-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: segment by graphemes #11
Conversation
07aad0f
to
77f632a
Compare
77f632a
to
fe3fe59
Compare
Interesting find: I kinda expected the family to be width 2 then, but it doesn't seem that way. But it also doesn't start with that |
I saw that PR and was thinking about the same thing last week. And my concern is exactly the large performance hit. I'm a bit reluctant to merge this. (But maybe the performance isn't that important, as truncating a full 16KB text is pretty rare? Happy to be convinced.) The input text used in the benchmark is all Chinese characters without any zero-width fun stuff. I'd love to see some benchmarking for some full emoji text. Regarding the performance difference, truncate and truncate_start count things to keep, while truncate_centered counts things to remove. But I'm not sure how much this matters as all benchmarks are truncating to roughly half width. |
@@ -388,6 +381,15 @@ mod tests { | |||
("y\u{0306}ey\u{0306}", 3) | |||
); | |||
} | |||
|
|||
#[test] | |||
fn family_stays_together() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love this name :)
I assume most use-cases will truncate to kinda small numbers. Coming from the terminal library ratatui there are maybe 100 characters of width. Either truncation or wrapping can be used in these cases. When thinking about other places even then not that many long places are relevant. Browser dev tools truncate to the end of the line and that is also rather small. Usages on GitHub seem to use like 25, 50, 140, something like that. |
I'll note that |
Okay, let's get this merged. @EdJoPaTo could you rebase? |
…tion Conflicts: Cargo.toml src/lib.rs
## 🤖 New release * `unicode-truncate`: 1.0.0 -> 1.1.0 <details><summary><i><b>Changelog</b></i></summary><p> <blockquote> ## [1.1.0](v1.0.0...v1.1.0) - 2024-07-08 ### Added - segment by graphemes ([#11](#11)) ### Fixed - *(deps)* update rust crate itertools to 0.13 ([#20](#20)) - fixed typos in the `renovate.json` ([#17](#17)) ### Other - Removed unnessary debug-assertions setting - Treat control characters as width 1, fixes [#16](#16) ([#19](#19)) - tweak renovate configs ([#13](#13)) </blockquote> </p></details> --- This PR was generated with [release-plz](https://github.com/MarcoIeni/release-plz/). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Before this zero length things were assumed to keep, but this is mostly only a best-effort approach.
unicode-segmentation
bundles up characters that belong together.Sadly this is slower but more correct.
Interestingly centered is now faster than the other two by a lot. Analyzing this could lead to performance improvements for the other two too?