Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: segment by graphemes #11

Merged
merged 3 commits into from
Jun 25, 2024
Merged

Conversation

EdJoPaTo
Copy link
Contributor

@EdJoPaTo EdJoPaTo commented May 2, 2024

Before this zero length things were assumed to keep, but this is mostly only a best-effort approach. unicode-segmentation bundles up characters that belong together.

Sadly this is slower but more correct.

zhu fu/16384/end        time:   [98.795 µs 98.834 µs 98.883 µs]
                        thrpt:  [158.02 MiB/s 158.09 MiB/s 158.16 MiB/s]
                 change:
                        time:   [+420.90% +421.28% +421.82%] (p = 0.00 < 0.05)
                        thrpt:  [-80.836% -80.816% -80.802%]
                        Performance has regressed.
Found 8 outliers among 200 measurements (4.00%)
  1 (0.50%) low mild
  6 (3.00%) high mild
  1 (0.50%) high severe
zhu fu/16384/start      time:   [112.87 µs 112.98 µs 113.10 µs]
                        thrpt:  [138.15 MiB/s 138.30 MiB/s 138.43 MiB/s]
                 change:
                        time:   [+461.21% +461.73% +462.28%] (p = 0.00 < 0.05)
                        thrpt:  [-82.215% -82.198% -82.181%]
                        Performance has regressed.
Found 4 outliers among 200 measurements (2.00%)
  4 (2.00%) high mild
zhu fu/16384/centered   time:   [50.122 µs 50.177 µs 50.249 µs]
                        thrpt:  [310.95 MiB/s 311.40 MiB/s 311.74 MiB/s]
                 change:
                        time:   [+86.029% +86.268% +86.498%] (p = 0.00 < 0.05)
                        thrpt:  [-46.380% -46.314% -46.245%]
                        Performance has regressed.
Found 9 outliers among 200 measurements (4.50%)
  8 (4.00%) low mild
  1 (0.50%) high severe

Interestingly centered is now faster than the other two by a lot. Analyzing this could lead to performance improvements for the other two too?

@EdJoPaTo
Copy link
Contributor Author

EdJoPaTo commented May 2, 2024

Interesting find: unicode-width 0.1.12 released a few days ago with this: unicode-rs/unicode-width#41

I kinda expected the family to be width 2 then, but it doesn't seem that way. But it also doesn't start with that \u{FE0F} so I am not entirely sure what should happen or what is correct.

@Aetf
Copy link
Owner

Aetf commented May 5, 2024

I saw that PR and was thinking about the same thing last week. And my concern is exactly the large performance hit. I'm a bit reluctant to merge this. (But maybe the performance isn't that important, as truncating a full 16KB text is pretty rare? Happy to be convinced.)

The input text used in the benchmark is all Chinese characters without any zero-width fun stuff. I'd love to see some benchmarking for some full emoji text.

Regarding the performance difference, truncate and truncate_start count things to keep, while truncate_centered counts things to remove. But I'm not sure how much this matters as all benchmarks are truncating to roughly half width.

@@ -388,6 +381,15 @@ mod tests {
("y\u{0306}ey\u{0306}", 3)
);
}

#[test]
fn family_stays_together() {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this name :)

@EdJoPaTo
Copy link
Contributor Author

EdJoPaTo commented May 6, 2024

And my concern is exactly the large performance hit. I'm a bit reluctant to merge this. (But maybe the performance isn't that important, as truncating a full 16KB text is pretty rare? Happy to be convinced.)

I assume most use-cases will truncate to kinda small numbers. Coming from the terminal library ratatui there are maybe 100 characters of width. Either truncation or wrapping can be used in these cases. When thinking about other places even then not that many long places are relevant. Browser dev tools truncate to the end of the line and that is also rather small.

Usages on GitHub seem to use like 25, 50, 140, something like that.

@Jules-Bertholet
Copy link

Jules-Bertholet commented Jun 17, 2024

I'll note that unicode-width does not guarantee that the width of a string equals the sum of the widths of its grapheme clusters. In 1.13 (current published version), this property fails to hold only for the Old Lisu script (i.e., unlikely to be a problem in practice); in the latest master, it's also true for several other scripts, including Arabic.

@Aetf
Copy link
Owner

Aetf commented Jun 24, 2024

Okay, let's get this merged. @EdJoPaTo could you rebase?

@Aetf Aetf merged commit f85280f into Aetf:master Jun 25, 2024
20 checks passed
@github-actions github-actions bot mentioned this pull request Jun 24, 2024
@EdJoPaTo EdJoPaTo deleted the grapheme-segmentation branch June 25, 2024 04:23
Aetf pushed a commit that referenced this pull request Jul 9, 2024
## 🤖 New release
* `unicode-truncate`: 1.0.0 -> 1.1.0

<details><summary><i><b>Changelog</b></i></summary><p>

<blockquote>

##
[1.1.0](v1.0.0...v1.1.0)
- 2024-07-08

### Added
- segment by graphemes
([#11](#11))

### Fixed
- *(deps)* update rust crate itertools to 0.13
([#20](#20))
- fixed typos in the `renovate.json`
([#17](#17))

### Other
- Removed unnessary debug-assertions setting
- Treat control characters as width 1, fixes
[#16](#16)
([#19](#19))
- tweak renovate configs
([#13](#13))
</blockquote>


</p></details>

---
This PR was generated with
[release-plz](https://github.com/MarcoIeni/release-plz/).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants