feat: segment by graphemes #11

EdJoPaTo · 2024-05-02T12:45:29Z

Before this zero length things were assumed to keep, but this is mostly only a best-effort approach. unicode-segmentation bundles up characters that belong together.

Sadly this is slower but more correct.

zhu fu/16384/end        time:   [98.795 µs 98.834 µs 98.883 µs]
                        thrpt:  [158.02 MiB/s 158.09 MiB/s 158.16 MiB/s]
                 change:
                        time:   [+420.90% +421.28% +421.82%] (p = 0.00 < 0.05)
                        thrpt:  [-80.836% -80.816% -80.802%]
                        Performance has regressed.
Found 8 outliers among 200 measurements (4.00%)
  1 (0.50%) low mild
  6 (3.00%) high mild
  1 (0.50%) high severe
zhu fu/16384/start      time:   [112.87 µs 112.98 µs 113.10 µs]
                        thrpt:  [138.15 MiB/s 138.30 MiB/s 138.43 MiB/s]
                 change:
                        time:   [+461.21% +461.73% +462.28%] (p = 0.00 < 0.05)
                        thrpt:  [-82.215% -82.198% -82.181%]
                        Performance has regressed.
Found 4 outliers among 200 measurements (2.00%)
  4 (2.00%) high mild
zhu fu/16384/centered   time:   [50.122 µs 50.177 µs 50.249 µs]
                        thrpt:  [310.95 MiB/s 311.40 MiB/s 311.74 MiB/s]
                 change:
                        time:   [+86.029% +86.268% +86.498%] (p = 0.00 < 0.05)
                        thrpt:  [-46.380% -46.314% -46.245%]
                        Performance has regressed.
Found 9 outliers among 200 measurements (4.50%)
  8 (4.00%) low mild
  1 (0.50%) high severe

Interestingly centered is now faster than the other two by a lot. Analyzing this could lead to performance improvements for the other two too?

EdJoPaTo · 2024-05-02T13:56:36Z

Interesting find: unicode-width 0.1.12 released a few days ago with this: unicode-rs/unicode-width#41

I kinda expected the family to be width 2 then, but it doesn't seem that way. But it also doesn't start with that \u{FE0F} so I am not entirely sure what should happen or what is correct.

Aetf · 2024-05-05T23:00:47Z

I saw that PR and was thinking about the same thing last week. And my concern is exactly the large performance hit. I'm a bit reluctant to merge this. (But maybe the performance isn't that important, as truncating a full 16KB text is pretty rare? Happy to be convinced.)

The input text used in the benchmark is all Chinese characters without any zero-width fun stuff. I'd love to see some benchmarking for some full emoji text.

Regarding the performance difference, truncate and truncate_start count things to keep, while truncate_centered counts things to remove. But I'm not sure how much this matters as all benchmarks are truncating to roughly half width.

Aetf · 2024-05-05T23:01:10Z

src/lib.rs

@@ -388,6 +381,15 @@ mod tests {
                ("y\u{0306}ey\u{0306}", 3)
            );
        }
+
+        #[test]
+        fn family_stays_together() {


Love this name :)

EdJoPaTo · 2024-05-06T18:11:47Z

And my concern is exactly the large performance hit. I'm a bit reluctant to merge this. (But maybe the performance isn't that important, as truncating a full 16KB text is pretty rare? Happy to be convinced.)

I assume most use-cases will truncate to kinda small numbers. Coming from the terminal library ratatui there are maybe 100 characters of width. Either truncation or wrapping can be used in these cases. When thinking about other places even then not that many long places are relevant. Browser dev tools truncate to the end of the line and that is also rather small.

Usages on GitHub seem to use like 25, 50, 140, something like that.

Jules-Bertholet · 2024-06-17T12:19:14Z

I'll note that unicode-width does not guarantee that the width of a string equals the sum of the widths of its grapheme clusters. In 1.13 (current published version), this property fails to hold only for the Old Lisu script (i.e., unlikely to be a problem in practice); in the latest master, it's also true for several other scripts, including Arabic.

Aetf · 2024-06-24T01:11:48Z

Okay, let's get this merged. @EdJoPaTo could you rebase?

…tion Conflicts: Cargo.toml src/lib.rs

## 🤖 New release * `unicode-truncate`: 1.0.0 -> 1.1.0 <details><summary>Changelog</summary> <blockquote> ## [1.1.0](v1.0.0...v1.1.0) - 2024-07-08 ### Added - segment by graphemes ([#11](#11)) ### Fixed - *(deps)* update rust crate itertools to 0.13 ([#20](#20)) - fixed typos in the `renovate.json` ([#17](#17)) ### Other - Removed unnessary debug-assertions setting - Treat control characters as width 1, fixes [#16](#16) ([#19](#19)) - tweak renovate configs ([#13](#13)) </blockquote> </details> --- This PR was generated with [release-plz](https://github.com/MarcoIeni/release-plz/). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

EdJoPaTo force-pushed the grapheme-segmentation branch from 07aad0f to 77f632a Compare May 2, 2024 13:10

EdJoPaTo added 2 commits May 2, 2024 15:14

test(bench): use black_box to ensure bench correctness

c138c07

feat: segment by graphemes

fe3fe59

EdJoPaTo force-pushed the grapheme-segmentation branch from 77f632a to fe3fe59 Compare May 2, 2024 13:16

Aetf reviewed May 5, 2024

View reviewed changes

EdJoPaTo mentioned this pull request May 6, 2024

fix: unicode truncation bug ratatui/ratatui#1089

Merged

Merge remote-tracking branch 'upstream/master' into grapheme-segmenta…

3374392

…tion Conflicts: Cargo.toml src/lib.rs

Aetf merged commit f85280f into Aetf:master Jun 25, 2024
20 checks passed

github-actions bot mentioned this pull request Jun 24, 2024

chore: release v1.1.0 #15

Merged

EdJoPaTo deleted the grapheme-segmentation branch June 25, 2024 04:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: segment by graphemes #11

feat: segment by graphemes #11

EdJoPaTo commented May 2, 2024

EdJoPaTo commented May 2, 2024

Aetf commented May 5, 2024

Aetf May 5, 2024

EdJoPaTo commented May 6, 2024

Jules-Bertholet commented Jun 17, 2024 •

edited

Loading

Aetf commented Jun 24, 2024

feat: segment by graphemes #11

feat: segment by graphemes #11

Conversation

EdJoPaTo commented May 2, 2024

EdJoPaTo commented May 2, 2024

Aetf commented May 5, 2024

Aetf May 5, 2024

Choose a reason for hiding this comment

EdJoPaTo commented May 6, 2024

Jules-Bertholet commented Jun 17, 2024 • edited Loading

Aetf commented Jun 24, 2024

Jules-Bertholet commented Jun 17, 2024 •

edited

Loading