Optimize size/speed of Unicode datasets #68232

Mark-Simulacrum · 2020-01-14T21:23:43Z

The overall implementation has the same general idea as the prior approach,
which was based on a compressed trie structure, but modified to use less space
(and, coincidentally, be an overall performance improvement).

Sizes	Old	New	New/current
Alphabetic	4616	2982	64.60%
Case_Ignorable	3144	2112	67.18%
Cased	2376	934	39.31%
Cc	19	43	226.32%
Grapheme_Extend	3072	1734	56.45%
Lowercase	2328	985	42.31%
N	2648	1239	46.79%
Uppercase	1978	934	47.22%
White_Space	241	140	58.09%

Total	20422	11103	54.37%

This table shows the size of the old and new tables in bytes. The most important
of these tables is "Grapheme_Extend", as it is present in essentially all Rust
programs due to being called from str's Debug impl (char::escape_debug). In
a representative case given by this blog post for the embedded world, the
shrinking in this PR shrinks the final binary by 1,604 bytes, from 14,440 to
12,836.

The performance of these new tables, based on the (rough) benchmark of linearly
scanning the entire valid set of chars, querying for each is_*, is roughly
~50% better, though in some cases is either on par or slightly (3-5%) worse. In
practice, I believe the size benefits of this PR are the main concern. The new
implementation has been tested to be equivalent to the current nightly in terms
of returned values on the set of valid chars.

A (relatively) high-level explanation of the specific compression scheme used
can be found in the generator.

This is split into three commits -- the first adds the generator which produces
the Rust code for the tables, the second adds support code for the lookup, and
the third actually swaps the current implementation out for the new one.

rust-highfive · 2020-01-14T21:23:54Z

r? @joshtriplett

(rust_highfive has picked a reviewer for you, use r? to override)

Mark-Simulacrum · 2020-01-14T21:24:10Z

cc @jamesmunns per our discussions about this being helpful for embedded

rust-highfive · 2020-01-14T22:50:14Z

The job x86_64-gnu-llvm-7 of your PR failed (pretty log, raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log.

2020-01-14T21:28:27.7918909Z ##[command]git remote add origin https://github.com/rust-lang/rust
2020-01-14T21:28:27.8009967Z ##[command]git config gc.auto 0
2020-01-14T21:28:27.8094159Z ##[command]git config --get-all http.https://github.com/rust-lang/rust.extraheader
2020-01-14T21:28:27.8162698Z ##[command]git config --get-all http.proxy
2020-01-14T21:28:27.8362080Z ##[command]git -c http.extraheader="AUTHORIZATION: basic ***" fetch --force --tags --prune --progress --no-recurse-submodules --depth=2 origin +refs/heads/*:refs/remotes/origin/* +refs/pull/68232/merge:refs/remotes/pull/68232/merge
---
2020-01-14T22:23:43.7230124Z .................................................................................................... 1600/9519
2020-01-14T22:23:49.1675536Z .................................................................................................... 1700/9519
2020-01-14T22:23:58.4061788Z .................................................................................................... 1800/9519
2020-01-14T22:24:07.7904235Z ........i........................................................................................... 1900/9519
2020-01-14T22:24:14.9825123Z ..................................................................................................ii 2000/9519
2020-01-14T22:24:31.4096467Z iii................................................................................................. 2100/9519
2020-01-14T22:24:39.8061543Z .................................................................................................... 2300/9519
2020-01-14T22:24:42.3382941Z .................................................................................................... 2400/9519
2020-01-14T22:24:48.2445277Z .................................................................................................... 2500/9519
2020-01-14T22:25:08.8398786Z .................................................................................................... 2600/9519
---
2020-01-14T22:27:51.2570276Z .........................................i...............i.......................................... 4900/9519
2020-01-14T22:28:00.6625733Z .................................................................................................... 5000/9519
2020-01-14T22:28:07.1660952Z ....................................................................................i............... 5100/9519
2020-01-14T22:28:12.7632732Z .................................................................................................... 5200/9519
2020-01-14T22:28:23.3509875Z .......................................................ii.ii...........i............................ 5300/9519
2020-01-14T22:28:32.9396986Z .................................................................................................... 5500/9519
2020-01-14T22:28:43.5477184Z .................................................................................................... 5600/9519
2020-01-14T22:28:50.2443148Z ........................................i........................................................... 5700/9519
2020-01-14T22:28:57.2595371Z .................................................................................................... 5800/9519
2020-01-14T22:28:57.2595371Z .................................................................................................... 5800/9519
2020-01-14T22:29:08.2740792Z .................................................................................................... 5900/9519
2020-01-14T22:29:18.2862387Z ...............................ii...i..ii...........i............................................... 6000/9519
2020-01-14T22:29:37.6067722Z .................................................................................................... 6200/9519
2020-01-14T22:29:45.9629012Z .................................................................................................... 6300/9519
2020-01-14T22:29:45.9629012Z .................................................................................................... 6300/9519
2020-01-14T22:29:53.8757046Z ...........................................................i..ii.................................... 6400/9519
2020-01-14T22:30:21.9863564Z .................................................................................................... 6600/9519
2020-01-14T22:30:24.2375970Z ...................................i................................................................ 6700/9519
2020-01-14T22:30:26.5710051Z .................................................................................................... 6800/9519
2020-01-14T22:30:29.1810539Z ...................................i................................................................ 6900/9519
---
2020-01-14T22:32:08.9142276Z .................................................................................................... 7500/9519
2020-01-14T22:32:13.4122534Z .................................................................................................... 7600/9519
2020-01-14T22:32:19.6054794Z .................................................................................................... 7700/9519
2020-01-14T22:32:26.8730354Z .................................................................................................... 7800/9519
2020-01-14T22:32:36.9979843Z ....................................................................................iiii............ 7900/9519
2020-01-14T22:32:53.7448148Z ..................i......i.......................................................................... 8100/9519
2020-01-14T22:32:59.1143144Z .................................................................................................... 8200/9519
2020-01-14T22:33:12.7743361Z .................................................................................................... 8300/9519
2020-01-14T22:33:23.0670789Z .................................................................................................... 8400/9519
---
2020-01-14T22:35:53.7772330Z  finished in 7.371
2020-01-14T22:35:53.7983663Z Check compiletest suite=codegen mode=codegen (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu)
2020-01-14T22:35:54.0083800Z 
2020-01-14T22:35:54.0084045Z running 166 tests
2020-01-14T22:35:57.1173691Z iiii......i........ii..iiii...i....i...........i............i..i..................i....i............ 100/166
2020-01-14T22:35:59.4483678Z i.i.i...iii..iiiiiii.......................iii............ii......
2020-01-14T22:35:59.4490247Z 
2020-01-14T22:35:59.4490348Z  finished in 5.650
2020-01-14T22:35:59.4695103Z Check compiletest suite=codegen-units mode=codegen-units (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu)
2020-01-14T22:35:59.6466477Z 
---
2020-01-14T22:36:01.6360395Z  finished in 2.166
2020-01-14T22:36:01.6555255Z Check compiletest suite=assembly mode=assembly (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu)
2020-01-14T22:36:01.8813210Z 
2020-01-14T22:36:01.8813454Z running 9 tests
2020-01-14T22:36:01.8814305Z iiiiiiiii
2020-01-14T22:36:01.8822973Z 
2020-01-14T22:36:01.8828408Z  finished in 0.226
2020-01-14T22:36:01.9035317Z Check compiletest suite=incremental mode=incremental (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu)
2020-01-14T22:36:02.1010658Z 
---
2020-01-14T22:36:22.7261907Z  finished in 20.822
2020-01-14T22:36:22.7492617Z Check compiletest suite=debuginfo mode=debuginfo-gdb+lldb (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu)
2020-01-14T22:36:22.9469170Z 
2020-01-14T22:36:22.9469432Z running 116 tests
2020-01-14T22:36:49.0532021Z .iiiii..i.....i..i...i..i.i.i..i..i..ii....i.i....ii..........iiii..........i.....i..i.......ii.i.ii 100/116
2020-01-14T22:36:52.6991584Z .....iiii.....ii
2020-01-14T22:36:52.6994440Z 
2020-01-14T22:36:52.6998914Z  finished in 29.950
2020-01-14T22:36:52.7007332Z Uplifting stage1 rustc (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu)
2020-01-14T22:36:52.7007961Z Copying stage2 rustc from stage1 (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu / x86_64-unknown-linux-gnu)
---
2020-01-14T22:50:08.7696607Z ..................................................................................................
2020-01-14T22:50:08.7697372Z failures:
2020-01-14T22:50:08.7697874Z 
2020-01-14T22:50:08.7698805Z ---- char::test_to_lowercase stdout ----
2020-01-14T22:50:08.7699672Z thread 'char::test_to_lowercase' panicked at 'assertion failed: `(left == right)`
2020-01-14T22:50:08.7700052Z   left: `"\u{0}"`,
2020-01-14T22:50:08.7700596Z  right: `"ß"`', src/libcore/../libcore/tests/char.rs:88:5
2020-01-14T22:50:08.7701381Z 
2020-01-14T22:50:08.7701564Z failures:
2020-01-14T22:50:08.7701741Z     char::test_to_lowercase
2020-01-14T22:50:08.7701923Z 
2020-01-14T22:50:08.7701923Z 
2020-01-14T22:50:08.7702108Z test result: FAILED. 995 passed; 1 failed; 2 ignored; 0 measured; 0 filtered out
2020-01-14T22:50:08.7702267Z 
2020-01-14T22:50:08.7718329Z error: test failed, to rerun pass '-p core --test coretests'
2020-01-14T22:50:08.7743321Z 
2020-01-14T22:50:08.7744276Z command did not execute successfully: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0/bin/cargo" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0/bin/cargo" "test" "-Zconfig-profile" "--target" "x86_64-unknown-linux-gnu" "-Zbinary-dep-depinfo" "-j" "2" "--release" "--locked" "--color" "always" "--features" "panic-unwind backtrace compiler-builtins-c" "--manifest-path" "/checkout/src/libtest/Cargo.toml" "-p" "core" "--" "--quiet"
2020-01-14T22:50:08.7744640Z expected success, got: exit code: 101
2020-01-14T22:50:08.7744838Z 
---
2020-01-14T22:50:08.7823508Z   local time: Tue Jan 14 22:50:08 UTC 2020
2020-01-14T22:50:08.9454476Z   network time: Tue, 14 Jan 2020 22:50:08 GMT
2020-01-14T22:50:08.9461199Z == end clock drift check ==
2020-01-14T22:50:09.2679907Z 
2020-01-14T22:50:09.2780968Z ##[error]Bash exited with code '1'.
2020-01-14T22:50:09.2833567Z ##[section]Starting: Checkout
2020-01-14T22:50:09.2835478Z ==============================================================================
2020-01-14T22:50:09.2835520Z Task         : Get sources
2020-01-14T22:50:09.2835574Z Description  : Get sources from a repository. Supports Git, TfsVC, and SVN repositories.

I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @TimNN. (Feature Requests)

Mark-Simulacrum · 2020-01-15T00:13:43Z

Turns out adding in the case conversions at the last minute in a way that I thought was 1-1 with the old code is not a good idea. The bug should be fixed now.

joshtriplett · 2020-01-15T07:28:24Z

Very nice!

@bors r+

bors · 2020-01-15T07:28:25Z

📌 Commit efcda04 has been approved by joshtriplett

joshtriplett · 2020-01-15T07:34:43Z

Not for this series, but looking at the implementation of escape_debug, how critical is it (and how much does backward compatibility depend on) that it treats extended graphemes specially, rather than escaping all Unicode or all non-printable Unicode? Omitting that table entirely from most Rust programs seems worthwhile.

CAD97 · 2020-01-15T15:33:58Z

rather than escaping all Unicode

That would be str::escape_default

or all non-printable Unicode

That's today's str::escape_debug (roughly) 😛

The problem being solved by the grapheme table is (in theory, anyway) that the printability of any specific Unicode codepoint depends on its position in a grapheme. Just as a simple example, ZERO WIDTH JOINER is an "unprintable" that has no meaning and is invisible when just in the middle of whitespace. But when between two emoji, it signifies that the user wants a singular emoji containing the attributes of both emoji. Similarly, the country code codepoints on their own are "unprintables", but in pairs they represent country flag emoji. (In practice: Unicode is hard and you don't even know what's actually printable until you ask the font. escape_default is best effort to show all meaningful codepoints in the string while maintaining legibility for strings including non-ascii characters.)

…oshtriplett Optimize size/speed of Unicode datasets The overall implementation has the same general idea as the prior approach, which was based on a compressed trie structure, but modified to use less space (and, coincidentally, be an overall performance improvement). Sizes | Old | New | New/current -- | -- | -- | -- Alphabetic | 4616 | 2982 | 64.60% Case_Ignorable | 3144 | 2112 | 67.18% Cased | 2376 | 934 | 39.31% Cc | 19 | 43 | 226.32% Grapheme_Extend | 3072 | 1734 | 56.45% Lowercase | 2328 | 985 | 42.31% N | 2648 | 1239 | 46.79% Uppercase | 1978 | 934 | 47.22% White_Space | 241 | 140 | 58.09% | | | Total | 20422 | 11103 | 54.37% This table shows the size of the old and new tables in bytes. The most important of these tables is "Grapheme_Extend", as it is present in essentially all Rust programs due to being called from `str`'s Debug impl (`char::escape_debug`). In a representative case given by this [blog post] for the embedded world, the shrinking in this PR shrinks the final binary by 1,604 bytes, from 14,440 to 12,836. The performance of these new tables, based on the (rough) benchmark of linearly scanning the entire valid set of chars, querying for each `is_*`, is roughly ~50% better, though in some cases is either on par or slightly (3-5%) worse. In practice, I believe the size benefits of this PR are the main concern. The new implementation has been tested to be equivalent to the current nightly in terms of returned values on the set of valid chars. A (relatively) high-level explanation of the specific compression scheme used can be found [in the generator]. This is split into three commits -- the first adds the generator which produces the Rust code for the tables, the second adds support code for the lookup, and the third actually swaps the current implementation out for the new one. [blog post]: https://jamesmunns.com/blog/fmt-unreasonably-expensive/ [in the generator]: https://github.com/Mark-Simulacrum/rust/blob/unicode-tables/src/tools/unicode-table-generator/src/raw_emitter.rs

@ghost

Rollup of 6 pull requests Successful merges: - #68123 (Implement Cursor for linked lists. (RFC 2570).) - #68212 (Suggest to shorten temporary lifetime during method call inside generator) - #68232 (Optimize size/speed of Unicode datasets) - #68236 (Add some regression tests) - #68237 (Account for `Path`s in `is_suggestable_infer_ty`) - #68252 (remove redundant clones, found by clippy) Failed merges: r? @ghost

joshtriplett · 2020-02-11T21:41:17Z

@CAD97 What I'm suggesting is that for Debug output, it's potentially acceptable to escape characters that might have been printable. And if doing so means we can drop a fairly large table from the majority of Rust binaries, that seems worth doing.

…tolnay Shrink Unicode tables (even more) This shrinks the Unicode tables further, building upon the wins in rust-lang#68232 (the previous counts differ due to an interim Unicode version update, see rust-lang#69929. The new data structure is slower by around 3x, on the benchmark of looking up every Unicode scalar value in each data set sequentially in every data set included. Note that for ASCII, the exposed functions on `char` optimize with direct branches, so ASCII will retain the same performance regardless of internal optimizations (or the reverse). Also, note that the size reduction due to the skip list (from where the performance losses come) is around 40%, and, as a result, I believe the performance loss is acceptable, as the routines are still quite fast. Anywhere where this is hot, should probably be using a custom data structure anyway (e.g., a raw bitset) or something optimized for frequently seen values, etc. This PR updates both the bitset data structure, and introduces a new data structure similar to a skip list. For more details, see the [main.rs] of the table generator, which describes both. The commits mostly work individually and document size wins. As before, this is tested on all valid chars to have the same results as nightly (and the canonical Unicode data sets), happily, no bugs were found. [main.rs]: https://github.com/rust-lang/rust/blob/fb4a715e18b/src/tools/unicode-table-generator/src/main.rs Set | Previous | New | % of old | Codepoints | Ranges | ----------------|---------:|------:|-----------:|-----------:|-------:| Alphabetic | 3055 | 1599 | 52% | 132875 | 695 | Case Ignorable | 2136 | 949 | 44% | 2413 | 410 | Cased | 934 | 359 | 38% | 4286 | 141 | Cc | 43 | 9 | 20% | 65 | 2 | Grapheme Extend | 1774 | 813 | 46% | 1979 | 344 | Lowercase | 985 | 867 | 88% | 2344 | 652 | N | 1266 | 419 | 33% | 1781 | 133 | Uppercase | 934 | 777 | 83% | 1911 | 643 | White_Space | 140 | 37 | 26% | 25 | 10 | ----------------|----------|-------|------------|------------|--------| Total | 11267 | 5829 | 51% | - | - |

rust-highfive assigned joshtriplett Jan 14, 2020

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jan 14, 2020

Mark-Simulacrum force-pushed the unicode-tables branch from 21efaf9 to 74ca7ed Compare January 14, 2020 21:27

Mark-Simulacrum added 3 commits January 14, 2020 19:11

Add unicode table generator

064f888

Add support code for new unicode_data module

40ad877

Replace old tables with new unicode data

efcda04

Mark-Simulacrum force-pushed the unicode-tables branch from 74ca7ed to efcda04 Compare January 15, 2020 00:11

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jan 15, 2020

Dylan-DPC-zz mentioned this pull request Jan 15, 2020

Rollup of 6 pull requests #68254

Merged

bors merged commit efcda04 into rust-lang:master Jan 15, 2020

Mark-Simulacrum deleted the unicode-tables branch January 23, 2020 20:51

thomcc mentioned this pull request Mar 13, 2020

Add support for the compressed bitset format that rustc uses for some tables in libcore BurntSushi/ucd-generate#30

Open

Mark-Simulacrum mentioned this pull request Mar 27, 2020

Shrink Unicode tables (even more) #70486

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize size/speed of Unicode datasets #68232

Optimize size/speed of Unicode datasets #68232

Mark-Simulacrum commented Jan 14, 2020 •

edited by dtolnay

Loading

rust-highfive commented Jan 14, 2020

Mark-Simulacrum commented Jan 14, 2020

rust-highfive commented Jan 14, 2020

Mark-Simulacrum commented Jan 15, 2020

joshtriplett commented Jan 15, 2020

bors commented Jan 15, 2020

joshtriplett commented Jan 15, 2020

CAD97 commented Jan 15, 2020

joshtriplett commented Feb 11, 2020

Optimize size/speed of Unicode datasets #68232

Optimize size/speed of Unicode datasets #68232

Conversation

Mark-Simulacrum commented Jan 14, 2020 • edited by dtolnay Loading

rust-highfive commented Jan 14, 2020

Mark-Simulacrum commented Jan 14, 2020

rust-highfive commented Jan 14, 2020

Mark-Simulacrum commented Jan 15, 2020

joshtriplett commented Jan 15, 2020

bors commented Jan 15, 2020

joshtriplett commented Jan 15, 2020

CAD97 commented Jan 15, 2020

joshtriplett commented Feb 11, 2020

Mark-Simulacrum commented Jan 14, 2020 •

edited by dtolnay

Loading