Add configurability for unicode and skipping parsers #14

hamiltop · 2023-07-09T03:54:04Z

Two main changes

Allow disabling unicode support. Unicode support greatly increases memory and latency and is not generally needed.
Allow skipping specific parser categories when loading from regex yaml. Compiling regex for unused parsers is a pretty big waste of memory.

Both unicode support and device/os/user_agent parsers can now be configured via UserAgentParserBuilder.

I tried very hard to keep this backwards compatible, but if we're willing to break backwards compatibility we can take these in a different direction.

- Unicode support greatly increases memory and latency and is not generally needed. - Compiling regex for unused parsers is just a waste of memory. Both unicode support and device/os/user_agent parsers can now be configured via `UserAgentParserBuilder`.

hamiltop · 2023-07-09T04:05:35Z

I dug in a bit after this discussion about memory usage: rust-lang/regex#1027 (comment)

I took at a look at other uaparser implementations and none of them are done with full unicode support. The uaparser-core test suite passes without it just fine. So I moved everything to use regex::bytes::Regex and make unicode a configuration option.

Additionally, we at Remind are only interested in OS for our use case. What we've been doing has been to take the core regex yaml file and delete the Device and User Agent sections, but it would be nice to not have to curate that ourselves and instead just pass in flags to only parse and use the os parsers. So I added the ability to disable any of os, device, and user_agent parsers to save on memory.

Some performance numbers with these changes:

First, memory usage. full_parser has all the parsers and includes unicode support.no_unicode_parser is all the parsers with unicode disabled. os_only_no_unicode_parser skips device and user_agent parsers and disables unicode (this is how we would use it at Remind).

➜  uap-rs git:(main) ✗ /usr/bin/time -l target/debug/examples/full_parser
Client { device: Device { family: "Mac", brand: Some("Apple"), model: Some("Mac") }, os: OS { family: "Mac OS X", major: Some("10"), minor: Some("15"), patch: Some("7"), patch_minor: None }, user_agent: UserAgent { family: "Chrome", major: Some("114"), minor: Some("0"), patch: Some("0") } }
        5.48 real         5.40 user         0.07 sys
           296796160  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
               18210  page reclaims
                  11  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                 101  involuntary context switches
         54894775582  instructions retired
         18604917601  cycles elapsed
           282281600  peak memory footprint
➜  uap-rs git:(main) ✗ /usr/bin/time -l target/debug/examples/no_unicode_parser
Client { device: Device { family: "Mac", brand: Some("Apple"), model: Some("Mac") }, os: OS { family: "Mac OS X", major: Some("10"), minor: Some("15"), patch: Some("7"), patch_minor: None }, user_agent: UserAgent { family: "Chrome", major: Some("114"), minor: Some("0"), patch: Some("0") } }
        1.01 real         0.99 user         0.01 sys
            63062016  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
                3949  page reclaims
                   6  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                  33  involuntary context switches
         10472472398  instructions retired
          3381014760  cycles elapsed
            59262016  peak memory footprint
➜  uap-rs git:(main) ✗ /usr/bin/time -l target/debug/examples/os_only_no_unicode_parser
Client { device: Device { family: "Other", brand: None, model: None }, os: OS { family: "Mac OS X", major: Some("10"), minor: Some("15"), patch: Some("7"), patch_minor: None }, user_agent: UserAgent { family: "Other", major: None, minor: None, patch: None } }
        0.16 real         0.16 user         0.00 sys
            10469376  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
                 740  page reclaims
                   5  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                   9  involuntary context switches
          1548425921  instructions retired
           497911781  cycles elapsed
             6668992  peak memory footprint

Disabling Unicode takes us from 300MB to 60MB, and only doing the OS parsers takes that all the way to 10MB.

Beyond memory, the cargo bench results are also nice to see:

parse_device            time:   [578.92 ms 581.72 ms 584.82 ms]                         
                        change: [-0.9028% +0.0193% +1.0401%] (p = 0.99 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

parse_device unicode disabled                                                                            
                        time:   [496.35 ms 498.60 ms 500.98 ms]
                        change: [+0.0428% +1.2668% +2.4227%] (p = 0.05 > 0.05)
                        No change in performance detected.

parse_os                time:   [2.3230 ms 2.3356 ms 2.3488 ms]                     
                        change: [-2.9222% -1.7137% -0.4659%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) high mild
  1 (1.00%) high severe

parse_os unicode disabled                                                                            
                        time:   [2.3017 ms 2.3118 ms 2.3228 ms]
                        change: [-0.5476% +0.1792% +0.9374%] (p = 0.66 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

parse_user_agent        time:   [174.95 ms 175.33 ms 175.73 ms]                              
                        change: [-1.2574% -0.5656% +0.1207%] (p = 0.10 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

parse_user_agent unicode disabled                                                                             
                        time:   [56.693 ms 56.998 ms 57.326 ms]
                        change: [-2.7180% -1.5820% -0.3546%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

A huge speedup on the user_agent parsing (57ms vs 175ms), a moderate speedup on device parsing (498 vs 581), and not much change on OS.

I left the unicode support in as the default to make this a non-breaking change, but honestly it's just not needed and I'd recommend we rip it out and just make everything not support unicode.

Anyway, let me know what you think.

hamiltop · 2023-07-09T04:22:15Z

src/parser/mod.rs

+        captures.expand(raw_replacement, &mut target);
+        std::str::from_utf8(&target)
+            .map(|s| Cow::Owned(s.trim().to_owned()))
+            // What is the behavior if we can't parse a string???


This is the only big question mark in behavior. If a replacement has a group and the result of the capture expands into a &[u8] that can't be turned into a string, then we have to do something. Rather than panic or try and surface this, I chose to just skip the replacement.

This may never happen. It likely won't. I'm not sure if a panic is better or worse here and am very open to feedback.

hamiltop · 2023-07-09T04:23:24Z

This example is a good example of how the new builder works and the configuration available https://github.com/davidarmstronglewis/uap-rs/pull/14/files#diff-f2849e4e13ac1633666e5ef659e790e06feb21fa5848bf7268b80e7e7de39a94

hamiltop · 2023-07-09T04:29:10Z

src/parser/mod.rs

+        user_agent: bool,
+        unicode: bool,
+    ) -> Result<UserAgentParser, Error> {
+        let device_matchers = if device {


I explored a few options on how to disable/enable a specific class of parser. Using crate features was neat, but felt overkill for something that could just be provided as config. The most correct approach is probably making the Vec<device::Matcher> an Option<Vec<device::Matcher>> and making Client have optional fields for all the matches, but that was a bigger footprint and I wasn't sure if that was worth it. An empty Vec is what we're getting now with our stripped down regex yaml file and it works well enough.

oceanlewis · 2023-07-18T03:51:46Z

Hey @hamiltop, thank you for the lovely PR and for linking out to context around these changes. Solid change, my only feedback would be to hide the new UserAgentParserBuilder behind a builder method rather than exporting the type directly. In case you've moved on to something else I've pushed up this branch which does just that (and updates some Nix stuff; which you can ignore).

Once this is in I'll cut a new release. Thanks again ❤️

hamiltop · 2023-07-20T16:25:37Z

Great. Updated from your branch and then fixed up a few tests to use the new syntax. Should be g2g here.

oceanlewis · 2023-07-20T18:00:10Z

src/parser/builder.rs

+// fn foo() {
+// /// ```rust
+// /// # use uaparser::*;
+// let _parser = UserAgentParser::builder()
+// .unicode(false)
+//     .user_agent()
+// /// ```
+//     }


Oops, I think I left this in here erroneously.

Updated with a fixup commit and rebased.

oceanlewis

Nice, I'll merge this and cut a new release later tonight.
Thank you for the contribution ❤️

hamiltop mentioned this pull request Jul 9, 2023

Slow performance in a loop #13

Closed

hamiltop commented Jul 9, 2023

View reviewed changes

oceanlewis reviewed Jul 20, 2023

View reviewed changes

oceanlewis and others added 2 commits July 20, 2023 14:38

Hide UserAgentParserBuilder behind builder method

c58b032

Update tests/bench/examples with new builder syntax

d7a94b3

hamiltop force-pushed the disable_unicode_and_more_config branch from 2216e6c to d7a94b3 Compare July 20, 2023 20:38

oceanlewis approved these changes Jul 20, 2023

View reviewed changes

oceanlewis merged commit 71a5f66 into oceanlewis:main Jul 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add configurability for unicode and skipping parsers #14

Add configurability for unicode and skipping parsers #14

hamiltop commented Jul 9, 2023 •

edited

Loading

hamiltop commented Jul 9, 2023 •

edited

Loading

hamiltop Jul 9, 2023

hamiltop commented Jul 9, 2023

hamiltop Jul 9, 2023

oceanlewis commented Jul 18, 2023

hamiltop commented Jul 20, 2023

oceanlewis Jul 20, 2023

hamiltop Jul 20, 2023

oceanlewis left a comment

Add configurability for unicode and skipping parsers #14

Add configurability for unicode and skipping parsers #14

Conversation

hamiltop commented Jul 9, 2023 • edited Loading

hamiltop commented Jul 9, 2023 • edited Loading

hamiltop Jul 9, 2023

Choose a reason for hiding this comment

hamiltop commented Jul 9, 2023

hamiltop Jul 9, 2023

Choose a reason for hiding this comment

oceanlewis commented Jul 18, 2023

hamiltop commented Jul 20, 2023

oceanlewis Jul 20, 2023

Choose a reason for hiding this comment

hamiltop Jul 20, 2023

Choose a reason for hiding this comment

oceanlewis left a comment

Choose a reason for hiding this comment

hamiltop commented Jul 9, 2023 •

edited

Loading

hamiltop commented Jul 9, 2023 •

edited

Loading