Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expose new crate features for optionally shrinking regex #613

Merged
merged 15 commits into from
Sep 3, 2019

Conversation

BurntSushi
Copy link
Member

@BurntSushi BurntSushi commented Sep 2, 2019

This PR is primarily intended to close #583. However, an additional motivation to these changes was to permit users of regex to shrink its dependency tree, should they wish to give up runtime performance in exchange. While this may not sound like a great exchange, there exist many cases where high performance regex matching isn't actually required. For example, if one is using a regex to filter a small set of tiny ASCII strings, then it would be perfectly reasonable to disable all of regex's crate features. The end result of this is that it will substantially shrink binary size, improve compilation times and shrink the dependency tree of regex down to a single crate (regex-syntax).

As an example, if I compile the following program in release mode

use regex::Regex;

fn main() {
    Regex::new("x").unwrap();
}

and use regex = "1", then the total stripped binary size is 1.5M. Compare this with a baseline program

use regex::Regex;

fn main() {
    println!("Hello, world!");
}

whose total stripped binary size is 203K. Thus, the total overhead of regex is approximately 1.3M. A large percentage of that overhead corresponds to Unicode tables. For example, if we compile the above regex program, but with Unicode tables disabled (and keeping performance oriented features enabled)

[dependencies.regex]
version = "1.3.0"
default-features = false
features = ["std", "perf"]

then the total binary size drops to 767K, for a total overhead of about 560K.

Finally, disabling all possible features

[dependencies.regex]
version = "1.3.0"
default-features = false
features = ["std"]

results in a binary size of 535K, for a total overhead of about 332K.

You can shrink the binary size even more (by incurring more compilation time) with the following settings:

[profile.release]
lto = true
codegen-units = 1
opt-level = "z"

This results in a baseline (hello world above) binary size of 191K, and a binary size of 367K for regex for a total overhead of 176K. This isn't quite the target of 50K desired by @cramertj, but it does correspond to about an order of magnitude improvement over the status quo.

Another great benefit to trimming all this stuff is that release mode compilation times drop by a factor of 2 on my machine:

$ cargo clean

$ time cargo build --release
    Updating crates.io index
   Compiling memchr v2.2.1
   Compiling lazy_static v1.4.0
   Compiling regex-syntax v0.6.11 (/home/andrew/rust/regex/regex-syntax)
   Compiling thread_local v0.3.6
   Compiling aho-corasick v0.7.6
   Compiling regex v1.2.1 (/home/andrew/rust/regex)
   Compiling regex-bloat v0.1.0 (/home/andrew/tmp/play/rust/regex-bloat)
    Finished release [optimized] target(s) in 10.67s

real    10.785
user    55.063
sys     0.961
maxmem  419 MB
faults  1

$ cargo clean

$ time cargo build --release
   Compiling regex-syntax v0.6.11 (/home/andrew/rust/regex/regex-syntax)
   Compiling regex v1.2.1 (/home/andrew/rust/regex)
   Compiling regex-bloat v0.1.0 (/home/andrew/tmp/play/rust/regex-bloat)
    Finished release [optimized] target(s) in 4.84s

real    4.863
user    26.894
sys     0.415
maxmem  322 MB
faults  0

Debug mode compilation also gets a nice ~1.5x speed-up:

$ cargo clean

$ time cargo build
   Compiling memchr v2.2.1
   Compiling lazy_static v1.4.0
   Compiling regex-syntax v0.6.11 (/home/andrew/rust/regex/regex-syntax)
   Compiling thread_local v0.3.6
   Compiling aho-corasick v0.7.6
   Compiling regex v1.2.1 (/home/andrew/rust/regex)
   Compiling regex-bloat v0.1.0 (/home/andrew/tmp/play/rust/regex-bloat)
    Finished dev [unoptimized + debuginfo] target(s) in 7.05s

real    7.069
user    20.716
sys     0.980
maxmem  446 MB
faults  0

$ cargo clean

$ time cargo build
   Compiling regex-syntax v0.6.11 (/home/andrew/rust/regex/regex-syntax)
   Compiling regex v1.2.1 (/home/andrew/rust/regex)
   Compiling regex-bloat v0.1.0 (/home/andrew/tmp/play/rust/regex-bloat)
    Finished dev [unoptimized + debuginfo] target(s) in 4.46s

real    4.490
user    9.788
sys     0.468
maxmem  355 MB
faults  0

We'll remove 'use_std' in regex 2, but keep it around for backward
compatibility.

Fixes #474
This makes sure the generated tables are rustfmt'd.
This nominally moves the logic for acquiring Unicode-aware Perl character
classes into the `unicode` module, and also makes the calling code
robust with respect to failures.

This commit is prep work for making the availability of Unicode-aware
Perl classes optional.
This commit refactors the way this library handles Unicode data by
making it completely optional. Several features are introduced which
permit callers to select only the Unicode data they need (up to a point
of granularity).

An important property of these changes is that presence of absence of
crate features will never change the match semantics of a regular
expression. Instead, the presence or absence of a crate feature can only
add or subtract from the set of all possible valid regular expressions.

So for example, if the `unicode-case` feature is disabled, then
attempting to produce `Hir` for the regex `(?i)a` will fail. Instead,
callers must use `(?i-u)a` (or enable the `unicode-case` feature).

This partially addresses #583 since it permits callers to decrease
binary size.
We have a good thing going, so let's formalize it a bit.
This commit sets up the infrastructure for supporting various `unicode`
and `perf` features, which permit decreasing binary size, compile times
and the size of the dependency tree.

Most of the work here is in modifying the regex tests to make them
work in concert with the available Unicode features. In cases where
Unicode is irrelevant, we just turn it off. In other cases, we require
the Unicode features to run the tests.

This also introduces a new error in the compiler where by if a Unicode
word boundary is used, but the `unicode-perl` feature is disabled, then
the regex will fail to compile. (Because the necessary data to match
Unicode word boundaries isn't available.)
This makes all uses of `#[inline(always)]` conditional on the
`perf-inline` feature. This should reduce compile times and binary size,
but may decrease match performance.
This makes the thread_local (and by consequence, lazy_static) crates
optional by providing a naive caching mechanism when perf-cache is
disabled. This is achieved by defining a common API and implementing it
via both approaches.

The one tricky bit here is to ensure our naive version implements the
same auto-traits as the fast version. Since we just use a plain mutex,
it impls RefUnwindSafe, but thread_local does not. So we forcefully
remove the RefUnwindSafe impl from our safe variant.

We should be able to implement RefUnwindSafe in both cases, but
this likely requires some mechanism for clearing the regex cache
automatically if a panic occurs anywhere during search. But that's a
more invasive change and is part of  #576.
This commit enables support for the perf-literal feature. When it's
disabled, no literal optimizations will be performed. Instead, only
the regex engine itself is used.

In practice, it's quite plausible that we don't need to disable *all*
literal optimizations. But that is the simplest path here, and I don't
have the stomach to do anything more with the current code. src/exec.rs
has turned into a giant soup.
This commit adds support for the perf-dfa feature, which permits users
of this crate to completely disable the lazy DFA. This should help
decrease binary size and compilation times. Although, this will come at
a significant cost of runtime performance.
This seems to save about 12KB on the final binary size. Benchmarks
suggest that there is no meaningful runtime performance difference.
@BurntSushi BurntSushi mentioned this pull request Sep 2, 2019
@BurntSushi
Copy link
Member Author

For those wondering, here are exposed crate features:

Ecosystem features

  • std -
    When enabled, this will cause regex to use the standard library. Currently,
    disabling this feature will always result in a compilation error. It is
    intended to add alloc-only support to regex in the future.

Performance features

  • perf -
    Enables all performance related features. This feature is enabled by default
    and will always cover all features that improve performance, even if more
    are added in the future.
  • perf-cache -
    Enables the use of very fast thread safe caching for internal match state.
    When this is disabled, caching is still used, but with a slower and simpler
    implementation. Disabling this drops the thread_local and lazy_static
    dependencies.
  • perf-dfa -
    Enables the use of a lazy DFA for matching. The lazy DFA is used to compile
    portions of a regex to a very fast DFA on an as-needed basis. This can
    result in substantial speedups, usually by an order of magnitude on large
    haystacks. The lazy DFA does not bring in any new dependencies, but it can
    make compile times longer.
  • perf-inline -
    Enables the use of aggressive inlining inside match routines. This reduces
    the overhead of each match. The aggressive inlining, however, increases
    compile times and binary size.
  • perf-literal -
    Enables the use of literal optimizations for speeding up matches. In some
    cases, literal optimizations can result in speedups of several orders of
    magnitude. Disabling this drops the aho-corasick and memchr dependencies.

Unicode features

  • unicode -
    Enables all Unicode features. This feature is enabled by default, and will
    always cover all Unicode features, even if more are added in the future.
  • unicode-age -
    Provide the data for the
    Unicode Age property.
    This makes it possible to use classes like \p{Age:6.0} to refer to all
    codepoints first introduced in Unicode 6.0
  • unicode-bool -
    Provide the data for numerous Unicode boolean properties. The full list
    is not included here, but contains properties like Alphabetic, Emoji,
    Lowercase, Math, Uppercase and White_Space.
  • unicode-case -
    Provide the data for case insensitive matching using
    Unicode's "simple loose matches" specification.
  • unicode-gencat -
    Provide the data for
    Uncode general categories.
    This includes, but is not limited to, Decimal_Number, Letter,
    Math_Symbol, Number and Punctuation.
  • unicode-perl -
    Provide the data for supporting the Unicode-aware Perl character classes,
    corresponding to \w, \s and \d. This is also necessary for using
    Unicode-aware word boundary assertions. Note that if this feature is
    disabled, the \s and \d character classes are still available if the
    unicode-bool and unicode-gencat features are enabled, respectively.
  • unicode-script -
    Provide the data for
    Unicode scripts and script extensions.
    This includes, but is not limited to, Arabic, Cyrillic, Hebrew,
    Latin and Thai.
  • unicode-segment -
    Provide the data necessary to provide the properties used to implement the
    Unicode text segmentation algorithms.
    This enables using classes like \p{gcb=Extend}, \p{wb=Katakana} and
    \p{sb=ATerm}.

@cramertj
Copy link
Member

cramertj commented Sep 3, 2019

This is incredible work, thank you so much!

@BurntSushi
Copy link
Member Author

No problem! Thanks for giving me the kick to do it. :-)

This PR is now on crates.io in regex 1.3.0 and regex-syntax 0.6.12.

@BurntSushi
Copy link
Member Author

It seems regex didn't build on docs.rs and I'm not sure why. I opened an issue: rust-lang/docs.rs#400

@jhpratt
Copy link
Member

jhpratt commented Sep 3, 2019

Now because of this change, I've got a request! Could we have case insensitive matching even with unicode disabled? Right now it still requires the unicode-case feature.

@BurntSushi
Copy link
Member Author

@jhpratt Could you please show your code? It should work just fine. i.e., (?i-u)a will match a and A. (As it always has.)

@jhpratt
Copy link
Member

jhpratt commented Sep 3, 2019

Wasn't aware of the negative -u flag. That solves the issue!

fulmicoton added a commit to quickwit-oss/tantivy that referenced this pull request Sep 3, 2019
fulmicoton added a commit to quickwit-oss/tantivy that referenced this pull request Sep 4, 2019
@nic-hartley
Copy link

Would it be reasonable to do the equivalent of prepending every regex with (?-u) automatically when the Unicode feature is disabled? That way I wouldn't have to change my incredibly simple regexes to support it; I could just remove it in my Cargo.toml.

@BurntSushi
Copy link
Member Author

Nope, it's not. That violates the property that the semantics of a regex are changed based on which features are enabled. This property is important. Consider, for example, that you've written a library that depends on regex. You don't need Unicode support, so you disable default features and only enable std. Your regexes get automatically rewritten to behave as if they started with (?-u). Now imagine that your library is used in someone else's project, and that project also depends on regex but uses Unicode features. Features are additive, so now, all of a sudden, your library is now also using regex with Unicode support enabled. This means your regexes are no longer written with (?-u) as a prefix. This changes the match semantics of your regexes and can wind up causing spectacular failures. And these are the kind of failures that correspond to subtle corner cases and may not be unit tested.

If you don't want to write (?-u), then an alternative is to write a helper function that calls RegexBuilder::unicode to disable Unicode mode. Then use that helper function to compile all of your regexes instead of Regex::new.

huitseeker added a commit to huitseeker/diem that referenced this pull request Sep 6, 2019
See rust-lang/regex#613

as it turns out we never use regex in a Unicode context, trim its transitive dependencies
bors-libra pushed a commit to diem/diem that referenced this pull request Sep 6, 2019
See rust-lang/regex#613

as it turns out we never use regex in a Unicode context, trim its transitive dependencies

Closes: #871
Approved by: mimoo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Binary Size
4 participants