-
Notifications
You must be signed in to change notification settings - Fork 449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
expose new crate features for optionally shrinking regex #613
Conversation
We'll remove 'use_std' in regex 2, but keep it around for backward compatibility. Fixes #474
This makes sure the generated tables are rustfmt'd.
This nominally moves the logic for acquiring Unicode-aware Perl character classes into the `unicode` module, and also makes the calling code robust with respect to failures. This commit is prep work for making the availability of Unicode-aware Perl classes optional.
This commit refactors the way this library handles Unicode data by making it completely optional. Several features are introduced which permit callers to select only the Unicode data they need (up to a point of granularity). An important property of these changes is that presence of absence of crate features will never change the match semantics of a regular expression. Instead, the presence or absence of a crate feature can only add or subtract from the set of all possible valid regular expressions. So for example, if the `unicode-case` feature is disabled, then attempting to produce `Hir` for the regex `(?i)a` will fail. Instead, callers must use `(?i-u)a` (or enable the `unicode-case` feature). This partially addresses #583 since it permits callers to decrease binary size.
We have a good thing going, so let's formalize it a bit.
This commit sets up the infrastructure for supporting various `unicode` and `perf` features, which permit decreasing binary size, compile times and the size of the dependency tree. Most of the work here is in modifying the regex tests to make them work in concert with the available Unicode features. In cases where Unicode is irrelevant, we just turn it off. In other cases, we require the Unicode features to run the tests. This also introduces a new error in the compiler where by if a Unicode word boundary is used, but the `unicode-perl` feature is disabled, then the regex will fail to compile. (Because the necessary data to match Unicode word boundaries isn't available.)
This makes all uses of `#[inline(always)]` conditional on the `perf-inline` feature. This should reduce compile times and binary size, but may decrease match performance.
This makes the thread_local (and by consequence, lazy_static) crates optional by providing a naive caching mechanism when perf-cache is disabled. This is achieved by defining a common API and implementing it via both approaches. The one tricky bit here is to ensure our naive version implements the same auto-traits as the fast version. Since we just use a plain mutex, it impls RefUnwindSafe, but thread_local does not. So we forcefully remove the RefUnwindSafe impl from our safe variant. We should be able to implement RefUnwindSafe in both cases, but this likely requires some mechanism for clearing the regex cache automatically if a panic occurs anywhere during search. But that's a more invasive change and is part of #576.
This commit enables support for the perf-literal feature. When it's disabled, no literal optimizations will be performed. Instead, only the regex engine itself is used. In practice, it's quite plausible that we don't need to disable *all* literal optimizations. But that is the simplest path here, and I don't have the stomach to do anything more with the current code. src/exec.rs has turned into a giant soup.
This commit adds support for the perf-dfa feature, which permits users of this crate to completely disable the lazy DFA. This should help decrease binary size and compilation times. Although, this will come at a significant cost of runtime performance.
This seems to save about 12KB on the final binary size. Benchmarks suggest that there is no meaningful runtime performance difference.
For those wondering, here are exposed crate features: Ecosystem features
Performance features
Unicode features
|
This is incredible work, thank you so much! |
No problem! Thanks for giving me the kick to do it. :-) This PR is now on crates.io in |
It seems regex didn't build on docs.rs and I'm not sure why. I opened an issue: rust-lang/docs.rs#400 |
Now because of this change, I've got a request! Could we have case insensitive matching even with unicode disabled? Right now it still requires the |
@jhpratt Could you please show your code? It should work just fine. i.e., |
Wasn't aware of the negative |
Would it be reasonable to do the equivalent of prepending every regex with |
Nope, it's not. That violates the property that the semantics of a regex are changed based on which features are enabled. This property is important. Consider, for example, that you've written a library that depends on If you don't want to write |
See rust-lang/regex#613 as it turns out we never use regex in a Unicode context, trim its transitive dependencies
See rust-lang/regex#613 as it turns out we never use regex in a Unicode context, trim its transitive dependencies Closes: #871 Approved by: mimoo
This PR is primarily intended to close #583. However, an additional motivation to these changes was to permit users of
regex
to shrink its dependency tree, should they wish to give up runtime performance in exchange. While this may not sound like a great exchange, there exist many cases where high performance regex matching isn't actually required. For example, if one is using a regex to filter a small set of tiny ASCII strings, then it would be perfectly reasonable to disable all of regex's crate features. The end result of this is that it will substantially shrink binary size, improve compilation times and shrink the dependency tree ofregex
down to a single crate (regex-syntax
).As an example, if I compile the following program in release mode
and use
regex = "1"
, then the total stripped binary size is1.5M
. Compare this with a baseline programwhose total stripped binary size is
203K
. Thus, the total overhead ofregex
is approximately1.3M
. A large percentage of that overhead corresponds to Unicode tables. For example, if we compile the above regex program, but with Unicode tables disabled (and keeping performance oriented features enabled)then the total binary size drops to
767K
, for a total overhead of about560K
.Finally, disabling all possible features
results in a binary size of
535K
, for a total overhead of about332K
.You can shrink the binary size even more (by incurring more compilation time) with the following settings:
This results in a baseline (hello world above) binary size of
191K
, and a binary size of367K
forregex
for a total overhead of176K
. This isn't quite the target of50K
desired by @cramertj, but it does correspond to about an order of magnitude improvement over the status quo.Another great benefit to trimming all this stuff is that release mode compilation times drop by a factor of 2 on my machine:
Debug mode compilation also gets a nice ~1.5x speed-up: