Skip to content

Commit

Permalink
Add rure, a C API.
Browse files Browse the repository at this point in the history
This commit contains a new sub-crate called `regex-capi` which provides
a C library called `rure`.

A new `RegexBuilder` type was also added to the Rust API proper, which
permits both users of C and Rust to tweak various knobs on a `Regex`.
This fixes issue #166.

Since it's likely that this API will be used to provide bindings to
other languages, I've created bindings to Go as a proof of concept:
https://github.com/BurntSushi/rure-go --- to my knowledge, the wrapper
has as little overhead as it can. It was in particular important for the
C library to not store any pointers provided by the caller, as this can
be problematic in languages with managed runtimes and a moving GC.

The C API doesn't expose `RegexSet` and a few other convenience functions
such as splitting or replacing. That can be future work.

Note that the regex-capi crate requires Rust 1.9, since it uses
`panic::catch_unwind`.

This also includes tests of basic API functionality and a commented
example. Both should now run as part of CI.
  • Loading branch information
BurntSushi committed Apr 29, 2016
1 parent 3f408e5 commit 97d374b
Show file tree
Hide file tree
Showing 26 changed files with 15,141 additions and 265 deletions.
11 changes: 7 additions & 4 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,13 @@ script:
- cargo test --verbose --manifest-path=regex-syntax/Cargo.toml
- cargo doc --verbose --manifest-path=regex-syntax/Cargo.toml
- if [ "$TRAVIS_RUST_VERSION" = "nightly" ]; then
travis_wait ./run-bench rust;
travis_wait ./run-bench rust-bytes --no-run;
travis_wait ./run-bench pcre1 --no-run;
travis_wait ./run-bench onig --no-run;
(cd regex-capi && cargo build --verbose);
(cd regex-capi/ctest && ./compile && LD_LIBRARY_PATH=../target/debug ./test);
(cd regex-capi/examples && ./compile && LD_LIBRARY_PATH=../target/release ./iter);
(cd bench && travis_wait ./run rust);
(cd bench && travis_wait ./run rust-bytes --no-run);
(cd bench && travis_wait ./run pcre1 --no-run);
(cd bench && travis_wait ./run onig --no-run);
travis_wait cargo test --verbose --manifest-path=regex_macros/Cargo.toml;
fi
addons:
Expand Down
48 changes: 24 additions & 24 deletions PERFORMANCE.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,13 @@ The promise of this crate is that *this pathological behavior can't happen*.
With that said, just because we have protected ourselves against worst case
exponential behavior doesn't mean we are immune from large constant factors
or places where the current regex engine isn't quite optimal. This guide will
detail those cases, among other general advice, and give advice on how to avoid
them.
detail those cases and provide guidance on how to avoid them, among other
bits of general advice.

## Thou Shalt Not Compile Regular Expressions In A Loop

**Advice**: Use `lazy_static` to amortize the cost of `Regex` compilation.

Don't do it unless you really don't mind paying for it. Compiling a regular
expression in this crate is quite expensive. It is conceivable that it may get
faster some day, but I wouldn't hold out hope for, say, an order of magnitude
Expand All @@ -48,7 +50,7 @@ This means that in order to realize efficient regex matching, one must
inside a loop, then make sure your call to `Regex::new` is *outside* that loop.

In many programming languages, regular expressions can be conveniently defined
and "compiled" in a global scope, and code can reach out and use them as if
and compiled in a global scope, and code can reach out and use them as if
they were global static variables. In Rust, there is really no concept of
life-before-main, and therefore, one cannot utter this:

Expand Down Expand Up @@ -80,10 +82,14 @@ it's self contained and everything works exactly as you expect. In particular,
`MY_REGEX` can be used from multiple threads without wrapping it in an `Arc` or
a `Mutex`. On that note...

**Advice**: Use `lazy_static` to amortize the cost of `Regex` compilation.

## Using a regex from multiple threads

**Advice**: The performance impact from using a `Regex` from multiple threads
is likely negligible. If necessary, clone the `Regex` so that each thread gets
its own copy. Cloning a regex does not incur any additional memory overhead
than what would be used by using a `Regex` from multiple threads
simultaneously. *Its only cost is ergonomics.*

It is supported and encouraged to define your regexes using `lazy_static!` as
if they were global static values, and then use them to search text from
multiple threads simultaneously.
Expand Down Expand Up @@ -126,14 +132,10 @@ Then you may not suffer from contention since the cost of synchronization is
amortized on *construction of the iterator*. That is, the mutable scratch space
is obtained when the iterator is created and retained throughout its lifetime.

**Advice**: The performance impact from using a `Regex` from multiple threads
is likely negligible. If necessary, clone the `Regex` so that each thread gets
its own copy. Cloning a regex does not incur any additional memory overhead
than what would be used by using a `Regex` from multiple threads
simultaneously. *Its only cost is ergonomics.*

## Only ask for what you need

**Advice**: Prefer in this order: `is_match`, `find`, `captures`.

There are three primary search methods on a `Regex`:

* is_match
Expand Down Expand Up @@ -166,10 +168,11 @@ end location of when it discovered a match. For example, given the regex `a+`
and the haystack `aaaaa`, `shortest_match` may return `1` as opposed to `5`,
the latter of which being the correct end location of the leftmost-first match.

**Advice**: Prefer in this order: `is_match`, `find`, `captures`.

## Literals in your regex may make it faster

**Advice**: Literals can reduce the work that the regex engine needs to do. Use
them if you can, especially as prefixes.

In particular, if your regex starts with a prefix literal, the prefix is
quickly searched before entering the (much slower) regex engine. For example,
given the regex `foo\w+`, the literal `foo` will be searched for using
Expand Down Expand Up @@ -197,11 +200,11 @@ Literals in anchored regexes can also be used for detecting non-matches very
quickly. For example, `^foo\w+` and `\w+foo$` may be able to detect a non-match
just by examing the first (or last) three bytes of the haystack.

**Advice**: Literals can reduce the work that the regex engine needs to do. Use
them if you can, especially as prefixes.

## Unicode word boundaries may prevent the DFA from being used

**Advice**: In most cases, `\b` should work well. If not, use `(?-u:\b)`
instead of `\b` if you care about consistent performance more than correctness.

It's a sad state of the current implementation. At the moment, the DFA will try
to interpret Unicode word boundaries as if they were ASCII word boundaries.
If the DFA comes across any non-ASCII byte, it will quit and fall back to an
Expand Down Expand Up @@ -233,11 +236,11 @@ more consistent performance.
N.B. When using `bytes::Regex`, Unicode support is disabled by default, so one
can simply write `\b` to get an ASCII word boundary.

**Advice**: In most cases, `\b` should work well. If not, use `(?-u:\b)`
instead of `\b` if you care about consistent performance more than correctness.

## Excessive counting can lead to exponential state blow up in the DFA

**Advice**: Don't write regexes that cause DFA state blow up if you care about
match performance.

Wait, didn't I say that this crate guards against exponential worst cases?
Well, it turns out that the process of converting an NFA to a DFA can lead to
an exponential blow up in the number of states. This crate specifically guards
Expand Down Expand Up @@ -266,14 +269,11 @@ In the future, it may be possible to increase the bound that the DFA uses,
which would allow the caller to choose how much memory they're willing to
spend.

**Advice**: Don't write regexes that cause DFA state blow up if you care about
match performance.

## Resist the temptation to "optimize" regexes

**Advice**: This ain't a backtracking engine.

An entire book was written on how to optimize Perl-style regular expressions.
Most of those techniques are not applicable for this library. For example,
there is no problem with using non-greedy matching or having lots of
alternations in your regex.

**Advice**: This ain't a backtracking engine.
22 changes: 15 additions & 7 deletions bench/src/ffi/re2.rs
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,10 @@ impl<'r, 't> Iterator for FindMatches<'r, 't> {

fn next(&mut self) -> Option<(usize, usize)> {
fn next_after_empty(text: &str, i: usize) -> usize {
let b = text.as_bytes()[i];
let b = match text.as_bytes().get(i) {
None => return text.len() + 1,
Some(&b) => b,
};
let inc = if b <= 0x7F {
1
} else if b <= 0b110_11111 {
Expand All @@ -105,14 +108,19 @@ impl<'r, 't> Iterator for FindMatches<'r, 't> {
Some((s, e)) => (s, e),
};
assert!(s >= self.last_end);
if e == s && Some(self.last_end) == self.last_match {
if self.last_end >= self.text.len() {
return None;
if s == e {
// This is an empty match. To ensure we make progress, start
// the next search at the smallest possible starting position
// of the next match following this one.
self.last_end = next_after_empty(&self.text, e);
// Don't accept empty matches immediately following a match.
// Just move on to the next match.
if Some(e) == self.last_match {
return self.next();
}
self.last_end = next_after_empty(self.text, self.last_end);
return self.next();
} else {
self.last_end = e;
}
self.last_end = e;
self.last_match = Some(self.last_end);
Some((s, e))
}
Expand Down
20 changes: 20 additions & 0 deletions regex-capi/Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
[package]
name = "rure"
version = "0.1.0" #:version
authors = ["The Rust Project Developers"]
license = "MIT/Apache-2.0"
readme = "README.md"
repository = "https://github.com/rust-lang/regex"
documentation = "https://doc.rust-lang.org/regex"
homepage = "https://github.com/rust-lang/regex"
description = """
A C API for Rust's regular expression library.
"""

[lib]
name = "rure"
crate-type = ["staticlib", "dylib"]

[dependencies]
libc = "0.2"
regex = { version = "0.1", path = ".." }
105 changes: 105 additions & 0 deletions regex-capi/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
RUst's REgex engine
===================
rure is a C API to Rust's regex library, which guarantees linear time
searching using finite automata. In exchange, it must give up some common
regex features such as backreferences and arbitrary lookaround. It does
however include capturing groups, lazy matching, Unicode support and word
boundary assertions. Its matching semantics generally correspond to Perl's,
or "leftmost first." Namely, the match locations reported correspond to the
first match that would be found by a backtracking engine.

The header file (`includes/rure.h`) serves as the primary API documentation of
this library. Types and flags are documented first, and functions follow.

The syntax and possibly other useful things are documented in the Rust
API documentation: http://doc.rust-lang.org/regex/regex/index.html


Examples
--------
There are readable examples in the `ctest` and `examples` sub-directories.

Assuming you have
[Rust and Cargo installed](https://www.rust-lang.org/downloads.html)
(and a C compiler), then this should work to run the `iter` example:

```
$ git clone git://github.com/rust-lang-nursery/regex
$ cd regex/regex-capi/examples
$ ./compile
$ LD_LIBRARY_PATH=../target/release ./iter
```


Performance
-----------
It's fast. Its core matching engine is a lazy DFA, which is what GNU grep
and RE2 use. Like GNU grep, this regex engine can detect multi byte literals
in the regex and will use fast literal string searching to quickly skip
through the input to find possible match locations.

All memory usage is bounded and all searching takes linear time with respect
to the input string.

For more details, see the PERFORMANCE guide:
https://github.com/rust-lang-nursery/regex/blob/master/PERFORMANCE.md


Text encoding
-------------
All regular expressions must be valid UTF-8.

The text encoding of haystacks is more complicated. To a first
approximation, haystacks should be UTF-8. In fact, UTF-8 (and, one
supposes, ASCII) is the only well defined text encoding supported by this
library. It is impossible to match UTF-16, UTF-32 or any other encoding
without first transcoding it to UTF-8.

With that said, haystacks do not need to be valid UTF-8, and if they aren't
valid UTF-8, no performance penalty is paid. Whether invalid UTF-8 is
matched or not depends on the regular expression. For example, with the
`RURE_FLAG_UNICODE` flag enabled, the regex `.` is guaranteed to match a
single UTF-8 encoding of a Unicode codepoint (sans LF). In particular,
it will not match invalid UTF-8 such as `\xFF`, nor will it match surrogate
codepoints or "alternate" (i.e., non-minimal) encodings of codepoints.
However, with the `RURE_FLAG_UNICODE` flag disabled, the regex `.` will match
any *single* arbitrary byte (sans LF), including `\xFF`.

This provides a useful invariant: wherever `RURE_FLAG_UNICODE` is set, the
corresponding regex is guaranteed to match valid UTF-8. Invalid UTF-8 will
always prevent a match from happening when the flag is set. Since flags can be
toggled in the regular expression itself, this allows one to pick and choose
which parts of the regular expression must match UTF-8 or not.

Some good advice is to always enable the `RURE_FLAG_UNICODE` flag (which is
enabled when using `rure_compile_must`) and selectively disable the flag when
one wants to match arbitrary bytes. The flag can be disabled in a regular
expression with `(?-u)`.

Finally, if one wants to match specific invalid UTF-8 bytes, then you can
use escape sequences. e.g., `(?-u)\\xFF` will match `\xFF`. It's not
possible to use C literal escape sequences in this case since regular
expressions must be valid UTF-8.


Aborts
------
This library will abort your process if an unwinding panic is caught in the
Rust code. Generally, a panic occurs when there is a bug in the program or
if allocation failed. It is possible to cause this behavior by passing
invalid inputs to some functions. For example, giving an invalid capture
group index to `rure_captures_at` will cause Rust's bounds checks to fail,
which will cause a panic, which will be caught and printed to stderr. The
process will then `abort`.


Missing
-------
There are a few things missing from the C API that are present in the Rust API.
There's no particular (known) reason why they don't, they just haven't been
implemented yet.

* RegexSet, which permits matching multiple regular expressions simultaneously
in a single linear time search.
* Splitting a string by a regex.
* Replacing regex matches in a string with some other text.
1 change: 1 addition & 0 deletions regex-capi/ctest/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
test
8 changes: 8 additions & 0 deletions regex-capi/ctest/compile
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
#!/bin/sh

set -ex

cargo build --manifest-path ../Cargo.toml
gcc -DDEBUG -o test test.c -ansi -Wall -I../include -L../target/debug -lrure
# If you're using librure.a, then you'll need to link other stuff:
# -lutil -ldl -lpthread -lgcc_s -lc -lm -lrt -lutil -lrure
Loading

0 comments on commit 97d374b

Please sign in to comment.