Add rure, a C API.

This commit contains a new sub-crate called `regex-capi` which provides a C library called `rure`. A new `RegexBuilder` type was also added to the Rust API proper, which permits both users of C and Rust to tweak various knobs on a `Regex`. This fixes issue #166. Since it's likely that this API will be used to provide bindings to other languages, I've created bindings to Go as a proof of concept: https://github.com/BurntSushi/rure-go --- to my knowledge, the wrapper has as little overhead as it can. It was in particular important for the C library to not store any pointers provided by the caller, as this can be problematic in languages with managed runtimes and a moving GC. The C API doesn't expose `RegexSet` and a few other convenience functions such as splitting or replacing. That can be future work. Note that the regex-capi crate requires Rust 1.9, since it uses `panic::catch_unwind`. This also includes tests of basic API functionality and a commented example. Both should now run as part of CI.
rust-lang · Apr 29, 2016 · 97d374b · 97d374b
1 parent 3f408e5
commit 97d374b
Show file tree

Hide file tree

Showing 26 changed files with 15,141 additions and 265 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -18,10 +18,13 @@ script:
   - cargo test --verbose --manifest-path=regex-syntax/Cargo.toml
   - cargo doc --verbose --manifest-path=regex-syntax/Cargo.toml
   - if [ "$TRAVIS_RUST_VERSION" = "nightly" ]; then
-      travis_wait ./run-bench rust;
-      travis_wait ./run-bench rust-bytes --no-run;
-      travis_wait ./run-bench pcre1 --no-run;
-      travis_wait ./run-bench onig --no-run;
+      (cd regex-capi && cargo build --verbose);
+      (cd regex-capi/ctest && ./compile && LD_LIBRARY_PATH=../target/debug ./test);
+      (cd regex-capi/examples && ./compile && LD_LIBRARY_PATH=../target/release ./iter);
+      (cd bench && travis_wait ./run rust);
+      (cd bench && travis_wait ./run rust-bytes --no-run);
+      (cd bench && travis_wait ./run pcre1 --no-run);
+      (cd bench && travis_wait ./run onig --no-run);
       travis_wait cargo test --verbose --manifest-path=regex_macros/Cargo.toml;
     fi
 addons:

diff --git a/PERFORMANCE.md b/PERFORMANCE.md
@@ -26,11 +26,13 @@ The promise of this crate is that *this pathological behavior can't happen*.
 With that said, just because we have protected ourselves against worst case
 exponential behavior doesn't mean we are immune from large constant factors
 or places where the current regex engine isn't quite optimal. This guide will
-detail those cases, among other general advice, and give advice on how to avoid
-them.
+detail those cases and provide guidance on how to avoid them, among other
+bits of general advice.
 
 ## Thou Shalt Not Compile Regular Expressions In A Loop
 
+**Advice**: Use `lazy_static` to amortize the cost of `Regex` compilation.
+
 Don't do it unless you really don't mind paying for it. Compiling a regular
 expression in this crate is quite expensive. It is conceivable that it may get
 faster some day, but I wouldn't hold out hope for, say, an order of magnitude
@@ -48,7 +50,7 @@ This means that in order to realize efficient regex matching, one must
 inside a loop, then make sure your call to `Regex::new` is *outside* that loop.
 
 In many programming languages, regular expressions can be conveniently defined
-and "compiled" in a global scope, and code can reach out and use them as if
+and compiled in a global scope, and code can reach out and use them as if
 they were global static variables. In Rust, there is really no concept of
 life-before-main, and therefore, one cannot utter this:
 
@@ -80,10 +82,14 @@ it's self contained and everything works exactly as you expect. In particular,
 `MY_REGEX` can be used from multiple threads without wrapping it in an `Arc` or
 a `Mutex`. On that note...
 
-**Advice**: Use `lazy_static` to amortize the cost of `Regex` compilation.
-
 ## Using a regex from multiple threads
 
+**Advice**: The performance impact from using a `Regex` from multiple threads
+is likely negligible. If necessary, clone the `Regex` so that each thread gets
+its own copy. Cloning a regex does not incur any additional memory overhead
+than what would be used by using a `Regex` from multiple threads
+simultaneously. *Its only cost is ergonomics.*
+
 It is supported and encouraged to define your regexes using `lazy_static!` as
 if they were global static values, and then use them to search text from
 multiple threads simultaneously.
@@ -126,14 +132,10 @@ Then you may not suffer from contention since the cost of synchronization is
 amortized on *construction of the iterator*. That is, the mutable scratch space
 is obtained when the iterator is created and retained throughout its lifetime.
 
-**Advice**: The performance impact from using a `Regex` from multiple threads
-is likely negligible. If necessary, clone the `Regex` so that each thread gets
-its own copy. Cloning a regex does not incur any additional memory overhead
-than what would be used by using a `Regex` from multiple threads
-simultaneously. *Its only cost is ergonomics.*
-
 ## Only ask for what you need
 
+**Advice**: Prefer in this order: `is_match`, `find`, `captures`.
+
 There are three primary search methods on a `Regex`:
 
 * is_match
@@ -166,10 +168,11 @@ end location of when it discovered a match. For example, given the regex `a+`
 and the haystack `aaaaa`, `shortest_match` may return `1` as opposed to `5`,
 the latter of which being the correct end location of the leftmost-first match.
 
-**Advice**: Prefer in this order: `is_match`, `find`, `captures`.
-
 ## Literals in your regex may make it faster
 
+**Advice**: Literals can reduce the work that the regex engine needs to do. Use
+them if you can, especially as prefixes.
+
 In particular, if your regex starts with a prefix literal, the prefix is
 quickly searched before entering the (much slower) regex engine. For example,
 given the regex `foo\w+`, the literal `foo` will be searched for using
@@ -197,11 +200,11 @@ Literals in anchored regexes can also be used for detecting non-matches very
 quickly. For example, `^foo\w+` and `\w+foo$` may be able to detect a non-match
 just by examing the first (or last) three bytes of the haystack.
 
-**Advice**: Literals can reduce the work that the regex engine needs to do. Use
-them if you can, especially as prefixes.
-
 ## Unicode word boundaries may prevent the DFA from being used
 
+**Advice**: In most cases, `\b` should work well. If not, use `(?-u:\b)`
+instead of `\b` if you care about consistent performance more than correctness.
+
 It's a sad state of the current implementation. At the moment, the DFA will try
 to interpret Unicode word boundaries as if they were ASCII word boundaries.
 If the DFA comes across any non-ASCII byte, it will quit and fall back to an
@@ -233,11 +236,11 @@ more consistent performance.
 N.B. When using `bytes::Regex`, Unicode support is disabled by default, so one
 can simply write `\b` to get an ASCII word boundary.
 
-**Advice**: In most cases, `\b` should work well. If not, use `(?-u:\b)`
-instead of `\b` if you care about consistent performance more than correctness.
-
 ## Excessive counting can lead to exponential state blow up in the DFA
 
+**Advice**: Don't write regexes that cause DFA state blow up if you care about
+match performance.
+
 Wait, didn't I say that this crate guards against exponential worst cases?
 Well, it turns out that the process of converting an NFA to a DFA can lead to
 an exponential blow up in the number of states. This crate specifically guards
@@ -266,14 +269,11 @@ In the future, it may be possible to increase the bound that the DFA uses,
 which would allow the caller to choose how much memory they're willing to
 spend.
 
-**Advice**: Don't write regexes that cause DFA state blow up if you care about
-match performance.
-
 ## Resist the temptation to "optimize" regexes
 
+**Advice**: This ain't a backtracking engine.
+
 An entire book was written on how to optimize Perl-style regular expressions.
 Most of those techniques are not applicable for this library. For example,
 there is no problem with using non-greedy matching or having lots of
 alternations in your regex.
-
-**Advice**: This ain't a backtracking engine.
diff --git a/bench/src/ffi/re2.rs b/bench/src/ffi/re2.rs
@@ -84,7 +84,10 @@ impl<'r, 't> Iterator for FindMatches<'r, 't> {
 
     fn next(&mut self) -> Option<(usize, usize)> {
         fn next_after_empty(text: &str, i: usize) -> usize {
-            let b = text.as_bytes()[i];
+            let b = match text.as_bytes().get(i) {
+                None => return text.len() + 1,
+                Some(&b) => b,
+            };
             let inc = if b <= 0x7F {
                 1
             } else if b <= 0b110_11111 {
@@ -105,14 +108,19 @@ impl<'r, 't> Iterator for FindMatches<'r, 't> {
             Some((s, e)) => (s, e),
         };
         assert!(s >= self.last_end);
-        if e == s && Some(self.last_end) == self.last_match {
-            if self.last_end >= self.text.len() {
-                return None;
+        if s == e {
+            // This is an empty match. To ensure we make progress, start
+            // the next search at the smallest possible starting position
+            // of the next match following this one.
+            self.last_end = next_after_empty(&self.text, e);
+            // Don't accept empty matches immediately following a match.
+            // Just move on to the next match.
+            if Some(e) == self.last_match {
+                return self.next();
             }
-            self.last_end = next_after_empty(self.text, self.last_end);
-            return self.next();
+        } else {
+            self.last_end = e;
         }
-        self.last_end = e;
         self.last_match = Some(self.last_end);
         Some((s, e))
     }

diff --git a/regex-capi/Cargo.toml b/regex-capi/Cargo.toml
@@ -0,0 +1,20 @@
+[package]
+name = "rure"
+version = "0.1.0"  #:version
+authors = ["The Rust Project Developers"]
+license = "MIT/Apache-2.0"
+readme = "README.md"
+repository = "https://github.com/rust-lang/regex"
+documentation = "https://doc.rust-lang.org/regex"
+homepage = "https://github.com/rust-lang/regex"
+description = """
+A C API for Rust's regular expression library.
+"""
+
+[lib]
+name = "rure"
+crate-type = ["staticlib", "dylib"]
+
+[dependencies]
+libc = "0.2"
+regex = { version = "0.1", path = ".." }
diff --git a/regex-capi/README.md b/regex-capi/README.md
@@ -0,0 +1,105 @@
+RUst's REgex engine
+===================
+rure is a C API to Rust's regex library, which guarantees linear time
+searching using finite automata. In exchange, it must give up some common
+regex features such as backreferences and arbitrary lookaround. It does
+however include capturing groups, lazy matching, Unicode support and word
+boundary assertions. Its matching semantics generally correspond to Perl's,
+or "leftmost first." Namely, the match locations reported correspond to the
+first match that would be found by a backtracking engine.
+
+The header file (`includes/rure.h`) serves as the primary API documentation of
+this library. Types and flags are documented first, and functions follow.
+
+The syntax and possibly other useful things are documented in the Rust
+API documentation: http://doc.rust-lang.org/regex/regex/index.html
+
+
+Examples
+--------
+There are readable examples in the `ctest` and `examples` sub-directories.
+
+Assuming you have
+[Rust and Cargo installed](https://www.rust-lang.org/downloads.html)
+(and a C compiler), then this should work to run the `iter` example:
+
+```
+$ git clone git://github.com/rust-lang-nursery/regex
+$ cd regex/regex-capi/examples
+$ ./compile
+$ LD_LIBRARY_PATH=../target/release ./iter
+```
+
+
+Performance
+-----------
+It's fast. Its core matching engine is a lazy DFA, which is what GNU grep
+and RE2 use. Like GNU grep, this regex engine can detect multi byte literals
+in the regex and will use fast literal string searching to quickly skip
+through the input to find possible match locations.
+
+All memory usage is bounded and all searching takes linear time with respect
+to the input string.
+
+For more details, see the PERFORMANCE guide:
+https://github.com/rust-lang-nursery/regex/blob/master/PERFORMANCE.md
+
+
+Text encoding
+-------------
+All regular expressions must be valid UTF-8.
+
+The text encoding of haystacks is more complicated. To a first
+approximation, haystacks should be UTF-8. In fact, UTF-8 (and, one
+supposes, ASCII) is the only well defined text encoding supported by this
+library. It is impossible to match UTF-16, UTF-32 or any other encoding
+without first transcoding it to UTF-8.
+
+With that said, haystacks do not need to be valid UTF-8, and if they aren't
+valid UTF-8, no performance penalty is paid. Whether invalid UTF-8 is
+matched or not depends on the regular expression. For example, with the
+`RURE_FLAG_UNICODE` flag enabled, the regex `.` is guaranteed to match a
+single UTF-8 encoding of a Unicode codepoint (sans LF). In particular,
+it will not match invalid UTF-8 such as `\xFF`, nor will it match surrogate
+codepoints or "alternate" (i.e., non-minimal) encodings of codepoints.
+However, with the `RURE_FLAG_UNICODE` flag disabled, the regex `.` will match
+any *single* arbitrary byte (sans LF), including `\xFF`.
+
+This provides a useful invariant: wherever `RURE_FLAG_UNICODE` is set, the
+corresponding regex is guaranteed to match valid UTF-8. Invalid UTF-8 will
+always prevent a match from happening when the flag is set. Since flags can be
+toggled in the regular expression itself, this allows one to pick and choose
+which parts of the regular expression must match UTF-8 or not.
+
+Some good advice is to always enable the `RURE_FLAG_UNICODE` flag (which is
+enabled when using `rure_compile_must`) and selectively disable the flag when
+one wants to match arbitrary bytes. The flag can be disabled in a regular
+expression with `(?-u)`.
+
+Finally, if one wants to match specific invalid UTF-8 bytes, then you can
+use escape sequences. e.g., `(?-u)\\xFF` will match `\xFF`. It's not
+possible to use C literal escape sequences in this case since regular
+expressions must be valid UTF-8.
+
+
+Aborts
+------
+This library will abort your process if an unwinding panic is caught in the
+Rust code. Generally, a panic occurs when there is a bug in the program or
+if allocation failed. It is possible to cause this behavior by passing
+invalid inputs to some functions. For example, giving an invalid capture
+group index to `rure_captures_at` will cause Rust's bounds checks to fail,
+which will cause a panic, which will be caught and printed to stderr. The
+process will then `abort`.
+
+
+Missing
+-------
+There are a few things missing from the C API that are present in the Rust API.
+There's no particular (known) reason why they don't, they just haven't been
+implemented yet.
+
+* RegexSet, which permits matching multiple regular expressions simultaneously
+  in a single linear time search.
+* Splitting a string by a regex.
+* Replacing regex matches in a string with some other text.
diff --git a/regex-capi/ctest/.gitignore b/regex-capi/ctest/.gitignore
@@ -0,0 +1 @@
+test
diff --git a/regex-capi/ctest/compile b/regex-capi/ctest/compile
@@ -0,0 +1,8 @@
+#!/bin/sh
+
+set -ex
+
+cargo build --manifest-path ../Cargo.toml
+gcc -DDEBUG -o test test.c -ansi -Wall -I../include -L../target/debug -lrure
+# If you're using librure.a, then you'll need to link other stuff:
+# -lutil -ldl -lpthread -lgcc_s -lc -lm -lrt -lutil -lrure