Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regex 0.2 #310

Merged
merged 27 commits into from
Dec 31, 2016
Merged

regex 0.2 #310

merged 27 commits into from
Dec 31, 2016

Conversation

BurntSushi
Copy link
Member

@BurntSushi BurntSushi commented Dec 30, 2016

0.2.0

This is a new major release of the regex crate, and is an implementation of the
regex 1.0 RFC.
We are releasing a 0.2 first, and if there are no major problems, we will
release a 1.0 shortly. For 0.2, the minimum supported Rust version is
1.12.

There are a number of breaking changes in 0.2. They are split into two
types. The first type correspond to breaking changes in regular expression
syntax. The second type correspond to breaking changes in the API.

Breaking changes for regex syntax:

  • POSIX character classes now require double bracketing. Previously, the regex
    [:upper:] would parse as the upper POSIX character class. Now it parses
    as the character class containing the characters :upper:. The fix to this
    change is to use [[:upper:]] instead. Note that variants like
    [[:upper:][:blank:]] continue to work.
  • The character [ must always be escaped inside a character class.
  • The characters &, - and ~ must be escaped if any one of them are
    repeated consecutively. For example, [&], [\&], [\&\&], [&-&] are all
    equivalent while [&&] is illegal. (The motivation for this and the prior
    change is to provide a backwards compatible path for adding character class
    set notation.)
  • A bytes::Regex now has Unicode mode enabled by default (like the main
    Regex type). This means regexes compiled with bytes::Regex::new that
    don't have the Unicode flag set should add (?-u) to recover the original
    behavior.

Breaking changes for the regex API:

  • find and find_iter now return Match values instead of
    (usize, usize).
    Match values have start and end methods, which
    return the match offsets. Match values also have an as_str method,
    which returns the text of the match itself.
  • The Captures type now only provides a single iterator over all capturing
    matches, which should replace uses of iter and iter_pos. Uses of
    iter_named should use the capture_names method on Regex.
  • The replace methods now return Cow values. The Cow::Borrowed variant
    is returned when no replacements are made.
  • The Replacer trait has been completely overhauled. This should only
    impact clients that implement this trait explicitly. Standard uses of
    the replace methods should continue to work unchanged.
  • The quote free function has been renamed to escape.
  • The Regex::with_size_limit method has been removed. It is replaced by
    RegexBuilder::size_limit.
  • The RegexBuilder type has switched from owned self method receivers to
    &mut self method receivers. Most uses will continue to work unchanged, but
    some code may require naming an intermediate variable to hold the builder.
  • The free is_match function has been removed. It is replaced by compiling
    a Regex and calling its is_match method.
  • The PartialEq and Eq impls on Regex have been dropped. If you relied
    on these impls, the fix is to define a wrapper type around Regex, impl
    Deref on it and provide the necessary impls.
  • The is_empty method on Captures has been removed. This always returns
    false, so its use is superfluous.
  • The Syntax variant of the Error type now contains a string instead of
    a regex_syntax::Error. If you were examining syntax errors more closely,
    you'll need to explicitly use the regex_syntax crate to re-parse the regex.
  • The InvalidSet variant of the Error type has been removed since it is
    no longer used.
  • Most of the iterator types have been renamed to match conventions. If you
    were using these iterator types explicitly, please consult the documentation
    for its new name. For example, RegexSplits has been renamed to Split.

A number of bugs have been fixed:

  • BUG #151:
    The Replacer trait has been changed to permit the caller to control
    allocation.
  • BUG #165:
    Remove the free is_match function.
  • BUG #166:
    Expose more knobs (available in 0.1) and remove with_size_limit.
  • BUG #168:
    Iterators produced by Captures now have the correct lifetime parameters.
  • BUG #175:
    Fix a corner case in the parsing of POSIX character classes.
  • BUG #178:
    Drop the PartialEq and Eq impls on Regex.
  • BUG #179:
    Remove is_empty from Captures since it always returns false.
  • BUG #276:
    Position of named capture can now be retrieved from a Captures.
  • BUG #296:
    Remove winapi/kernel32-sys dependency on UNIX.
  • BUG #307:
    Fix error on emscripten.

BurntSushi and others added 20 commits December 30, 2016 01:05
This uses the new Replacer trait essentially as defined in the `bytes`
sub-module and described in rust-lang#151.

Fixes rust-lang#151
It is useless because it will always return false (since every regex has
at least one capture group corresponding to the full match).

Fixes rust-lang#179
It is misleading to suggest that Regex implements equality, since
equality is a well defined operation on regular expressions and this
particular implementation doesn't correspond to that definition at all.

Moreover, I suspect the actual use cases for such an impl are rather
niche. A simple newtype+deref should resolve any such use cases.

Fixes rust-lang#178
This corrects a gaffe of mine. In particular, both types contain
references to a `Captures` *and* the text that was searched, but
only names one lifetime. In practice, this means that the shortest
lifetime is used, which can be problematic for when one is trying to
extract submatch text.

This also fixes the lifetime annotation on `iter_pos`, which should be
tied to the Captures and not the text.

It was always possible to work around this by using indices.

Fixes rust-lang#168
This is replaced by using RegexBuilder.

Fixes rust-lang#166
It encourages compiling a regex for every use, which can be convenient
in some circumstances but deadly for performance.

Fixes rust-lang#165
Similarly, rename RegexSplitsN to SplitsN.

This follows the convention of all other iterator types. In general,
we shouldn't namespace our type names.
Mostly, this adds an `Iter` suffix to all of the names.
If `replace` doesn't find any matches, then it can return the original
string unchanged.
This remove the InvalidSet variant, which is no longer used, and no
longer exposes the `regex_syntax::Error` type, instead exposing it as
a string.
This also removes Captures.{at,pos} and replaces it with Captures.get,
which now returns a Match. Similarly, Captures.name returns a Match as
well.

Fixes rust-lang#276
All use cases can be replaced with Regex::capture_names.
Specifically, use mutable references instead of passing ownership.
For example, the regex `[:upper:]` used to correspond to the `upper`
ASCII character class, but it now corresponds to the character class
containing the characters `:upper:`.

Forms like `[[:upper:][:blank:]]` are still accepted.

Fixes rust-lang#175
The escaping of &, - and ~ is only required when the characters are
repeated adjacently, which should be quite rare. Escaping of [ is always
required, unless it appear in the second position of a range.

These rules enable us to add character class sets as described in
UTS#18 RL1.3 in a backward compatible way.
This was referenced Dec 30, 2016
This was added because regex 0.1 supports Rust 1.3+. But we can now
assume Rust 1.12+, which has Vec::extend_from_slice. Yay for less unsafe!
@BurntSushi BurntSushi force-pushed the rfc branch 2 times, most recently from ca60bf9 to f8903d9 Compare December 30, 2016 21:46
When building a Match, we should avoid storing a subslice and instead
store the full string. We can punt subslicing to access. This seems to
get LLVM to optimize tight loops better when the subslice isn't needed.
@BurntSushi BurntSushi force-pushed the rfc branch 2 times, most recently from e818f7e to cc56d60 Compare December 31, 2016 17:57
This API mirrors RegexBuilder, but for multiple patterns.

Also, modify regex-capi to use RegexSetBuilder internally.
@BurntSushi
Copy link
Member Author

@bors r+

@bors
Copy link
Contributor

bors commented Dec 31, 2016

📌 Commit ac3ab6d has been approved by BurntSushi

@bors
Copy link
Contributor

bors commented Dec 31, 2016

⌛ Testing commit ac3ab6d with merge 52fdae7...

bors added a commit that referenced this pull request Dec 31, 2016
regex 0.2

0.2.0
=====
This is a new major release of the regex crate, and is an implementation of the
[regex 1.0 RFC](https://github.com/rust-lang/rfcs/blob/master/text/1620-regex-1.0.md).
We are releasing a `0.2` first, and if there are no major problems, we will
release a `1.0` shortly. For `0.2`, the minimum *supported* Rust version is
1.12.

There are a number of **breaking changes** in `0.2`. They are split into two
types. The first type correspond to breaking changes in regular expression
syntax. The second type correspond to breaking changes in the API.

Breaking changes for regex syntax:

* POSIX character classes now require double bracketing. Previously, the regex
  `[:upper:]` would parse as the `upper` POSIX character class. Now it parses
  as the character class containing the characters `:upper:`. The fix to this
  change is to use `[[:upper:]]` instead. Note that variants like
  `[[:upper:][:blank:]]` continue to work.
* The character `[` must always be escaped inside a character class.
* The characters `&`, `-` and `~` must be escaped if any one of them are
  repeated consecutively. For example, `[&]`, `[\&]`, `[\&\&]`, `[&-&]` are all
  equivalent while `[&&]` is illegal. (The motivation for this and the prior
  change is to provide a backwards compatible path for adding character class
  set notation.)
* A `bytes::Regex` now has Unicode mode enabled by default (like the main
  `Regex` type). This means regexes compiled with `bytes::Regex::new` that
  don't have the Unicode flag set should add `(?-u)` to recover the original
  behavior.

Breaking changes for the regex API:

* `find` and `find_iter` now **return `Match` values instead of
  `(usize, usize)`.** `Match` values have `start` and `end` methods, which
  return the match offsets. `Match` values also have an `as_str` method,
  which returns the text of the match itself.
* The `Captures` type now only provides a single iterator over all capturing
  matches, which should replace uses of `iter` and `iter_pos`. Uses of
  `iter_named` should use the `capture_names` method on `Regex`.
* The `replace` methods now return `Cow` values. The `Cow::Borrowed` variant
  is returned when no replacements are made.
* The `Replacer` trait has been completely overhauled. This should only
  impact clients that implement this trait explicitly. Standard uses of
  the `replace` methods should continue to work unchanged.
* The `quote` free function has been renamed to `escape`.
* The `Regex::with_size_limit` method has been removed. It is replaced by
  `RegexBuilder::size_limit`.
* The `RegexBuilder` type has switched from owned `self` method receivers to
  `&mut self` method receivers. Most uses will continue to work unchanged, but
  some code may require naming an intermediate variable to hold the builder.
* The free `is_match` function has been removed. It is replaced by compiling
  a `Regex` and calling its `is_match` method.
* The `PartialEq` and `Eq` impls on `Regex` have been dropped. If you relied
  on these impls, the fix is to define a wrapper type around `Regex`, impl
  `Deref` on it and provide the necessary impls.
* The `is_empty` method on `Captures` has been removed. This always returns
  `false`, so its use is superfluous.
* The `Syntax` variant of the `Error` type now contains a string instead of
  a `regex_syntax::Error`. If you were examining syntax errors more closely,
  you'll need to explicitly use the `regex_syntax` crate to re-parse the regex.
* The `InvalidSet` variant of the `Error` type has been removed since it is
  no longer used.
* Most of the iterator types have been renamed to match conventions. If you
  were using these iterator types explicitly, please consult the documentation
  for its new name. For example, `RegexSplits` has been renamed to `Split`.

A number of bugs have been fixed:

* [BUG #151](#151):
  The `Replacer` trait has been changed to permit the caller to control
  allocation.
* [BUG #165](#165):
  Remove the free `is_match` function.
* [BUG #166](#166):
  Expose more knobs (available in `0.1`) and remove `with_size_limit`.
* [BUG #168](#168):
  Iterators produced by `Captures` now have the correct lifetime parameters.
* [BUG #175](#175):
  Fix a corner case in the parsing of POSIX character classes.
* [BUG #178](#178):
  Drop the `PartialEq` and `Eq` impls on `Regex`.
* [BUG #179](#179):
  Remove `is_empty` from `Captures` since it always returns false.
* [BUG #276](#276):
  Position of named capture can now be retrieved from a `Captures`.
* [BUG #296](#296):
  Remove winapi/kernel32-sys dependency on UNIX.
* [BUG #307](#307):
  Fix error on emscripten.
@bors
Copy link
Contributor

bors commented Dec 31, 2016

☀️ Test successful - status-appveyor, status-travis
Approved by: BurntSushi
Pushing 52fdae7 to master...

@bors bors merged commit ac3ab6d into rust-lang:master Dec 31, 2016
@BurntSushi BurntSushi deleted the rfc branch December 31, 2016 23:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants