-
Notifications
You must be signed in to change notification settings - Fork 12k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[libc++][regex] Correctly adjust match prefix for zero-length matches. #94550
Conversation
@llvm/pr-subscribers-libcxx Author: Konstantin Varlamov (var-const) ChangesFull diff: https://github.com/llvm/llvm-project/pull/94550.diff 2 Files Affected:
diff --git a/libcxx/include/regex b/libcxx/include/regex
index b3869d36de1df..8b2637ead328a 100644
--- a/libcxx/include/regex
+++ b/libcxx/include/regex
@@ -792,6 +792,7 @@ typedef regex_token_iterator<wstring::const_iterator> wsregex_token_iterator;
#include <__algorithm/find.h>
#include <__algorithm/search.h>
#include <__assert>
+#include <__availability>
#include <__config>
#include <__iterator/back_insert_iterator.h>
#include <__iterator/default_sentinel.h>
@@ -4700,6 +4701,9 @@ private:
template <class, class>
friend class __lookahead;
+
+ template <class, class, class>
+ friend class regex_iterator;
};
template <class _BidirectionalIterator, class _Allocator>
@@ -5410,7 +5414,9 @@ template <class _BidirectionalIterator, class _CharT, class _Traits>
regex_iterator<_BidirectionalIterator, _CharT, _Traits>&
regex_iterator<_BidirectionalIterator, _CharT, _Traits>::operator++() {
__flags_ |= regex_constants::__no_update_pos;
- _BidirectionalIterator __start = __match_[0].second;
+ _BidirectionalIterator __start = __match_[0].second;
+ _BidirectionalIterator __prefix_start = __start;
+
if (__match_[0].first == __match_[0].second) {
if (__start == __end_) {
__match_ = value_type();
@@ -5424,9 +5430,21 @@ regex_iterator<_BidirectionalIterator, _CharT, _Traits>::operator++() {
else
++__start;
}
+
__flags_ |= regex_constants::match_prev_avail;
- if (!std::regex_search(__start, __end_, __match_, *__pregex_, __flags_))
+ if (!std::regex_search(__start, __end_, __match_, *__pregex_, __flags_)) {
__match_ = value_type();
+
+ } else {
+ // The Standard mandates that if `regex_search` returns true ([re.regiter.incr]), "`match.prefix().first` shall be
+ // equal to the previous value of `match[0].second`... It is unspecified how the implementation makes these
+ // adjustments." The adjustment is necessary if we incremented `__start` above (the branch that deals with
+ // zero-length matches).
+ auto& __prefix = __match_.__prefix_;
+ __prefix.first = __prefix_start;
+ __prefix.matched = __prefix.first != __prefix.second;
+ }
+
return *this;
}
diff --git a/libcxx/test/std/re/re.iter/re.regiter/re.regiter.incr/post.pass.cpp b/libcxx/test/std/re/re.iter/re.regiter/re.regiter.incr/post.pass.cpp
index 9332158f0de95..596a71c70a484 100644
--- a/libcxx/test/std/re/re.iter/re.regiter/re.regiter.incr/post.pass.cpp
+++ b/libcxx/test/std/re/re.iter/re.regiter/re.regiter.incr/post.pass.cpp
@@ -114,5 +114,53 @@ int main(int, char**)
assert(i == e);
}
+ {
+ // Check that we correctly adjust the match prefix when dealing with zero-length matches -- this is explicitly
+ // required by the Standard ([re.regiter.incr]: "In all cases in which the call to `regex_search` returns true,
+ // `match.prefix().first` shall be equal to the previous value of `match[0].second`"). For a pattern that matches
+ // empty sequences, there is an implicit zero-length match between every character in a string -- make sure the
+ // prefix of each of these matches (except the first one) is the preceding character.
+
+ auto validate = [](const std::regex& empty_matching_pattern) {
+ const char source[] = "abc";
+
+ std::cregex_iterator i(source, source + 3, empty_matching_pattern);
+ assert(!i->prefix().matched);
+ assert(i->prefix().length() == 0);
+ assert(i->prefix().first == source);
+ assert(i->prefix().second == source);
+
+ ++i;
+ assert(i->prefix().matched);
+ assert(i->prefix().length() == 1);
+ assert(i->prefix().first == source);
+ assert(i->prefix().second == source + 1);
+ assert(i->prefix().str() == "a");
+
+ ++i;
+ assert(i->prefix().matched);
+ assert(i->prefix().length() == 1);
+ assert(i->prefix().first == source + 1);
+ assert(i->prefix().second == source + 2);
+ assert(i->prefix().str() == "b");
+
+ ++i;
+ assert(i->prefix().matched);
+ assert(i->prefix().length() == 1);
+ assert(i->prefix().first == source + 2);
+ assert(i->prefix().second == source + 3);
+ assert(i->prefix().str() == "c");
+
+ ++i;
+ assert(i == std::cregex_iterator());
+ };
+
+ // An empty pattern produces zero-length matches.
+ validate(std::regex(""));
+ // Any character repeated zero or more times can produce zero-length matches.
+ validate(std::regex("X*"));
+ validate(std::regex("X{0,3}"));
+ }
+
return 0;
}
|
7971a4b
to
9fd7dce
Compare
assert(i->prefix().second == source); | ||
|
||
++i; | ||
assert(i->prefix().matched); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this repetition warrants another helper function (calls would look something like
check_prefix(i->prefix(), /*matched=*/true, /*length=*/1, /*first=*/source, /*second=*/source + 1);
).
@@ -114,5 +114,53 @@ int main(int, char**) | |||
assert(i == e); | |||
} | |||
|
|||
{ | |||
// Check that we correctly adjust the match prefix when dealing with zero-length matches -- this is explicitly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please let me know if you think I'm writing too much detail for this single specific issue (here and/or in the test file).
✅ With the latest revision this PR passed the C/C++ code formatter. |
Fixes #64451 |
9fd7dce
to
0a4aee4
Compare
// equal to the previous value of `match[0].second`... It is unspecified how the implementation makes these | ||
// adjustments." The adjustment is necessary if we incremented `__start` above (the branch that deals with | ||
// zero-length matches). | ||
auto& __prefix = __match_.__prefix_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's another call to regex_search
above, but I don't think prefix adjustment should ever be necessary there since it happens before __start
is incremented.
// empty sequences, there is an implicit zero-length match between every character in a string -- make sure the | ||
// prefix of each of these matches (except the first one) is the preceding character. | ||
|
||
auto validate = [](const std::regex& empty_matching_pattern) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: I actually didn't realize we support <regex>
in the C++03 mode. Would you prefer to change this to a helper function, or to make the test C++11 and above? (perhaps split into a separate file)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with the comment, and also please make sure to provide a very short explanation of the bug we're fixing with this, since the only explanation is on an internal bug tracker. Basically, please make the commit message sufficient in itself to quickly understand the bug we're fixing.
For regex patterns that produce zero-length matches, there is one (imaginary) match in-between every character in the sequence being searched (as well as before the first character and after the last character). It's easiest to demonstrate using replacement:
std::regex_replace("abc"s, "!", "")
should produce!a!b!c!
, where each exclamation mark makes a zero-length match visible.Currently our implementation doesn't correctly set the prefix of each zero-length match, "swallowing" the characters separating the imaginary matches -- e.g. when going through zero-length matches within
abc
, the corresponding prefixes should be{'', 'a', 'b', 'c'}
, but before this patch they will all be empty ({'', '', '', ''}
). This happens in the implementation ofregex_iterator::operator++
. Note that the Standard spells out quite explicitly that the prefix might need to be adjusted when dealing with zero-length matches inre.regiter.incr
:Reproduction example
Fixes #64451
rdar://119912002