You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With libc++, when std::regex_iterator advances past a zero-length match (e.g. "", "^\\s*|\\s*$"), it skips the character immediately after the zero-length match. As a result, std::regex_replace also drops that character.
Example:
#include <cstdio>
#include <regex>
int main() {
std::string src = "AB";
std::regex pattern("");
// Expected 'xAxBx'. Actual output is 'xxx'.
std::string repl = std::regex_replace(src, pattern, "x");
printf("'%s'\n", repl.c_str());
// Expected ['', 'A', 'B']. Actual output is ['', '', ''].
std::sregex_iterator begin { src.begin(), src.end(), pattern };
std::sregex_iterator end {};
for (auto i = begin; i != end; ++i) {
std::smatch m = *i;
printf("'%s'\n", m.prefix().str().c_str());
}
return 0;
}
libstdc++ prints the expected output above.
This bug was originally reported against the Android NDK, android/ndk#1911, where the pattern was "^\\s*|\\s*$"
I wonder if operator++ need to adjust the prefix backward one character when it skips one character forward here:
#94550)
For regex patterns that produce zero-length matches, there is one
(imaginary) match in-between every character in the sequence being
searched (as well as before the first character and after the last
character). It's easiest to demonstrate using replacement:
`std::regex_replace("abc"s, "!", "")` should produce `!a!b!c!`, where
each exclamation mark makes a zero-length match visible.
Currently our implementation doesn't correctly set the prefix of each
zero-length match, "swallowing" the characters separating the imaginary
matches -- e.g. when going through zero-length matches within `abc`, the
corresponding prefixes should be `{'', 'a', 'b', 'c'}`, but before this
patch they will all be empty (`{'', '', '', ''}`). This happens in the
implementation of `regex_iterator::operator++`. Note that the Standard
spells out quite explicitly that the prefix might need to be adjusted
when dealing with zero-length matches in
[`re.regiter.incr`](http://eel.is/c++draft/re.regiter.incr):
> In all cases in which the call to `regex_search` returns `true`,
`match.prefix().first` shall be equal to the previous value of
`match[0].second`... It is unspecified how the implementation makes
these adjustments.
[Reproduction example](https://godbolt.org/z/8ve6G3dav)
```cpp
#include <iostream>
#include <regex>
#include <string>
int main() {
std::string str = "abc";
std::regex empty_matching_pattern("");
{ // The underlying problem is that `regex_iterator::operator++` doesn't update
// the prefix correctly.
std::sregex_iterator i(str.begin(), str.end(), empty_matching_pattern), e;
std::cout << "\"";
for (; i != e; ++i) {
const std::ssub_match& prefix = i->prefix();
std::cout << prefix.str();
}
std::cout << "\"\n";
// Before the patch: ""
// After the patch: "abc"
}
{ // `regex_replace` makes the problem very visible.
std::string replaced = std::regex_replace(str, empty_matching_pattern, "!");
std::cout << "\"" << replaced << "\"\n";
// Before the patch: "!!!!"
// After the patch: "!a!b!c!"
}
}
```
Fixes#64451
rdar://119912002
With libc++, when std::regex_iterator advances past a zero-length match (e.g.
""
,"^\\s*|\\s*$"
), it skips the character immediately after the zero-length match. As a result, std::regex_replace also drops that character.Example:
libstdc++ prints the expected output above.
This bug was originally reported against the Android NDK, android/ndk#1911, where the pattern was
"^\\s*|\\s*$"
I wonder if operator++ need to adjust the prefix backward one character when it skips one character forward here:
https://github.com/llvm/llvm-project/blob/llvmorg-17.0.0-rc1/libcxx/include/regex#L6512
The text was updated successfully, but these errors were encountered: