Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libc++ regex drops the character between a zero-length match and its subsequent match #64451

Closed
rprichard opened this issue Aug 4, 2023 · 0 comments · Fixed by #94550
Closed
Labels
libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. regex Issues related to regex

Comments

@rprichard
Copy link
Contributor

With libc++, when std::regex_iterator advances past a zero-length match (e.g. "", "^\\s*|\\s*$"), it skips the character immediately after the zero-length match. As a result, std::regex_replace also drops that character.

Example:

#include <cstdio>
#include <regex>
int main() {
    std::string src = "AB";
    std::regex pattern("");

    // Expected 'xAxBx'. Actual output is 'xxx'.
    std::string repl = std::regex_replace(src, pattern, "x");
    printf("'%s'\n", repl.c_str());

    // Expected ['', 'A', 'B']. Actual output is ['', '', ''].
    std::sregex_iterator begin { src.begin(), src.end(), pattern };
    std::sregex_iterator end {};
    for (auto i = begin; i != end; ++i) {
        std::smatch m = *i;
        printf("'%s'\n", m.prefix().str().c_str());
    }

    return 0;
}

libstdc++ prints the expected output above.

This bug was originally reported against the Android NDK, android/ndk#1911, where the pattern was "^\\s*|\\s*$"

I wonder if operator++ need to adjust the prefix backward one character when it skips one character forward here:

https://github.com/llvm/llvm-project/blob/llvmorg-17.0.0-rc1/libcxx/include/regex#L6512

@EugeneZelenko EugeneZelenko added libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. and removed new issue labels Aug 5, 2023
@philnik777 philnik777 added the regex Issues related to regex label Aug 5, 2023
ldionne pushed a commit that referenced this issue Jun 7, 2024
#94550)

For regex patterns that produce zero-length matches, there is one
(imaginary) match in-between every character in the sequence being
searched (as well as before the first character and after the last
character). It's easiest to demonstrate using replacement:
`std::regex_replace("abc"s, "!", "")` should produce `!a!b!c!`, where
each exclamation mark makes a zero-length match visible.

Currently our implementation doesn't correctly set the prefix of each
zero-length match, "swallowing" the characters separating the imaginary
matches -- e.g. when going through zero-length matches within `abc`, the
corresponding prefixes should be `{'', 'a', 'b', 'c'}`, but before this
patch they will all be empty (`{'', '', '', ''}`). This happens in the
implementation of `regex_iterator::operator++`. Note that the Standard
spells out quite explicitly that the prefix might need to be adjusted
when dealing with zero-length matches in
[`re.regiter.incr`](http://eel.is/c++draft/re.regiter.incr):
> In all cases in which the call to `regex_search` returns `true`,
`match.prefix().first` shall be equal to the previous value of
`match[0].second`... It is unspecified how the implementation makes
these adjustments.

[Reproduction example](https://godbolt.org/z/8ve6G3dav)
```cpp
#include <iostream>
#include <regex>
#include <string>

int main() {
  std::string str = "abc";
  std::regex empty_matching_pattern("");

  { // The underlying problem is that `regex_iterator::operator++` doesn't update
    // the prefix correctly.
    std::sregex_iterator i(str.begin(), str.end(), empty_matching_pattern), e;
    std::cout << "\"";
    for (; i != e; ++i) {
      const std::ssub_match& prefix = i->prefix();
      std::cout << prefix.str();
    }
    std::cout << "\"\n";
    // Before the patch: ""
    // After the patch: "abc"
  }

  { // `regex_replace` makes the problem very visible.
    std::string replaced = std::regex_replace(str, empty_matching_pattern, "!");
    std::cout << "\"" << replaced << "\"\n";
    // Before the patch: "!!!!"
    // After the patch: "!a!b!c!"
  }
}
```

Fixes #64451

rdar://119912002
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. regex Issues related to regex
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants