-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
regexp: document and implement invalid UTF-8 treated as U+FFFD #48749
Comments
It may not be clear and it may not be right, but this behavior is how strings work in Go. Invalid UTF-8 gets turned, one byte at a time, into U+FFFD. Here is text from the spec about ranges over strings, which is as good a description as any of how Go handles invalid UTF-8. It's part of the language itself to do it this way:
From the point of view of the matching algorithm, the regexp compiler has already overwritten all the invalid UTF-8 when it built the engine using Go's rules to interpret the string. There is no way to fix this compatibly other than to provide a flag or other mechanism to avoid this interpretation. Given that the code is all runes inside, though, even that may be infeasible. Working as intended, and unfortunate. You are right that your best bet is likely to validate the string ahead of time, or else elide the invalid UTF-8 altogether. |
This does not explain why a simple regex explicitly looking for U+FFFD does NOT match on invalid utf8. Maybe there's some optimization going on for regexes such as Anyway, thanks for the quick response. I just wanted to save someone some time hairpulling like I did today :) |
I think there is a real bug here for some cases. Reopening. |
The literals (lines 13-18) are buggy in https://play.golang.org/p/j-jsteknY0M and should be fixed. |
This is a duplicate of #38006, which I've merged into this issue because this issue had more commentary. That issue was marked as just needing a documentation update. I started to look into fixing this, but it's fairly complex to get all the cases in all the matching engines. For the record, the coherent behavior options are:
RE2 uses Rule 1. Because it works byte at a time it can also provide \C to match any single byte of input, which matches invalid UTF-8 as well. This provides the nice property that a match for a regexp without \C is guaranteed to be valid UTF-8. Unfortunately, today Go has an incoherent mix of these two, although mostly Rule 2. This is a deviation from RE2, and it gives up the nice property, but we probably can't correct that at this point. In particular |
Change https://golang.org/cl/354569 mentions this issue: |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
It reproduces with the Go Playground, which I assume is the latest version.
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
I tried to validate a user-provided string, which might contain non-utf8 data, using a regex.
See https://play.golang.org/p/j-jsteknY0M for a concise example.
What did you expect to see?
The regex package docs says:
So, I expected matching on non-utf8 strings to either:
false
What did you see instead?
For some regexes, it returns
false
, for some it returnstrue
. It never returns an error.Particularly, regexes referencing or containing the Unicode
REPLACEMENT CHARACTER
(\ufffd
, �) inside a bracket expression return true (but only if there are other characters in the same bracket). See the playgound example.I understand that the immediate solution for me is to just check for invalid utf8 first, before regexing. However, the actual behaviour was so unexpected to me, even if it's technically undefined when reading the docs, that it might be a good idea to at least document this.
The text was updated successfully, but these errors were encountered: