Some panic cases found by afl.rs, involving 9 public API #738

StevenJiang1110 · 2021-01-10T08:50:00Z

I have used afl.rs to fuzz all public API of thie crate. And I found several cases may cause panic. The version I fuzz on is 1.4.2, but I have checked that all the cases can be replayed on the newest version 1.4.3. These panics involve 9 APIs(some are similar). The code to replay these panics are as follows:

These 6 cases are about slicing error or out-of-bound error.

let regex_ = regex::bytes::Regex::new("0").unwrap();
let _ = regex::bytes::Regex::find_at(&regex_ ,&[48] ,3472328296227680304);

let regex_ = regex::Regex::new("0").unwrap();
let _local1 = regex::Regex::find_at(&regex_ ,"0" ,3472328296227680304);

let regex_ = regex::bytes::Regex::new("0").unwrap();
let _ = regex::bytes::Regex::shortest_match_at(&regex_ ,&[48] ,3472328296227680304);

let regex_ = regex::bytes::Regex::new("0").unwrap();
let _ = regex::bytes::Regex::is_match_at(&regex_ ,&[48] ,3472328296227680304);

let regex_ = regex::Regex::new("0").unwrap();
let _ = regex::Regex::shortest_match_at(&regex_ ,"0" ,3472328296227680304);

let regex_ = regex::Regex::new("0").unwrap();
let _ = regex::Regex::is_match_at(&regex_ ,"0" ,3472328296227680304);

These 2 cases are about arithmetic overflow.

let regex_ = regex::bytes::Regex::new("0").unwrap();
let capture_location = regex::bytes::Regex::capture_locations(&regex_);
let _ = regex::bytes::CaptureLocations::get(&capture_location ,18388250262078763056);

let regex_ = regex::Regex::new("0").unwrap();
let capture_location = regex::Regex::capture_locations(&regex_);
let _ = regex::CaptureLocations::get(&capture_location ,9236935819261915184);

This case is about unicode error(char boundary)

let regex_ = regex::Regex::new("(?-u)000|\\S000").unwrap();
let match_ = regex::Regex::find(&regex_ ,"詩00000000000").unwrap();
let _ = regex::Match::as_str(&match_);

I also put these replay codes and more data that may cause panic on replay_files.

I hope you can check if these are real bugs need to be fixed. Thanks a lot.

BurntSushi · 2021-01-10T15:25:36Z

None of the first 6 are bugs. You're providing an offset that is invalid for the slice given. Arguably this should be documented as a panic condition.

The second two cases do point to a bug. It should result in None being returned instead of a panic.

The last one is also a bug. Here is a smaller reproduction:

    let regex_ = regex::Regex::new(r"(?-u)\S").unwrap();
    let match_ = regex::Regex::find(&regex_ ,"詩").unwrap();
    let _ = regex::Match::as_str(&match_);

The problem is that the regex given should not be allowed to compile since \S could match invalid UTF-8, and a regex::Regex is never allowed to match invalid UTF-8. So there is something wrong with the logic in regex-syntax that allows such a regex to be constructed in the first place.

Nice finds!

StevenJiang1110 · 2021-01-23T06:04:20Z

By fuzzing again, there's another unicode error found by afl.rs. The reason may be similar.

let regex_ = regex::Regex::new("(?-u)0|\\W").unwrap();
let capture_ = regex::Regex::captures(&regex_ ,"〧000000").unwrap();
let mut escape_ = regex::escape("000000000");
let _ = regex::Captures::expand(&capture_ ,"0$0000000" ,&mut escape_);

When Unicode mode is disabled (i.e., (?-u)), the Perl character classes (\w, \d and \s) revert to their ASCII definitions. The negated forms of these classes are also derived from their ASCII definitions, and this means that they may actually match bytes outside of ASCII and thus possibly invalid UTF-8. For this reason, when the translator is configured to only produce HIR that matches valid UTF-8, '(?-u)\W' should be rejected. Previously, it was not being rejected, which could actually lead to matches that produced offsets that split codepoints, and thus lead to panics when match offsets are used to slice a string. For example, this code fn main() { let re = regex::Regex::new(r"(?-u)\W").unwrap(); let haystack = "☃"; if let Some(m) = re.find(haystack) { println!("{:?}", &haystack[m.range()]); } } panics with byte index 1 is not a char boundary; it is inside '☃' (bytes 0..3) of `☃` That is, it reports a match at 0..1, which is technically correct, but the regex itself should have been rejected in the first place since the top-level Regex API always has UTF-8 mode enabled. Also, many of the replacement tests were using '(?-u)\W' (or similar) for some reason. I'm not sure why, so I just removed the '(?-u)' to make those tests pass. Whether Unicode is enabled or not doesn't seem to be an interesting detail for those tests. (All haystacks and replacements appear to be ASCII.) Fixes #895, Partially addresses #738

The contract of this function says that any invalid group offset should result in a return value of None. In general, it worked fine, unless the offset was so big that some internal multiplication overflowed. That could in turn produce an incorrect result or a panic. So we fix that here with checked arithmetic. Fixes #738, Fixes #950