Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex.Match fails for the .* pattern when startat is non-zero #42390

Closed
slozier opened this issue Sep 17, 2020 · 3 comments
Closed

Regex.Match fails for the .* pattern when startat is non-zero #42390

slozier opened this issue Sep 17, 2020 · 3 comments

Comments

@slozier
Copy link
Contributor

slozier commented Sep 17, 2020

Description

Regex.Match fails for .* pattern when startat is non-zero. I would expect the following to succeed but it does not.

Debug.Assert(new Regex(".*").Match("abc", startat: 1).Success);

Configuration

.NET SDK (reflecting any global.json):
Version: 5.0.100-rc.1.20452.10
Commit: 473d1b592e

Runtime Environment:
OS Name: Windows
OS Version: 10.0.19042
OS Platform: Windows
RID: win10-x64
Base Path: C:\Program Files\dotnet\sdk\5.0.100-rc.1.20452.10\

Host (useful for support):
Version: 5.0.0-rc.1.20451.14
Commit: 38017c3

Regression?

This works in .NET Core 3.1 and .NET Framework.

@ghost
Copy link

ghost commented Sep 17, 2020

Tagging subscribers to this area: @eerhardt, @pgovind, @jeffhandley
See info in area-owners.md if you want to be subscribed.

@stephentoub stephentoub added this to the 5.0.0 milestone Sep 17, 2020
@stephentoub stephentoub self-assigned this Sep 17, 2020
@stephentoub
Copy link
Member

stephentoub commented Sep 17, 2020

@slozier, thanks for the very helpful repro.

This is due to #1706. That PR adds an optimization that automatically adds a beginning-of-line anchor to a pattern that begins with .*. This lets the implementation skip backtracking by jumping to the start of the next line when it fails to match. This, however, didn't factor in the possibility that matching might be starting in the middle of a line, in which case when it goes to match, the BOL node itself may fail to match.

The simple fix is to just delete the optimization, i.e. remove this code:

// Optimization: implicit anchoring.
// If the expression begins with a .* loop, add an anchor to the beginning:
// - If Singleline is set such that '.' eats anything, the .* will zip to the end of the string and then backtrack through
// the whole thing looking for a match; since it will have examined everything, there's no benefit to examining it all
// again, and we can anchor to beginning.
// - If Singleline is not set, then '.' eats anything up until a '\n' and backtracks from there, so we can similarly avoid
// re-examining that content and anchor to the beginning of lines.
// We are currently very conservative here, only examining concat nodes. This could be loosened in the future, e.g. to
// explore captures (but think through any implications of there being a back ref to that capture), to explore loops and
// lazy loops a positive minimum (but the anchor shouldn't be part of the loop), to explore alternations and support adding
// an anchor if all of them begin with appropriate star loops (though this could also be accomplished by factoring out the
// loops to be before the alternation), etc.
{
RegexNode node = rootNode.Child(0); // skip implicit root capture node
while (true)
{
bool singleline = (node.Options & RegexOptions.Singleline) != 0;
switch (node.Type)
{
case Concatenate:
node = node.Child(0);
continue;
case Setloop when singleline && node.N == int.MaxValue && node.Str == RegexCharClass.AnyClass:
case Setloopatomic when singleline && node.N == int.MaxValue && node.Str == RegexCharClass.AnyClass:
case Notoneloop when !singleline && node.N == int.MaxValue && node.Ch == '\n':
case Notoneloopatomic when !singleline && node.N == int.MaxValue && node.Ch == '\n':
RegexNode? parent = node.Next;
var anchor = new RegexNode(singleline ? Beginning : Bol, node.Options);
Debug.Assert(parent != null);
if (parent.Type == Concatenate)
{
Debug.Assert(parent.ChildCount() >= 2);
Debug.Assert(parent.Children is List<RegexNode>);
anchor.Next = parent;
((List<RegexNode>)parent.Children).Insert(0, anchor);
}
else
{
Debug.Assert(parent.Type == Capture && parent.Next is null, "Only valid capture is the implicit root capture");
var concat = new RegexNode(Concatenate, parent.Options);
concat.AddChild(anchor);
concat.AddChild(node);
parent.ReplaceChild(0, concat);
}
break;
}
break;
}
}

That removes a valuable optimization, but I'm not sure if there's a good way to save it. One thought is to add a new kind of anchor, one that behaves like BOL for the purposes of jumping ahead in FindFirstChar but that doesn't actually require any matching in Go. I expect that would work, but it's probably too late to do so for .NET 5.

@pgovind, @danmosemsft, thoughts?

@jeffhandley
Copy link
Member

Fixed in #42408 (master) and #42409 (release/5.0)

@ghost ghost locked as resolved and limited conversation to collaborators Dec 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants