-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Match and isMatch ReadOnlySpan<char> overloads as well as Regex protected methods which expose ReadOnlySpan<char> to avoid allocations. #62509
Conversation
Note regarding the This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change. |
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions Issue DetailsThis is still WIP as these APIs haven't been approved yet and the only purpose of the draft PR is to get early feedback. cc: @stephentoub
|
...raries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexInterpreter.cs
Outdated
Show resolved
Hide resolved
@joperezr, thanks for sharing your draft. There are a few high-level issues here:
So as you currently have things, I don't think this is feasible. Might be good to start out thinking about the APIs a little more. I'd proposed overloads of |
Thanks a bunch for taking a look and for the feedback @stephentoub.
Yup, I'm aware of that (in dotnet/iot we heavily use
I see, makes sense, I'll do that.
I see, yeah good point. My intention here was that I wouldn't want to break existing custom derived RegexRunners, so I wanted to introduce a new virtual Span overload which by default would just set runtext and call the existing Go/FFC methods which would have already been implemented by the derived types. For new derived types, I was expecting consumers do the opposite (which is what I'm doing here in the interpreted engine) which is to override both overloads, and have the non-Span one just call the Span one passing in runtext. I will re-read more thoroughly your API proposal and change this to more closely match what you proposed and take into consideration all of your comments here. |
e417519
to
63ed6b5
Compare
I've now implemented the desired Scan method with both the Interpreted and Compiled engines. Next I'll work on the Source generator one. I realize that there is a lot of code that can be refactored here to avoid repetition on that Scan loop, but for now I'm mostly focusing on proof of concept implementation and will then polish once this is ready for actual review. |
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Capture.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Capture.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Capture.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Capture.cs
Outdated
Show resolved
Hide resolved
...ies/System.Text.RegularExpressions/src/System/Text/RegularExpressions/CompiledRegexRunner.cs
Outdated
Show resolved
Hide resolved
...ies/System.Text.RegularExpressions/src/System/Text/RegularExpressions/CompiledRegexRunner.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/ref/System.Text.RegularExpressions.cs
Outdated
Show resolved
Hide resolved
...ies/System.Text.RegularExpressions/src/System/Text/RegularExpressions/CompiledRegexRunner.cs
Outdated
Show resolved
Hide resolved
...ies/System.Text.RegularExpressions/src/System/Text/RegularExpressions/CompiledRegexRunner.cs
Outdated
Show resolved
Hide resolved
...raries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexInterpreter.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.cs
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.cs
Outdated
Show resolved
Hide resolved
…h all tests passing.
…bool. - Added Scan overload to NonBacktracking - Added Scan overload to CompiledEngine which internally calls the generated Go and FFC dynamic methods. - Moved a lot of the initialization that happens as part of Scan to occur before Scan, in order to not requiring to expose additional APIs to create custom regexRunners - Added Scan overload to Interpreted engine - Moved logic that cleans up the Match object that used to happen at the end of Scan to now happen after Scan() is done.
2df7931
to
2960e99
Compare
...ies/System.Text.RegularExpressions/src/System/Text/RegularExpressions/CompiledRegexRunner.cs
Outdated
Show resolved
Hide resolved
runner.InitializeForGo(); | ||
runner.Scan(this, span, startat - beginning, prevlen, quick); | ||
Match? m = runner.runmatch; | ||
runner.runmatch = null; // Reset runmatch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this mean we'll never reuse Match objects from call to call, even if quick == true? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I still have to figure out what the best way to do this. Should be fixable, but the issue right now is that when quick is true and you find a match, you actually want to return null as that is what we today interpret as "there was a match" like:
Line 38 in 05dbb13
return Run(quick: true, -1, input, 0, input.Length, UseOptionR() ? input.Length : 0) is null; |
Given Scan doesn't return anything today, and basically uses runmatch as the communication mechanism, then that is why even when quick is true we are setting it to null. That said, after this new refactoring where all of the handling of runmatch happens as part of Run(), this should be easy to fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following. We don't currently null out runmatch for quick; why do we need to here?
Lines 179 to 183 in 05dbb13
if (quick) | |
{ | |
runmatch!.Text = null!; // drop reference | |
return null; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need it currently because the current Scan
returns a Match object, so when quick == true and we found a match, we simply return null:
Lines 175 to 183 in 87777b6
if (match._matchcount[0] > 0) | |
{ | |
runtext = null; // drop reference to text to avoid keeping it alive in a cache | |
if (quick) | |
{ | |
runmatch!.Text = null!; // drop reference | |
return null; | |
} |
The new Scan method however returns void, and the way to communicate the result back to Run() used to be via runmatch (before I moved all of the handling of runmatch to inside Run() method), which is why I was setting runmatch to null when quick was true. Now that all of the handling of runmatch happens during Run(), I can avoid nulling out runmatch and simply return null from Run
method whenever we found a match and quick was set to true.
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexRunner.cs
Outdated
Show resolved
Hide resolved
(index < endpos && RegexCharClass.IsBoundaryWordChar(runtext[index])); | ||
} | ||
|
||
protected bool IsBoundary(ReadOnlySpan<char> inputSpan, int index, int startpos, int endpos) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need endpos? Isn't endpos always inputSpan.Length?
For that matter, isn't startpos always actually runtextbeg? In which case we don't need it either. At which point this just takes the span and the index. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
startpos is not always runtextbeg, but it is runner.runtextstart. I just copied the signature from the existing methods, which they did take start and end but I do think we can just use the regexRunner fields so these will get refactored anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
startpos is not always runtextbeg, but it is runner.runtextstart
Where is it not runtextbeg? I see these two calls in the interpreter:
Lines 713 to 727 in af42c8e
case RegexOpcode.Boundary: | |
if (!IsBoundary(runtextpos, runtextbeg, runtextend)) | |
{ | |
break; | |
} | |
advance = 0; | |
continue; | |
case RegexOpcode.NonBoundary: | |
if (IsBoundary(runtextpos, runtextbeg, runtextend)) | |
{ | |
break; | |
} | |
advance = 0; | |
continue; |
if (runtext != null) | ||
{ | ||
sb.Append(RegexCharClass.DescribeChar(runtext[runtextpos - 1])); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should consider whether any of this tracing is still valuable, especially if dumping the current state won't help with the newer modes of access, and whether it should all be deleted or simplified significantly. Entirely up to you, as likely the sole person who would ever enable it :) #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, I can handle that in a separate PR if that's ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup
writer.WriteLine("// Reset state for another iteration."); | ||
writer.WriteLine("base.runtrackpos = base.runtrack!.Length;"); | ||
writer.WriteLine("base.runstackpos = base.runstack!.Length;"); | ||
writer.WriteLine("base.runcrawlpos = base.runcrawl!.Length;"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are any of these necessary?
writer.WriteLine("// We failed to find a match. If we're at the end of the input, then we are done."); | ||
using (EmitBlock(writer, "if (base.runtextpos == text.Length)")) | ||
{ | ||
writer.WriteLine("base.runmatch = global::System.Text.RegularExpressions.Match.Empty;"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed? If we initialize runmatch in the regex lib prior to calling Scan, and if Go ensures it uncaptures everything in the event of a failed match, can we just leave runmatch unmodified, and the regex lib will see that runmatch doesn't contain a top-level capture and will know that means a failed match?
src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs
Show resolved
Hide resolved
writer.WriteLine(); | ||
|
||
writer.WriteLine("// If we got a match, we're done."); | ||
using (EmitBlock(writer, "if (Go(text))")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not against Go returning a bool, but I'm also not clear why it's needed. Can't the caller of Scan see whether there was a match based on the state of the object in runmatch?
protected bool IsMatched(int cap) { throw null; } | ||
protected int MatchIndex(int cap) { throw null; } | ||
protected int MatchLength(int cap) { throw null; } | ||
protected int Popcrawl() { throw null; } | ||
protected internal System.Text.RegularExpressions.Match? Scan(System.Text.RegularExpressions.Regex regex, string text, int textbeg, int textend, int textstart, int prevlen, bool quick) { throw null; } | ||
protected internal System.Text.RegularExpressions.Match? Scan(System.Text.RegularExpressions.Regex regex, string text, int textbeg, int textend, int textstart, int prevlen, bool quick, System.TimeSpan timeout) { throw null; } | ||
protected internal virtual void Scan(System.Text.RegularExpressions.Regex regex, System.ReadOnlySpan<char> text, int textstart, int prevlen, bool quick) { throw null; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As part of exposing this, we should consider whether we want to keep the names as close to the existing ones as possible, or come up with better ones, e.g. is it better for prevlen to be consistent with the existing Scan overloads that we don't actually think anyone uses, or is it better for prevlen to be named previousMatchLength, or something more descriptive like that. I don't have a strong opinion, just something to consider. #Resolved
…ouch runmatch any more.
src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs
Show resolved
Hide resolved
writer.WriteLine("// Reset state for another iteration."); | ||
writer.WriteLine("base.runtrackpos = base.runtrack!.Length;"); | ||
writer.WriteLine("base.runstackpos = base.runstack!.Length;"); | ||
writer.WriteLine("base.runcrawlpos = base.runcrawl!.Length;"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are any of these actually necessary? runcrawlpos should be 0 because if there were captures we would have uncaptured all the way back to 0 on a failed match. We don't use runtrack at all. I'd have expected runtrackpos to also end up at runtrack.Length as well, as we use it as a backtracking stack which should get fully unwound.
protected abstract void InitTrackCount(); | ||
protected virtual bool FindFirstChar() { throw null; } | ||
protected virtual void Go() { throw null; } | ||
protected virtual void InitTrackCount() { throw null; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we delete our generated override of InitTrackCount in the source generator? What breaks if runtrackcount remains 0? I think that'll just mean that we end up with the default minimums enforced by InitializeForGo (or whatever it's now called). We should probably delete the override and then confirm the defaults make sense, and/or determine whether there's a different number we could be computing that would help presize this better. If memory serves, right now it's yielding the number of opcodes produced by RegexWriter that are backtracking... but that's no longer particularly relevant to how we use runtrack, and we don't use runstack at all... so there's a fair bit of waste here to be cleaned up.
@@ -7,25 +7,21 @@ namespace System.Text.RegularExpressions | |||
{ | |||
internal sealed class CompiledRegexRunnerFactory : RegexRunnerFactory | |||
{ | |||
private readonly DynamicMethod _goMethod; | |||
private readonly DynamicMethod _findFirstCharMethod; | |||
private readonly DynamicMethod _scanMethod; | |||
private readonly int _trackcount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As with the source generator, it'd be nice to see whether we can just delete _trackcount, the setting of it, etc. That'll contribute to #62450.
/// </summary> | ||
/// <param name="input">The input span to be searched on.</param> | ||
/// <param name="pattern">The Regex pattern to be used for matching.</param> | ||
/// <returns><see langword="true"/> if the input matches the pattern, <see langword="false"/> otherwise.</returns> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all of these doc comments, it'd be nice to start with the wording currently used in the docs for the corresponding string-based methods, e.g.
https://docs.microsoft.com/en-us/dotnet/api/system.text.regularexpressions.regex.ismatch?view=net-6.0#system-text-regularexpressions-regex-ismatch(system-string-system-string) #Resolved
public bool IsMatch(ReadOnlySpan<char> input) | ||
{ | ||
return Run(quick: true, -1, input, 0, input.Length, UseOptionR() ? input.Length : 0) is null; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: the other methods are all expression-bodied members, but this one isn't #Resolved
return runner.Scan(this, input, beginning, beginning + length, startat, prevlen, quick, internalMatchTimeout); | ||
bool skipScan = false; | ||
runner.InitializeTimeout(internalMatchTimeout); | ||
ReadOnlySpan<char> span = input.AsSpan(beginning, length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how this is avoiding allocations when one of the string-based entrypoints is used with an existing CompileToAssembly assembly. Doesn't it need to store the input
into runtext
so that the original string is available to overrides of FindFirstChar
and Go
?
My expectation was that this would set runner.runtext = input
and then the base Scan function (that all of our engines override but a CompileToAssembly assembly wouldn't) would do a check like if (inputSpan != runtext) runtext = inputSpan.ToString();
. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A valid alternative to if (inputSpan != runtext) runtext = inputSpan.ToString();
would also be to throw an exception in the base Scan method if runtext is null. That should only be hit if a) a new span-based entrypoint is used and b) an existing CompileToAssembly assembly is being used that doesn't override Scan, and we could say that combination simply isn't supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I have it now, is that instead of Regex.Run
doing the runner.runtext = input
, I'm doing that in the default implementation of Scan as you suggest:
Lines 91 to 103 in cc92623
protected internal virtual void Scan(ReadOnlySpan<char> text) | |
{ | |
string? s = runtext; | |
if (text != s) | |
{ | |
s = text.ToString(); | |
runtext = s; | |
} | |
Debug.Assert(runregex != null); | |
Match? match = Scan(runregex, s, 0, text.Length, runtextstart, -1, quick, runregex.internalMatchTimeout); | |
runmatch = match; | |
} |
This way, setting runtext will ONLY happen now when using CompiledToAssembly. All of the other engines will override the Scan method, and won't be setting runtext any longer, which is how we prevent that extra allocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I have it now, is that instead of Regex.Run doing the runner.runtext = input, I'm doing that in the default implementation of Scan as you suggest:
We need to do both. If you don't set runner.runtext = input
in the Run
method, then when it gets to this point in Scan
, runtext
will be null and we'll always end up doing text.ToString.
runner.InitializeForScan(this, span, 0, span.Length, startat - beginning, quick); | ||
runner.InitializeForGo(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is InitializeForGo still needed? Its contents can't be in InitializeForScan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reason, it can totally be absorbed into Initialize for Scan. I didn't do that refactoring as it wasn't really relevant for the purposes of the API exposure POC, but I can for sure do this as part of the actual implementation once the API is approved.
bool skipScan = false; | ||
runner.InitializeTimeout(internalMatchTimeout); | ||
ReadOnlySpan<char> span = input.AsSpan(beginning, length); | ||
runner.InitializeForScan(this, span, 0, span.Length, startat - beginning, quick); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this need to pass in the 0 and span.Length? We're passing in span.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, it can totally work without those two as all the callsites are passing the same thing since textbeg and textend is set to be beginning and end of the sliced span.
// if we got a match, set runmatch to null if quick is true | ||
if (runner.runmatch!._matchcount[0] > 0) | ||
{ | ||
if (quick) | ||
{ | ||
runner.runmatch = null; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted, we really need to not do this. This is a ship stopper. It'll cause every IsMatch/Replace/Split/Count to start skyrocketing in allocations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this little implementation detail is there as part of a copy paste from code that used to be in individual implementations of Scan on the engine, to code that now lives in Run(), and it was originally meant to serve for IsMatch() purposes where a null match means "The input is a match for pattern". Now that all of that lives in Run, we don't have to reset runmatch, only to return null from Run() (which we couldn't do from the new Scan since it returns void)
runner.InitializeTimeout(internalMatchTimeout); | ||
ReadOnlySpan<char> span = input.Slice(beginning, length); | ||
runner.InitializeForScan(this, span, 0, span.Length, startat - beginning, quick); | ||
runner.InitializeForGo(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we need a method to configure state on the runner, we should have it be a single method rather than three :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Although I see later that at least the timeout initialization is separate out to before a loop, so maybe that makes sense on its own)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, as said above, this will be refactored into two:
- InitializeTimeout: which will only be called once particularly when being called by
Run<TState>
which has a loop inside - InitializeForScan: which will have contents of what today is IntializeForScan and IntializeForGo, which when being called by
Run<TState>
do need to be called for each new iteration of the loop.
return; | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we implement this method in terms of the other Run overload? It'd be nice to avoid all this complicated duplication. I realize it'll come at a small cost, but hopefully it'd be fairly minor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can do that.
// { | ||
Ldthisfld(s_runregexField); | ||
Callvirt(s_regexGetRightToLeft); | ||
BrfalseFar(notRightToLeft); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We no longer implement RightToLeft support in RegexCompiler; any use of RightToLeft ends up in the interpreter:
runtime/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs
Lines 2513 to 2518 in af42c8e
if ((Options & (RegexOptions.RightToLeft | RegexOptions.NonBacktracking)) != 0) | |
{ | |
// NonBacktracking isn't supported, nor RightToLeft. The latter applies to both the top-level | |
// options as well as when used to specify positive and negative lookbehinds. | |
return false; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, yeah I did know that and that's why I removed all of this logic form the source generated emitter, but forgot to also remove it from the Compiled engine, will fix.
|
||
// CheckTimeout(); | ||
Ldthis(); | ||
Call(s_checkTimeoutMethod); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing guard on the timeout check. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind elaborating? Are you referring to the existing if (ignoreTimeout)
? If so, we don't do that any longer since timeout doesn't get passed through to Scan. Before we had the guard because we were calling DoCheckTimeout()
as opposed to CheckTimeout
which has the guard inside of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind elaborating?
if (_hasTimeout)
{
Ldthis();
Call(s_checkTimeoutMethod);
}
Ldc(0); | ||
_ilg!.Emit(OpCodes.Cgt); | ||
BrfalseFar(afterSuccessMatchLabel); | ||
BrFar(returnLabel); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to do the same thing we do in source gen and have Go return a bool rather than having this code reach into the match object to inspect the match count? The bool should be faster and simpler. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do that as well here. I originally did it only in the source generator because main idea with that is that source generator doesn't have access to match._matchCount[0]
field, so I used Go
return value to let me know if the input matched or not. For Compiled engine on the other hand, we do have access to the _matchCount field which is why I left it as is, but I agree it would be better to match for a) consistency between generated engines and b) perf.
Ldthisfld(s_runcrawlField); | ||
Ldlen(); | ||
_ilg!.Emit(OpCodes.Conv_I4); | ||
Stfld(s_runcrawlposField); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same questions as on the source gen about whether we actually need to do this resetting.
(We should strive to keep the RegexCompiler implementation and the source generator implementation 1:1 as much as possible, only diverging in places where we're forced to do something special because of C# vs IL (e.g. complications from scoping in C#) or where emitting C# enables further optimizations (e.g. emitting a switch that the C# compiler can further optimize with its heuristics). #Resolved
if (runmatch!._matchcount[0] > 0) | ||
{ | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. Can Go return a bool? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it can. Here it wouldn't really be a perf gain like in the compiled engine, just consistency between engines. Interpreted doesn't need it as it can access _matchcount
field, but I'm fine with moving all of them for consistency sake.
// Reset state for another iteration. | ||
runtrackpos = runtrack!.Length; | ||
runstackpos = runstack!.Length; | ||
runcrawlpos = runcrawl!.Length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike RegexCompiler and the source generator, these will need to stay, as the interpreter still uses the "old" approach based on RegexWriter's opcodes. #Resolved
m.Text = input; | ||
} | ||
|
||
// If there was a match and the original text was sliced, then add beggining to the index to get the real |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// If there was a match and the original text was sliced, then add beggining to the index to get the real | |
// If there was a match and the original text was sliced, then add beginning to the index to get the real |
// bump by 1 and stop at textend, but if we're examining right-to-left, we instead bump | ||
// by -1 and stop at textbeg. | ||
int bump = 1, stoppos = text.Length; | ||
Debug.Assert(runregex != null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's value in asserting that something isn't null a line above where it's dereferenced. #Resolved
_timeoutOccursAt = Environment.TickCount + _timeout; | ||
_timeoutChecksToSkip = TimeoutCheckFrequency; | ||
s = text.ToString(); | ||
runtext = s; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so you have this logic here, it just seems like the runtext = input
logic for string inputs was missing earlier.
As noted, we could alternatively choose to throw a NotSupportedException here if text != s. We should think through the implications. On the one hand, it's nice to implicitly support the span-based APIs when using one of those older CompileToAssembly implementations. On the other hand, this could result in a massive allocation increase when someone switches to using the span-based APIs, and it could easily be a pit of failure; an exception would immediately highlight the problem and help people to know not to use that combination (which should be a rare combination, I hope). #Resolved
} | ||
|
||
Debug.Assert(runregex != null); | ||
Match? match = Scan(runregex, s, 0, text.Length, runtextstart, -1, quick, runregex.internalMatchTimeout); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
text.Length => s.Length ? #Resolved
@@ -207,6 +168,36 @@ public abstract class RegexRunner | |||
} | |||
} | |||
|
|||
internal void InitializeForScan(Regex regex, ReadOnlySpan<char> text, int textbeg, int textend, int textstart, bool quick) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is text
passed in? Or alternatively if textbeg is always 0 and textend is always text.Length, why are textbeg and textend passed in? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was originally passed in because this method orignally was also setting runtext. Then I removed that but didn't remove the parameter. You are right about textbeg and end, so I will remove those instead and calculate them based on text instead.
// Environment.TickCount is an int that cycles. We intentionally let timeoutOccursAt | ||
// overflow it will still stay ahead of Environment.TickCount for comparisons made | ||
// in DoCheckTimeout(). | ||
Regex.ValidateMatchTimeout(timeout); // validate timeout as this could be called from user code due to being protected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A separate discussion, but the logic of this comment seems a bit flawed. Yes, someone can derive from Regex, but if they do and so can set this to a non-sensical value, they can also override FindFirstChar/Go/Scan to do whatever the heck they want. We shouldn't need to pay to validate this value over and over and over again (we already validated the values that came in through the public APIs for consumers of Regex) for such misuse that isn't actually preventing anything nefarious.
|
||
/// <summary> | ||
/// InitTrackCount must initialize the runtrackcount field; this is | ||
/// used to know how large the initial runtrack and runstack arrays | ||
/// must be. | ||
/// </summary> | ||
protected abstract void InitTrackCount(); | ||
protected virtual void InitTrackCount() => runtrackcount = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need a body or can it be empty? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be empty. I just set it to zero as that was in your original proposal.
{ | ||
if (runmatch is null) | ||
{ | ||
// Use a hashtabled Match object if the capture numbers are sparse | ||
runmatch = runregex!.caps is null ? | ||
new Match(runregex, runregex.capsize, runtext!, runtextbeg, runtextend - runtextbeg, runtextstart) : | ||
new MatchSparse(runregex, runregex.caps, runregex.capsize, runtext!, runtextbeg, runtextend - runtextbeg, runtextstart); | ||
new Match(runregex, runregex.capsize, runtext ?? string.Empty, runtextbeg, runtextend - runtextbeg, runtextstart) : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Earlier you made the runtext string argument be nullable... why does this then prevent null from being passed in? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to check again, but I believe I added this because a unit test was failing which was checking the equivalence of Match.Empty and a failed match, or something like that.
protected internal override void Scan(ReadOnlySpan<char> text) | ||
{ | ||
ReadOnlySpan<char> inputSpan = runtext; | ||
Go(text); | ||
} | ||
|
||
private void Go(ReadOnlySpan<char> inputSpan) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need Go to be a separate method. The body of Go can just be the body of Scan. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
totally true, originally I had it separate as in Scan I had a lot of the boilerplate that I had for other engine's Scan methods, but I then started removing all of that since it didn't apply for this engine and was left with just the invocation of Go. Will fix this.
Thanks for all of the great feedback @stephentoub and @danmoseley. Now that these APIs are approved, I will go ahead and close this draft and will send out a real PR with all the previously given feedback addressed shortly. The reason for not turning this one into a real PR instead, is that this one already has a lot of comments which has caused it to have issues when trying to load, so having a clean new one will help with that. |
This is still WIP as these APIs haven't been approved yet and the only purpose of the draft PR is to get early feedback.
cc: @stephentoub