Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Match and isMatch ReadOnlySpan<char> overloads as well as Regex protected methods which expose ReadOnlySpan<char> to avoid allocations. #62509

Closed
wants to merge 6 commits into from

Conversation

joperezr
Copy link
Member

@joperezr joperezr commented Dec 7, 2021

This is still WIP as these APIs haven't been approved yet and the only purpose of the draft PR is to get early feedback.

cc: @stephentoub

@joperezr joperezr added the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Dec 7, 2021
@joperezr joperezr requested a review from stephentoub December 7, 2021 23:08
@dotnet-issue-labeler
Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@ghost
Copy link

ghost commented Dec 7, 2021

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

This is still WIP as these APIs haven't been approved yet and the only purpose of the draft PR is to get early feedback.

cc: @stephentoub

Author: joperezr
Assignees: -
Labels:

* NO MERGE *, area-System.Text.RegularExpressions, new-api-needs-documentation

Milestone: -

@joperezr joperezr requested a review from danmoseley December 7, 2021 23:09
@stephentoub
Copy link
Member

@joperezr, thanks for sharing your draft.

There are a few high-level issues here:

  • There's a reason I didn't include Match in my proposed set of span-based APIs. Match(...) returns a Match object representing the first possible match in the input string, but a) things like the values of captures are lazily produced based on the input string stored in the Match object, and b) to get to the next match one needs to call Match.NextMatch(), which also relies on the string input stored in the Match object. Neither of those is possible when the input is a span, since a ref struct can't be stored as a field in a class. We could choose to invent a new ref struct match type that's returned from such a method, however one of the reasons folks want to use spans is reduced allocation, but while such a ref struct match object itself wouldn't be allocating, the data structures underneath it are, e.g. the array used to store all the capture information.
  • Similarly, the existing Scan methods that return Match objects won't work well for span-based inputs, and...
  • protected internal means "protected or internal", so we can't change the signature of those protected internal Match Scan methods anyway to pass in the span: it needs to be a new method. My proposal in the issue was a void Scan accepting a span. The implementation internally populates a Match object, but just as with IsMatch, we can just use that Match as temporary storage, effectively part of the RegexRunner instance, and don't need to return it out of the public API... we can just copy out of that instance whatever data we need and then return that copy, letting that singleton Match object be reused for the next matching operation, thereby enabling it to be ammortized-allocation-free.
  • In the base Go/FindFirstChar methods, you're initializing runtext to the ToString of the span if runtext is null. However, I don't see where runtext is being reset back to null. That would seem to mean it's going to be initialized based on the first input used, and then ignore any subsequent input. On top of that, if a string input is in fact used, and these methods haven't been overridden, we don't want to force the string to a span only to create a new string again. I think for compat we should be storing the string into runtext upfront, and then in the Go/FindFirstChar base virtuals, check whether the span is the same ref/length as the string (one of the few cases where span's Equals has the actual desired semantics).

So as you currently have things, I don't think this is feasible. Might be good to start out thinking about the APIs a little more. I'd proposed overloads of IsMatch as well as a new Enumerate method that would be allocation-free and enable iterating through all matches in the input. However, as I have it proposed, it doesn't provide access to the captures. If we believe access to the captures is important with spans (it arguably is), we should look at something in addition to this, e.g. a new ref struct "match" type that works with span (storing the ref struct span) but still incurs the allocations for the state necessary to store captures. We could then either change the proposed Enumerate API to yield instances of these new ref struct matches (in which case we might want a Boolean or some options input to specify whether it should pay to return all the captures, or we could add Match overloads that return the new type (which would then similarly need NextMatch methods like the current Match).

@joperezr
Copy link
Member Author

joperezr commented Dec 8, 2021

Thanks a bunch for taking a look and for the feedback @stephentoub.

protected internal means "protected or internal", so we can't change the signature of those protected internal Match Scan

Yup, I'm aware of that (in dotnet/iot we heavily use protected internal methods), that was my intention but it looks like I had a typo and accidentally removed internal access modifier from the existing Scan method, I basically just wanted to add a new one instead of changing the existing ones. I'll fix that.

The implementation internally populates a Match object, but just as with IsMatch, we can just use that Match as temporary storage, effectively part of the RegexRunner instance

I see, makes sense, I'll do that.

In the base Go/FindFirstChar methods, you're initializing runtext to the ToString of the span if runtext is null. However, I don't see where runtext is being reset back to null. That would seem to mean it's going to be initialized based on the first input used, and then ignore any subsequent input.

I see, yeah good point. My intention here was that I wouldn't want to break existing custom derived RegexRunners, so I wanted to introduce a new virtual Span overload which by default would just set runtext and call the existing Go/FFC methods which would have already been implemented by the derived types. For new derived types, I was expecting consumers do the opposite (which is what I'm doing here in the interpreted engine) which is to override both overloads, and have the non-Span one just call the Span one passing in runtext.

I will re-read more thoroughly your API proposal and change this to more closely match what you proposed and take into consideration all of your comments here.

@joperezr joperezr force-pushed the AddingSpanRegexOverloads branch 2 times, most recently from e417519 to 63ed6b5 Compare December 20, 2021 20:47
@joperezr
Copy link
Member Author

I've now implemented the desired Scan method with both the Interpreted and Compiled engines. Next I'll work on the Source generator one. I realize that there is a lot of code that can be refactored here to avoid repetition on that Scan loop, but for now I'm mostly focusing on proof of concept implementation and will then polish once this is ready for actual review.

@joperezr joperezr self-assigned this Jan 26, 2022
…bool.

- Added Scan overload to NonBacktracking
- Added Scan overload to CompiledEngine which internally calls the generated Go and FFC dynamic methods.
- Moved a lot of the initialization that happens as part of Scan to occur before Scan, in order to not requiring to expose additional APIs to create custom regexRunners
- Added Scan overload to Interpreted engine
- Moved logic that cleans up the Match object that used to happen at the end of Scan to now happen after Scan() is done.
@joperezr joperezr force-pushed the AddingSpanRegexOverloads branch from 2df7931 to 2960e99 Compare February 4, 2022 17:57
runner.InitializeForGo();
runner.Scan(this, span, startat - beginning, prevlen, quick);
Match? m = runner.runmatch;
runner.runmatch = null; // Reset runmatch
Copy link
Member

@stephentoub stephentoub Feb 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this mean we'll never reuse Match objects from call to call, even if quick == true? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I still have to figure out what the best way to do this. Should be fixable, but the issue right now is that when quick is true and you find a match, you actually want to return null as that is what we today interpret as "there was a match" like:

return Run(quick: true, -1, input, 0, input.Length, UseOptionR() ? input.Length : 0) is null;

Given Scan doesn't return anything today, and basically uses runmatch as the communication mechanism, then that is why even when quick is true we are setting it to null. That said, after this new refactoring where all of the handling of runmatch happens as part of Run(), this should be easy to fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following. We don't currently null out runmatch for quick; why do we need to here?

if (quick)
{
runmatch!.Text = null!; // drop reference
return null;
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need it currently because the current Scan returns a Match object, so when quick == true and we found a match, we simply return null:

if (match._matchcount[0] > 0)
{
runtext = null; // drop reference to text to avoid keeping it alive in a cache
if (quick)
{
runmatch!.Text = null!; // drop reference
return null;
}

The new Scan method however returns void, and the way to communicate the result back to Run() used to be via runmatch (before I moved all of the handling of runmatch to inside Run() method), which is why I was setting runmatch to null when quick was true. Now that all of the handling of runmatch happens during Run(), I can avoid nulling out runmatch and simply return null from Run method whenever we found a match and quick was set to true.

(index < endpos && RegexCharClass.IsBoundaryWordChar(runtext[index]));
}

protected bool IsBoundary(ReadOnlySpan<char> inputSpan, int index, int startpos, int endpos)
Copy link
Member

@stephentoub stephentoub Feb 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need endpos? Isn't endpos always inputSpan.Length?

For that matter, isn't startpos always actually runtextbeg? In which case we don't need it either. At which point this just takes the span and the index. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

startpos is not always runtextbeg, but it is runner.runtextstart. I just copied the signature from the existing methods, which they did take start and end but I do think we can just use the regexRunner fields so these will get refactored anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

startpos is not always runtextbeg, but it is runner.runtextstart

Where is it not runtextbeg? I see these two calls in the interpreter:

case RegexOpcode.Boundary:
if (!IsBoundary(runtextpos, runtextbeg, runtextend))
{
break;
}
advance = 0;
continue;
case RegexOpcode.NonBoundary:
if (IsBoundary(runtextpos, runtextbeg, runtextend))
{
break;
}
advance = 0;
continue;

if (runtext != null)
{
sb.Append(RegexCharClass.DescribeChar(runtext[runtextpos - 1]));
}
Copy link
Member

@stephentoub stephentoub Feb 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider whether any of this tracing is still valuable, especially if dumping the current state won't help with the newer modes of access, and whether it should all be deleted or simplified significantly. Entirely up to you, as likely the sole person who would ever enable it :) #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I can handle that in a separate PR if that's ok.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup

writer.WriteLine("// Reset state for another iteration.");
writer.WriteLine("base.runtrackpos = base.runtrack!.Length;");
writer.WriteLine("base.runstackpos = base.runstack!.Length;");
writer.WriteLine("base.runcrawlpos = base.runcrawl!.Length;");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are any of these necessary?

writer.WriteLine("// We failed to find a match. If we're at the end of the input, then we are done.");
using (EmitBlock(writer, "if (base.runtextpos == text.Length)"))
{
writer.WriteLine("base.runmatch = global::System.Text.RegularExpressions.Match.Empty;");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? If we initialize runmatch in the regex lib prior to calling Scan, and if Go ensures it uncaptures everything in the event of a failed match, can we just leave runmatch unmodified, and the regex lib will see that runmatch doesn't contain a top-level capture and will know that means a failed match?

writer.WriteLine();

writer.WriteLine("// If we got a match, we're done.");
using (EmitBlock(writer, "if (Go(text))"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not against Go returning a bool, but I'm also not clear why it's needed. Can't the caller of Scan see whether there was a match based on the state of the object in runmatch?

protected bool IsMatched(int cap) { throw null; }
protected int MatchIndex(int cap) { throw null; }
protected int MatchLength(int cap) { throw null; }
protected int Popcrawl() { throw null; }
protected internal System.Text.RegularExpressions.Match? Scan(System.Text.RegularExpressions.Regex regex, string text, int textbeg, int textend, int textstart, int prevlen, bool quick) { throw null; }
protected internal System.Text.RegularExpressions.Match? Scan(System.Text.RegularExpressions.Regex regex, string text, int textbeg, int textend, int textstart, int prevlen, bool quick, System.TimeSpan timeout) { throw null; }
protected internal virtual void Scan(System.Text.RegularExpressions.Regex regex, System.ReadOnlySpan<char> text, int textstart, int prevlen, bool quick) { throw null; }
Copy link
Member

@stephentoub stephentoub Feb 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As part of exposing this, we should consider whether we want to keep the names as close to the existing ones as possible, or come up with better ones, e.g. is it better for prevlen to be consistent with the existing Scan overloads that we don't actually think anyone uses, or is it better for prevlen to be named previousMatchLength, or something more descriptive like that. I don't have a strong opinion, just something to consider. #Resolved

writer.WriteLine("// Reset state for another iteration.");
writer.WriteLine("base.runtrackpos = base.runtrack!.Length;");
writer.WriteLine("base.runstackpos = base.runstack!.Length;");
writer.WriteLine("base.runcrawlpos = base.runcrawl!.Length;");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are any of these actually necessary? runcrawlpos should be 0 because if there were captures we would have uncaptured all the way back to 0 on a failed match. We don't use runtrack at all. I'd have expected runtrackpos to also end up at runtrack.Length as well, as we use it as a backtracking stack which should get fully unwound.

protected abstract void InitTrackCount();
protected virtual bool FindFirstChar() { throw null; }
protected virtual void Go() { throw null; }
protected virtual void InitTrackCount() { throw null; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we delete our generated override of InitTrackCount in the source generator? What breaks if runtrackcount remains 0? I think that'll just mean that we end up with the default minimums enforced by InitializeForGo (or whatever it's now called). We should probably delete the override and then confirm the defaults make sense, and/or determine whether there's a different number we could be computing that would help presize this better. If memory serves, right now it's yielding the number of opcodes produced by RegexWriter that are backtracking... but that's no longer particularly relevant to how we use runtrack, and we don't use runstack at all... so there's a fair bit of waste here to be cleaned up.

@@ -7,25 +7,21 @@ namespace System.Text.RegularExpressions
{
internal sealed class CompiledRegexRunnerFactory : RegexRunnerFactory
{
private readonly DynamicMethod _goMethod;
private readonly DynamicMethod _findFirstCharMethod;
private readonly DynamicMethod _scanMethod;
private readonly int _trackcount;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with the source generator, it'd be nice to see whether we can just delete _trackcount, the setting of it, etc. That'll contribute to #62450.

/// </summary>
/// <param name="input">The input span to be searched on.</param>
/// <param name="pattern">The Regex pattern to be used for matching.</param>
/// <returns><see langword="true"/> if the input matches the pattern, <see langword="false"/> otherwise.</returns>
Copy link
Member

@stephentoub stephentoub Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all of these doc comments, it'd be nice to start with the wording currently used in the docs for the corresponding string-based methods, e.g.
https://docs.microsoft.com/en-us/dotnet/api/system.text.regularexpressions.regex.ismatch?view=net-6.0#system-text-regularexpressions-regex-ismatch(system-string-system-string) #Resolved

public bool IsMatch(ReadOnlySpan<char> input)
{
return Run(quick: true, -1, input, 0, input.Length, UseOptionR() ? input.Length : 0) is null;
}
Copy link
Member

@stephentoub stephentoub Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the other methods are all expression-bodied members, but this one isn't #Resolved

return runner.Scan(this, input, beginning, beginning + length, startat, prevlen, quick, internalMatchTimeout);
bool skipScan = false;
runner.InitializeTimeout(internalMatchTimeout);
ReadOnlySpan<char> span = input.AsSpan(beginning, length);
Copy link
Member

@stephentoub stephentoub Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how this is avoiding allocations when one of the string-based entrypoints is used with an existing CompileToAssembly assembly. Doesn't it need to store the input into runtext so that the original string is available to overrides of FindFirstChar and Go?

My expectation was that this would set runner.runtext = input and then the base Scan function (that all of our engines override but a CompileToAssembly assembly wouldn't) would do a check like if (inputSpan != runtext) runtext = inputSpan.ToString();. #Resolved

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A valid alternative to if (inputSpan != runtext) runtext = inputSpan.ToString(); would also be to throw an exception in the base Scan method if runtext is null. That should only be hit if a) a new span-based entrypoint is used and b) an existing CompileToAssembly assembly is being used that doesn't override Scan, and we could say that combination simply isn't supported.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I have it now, is that instead of Regex.Run doing the runner.runtext = input, I'm doing that in the default implementation of Scan as you suggest:

protected internal virtual void Scan(ReadOnlySpan<char> text)
{
string? s = runtext;
if (text != s)
{
s = text.ToString();
runtext = s;
}
Debug.Assert(runregex != null);
Match? match = Scan(runregex, s, 0, text.Length, runtextstart, -1, quick, runregex.internalMatchTimeout);
runmatch = match;
}

This way, setting runtext will ONLY happen now when using CompiledToAssembly. All of the other engines will override the Scan method, and won't be setting runtext any longer, which is how we prevent that extra allocation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I have it now, is that instead of Regex.Run doing the runner.runtext = input, I'm doing that in the default implementation of Scan as you suggest:

We need to do both. If you don't set runner.runtext = input in the Run method, then when it gets to this point in Scan, runtext will be null and we'll always end up doing text.ToString.

Comment on lines +389 to +390
runner.InitializeForScan(this, span, 0, span.Length, startat - beginning, quick);
runner.InitializeForGo();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is InitializeForGo still needed? Its contents can't be in InitializeForScan?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason, it can totally be absorbed into Initialize for Scan. I didn't do that refactoring as it wasn't really relevant for the purposes of the API exposure POC, but I can for sure do this as part of the actual implementation once the API is approved.

bool skipScan = false;
runner.InitializeTimeout(internalMatchTimeout);
ReadOnlySpan<char> span = input.AsSpan(beginning, length);
runner.InitializeForScan(this, span, 0, span.Length, startat - beginning, quick);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to pass in the 0 and span.Length? We're passing in span.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, it can totally work without those two as all the callsites are passing the same thing since textbeg and textend is set to be beginning and end of the sliced span.

Comment on lines +410 to +417
// if we got a match, set runmatch to null if quick is true
if (runner.runmatch!._matchcount[0] > 0)
{
if (quick)
{
runner.runmatch = null;
}
}
Copy link
Member

@stephentoub stephentoub Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted, we really need to not do this. This is a ship stopper. It'll cause every IsMatch/Replace/Split/Count to start skyrocketing in allocations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this little implementation detail is there as part of a copy paste from code that used to be in individual implementations of Scan on the engine, to code that now lives in Run(), and it was originally meant to serve for IsMatch() purposes where a null match means "The input is a match for pattern". Now that all of that lives in Run, we don't have to reset runmatch, only to return null from Run() (which we couldn't do from the new Scan since it returns void)

Comment on lines +468 to +471
runner.InitializeTimeout(internalMatchTimeout);
ReadOnlySpan<char> span = input.Slice(beginning, length);
runner.InitializeForScan(this, span, 0, span.Length, startat - beginning, quick);
runner.InitializeForGo();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need a method to configure state on the runner, we should have it be a single method rather than three :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Although I see later that at least the timeout initialization is separate out to before a loop, so maybe that makes sense on its own)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, as said above, this will be refactored into two:

  • InitializeTimeout: which will only be called once particularly when being called by Run<TState> which has a loop inside
  • InitializeForScan: which will have contents of what today is IntializeForScan and IntializeForGo, which when being called by Run<TState> do need to be called for each new iteration of the loop.

return;
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we implement this method in terms of the other Run overload? It'd be nice to avoid all this complicated duplication. I realize it'll come at a small cost, but hopefully it'd be fairly minor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can do that.

// {
Ldthisfld(s_runregexField);
Callvirt(s_regexGetRightToLeft);
BrfalseFar(notRightToLeft);
Copy link
Member

@stephentoub stephentoub Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer implement RightToLeft support in RegexCompiler; any use of RightToLeft ends up in the interpreter:

if ((Options & (RegexOptions.RightToLeft | RegexOptions.NonBacktracking)) != 0)
{
// NonBacktracking isn't supported, nor RightToLeft. The latter applies to both the top-level
// options as well as when used to specify positive and negative lookbehinds.
return false;
}
#Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, yeah I did know that and that's why I removed all of this logic form the source generated emitter, but forgot to also remove it from the Compiled engine, will fix.


// CheckTimeout();
Ldthis();
Call(s_checkTimeoutMethod);
Copy link
Member

@stephentoub stephentoub Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing guard on the timeout check. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind elaborating? Are you referring to the existing if (ignoreTimeout)? If so, we don't do that any longer since timeout doesn't get passed through to Scan. Before we had the guard because we were calling DoCheckTimeout() as opposed to CheckTimeout which has the guard inside of it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind elaborating?

if (_hasTimeout)
{
    Ldthis();
    Call(s_checkTimeoutMethod);
}

Ldc(0);
_ilg!.Emit(OpCodes.Cgt);
BrfalseFar(afterSuccessMatchLabel);
BrFar(returnLabel);
Copy link
Member

@stephentoub stephentoub Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to do the same thing we do in source gen and have Go return a bool rather than having this code reach into the match object to inspect the match count? The bool should be faster and simpler. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do that as well here. I originally did it only in the source generator because main idea with that is that source generator doesn't have access to match._matchCount[0] field, so I used Go return value to let me know if the input matched or not. For Compiled engine on the other hand, we do have access to the _matchCount field which is why I left it as is, but I agree it would be better to match for a) consistency between generated engines and b) perf.

Ldthisfld(s_runcrawlField);
Ldlen();
_ilg!.Emit(OpCodes.Conv_I4);
Stfld(s_runcrawlposField);
Copy link
Member

@stephentoub stephentoub Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same questions as on the source gen about whether we actually need to do this resetting.

(We should strive to keep the RegexCompiler implementation and the source generator implementation 1:1 as much as possible, only diverging in places where we're forced to do something special because of C# vs IL (e.g. complications from scoping in C#) or where emitting C# enables further optimizations (e.g. emitting a switch that the C# compiler can further optimize with its heuristics). #Resolved

if (runmatch!._matchcount[0] > 0)
{
return;
}
Copy link
Member

@stephentoub stephentoub Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Can Go return a bool? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it can. Here it wouldn't really be a perf gain like in the compiled engine, just consistency between engines. Interpreted doesn't need it as it can access _matchcount field, but I'm fine with moving all of them for consistency sake.

// Reset state for another iteration.
runtrackpos = runtrack!.Length;
runstackpos = runstack!.Length;
runcrawlpos = runcrawl!.Length;
Copy link
Member

@stephentoub stephentoub Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike RegexCompiler and the source generator, these will need to stay, as the interpreter still uses the "old" approach based on RegexWriter's opcodes. #Resolved

m.Text = input;
}

// If there was a match and the original text was sliced, then add beggining to the index to get the real
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// If there was a match and the original text was sliced, then add beggining to the index to get the real
// If there was a match and the original text was sliced, then add beginning to the index to get the real

// bump by 1 and stop at textend, but if we're examining right-to-left, we instead bump
// by -1 and stop at textbeg.
int bump = 1, stoppos = text.Length;
Debug.Assert(runregex != null);
Copy link
Member

@danmoseley danmoseley Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's value in asserting that something isn't null a line above where it's dereferenced. #Resolved

_timeoutOccursAt = Environment.TickCount + _timeout;
_timeoutChecksToSkip = TimeoutCheckFrequency;
s = text.ToString();
runtext = s;
Copy link
Member

@stephentoub stephentoub Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so you have this logic here, it just seems like the runtext = input logic for string inputs was missing earlier.

As noted, we could alternatively choose to throw a NotSupportedException here if text != s. We should think through the implications. On the one hand, it's nice to implicitly support the span-based APIs when using one of those older CompileToAssembly implementations. On the other hand, this could result in a massive allocation increase when someone switches to using the span-based APIs, and it could easily be a pit of failure; an exception would immediately highlight the problem and help people to know not to use that combination (which should be a rare combination, I hope). #Resolved

}

Debug.Assert(runregex != null);
Match? match = Scan(runregex, s, 0, text.Length, runtextstart, -1, quick, runregex.internalMatchTimeout);
Copy link
Member

@stephentoub stephentoub Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

text.Length => s.Length ? #Resolved

@@ -207,6 +168,36 @@ public abstract class RegexRunner
}
}

internal void InitializeForScan(Regex regex, ReadOnlySpan<char> text, int textbeg, int textend, int textstart, bool quick)
Copy link
Member

@stephentoub stephentoub Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is text passed in? Or alternatively if textbeg is always 0 and textend is always text.Length, why are textbeg and textend passed in? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was originally passed in because this method orignally was also setting runtext. Then I removed that but didn't remove the parameter. You are right about textbeg and end, so I will remove those instead and calculate them based on text instead.

// Environment.TickCount is an int that cycles. We intentionally let timeoutOccursAt
// overflow it will still stay ahead of Environment.TickCount for comparisons made
// in DoCheckTimeout().
Regex.ValidateMatchTimeout(timeout); // validate timeout as this could be called from user code due to being protected
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A separate discussion, but the logic of this comment seems a bit flawed. Yes, someone can derive from Regex, but if they do and so can set this to a non-sensical value, they can also override FindFirstChar/Go/Scan to do whatever the heck they want. We shouldn't need to pay to validate this value over and over and over again (we already validated the values that came in through the public APIs for consumers of Regex) for such misuse that isn't actually preventing anything nefarious.


/// <summary>
/// InitTrackCount must initialize the runtrackcount field; this is
/// used to know how large the initial runtrack and runstack arrays
/// must be.
/// </summary>
protected abstract void InitTrackCount();
protected virtual void InitTrackCount() => runtrackcount = 0;
Copy link
Member

@stephentoub stephentoub Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need a body or can it be empty? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be empty. I just set it to zero as that was in your original proposal.

{
if (runmatch is null)
{
// Use a hashtabled Match object if the capture numbers are sparse
runmatch = runregex!.caps is null ?
new Match(runregex, runregex.capsize, runtext!, runtextbeg, runtextend - runtextbeg, runtextstart) :
new MatchSparse(runregex, runregex.caps, runregex.capsize, runtext!, runtextbeg, runtextend - runtextbeg, runtextstart);
new Match(runregex, runregex.capsize, runtext ?? string.Empty, runtextbeg, runtextend - runtextbeg, runtextstart) :
Copy link
Member

@stephentoub stephentoub Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier you made the runtext string argument be nullable... why does this then prevent null from being passed in? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to check again, but I believe I added this because a unit test was failing which was checking the equivalence of Match.Empty and a failed match, or something like that.

Comment on lines +87 to +93
protected internal override void Scan(ReadOnlySpan<char> text)
{
ReadOnlySpan<char> inputSpan = runtext;
Go(text);
}

private void Go(ReadOnlySpan<char> inputSpan)
{
Copy link
Member

@stephentoub stephentoub Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need Go to be a separate method. The body of Go can just be the body of Scan. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totally true, originally I had it separate as in Scan I had a lot of the boilerplate that I had for other engine's Scan methods, but I then started removing all of that since it didn't apply for this engine and was left with just the invocation of Go. Will fix this.

@joperezr
Copy link
Member Author

Thanks for all of the great feedback @stephentoub and @danmoseley. Now that these APIs are approved, I will go ahead and close this draft and will send out a real PR with all the previously given feedback addressed shortly. The reason for not turning this one into a real PR instead, is that this one already has a lot of comments which has caused it to have issues when trying to load, so having a clean new one will help with that.

@joperezr joperezr closed this Feb 15, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Mar 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Text.RegularExpressions new-api-needs-documentation NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants