Add Match and isMatch ReadOnlySpan<char> overloads as well as Regex protected methods which expose ReadOnlySpan<char> to avoid allocations. #62509

joperezr · 2021-12-07T23:08:20Z

This is still WIP as these APIs haven't been approved yet and the only purpose of the draft PR is to get early feedback.

cc: @stephentoub

dotnet-issue-labeler · 2021-12-07T23:08:31Z

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

ghost · 2021-12-07T23:08:39Z

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

This is still WIP as these APIs haven't been approved yet and the only purpose of the draft PR is to get early feedback.

cc: @stephentoub

Author:	joperezr
Assignees:	-
Labels:	`* NO MERGE *`, `area-System.Text.RegularExpressions`, `new-api-needs-documentation`
Milestone:	-

...raries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexInterpreter.cs

stephentoub · 2021-12-08T14:51:11Z

@joperezr, thanks for sharing your draft.

There are a few high-level issues here:

There's a reason I didn't include Match in my proposed set of span-based APIs. Match(...) returns a Match object representing the first possible match in the input string, but a) things like the values of captures are lazily produced based on the input string stored in the Match object, and b) to get to the next match one needs to call Match.NextMatch(), which also relies on the string input stored in the Match object. Neither of those is possible when the input is a span, since a ref struct can't be stored as a field in a class. We could choose to invent a new ref struct match type that's returned from such a method, however one of the reasons folks want to use spans is reduced allocation, but while such a ref struct match object itself wouldn't be allocating, the data structures underneath it are, e.g. the array used to store all the capture information.
Similarly, the existing Scan methods that return Match objects won't work well for span-based inputs, and...
protected internal means "protected or internal", so we can't change the signature of those protected internal Match Scan methods anyway to pass in the span: it needs to be a new method. My proposal in the issue was a void Scan accepting a span. The implementation internally populates a Match object, but just as with IsMatch, we can just use that Match as temporary storage, effectively part of the RegexRunner instance, and don't need to return it out of the public API... we can just copy out of that instance whatever data we need and then return that copy, letting that singleton Match object be reused for the next matching operation, thereby enabling it to be ammortized-allocation-free.
In the base Go/FindFirstChar methods, you're initializing runtext to the ToString of the span if runtext is null. However, I don't see where runtext is being reset back to null. That would seem to mean it's going to be initialized based on the first input used, and then ignore any subsequent input. On top of that, if a string input is in fact used, and these methods haven't been overridden, we don't want to force the string to a span only to create a new string again. I think for compat we should be storing the string into runtext upfront, and then in the Go/FindFirstChar base virtuals, check whether the span is the same ref/length as the string (one of the few cases where span's Equals has the actual desired semantics).

So as you currently have things, I don't think this is feasible. Might be good to start out thinking about the APIs a little more. I'd proposed overloads of IsMatch as well as a new Enumerate method that would be allocation-free and enable iterating through all matches in the input. However, as I have it proposed, it doesn't provide access to the captures. If we believe access to the captures is important with spans (it arguably is), we should look at something in addition to this, e.g. a new ref struct "match" type that works with span (storing the ref struct span) but still incurs the allocations for the state necessary to store captures. We could then either change the proposed Enumerate API to yield instances of these new ref struct matches (in which case we might want a Boolean or some options input to specify whether it should pay to return all the captures, or we could add Match overloads that return the new type (which would then similarly need NextMatch methods like the current Match).

joperezr · 2021-12-08T17:49:40Z

Thanks a bunch for taking a look and for the feedback @stephentoub.

protected internal means "protected or internal", so we can't change the signature of those protected internal Match Scan

Yup, I'm aware of that (in dotnet/iot we heavily use protected internal methods), that was my intention but it looks like I had a typo and accidentally removed internal access modifier from the existing Scan method, I basically just wanted to add a new one instead of changing the existing ones. I'll fix that.

The implementation internally populates a Match object, but just as with IsMatch, we can just use that Match as temporary storage, effectively part of the RegexRunner instance

I see, makes sense, I'll do that.

In the base Go/FindFirstChar methods, you're initializing runtext to the ToString of the span if runtext is null. However, I don't see where runtext is being reset back to null. That would seem to mean it's going to be initialized based on the first input used, and then ignore any subsequent input.

I see, yeah good point. My intention here was that I wouldn't want to break existing custom derived RegexRunners, so I wanted to introduce a new virtual Span overload which by default would just set runtext and call the existing Go/FFC methods which would have already been implemented by the derived types. For new derived types, I was expecting consumers do the opposite (which is what I'm doing here in the interpreted engine) which is to override both overloads, and have the non-Span one just call the Span one passing in runtext.

I will re-read more thoroughly your API proposal and change this to more closely match what you proposed and take into consideration all of your comments here.

joperezr · 2021-12-21T20:32:07Z

I've now implemented the desired Scan method with both the Interpreted and Compiled engines. Next I'll work on the Source generator one. I realize that there is a lot of code that can be refactored here to avoid repetition on that Scan loop, but for now I'm mostly focusing on proof of concept implementation and will then polish once this is ready for actual review.

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Capture.cs

...ies/System.Text.RegularExpressions/src/System/Text/RegularExpressions/CompiledRegexRunner.cs

src/libraries/System.Text.RegularExpressions/ref/System.Text.RegularExpressions.cs

...ies/System.Text.RegularExpressions/src/System/Text/RegularExpressions/CompiledRegexRunner.cs

...raries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexInterpreter.cs

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.cs

…h all tests passing.

…bool. - Added Scan overload to NonBacktracking - Added Scan overload to CompiledEngine which internally calls the generated Go and FFC dynamic methods. - Moved a lot of the initialization that happens as part of Scan to occur before Scan, in order to not requiring to expose additional APIs to create custom regexRunners - Added Scan overload to Interpreted engine - Moved logic that cleans up the Match object that used to happen at the end of Scan to now happen after Scan() is done.

...ies/System.Text.RegularExpressions/src/System/Text/RegularExpressions/CompiledRegexRunner.cs

stephentoub · 2022-02-04T19:02:22Z

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.cs

+                runner.InitializeForGo();
+                runner.Scan(this, span, startat - beginning, prevlen, quick);
+                Match? m = runner.runmatch;
+                runner.runmatch = null; // Reset runmatch


Doesn't this mean we'll never reuse Match objects from call to call, even if quick == true? #Resolved

yeah, I still have to figure out what the best way to do this. Should be fixable, but the issue right now is that when quick is true and you find a match, you actually want to return null as that is what we today interpret as "there was a match" like:

runtime/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.Match.cs

Line 38 in 05dbb13

return Run(quick: true, -1, input, 0, input.Length, UseOptionR() ? input.Length : 0) is null;

Given Scan doesn't return anything today, and basically uses runmatch as the communication mechanism, then that is why even when quick is true we are setting it to null. That said, after this new refactoring where all of the handling of runmatch happens as part of Run(), this should be easy to fix.

I'm not following. We don't currently null out runmatch for quick; why do we need to here?

runtime/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexRunner.cs

Lines 179 to 183 in 05dbb13

if (quick)

{

runmatch!.Text = null!; // drop reference

return null;

}

We don't need it currently because the current Scan returns a Match object, so when quick == true and we found a match, we simply return null:

runtime/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexRunner.cs

Lines 175 to 183 in 87777b6

if (match._matchcount[0] > 0)

{

runtext = null; // drop reference to text to avoid keeping it alive in a cache

if (quick)

{

runmatch!.Text = null!; // drop reference

return null;

}

The new Scan method however returns void, and the way to communicate the result back to Run() used to be via runmatch (before I moved all of the handling of runmatch to inside Run() method), which is why I was setting runmatch to null when quick was true. Now that all of the handling of runmatch happens during Run(), I can avoid nulling out runmatch and simply return null from Run method whenever we found a match and quick was set to true.

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexRunner.cs

stephentoub · 2022-02-04T19:05:01Z

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexRunner.cs

+                   (index < endpos && RegexCharClass.IsBoundaryWordChar(runtext[index]));
+        }
+
+        protected bool IsBoundary(ReadOnlySpan<char> inputSpan, int index, int startpos, int endpos)


Do we need endpos? Isn't endpos always inputSpan.Length?

For that matter, isn't startpos always actually runtextbeg? In which case we don't need it either. At which point this just takes the span and the index. #Resolved

startpos is not always runtextbeg, but it is runner.runtextstart. I just copied the signature from the existing methods, which they did take start and end but I do think we can just use the regexRunner fields so these will get refactored anyway.

startpos is not always runtextbeg, but it is runner.runtextstart

Where is it not runtextbeg? I see these two calls in the interpreter:

runtime/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexInterpreter.cs

Lines 713 to 727 in af42c8e

case RegexOpcode.Boundary:

if (!IsBoundary(runtextpos, runtextbeg, runtextend))

{

break;

}

advance = 0;

continue;

case RegexOpcode.NonBoundary:

if (IsBoundary(runtextpos, runtextbeg, runtextend))

{

break;

}

advance = 0;

continue;

stephentoub · 2022-02-04T19:08:08Z

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexRunner.cs

+                    if (runtext != null)
+                    {
+                        sb.Append(RegexCharClass.DescribeChar(runtext[runtextpos - 1]));
+                    }


We should consider whether any of this tracing is still valuable, especially if dumping the current state won't help with the newer modes of access, and whether it should all be deleted or simplified significantly. Entirely up to you, as likely the sole person who would ever enable it :) #Resolved

sure, I can handle that in a separate PR if that's ok.

stephentoub · 2022-02-04T19:21:08Z

src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs

+                        writer.WriteLine("// Reset state for another iteration.");
+                        writer.WriteLine("base.runtrackpos = base.runtrack!.Length;");
+                        writer.WriteLine("base.runstackpos = base.runstack!.Length;");
+                        writer.WriteLine("base.runcrawlpos = base.runcrawl!.Length;");


Are any of these necessary?

stephentoub · 2022-02-04T19:22:41Z

src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs

+                    writer.WriteLine("// We failed to find a match. If we're at the end of the input, then we are done.");
+                    using (EmitBlock(writer, "if (base.runtextpos == text.Length)"))
+                    {
+                        writer.WriteLine("base.runmatch = global::System.Text.RegularExpressions.Match.Empty;");


Why is this needed? If we initialize runmatch in the regex lib prior to calling Scan, and if Go ensures it uncaptures everything in the event of a failed match, can we just leave runmatch unmodified, and the regex lib will see that runmatch doesn't contain a top-level capture and will know that means a failed match?

src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs

stephentoub · 2022-02-04T19:23:49Z

src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs

+                        writer.WriteLine();
+
+                        writer.WriteLine("// If we got a match, we're done.");
+                        using (EmitBlock(writer, "if (Go(text))"))


I'm not against Go returning a bool, but I'm also not clear why it's needed. Can't the caller of Scan see whether there was a match based on the state of the object in runmatch?

stephentoub · 2022-02-04T19:26:34Z

src/libraries/System.Text.RegularExpressions/ref/System.Text.RegularExpressions.cs

        protected bool IsMatched(int cap) { throw null; }
        protected int MatchIndex(int cap) { throw null; }
        protected int MatchLength(int cap) { throw null; }
        protected int Popcrawl() { throw null; }
        protected internal System.Text.RegularExpressions.Match? Scan(System.Text.RegularExpressions.Regex regex, string text, int textbeg, int textend, int textstart, int prevlen, bool quick) { throw null; }
        protected internal System.Text.RegularExpressions.Match? Scan(System.Text.RegularExpressions.Regex regex, string text, int textbeg, int textend, int textstart, int prevlen, bool quick, System.TimeSpan timeout) { throw null; }
+        protected internal virtual void Scan(System.Text.RegularExpressions.Regex regex, System.ReadOnlySpan<char> text, int textstart, int prevlen, bool quick) { throw null; }


As part of exposing this, we should consider whether we want to keep the names as close to the existing ones as possible, or come up with better ones, e.g. is it better for prevlen to be consistent with the existing Scan overloads that we don't actually think anyone uses, or is it better for prevlen to be named previousMatchLength, or something more descriptive like that. I don't have a strong opinion, just something to consider. #Resolved

…ouch runmatch any more.

src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs

stephentoub · 2022-02-08T21:48:38Z

src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs

+                    writer.WriteLine("// Reset state for another iteration.");
+                    writer.WriteLine("base.runtrackpos = base.runtrack!.Length;");
+                    writer.WriteLine("base.runstackpos = base.runstack!.Length;");
+                    writer.WriteLine("base.runcrawlpos = base.runcrawl!.Length;");


Are any of these actually necessary? runcrawlpos should be 0 because if there were captures we would have uncaptured all the way back to 0 on a failed match. We don't use runtrack at all. I'd have expected runtrackpos to also end up at runtrack.Length as well, as we use it as a backtracking stack which should get fully unwound.

stephentoub · 2022-02-08T21:54:21Z

src/libraries/System.Text.RegularExpressions/ref/System.Text.RegularExpressions.cs

-        protected abstract void InitTrackCount();
+        protected virtual bool FindFirstChar() { throw null; }
+        protected virtual void Go() { throw null; }
+        protected virtual void InitTrackCount() { throw null; }


Can we delete our generated override of InitTrackCount in the source generator? What breaks if runtrackcount remains 0? I think that'll just mean that we end up with the default minimums enforced by InitializeForGo (or whatever it's now called). We should probably delete the override and then confirm the defaults make sense, and/or determine whether there's a different number we could be computing that would help presize this better. If memory serves, right now it's yielding the number of opcodes produced by RegexWriter that are backtracking... but that's no longer particularly relevant to how we use runtrack, and we don't use runstack at all... so there's a fair bit of waste here to be cleaned up.

stephentoub · 2022-02-08T21:58:21Z

...tem.Text.RegularExpressions/src/System/Text/RegularExpressions/CompiledRegexRunnerFactory.cs

@@ -7,25 +7,21 @@ namespace System.Text.RegularExpressions
 {
    internal sealed class CompiledRegexRunnerFactory : RegexRunnerFactory
    {
-        private readonly DynamicMethod _goMethod;
-        private readonly DynamicMethod _findFirstCharMethod;
+        private readonly DynamicMethod _scanMethod;
        private readonly int _trackcount;


As with the source generator, it'd be nice to see whether we can just delete _trackcount, the setting of it, etc. That'll contribute to #62450.

stephentoub · 2022-02-08T22:03:23Z

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.Match.cs

+        /// </summary>
+        /// <param name="input">The input span to be searched on.</param>
+        /// <param name="pattern">The Regex pattern to be used for matching.</param>
+        /// <returns><see langword="true"/> if the input matches the pattern, <see langword="false"/> otherwise.</returns>


For all of these doc comments, it'd be nice to start with the wording currently used in the docs for the corresponding string-based methods, e.g.
https://docs.microsoft.com/en-us/dotnet/api/system.text.regularexpressions.regex.ismatch?view=net-6.0#system-text-regularexpressions-regex-ismatch(system-string-system-string) #Resolved

stephentoub · 2022-02-08T22:03:33Z

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.Match.cs

+        public bool IsMatch(ReadOnlySpan<char> input)
+        {
+            return Run(quick: true, -1, input, 0, input.Length, UseOptionR() ? input.Length : 0) is null;
+        }


Nit: the other methods are all expression-bodied members, but this one isn't #Resolved

stephentoub · 2022-02-08T22:07:00Z

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.cs

-                return runner.Scan(this, input, beginning, beginning + length, startat, prevlen, quick, internalMatchTimeout);
+                bool skipScan = false;
+                runner.InitializeTimeout(internalMatchTimeout);
+                ReadOnlySpan<char> span = input.AsSpan(beginning, length);


I don't understand how this is avoiding allocations when one of the string-based entrypoints is used with an existing CompileToAssembly assembly. Doesn't it need to store the input into runtext so that the original string is available to overrides of FindFirstChar and Go?

My expectation was that this would set runner.runtext = input and then the base Scan function (that all of our engines override but a CompileToAssembly assembly wouldn't) would do a check like if (inputSpan != runtext) runtext = inputSpan.ToString();. #Resolved

A valid alternative to if (inputSpan != runtext) runtext = inputSpan.ToString(); would also be to throw an exception in the base Scan method if runtext is null. That should only be hit if a) a new span-based entrypoint is used and b) an existing CompileToAssembly assembly is being used that doesn't override Scan, and we could say that combination simply isn't supported.

The way I have it now, is that instead of Regex.Run doing the runner.runtext = input, I'm doing that in the default implementation of Scan as you suggest:

runtime/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexRunner.cs

Lines 91 to 103 in cc92623

protected internal virtual void Scan(ReadOnlySpan<char> text)

{

string? s = runtext;

if (text != s)

{

s = text.ToString();

runtext = s;

}

Debug.Assert(runregex != null);

Match? match = Scan(runregex, s, 0, text.Length, runtextstart, -1, quick, runregex.internalMatchTimeout);

runmatch = match;

}

This way, setting runtext will ONLY happen now when using CompiledToAssembly. All of the other engines will override the Scan method, and won't be setting runtext any longer, which is how we prevent that extra allocation.

The way I have it now, is that instead of Regex.Run doing the runner.runtext = input, I'm doing that in the default implementation of Scan as you suggest:

We need to do both. If you don't set runner.runtext = input in the Run method, then when it gets to this point in Scan, runtext will be null and we'll always end up doing text.ToString.

stephentoub · 2022-02-08T22:21:04Z

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.cs

+                runner.InitializeForScan(this, span, 0, span.Length, startat - beginning, quick);
+                runner.InitializeForGo();


Why is InitializeForGo still needed? Its contents can't be in InitializeForScan?

No reason, it can totally be absorbed into Initialize for Scan. I didn't do that refactoring as it wasn't really relevant for the purposes of the API exposure POC, but I can for sure do this as part of the actual implementation once the API is approved.

stephentoub · 2022-02-08T22:21:55Z

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.cs

+                bool skipScan = false;
+                runner.InitializeTimeout(internalMatchTimeout);
+                ReadOnlySpan<char> span = input.AsSpan(beginning, length);
+                runner.InitializeForScan(this, span, 0, span.Length, startat - beginning, quick);


Why does this need to pass in the 0 and span.Length? We're passing in span.

Good point, it can totally work without those two as all the callsites are passing the same thing since textbeg and textend is set to be beginning and end of the sliced span.

stephentoub · 2022-02-08T22:22:47Z

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.cs

+                    // if we got a match, set runmatch to null if quick is true
+                    if (runner.runmatch!._matchcount[0] > 0)
+                    {
+                        if (quick)
+                        {
+                            runner.runmatch = null;
+                        }
+                    }


As noted, we really need to not do this. This is a ship stopper. It'll cause every IsMatch/Replace/Split/Count to start skyrocketing in allocations.

Yeah, this little implementation detail is there as part of a copy paste from code that used to be in individual implementations of Scan on the engine, to code that now lives in Run(), and it was originally meant to serve for IsMatch() purposes where a null match means "The input is a match for pattern". Now that all of that lives in Run, we don't have to reset runmatch, only to return null from Run() (which we couldn't do from the new Scan since it returns void)

stephentoub · 2022-02-08T22:26:48Z

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.cs

+                runner.InitializeTimeout(internalMatchTimeout);
+                ReadOnlySpan<char> span = input.Slice(beginning, length);
+                runner.InitializeForScan(this, span, 0, span.Length, startat - beginning, quick);
+                runner.InitializeForGo();


If we need a method to configure state on the runner, we should have it be a single method rather than three :)

(Although I see later that at least the timeout initialization is separate out to before a loop, so maybe that makes sense on its own)

Agree, as said above, this will be refactored into two:

InitializeTimeout: which will only be called once particularly when being called by Run<TState> which has a loop inside

InitializeForScan: which will have contents of what today is IntializeForScan and IntializeForGo, which when being called by Run<TState> do need to be called for each new iteration of the loop.

stephentoub · 2022-02-08T22:29:02Z

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.cs

+                            return;
+                        }
+                    }
+                }


Can we implement this method in terms of the other Run overload? It'd be nice to avoid all this complicated duplication. I realize it'll come at a small cost, but hopefully it'd be fairly minor.

Sure, I can do that.

stephentoub · 2022-02-08T22:31:25Z

...libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCompiler.cs

+            // {
+            Ldthisfld(s_runregexField);
+            Callvirt(s_regexGetRightToLeft);
+            BrfalseFar(notRightToLeft);


We no longer implement RightToLeft support in RegexCompiler; any use of RightToLeft ends up in the interpreter:

runtime/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs

Lines 2513 to 2518 in af42c8e

if ((Options & (RegexOptions.RightToLeft | RegexOptions.NonBacktracking)) != 0)

{

// NonBacktracking isn't supported, nor RightToLeft. The latter applies to both the top-level

// options as well as when used to specify positive and negative lookbehinds.

return false;

}

#Resolved

I see, yeah I did know that and that's why I removed all of this logic form the source generated emitter, but forgot to also remove it from the Compiled engine, will fix.

stephentoub · 2022-02-08T22:31:47Z

...libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCompiler.cs

+
+            // CheckTimeout();
+            Ldthis();
+            Call(s_checkTimeoutMethod);


Missing guard on the timeout check. #Resolved

Do you mind elaborating? Are you referring to the existing if (ignoreTimeout)? If so, we don't do that any longer since timeout doesn't get passed through to Scan. Before we had the guard because we were calling DoCheckTimeout() as opposed to CheckTimeout which has the guard inside of it.

Do you mind elaborating?

if (_hasTimeout) { Ldthis(); Call(s_checkTimeoutMethod); }

stephentoub · 2022-02-08T22:32:40Z

...libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCompiler.cs

+            Ldc(0);
+            _ilg!.Emit(OpCodes.Cgt);
+            BrfalseFar(afterSuccessMatchLabel);
+            BrFar(returnLabel);


We don't want to do the same thing we do in source gen and have Go return a bool rather than having this code reach into the match object to inspect the match count? The bool should be faster and simpler. #Resolved

We can do that as well here. I originally did it only in the source generator because main idea with that is that source generator doesn't have access to match._matchCount[0] field, so I used Go return value to let me know if the input matched or not. For Compiled engine on the other hand, we do have access to the _matchCount field which is why I left it as is, but I agree it would be better to match for a) consistency between generated engines and b) perf.

stephentoub · 2022-02-08T22:34:02Z

...libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCompiler.cs

+            Ldthisfld(s_runcrawlField);
+            Ldlen();
+            _ilg!.Emit(OpCodes.Conv_I4);
+            Stfld(s_runcrawlposField);


Same questions as on the source gen about whether we actually need to do this resetting.

(We should strive to keep the RegexCompiler implementation and the source generator implementation 1:1 as much as possible, only diverging in places where we're forced to do something special because of C# vs IL (e.g. complications from scoping in C#) or where emitting C# enables further optimizations (e.g. emitting a switch that the C# compiler can further optimize with its heuristics). #Resolved

stephentoub · 2022-02-08T22:34:28Z

...raries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexInterpreter.cs

+                    if (runmatch!._matchcount[0] > 0)
+                    {
+                        return;
+                    }


Same here. Can Go return a bool? #Resolved

Yeah, it can. Here it wouldn't really be a perf gain like in the compiled engine, just consistency between engines. Interpreted doesn't need it as it can access _matchcount field, but I'm fine with moving all of them for consistency sake.

stephentoub · 2022-02-08T22:35:04Z

...raries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexInterpreter.cs

+                    // Reset state for another iteration.
+                    runtrackpos = runtrack!.Length;
+                    runstackpos = runstack!.Length;
+                    runcrawlpos = runcrawl!.Length;


Unlike RegexCompiler and the source generator, these will need to stay, as the interpreter still uses the "old" approach based on RegexWriter's opcodes. #Resolved

danmoseley · 2022-02-08T22:36:07Z

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.cs

+                            m.Text = input;
+                    }
+
+                    // If there was a match and the original text was sliced, then add beggining to the index to get the real


Suggested change

// If there was a match and the original text was sliced, then add beggining to the index to get the real

// If there was a match and the original text was sliced, then add beginning to the index to get the real

danmoseley · 2022-02-08T22:37:51Z

...raries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexInterpreter.cs

+            // bump by 1 and stop at textend, but if we're examining right-to-left, we instead bump
+            // by -1 and stop at textbeg.
+            int bump = 1, stoppos = text.Length;
+            Debug.Assert(runregex != null);


I don't think there's value in asserting that something isn't null a line above where it's dereferenced. #Resolved

stephentoub · 2022-02-08T22:38:58Z

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexRunner.cs

-                _timeoutOccursAt = Environment.TickCount + _timeout;
-                _timeoutChecksToSkip = TimeoutCheckFrequency;
+                s = text.ToString();
+                runtext = s;


Ok, so you have this logic here, it just seems like the runtext = input logic for string inputs was missing earlier.

As noted, we could alternatively choose to throw a NotSupportedException here if text != s. We should think through the implications. On the one hand, it's nice to implicitly support the span-based APIs when using one of those older CompileToAssembly implementations. On the other hand, this could result in a massive allocation increase when someone switches to using the span-based APIs, and it could easily be a pit of failure; an exception would immediately highlight the problem and help people to know not to use that combination (which should be a rare combination, I hope). #Resolved

stephentoub · 2022-02-08T22:40:18Z

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexRunner.cs

            }

+            Debug.Assert(runregex != null);
+            Match? match = Scan(runregex, s, 0, text.Length, runtextstart, -1, quick, runregex.internalMatchTimeout);


text.Length => s.Length ? #Resolved

stephentoub · 2022-02-08T22:41:57Z

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexRunner.cs

@@ -207,6 +168,36 @@ public abstract class RegexRunner
            }
        }

+        internal void InitializeForScan(Regex regex, ReadOnlySpan<char> text, int textbeg, int textend, int textstart, bool quick)


Why is text passed in? Or alternatively if textbeg is always 0 and textend is always text.Length, why are textbeg and textend passed in? #Resolved

It was originally passed in because this method orignally was also setting runtext. Then I removed that but didn't remove the parameter. You are right about textbeg and end, so I will remove those instead and calculate them based on text instead.

stephentoub · 2022-02-08T22:46:26Z

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexRunner.cs

+                // Environment.TickCount is an int that cycles. We intentionally let timeoutOccursAt
+                // overflow it will still stay ahead of Environment.TickCount for comparisons made
+                // in DoCheckTimeout().
+                Regex.ValidateMatchTimeout(timeout); // validate timeout as this could be called from user code due to being protected


A separate discussion, but the logic of this comment seems a bit flawed. Yes, someone can derive from Regex, but if they do and so can set this to a non-sensical value, they can also override FindFirstChar/Go/Scan to do whatever the heck they want. We shouldn't need to pay to validate this value over and over and over again (we already validated the values that came in through the public APIs for consumers of Regex) for such misuse that isn't actually preventing anything nefarious.

stephentoub · 2022-02-08T22:47:22Z

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexRunner.cs


        /// <summary>
        /// InitTrackCount must initialize the runtrackcount field; this is
        /// used to know how large the initial runtrack and runstack arrays
        /// must be.
        /// </summary>
-        protected abstract void InitTrackCount();
+        protected virtual void InitTrackCount() => runtrackcount = 0;


Does this need a body or can it be empty? #Resolved

It could be empty. I just set it to zero as that was in your original proposal.

stephentoub · 2022-02-08T22:48:38Z

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexRunner.cs

        {
            if (runmatch is null)
            {
                // Use a hashtabled Match object if the capture numbers are sparse
                runmatch = runregex!.caps is null ?
-                    new Match(runregex, runregex.capsize, runtext!, runtextbeg, runtextend - runtextbeg, runtextstart) :
-                    new MatchSparse(runregex, runregex.caps, runregex.capsize, runtext!, runtextbeg, runtextend - runtextbeg, runtextstart);
+                    new Match(runregex, runregex.capsize, runtext ?? string.Empty, runtextbeg, runtextend - runtextbeg, runtextstart) :


Earlier you made the runtext string argument be nullable... why does this then prevent null from being passed in? #Resolved

I need to check again, but I believe I added this because a unit test was failing which was checking the equivalence of Match.Empty and a failed match, or something like that.

stephentoub · 2022-02-08T22:49:38Z

...RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexRunnerFactory.cs

+            protected internal override void Scan(ReadOnlySpan<char> text)
            {
-                ReadOnlySpan<char> inputSpan = runtext;
+                Go(text);
+            }

+            private void Go(ReadOnlySpan<char> inputSpan)
+            {


We don't need Go to be a separate method. The body of Go can just be the body of Scan. #Resolved

totally true, originally I had it separate as in Scan I had a lot of the boilerplate that I had for other engine's Scan methods, but I then started removing all of that since it didn't apply for this engine and was left with just the invocation of Go. Will fix this.

joperezr · 2022-02-15T22:00:33Z

Thanks for all of the great feedback @stephentoub and @danmoseley. Now that these APIs are approved, I will go ahead and close this draft and will send out a real PR with all the previously given feedback addressed shortly. The reason for not turning this one into a real PR instead, is that this one already has a lot of comments which has caused it to have issues when trying to load, so having a clean new one will help with that.

joperezr added the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Dec 7, 2021

joperezr requested a review from stephentoub December 7, 2021 23:08

dotnet-issue-labeler bot added area-System.Text.RegularExpressions new-api-needs-documentation labels Dec 7, 2021

joperezr requested a review from danmoseley December 7, 2021 23:09

danmoseley reviewed Dec 8, 2021

View reviewed changes

...raries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexInterpreter.cs Outdated Show resolved Hide resolved

runfoapp bot mentioned this pull request Dec 8, 2021

System.Runtime.Serialization.Xml.Tests failed #59926

Closed

joperezr force-pushed the AddingSpanRegexOverloads branch 2 times, most recently from e417519 to 63ed6b5 Compare December 20, 2021 20:47

stephentoub reviewed Jan 19, 2022

View reviewed changes

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.cs Show resolved Hide resolved

stephentoub reviewed Jan 19, 2022

View reviewed changes

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.cs Outdated Show resolved Hide resolved

joperezr self-assigned this Jan 26, 2022

joperezr added 3 commits February 1, 2022 15:12

Adding new API and ensuring new Scan is called during Regex.Run() wit…

c1c0fa3

…h all tests passing.

Remove timeout from Scan, as well as fix most of the test issues

2960e99

joperezr force-pushed the AddingSpanRegexOverloads branch from 2df7931 to 2960e99 Compare February 4, 2022 17:57

Cleaning up some commented out code

d16b8e9

stephentoub reviewed Feb 4, 2022

View reviewed changes

Making Span to only take in a Span and ensure it doesn't require to t…

444c017

…ouch runmatch any more.

joperezr mentioned this pull request Feb 8, 2022

Augment Regex extensibility point for better perf and span-based matching #59629

Closed

stephentoub reviewed Feb 8, 2022

View reviewed changes

src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs Outdated Show resolved Hide resolved

stephentoub reviewed Feb 8, 2022

View reviewed changes

src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs Show resolved Hide resolved

stephentoub reviewed Feb 8, 2022

View reviewed changes

Adress some PR feedback

cc92623

stephentoub reviewed Feb 8, 2022

View reviewed changes

danmoseley reviewed Feb 8, 2022

View reviewed changes

stephentoub reviewed Feb 8, 2022

View reviewed changes

joperezr closed this Feb 15, 2022

joperezr mentioned this pull request Feb 17, 2022

Adding Regex.IsMatch(Span) and RegexRunner.Scan() methods #65473

Merged

2 tasks

ghost locked as resolved and limited conversation to collaborators Mar 19, 2022

	if (quick)
	{
	runmatch!.Text = null!; // drop reference
	return null;
	}

	if (match._matchcount[0] > 0)
	{
	runtext = null; // drop reference to text to avoid keeping it alive in a cache

	if (quick)
	{
	runmatch!.Text = null!; // drop reference
	return null;
	}

	case RegexOpcode.Boundary:
	if (!IsBoundary(runtextpos, runtextbeg, runtextend))
	{
	break;
	}
	advance = 0;
	continue;

	case RegexOpcode.NonBoundary:
	if (IsBoundary(runtextpos, runtextbeg, runtextend))
	{
	break;
	}
	advance = 0;
	continue;

	protected internal virtual void Scan(ReadOnlySpan<char> text)
	{
	string? s = runtext;
	if (text != s)
	{
	s = text.ToString();
	runtext = s;
	}

	Debug.Assert(runregex != null);
	Match? match = Scan(runregex, s, 0, text.Length, runtextstart, -1, quick, runregex.internalMatchTimeout);
	runmatch = match;
	}

		runner.InitializeForScan(this, span, 0, span.Length, startat - beginning, quick);
		runner.InitializeForGo();

	if ((Options & (RegexOptions.RightToLeft \| RegexOptions.NonBacktracking)) != 0)
	{
	// NonBacktracking isn't supported, nor RightToLeft. The latter applies to both the top-level
	// options as well as when used to specify positive and negative lookbehinds.
	return false;
	}

	// If there was a match and the original text was sliced, then add beggining to the index to get the real
	// If there was a match and the original text was sliced, then add beginning to the index to get the real

Add Match and isMatch ReadOnlySpan<char> overloads as well as Regex protected methods which expose ReadOnlySpan<char> to avoid allocations. #62509

Add Match and isMatch ReadOnlySpan<char> overloads as well as Regex protected methods which expose ReadOnlySpan<char> to avoid allocations. #62509

Conversation

joperezr commented Dec 7, 2021

dotnet-issue-labeler bot commented Dec 7, 2021

ghost commented Dec 7, 2021

stephentoub commented Dec 8, 2021

joperezr commented Dec 8, 2021

joperezr commented Dec 21, 2021

stephentoub Feb 4, 2022 • edited by joperezr Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephentoub Feb 4, 2022 • edited by joperezr Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephentoub Feb 4, 2022 • edited by joperezr Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephentoub Feb 4, 2022 • edited by joperezr Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephentoub Feb 8, 2022 • edited by joperezr Loading

Choose a reason for hiding this comment

stephentoub Feb 8, 2022 • edited by joperezr Loading

Choose a reason for hiding this comment

stephentoub Feb 8, 2022 • edited by joperezr Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephentoub Feb 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephentoub Feb 8, 2022 • edited by joperezr Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephentoub Feb 8, 2022 • edited by joperezr Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephentoub Feb 8, 2022 • edited by joperezr Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephentoub Feb 8, 2022 • edited by joperezr Loading

Choose a reason for hiding this comment

stephentoub Feb 8, 2022 • edited by joperezr Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephentoub Feb 8, 2022 • edited by joperezr Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danmoseley Feb 8, 2022 • edited by joperezr Loading

Choose a reason for hiding this comment

stephentoub Feb 8, 2022 • edited by joperezr Loading

Choose a reason for hiding this comment

stephentoub Feb 8, 2022 • edited by joperezr Loading

Choose a reason for hiding this comment

stephentoub Feb 8, 2022 • edited by joperezr Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephentoub Feb 8, 2022 • edited by joperezr Loading

stephentoub Feb 4, 2022 •

edited by joperezr

Loading

stephentoub Feb 4, 2022 •

edited by joperezr

Loading

stephentoub Feb 4, 2022 •

edited by joperezr

Loading

stephentoub Feb 4, 2022 •

edited by joperezr

Loading

stephentoub Feb 8, 2022 •

edited by joperezr

Loading

stephentoub Feb 8, 2022 •

edited by joperezr

Loading

stephentoub Feb 8, 2022 •

edited by joperezr

Loading

stephentoub Feb 8, 2022 •

edited

Loading

stephentoub Feb 8, 2022 •

edited by joperezr

Loading

stephentoub Feb 8, 2022 •

edited by joperezr

Loading

stephentoub Feb 8, 2022 •

edited by joperezr

Loading

stephentoub Feb 8, 2022 •

edited by joperezr

Loading

stephentoub Feb 8, 2022 •

edited by joperezr

Loading

stephentoub Feb 8, 2022 •

edited by joperezr

Loading

danmoseley Feb 8, 2022 •

edited by joperezr

Loading

stephentoub Feb 8, 2022 •

edited by joperezr

Loading

stephentoub Feb 8, 2022 •

edited by joperezr

Loading

stephentoub Feb 8, 2022 •

edited by joperezr

Loading

stephentoub Feb 8, 2022 •

edited by joperezr

Loading

stephentoub Feb 8, 2022 •

edited by joperezr

Loading

stephentoub Feb 8, 2022 •

edited by joperezr

Loading