Report partial matches #678

Ekleog · 2020-05-14T14:06:13Z

Hello,

First, this issue's description may overlap a bit with #425, by attempting to do what #25 wanted.
However, I believe the solution described below to handle it would probably be much easier, and I think it would be a good thing to handle my use case even if #425 were implemented (even though it could be worked around by using longest_potential_match_len in such a world).
So, I thought it was worth it suggesting this intermediate solution, that would hopefully be reasonable to implement even though #425 appears to not be as of today, and solve at least some issues :)

Problem description

I'm trying to use regexes from streaming nom parsers, for parsing SMTP information.

In order to do so, the nom way of doing things is basically to, when receiving a packet, optimistically parse it, and if the parser reports that it needs additional data to proceed, then wait for enough data to come in and parse again -- this way most matches are done in one pass, even though it means one additional case for the rare case where a message was split between two packets.

However, in order to use regexes from this, there has to be a way to be able to know whether the regex didn't match because it couldn't match, or if it didn't match because it didn't have a complete match yet.

Proposed solution

The solution I can think of is simple with only basic theory of automata -- not having any idea how regex's engine works in practice, I cannot say whether it would be simple in practice, though.

The idea is the following: if the automaton has reached a failing state before the end of the input, return that there can be no match. If it has not, and reaches the end of the input in a start or intermediate state, return that there may be a match that would come up if there were more input.

Note that this scheme works well only for regexes that start with ^ or similar -- but any regex that doesn't would potentially match later on, which means that this would be working as intended.

As a further evolution that might be enough to handle part of #425's problem too, it might make sense to also report when the regex was last in its start state, so that the caller could just buffer starting from the last time the regex was in its start state and re-run from that last position the next time. It's less optimized than doing it all in one pass, but hopefully would be fast enough (by making the user rescan only the parts of the buffer that could contain a match), while unlocking additional use cases.

What do you think about this idea? Would it be simple enough to deserve implementing, until #425 would be solved, or would I be better off implementing this myself based on the regex_automata crate?

The text was updated successfully, but these errors were encountered:

BurntSushi · 2020-05-14T15:11:56Z

I think you would definitely be better off using regex-automata. It might not be difficult to implement your suggestion in the actual regex engines themselves, but exposing that information in the API is non-trivial.

BurntSushi · 2020-05-14T15:13:00Z

it might make sense to also report when the regex was last in its start state, so that the caller could just buffer starting from the last time the regex was in its start state and re-run from that last position the next time

Note that this isn't correct in general. The start state might itself consume input.

Ekleog · 2020-05-14T19:43:41Z

Oh that's right, so I guess this further evolution would require additional changes, by splitting the start state between a start state that hasn't consumed any input yet and a start state that has consumed input, and this would make things much harder to implement for the extension because then any transition to the start state would have to be considered, to either point to the nothing-consumed start state or to the something-consumed start state.

As for how to expose this information in the API… intuitively, as it doesn't incur any performance cost, I'd have done it by taking the methods that return Option<T> and instead return Result<T, enum { WillNeverMatch, MayMatchLater }>. Now, I don't know how you feel about backwards-compatibility and API-breaking changes, and I'm not really sure how to incrementally improve the current API, as all good function names are taken :/ something like adding a _streaming at the end?

(Anyway, I'm going to try it out with regex-automata for now, which probably solves my current issue, thank you for your feedback! :))

BurntSushi · 2020-05-15T02:21:07Z

Now, I don't know how you feel about backwards-compatibility and API-breaking changes

I'm sure there will be a regex 2.0 some day, but I won't do such a thing lightly and this kind of change definitely doesn't elevate to that level unfortunately. (Even if I agreed with your proposal. :P)

I'd have done it by taking the methods that return Option<T> and instead return Result<T, enum { WillNeverMatch, MayMatchLater }>

My initial reaction to this is that it makes the API more complex for a rather niche use case.

With that said, I have actually experimented with such an API in regex-automata. In particular, I've been working on #656, which involves re-working most of regex-automata. In particular, I've added try_* variants of a lot of the matching methods. For example: Regex::try_find_leftmost returns Result<Option<Match>, NoMatch> instead of just Option<Match>. (Warning: I have not put much, if any, effort into docs at the moment. They are likely outdated and wrong in places.)

Now, NoMatch isn't quite what you want here, since it actually indicates an error that occurred while searching. In this case, the caller can't make any assumptions about whether a match did or didn't occur. (Specifically, it corresponds to one of two cases: the machine hit a case it could not process and quit, or the machine "decided" it was inefficient and gave up.)

When I originally wrote NoMatch, it actually had three variants: the two I just described in addition to None. In other words, the API was Result<Match, NoMatch>. It turned out that this was actually quite inconvenient to use, because in some contexts, you want to propagate the "quit" and "gave up" variants as normal errors and handle the "didn't match" case differently. So it leads to a lot of explicit case analysis. It also presents problems for implementing iterators. The Option<Match> API extends naturally to iterators, but the Result<Match, NoMatch> API does not. You wind up having a next method that returns Option<Result<Match, NoMatch>>, but where the NoMatch::None variant can never occur, because it signals the end of iteration. So it's just super awkward.

So the API you would want, I think, would most naturally be:

// This corresponds to `NoMatch` in the regex-automata docs I linked.
type MatchError = ...;

enum NoMatch {
    WillNeverMatch(usize),
    MayMatchLater(usize),
}

fn try_find_leftmost(...) -> Result<Result<Match, NoMatch>, MatchError>;

This is a little ugly, but maybe it is the right thing to do at the regex-automata level. On the other hand, I don't think it would be that bad to demand that callers implement their own search loop for these kinds of cases, which wouldn't be too bad. Especially since this is really only useful for the anchored case.

More generally, my hope is that regex-automata will become the right place for these kinds of lower level and more complex APIs that I'm not sure really belong in regex proper. regex-automata is also something I'm more willing to evolve rapidly and am generally okay with it having a much higher compexity budget that regex. Namely, regex is something that ideally everyone can use and use easily without really know much about how regexes work. But in order to use regex-automata effectively, I would generally hope that folks would expect to put in a bigger investment to understand how it works.

So I think that's my high level feedback for the time being.

Ekleog · 2020-05-17T17:11:18Z

What, you don't want to make an API-breaking change just for my very specific use case? :P (more seriously, I was more thinking it could be something to stash on top of a list of changes queued for a potential 2.0 release whenever it comes)

So first: I have implemented what you suggested with regex-automata, here -- it works great, thank you for the hint!

For your idea of API for try_find_leftmost, it looks indeed like what would be best! The reason why I was suggesting replacement of Option<Match> with Result<Match, WillNeverMatch|MayMatchLater> is that it would allow simply replacing if let Some(foo) with if let Ok(foo), given the error variant would just be a split of the previous None case between a WillNeverMatch and a MayMatchLater case :)

With your try_find_leftmost, it's equivalent to going from Result<Option<Match>, MatchError> that AFAIU is your current API to Result<Result<Match, NoMatch>, MatchError> -- the same thing I was suggesting :)

About users having to write their own search loop, it totally makes sense to me! I think it should also be possible to upstream this specific loop into nom (related: rust-bakery/nom#1155), so that each other user wouldn't have to rewrite their own :)

BurntSushi · 2020-05-17T19:33:14Z

With your try_find_leftmost, it's equivalent to going from Result<Option<Match>, MatchError> that AFAIU is your current API to Result<Result<Match, NoMatch>, MatchError> -- the same thing I was suggesting :)

Right. My main point I was trying to express was how awkward it is, I guess.

Note that when trying to use regex-automata in other crates, it would be good to mention that it is first and foremost an implementation detail of regex. This doesn't mean one shouldn't use it of course, but that the same kind of API stability one might expect with regex won't really apply to regex-automata.

BurntSushi · 2023-03-06T15:51:50Z

I'm going to close this in favor of #656. Namely, once regex-automata 0.3 is released, you should have more options. You might still need to implement your own search routine using a lazy DFA for example, but this is eminently doable in a couple dozen lines. Otherwise, I don't think it's worth complicating every search API to report this information.

BurntSushi mentioned this issue Jan 5, 2023

Get position of first non-matching character #941

Closed

BurntSushi closed this as not planned Won't fix, can't repro, duplicate, stale Mar 6, 2023

dustinpho mentioned this issue Mar 13, 2023

Prune paths when using --full-path and a pattern containing the start of string anchor sharkdp/fd#1279

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report partial matches #678

Report partial matches #678

Ekleog commented May 14, 2020

BurntSushi commented May 14, 2020

BurntSushi commented May 14, 2020

Ekleog commented May 14, 2020

BurntSushi commented May 15, 2020 •

edited

Loading

Ekleog commented May 17, 2020 •

edited

Loading

BurntSushi commented May 17, 2020

BurntSushi commented Mar 6, 2023

Report partial matches #678

Report partial matches #678

Comments

Ekleog commented May 14, 2020

Problem description

Proposed solution

BurntSushi commented May 14, 2020

BurntSushi commented May 14, 2020

Ekleog commented May 14, 2020

BurntSushi commented May 15, 2020 • edited Loading

Ekleog commented May 17, 2020 • edited Loading

BurntSushi commented May 17, 2020

BurntSushi commented Mar 6, 2023

BurntSushi commented May 15, 2020 •

edited

Loading

Ekleog commented May 17, 2020 •

edited

Loading