-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Report partial matches #678
Comments
I think you would definitely be better off using |
Note that this isn't correct in general. The start state might itself consume input. |
Oh that's right, so I guess this further evolution would require additional changes, by splitting the start state between a start state that hasn't consumed any input yet and a start state that has consumed input, and this would make things much harder to implement for the extension because then any transition to the start state would have to be considered, to either point to the nothing-consumed start state or to the something-consumed start state. As for how to expose this information in the API… intuitively, as it doesn't incur any performance cost, I'd have done it by taking the methods that return (Anyway, I'm going to try it out with regex-automata for now, which probably solves my current issue, thank you for your feedback! :)) |
I'm sure there will be a regex 2.0 some day, but I won't do such a thing lightly and this kind of change definitely doesn't elevate to that level unfortunately. (Even if I agreed with your proposal. :P)
My initial reaction to this is that it makes the API more complex for a rather niche use case. With that said, I have actually experimented with such an API in Now, When I originally wrote So the API you would want, I think, would most naturally be: // This corresponds to `NoMatch` in the regex-automata docs I linked.
type MatchError = ...;
enum NoMatch {
WillNeverMatch(usize),
MayMatchLater(usize),
}
fn try_find_leftmost(...) -> Result<Result<Match, NoMatch>, MatchError>; This is a little ugly, but maybe it is the right thing to do at the More generally, my hope is that So I think that's my high level feedback for the time being. |
What, you don't want to make an API-breaking change just for my very specific use case? :P (more seriously, I was more thinking it could be something to stash on top of a list of changes queued for a potential 2.0 release whenever it comes) So first: I have implemented what you suggested with regex-automata, here -- it works great, thank you for the hint! For your idea of API for With your About users having to write their own search loop, it totally makes sense to me! I think it should also be possible to upstream this specific loop into nom (related: rust-bakery/nom#1155), so that each other user wouldn't have to rewrite their own :) |
Right. My main point I was trying to express was how awkward it is, I guess. Note that when trying to use regex-automata in other crates, it would be good to mention that it is first and foremost an implementation detail of |
I'm going to close this in favor of #656. Namely, once |
Hello,
First, this issue's description may overlap a bit with #425, by attempting to do what #25 wanted.
However, I believe the solution described below to handle it would probably be much easier, and I think it would be a good thing to handle my use case even if #425 were implemented (even though it could be worked around by using
longest_potential_match_len
in such a world).So, I thought it was worth it suggesting this intermediate solution, that would hopefully be reasonable to implement even though #425 appears to not be as of today, and solve at least some issues :)
Problem description
I'm trying to use regexes from streaming
nom
parsers, for parsing SMTP information.In order to do so, the nom way of doing things is basically to, when receiving a packet, optimistically parse it, and if the parser reports that it needs additional data to proceed, then wait for enough data to come in and parse again -- this way most matches are done in one pass, even though it means one additional case for the rare case where a message was split between two packets.
However, in order to use regexes from this, there has to be a way to be able to know whether the regex didn't match because it couldn't match, or if it didn't match because it didn't have a complete match yet.
Proposed solution
The solution I can think of is simple with only basic theory of automata -- not having any idea how
regex
's engine works in practice, I cannot say whether it would be simple in practice, though.The idea is the following: if the automaton has reached a failing state before the end of the input, return that there can be no match. If it has not, and reaches the end of the input in a start or intermediate state, return that there may be a match that would come up if there were more input.
Note that this scheme works well only for regexes that start with
^
or similar -- but any regex that doesn't would potentially match later on, which means that this would be working as intended.As a further evolution that might be enough to handle part of #425's problem too, it might make sense to also report when the regex was last in its start state, so that the caller could just buffer starting from the last time the regex was in its start state and re-run from that last position the next time. It's less optimized than doing it all in one pass, but hopefully would be fast enough (by making the user rescan only the parts of the buffer that could contain a match), while unlocking additional use cases.
What do you think about this idea? Would it be simple enough to deserve implementing, until #425 would be solved, or would I be better off implementing this myself based on the
regex_automata
crate?The text was updated successfully, but these errors were encountered: