-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
execute a regex on text streams #425
Comments
Thanks for taking the time to write this up! I'm afraid the issue tracker (and the issue you linked) isn't entirely up to date with my opinion on this particular feature. Basically, I agree that this would be useful to have. API design is certainly one aspect of this that needs to be solved, and I thank you for spending time on that. However, the API design is just the surface. The far more difficult aspect of providing a feature like this is the actual implementation work required. My estimation of the level of difficulty for this is roughly "a rewrite of every single regex engine." That's about as bad as it gets and it's why this particular feature hasn't happened yet. The problem is that the assumption that a regex engine has access to the entire buffer on which it can match is thoroughly baked in across all of the matching routines. For example, as a trivially simple thing to understand (but far from the only issue), consider that none of the existing engines have the ability to pause execution of a match at an arbitrary point, and then pick it back up again. Such a thing is necessary for a streaming match mode. (There is an interesting compromise point here. For example, the regex crate could accept an Getting back to API design, I think we would want to take a very strong look at Hyperscan, whose documentation can be found here. This particular regex library is probably the fastest in the world, is built on finite automata, and specifically specializes in matching regexes on streams. You'll notice that an interesting knob on their API is the setting of whether to report the start of a match or not. This is important and particularly relevant to the regex crate. In particular, when running the DFA (the fastest regex engine), it is not possible to determine both the start and end of a match in a single pass in the general case. The first pass searches forwards through the string for the end of the match. The second pass then uses a reversed DFA and searches backwards in the text from the end of the match to find the start of the match. Cool, right? And it unfortunately requires that the entire match be buffered in memory. That throws a wrench in things, because now you need to handle the case of "what happens when a match is 2GB?" There's no universal right answer, so you need to expose knobs like Hyperscan does. The start-of-match stuff throws a wrench in my idea above as well. Even if the regex crate provided an API to search an
You can also see that RE2 doesn't have such an API either, mostly for the reasons I've already outlined here. That issue does note another problem, in particular, managing the state of the DFA, but that just falls under the bigger problem of "how do you pause/resume each regex engine" that I mentioned above. For Go's regexp engine, there is someone working on adding a DFA, and even in their work, they defer to the NFA when given a rune reader to use with the DFA. In other words, even with the DFA, that particular implementation chooses (2). What am I trying to say here? What I'm saying is that a number of people who specialize in this area (including myself) have come to the same conclusion: the feature you're asking for is freaking hard to implement at multiple levels, at least in a fast finite automaton based regex engine. So this isn't really something that we can just think ourselves out of here. The paths available are known, they are just a ton of work. With that said, this is something that I think would be very cool to have. My plans for the future of regex revolve around one goal: "make it easier to add and maintain new optimizations." This leaves me yearning for a more refined internal architecture, which also incidentally amounts to a rewrite. In the course of doing this, my plan was to re-evaluate the streaming feature and see if I could figure out how to do it. However, unfortunately, the time scale on this project is probably best measured in years, so it won't help you any time soon. If you need this feature yesterday, then this is the best thing I can come up with, if you're intent on sticking with pure Rust:
Basically, this amounts to "make the same trade off as Go." You could lobby me to make this a feature of the regex crate (with the warning that it will always run slowly), but I'm not particularly inclined to do that because it is still quite a bit of work and I'd rather hold off on adding such things until I more thoroughly understand the problem domain. It's basically a matter of priorities. I don't want to spend a couple months of my time adding a feature that has known bad performance, with no easy route to fixing it. Apologies for the bad news! |
@BurntSushi I think I was also perhaps mislead by my experience with the C++ standard library, e.g. http://en.cppreference.com/w/cpp/regex (As an aside, it seems slightly strange to me that the Rust stdlib doesn't have a bidirectional iterator trait. It does have double-ended, but that's different. Maybe I should write up a Rust RFC.) As I said before, my own use-case is not urgent. It's only for a toy project anyway. No one is going to die or lose their job. ;-) But I appreciate your "if you need this yesterday" explanation. I may end up doing that at some point, if/when I get around to adding regex to my editor.
Ah! I feel quite silly, now! I didn't notice that ticket. Apologies. Having said that, I think this ticket actually covers a super-set of #386. #386 is essentially just API 2 in this write-up, and doesn't cover the streaming case (API 1). And although my personal use-case only needs API 2, I was intentionally trying to cover the wider range of use-cases covered in #25. So I guess it depends on how relevant you feel the streaming use-case is? In any case, I won't be at all offended if you close this as duplicate. :-) |
Sounds good to me! I closed out #386. :-) |
That was a greatly useful explanation for me. I'm writing a tool that needs string matching although not general regexps, and have been looking at three alternatives:
The ideal for me would be a RegexSet that would give me not just the indexes of matched patterns but also the (start,end) indexes, or rather the (start,end) of the first match, and that could be paused/restarted to feed it the input in blocks, not all at once. I see now why all that is not possible at the moment, and even if implemented it would be slower than what we already have. Thanks! |
I didn't respond to this before, but to be clear, C++'s standard library regex engine is backtracking AFAIK, which has a completely different set of trade offs in this space. If you asked me to do this feature request in a backtracking engine, my first thought to you would be "forget everything I just said." :) |
@BurntSushi Also, looking at Although I don't think I have the time/energy for it right now (I'm putting most of my spare time into Ropey), if I get around to it would you be offended if I took a crack at creating a separate crate to accommodate the use-cases outlined in this issue, using just the slower NFA engine from this crate? My intent in doing so would be twofold:
This would all be with the idea that these use-cases would be folded back into this crate whenever you get the time to tackle your bigger plans for it, and my fork would be then deprecated. But it would give a stop-gap solution in the mean time. |
@cessen I would not be offended. :) In fact, I suggested that exact idea above I think. You are right if course that the pikevm is more amenable to this change. The bounded backtracker might also be capable of handling it, but I haven't thought about it. |
Yup! I was double-checking that you were okay with it being a published crate, rather than just an in-project-repo sort of thing. Thanks much! If I get started on this at some point, I'll post here. |
I've started poking at this in https://github.com/cessen/streaming_regex So far I've ripped out everything except the PikeVM engine, and also ripped out the byte regex. My goal is to get the simple case working, and then I can start adding things back in (like the literal matcher, byte regex, etc.). Now that I've gotten to this point, it's really obvious what you meant @BurntSushi , regarding everything needing to be re-done to make this possible! The first thing I'm going to try is to rework the PikeVM engine so that it incrementally takes a byte at a time as input. I think this can work even in the unicode case by having a small four-byte buffer to collect the bytes of unicode scalar values, and only execute regex instructions once a full value has been collected. Hopefully that can be done relatively efficiently. Once that's done, then building on that for incremental input will hopefully be relatively straight-forward (fingers crossed). Does that sound like a reasonable approach, @BurntSushi ? |
@cessen Indeed it does! Exciting. :-) |
@BurntSushi |
I wouldn't bother honestly. I would do whatever is natural for you. I expect the current internals to be completely redone before something like streaming functionality could get merged. I'm happy to be a sounding board of course! |
Oh sure, I'm not expecting it to be mergeable. But I'm hoping that the architecture of this fork will be relevant to your rewrite in the future, so that any lessons learned can be applied. So I'd rather not go off in some direction that's completely tangential to (or fundamentally clashing with) what you have in mind for the future. If that makes sense? (Edited to add the "not" before "expecting" above. Really bad typo...) |
Yeah I think so. I'm happy to give feedback as time permits. :) |
Awesome, thanks! And, of course, I don't mean to squeeze you for time. Apologies for coming off a bit presumptuous earlier--I didn't mean it that way. I think I'll post here, if that's okay. Let me know if you'd prefer elsewhere! |
So, the first thing is a bit bigger-picture: I think we'll have a lot more wiggle-room to experiment with fast incremental scanning approaches if we keep the incremental APIs chunk-based rather than byte-based. This basically amounts to changing API 2 in my original proposal above to take an iterator over byte slices instead of an iterator over bytes. I think that still covers all of the real use-cases. (And if someone really needs to use a byte iterator with the API, they can do their own buffering to pass larger chunks.) The relevance to what I'm doing right now is that I'm thinking of changing the pub trait RegularExpression {
fn is_match_at(
&self,
chunk: &[u8], // Analagous to `text` in your code
offset_in_chunk: usize, // Analagous to `start` in your code
is_last_chunk: bool,
) -> bool;
fn find_at(
&self,
chunk: &[u8],
offset_in_chunk: usize,
is_last_chunk: bool,
) -> Option<(usize, usize)>;
// etc...
} Calling code would pass a chunk at a time, and indicate via The pub trait RegularExpression {
// The above stuff
fn find_iter<I: Iterator<Item=&[u8]>> (
self,
text: I
) -> Matches<Self> {
// ...
}
// etc...
} One of the nice things about this approach is that a single contiguous text is just a special case: for Using DFA internally on contiguous text falls naturally out of this as well, since we can switch to DFA when we know we're on the last chunk. And this gives us plenty of room to experiment with approaches to faster incremental scanning. (Incidentally, the reason I'm focusing on the |
@cessen Everything you said sounds pretty sensible to me. Here are some random thoughts:
|
Thanks for the feedback!
|
|
Yes, that was my understanding as well. I didn't mean to imply that |
Or, more specifically, if we reach the upper bound (if one is given) we can set We would still have to do something like pass an empty chunk at the end when the upper bound isn't given or when we reach the end before the upper bound. |
So, regardless of all of this, I think your suggestion to use the peekable iterator is a better idea. For some reason I thought not all iterators could be made peekable, but apparently they can. So that would be a great approach! |
Quick status report: The PikeVM engine now takes a single token at a time, and doesn't access the Next step is to get it to take just a single byte at a time. One thing I've noticed is that I think the PikeVM can be made to handle Unicode and byte data use-cases at the same time, even within the same regex (assuming the syntax supports making a distinction). I'm guessing with the DFA that's not possible in a reasonable way, because it can only be in one state at a time...? It's not particularly relevant to the use-cases I'm trying to accommodate, but I think it's interesting nonetheless. Edit: |
@pascalkuthe Sounds good. There is also the nuclear option: that we add (The marginal cost of adding another engine is fairly high, but not nearly as high as it was before |
One other thing worth asking here for folks who are working on the streaming case: are the APIs for the DFA and the lazy DFA low level enough to make it possible to build a streaming engine on top of it? |
Yeah I actually got the DFA working first (both lazy and eager). With the exception of #1031 and a way to determine the length of a prefilter (so I know how much to buffer) that worked fairly well. There is obviously sone duplication but it's not too bad: https://github.com/pascalkuthe/ropey_regex/blob/master/src/engines/dfa/search.rs For the dfa in particular the memchar based acceleration might be nice to reuse (it's a verbatim copy right now IIRC) but it's not too much code |
I have been working on this again a bit and I have run into a small roadblack. I am trying to be very general with my apporache supporting a cursor API that optionally supports backtracking. The goal would be that inputs that support backtracking (like ropes) can use all engines (dfa, hybrid, ..) while fully streaming inputs (basically any The problem is that However, I read the documentation of that module carefully and it sounds like this may not be a huge problem in practice, because most practical cases only involve emtpy matches at the start of the haystack. These case doesn't pose an issue because there is not any backtracking involved (note that I require chunks to be Unicode aligned so if the match is at a chunk boundary it is automatically unicode aligned). The only cases which wouldn't work is:
I would be ok with panicking in these two cases with a cursor that don't support backtracking (happen automatically anyway) since they are pretty niche as long as the more common cases work. @BurntSushi is my understanding correct here or do you have any better ideas how to handle this? |
I'm not quite sure to be honest. (Note that I wonder if there is perhaps a way to "not solve" this problem. I can think of two different approaches to take there:
It's also worth point out that, at least for the lazy DFA and fully compiled DFAs, you don't have to use the search APIs that handle empty matches. You can implement your own. |
Thanks for your response @BurntSushi
After thinking about it some more I am not sure either. I think its possible that the NFA/DFA would try to match some other pattern and then terminate at the empty match (like
yeah sorry, that is what I meant. I didn't want to spell it all out and got tripped up by the documentation referring to that and somehow taught they were the same thing.
Yeah that's what I am doing (I am also reimplementing the pikevm) but I am also reimplementing the copy the
For my personal usecase of ropey/helix all of this doens't matter since backtracking isn't an issue there. The first option is actually quite intesrting to me. We never want an empty match in helix. I would be happy to just remove those states from the DFA/NFA somehow? I would like to find a way to at least partially support these cases in the future too (probably using the old hack used in the regex crate for iteration, iteration of cursor without backtracking is probably pretty niche anyway since that only works with earliest match semantics for similar reasons) |
Oh interesting, I wasn't thinking of it this way. I was thinking of it as in "regex compilation returns an error if streaming mode is asked for and it can match the empty string." I'm not sure how to do your idea. What would you do with a regex like (It's worth noting that Hyperscan---which supports stream searching---errors on regexes that can match the empty string by default. You have to go out of your way to enable
I do wonder if it makes sense to forget about the "true streaming" use case and really just focus on the "non-contiguous storage" use case. It would be an incremental improvement and it would satisfy some real world use cases. It isn't as operationally flexible which is a bummer. |
For my specific usecase I essentially want to select all text that matches a regex pattern. An empty match essentially leads to no selection and therefore can be ignored (they actually lead to bugs currently and I was going to work around it donwstream by just checking the length of each match in a Reporting an error for you example could be fine but repetitions tend to be a bit annoying. Patterns like Ofcourse that is kind of specific to my usecase and reporting an error is probably the better general solution (I would love a better solution for
I have been tempted by this too. It may indeed make more sense to do that for now. I actually have no specical codepath for the non-backtracking case right now. I have a cursor trait: pub trait Cursor {
fn chunk(&self) -> &[u8];
/// Whether this cursor can be used for unicode matching That
/// means specifically that it promises that unicode codepoints are never
/// split across chunk boundaries
fn utf8_aware(&self) -> bool;
/// Returns true if successful, false at EOI
fn advance(&mut self) -> bool;
/// Returns true if successful, false at EOI
/// or if backtracking is not supported
fn backtrack(&mut self) -> bool;
} and the So for regex searches that won't run into these restrictions you can totally still use my crate. I was mostly trying to remove the cases where it would panic but it might be better to ignore that usecsase for now. Now that you mention it this API probably isn't too useful for something "fully" streaming as they would probably want to treat the regex search more like a state machine (and use regex-automata directly would work pretty well in that case I think, I imagine these are highly specific use-cases that may even be able to constrain their regex patterns further). So that was a lot of words to say, you are right I probably shouldn't focus on the fully streaming case for now :D |
Yeah I like the idea of building a Shitty First Draft that targets what you need specifically first. That will get you (and us) some real world experience with a prototype. And then hopefully it can be iterated on and improved in the future. |
I managed to implement a fully working meta regex engine (pikevm + dfa + hybrid + prefilter) that passes all regex tests at https://github.com/pascalkuthe/regex-cursor. I put a short summary in the readme. I think it would be nice to upstream this eventually since I had to duplicate a ton of private code (primarily in the meta engine) and I am a bit worried about maintaining the duplication in the long run. The cursor API that I came up with is very generic and the pikevm implementation could be made to fit fully streaming input in the future (altough with heavy limitation see the readme). Ofcourse this is just a prototype and there is a long road to getting this upstream but I want to get the ball rolling a bit. A couple points:
There is also some stuff missing:
|
@pascalkuthe Wow, that is amazing. I'll have to take a deeper look soon.
The main problem here I think, and why I designed it the way I did, was to avoid making everything in the crate polymorphic. My suspicion is that with |
yeah, that is a good point, maybe the boilerplate cost is work paying to avoid bloating compile-time. It may also make sense to keep all the cursor stuff behind feature flags. There is a lot of code that is just fully copy pasted |
Yeah I'm definitely overall in favor of finding a way to upstream your work. It's a really nice use case to serve if we can do it. But I'll want to understand more about the trade offs. Are you planning to publish the crate? (So that it's easy to read the crate docs.) |
yeah I will publish the crate later today but full disclaimer: The docs still need work. They are nowhere near the great documentation upstream has (most of the docs are also copied from upstream where API was covered). The public API probably won't be too interesting for you though since it's mostly a carbon copy from regex-automata (including docs). Mostly the cursor trait will be interesting. |
Yeah no/poor docs is cool. Just want to navigate the types and what not. |
I polish the docs up a bit and pushed it to crates.io, should be on docs.rs in a couple minutes: https://docs.rs/regex-cursor/ |
Had a look at what would be involved with implementing the overlapping search routines, and indeed I see what you mean by lots of copy/paste since these rely on some Edit: FWIW the impression I get it seems overall likely that i'd be better just running matches over many regexp individually, than trying to complete that aspect of the API if only for maintainability sake. |
@pascalkuthe have you run any benchmarks against your implementation? I am curious how it compares as a baseline. |
no haven't got around to that yet. The performance will mostly depend on your collection (a collection with small chunks and slow cursor will see a much larger impact than one with large chunks and a fast cursor). I will only be benchmarking ropey as that is the only practical usecase I have. There will be some slowdown from the cases being accelerated by the engines/strategies not implemented yet but that is only temporary. The only thing where performance would really be interesting is the prefilter since that does actually has some additional complexity. The rest I would expect to be very close to upstream regex if chunk breaks are reasonably rare (which they are in ropeys case). |
I am really excited for this feature! Thank you for putting in so much work on path clearing it. I took some time to benchmark it using rebar to get baseline results. This is for the
|
I think kw ecsn ignore everything under 20% for now. I haven't gone trough all of them but for most of the big offenders (that are an order magnitude slower) I think I identified the cause.
|
Those are really impressive numbers. Wow. Yes there are a few 10x in there, but not many. I was expecting things to be a lot worse. (I still haven't looked at the code yet though.) Nice work @pascalkuthe. |
Maybe a small update: |
I recently saw that another Rust implementation of regexes over streams, |
Cross-refer to related content I read today: |
This is more-or-less a continuation of issue #25 (most of which is actually here).
Preface
I don't personally have an urgent need for this functionality, but I do think it would be useful and would make the regex crate even more powerful and flexible. I also have a motivating use-case that I didn't see mentioned in the previous issue.
More importantly, though, I think I have a reasonable design that would handle all the relevant use-cases for streaming regex--or at least would make the regex crate not the limiting/blocking factor. I don't have the time/energy to work on implementing it myself, so please take this proposal with the appropriate amount of salt. It's more of a thought and a "hey, I think this design might work", than anything else.
And most importantly: thanks so much to everyone who has put time and effort into contributing to the regex crate! It is no coincidence that it has become such staple of the Rust ecosystem. It's a great piece of software!
My use-case
I occasionally hack on a toy text editor project of mine, and this editor uses ropes as its in-memory text data structure. The relevant implication of this is that text in my editor is split over non-contiguous chunks of memory. Since the regex crate only works on contiguous strings, that means I can't use it to perform searches on text in my editor. (Unless, I suppose, I copy the text wholesale into a contiguous chunk of memory just to perform the search on that copy. But that seems overkill and wouldn't make sense for larger texts.)
Proposal
In the previous issue discussing this topic, the main problem noted was that the regex crate would have to allocate (e.g. a String) to return the contents of matches from an arbitrary stream. My proposed solution essentially amounts to: don't return the content of the match at all, and instead only return the byte offsets. It is then the responsibility of the client code to fetch the actual contents. For example, my editor would use its own rope APIs to fetch the contents (or replace them, or whatever), completely independent of the regex crate.
The current API that returns the contents along with offsets could (and probably should) still be included as a convenience for performing regex on contiguous slices. But the "raw" or "low level" API would only yield byte offsets, allowing for a wider range of use-cases.
Layered API
I'm imagining there would be three "layers" to the API, of increasing levels of convenience and decreasing levels of flexibility:
1. Feed chunks of bytes manually, handling matches as we go
2. Give regex an iterator that yields bytes
3. Give regex a slice, just like the current API
I'm of course not suggesting naming schemes here, or even the precise way that these API's should work. I'm just trying to illustrate the idea. :-)
Note that API 2 above addresses my use-case just fine. But API 1 provides even more flexibility for other use-cases.
Things this doesn't address
BurntSushi noted the following in the previous discussion (referencing Go's streaming regex support):
This proposal doesn't solve that problem, but rather side-steps it, making it the responsibility of the client code to decide how to handle it (or not). Practically speaking, this isn't actually an API problem but rather is a fundamental problem with unbounded streaming searches.
IMO, it doesn't make sense to keep this functionality out of the the regex crate because of this issue, because the issue is by its nature outside of the regex crate. The important thing is to design the API such that people can implement their own domain-specific solutions in the client code.
As an aside: API 1 above could be enhanced to provide the length of the longest potential match so far. For clarity of what I mean, here is an example of what that might look like and how it could be used:
That would allow client code to at least only hold onto the minimal amount of data. Nevertheless, that only mitigates the problem, since you can still have regex's that match unbounded amounts of data.
The text was updated successfully, but these errors were encountered: