Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental Parsing #114

Closed
sirinath opened this issue Dec 1, 2014 · 13 comments
Closed

Incremental Parsing #114

sirinath opened this issue Dec 1, 2014 · 13 comments

Comments

@sirinath
Copy link

sirinath commented Dec 1, 2014

Following is a similar project which supports incremental parsing: https://github.com/Eliah-Lakhin/papa-carlo. Perhaps some cool features from this project can be incorporated into parboiled2

@alexander-myltsev
Copy link
Contributor

@sirinath , thank you for information. I heard of it. It is on my queue to borrow some features to parboiled2.

@sirthias
Copy link
Owner

sirthias commented Dec 1, 2014

I don't think the incremental parsing approach from papa-carlo applies to pb2.
Also, I'm not sure it would help in any way to speed things up.
The README from papa-carlo says that it parses a 600 lines JSON file in ca. 250ms.
This is something like 2400 JSON lines per second and a few orders of magnitude slower than pb2, which does about 2 million JSON lines per second with the jsonBenchmark example.

I think what we want is #42, i.e. error recovery.
But not necessarily in an incremental fashion.

@sirthias sirthias closed this as completed Dec 1, 2014
@MasseGuillaume
Copy link

@sirthias well if you reparse multiple times (like the scala syntax in the ide context) it will be a constant time parsing vs linear.

@sirthias
Copy link
Owner

sirthias commented Dec 2, 2014

Ok. The question is what is faster in a real-world setting.
If papa-carlo is 1000 times slower the benefit of constant vs. linear runtime only manifests itself for extremely large files.

@Eliah-Lakhin
Copy link

Hello, guys.

Thanks for mentioning my project here.

First of all I have to say that I myself is a big fan of Parboiled. I have been using it for years, and Mathias does really great job by implementing parsing library that in the first place has parser-generator's performance characteristics, and on the other hand provides an API that is not a pain in the head when you are using it in practice in contrast to many other parser libraries.

In Papa Carlo my primary goal was reaching incrementality characteristics first. And that's true that on startup it is slow in contrast to well optimized linear time parsers such as Parboiled. That is one of the drawbacks of the Papa Carlo, that, I believe, sooner or later will be fixed... maybe by integrating Parboiled's backend to the Papa Carlo - why not? So we could win by combining good parts of both projects.

Talking about Json parser. I have to admit that for Json it doesn't really matter do we have incrementality or not. Because in this case any well optimized linear-time parser is extremely fast. But I think for something lesser synthetic, let's say, Scala syntax @MasseGuillaume argument takes place: contstant-time parser performance may save much time than the linear time parser does in practice maybe even for not so large files.

Advocating incremental parsing approach I also would like to add that Papa Carlo is rewriting relative fragments of Parsing Tree rather than rewriting the whole tree. It is very useful for the next compilation phases, such as semantic analysis - since it makes it possible to implement this phase in incremental/caching fashion too. And I think this is very important part, because in fact semantic phase is usually the most slow compiler's phase.

@MasseGuillaume
Copy link

@Eliah-Lakhin naive question is it possible to initialize the parser with parboiled ? Then switch to incremental mode.

@Eliah-Lakhin
Copy link

@MasseGuillaume I think it is possible. Code parsing loop is running in this procedure: https://github.com/Eliah-Lakhin/papa-carlo/blob/cb8722235227bf57873f3051754c537cc1e6bf74/src/main/scala/name.lakhin.eliah.projects/papacarlo/syntax/Session.scala#L65 using PEG rule interpreters on top of lexical tokens. So the thing is that we need to replace these interpreters with the Parboiled-based parsers. For each Papa Carlo rule there are similar rules exist in Parboiled and work the same way.

@sirinath
Copy link
Author

sirinath commented Dec 8, 2014

Perhaps something worth exploring more. Even merge the efforts of both projects.

@paulp
Copy link

paulp commented Dec 9, 2014

Advocating incremental parsing approach I also would like to add that Papa Carlo is rewriting relative fragments of Parsing Tree rather than rewriting the whole tree. It is very useful for the next compilation phases, such as semantic analysis - since it makes it possible to implement this phase in incremental/caching fashion too.

And don't underestimate the importance and potential huge benefits of understanding language semantics differentially. We should be designing programming languages with the primary ambition that it is easy to know the consequences of a change, because that is how we create all non-trivial programs: by changing existing ones. Instead, every language of which I am aware has been designed at least implicitly as if we would be writing every program from scratch.

Of course, one reason this isn't the way things are done is that it will be difficult, and because "language engineering" is almost non-existent - programming languages are mostly developed as if they are one-off throwaways, not as modular foundations for exploring the interplay between the many design tradeoffs. The fact that odersky thinks scalac's parser is stable and all finished says enough.

In summary, I intend to have the benefits of both these approaches, one way or another!

@sirinath
Copy link
Author

May be this can be re opened?

@sirthias
Copy link
Owner

I currently don't see incremental parsing as a priority for pb2.
There are too many other focus points for us at this time and I don't have a clear picture of what exactly or how exactly incremental parsing could work in pb2 in a way that wouldn't deteriorate general parsing performance or introduce too much complexity.

My take here is still that parsing is something that could be (and should be) made very fast, so that reparsing a "usually sized" input unit is not a problem.

What I would be more interested in is a "streamified parsing" approach, where pb2 can be used to parse potentially infinite streams of, say JSON data, in a nice fashion. This is somewhat related to incremental parsing, but less general.

@paulp
Copy link

paulp commented Dec 10, 2014

Oh, an event parser. That's a good idea.

@sirinath
Copy link
Author

Current parer does not need any change to accommodate this. Just have to effectively find where to 1) start and stop parsing such that the change in between and 2) then update the tree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants