Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental reparsing #5

Open
c42f opened this issue Feb 7, 2022 · 4 comments
Open

Incremental reparsing #5

c42f opened this issue Feb 7, 2022 · 4 comments
Labels
enhancement New feature or request parser

Comments

@c42f
Copy link
Member

c42f commented Feb 7, 2022

@davidanthoff asked on Zulip about incremental reparsing.

is there support for partial reparses, i.e. some sort of incremental parsing? Basic idea is that user presses one key in the editor, and we don't want to reparse the whole document on every key press, but only a subset, based on the precise range of the doc that was edited

To capture my thoughts on this somewhere more permanent, I think this should work fine but there's a couple of tricky things to work out:

First, how are the changed bytes supplied to the parser system? I haven't looked into LanguageServer yet. But presumably it's "insert this byte here" or "change line 10 to 'such-and-such' string". Those might require a representation of the source which isn't a String (or Vector{UInt8} buffer). It might be a rope data structure or something? Should we extend the SourceFile abstraction to allow different AbstractString types? Or perhaps this state should be managed outside the parser completely? Internally, I feel the lexer and parser should always operate on Vector{UInt8} as a concrete efficient datastructure for UTF-8 encoded text, so the subrange of text which is being parsed should probably be copied into one of these for use by the tokenizer.

Second, the new source text intersects with the existing parse tree node(s) which cover some range of bytes. There can be several such nodes nested together; which one do we choose? Equivalently, which production (JuliaSyntax.parse_* function) do we start reparsing from? Starting deeper in the tree is good because it implies a smaller span, but the parser may have nontrivial state which isn't explicit in the parse tree. For example, space sensitive parsing within [] or macro calls. Or the special parsing of in as = within iterator specification of a for loop. So we'd need a list of rules to specify which productions we can restart parsing from, and correctly reconstruct the ParseState for those cases. To start with, toplevel/module scope is probably fine and we could throw something together quickly for that, I think.

@pfitzseb
Copy link
Member

pfitzseb commented Feb 7, 2022

To start with, toplevel/module scope is probably fine and we could throw something together quickly for that, I think.

Agreed. I also think that it's not hugely important for performance though (the current LS ecosystem has incremental parsing disabled, I think, and edit performance isn't typically bad)

@c42f
Copy link
Member Author

c42f commented Feb 7, 2022

To measure current performance a little...

The largest Julia source file in JuliaLang/julia seems to be test/core.jl, and we've got

julia> text = read(joinpath(Sys.BINDIR, "..", "share", "julia", "test", "core.jl"), String);

julia> sizeof(text)
198113

The current time to parse this on my machine and convert it into a SyntaxNode:

julia> @time JuliaSyntax.parse(JuliaSyntax.SyntaxNode, text);
  0.060513 seconds (549.17 k allocations: 33.554 MiB, 19.23% gc time)

With some optimization to tree creation we might approach the current time to parse this without constructing the AST:

julia> @time JuliaSyntax.parse(JuliaSyntax.ParseStream(text))
  0.022842 seconds (111.16 k allocations: 10.286 MiB)
ParseStream at position 198114

60 ms is probably detectable latency and annoying to some people. 20 ms should getting low enough to be close to "instant".

But I can imagine people might

  • Generate much larger source files than the 200kB here (inadvisable? but rather probable)
  • Have a much slower machine than my Intel i7 laptop. Which is several years old but by no means slow.

Conversely, the hot token-handling loops and data structures in JuliaSyntax.jl haven't been optimized at all, so the parser might become quite a bit faster in the future.

Anyway, with all that in mind I think we don't need incremental parsing immediately but we should probably have it in the future and there will likely be use cases where it matters. I think the current design accommodates it and that's the most important thing at this early stage.

@c42f c42f added enhancement New feature or request parser labels May 18, 2022
@LilithHafner
Copy link
Member

Agree that it isn't needed now, and also that we should keep incremental parsing in mind when defining public API.

@gafter
Copy link

gafter commented Jul 8, 2023

In Roslyn, any nontrivial parser state is recorded in the green nodes. If the state isn't the same during reparsing, the node is "crumbled" to its constituent parts (one level at a time) and reparsed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request parser
Projects
None yet
Development

No branches or pull requests

4 participants