Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add modern tree-sitter support behind an experimental flag #472

Merged
merged 307 commits into from
May 20, 2023

Conversation

savetheclocktower
Copy link
Contributor

@savetheclocktower savetheclocktower commented Apr 6, 2023

Depends on #454 (I think).

Description of the Change

This PR brings the master branch up-to-date with my tree-sitter-hell branch (which has been the de facto branch for iterating on the modern-tree-sitter language mode).

This will add an experimental feature that is explicitly marked as experimental, but which should not affect anyone who doesn’t opt into it. So here’s how I’d like this to work, if folks are OK with it:

  • We confirm that it doesn’t crash anyone’s editor, whether they’ve opted in or not. (This is why I think we’ll need to land add allow-jit entitlement (fixes Apple Silicon builds) #454 first.)
  • People verify that all the specs pass, especially new specs, and that there are no spec regressions compared to current master.
  • We test it enough to ascertain that behavior hasn’t changed for anyone who doesn’t opt in by changing the core.useExperimentalModernTreeSitter setting.
  • We do a sanity check when core.useExperimentalModernTreeSitter is set to true and make sure it can do simple things, like editing and saving files, without breaking the editor.

I know this seems like a low bar to set, but don’t get me wrong: in practice, the new mode is working great for me. My aim is just to define a realistic standard for getting an experimental feature into the editor, rather than wait until everything is at 100% feature parity.

How does this work?

The core.useExperimentalModernTreeSitter setting will be the opt-in setting for using experimental web-tree-sitter grammars.

If core.useExperimentalModernTreeSitter is set to false:

  • Modern Tree-sitter grammars will be hidden in the grammar selector, so the user won’t be able to choose one manually
  • In general it’ll be nearly impossible to get Pulsar to pick a modern tree-sitter grammar if a TM or legacy-TS version exists
  • The grammar selector will show a green “Tree-sitter” badge for TS grammars and no equivalent badge for TM-style grammars, just as it does today

If core.useExperimentalModernTreeSitter is set to true:

  • Modern Tree-sitter grammars will be visible in the grammar selector alongside the two other kinds of grammars
  • Pulsar will prefer a modern Tree-sitter grammar when choosing a grammar for a newly opened file, if such a grammar exists
  • The user will be able to choose between three different grammars in the grammar selector if they all exist for a given language and the “Hide Duplicate TextMate Grammars” setting for grammar-selector is unchecked; otherwise the user will choose between legacy-TS and modern-TS grammars when all three exist, and will see TM grammars only when neither kind of TS grammar exists for a language
  • The grammar selector will show a green “Tree-sitter” badge for modern TS grammars, a yellow (warning) “Legacy Tree-sitter” badge for old TS grammars, and no equivalent badge for TM-style grammars

No, I mean, like, the whole system. How does it work now?

Here's most of the documentation I've written as I've gone. I'm open to any suggestions or critiques about the design choices themselves, though perhaps they're a bit big of a subject for this PR and could take place in Discord.

Which grammars have been updated?

This will add new modern-tree-sitter versions of every grammar that had a legacy tree-sitter equivalent:

  • In language-c: source.c and source.cpp
  • In language-css: source.css
  • In language-go: source.go
  • In language-html: text.html.basic, text.html.erb, text.html.ejs
  • In language-hyperlink: A hyperlink-highlighting parser for injecting into strings and comments (not yet injected into all grammars)
  • In language-java: source.java
  • In language-javascript: source.javascript, plus an injection for highlighting regular expressions and one for highlighting JSDoc (JSX has built-in support)
  • In language-json: source.json
  • In language-python: source.python
  • In language-ruby: source.ruby, plus an injection for highlighting regular expressions
  • In language-rust-bundled: source.rust
  • In language-shellscript: source.shell
  • In language-todo: A parser for highlighting TODOs and FIXMEs and whatnot for injecting into code comments (not yet injected into all grammars)
  • In language-typescript: source.ts and source.tsx (two different tree-sitter parsers)

There are a few other tree-sitter parsers that are good enough to be added to one of our built-in packages — Markdown, PHP, YAML — but those don’t have legacy tree-sitter grammars, so there’s no urgency to replace them.

Alternate Designs

Oh, that ship has sailed.

Possible Drawbacks

The infrastructure for modern tree-sitter grammars will still be loaded for everyone, including the web-tree-sitter package and all WASM files. We can mitigate this somewhat by ignoring modern tree-sitter grammars during window load unless the user has opted in, but at a cost: changing core.useExperimentalModernTreeSitter to opt into the experimental mode would also require that the user reload the window. That’s why I chose not to do it here, but I’m open to the idea.

This mode will require some documentation. It’s mostly written on my end, but amid the overall docs rewrite, it’s not clear to me where that text should live.

Verification Process

Here are some things to try after checking out this PR branch:

  • At first launch, don’t change any settings yet
  • Make sure you can edit files as usual, that Pulsar makes the same automatic grammar choices that it did before, and that your grammar overrides (if you have any) are still respected
  • Open the grammar selector on a JavaScript file and make sure you see a maximum of 1 or 2 JavaScript entries in the list, depending on your settings

Then, after changing core.useExperimentalModernTreeSitter to true:

  • Ensure that a legacy tree-sitter grammar is never chosen over a modern tree-sitter grammar
  • Ensure that you can switch between (e.g.) the TM-style JavaScript grammar and the modern-TS JavaScript grammar without anything exploding
  • Ensure basic editing tasks work, and that saving files works

Here are some things that may be different, but which I think will make you happy:

  • Syntax highlighting will ideally be the best of both worlds: thorough and specific like a TM-style grammar, but precise and fast like a legacy tree-sitter grammar.
  • Indentation is now able to be more nuanced, and we can do some magical-feeling stuff with it, like hanging indents for statements that span multiple lines. I’ve added some of those rules into the JavaScript grammar, but in general they can be applied to lots of languages where it makes sense.
  • We have the ability to highlight TODOs in comments, and URLs in comments and strings, the way that TM-style grammars can do, but which legacy TS grammars never could. The integration requires work from a grammar author and isn’t yet as simple as I’d like, but it’s present in the new JavaScript grammar.

Here are some things that may be different, and may take you by surprise:

  • Your syntax highlighting will be different. Some of that is accidental, but some of it is the result of intentional renaming of some scopes. The bundled syntax themes will require some tweaks; I plan to have that discussion once I get some feedback.

  • Indentation rules might behave a bit differently. In general, if it works better, great; if it’s not as good as what you were used to, please do file an issue once this lands on master.

  • Code folding will behave differently. In general, we are now preferring to exclude a closing delimiter from the fold, so that most folds will look like…

    class Foo {}

    …but otherwise should be very similar to code folding in a legacy grammar. If you notice otherwise, please do file an issue once this lands on master.

  • I’ve noticed some occasional odd behaviors around highlighting of injected layers, but it’s been very hard to reproduce consistently. If you do notice any of these, please do file an issue once this lands on master. If you can reproduce it consistently, I will send you gifts. (I think these are all fixed, but absolutely let me know if I'm wrong.)

Release Notes

Added an experimental new tree-sitter mode that will eventually replace Pulsar’s legacy support for tree-sitter grammars. You can opt into this mode via the Core → Use Experimental Modern Tree-Sitter setting.

mauricioszabo and others added 30 commits February 28, 2023 20:06
…folds

Implementing folds for new tree-sitter
Get indents working better. Move things around a bit.
Implement `isRowCommented`,  add more syntax node predicates, implement automatic reloading of SCM files.
`LanguageLayer` is re-introduced. All tree parsing and querying will eventually need to live inside an instance of `LanguageLayer`. Right now, just the syntax highlighting is doing so.
A quick explanation on this - the old code was matching if the full
scope was ok. This is fine when we want to test a single grammar, but
for example, TextMate grammars always added `.ruby` on the last part of
the grammar. That was quite bad for TreeSitter, that basically didn't do
the same thing. So now it matches a fine crafted Regexp that basically
checks if the full scope is match from the beginning, OR if part of the
scope (up to just before the `.`) matches. So, for example, for
`constant.other.ruby` it'll match `constant`, `constant.other`, and
`constant.other.ruby` but it'll NOT MATCH `constant.oth`.
Preliminary. Folds and indents are broken at the moment.
…and add a `tags.scm`.
This is the latest `master` — it works fine now that we build our own version
of `web-tree-sitter`.

It's twice as big as the last stable release. I don't know why, but I've been
using it for about a week without any trouble.
…the same ones we applied to ERB earlier.
…when they haven't been re-parsed yet.

Say we've got an injection layer on rows 100-150. If a user inserts a newline
on line 1, we know for certain that that doesn't affect the parsed tree of that
injection layer, except that it needs to be edited to incorporate that change
so that its node positions are correct. So we wait to re-parse it because we
don't need to, even though we know the tree is technically dirty. We only need
to re-parse it when an edit occurs within its layer extent.

But each node has a `text` property that's actually just a getter. When `text`
is read, the node looks up its text using its `startIndex` and `endIndex`. If we
called `parse` with a string originally, then it's going to do its lookups
against a string that we know to be stale. Most `#match?` predicates would be
doomed to fail, and if the display layer needed to highlight any part of that
layer, it'd miss a lot of stuff.

Luckily, web-tree-sitter envisions that your buffer might be represented by a
strange data structure, and lets you specify a callback that is used when
the parser needs to get the contents of various buffer ranges. We can have that
callback do its lookups against a string copy of the buffer that we know to be
fresh, because we're updating it on every buffer change.

This is a big deal, because the alternative is having to re-parse every
injection layer on every transaction. Now we can safely run captures against
dirty trees when we know that the changes cannot have affected the structure of
the tree.
…returns its expected object signature.
…where such a parse could fail because the source text changes in the middle of
the parse.

The change I introduced yesterday made it possible for an async parse job to get
confused because its source text has changed in between async jobs. To prevent
this, the buffer text must remain constant during the parse, but should use the
most current buffer text _after_ the parse.
…to accept multiple node types.
…for consumption by other packages.

This is nearly the simplest possible interface around the `bookmarks` package,
except that it will fire events when the number of bookmarks has changed.
…when there's no indents query to help us out.

This seems to be the default behavior for plaintext and for other language
modes.

I've got a Tree-sitter Markdown grammar locally and I noticed it wasn't
preserving indentation when I was two levels deep into a bulleted list and hit
`Enter` — I had to re-indent each time. (The Markdown grammar doesn't include an
indents query because there's practically no way to predict indentation levels
in Markdown.)
…to consider content ranges instead of extents.

If you want to know which layers are operative at a given point, it makes no
sense to check a layer's extent. A `LanguageLayer` could extend over the entire
buffer and still only apply highlighting to one very small range in the buffer.

Thus it makes no sense for `controllingLayerAtPoint` not to check a layer's
content ranges, nor does it make sense for the `injectionLayersAtPoint` method
that it relies on.

If I felt like there was any use in knowing that a language layer extended over
a range that included the given point but wasn't operative at that point, I'd be
hesitant to change these methods. But I don't.

This bug was discovered when I realized that the wrong indents query was being
consulted in the following HTML…

    <style>
      #foo {

…when I moved the cursor to the end of the `<style>` element and hit `Enter`.
Forgot to sort deeper layers first.
…as overly broad.

Turns out there are some scenarios where you'd want to know about a layer whose
extent may include a given point, even if its content ranges _don't_ include
that point. I'd have realized this if I'd done something as simple as run the
test suite. Silly me.

But `controllingLayerAtPoint` will still enforce the content range constraint
because we use it to pick a winner for indent queries, fold queries, and so on.
@mauricioszabo
Copy link
Contributor

So - Ruby test is failing, and it's expected to do - I wrote that with the scopes that the queries from NeoVIM where returning at the time, and we changed that.

Autocomplete-CSS is also failing, and this one I have no idea if the failure is correct or not...

…when straddling an injection boundary.

When we do this one row at a time, the controlling layer for an indent query is
the comparison (previous) row, because that's where the query starts. The batch
version should be no different.

The new spec describes the exact scenario that revealed this bug.
…not to expect empty tokens.

I made a change in 29cfcad that neglected to apply any scopes for a range that
was zero characters long. My instincts tell me that this is a safe change to
make, but it does affect any tests that used `tokensForScreenRow` and expected
it to report information about zero-length tokens. So the expected results
needed to be updated.
This got put into a conditional, but I neglected to remove the original line.
@savetheclocktower
Copy link
Contributor Author

Autocomplete-CSS is also failing, and this one I have no idea if the failure is correct or not...

File that one under “how did this ever work?” — a spec is failing because it's expecting a scope name that is present in the TextMate-style CSS grammar but missing from the tree-sitter CSS grammar. This drives me nuts. They shipped tree-sitter back in the day without even checking for stuff like this.

This is just a band-aid. There are deeper issues here, like lack of
compatibility with tree-sitter grammars and lack of SCSS (new syntax) support.
@savetheclocktower
Copy link
Contributor Author

So - Ruby test is failing, and it's expected to do - I wrote that with the scopes that the queries from NeoVIM where returning at the time, and we changed that.

Yeah, my inclination right now is to skip those specs. I haven't written any for the other grammars because I expect there to be a lot of flux as we adjust highlights.scm files to maintain scope name compatibility with legacy grammars (when needed and where appropriate). Once we're ready to ship this for real, we can go through and write grammar-level specs.

Temporarily skipping the tree-sitter specs; we can formalize them later.
Copy link
Contributor

@mauricioszabo mauricioszabo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's GOOOOO!!!

Copy link
Member

@confused-Techie confused-Techie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets do this!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants