Add modern tree-sitter support behind an experimental flag #472

savetheclocktower · 2023-04-06T20:36:45Z

Depends on #454 (I think).

Description of the Change

This PR brings the master branch up-to-date with my tree-sitter-hell branch (which has been the de facto branch for iterating on the modern-tree-sitter language mode).

This will add an experimental feature that is explicitly marked as experimental, but which should not affect anyone who doesn’t opt into it. So here’s how I’d like this to work, if folks are OK with it:

We confirm that it doesn’t crash anyone’s editor, whether they’ve opted in or not. (This is why I think we’ll need to land add allow-jit entitlement (fixes Apple Silicon builds) #454 first.)
People verify that all the specs pass, especially new specs, and that there are no spec regressions compared to current master.
We test it enough to ascertain that behavior hasn’t changed for anyone who doesn’t opt in by changing the core.useExperimentalModernTreeSitter setting.
We do a sanity check when core.useExperimentalModernTreeSitter is set to true and make sure it can do simple things, like editing and saving files, without breaking the editor.

I know this seems like a low bar to set, but don’t get me wrong: in practice, the new mode is working great for me. My aim is just to define a realistic standard for getting an experimental feature into the editor, rather than wait until everything is at 100% feature parity.

How does this work?

The core.useExperimentalModernTreeSitter setting will be the opt-in setting for using experimental web-tree-sitter grammars.

If core.useExperimentalModernTreeSitter is set to false:

Modern Tree-sitter grammars will be hidden in the grammar selector, so the user won’t be able to choose one manually
In general it’ll be nearly impossible to get Pulsar to pick a modern tree-sitter grammar if a TM or legacy-TS version exists
The grammar selector will show a green “Tree-sitter” badge for TS grammars and no equivalent badge for TM-style grammars, just as it does today

If core.useExperimentalModernTreeSitter is set to true:

Modern Tree-sitter grammars will be visible in the grammar selector alongside the two other kinds of grammars
Pulsar will prefer a modern Tree-sitter grammar when choosing a grammar for a newly opened file, if such a grammar exists
The user will be able to choose between three different grammars in the grammar selector if they all exist for a given language and the “Hide Duplicate TextMate Grammars” setting for grammar-selector is unchecked; otherwise the user will choose between legacy-TS and modern-TS grammars when all three exist, and will see TM grammars only when neither kind of TS grammar exists for a language
The grammar selector will show a green “Tree-sitter” badge for modern TS grammars, a yellow (warning) “Legacy Tree-sitter” badge for old TS grammars, and no equivalent badge for TM-style grammars

No, I mean, like, the whole system. How does it work now?

Here's most of the documentation I've written as I've gone. I'm open to any suggestions or critiques about the design choices themselves, though perhaps they're a bit big of a subject for this PR and could take place in Discord.

Which grammars have been updated?

This will add new modern-tree-sitter versions of every grammar that had a legacy tree-sitter equivalent:

In language-c: source.c and source.cpp
In language-css: source.css
In language-go: source.go
In language-html: text.html.basic, text.html.erb, text.html.ejs
In language-hyperlink: A hyperlink-highlighting parser for injecting into strings and comments (not yet injected into all grammars)
In language-java: source.java
In language-javascript: source.javascript, plus an injection for highlighting regular expressions and one for highlighting JSDoc (JSX has built-in support)
In language-json: source.json
In language-python: source.python
In language-ruby: source.ruby, plus an injection for highlighting regular expressions
In language-rust-bundled: source.rust
In language-shellscript: source.shell
In language-todo: A parser for highlighting TODOs and FIXMEs and whatnot for injecting into code comments (not yet injected into all grammars)
In language-typescript: source.ts and source.tsx (two different tree-sitter parsers)

There are a few other tree-sitter parsers that are good enough to be added to one of our built-in packages — Markdown, PHP, YAML — but those don’t have legacy tree-sitter grammars, so there’s no urgency to replace them.

Alternate Designs

Oh, that ship has sailed.

Possible Drawbacks

The infrastructure for modern tree-sitter grammars will still be loaded for everyone, including the web-tree-sitter package and all WASM files. We can mitigate this somewhat by ignoring modern tree-sitter grammars during window load unless the user has opted in, but at a cost: changing core.useExperimentalModernTreeSitter to opt into the experimental mode would also require that the user reload the window. That’s why I chose not to do it here, but I’m open to the idea.

This mode will require some documentation. It’s mostly written on my end, but amid the overall docs rewrite, it’s not clear to me where that text should live.

Verification Process

Here are some things to try after checking out this PR branch:

At first launch, don’t change any settings yet
Make sure you can edit files as usual, that Pulsar makes the same automatic grammar choices that it did before, and that your grammar overrides (if you have any) are still respected
Open the grammar selector on a JavaScript file and make sure you see a maximum of 1 or 2 JavaScript entries in the list, depending on your settings

Then, after changing core.useExperimentalModernTreeSitter to true:

Ensure that a legacy tree-sitter grammar is never chosen over a modern tree-sitter grammar
Ensure that you can switch between (e.g.) the TM-style JavaScript grammar and the modern-TS JavaScript grammar without anything exploding
Ensure basic editing tasks work, and that saving files works

Here are some things that may be different, but which I think will make you happy:

Syntax highlighting will ideally be the best of both worlds: thorough and specific like a TM-style grammar, but precise and fast like a legacy tree-sitter grammar.
Indentation is now able to be more nuanced, and we can do some magical-feeling stuff with it, like hanging indents for statements that span multiple lines. I’ve added some of those rules into the JavaScript grammar, but in general they can be applied to lots of languages where it makes sense.
We have the ability to highlight TODOs in comments, and URLs in comments and strings, the way that TM-style grammars can do, but which legacy TS grammars never could. The integration requires work from a grammar author and isn’t yet as simple as I’d like, but it’s present in the new JavaScript grammar.

Here are some things that may be different, and may take you by surprise:

Your syntax highlighting will be different. Some of that is accidental, but some of it is the result of intentional renaming of some scopes. The bundled syntax themes will require some tweaks; I plan to have that discussion once I get some feedback.
Indentation rules might behave a bit differently. In general, if it works better, great; if it’s not as good as what you were used to, please do file an issue once this lands on master.
Code folding will behave differently. In general, we are now preferring to exclude a closing delimiter from the fold, so that most folds will look like…
```
class Foo {…}
```
…but otherwise should be very similar to code folding in a legacy grammar. If you notice otherwise, please do file an issue once this lands on master.
I’ve noticed some occasional odd behaviors around highlighting of injected layers, but it’s been very hard to reproduce consistently. If you do notice any of these, please do file an issue once this lands on master. If you can reproduce it consistently, I will send you gifts. (I think these are all fixed, but absolutely let me know if I'm wrong.)

Release Notes

Added an experimental new tree-sitter mode that will eventually replace Pulsar’s legacy support for tree-sitter grammars. You can opt into this mode via the Core → Use Experimental Modern Tree-Sitter setting.

…folds Implementing folds for new tree-sitter

…nto fix-ruby-tree-sitter-scopes

Get indents working better. Move things around a bit.

Implement `isRowCommented`, add more syntax node predicates, implement automatic reloading of SCM files.

`LanguageLayer` is re-introduced. All tree parsing and querying will eventually need to live inside an instance of `LanguageLayer`. Right now, just the syntax highlighting is doing so.

A quick explanation on this - the old code was matching if the full scope was ok. This is fine when we want to test a single grammar, but for example, TextMate grammars always added `.ruby` on the last part of the grammar. That was quite bad for TreeSitter, that basically didn't do the same thing. So now it matches a fine crafted Regexp that basically checks if the full scope is match from the beginning, OR if part of the scope (up to just before the `.`) matches. So, for example, for `constant.other.ruby` it'll match `constant`, `constant.other`, and `constant.other.ruby` but it'll NOT MATCH `constant.oth`.

…nto fix-ruby-tree-sitter-scopes

…uby-tree-sitter-scopes-with-right-tokens

Preliminary. Folds and indents are broken at the moment.

…and add a `tags.scm`.

This is the latest `master` — it works fine now that we build our own version of `web-tree-sitter`. It's twice as big as the last stable release. I don't know why, but I've been using it for about a week without any trouble.

…the same ones we applied to ERB earlier.

…when they haven't been re-parsed yet. Say we've got an injection layer on rows 100-150. If a user inserts a newline on line 1, we know for certain that that doesn't affect the parsed tree of that injection layer, except that it needs to be edited to incorporate that change so that its node positions are correct. So we wait to re-parse it because we don't need to, even though we know the tree is technically dirty. We only need to re-parse it when an edit occurs within its layer extent. But each node has a `text` property that's actually just a getter. When `text` is read, the node looks up its text using its `startIndex` and `endIndex`. If we called `parse` with a string originally, then it's going to do its lookups against a string that we know to be stale. Most `#match?` predicates would be doomed to fail, and if the display layer needed to highlight any part of that layer, it'd miss a lot of stuff. Luckily, web-tree-sitter envisions that your buffer might be represented by a strange data structure, and lets you specify a callback that is used when the parser needs to get the contents of various buffer ranges. We can have that callback do its lookups against a string copy of the buffer that we know to be fresh, because we're updating it on every buffer change. This is a big deal, because the alternative is having to re-parse every injection layer on every transaction. Now we can safely run captures against dirty trees when we know that the changes cannot have affected the structure of the tree.

…returns its expected object signature.

…where such a parse could fail because the source text changes in the middle of the parse. The change I introduced yesterday made it possible for an async parse job to get confused because its source text has changed in between async jobs. To prevent this, the buffer text must remain constant during the parse, but should use the most current buffer text _after_ the parse.

…to accept multiple node types.

…for consumption by other packages. This is nearly the simplest possible interface around the `bookmarks` package, except that it will fire events when the number of bookmarks has changed.

…when there's no indents query to help us out. This seems to be the default behavior for plaintext and for other language modes. I've got a Tree-sitter Markdown grammar locally and I noticed it wasn't preserving indentation when I was two levels deep into a bulleted list and hit `Enter` — I had to re-indent each time. (The Markdown grammar doesn't include an indents query because there's practically no way to predict indentation levels in Markdown.)

…to consider content ranges instead of extents. If you want to know which layers are operative at a given point, it makes no sense to check a layer's extent. A `LanguageLayer` could extend over the entire buffer and still only apply highlighting to one very small range in the buffer. Thus it makes no sense for `controllingLayerAtPoint` not to check a layer's content ranges, nor does it make sense for the `injectionLayersAtPoint` method that it relies on. If I felt like there was any use in knowing that a language layer extended over a range that included the given point but wasn't operative at that point, I'd be hesitant to change these methods. But I don't. This bug was discovered when I realized that the wrong indents query was being consulted in the following HTML… <style> #foo { …when I moved the cursor to the end of the `<style>` element and hit `Enter`.

Forgot to sort deeper layers first.

…as overly broad. Turns out there are some scenarios where you'd want to know about a layer whose extent may include a given point, even if its content ranges _don't_ include that point. I'd have realized this if I'd done something as simple as run the test suite. Silly me. But `controllingLayerAtPoint` will still enforce the content range constraint because we use it to pick a winner for indent queries, fold queries, and so on.

mauricioszabo · 2023-05-19T21:11:49Z

So - Ruby test is failing, and it's expected to do - I wrote that with the scopes that the queries from NeoVIM where returning at the time, and we changed that.

Autocomplete-CSS is also failing, and this one I have no idea if the failure is correct or not...

…when straddling an injection boundary. When we do this one row at a time, the controlling layer for an indent query is the comparison (previous) row, because that's where the query starts. The batch version should be no different. The new spec describes the exact scenario that revealed this bug.

…not to expect empty tokens. I made a change in 29cfcad that neglected to apply any scopes for a range that was zero characters long. My instincts tell me that this is a safe change to make, but it does affect any tests that used `tokensForScreenRow` and expected it to report information about zero-length tokens. So the expected results needed to be updated.

This got put into a conditional, but I neglected to remove the original line.

savetheclocktower · 2023-05-19T21:58:19Z

Autocomplete-CSS is also failing, and this one I have no idea if the failure is correct or not...

File that one under “how did this ever work?” — a spec is failing because it's expecting a scope name that is present in the TextMate-style CSS grammar but missing from the tree-sitter CSS grammar. This drives me nuts. They shipped tree-sitter back in the day without even checking for stuff like this.

This is just a band-aid. There are deeper issues here, like lack of compatibility with tree-sitter grammars and lack of SCSS (new syntax) support.

savetheclocktower · 2023-05-19T22:11:52Z

So - Ruby test is failing, and it's expected to do - I wrote that with the scopes that the queries from NeoVIM where returning at the time, and we changed that.

Yeah, my inclination right now is to skip those specs. I haven't written any for the other grammars because I expect there to be a lot of flux as we adjust highlights.scm files to maintain scope name compatibility with legacy grammars (when needed and where appropriate). Once we're ready to ship this for real, we can go through and write grammar-level specs.

Temporarily skipping the tree-sitter specs; we can formalize them later.

mauricioszabo

Let's GOOOOO!!!

confused-Techie

Lets do this!!

mauricioszabo and others added 30 commits February 28, 2023 20:06

Merge pull request pulsar-edit#401 from pulsar-edit/wasm-tree-sitter-…

18b162c

…folds Implementing folds for new tree-sitter

Trying to solve some scopes on highlights

d5f2f4f

Adding a testfile for TS grammar

784adc1

Merge remote-tracking branch 'origin/feature/modernize-tree-sitter' i…

785bfc6

…nto fix-ruby-tree-sitter-scopes

Getting only real assert comments

5b80503

Fix iPath error

7842474

Added lang attribute

e1cce9a

I am in hell

917fdf5

additional hell

092fe3f

briefly out of hell

ebf30d8

Lots of fixes.

6a41b3b

Fix a few bugs.

319e0b7

Get indents working better. Move things around a bit.

Implement bufferRangeForScopeAtPosition

e73d69e

Fixes

3f614b4

Implement `isRowCommented`, add more syntax node predicates, implement automatic reloading of SCM files.

Paving the way for injections

12e8c9c

`LanguageLayer` is re-introduced. All tree parsing and querying will eventually need to live inside an instance of `LanguageLayer`. Right now, just the syntax highlighting is doing so.

Removing VERY specific matches

0863b75

Merge remote-tracking branch 'origin/feature/modernize-tree-sitter' i…

0362fc7

…nto fix-ruby-tree-sitter-scopes

Some highlights changes

a80cbf2

Merge remote-tracking branch 'clocktower/tree-sitter-hell' into fix-r…

46b19aa

…uby-tree-sitter-scopes-with-right-tokens

Fixed-ish tests

f6018f7

Adding a hook for did-change-highlighting

6006798

More fixes to highlights

ebf4d77

Make require the same as TextMate

77de02a

Get injections working.

67f7fb7

Preliminary. Folds and indents are broken at the moment.

Get indents and folds working again

bea3252

Don't force a parse just to do folds (this was dumb of me)

c2e44b4

Revert adding a hook for highlight

815b66c

Adding highlight files for Clojure

6a8cc16

Adding a ready promise

100ca5a

savetheclocktower added 18 commits April 30, 2023 14:05

Add namespaces to Ruby query files…

1f5947c

…and add a `tags.scm`.

Update to latest tree-sitter-ruby

ee3b892

This is the latest `master` — it works fine now that we build our own version of `web-tree-sitter`. It's twice as big as the last stable release. I don't know why, but I've been using it for about a week without any trouble.

Add some highlighting fixes to EJS…

d8444b4

…the same ones we applied to ERB earlier.

Fix a bug in ordering of highlight iterators

008fa13

Ensure getSyntaxNodeAndGrammarContainingRange…

0cbdc2a

…returns its expected object signature.

Fix typo

a2f3c9e

Fix bug in WASMTreeSitterLanguageMode#parse

ef90e16

Give onlyIf(Not)Type ability…

31f67b9

…to accept multiple node types.

Merge master into tree-sitter-hell

133ffe2

Bump second-mate to 9686771

2e9f0d4

Add bookmarks service…

3ce6b26

…for consumption by other packages. This is nearly the simplest possible interface around the `bookmarks` package, except that it will fire events when the number of bookmarks has changed.

Fix bug introduced in controllingLayersAtPoint

811315a

Forgot to sort deeper layers first.

Merge from master

a20cd06

savetheclocktower added 3 commits May 19, 2023 14:18

Remove redundant did-insert-text firing

87f5d78

This got put into a conditional, but I neglected to remove the original line.

Get autocomplete-css tests working

f7af100

This is just a band-aid. There are deeper issues here, like lack of compatibility with tree-sitter grammars and lack of SCSS (new syntax) support.

Get language-ruby specs working

0391f94

Temporarily skipping the tree-sitter specs; we can formalize them later.

mauricioszabo approved these changes May 20, 2023

View reviewed changes

confused-Techie approved these changes May 20, 2023

View reviewed changes

savetheclocktower merged commit 36ebc40 into pulsar-edit:master May 20, 2023

claytonrcarter mentioned this pull request Jan 17, 2024

Status re Pulsar? digitallyserviced/semanticolor#48

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add modern tree-sitter support behind an experimental flag #472

Add modern tree-sitter support behind an experimental flag #472

savetheclocktower commented Apr 6, 2023 •

edited

Loading

mauricioszabo commented May 19, 2023

savetheclocktower commented May 19, 2023

savetheclocktower commented May 19, 2023

mauricioszabo left a comment

confused-Techie left a comment

Add modern tree-sitter support behind an experimental flag #472

Add modern tree-sitter support behind an experimental flag #472

Conversation

savetheclocktower commented Apr 6, 2023 • edited Loading

Description of the Change

How does this work?

No, I mean, like, the whole system. How does it work now?

Which grammars have been updated?

Alternate Designs

Possible Drawbacks

Verification Process

Release Notes

mauricioszabo commented May 19, 2023

savetheclocktower commented May 19, 2023

savetheclocktower commented May 19, 2023

mauricioszabo left a comment

Choose a reason for hiding this comment

confused-Techie left a comment

Choose a reason for hiding this comment

savetheclocktower commented Apr 6, 2023 •

edited

Loading