Skip to content

Commit

Permalink
Rewrite the grammar once again.
Browse files Browse the repository at this point in the history
* Parses the GHC codebase!

  I'm using a trimmed set of the source directories of the compiler and most core libraries in
  [this repo](https://github.com/tek/tsh-test-ghc).

  This used to break horribly in many files because explicit brace layouts weren't supported very well.

* Faster in most cases!
  Here are a few simple benchmarks to illustrate the difference, not to be taken _too_ seriously, using the test
  codebases in `test/libs`:

  Old:
  ```
  effects: 32ms
  postgrest: 91ms
  ivory: 224ms
  polysemy: 84ms
  semantic: 1336ms
  haskell-language-server: 532ms
  flatparse: 45ms
  ```

  New:
  ```
  effects: 29ms
  postgrest: 64ms
  ivory: 178ms
  polysemy: 70ms
  semantic: 692ms
  haskell-language-server: 390ms
  flatparse: 36ms
  ```

  GHC's `compiler` directory takes 3000ms, but is among the fastest repos for per-line and per-character times!
  To get more detailed info (including new codebases I added, consisting mostly of core libraries), run
  `test/parse-libs`.
  I also added an interface for running `hyperfine`, exposed as a Nix app – execute
  `nix run .#bench-libs -- stm mtl transformers` with the desired set of libraries in `test/libs` or
  `test/libs/tsh-test-ghc/libraries`.

* Smaller size of the shared object.

  `tree-sitter generate` produces a `haskell.so` with a size of 4.4MB for the old grammar, and 3.0MB for the new one.

* Significantly faster time to generate, and slightly faster build.

  On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s.

* All terminals now have proper text nodes when possible, like the `.` in modules.
  Fixes #102, #107, #115 (partially?).

* Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative
  interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code.
  Fixes #89, #105, #111.

* Comments aren't pulled into preceding layouts anymore.
  Fixes #82, #109.
  (Can probably still be improved with a few heuristics for e.g. postfix haddock)

* Similarly, whitespace is kept out of layout-related nodes as much as possible.
  Fixes #74.

* Hashes can now be operators in all situations, without sacrificing unboxed tuples.
  Fixes #108.

* Expression quotes are now handled separately from quasiquotes and their contents parsed properly.
  Fixes #116.

* Explicit brace layouts are now handled correctly.
  Fixes #92.

* Function application with multiple block arguments is handled correctly.

* Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like
  prefix operator detection.

* Haddock comments have dedicated nodes now.

* Use named precedences instead of closely replicating the GHC parser's productions.

* Different layouts are tracked and closed with their special cases considered.
  In particular, multi-way if now has layout.

* Fixed CPP bug where mid-line `#endif` would be false positive.

* CPP only matches legal directives now.

* Generally more lenient parsing than GHC, and in the presence of errors:
  * Missing closing tokens at EOF are tolerated for:
    * CPP
    * Comment
    * TH Quotation
  * Multiple semicolons in some positions like `if/then`
  * Unboxed tuples and sums are allowed to have arbitrary numbers of filled positions

* List comprehensions can have multiple sets of qualifiers (`ParallelListComp`).

* Deriving clauses after GADTs don't require layout anymore.

* Newtype instance heads are working properly now.

* Escaping newlines in comments and cpp works now.
  Escaping newlines on regular lines won't be implemented.

* One remaining issue is that qualified left sections that contain infix ops are broken: `(a + a A.+)`
  I haven't managed to figure out a good strategy for this – my suspicion is that it's impossible to correctly parse
  application, infix and negation without lexing all qualified names in the scanner.
  I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work.
  For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing.

* Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of
  Unicode categories, using bitmaps.
  I might need to change this to write them all to a shared file, so the set of source files stays the same.
  • Loading branch information
tek committed Mar 24, 2024
1 parent 95a4f00 commit 07299fd
Show file tree
Hide file tree
Showing 154 changed files with 574,420 additions and 893,240 deletions.
3 changes: 3 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -1,2 +1,5 @@
/src/** linguist-vendored
/examples/* linguist-vendored
/src/parser.c -diff
/src/grammar.json -diff
/src/node-types.json -diff
8 changes: 5 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,12 @@ node_modules
build
*.log
package-lock.json
repos
examples/*
!examples/.gitkeep
/test/libs/*
!/test/libs/.gitkeep
.gdb_history
*.o
*.so
/.build/
/target/
/dist-newstyle
/.lib/
38 changes: 13 additions & 25 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

16 changes: 13 additions & 3 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
[package]
name = "tree-sitter-haskell"
description = "haskell grammar for the tree-sitter parsing library"
version = "0.15.0"
version = "1.0.0"
keywords = ["incremental", "parsing", "haskell"]
categories = ["parsing", "text-editors"]
repository = "https://github.com/tree-sitter/tree-sitter-haskell"
edition = "2018"
license = "MIT"
edition = "2021"

build = "bindings/rust/build.rs"
include = [
Expand All @@ -19,6 +18,17 @@ include = [
[lib]
path = "bindings/rust/lib.rs"

[[test]]
name = "parse-test"
path = "test/rust/parse-test.rs"

[[bin]]
name = "parse"
path = "test/rust/parse.rs"
test = false
bench = false
doc = false

[dependencies]
tree-sitter = "0.20"

Expand Down
5 changes: 4 additions & 1 deletion Package.swift
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,10 @@ let package = Package(
sources: [
"src/parser.c",
"src/scanner.c",
"src/unicode.h",
"src/id.h",
"src/space.h",
"src/symop.h",
"src/varid-start.h",
],
resources: [
.copy("queries")
Expand Down
30 changes: 9 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,6 @@
# tree-sitter-haskell

[![CI][ci]](https://github.com/tree-sitter/tree-sitter-haskell/actions/workflows/ci.yml)
[![discord][discord]](https://discord.gg/w7nTvsVJhm)
[![matrix][matrix]](https://matrix.to/#/#tree-sitter-chat:matrix.org)
[![crates][crates]](https://crates.io/crates/tree-sitter-haskell)
[![npm][npm]](https://www.npmjs.com/package/tree-sitter-haskell)
[![CI](https://github.com/tree-sitter/tree-sitter-haskell/actions/workflows/ci.yml/badge.svg)](https://github.com/tree-sitter/tree-sitter-haskell/actions/workflows/ci.yml)

Haskell grammar for [tree-sitter].

Expand All @@ -23,7 +19,7 @@ local parser_config = require "nvim-treesitter.parsers".get_parser_configs()
parser_config.haskell = {
install_info = {
url = "~/path/to/tree-sitter-haskell",
files = {"src/parser.c", "src/scanner.c", "src/unicode.h"}
files = {"src/parser.c", "src/scanner.c"}
}
}
EOF
Expand Down Expand Up @@ -100,7 +96,7 @@ These extensions are supported ✅, unsupported ❌ or not applicable because th
* NamedFieldPuns ✅
* NamedWildCards ✅
* NegativeLiterals ➖️
* NondecreasingIndentation
* NondecreasingIndentation
* NPlusKPatterns ➖️
* NullaryTypeClasses ✅
* NumDecimals ➖️
Expand Down Expand Up @@ -166,11 +162,7 @@ Preprocessor `#elif` and `#else` directives cannot be handled correctly, since t
manually reset to what it was at the `#if`.
As a workaround, the code blocks in the alternative branches are parsed as part of the directives.

## Layout

`NondecreasingIndentation` is not supported (yet?).

### Operators on newlines in `do`
## Operators on newlines in `do`

A strange edge case is when an infix operator follows an expression statement of a do block with an indent of less or equal the `do`'s layout column:

Expand Down Expand Up @@ -204,18 +196,20 @@ These are stored in `./tests/corpus/`
$ tree-sitter test
```

## Test parsing an example codebase
## Parsing the codebase of a real-world library

**Requires**: `bc`
This will print the percentage of the codebase parsed, and the time taken

```
$ ./script/parse-examples # this clones all repos
$ ./script/parse-example <example> # where <example> is a project under ./examples/
$ ./script/parse-libs # this clones all repos
$ ./script/parse-lib <lib> # where <lib> is a project under ./test/libs/
```

## Enable scanner debug output

<!-- TODO rewrite -->

To get an extra-verbose scanner, unoptimized, with debug symbols:

```
Expand All @@ -235,9 +229,3 @@ If you want to debug the scanner with `gdb`, you can
```
$ tree-sitter parse -D test/Basic.hs # Produces log.html
```

[ci]: https://img.shields.io/github/actions/workflow/status/tree-sitter/tree-sitter-haskell/ci.yml?logo=github&label=CI
[discord]: https://img.shields.io/discord/1063097320771698699?logo=discord&label=discord
[matrix]: https://img.shields.io/matrix/tree-sitter-chat%3Amatrix.org?logo=matrix&label=matrix
[npm]: https://img.shields.io/npm/v/tree-sitter-haskell?logo=npm
[crates]: https://img.shields.io/crates/v/tree-sitter-haskell?logo=rust
1 change: 1 addition & 0 deletions cabal.project
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
packages: tools
145 changes: 145 additions & 0 deletions flake.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit 07299fd

Please sign in to comment.