[RFC007] First step: AST representation #2072

yannham · 2024-10-16T17:09:46Z

New AST representation for the (future) bytecode compiler

This is an implementation of the first step of RFC007: define the first AST representation, supposed to be output by the parser.

The whole RFC007 is a big chunk of work; we don't want to implement it all at once and make it the default. This PR creates a new bytecode module, which is only enabled under the bytecode-experimental feature, so that mainline Nickel is left unchanged, and we can experiment step by step on the side.

Content

This PR is concerned with the AST design part. It defines a new immutable, arena-allocated AST, that has been cleaned from any runtime concern - this should be the AST as produced by the parser in the future.

Any subpart of term::Term that relies explicitely on the RichTerm representation has been copied, cleaned and adapted: mostly the patterns, record and array satellite datatypes.

Finally, this PR introduces a bytecode::compat module that converts from the current mainline AST to the new AST, to make sure we haven't overlooked any part and that we have enough methods to build any ast node. This part isn't tested yet but this PR is already huge so we left this for future work.

AST design guideline

Important: this PR doesn't set anything in stone. The size of the AST hasn't been checked or hardcore optimized yet, and we'll probably have to update the representation as we re-implement typechecking, etc. It's rather a first draft, trying to follow a systematic approach.

The AST is designed to be compact and adapted for processing by various analysis phases (mostly typechecking, code analysis by the LSP and compilation by the future bytecode compiler). Thus we've replaced any Box/Rc by plain immutable references, where the content has been allocated in a centralized arena (actually several ones of them).

For structs, we have no reason to use references - for example, if struct Foo has a field bar: Bar, there is no good reason to add an indirection bar: &'ast Bar. Thus struct fields use owned data as much as possible.

This is the converse for enums: to avoid size bloat, we add reference indirection for any variant where the arguments takes up more than a few words.

Because everything is immutable and shareable, and that we want to avoid heap allocation as much as possible for performance reason (arena allocation should be faster), we don't use Vec<T> but &'ast [T] instead, which is the immutable equivalent.

We've tried to reduce the variation of the same constructs: while the original AST has two Let and LetPattern, Fun and FunPattern, Record and RecRecord, and so on, this AST merges all those cases. Whilie we take a small size hit for the simplest cases (a let x = y now has one indirection, and store the size of a 1 element slice in a fat pointer), this makes the definition and the code consuming it arguably much simpler, and we won't pay this price at runtime since the AST will be compiled away.

Similarly, there's no difference anymore between UnaryOp, BinaryOp, and NAryOp and the corresponding Op1, Op2, OpN: there is only one PrimOp type, and one PrimOpApp node taking a slice of arguments. Function application is also made multi-ary, which is a more efficient representation for application to multiple arguments and can also help give better error messages during typechecking for over or under-application.

Type is the only satellite data that hasn't be replicated. It's a problem because it includes a Contract constructor that still refer to the old representation (a RichTerm). We can also wonder if we we'd like to arena-allocate the Type AST as well, but this is non trivial work and we left it for a follow-up PR.

Reviewing

The diff is really big, but keep in mind that a lot of code needed to be copy pasted and almost mechanically adapted. I've also added even more documentation (such as the arguments on primops).

The most interesting is probably type definitions: Node, Ast, Pattern, PrimOp, etc. The rest is mostly type-guided implementation.

core/Cargo.toml

github-actions · 2024-10-16T17:32:40Z

Bencher Report

Branch	2072/merge
Testbed	ubuntu-latest

⚠️ WARNING: The following Measure does not have a Threshold. Without a Threshold, no Alerts will ever be generated!
Latency
Click here to create a new Threshold
For more information, see the Threshold documentation.
To only post results if a Threshold exists, set the --ci-only-thresholds CLI flag.

Click to view all benchmark results

Benchmark	Latency	nanoseconds (ns)
fibonacci 10	📈 view plot ⚠️ NO THRESHOLD	479,890.00
foldl arrays 50	📈 view plot ⚠️ NO THRESHOLD	1,681,600.00
foldl arrays 500	📈 view plot ⚠️ NO THRESHOLD	6,524,600.00
foldr strings 50	📈 view plot ⚠️ NO THRESHOLD	7,072,100.00
foldr strings 500	📈 view plot ⚠️ NO THRESHOLD	61,949,000.00
generate normal 250	📈 view plot ⚠️ NO THRESHOLD	43,954,000.00
generate normal 50	📈 view plot ⚠️ NO THRESHOLD	2,051,300.00
generate normal unchecked 1000	📈 view plot ⚠️ NO THRESHOLD	3,300,300.00
generate normal unchecked 200	📈 view plot ⚠️ NO THRESHOLD	745,310.00
pidigits 100	📈 view plot ⚠️ NO THRESHOLD	3,200,600.00
pipe normal 20	📈 view plot ⚠️ NO THRESHOLD	1,482,900.00
pipe normal 200	📈 view plot ⚠️ NO THRESHOLD	10,459,000.00
product 30	📈 view plot ⚠️ NO THRESHOLD	817,510.00
scalar 10	📈 view plot ⚠️ NO THRESHOLD	1,493,100.00
sum 30	📈 view plot ⚠️ NO THRESHOLD	815,220.00

🐰 View full continuous benchmarking report in Bencher

aspiwack · 2024-10-17T00:31:23Z

For my understanding, do you actually intent the flat AST to be an output of the parser?

jneem · 2024-10-17T02:26:39Z

If I understood this correctly, there's no flat AST: just a tree AST and a bytecode format. This PR is the simplified tree AST produced by the parser. It's simplified compared to current nickel because it doesn't have to support evaluation.

aspiwack · 2024-10-17T02:56:45Z

Ah. That's the point I'd missed. Thanks @jneem .

This commit starts to define an immutable AST, the first representation of the future bytecode virtual machine.

This commit continues the effort of defining a new AST. It introduces many helper methods to build nodes (which requires explicit allocation in arenas), and introduces methods to convert from the current mainline AST representation (unfinished).

Remaining TODO: shuffle arguments order of some primops.

jneem

I like the new simple Term -- it almost fits within one editor window!

core/src/bytecode/ast/compat.rs

core/src/bytecode/ast/mod.rs

core/src/bytecode/ast/pattern/bindings.rs

core/src/bytecode/ast/pattern/mod.rs

Co-authored-by: jneem <joeneeman@gmail.com>

yannham commented Oct 16, 2024

View reviewed changes

core/Cargo.toml Outdated Show resolved Hide resolved

yannham added 6 commits October 23, 2024 11:21

Initial AST draft

254f4ec

This commit starts to define an immutable AST, the first representation of the future bytecode virtual machine.

More AST building and conversion

eee6b95

This commit continues the effort of defining a new AST. It introduces many helper methods to build nodes (which requires explicit allocation in arenas), and introduces methods to convert from the current mainline AST representation (unfinished).

Move from_xxx in separate module; complete the conversion

c0730bc

Remaining TODO: shuffle arguments order of some primops.

Fix typo in comment

3e45912

Move primop in their own module

a3b450c

Fix clippy warning

24b837c

yannham force-pushed the rfc007/ast-first-draft branch from 20a6885 to 24b837c Compare October 23, 2024 09:23

yannham marked this pull request as ready for review October 23, 2024 09:48

Don't make typed-arena optional; Ident uses it

fbef9a6

yannham requested a review from jneem October 23, 2024 09:50

yannham added 2 commits October 23, 2024 14:55

Fix build with nix-experimental feature enabled

7d8bd8d

Fix cargo doc warnings

a79bcc5

yannham mentioned this pull request Oct 24, 2024

[RFC007] New AST for types #2079

Merged

jneem approved these changes Oct 24, 2024

View reviewed changes

yannham and others added 3 commits October 25, 2024 12:15

Apply suggestions from code review

545d4d2

Co-authored-by: jneem <joeneeman@gmail.com>

Get rid of the 'a lifetime

deb58c3

Get rid of comment leftover

0a009a8

yannham enabled auto-merge October 25, 2024 10:39

yannham added this pull request to the merge queue Oct 25, 2024

Merged via the queue into master with commit aaa04ab Oct 25, 2024
5 checks passed

yannham deleted the rfc007/ast-first-draft branch October 25, 2024 10:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC007] First step: AST representation #2072

[RFC007] First step: AST representation #2072

yannham commented Oct 16, 2024 •

edited

Loading

github-actions bot commented Oct 16, 2024 •

edited

Loading

aspiwack commented Oct 17, 2024

jneem commented Oct 17, 2024

aspiwack commented Oct 17, 2024

jneem left a comment

[RFC007] First step: AST representation #2072

[RFC007] First step: AST representation #2072

Conversation

yannham commented Oct 16, 2024 • edited Loading

New AST representation for the (future) bytecode compiler

Content

AST design guideline

Reviewing

github-actions bot commented Oct 16, 2024 • edited Loading

Bencher Report

aspiwack commented Oct 17, 2024

jneem commented Oct 17, 2024

aspiwack commented Oct 17, 2024

jneem left a comment

Choose a reason for hiding this comment

yannham commented Oct 16, 2024 •

edited

Loading

github-actions bot commented Oct 16, 2024 •

edited

Loading