Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC007] First step: AST representation #2072

Merged
merged 12 commits into from
Oct 25, 2024
Merged

Conversation

yannham
Copy link
Member

@yannham yannham commented Oct 16, 2024

New AST representation for the (future) bytecode compiler

This is an implementation of the first step of RFC007: define the first AST representation, supposed to be output by the parser.

The whole RFC007 is a big chunk of work; we don't want to implement it all at once and make it the default. This PR creates a new bytecode module, which is only enabled under the bytecode-experimental feature, so that mainline Nickel is left unchanged, and we can experiment step by step on the side.

Content

This PR is concerned with the AST design part. It defines a new immutable, arena-allocated AST, that has been cleaned from any runtime concern - this should be the AST as produced by the parser in the future.

Any subpart of term::Term that relies explicitely on the RichTerm representation has been copied, cleaned and adapted: mostly the patterns, record and array satellite datatypes.

Finally, this PR introduces a bytecode::compat module that converts from the current mainline AST to the new AST, to make sure we haven't overlooked any part and that we have enough methods to build any ast node. This part isn't tested yet but this PR is already huge so we left this for future work.

AST design guideline

Important: this PR doesn't set anything in stone. The size of the AST hasn't been checked or hardcore optimized yet, and we'll probably have to update the representation as we re-implement typechecking, etc. It's rather a first draft, trying to follow a systematic approach.

The AST is designed to be compact and adapted for processing by various analysis phases (mostly typechecking, code analysis by the LSP and compilation by the future bytecode compiler). Thus we've replaced any Box/Rc by plain immutable references, where the content has been allocated in a centralized arena (actually several ones of them).

For structs, we have no reason to use references - for example, if struct Foo has a field bar: Bar, there is no good reason to add an indirection bar: &'ast Bar. Thus struct fields use owned data as much as possible.

This is the converse for enums: to avoid size bloat, we add reference indirection for any variant where the arguments takes up more than a few words.

Because everything is immutable and shareable, and that we want to avoid heap allocation as much as possible for performance reason (arena allocation should be faster), we don't use Vec<T> but &'ast [T] instead, which is the immutable equivalent.

We've tried to reduce the variation of the same constructs: while the original AST has two Let and LetPattern, Fun and FunPattern, Record and RecRecord, and so on, this AST merges all those cases. Whilie we take a small size hit for the simplest cases (a let x = y now has one indirection, and store the size of a 1 element slice in a fat pointer), this makes the definition and the code consuming it arguably much simpler, and we won't pay this price at runtime since the AST will be compiled away.

Similarly, there's no difference anymore between UnaryOp, BinaryOp, and NAryOp and the corresponding Op1, Op2, OpN: there is only one PrimOp type, and one PrimOpApp node taking a slice of arguments. Function application is also made multi-ary, which is a more efficient representation for application to multiple arguments and can also help give better error messages during typechecking for over or under-application.

Type is the only satellite data that hasn't be replicated. It's a problem because it includes a Contract constructor that still refer to the old representation (a RichTerm). We can also wonder if we we'd like to arena-allocate the Type AST as well, but this is non trivial work and we left it for a follow-up PR.

Reviewing

The diff is really big, but keep in mind that a lot of code needed to be copy pasted and almost mechanically adapted. I've also added even more documentation (such as the arguments on primops).

The most interesting is probably type definitions: Node, Ast, Pattern, PrimOp, etc. The rest is mostly type-guided implementation.

core/Cargo.toml Outdated Show resolved Hide resolved
Copy link
Contributor

github-actions bot commented Oct 16, 2024

🐰 Bencher Report

Branch2072/merge
Testbedubuntu-latest

⚠️ WARNING: The following Measure does not have a Threshold. Without a Threshold, no Alerts will ever be generated!

Click here to create a new Threshold
For more information, see the Threshold documentation.
To only post results if a Threshold exists, set the --ci-only-thresholds CLI flag.

Click to view all benchmark results
BenchmarkLatencynanoseconds (ns)
fibonacci 10📈 view plot
⚠️ NO THRESHOLD
479,890.00
foldl arrays 50📈 view plot
⚠️ NO THRESHOLD
1,681,600.00
foldl arrays 500📈 view plot
⚠️ NO THRESHOLD
6,524,600.00
foldr strings 50📈 view plot
⚠️ NO THRESHOLD
7,072,100.00
foldr strings 500📈 view plot
⚠️ NO THRESHOLD
61,949,000.00
generate normal 250📈 view plot
⚠️ NO THRESHOLD
43,954,000.00
generate normal 50📈 view plot
⚠️ NO THRESHOLD
2,051,300.00
generate normal unchecked 1000📈 view plot
⚠️ NO THRESHOLD
3,300,300.00
generate normal unchecked 200📈 view plot
⚠️ NO THRESHOLD
745,310.00
pidigits 100📈 view plot
⚠️ NO THRESHOLD
3,200,600.00
pipe normal 20📈 view plot
⚠️ NO THRESHOLD
1,482,900.00
pipe normal 200📈 view plot
⚠️ NO THRESHOLD
10,459,000.00
product 30📈 view plot
⚠️ NO THRESHOLD
817,510.00
scalar 10📈 view plot
⚠️ NO THRESHOLD
1,493,100.00
sum 30📈 view plot
⚠️ NO THRESHOLD
815,220.00
🐰 View full continuous benchmarking report in Bencher

@aspiwack
Copy link
Member

For my understanding, do you actually intent the flat AST to be an output of the parser?

@jneem
Copy link
Member

jneem commented Oct 17, 2024

If I understood this correctly, there's no flat AST: just a tree AST and a bytecode format. This PR is the simplified tree AST produced by the parser. It's simplified compared to current nickel because it doesn't have to support evaluation.

@aspiwack
Copy link
Member

Ah. That's the point I'd missed. Thanks @jneem .

This commit starts to define an immutable AST, the first representation
of the future bytecode virtual machine.
This commit continues the effort of defining a new AST. It introduces
many helper methods to build nodes (which requires explicit allocation
in arenas), and introduces methods to convert from the current mainline
AST representation (unfinished).
Remaining TODO: shuffle arguments order of some primops.
@yannham yannham marked this pull request as ready for review October 23, 2024 09:48
@yannham yannham requested a review from jneem October 23, 2024 09:50
Copy link
Member

@jneem jneem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the new simple Term -- it almost fits within one editor window!

core/src/bytecode/ast/compat.rs Show resolved Hide resolved
core/src/bytecode/ast/mod.rs Outdated Show resolved Hide resolved
core/src/bytecode/ast/mod.rs Outdated Show resolved Hide resolved
core/src/bytecode/ast/pattern/bindings.rs Show resolved Hide resolved
core/src/bytecode/ast/pattern/mod.rs Outdated Show resolved Hide resolved
@yannham yannham added this pull request to the merge queue Oct 25, 2024
Merged via the queue into master with commit aaa04ab Oct 25, 2024
5 checks passed
@yannham yannham deleted the rfc007/ast-first-draft branch October 25, 2024 10:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants