Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add smithy-syntax package and smithy format command #1830

Merged
merged 19 commits into from
Jun 19, 2023
Merged

Conversation

mtdowling
Copy link
Member

This PR adds a new package called smithy-syntax that provides a token tree to represent a parsed model, and a formatter to format Smithy models. It's called a token tree instead of a parse tree because it bottoms-out at lexer tokens rather than more granular productions from the grammar for things like identifiers, quoted strings, etc.

A formatter is lumped in here to ensure that the toke tree works well and is usable for a real-world use case. smithy-syntax is currently marked as unstable, but we can make it stable after integrating it with the Smithy LSP.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

mtdowling and others added 18 commits June 16, 2023 15:59
smithy-syntax is used to provide token trees (a very slightly higher
level parse tree) of Smithy IDL models. This token tree is meant to
be lossless, fault-tolerant, and easy to navigate.

The use case for smithy-syntax is to make it easier to perform syntax
analysis in tools like the LSP, IntelliJ, in-place model transforms
(e.g., refactoring tools), and a formatter.
This commit adds a test for almost every implementation
of the parse method for TreeType. OperationInput/Output/Errors,
and number nodes aren't covered yet. These tests only include valid
cases, and are not comprehensive yet. More cases are needed to check
edge cases and different code paths for some TreeTypes, as well as
error behavior. The test assertions check that the parser has produced
the correct TreeTypes for the root node and its immediate children.

Various minor bugs were fixed, and updates made while writing these tests:
- CapturingTokenizer was updated to avoid a bug where the first character,
$, was not included in the list of captured tokens.
- An enum member with no ws around the '=' could have the value assingment
skipped.
- Members with a value assignment could consume whitespace that should be a
part of the value assignment's BR.
- ENUM_SHAPE_MEMBER was added.
- Mixins added to operation shape.
- Trait statements added to inline aggregate shape.
- Some places changed an expect -> skip to the corresponding parse method.
- TraitStructure and NodeValue were swapped in TRAIT_BODY_VALUE.
- Removed extraneous BR from apply statement.
- Block apply statement and singular apply statement had eachothers
implementation.
This change implements the updated grammar for traits and operations
in #1800. WS and BR handling is now more explicit, including
individual token trees for comments, spaces, and commas.

NODE_OBJECT_KEY now contains an IDENTIFIER and QUOTED_TEXT tree rather
than raw tokens.

This change also cleans up token trees and adds some new features:
* They implement FromSourceLocation
* Removed unused methods.
All TreeTypeTests now expect the entire tree to be valid. This
helps us find ambiguity and inconsistencies in the parser for
cases when parsing a child tree is invalid only due to the
context of the parent. Fixes minor bugs that caused trees to be
invalid:
- enumMember test didn't have a BR after VALUE_ASSIGNMENT

Fills in missing tests for:
- Operations
- Block apply statements
- Entity type name
- Aggregate type name

Other minor changes and improvements:
- Adds util methods for parsing optional for-resource, mixins,
and value assignment productions.
- Adds util method for parsing any shape type and name.
- Updates peekPastSpaces to not skip the current token, in case
it is not SP. This simplifies parsing value assignments, and has
more intuitive behavior.
- Removes peekPastWs, which was only used for TRAIT_BODY_VALUE,
inlining the implementation. peekPastWs skipped the current token
like peekPastSpaces, which is necessary for TRAIT_BODY_VALUE.
- Add ENTITY_TYPE_NAME production.
- Updates name of some productions to be consistent with grammar.
- Fixed a few places where TOKEN was parsed as IDENTIFIER.
This formatter uses Wadler's pretty printer algorithm to automatically
format code based on line length (via prettier4j).

Future commits should investigate shading the dependency on prettier4J.
Co-authored-by: Jordon Phillips <JordonPhillips@users.noreply.github.com>
This doesn't account for every possible bad doc comment (e.g., doc
comments that come after traits). That can be a followup.
Node string values should parse as a SHAPE_ID not as an IDENTIFIER.
* Add parse failure tests

Adds test cases that validate the parser fails when expected by
asserting that any descendant of the root is an error tree. There
is a test per-production, if that production has terminal tokens,
or represents an alternation of productions that are distinguished
by terminal tokens. For example, SHAPE_OR_APPLY_STATEMENT chooses
between shape and apply statements based on whether the current
lexemme is "apply".

Only cases where the terminal tokens are invalid are tested because
if a non-terminal (ie. another production) is invalid, the error tree
is a child of the tree for that production. Since we also have tests
that make sure the correct children are present for each production,
we should have enough coverage to only test invalid non-terminals.
For example, when parsing this shape statement:
```
structure Foo.Bar {}
```
we know the `Foo.Bar` is part of the IDENTIFIER production, we have
tests that validate the the IDENTIFIER production is present, and
we have tests that check for errors in IDENTIFIERs, so we don't
need to specifically test this kind of error for shape statements.

One caveat is that the existing tests I'm referring to only ensure
all expected children are present for valid trees, so there is a
logical inconsistency in the argument that these tests provide full
coverage, but I can't think of a case that could cause a problem.
Either way, these tests should provide _good enough_ coverage.

Some bug fixes:
- Fixed an issue where the error caused by not finding a "namespace"
in NAMESPACE_STATEMENT was recovered by the parent tree.
- Expect IDENTIFIER for "with" token in MIXINS production.

* Add cases for productions with loops

Adds more test cases for `*Statement` productions to ensure loops
are broken out of properly when parsing fails.
Should not include the tree node if not present.
Minor changes to different docs and APIs for smithy-syntax
@mtdowling mtdowling requested a review from a team as a code owner June 16, 2023 20:59
@mtdowling mtdowling changed the title Add smithy-syntax package and smithy-format command Add smithy-syntax package and smithy format command Jun 16, 2023
@mtdowling mtdowling merged commit 6f6fe88 into main Jun 19, 2023
@mtdowling mtdowling deleted the token-tree branch September 11, 2023 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants