Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use node.loc.tokens to improve handling of parentheses. #537

Merged
merged 21 commits into from
Oct 13, 2018

Conversation

benjamn
Copy link
Owner

@benjamn benjamn commented Sep 10, 2018

Recast has suffered for a long time because it did not have reliable access to the lexical analysis of source tokens during reprinting, and instead relied on scanning source characters, which is error-prone because identifying comments, string literals, regular expression literals, and other large tokens is difficult/impossible without doing a full tokenization and sometimes a small amount of initial parsing.

Most importantly, accurate token information can be used to detect whether a node was originally wrapped with parentheses, even if those parentheses were separated from the node by comments or other incidental non-whitespace text, such as trailing commas. Here are just some of the issues that have resulted from the lack of reliable token information:

With this change, every node in the AST returned by recast.parse will now have a node.loc.tokens reference to shared array containing the entire sequence of original source tokens, as well as node.loc.{start,end}.token indexes into this array of tokens, such that

node.loc.tokens.slice(
  node.loc.start.token,
  node.loc.end.token
)

returns a complete list of all source tokens contained by the node. Note that some nodes (such as comments) may contain no source tokens, in which case node.loc.start.token === node.loc.end.token, which will be the index of the first token after the position where the node appeared.

Most parsers can expose token information for free / very cheaply, as a byproduct of the parsing process. In case a custom parser is provided that does not expose token information, we fall back to Esprima's tokenizer. While there is considerable variation between different parsers in terms of AST format, there is much less variation in tokenization, so the Esprima tokenizer should be adequate in most cases (even for JS dialects like TypeScript). If it is not adequate, the caller should simply ensure that the custom parser exposes an ast.tokens array containing token objects with token.loc.{start,end}.{line,column} information.

@benjamn benjamn self-assigned this Sep 10, 2018
newPath.firstInStatement() &&
!hasOpeningParen(oldPath)) {
return false;
}
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very glad to see this ceremony moved entirely into FastPath#hasParens.

Recast has suffered for a long time because it did not have reliable
access to the lexical analysis of source tokens during reprinting.

Most importantly, accurate token information could be used to detect
whether a node was originally wrapped with parentheses, even if the
parentheses are separated from the node by comments or other incidental
non-whitespace text, such as trailing commas. Here are just some of the
issues that have resulted from the lack of reliable token information:
- #533
- #528
- #513
- #512
- #366
- #327
- #286

With this change, every node in the AST returned by recast.parse will now
have a node.loc.tokens array representing the entire sequence of original
source tokens, as well as node.loc.{start,end}.token indexes into this
array of tokens, such that

  node.loc.tokens.slice(
    node.loc.start.token,
    node.loc.end.token
  )

returns a complete list of all source tokens contained by the node. Note
that some nodes (such as comments) may contain no source tokens, in which
case node.loc.start.token === node.loc.end.token, which will be the index
of the first token *after* the position where the node appeared.

Most parsers can expose token information for free / very cheaply, as a
byproduct of the parsing process. In case a custom parser is provided that
does not expose token information, we fall back to Esprima's tokenizer.
While there is considerable variation between different parsers in terms
of AST format, there is much less variation in tokenization, so the
Esprima tokenizer should be adequate in most cases (even for JS dialects
like TypeScript). If it is not adequate, the caller should simply ensure
that the custom parser exposes an ast.tokens array containing token
objects with token.loc.{start,end}.{line,column} information.
FastPath#hasParens is now implemented in terms of source tokens rather
than source characters, so it should not be fooled as easily by
intervening comments etc.
These options are documented in lib/options.js, and apply to the entire
printing process, and don't change depending on how the various print
functions are called.
https://ariya.io/2011/08/hall-of-api-shame-boolean-trap

This will allow adding additional options without overcomplicating all the
various places where print is called.
Normally the genericPrint function will attempt to wrap its result with
parentheses if path.needsParens(), but that's not what we want if we're
printing the path in order to patch it into a location that already has
parentheses, so it's important to be able to disable this behavior in such
special cases.

Note that children printed recursively by the genericPrint[NoParens]
functions will still be wrapped with parentheses if necessary, since this
option applies to the printing of the root node only.

Should help with #327.
The complicated logic that I implemented previously was an approximation
of a more fundamental and ultimately much simpler decision problem: some
nodes need parentheses only because they can't come first in the enclosing
statement, due to parsing ambiguities (e.g. FunctionExpression and
ObjectExpression nodes). However, these nodes do not need to be followed
immediately by a closing parenthesis, because the presence of an opening
parenthesis is enough to resolve the parsing ambiguity.
@benjamn benjamn force-pushed the use-tokens-to-improve-parens-handling branch from 0475616 to c331354 Compare September 11, 2018 01:04
@benjamn benjamn merged commit 8a5493b into master Oct 13, 2018
eventualbuddha added a commit to codemod-js/codemod that referenced this pull request Nov 19, 2018
This is now required for recast, as it'll othewise fall back on esprima which doesn't work for JSX.

benjamn/recast#537
eventualbuddha added a commit to codemod-js/codemod that referenced this pull request Nov 19, 2018
This is now required for recast, as it'll othewise fall back on esprima which doesn't work for JSX.

benjamn/recast#537
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant