Enforce insertion-order for SyntaxIssues.
Abstract special characters e.g., \[Pi] -> Pi
.
Add Span, Part, and MessageName to Concretify.
Non-BMP PUA characters are not strange.
Better reporting of unrecognized longnames.
Lots of internal updates.
Fix building for older versions on Mac.
add suggestions for a->b
and a:>b
at top-level
Concrete CallNodes have changed how they contain their children.
The old behavior would create a list around the children nodes.
But there can only ever be a single child in a CallNode.
So the wrapping list was removed.
With this:
CodeConcreteParse["f[]"]
The old behavior is:
ContainerNode[String, {CallNode[{LeafNode[Symbol,
"f", <|Source -> {{1, 1}, {1, 2}}|>]}, {GroupNode[
GroupSquare, {LeafNode[Token`OpenSquare,
"[", <|Source -> {{1, 2}, {1, 3}}|>],
LeafNode[Token`CloseSquare,
"]", <|Source -> {{1, 3}, {1, 4}}|>]}, <|Source -> {{1, 2}, {1,
4}}|>]}, <|Source -> {{1, 1}, {1, 4}}|>]}, <|Source -> {{1,
1}, {1, 4}}|>]
And the new behavior is:
ContainerNode[String, {CallNode[{LeafNode[Symbol,
"f", <|Source -> {{1, 1}, {1, 2}}|>]},
GroupNode[
GroupSquare, {LeafNode[Token`OpenSquare,
"[", <|Source -> {{1, 2}, {1, 3}}|>],
LeafNode[Token`CloseSquare,
"]", <|Source -> {{1, 3}, {1, 4}}|>]}, <|Source -> {{1, 2}, {1,
4}}|>], <|Source -> {{1, 1}, {1, 4}}|>]}, <|Source -> {{1,
1}, {1, 4}}|>]
The major update is using clang's [[clang::musttail]]
annotation to enforce tail call optimization.
This means that CodeParser no longer uses stack when parsing. Recursive calls use the same stack frame.
This means speed and this means no longer possible to hit the system stack limit when parsing.
To use this, build CodeParser with clang and with -DUSE_MUSTTAIL=ON
.
This is completely opt-in.
Another recent update is that starting in 13.2 (and able to be turned on in 13.1), a library built by the Compiler for building exprs will be used instead of MathLink.
This also means speed.
Old behavior can be used doing -DTRANSPORT=MathLink
.
Various curly quote characters are strange
Specify "sequence" -> "UTF-8 sequence" in various places
Specify reason for unsafe character encoding
AdditionalDescriptions for unrecognized long names
Show the parenthesized version of certain replacement characters
Limit fatal errors returned by parser to 10
Recommend ASCII replacements for strange characters
Handle =.
uniformly, from both text and boxes
Warn about signs in precision specification of reals
Handle buggy FE syntax of a:b:c:d
linear syntax \!
is preserved during abstracting, so ScopingData must also handle it
U+279D Triangle-Headed Rightwards Arrow is strange
Handle SuperscriptBox in DefinitionSymbols
First draft of Concretify
Add new 13.1 FormalScript characters
support new 13.1 syntax "PackedArray"::["Real64"]
Handle out-of-range \|FFFFFF
notation
parseBox should handle Cell that does not have BoxData
Related fix to 418690: Allow {a, b} = c to parse as having definitions for a and b
Fix 418744 but this was a loose thread and whole sweater unraveled.
Unsafe input will now give Missing["UnsafeCharacterEncoding"] instead of attempting to be repaired
unsafe input is anything with:
- incomplete sequences
- stray surrogates
- BOM
so Missing[] is now a head that may appear in different places with nodes
Add unsafeCharacterEncodingFlag to ParserSession
Remove UTF8Status status field from ByteDecoder
status does not need be kept track of because no repairing is ever attempted
In 13.1, @@@
is parsed as System`MapApply
Update file format logic
Split "Definitions" metadata into "AdditionalDefinitions"
foo /: bar[foo]
will have: "Definitions" metadata for foo "AdditionalDefinitions" metadata for bar
Fix 418240: CodeParse not respecting options set by SetOptions
Fix handling of NumberComma
Was mistakenly being treated as UnsupportedCharacter
Fix 420623: assert with \<0xa9>
Reduce severity of unexpected characters inside of strings or comments
Do not allow PacletManager to participate in finding `Generate` files
Do not give UnexpectedCharacter issue if UnexpectedSpaceCharacter is already going to be given
Handle multiple arguments to \[Integral]
RowBox as implicit Times
Change how implicit Nulls are parsed.
Implicit Null from stray commas are now errors. New error tokens Token`Error`PrefixImplicitNull and Token`Error`InfixImplicitNull are added.
"Comma" issues are no longer generated by the parser, they are now created by the linter.
Add syntax errors for old FE syntax
Work on unexpected calls from boxes
Add SubsuperscriptBox as ok
Include the name of the box when there is a strange call issue
Handle missing rands for all operators
Correctly handle Token`Tilde from boxes
Handle unrecognized characters
Fix 414131: =.
from boxes was not being abstracted
Fix 414139: '
from boxes was not being abstracted properly
Fix not handling CallMissingCloserNode in Scoping
Only pass in "StringifyMode" -> 2
if arg is a String
Fix 414554: do not mark Block variable inside of Module as unused
Notes on compatibility have been added to docs/compatibility.md
Introduce BatchMode option for Abstract
BatchMode -> True where Begin[] and End[] nodes will be at top-level (e.g., .wl files)
- PackageNodes[] and ContextNodes[] WILL be created
- Issues about unbalanced directives WILL be created
BatchMode -> False otherwise, i.e., where Begin[] and End[] nodes are separate or not easily scanned together (e.g., cells in notebooks)
- PackageNodes[] and ContextNodes[] will NOT be created
- Issues about unbalanced directives will NOT be created
Fix 409304: Teach parser about TernaryOptionalPattern TernaryOptionalPattern comes from boxes
Multi-pronged fix for 409210:
\[Prime]
is not strange
Add SubscriptBox and SuperscriptBox as not strange for head of calls
Fix 409216: Teach abstract and scoping about GridBox
Fix 409472: The abstracting code assumed that the Source was always LineColumn convention, and this was breaking for boxes So introduce some functions to handle the different source conventions
410337: Fix DefinitionSymbols for errors
410404: Fix parsing \[Integral] a + 2
Trivia2 was not scoped properly and was not being cleaned up at the correct time
Fix 405322: Symbols occurring in options to DynamicModule have same scope as body
Provide more ASCII approximations for long names
Handle linear syntax as tokens, but they are unsupported
A workflow is removing the \( \)
from a LinearSyntaxBlob and then tokenizing.
>>
and >>>
should not give top-level lints
Use the source of just the ; when reporting top-level CompoundExpression
Add CodeStructuralSyntaxAggQ
Move stray comma handling back to concrete parser and teach -check flag to fail with stray commas
For unexpected characters, display both the actual encoded representation and the escaped form.
This will help with diagnosing problems with invisible characters
Add a new Kernel file RowBox.wl that is generated at build-time.
Hard-code more values for parseBox
In the fall-through cases of parseBox, be more efficient with testing different string cases
Remove StringifyMode 3 (passthrough). Just hard-code these cases
Allow "FileFormat" -> "Script"
for ignoring shebang in .wls files
Provide better error reporting for \Alpha]
Add ScopingData[ast]
function for gathering information about scoping constructs
Introduce DefinitionSymbols function and start using it
This is a saner function than the older DeclarationName
Update ASCII replacements
Provide some selectors for CodeAction objects
Move ToInputFormString
Compil` also has a ToInputFormString symbol, so to prevent shadowing messages, move CodeParser`ToInputFormString to CodeParser`ToString`ToInputFormString and rely on CodeParser`ToString not being on $ContextPath (unless explicitly loaded, of course)
Insert "FileName"
into CSTs
Allow CodeConcreteParse of Cells, Notebooks, CellObjects, and NotebookObjects
Lower confidence of package-related issues that are not 100% errors
Fix warning from MSVC: using universal-character-name \u2423
for \[SpaceIndicator]
was technically not correct, since MSVC is setup with a different code page by default
Fix handling #"foo"
Fix bug 404282, and related issues
CodeConcreteParse["\"\r"]
would give messages because the \r
newline was not being handled correctly
Fix that, and also fix issues with \r\n
newlines where the indices from the SourceCharacterIndex were wrong because of treating \r\n
as a single source character
Handle TernaryTilde at top-level
Add UpSet and UpSetDelayed to top-level definitions
Syntax such as Attributes[foo] = {HoldAll}
is a definition for foo and not for Attributes
.wl files may be scripts with #! or not, so must handle
In the doc page: tutorial/FilesStreamsAndExternalOperations
there is a paragraph that starts:
"Scripts may be stored either in normal .wl package files or in dedicated .wls script files."
Fix 406933: specific comment makes CodeParser give messages and go into infinite loop
Included in Mathematica 12.2
The parser side of what needs to be fixed for 398836: implicit Times tokens do need to be present in the box cst.
I had been remiss in treating unhandled boxes as implicit Times.
So start doing that now. But do the easy thing and simply riffle in ImplicitTimes tokens.
This is not strictly correct, but the analysis needed to properly insert ImplicitTokens tokens is too complicated in the current location.
Note: Mixing explicit Times operators and implicit Times does not currently work
Fix LinearSyntaxBlob in wrong context
Return a Rational when parsing something like 1\*^-2
This returned an Integer before.
Add Token`PercentPercent as a variable length token
Add CompoundNode
Add CodeSyntaxQ, CodeSyntaxCSTQ, CodeStructuralSyntaxQ
Add Token`Error`UnterminatedFileString
Add Token`Boxes`CommentContent
Add UnterminatedTokenErrorNeedsReparseNode
Add Token`Error`UnexpectedCloser
Add GroupMissingOpenerNode (only used in boxes)
Add CallMissingCloserNode
Add QuaternaryNode and FragmentNode
Add new nodes and tags: MemoizedSetDelayed, MemoizedTagSetDelayed, UnterminatedGroupNode, UnterminatedCallNode
Add "TabWidth" option to functions and correctly calculate columns using tab stops
Graduate SourceConvention option to be a symbol
Add -check flag to command-line parser
Remove OptionalDefaultNode and just use Token`UnderDot directly.
Remove Token`LineContinuation. The actual line continuation string is now always attached to the previous or next token. Currently, it's always the next token, but that is an implementation detail.
Remove the 3 fake line continuation code points. These are no longer used.
Remove UnexpectedLineContinuation FormatIssue. This is no longer used.
NOTE: line continuations are NOT currently handled in places where SourceCharacters are used directly: comments and file stringification. This is not relevant for comments, but it is wrong to not handle line continuations during file stringification.
Propagate EncodingIssue in a few places
Treat EqualDot as a proper binary operator
Add new |->
operator for Function.
Improve stack shim performance
Use results from some profiling to speed up tokenizing numbers.
Dynamically set GroupMissingCloserNeedsReparseNode and UnterminatedTokenErrorNeedsReparseNode while parsing, instead of doing Replace.
Use Normal[ReadByteArray[#]]&
instead of Import[#, "Byte"]
Remove asserts from aggregate and try to not use local variables when possible
Add better handling for syntax with ..
and ...
where backtracking is needed
Examples are 2^^..
and 123\*^2..
Continue parsing numbers with invalid bases or unrecognized digits (because of the base)
A massive number of minor bug fixes, typos fixes, and reorganizations, tweaks and cleanup
Properly abstract unquoted strings
Rename ParseLeaf -> ConcreteParseLeaf
Rename AST -> CodeParser
Add support for System`Private`NewContextPath
/ System`Private`RestoreContextPath
Do better job with ToStandardFormBoxes handling multiple inputs separated by newlines
Simplify newline handling so that newline tokens are contiguous and half-open, just like all other tokens
Enable multi-line mode for matching chunks.
Add SourceConvention option
Generate Parselet registrations at build-time
Convert various std::maps to sorted std::arrays
Add EncodingIssues
Template-ize UnderParselet
Make all parsing functions listable
Insert "Definition" metadata for functions
Allow + +a
to parse as +a
, the same as kernel
Change from old syntax CodeParse[str, h]
to new syntax CodeParse[str, ContainerNode -> h]
Call setupLibraries[]
lazily
Transition to purely parselet-driven parsing
Remove implicit Times logic from parser, and properly handle inside parselets.
No longer need to pay the cost of checking implicitTimes boolean for every parse.
Fix crash found by afl
Combine handling of Inequality and VectorInequality
Combine handling of Infix and Inequality
Cleanup several issues from fuzz testing
Only load expr lib functions if expr lib exists
Fix assert when parsing U+FEFF
codepoint.
Fix FileExistsQ::fstr
that can happen with earlier versions
Update Source of nodes to be half-open. This is a change from earlier versions, where the Source was always inclusive.
For example, here is the old Source for the integer 123:
In[2]:= ConcreteParseString["123"]
Out[2]= ContainerNode[String, {LeafNode[Integer,
"123", <|Source -> {{1, 1}, {1, 3}}|>]}, <||>]
And now here is the new Source:
In[3]:= ConcreteParseString["123"]
Out[3]= ContainerNode[String, {LeafNode[Integer,
"123", <|Source -> {{1, 1}, {1, 4}}|>]}, <||>]
This change has a number of nice qualities. It is now easy to determine the length of the token by subtracting the start from the end, and 0-length tokens can now be represented accurately.
Add Creator field to paclet.
Standardize on using ContainerNode as the outer-most node
Add InsertNodeAfter CodeAction command
Add ParseBytes function
Add InsertTextAfter CodeAction command
FormatIssues now explicitly supply their CodeActions
Do not complain about unexpected line continuations in comments
Add more operators:
- DoubleRightTee
- DoubleLeftTee
- UpTee
- DownTee
- RoundImplies
- Perpendicular
- etc.
Disable treating BMP PUA as strange for now
REPLACEMENT CHARACTER 0xfffd
is strange, this will allow flagging of bad UTF-8 in the linter
Introduce \r\n
as a single SourceCharacter. This greatly simplifies newline handling.
Standardize on using Whitespace as a token
Require using File[]
wrapper
Add Listable version of Tokenize
Add SafeString function
InlinePart longname is unsupported
Bring in several fixes found from fuzz testing
When parsing a - b + c
, make sure to give the abstracted Times expression the correct Source.
Fix implicit Times in boxes by giving it the same Source as the RHS
Fix parsing single-digit precision
Treat prefix !!
properly
Fix line continuations in #
and %
tokens
Abstract HermitianConjugate into ConjugateTranspose
Handle a-EOF
and a/EOF
Handle Unicode non-characters and BOM
Fix precedence problem of ++a++
and --a--
Remove unused MissingOpener parts
Do not treat -
as separate binary operator from +
. Combine -
parsing with +
.
Do not check for strange characters in comments
Switch to using unique_ptr implementation.
Work on messages
Simplify handling line continuation of just \r
No need to issue warning for errors being strange
Simplify handling \
at end of file
Return TOKEN_ERROR_EMPTYSTRING
at EOF
when appropriate
Prevent SourceManager from advancing past EOF
Do not count SyntaxErrorNodes as being strange at top-level
Do the favor of combining naked \
with next character for better error reporting
Tighter error handling
Call back to kernel for LongName suggestions
Add MultiBoxNode, for handling multiple inputs in an Input cell
Make sure that ParseLeafs remembers any SyntaxIssues
Support different SourceStyles
Space or Newline chars that are directly encoded are not strange
Remove string methods from SourceCharacter and WLCharacter, and provide iterators instead
Differentiate between line continuations with different newlines
Optimizing: Collect flag fields into single bitset fields
Optimizing: eliminate dynamic_cast
Develop system for using RAII to automatically handle queueing unhandled whitespace
Work on CodeActions
Distinguish between SyntaxIssues and FormatIssues
Add ConfidenceLevel to issues
Allow ::
to work when parsing boxes
Allow ?
to work when parsing boxes
Convert all syntax issues in C++ code to use CodeActions
Add LeafSeqNode and NodeSeqNode classes in order to alleviate the need to constantly iterate through vectors
Employ some strategies to reduce copying Tokens so much
Letterlike characters can be strange, or very strange (with higher confidence that it is a problem)
Better error handling for f[1\[Alpa]2]
Added Intra[]
construct for specifying positions within tokens
Handle \[Alpa]
being parsed as boxes
Start work to make TokenEnum contain other useful bits
Make sure that SyntaxErrorNodes always have children, and are not just a leaf node
Convert Token errors into appropriate SyntaxErrorNodes when abstracting
Allow e.g., { + }
to be parsed correctly as boxes
Revamped the error handling in the parser, so that unexpected closing brackets and unexpected operators do not eat any unnecessary whitespace. Lot of work and kind of ugly. But maybe error handling has to be ugly
Use LongName for making characters graphical, if available
Add check for strange characters
Allow boxes with <<
to be parsed properly
Drop ImplicitNull when converting back to boxes
Add checks for strange Unicode characters
When parsing a leaf, treat multiple tokens of whitespace as a single token of whitespace, similar to how FE works
Remove special handling of NonAssociativity with DirectedEdge and UndirectedEdge. This was bug 206938 and is now fixed.
Add a Quirks mode
Introduce AbstractFormatIssues, to allow warning about unneeded line continuations
Have ScopedIFS manage the data buffer from a file, and pass it to SourceManager
Treat \r\n
as a single Newline token
Fix text mode error found on Windows
Fix bug where \\
in a string, at end of line, on Windows, gave an assert
Fix ToSourceCharacterString
Fix OptionalDefaultPatternNode
Fix when UnhandledDot can happen
Fix ParseLeaf of <<
Fix parsing TagSetDelayed and TagUnset
Fix when ReplaceNode CodeAction is the entire expression (ReplacePart does not work)
Fix reporting of EOF in escape sequences
Move NonAssociative error handling to Abstract.wl
Convert tokenQueue to a deque, since there are so many insertions in the front
Remove append from Parser, and only have prepend
Complete remove Metadatas
Move library-related stuff to Library.wl
Organize Token and LongName files
Remove unneeded use of unique_ptr
Remove LOOKAHEAD wrappers
Add ConcreteParseBox, ToStrandardFormBoxes, and ToSourceCharacterString functions.
Add progress monitoring.
Switch to sending MakeLeafNode calls over MathLink, for performance.
Add some missing operators, \[Colon]
, \[CupCap]
, etc.
Other performance improvements.
Add some CodeActions.
Add Did You Mean for /@
for /
at top-level
Move to using a struct with bitfields for WLCharacter.
Read entire file into buffer and store in SourceManager
Add ParseLeaf function
Create single nodes for Inequalities and VectorInequalities
Preserve line continuations in concrete syntax
Introduce a token for =.
which is needed for box support
The characters \"
and \\
are mapped into special codes
Warn about line continuations inside comments
Rename implicit tokens ImplicitNull, ImplicitOne, ImplicitAll
Add InfixOperatorWithTrailingParselet, for commas and semis
Only report UnlikelyEscapeSequence if not a valid character
Report strange space characters
Report strange newline characters
Various bug fixes.
Consolidate various atom Nodes into single LeafNode
\[EmptySet]
is not strange.
Fixed handling of runs of multiple ;;
Comments, whitespace, and newlines are now returned in concrete syntax.
Introduce new concept of aggregate syntax which is concrete syntax, but with comments, whitespace, and newlines removed.
Aborts are handled more gracefully.
Handle more operators.
Add \[CubeRoot]
character.
Various bug fixes.
Include operator tokens in concrete syntax trees.
Include comments as a separate list returned by concreteParse functions.
Include warnings about using unsupported and undocumented characters.
Include warnings about strange top-level expressions.
Various bug fixes.
Put Source information on link as a packed array.
Properly abstract strings in operators that do their own stringification (::
, >>
, >>>
, <<
, #
).
Handle parsing #"123"
and #a`b
Sandbox mode is now respected in the library.
Support abstracting \[VectorLess]
, \[VectorGreater]
, etc.
Updated build scripts.
Re-add remarks about invalid UTF-8 sequences.
Add a remark about stray \r
characters.
Add more ambiguity remarks.
Support abstracting BeginPackage[]
/EndPackage[]
, Begin[]
/End[]
, and BeginStaticAnalysisIgnore[]
/EndStaticAnalysisIgnore[]
.
Various bug fixes.
Remove explicit lists of letterlike long names, and add explicit list of uninterpretable long names.
All valid letterlike characters should work now, and any \:xxxx
characters are flagged as strange.
Fill out the other lists of characters.
All characters should now be categorized (letterlike, operator, space, newline, comma, uninterpretable)
Add >>>
operator.
Properly stringify args after ::
, <<
, >>
, and >>>
.
Attach comments to CST nodes.
Stop using an executable. Build a shared library and use LibraryLink and MathLink. Support building with earlier versions of Mathematica. Building with version 11.0 is supported.
Separate SyntaxIssues and AbstractSyntaxIssues.
Allow File[]
wrappers to work.
Various bug fixes.
Correct concrete and abstract '
(Derivative) parsing
Abstract parsing errors for a \[DirectedEdge] b \[DirectedEdge] c
and a \[UndirectedEdge] b \[UndirectedEdge] c
Correctly abstract a \[Divides] b
Add a stop-gap for parsing large expressions containing +
and -
.
The concrete syntax tree was originally treating exprs of +
and exprs of -
as separate infix nodes. And then when abstracted, they would be combined. However, this is a problem for expressions that heavily alternate between +
and -
, e.g., a + b - c + d - e + f - g ...
This would create a deeply nested expression and then the internal call to ToExpression would fail, even though the kernel can parse the originaly expression. The limitation of ToExpression is understood. Introduce a stop-gap that treats a + b - c + d - e + f - g ...
as a single infix node, with InternalMinusNode for minus nodes. Eventually when we move to something like LibraryLink / WSTP, then we can move back to separate parse trees for +
and -
.
Use FindFile to help with things like ParseFile["Foo`"]
Fix more cases for DeclarationName
Add support for all set relation operators
Fix certain infix longname operators being parsed as binary operators
Better error handling for ToInputFormString and ToFullFormString
Make sure to delete all new'd memory in wl-ast
Some work on documentation notebooks.
Add Boxes.wl, providing a rudimentary CSTToBoxes function
Make Divide binary instead of infix
Enforce NonASCII restriction on strings and files. This comes from a restriction of RunProcess and will be addressed in a future update.
Preserve the difference between characters provided as byte encoded, and characters provided as \
encoded
For example, the byte 0x0a
and the bytes 0x5c 0x6e
both encode the newline character. And there are times where the distinction is important and must be preserved. \
syntax for the newline character cannot be used outside of strings.
Also it is just a bit nicer to preserve the encoding that was provided. One caveat is that \
encoding does canonicalize to one form. For example, \n
, \:000a
, \.0a
, and \012
all encode the newline character. But it is canonicalized to \n
. The general pattern for canonicalization is to first prefer the short \x
form if possible (e.g., \n
\r
\t
), and then prefer long names \[Name]
syntax, and then prefer \:xxxx
syntax.
Separate Concrete Syntax Trees and Abstract Syntax Trees.
Moved unused generated files out of paclet layout.
Introduced expression depth and expression breadth warnings.
Added wl-ast as a Paclet Resource.
Added notes in README.md about Antivirus problems on Windows.
Added TernaryOperatorsToSymbol
Renamed LinearSyntaxBang->PrefixLinearSyntaxBang
Added PrefixInvisiblePrefixScriptBase
Added PostfixInvisiblePostfixScriptBase
Renamed InfixImplicitPlus->BinaryImplicitPlus
Added BinaryInvisibleTimes
Added WLCharacter
brenton2maclap:MacOSX-x86-64 brenton$ ./wl-ast -format characters
>>> 1+1
{
WLCharacter[49, "1", <|Source->{{1, 1}, {1, 1}}|>],
WLCharacter[43, "+", <|Source->{{1, 2}, {1, 2}}|>],
WLCharacter[49, "1", <|Source->{{1, 3}, {1, 3}}|>],
WLCharacter[-1, "", <|Source->{{2, 0}, {2, 0}}|>],
Nothing
}
brenton2maclap:MacOSX-x86-64 brenton$
Added logic to use PacletResources to find wl-ast executable.
Added ConcreteParse functions and start separating out concrete and abstract parse trees.
Added warning for input like:
1.2.3