-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Macro future proofing #550
Changes from 11 commits
f2ee210
cc7f0e1
2ebedee
34a3bd6
645f679
849c0ae
8b475b5
bca17b4
b522379
5ef6de5
68ecb34
5db222a
b61c42a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,147 @@ | ||
- Start Date: 2014-12-21 | ||
- RFC PR: (leave this empty) | ||
- Rust Issue: (leave this empty) | ||
|
||
# Key Terminology | ||
|
||
- `macro`: anything invokable as `foo!(...)` in source code. | ||
- `MBE`: macro-by-example, a macro defined by `macro_rules`. | ||
- `matcher`: the left-hand-side of a rule in a `macro_rules` invocation. | ||
- `macro parser`: the bit of code in the Rust parser that will parse the input | ||
using a grammar derived from all of the matchers. | ||
- `NT`: non-terminal, the various "meta-variables" that can appear in a matcher. | ||
- `fragment`: The piece of Rust syntax that an NT can accept. | ||
- `fragment specifier`: The identifier in an NT that specifies which fragment | ||
the NT accepts. | ||
- `language`: a context-free language. | ||
|
||
Example: | ||
|
||
```rust | ||
macro_rules! i_am_an_mbe { | ||
(start $foo:expr end) => ($foo) | ||
} | ||
``` | ||
|
||
`(start $foo:expr end)` is a matcher, `$foo` is an NT with `expr` as its | ||
fragment specifier. | ||
|
||
# Summary | ||
|
||
Future-proof the allowed forms that input to an MBE can take by requiring | ||
certain delimiters following NTs in a matcher. In the future, it will be | ||
possible to lift these restrictions backwards compatibly if desired. | ||
|
||
# Motivation | ||
|
||
In current Rust, the `macro_rules` parser is very liberal in what it accepts | ||
in a matcher. This can cause problems, because it is possible to write an | ||
MBE which corresponds to an ambiguous grammar. When an MBE is invoked, if the | ||
macro parser encounters an ambiguity while parsing, it will bail out with a | ||
"local ambiguity" error. As an example for this, take the following MBE: | ||
|
||
```rust | ||
macro_rules! foo { | ||
($($foo:expr)* $bar:block) => (/*...*/) | ||
}; | ||
``` | ||
|
||
Attempts to invoke this MBE will never succeed, because the macro parser | ||
will always emit an ambiguity error rather than make a choice when presented | ||
an ambiguity. In particular, it needs to decide when to stop accepting | ||
expressions for `foo` and look for a block for `bar` (noting that blocks are | ||
valid expressions). Situations like this are inherent to the macro system. On | ||
the other hand, it's possible to write an unambiguous matcher that becomes | ||
ambiguous due to changes in the syntax for the various fragments. As a | ||
concrete example: | ||
|
||
```rust | ||
macro_rules! bar { | ||
($in:ty ( $($arg:ident)*, ) -> $out:ty;) => (/*...*/) | ||
}; | ||
``` | ||
|
||
When the type syntax was extended to include the unboxed closure traits, | ||
an input such as `FnMut(i8, u8) -> i8;` became ambiguous. The goal of this | ||
proposal is to prevent such scenarios in the future by requiring certain | ||
"delimiter tokens" after an NT. When extending Rust's syntax in the future, | ||
ambiguity need only be considered when combined with these sets of delimiters, | ||
rather than any possible arbitrary matcher. | ||
|
||
# Detailed design | ||
|
||
The algorithm for recognizing valid matchers `M` follows. Note that a matcher | ||
is merely a token tree. A "simple NT" is an NT without repetitions. That is, | ||
`$foo:ty` is a simple NT but `$($foo:ty)+` is not. `FOLLOW(NT)` is the set of | ||
allowed tokens for the given NT's fragment specifier, and is defined below. | ||
`F` is used for representing the separator in complex NTs. In `$($foo:ty),+`, | ||
`F` would be `,`, and for `$($foo:ty)+`, `F` would be `EOF`. | ||
|
||
*input*: a token tree `M` representing a matcher and a token `F` | ||
|
||
*output*: whether M is valid | ||
|
||
1. If there are no tokens in `M`, accept. | ||
2. For each token `T` in `M`: | ||
1. If `T` is not an NT, continue. | ||
2. If `T` is a simple NT, look ahead to the next token `T'` in `M`. If | ||
`T'` is `EOF` or a close delimiter of a token tree, replace `T'` with | ||
`F`. If `T'` is in the set `FOLLOW(NT)`, `T'` is EOF, `T'` is any NT, | ||
or `T'` is any identifier, continue. Else, reject. | ||
3. Else, `T` is a complex NT. | ||
1. If `T` has the form `$(...)+` or `$(...)*`, run the algorithm on | ||
the contents with `F` set to `EOF`. If it accepts, continue, else, | ||
reject. | ||
2. If `T` has the form `$(...)U+` or `$(...)U*` for some token `U`, run | ||
the algorithm on the contents with `F` set to `U`. If it accepts, | ||
continue, else, reject. | ||
|
||
This algorithm should be run on every matcher in every `macro_rules` | ||
invocation, with `F` as `EOF`. If it rejects a matcher, an error should be | ||
emitted and compilation should not complete. | ||
|
||
The current legal fragment specifiers are: `item`, `block`, `stmt`, `pat`, | ||
`expr`, `ty`, `ident`, `path`, `meta`, and `tt`. | ||
|
||
- `FOLLOW(stmt)` = `FOLLOW(expr)` | ||
- `FOLLOW(pat)` = `{FatArrow, Comma, Pipe}` | ||
- `FOLLOW(expr)` = `{Comma, FatArrow, CloseBrace, CloseParen, CloseBracket}` | ||
- `FOLLOW(ty)` = `{Comma, CloseBrace, CloseParen, CloseBracket}` | ||
- `FOLLOW(block)` = any token | ||
- `FOLLOW(ident)` = any token | ||
- `FOLLOW(tt)` = any token | ||
- `FOLLOW(item)` = up for discussion | ||
- `FOLLOW(path)` = up for discussion | ||
- `FOLLOW(meta)` = up for discussion | ||
|
||
# Drawbacks | ||
|
||
It does restrict the input to a MBE, but the choice of delimiters provides | ||
reasonable freedom. | ||
|
||
# Alternatives | ||
|
||
1. Fix the syntax that a fragment can parse. This would create a situation | ||
where a future MBE might not be able to accept certain inputs because the | ||
input uses newer features than the fragment that was fixed at 1.0. For | ||
example, in the `bar` MBE above, if the `ty` fragment was fixed before the | ||
unboxed closure sugar was introduced, the MBE would not be able to accept | ||
such a type. While this approach is feasible, it would cause unnecessary | ||
confusion for future users of MBEs when they can't put certain perfectly | ||
valid Rust code in the input to an MBE. Versioned fragments could avoid | ||
this problem but only for new code. | ||
2. Keep `macro_rules` unstable. Given the great syntactical abstraction that | ||
`macro_rules` provides, it would be a shame for it to be unusable in a | ||
release version of Rust. If ever `macro_rules` were to be stabilized, this | ||
same issue would come up. | ||
3. Do nothing. This is very dangerous, and has the potential to essentially | ||
freeze Rust's syntax for fear of accidentally breaking a macro. | ||
|
||
# Unresolved questions | ||
|
||
1. What should the FOLLOW sets for `item`, `path`, and `meta` be? | ||
2. Should the `FOLLOW` set for `ty` be extended? In practice, `RArrow`, | ||
`Colon`, `as`, and `in` are also used. (See next item) | ||
2. What, if any, identifiers should be allowed in the FOLLOW sets? The author | ||
is concerned that allowing arbitrary identifiers would limit the future use | ||
of "contextual keywords". | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you give some intuition of what the effects are of including or excluding tokens in the FOLLOW sets? I.e., doe sit mean the new token can be used a delimiter token? What does it mean for future proofing? Does it restrict or extend what we can do in the future? How? I have a feeling we should think about semi-colons, but I'm not sure how. Should they be in the follow sets for either expr or stmt? Is it true that an item must always end with a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If a token is in the FOLLOW set of a nt, we can never change the language in such a way that parsing that nt would consume that token. This gives us a rigid boundary around which we can change the language and not break macros. I think the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Consider Looks like it's implied that each FOLLOW set contains all identifiers:
Further, alternation ( [1] RFC: Extend pattern syntax #99, postponed There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure how we ought to handle identifiers. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot one, of course:
=