Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Macro future proofing #550

Merged
merged 13 commits into from
Jan 19, 2015
147 changes: 147 additions & 0 deletions text/0000-macro-future-proofing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
- Start Date: 2014-12-21
- RFC PR: (leave this empty)
- Rust Issue: (leave this empty)

# Key Terminology

- `macro`: anything invokable as `foo!(...)` in source code.
- `MBE`: macro-by-example, a macro defined by `macro_rules`.
- `matcher`: the left-hand-side of a rule in a `macro_rules` invocation.
- `macro parser`: the bit of code in the Rust parser that will parse the input
using a grammar derived from all of the matchers.
- `NT`: non-terminal, the various "meta-variables" that can appear in a matcher.
- `fragment`: The piece of Rust syntax that an NT can accept.
- `fragment specifier`: The identifier in an NT that specifies which fragment
the NT accepts.
- `language`: a context-free language.

Example:

```rust
macro_rules! i_am_an_mbe {
(start $foo:expr end) => ($foo)
}
```

`(start $foo:expr end)` is a matcher, `$foo` is an NT with `expr` as its
fragment specifier.

# Summary

Future-proof the allowed forms that input to an MBE can take by requiring
certain delimiters following NTs in a matcher. In the future, it will be
possible to lift these restrictions backwards compatibly if desired.

# Motivation

In current Rust, the `macro_rules` parser is very liberal in what it accepts
in a matcher. This can cause problems, because it is possible to write an
MBE which corresponds to an ambiguous grammar. When an MBE is invoked, if the
macro parser encounters an ambiguity while parsing, it will bail out with a
"local ambiguity" error. As an example for this, take the following MBE:

```rust
macro_rules! foo {
($($foo:expr)* $bar:block) => (/*...*/)
};
```

Attempts to invoke this MBE will never succeed, because the macro parser
will always emit an ambiguity error rather than make a choice when presented
an ambiguity. In particular, it needs to decide when to stop accepting
expressions for `foo` and look for a block for `bar` (noting that blocks are
valid expressions). Situations like this are inherent to the macro system. On
the other hand, it's possible to write an unambiguous matcher that becomes
ambiguous due to changes in the syntax for the various fragments. As a
concrete example:

```rust
macro_rules! bar {
($in:ty ( $($arg:ident)*, ) -> $out:ty;) => (/*...*/)
};
```

When the type syntax was extended to include the unboxed closure traits,
an input such as `FnMut(i8, u8) -> i8;` became ambiguous. The goal of this
proposal is to prevent such scenarios in the future by requiring certain
"delimiter tokens" after an NT. When extending Rust's syntax in the future,
ambiguity need only be considered when combined with these sets of delimiters,
rather than any possible arbitrary matcher.

# Detailed design

The algorithm for recognizing valid matchers `M` follows. Note that a matcher
is merely a token tree. A "simple NT" is an NT without repetitions. That is,
`$foo:ty` is a simple NT but `$($foo:ty)+` is not. `FOLLOW(NT)` is the set of
allowed tokens for the given NT's fragment specifier, and is defined below.
`F` is used for representing the separator in complex NTs. In `$($foo:ty),+`,
`F` would be `,`, and for `$($foo:ty)+`, `F` would be `EOF`.

*input*: a token tree `M` representing a matcher and a token `F`

*output*: whether M is valid

1. If there are no tokens in `M`, accept.
2. For each token `T` in `M`:
1. If `T` is not an NT, continue.
2. If `T` is a simple NT, look ahead to the next token `T'` in `M`. If
`T'` is `EOF` or a close delimiter of a token tree, replace `T'` with
`F`. If `T'` is in the set `FOLLOW(NT)`, `T'` is EOF, `T'` is any NT,
or `T'` is any identifier, continue. Else, reject.
3. Else, `T` is a complex NT.
1. If `T` has the form `$(...)+` or `$(...)*`, run the algorithm on
the contents with `F` set to `EOF`. If it accepts, continue, else,
reject.
2. If `T` has the form `$(...)U+` or `$(...)U*` for some token `U`, run
the algorithm on the contents with `F` set to `U`. If it accepts,
continue, else, reject.

This algorithm should be run on every matcher in every `macro_rules`
invocation, with `F` as `EOF`. If it rejects a matcher, an error should be
emitted and compilation should not complete.

The current legal fragment specifiers are: `item`, `block`, `stmt`, `pat`,
`expr`, `ty`, `ident`, `path`, `meta`, and `tt`.

- `FOLLOW(stmt)` = `FOLLOW(expr)`
- `FOLLOW(pat)` = `{FatArrow, Comma, Pipe}`
- `FOLLOW(expr)` = `{Comma, FatArrow, CloseBrace, CloseParen, CloseBracket}`
- `FOLLOW(ty)` = `{Comma, CloseBrace, CloseParen, CloseBracket}`
- `FOLLOW(block)` = any token
- `FOLLOW(ident)` = any token
- `FOLLOW(tt)` = any token
- `FOLLOW(item)` = up for discussion
- `FOLLOW(path)` = up for discussion
- `FOLLOW(meta)` = up for discussion

# Drawbacks

It does restrict the input to a MBE, but the choice of delimiters provides
reasonable freedom.

# Alternatives

1. Fix the syntax that a fragment can parse. This would create a situation
where a future MBE might not be able to accept certain inputs because the
input uses newer features than the fragment that was fixed at 1.0. For
example, in the `bar` MBE above, if the `ty` fragment was fixed before the
unboxed closure sugar was introduced, the MBE would not be able to accept
such a type. While this approach is feasible, it would cause unnecessary
confusion for future users of MBEs when they can't put certain perfectly
valid Rust code in the input to an MBE. Versioned fragments could avoid
this problem but only for new code.
2. Keep `macro_rules` unstable. Given the great syntactical abstraction that
`macro_rules` provides, it would be a shame for it to be unusable in a
release version of Rust. If ever `macro_rules` were to be stabilized, this
same issue would come up.
3. Do nothing. This is very dangerous, and has the potential to essentially
freeze Rust's syntax for fear of accidentally breaking a macro.

# Unresolved questions

1. What should the FOLLOW sets for `item`, `path`, and `meta` be?
2. Should the `FOLLOW` set for `ty` be extended? In practice, `RArrow`,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot one, of course: =

`Colon`, `as`, and `in` are also used. (See next item)
2. What, if any, identifiers should be allowed in the FOLLOW sets? The author
is concerned that allowing arbitrary identifiers would limit the future use
of "contextual keywords".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you give some intuition of what the effects are of including or excluding tokens in the FOLLOW sets? I.e., doe sit mean the new token can be used a delimiter token? What does it mean for future proofing? Does it restrict or extend what we can do in the future? How?

I have a feeling we should think about semi-colons, but I'm not sure how. Should they be in the follow sets for either expr or stmt? Is it true that an item must always end with a } or a ;? If so, does that mean we should take anything for FOLLOW for item? (I feel I only have about a 50% grasp of the concepts here, so forgive my possibly stupid questions)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a token is in the FOLLOW set of a nt, we can never change the language in such a way that parsing that nt would consume that token. This gives us a rigid boundary around which we can change the language and not break macros.

I think the FOLLOW for item could be everything. I think the FOLLOW for meta could be anything, but I'm not sure how that interacts with future plans for letting attributes contain arbitrary token trees. I really don't know about the futures of path/ty. path really ought to be removed, since there isn't really a single "path" that makes sense anymore.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider foo @ pat, an example of $_:pat. We might want to extend the pattern syntax so that pattern guards are allowed in patterns: foo @ pat if condition[1]. Now, let's assume this syntax is going to be added in 1.1. The input foo @ pat if something will continue to be accepted by the matcher ( $bar:pat if something ) for some time, but it will be rejected as soon as 1.1 comes. The culprit is the if in ( $bar:pat if something ) and hence if should be excluded from FOLLOW(pat).

Looks like it's implied that each FOLLOW set contains all identifiers:

If (...) T' is any identifier, continue. Else, reject.

Further, alternation (|) in patterns has been proposed in [1]. I think FOLLOW(pat) shouldn't include Pipe, too.
Edit: in both cases, the syntax that pat parses could be fixed.

[1] RFC: Extend pattern syntax #99, postponed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how we ought to handle identifiers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ty I have in the implementation as, ,, ->, :, =, and >. Any token for meta and item. Still not sure about path.