-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow non-ASCII identifiers #2457
Changes from 12 commits
ec728b3
4c1bda9
619f5b4
142d0bc
6b2a94a
3e19d26
a4830a1
12d0623
79bbc8e
41f0723
3c96d81
940dab5
da43d09
0e0ca66
935c917
8d548d4
9356fc1
40d53f5
7732810
e3f3692
d389a9c
70297a9
9bf90df
a6da03a
c4dff64
0c78631
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,182 @@ | ||
- Feature Name: non_ascii_idents | ||
- Start Date: 2018-06-03 | ||
- RFC PR: (leave this empty) | ||
- Rust Issue: (leave this empty) | ||
|
||
# Summary | ||
[summary]: #summary | ||
|
||
Allow non-ASCII letters (such as accented characters, Cyrillic, Greek, Kanji, etc.) in Rust identifiers. | ||
|
||
# Motivation | ||
[motivation]: #motivation | ||
|
||
Writing code using domain-specific terminology simplifies implementation and discussion as opposed to translating words from the project requirements. When the code is only intended for a limited audience such as with in-house projects or in teaching it can be beneficial to write code in the group's language as it boosts communication and helps people not fluent in English to participate and write Rust code themselves. | ||
|
||
The rationale from [PEP 3131] nicely explains it: | ||
|
||
> ~~Python~~ *Rust* code is written by many people in the world who are not familiar with the English language, or even well-acquainted with the Latin writing system. Such developers often desire to define classes and functions with names in their native languages, rather than having to come up with an (often incorrect) English translation of the concept they want to name. By using identifiers in their native language, code clarity and maintainability of the code among speakers of that language improves. | ||
> | ||
> For some languages, common transliteration systems exist (in particular, for the Latin-based writing systems). For other languages, users have larger difficulties to use Latin to write their native words. | ||
|
||
Additionally some math oriented projects may want to use identifiers closely resembling mathematical writing. | ||
|
||
# Guide-level explanation | ||
[guide-level-explanation]: #guide-level-explanation | ||
|
||
Identifiers include variable names, function and trait names and module names. They start with a letter or an underscore and may be followed by more letters, digits and some connecting punctuation. | ||
|
||
Examples of valid identifiers are: | ||
|
||
* ASCII letters and digits: `image_width`, `line2`, `Photo`, `el_tren`, `_unused` | ||
* words containing accented characters: `garçon`, `hühnervögel` | ||
* identifiers in other scripts: `Москва`, `東京`, ... | ||
|
||
Examples of invalid identifiers are: | ||
|
||
* Keywords: `impl`, `fn`, `_` (underscore), ... | ||
* Identifiers starting with numbers or containing "non letters": `42_the_answer`, `third√of7`, `◆◆◆`, ... | ||
* Many Emojis: 🙂, 🦀, 💩, ... | ||
|
||
[Composed characters] like those used in the word `ḱṷṓn` can be represented in different ways with Unicode. These different representations are all the same identifier in Rust. | ||
|
||
To disallow any Unicode identifiers in a project (for example to ease collaboration or for security reasons) limiting the accepted identifiers to ASCII add this lint to the `lib.rs` or `main.rs` file of your project: | ||
|
||
```rust | ||
#![forbid(non_ascii_idents)] | ||
``` | ||
|
||
Some Unicode character look confusingly similar to each other or even identical like the Latin **A** and the Cyrillic **А**. The compiler may warn you about names that are easy to confuse with keywords, names from the same crate and imported items. If needed (but not recommended) this warning can be silenced with a `#[allow(confusable_idents)]` annotation on the enclosing function or module. | ||
|
||
## Usage notes | ||
|
||
All code written in the Rust Language Organization (*rustc*, tools, std, common crates) will continue to only use ASCII identifiers and the English language. | ||
|
||
For open source crates it is suggested to write them in English and use ASCII-only. An exception can be made if the application domain (e.g. math) benefits from Unicode and the target audience (e.g. for a crate interfacing with Russian passports) is comfortable with the used language and characters. Additionally crates should consider to provide an ASCII-only API. | ||
|
||
Private projects can use any script and language the developer(s) desire. It is still a good idea (as with any language feature) not to overdo it. | ||
|
||
# Reference-level explanation | ||
[reference-level-explanation]: #reference-level-explanation | ||
|
||
Identifiers in Rust are based on the [Unicode® Standard Annex #31 Unicode Identifier and Pattern Syntax][UAX31]. | ||
|
||
Note: The supported Unicode version should be stated in the documentation. | ||
|
||
The lexer defines identifiers as: | ||
|
||
> **<sup>Lexer:<sup>** | ||
> IDENTIFIER_OR_KEYWORD: | ||
> XID_Start XID_Continue<sup>\*</sup> | ||
> | `_` XID_Continue<sup>+</sup> | ||
> | ||
> IDENTIFIER : | ||
> IDENTIFIER_OR_KEYWORD <sub>*Except a [strict] or [reserved] keyword*</sub> | ||
|
||
`XID_Start` and `XID_Continue` are used as defined in the aforementioned standard. The definition of identifiers is forward compatible with each successive release of Unicode as only appropriate new characters are added to the classes but none are removed. | ||
|
||
Parsers for Rust syntax normalize identifiers to [NFC][UAX15]. Every API accepting raw identifiers (such as `proc_macro::Ident::new` normalizes them to NFC and APIs returning them as strings (like `proc_macro::Ident::to_string`) return the normalized form. This means two identifiers are equal if their NFC forms are equal. | ||
|
||
A `non_ascii_idents` lint is added to the compiler. This lint is `allow` by default. The lint checks if any identifier in the current context contains a codepoint with a value equal to or greater than 0x80 (outside ASCII range). Not only locally defined identifiers are checked but also those imported from other crates and modules into the current context. | ||
|
||
## Confusable detection | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Currently the parser will go down a bad route when encountering an identifier that isn't a keyword. We probably want to forbid unicode idents that could be confused with any of the rust keywords. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See fifth paragraph, last sentence:
|
||
|
||
Rust compilers should detect confusingly similar Unicode identifiers and warn the user about it. | ||
|
||
Note: This is *not* a mandatory for all Rust compilers as it requires considerable implementation effort and is not related to the core function of the compiler. It rather is a tool to detect accidental misspellings and intentional homograph attacks. | ||
|
||
A new `confusable_idents` lint is added to the compiler. The default setting is `warn`. | ||
|
||
Note: The confusable detection is set to `warn` instead of `deny` to enable forward compatibility. The list of confusable characters will be extended in the future and programs that were once valid would fail to compile. | ||
|
||
The confusable detection algorithm is based on [Unicode® Technical Standard #39 Unicode Security Mechanisms Section 4 Confusable Detection][TR39Confusable]. For every distinct identifier X execute the function `skeleton(X)`. If there exist two distinct identifiers X and Y in the same crate where `skeleton(X) = skeleton(Y)` report it. The compiler uses the same mechanism to check if an identifier is too similar to a keyword. | ||
|
||
Note: A fast way to implement this is to compute `skeleton` for each identifier once and place the result in a hashmap as a key. If one tries to insert a key that already exists check if the two identifiers differ from each other. If so report the two confusable identifiers. | ||
|
||
## Conformance Statement | ||
|
||
* UAX31-C1: The Rust language conforms to the Unicode® Standard Annex #31 for Unicode Version 10.0.0. | ||
* UAX31-C2: It observes the following requirements: | ||
* UAX31-R1. Default Identifiers: To determine whether a string is an identifier it uses UAX31-D1 with the following profile: | ||
* Start := XID_Start, plus `_` | ||
* Continue := XID_Continue | ||
* Medial := empty | ||
* UAX31-R1b. Stable Identifiers: Once a string qualifies as an identifier, it does so in all future versions. | ||
* UAX31-R4. Equivalent Normalized Identifiers: All identifiers are normalized according to normalization form C before comparison. | ||
|
||
# Drawbacks | ||
[drawbacks]: #drawbacks | ||
|
||
* "ASCII is enough for anyone." As source code should be written in English and in English only (source: various people) no characters outside the ASCII range are needed to express identifiers. Therefore support for Unicode identifiers introduces unnecessary complexity to the compiler. | ||
* "Foreign characters are hard to type." Usually computer keyboards provide access to the US-ASCII printable characters and the local language characters. Characters from other scripts are difficult to type, require entering numeric codes or are not available at all. These characters either need to be copy-pasted or entered with an alternative input method. | ||
* "Foreign characters are hard to read." If one is not familiar with the characters used it can be hard to tell them apart (e.g. φ and ψ) and one may not be able refer to the identifiers in an appropriate way (e.g. "loop" and "trident" instead of phi and psi) | ||
* "My favorite terminal/text editor/web browser" has incomplete Unicode support." Even in 2018 some characters are not widely supported in all places where source code is usually displayed. | ||
* Homoglyph attacks are possible. Without confusable detection identifiers can be distinct for the compiler but visually the same. Even with confusable detection there are still similar looking characters that may be confused by the casual reader. | ||
|
||
# Rationale and alternatives | ||
[alternatives]: #alternatives | ||
|
||
As stated in [Motivation](#motivation) allowing Unicode identifiers outside the ASCII range improves Rusts accessibility for developers not working in English. Especially in teaching and when the application domain vocabulary is not in English it can be beneficial to use names from the native language. To facilitate this it is necessary to allow a wide range of Unicode character in identifiers. The proposed implementation based on the Unicode TR31 is already used by other programming languages and is implemented behind the `non_ascii_idents` in *rustc* but lacks the NFC normalization proposed. | ||
|
||
NFC normalization was chosen over NFKC normalization for the following reasons: | ||
|
||
* [Mathematicians want to use symbols mapped to the same NFKC form](https://github.com/rust-lang/rfcs/pull/2457#issuecomment-394928432) like π and ϖ in the same context. | ||
* [Some words are mangled by NFKC](https://github.com/rust-lang/rfcs/pull/2457#issuecomment-394922103) in surprising ways. | ||
* Naive (search) tools can't find different variants of the same NFKC identifier. As most text is already in NFC form search tools work well. | ||
|
||
Possible variants: | ||
|
||
1. Require all identifiers to be already in NFC form. | ||
2. Two identifiers are only equal if their codepoints are equal. | ||
3. Perform NFKC mapping instead of NFC mapping for identifiers. | ||
4. Only a number of common scripts could be supported. | ||
5. A [restriction level][TR39Restriction] is specified allowing only a subset of scripts and limit script-mixing within an identifier. | ||
|
||
An alternative design would use [Immutable Identifiers][TR31Alternative] as done in [C++]. In this case a list of Unicode codepoints is reserved for syntax (ASCII operators, braces, whitespace) and all other codepoints (including currently unassigned codepoints) are allowed in identifiers. The advantages are that the compiler does not need to know the Unicode character classes XID_Start and XID_Continue for each character and that the set of allowed identifiers never changes. It is disadvantageous that all not explicitly excluded characters at the time of creation can be used in identifiers. This allows developers to create identifiers that can't be recognized as such. It also impedes other uses of Unicode in Rust syntax like custom operators if they were not initially reserved. | ||
|
||
It always a possibility to do nothing and limit identifiers to ASCII. | ||
|
||
It has been suggested that Unicode identifiers should be opt-in instead of opt-out. The proposal chooses opt-out to benefit the international Rust community. New Rust users should not need to search for the configuration option they may not even know exists. Additionally it simplifies tutorials in other languages as they can omit an annotation in every code snippet. | ||
|
||
## Confusable detection | ||
|
||
The current design was chosen because the algorithm and list of similar characters are already provided by the Unicode Consortium. A different algorithm and list of characters could be created. I am not aware of any other programming language implementing confusable detection. The confusable detection was primarily included because homoglyph attacks are a huge concern for some members of the community. | ||
|
||
Instead of offering confusable detection the lint `forbid(non_ascii_idents)` is sufficient to protect a project written in English from homoglyph attacks. Projects using different languages are probably either written by students, by a small group or inside a regional company. These projects are not threatened as much as large open source projects by homoglyph attacks but still benefit from the easier debugging of typos. | ||
|
||
# Prior art | ||
[prior-art]: #prior-art | ||
|
||
"[Python PEP 3131][PEP 3131]: Supporting Non-ASCII Identifiers" is the Python equivalent to this proposal. The proposed identifier grammar **XID_Start XID_Continue<sup>\*</sup>** is identical to the one used in Python 3. While Python uses KC normalization this proposes to use normalization form C. | ||
|
||
[JavaScript] supports Unicode identifiers based on the same Default Identifier Syntax but does not apply normalization. | ||
|
||
The [CPP reference][C++] describes the allowed Unicode identifiers it is based on the immutable identifier principle. | ||
|
||
[Java] also supports Unicode identifiers. Character must belong to a number of Unicode character classes similar to XID_start and XID_continue used in Python. Unlike in Python no normalization is performed. | ||
|
||
The [Go language][Go] allows identifiers in the form **Letter (Letter | Number)\*** where **Letter** is a Unicode letter and **Number** is a Unicode decimal number. This is more restricted than the proposed design mainly as is does not allow combining characters needed to write some languages such as Hindi. | ||
|
||
# Unresolved questions | ||
[unresolved]: #unresolved-questions | ||
|
||
* Which context is adequate for confusable detection: file, current scope, crate? | ||
* Are Unicode characters allowed in `no_mangle` and `extern fn`s? | ||
* How do Unicode names interact with the file system? | ||
* Are crates with Unicode names allowed and can they be published to crates.io? | ||
* Are `non_ascii_idents` and `confusable_idents` good names? | ||
* Should [ZWNJ and ZWJ be allowed in identifiers][TR31Layout]? | ||
|
||
[PEP 3131]: https://www.python.org/dev/peps/pep-3131/ | ||
[UAX31]: http://www.unicode.org/reports/tr31/ | ||
[UAX15]: https://www.unicode.org/reports/tr15/ | ||
[TR31Alternative]: http://unicode.org/reports/tr31/#Alternative_Identifier_Syntax | ||
[TR31Layout]: https://www.unicode.org/reports/tr31/#Layout_and_Format_Control_Characters | ||
[TR39Confusable]: https://www.unicode.org/reports/tr39/#Confusable_Detection | ||
[TR39Restriction]: https://www.unicode.org/reports/tr39/#Restriction_Level_Detection | ||
[C++]: https://en.cppreference.com/w/cpp/language/identifiers | ||
[Julia Unicode PR]: https://github.com/JuliaLang/julia/pull/19464 | ||
[Java]: https://docs.oracle.com/javase/specs/jls/se10/html/jls-3.html#jls-3.8 | ||
[JavaScript]: http://www.ecma-international.org/ecma-262/6.0/#sec-names-and-keywords | ||
[Go]: https://golang.org/ref/spec#Identifiers | ||
[Composed characters]: https://en.wikipedia.org/wiki/Precomposed_character |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bringing my position inline so that anyone can reply to this directly.
I believe that by default we should:
latin
,math
,greek
,hira
/hiragana
,ascii
/reduced
/limited
/lets-party-like-its-1960
,emoji
, etc.). If a list of allowable scripts are defined in a crate, encountering chars outside of them is a hard error. On this error, confusables that are outside of the currently allowed scripts but that have a similar representation in the current script (𝜆
inmath
->λ
ingreek
), it should be pointed out with suggestions to either use the allowed script's char or to add the new script to the allowed list.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please state the reasoning behind point three? If the supporting arguments were already discussed also include the opinions of other people on the matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Being liberal on the default is a contentious point with people bringing multiple positions. Myself I fall under the lets be relatively liberal camp, with maybe a concession to a warn by default lint when falling outside of ascii suggesting specifying allowed scripts. By doing it this way we minimize friction from non-ascii using users, while not heavily burdening ascii-only users (this is debated).
I believe we should allow people to specify specific allowed scripts is a good solution to the problem of cross script confusables sneaking in, without having people forced to decide that it is not worth it to use math symbols for formulas because they're are worried about cyrillic confusables being sneaked into the codebase. This would also allow me to use emojis for identifiers if I so choose, as that script would be disabled by default, but I could add it to the allowed list.
Allowing per scope customization would be interesting to allow people a lot of linting control when deciding that they want to allow a given script internally, but not on the public api. I would not count function argument names as part of the public api for this consideration, as we don't need to write them down, just be able to read them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you say yourself that this is a fair and complete review of the previous arguments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it to be so, yes. I haven't seen my last point around per scope settings raised before, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was discussed under the "malicious contributor" label and was deemed "not much trouble".
How does allowing more codepoints than UAX#31 Default Identifiers the evolution and readability of Rust? (See custom operators and smart quotes) For which people is it important to allow additional characters?
Why is the proposed confusable lint insufficient?
Why aren't ad hoc solutions like regular expressions sufficient?
Why are typos involving Unicode characters different from those only involving ASCII ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that giving partial support for Unicode idents would be a disservice if we don't allow a future proof escape hatch.
Other than novelty uses (emoji), I don't know enough about other languages that might need it (my mother tongue is limited to latin).
Proactive warning and fine grained control for individual projects. A binary gate feels too blunt to mdgiven the scope.
I believe that the compiler and an editor should be enough to develop in Rust, and it should make an effort to provide help to newcomers for non obvious errors (and here I fall in the same camp of "avoid non ASCII in general" if possible, I just don't want to mandate it).
Tooling support. We already suggest based on levenshtein distance for ASCII typos, we should to the same for cobfusable chars.
Having said all of this, I'm enthusiastically in support of adding the proposed support and will do my best to make sure that the presented worries have a reasonable answer in the shape of compiler assisted support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great 👍
Maybe update your position above if it only for novelty?
Care to give an example from your personal experience where such a feature in a programming language would have made your work easier? (Keep in mind that many languages have Unicode support for years now among them Python, Java and C++)
We already do (tested on nightly):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My position is born of my own ignorance, I defer to people with more experience with non-western-european languages. For me (again, other than the novelty use of emoji and math symbols), my use case can be completely covered by latin (ñ, ü, é) and my counter case would be identifying confusables (which as you point out it would be handled) and bad diagnostics when changed code (smart quotes) is pasted.
Entire generation of Spanish speaking students have been "trained" to transliterate to ASCII, where
año
(year) gets changed toanio
(meaningless) orano
(anus). I think that the focus for this document should be placed on the use of other scripts that limit developers more heavily than this.As pointed out, having a single toggle would make it hard to only allow full unicode support at different levels, like only for literals, idents, in comments, or maybe on idents for internal interfaces, while disallowing them in other places (like public interfaces).