-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow non-ASCII identifiers #2455
Conversation
I doubt that the confusability check can be implemented effectively. There are just too many characters in Unicode that look confusingly similar. Often they are from different scripts like latin On the other hand I doubt that homograph attacks will be a problem as you already need to trust the people who write your code. |
Are there other languages with permissive Unicode identifiers that we could survey? How prevalent are homograph attacks? Separately from attack vectors, I like the idea itself though, particularly as a way to improve failure modes. |
This RFC is incomplete IMO. The reference-level section needs a detailed specification of the proposed behaviors. NFC is unambiguous enough but “the current set of allowed characters” means “whatever happened to be implemented unstable in June 2018” and doesn’t cut it. Similarly for confusables. https://unicode.org/reports/tr31/ and https://www.unicode.org/reports/tr39/#Confusable_Detection are relevant, but they each define multiple variations to pick from. (Edit: it also needs discussion of what happens when the compiler updates to a new version of Unicode: #2455 (comment).) Motivation and alternatives should be discussed at a finer granularity. Why error on non-NFC rather than have the parser do the normalization itself and have the compile compare normalized identifiers in name resolution? Many people are not familiar Unicode normalization, what do we expect them to do with a such compile error? Even for those who are, many text editors do not provide a way any UI for normalization. Regarding confusables, I agree with @pyfisch that the threat model of homograph attack during code review seems dubious. If however we mean to help people who accidentally type an ident that doesn’t compare equal to what they expect, why not only look for confusables in diagnostics, without changing whether a compiler error happens in the first place? The same-file boundary seems arbitrary. Calling a method defined in another file happens all the time. Finally, for this feature in particular I would like to see a lot more research in the prior art section. Different programming languages do slightly different things here (e.g. rust-lang/rust#28979 (comment) and rust-lang/rust#28979 (comment)). Before we stabilize this I’d like to find out what those differences are in a wide range of programming languages and why they made those decisions. Some choices might be arbitrary, but others might be guided by reasons that would also apply to Rust. Yes this is a lot of work, and yes I haven’t been doing that work myself either. But I think this is what would produce the best outcome. |
Unicode characters out of the ASCII range can be harder to type, and potentially harder to read. For most characters in Unicode, there is a representation using only ASCII characters, (ie. I do recognize that in some cases, (Maths/ Science) libraries, it can make the code easier to read, but to be fair, there is a lot more to this than to simply handle collisions of symbols which look similar. There is also this discussion on /r/rust which shows one case of this (in Go), and the community opinion on this topic. Edit: Doing some more research, here is the PEP for it, and they have pretty good justification for why it should be allowed, so now I'm quite torn on the issue, as it would make sense to have the internationalization. Edit2: I derped and added the link to Idris without reading it properly. Anyways... Removed that link now |
I'm somewhat concerned about the potential for homograph attacks and think that we should be biased towards caution. A conservative alternative is to introduce a select few sets of characters rather than trying to come up with a complete and foolproof solution. |
- What parts of the design do you expect to resolve through the RFC process before this gets merged? | ||
- What parts of the design do you expect to resolve through the implementation of this feature before stabilization? | ||
- What related issues do you consider out of scope for this RFC that could be addressed in the future independently of the solution that comes out of this RFC? | ||
- How can the confusability check be implemented? We should have a fast path when there are *no* non-ASCII identifiers, but we also must be fast (at worst, O(N log N)) when there *are* non-ASCII identifiers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be the core of the detailed design, not an unresolved question, IMO.
@shingtaklam1324 For me, the biggest reason to add this is not so that programmers can type Greek letters for mathematical variables (such as ∆ for delta), but so that identifiers can be written in any language. I'm not opposed to the use of a standard language for open source projects to facilitate collaboration, but I do think that developers should still be given a choice, instead of being told what's best for them. |
@repax My proposal is that any attempt at a homograph attack automatically results in an error message. That is why it requires the compiler to check for confusability. @pyfisch I believe that a confusability check can be done efficiently. After normalizing to NFC, replace every character in every identifier by the lowest-valued character it could be confused with. If any identifiers are the same now that were not the same before, report an error. |
@DemiMarie After reading the Unicode Technical Standard #39 on confusable detection mentioned by @SimonSapin I agree that there is an efficient algorithm. But I infer from the RFC and your comments that you are either unaware of the algorithm and the associated table or that you want to use a different confusable detection function. Either way I do agree with @SimonSapin and @sfackler that the RFC is incomplete and
If a table based solution like the one from Unicode Technical Standard #39 is used detected confusables should not be a hard error but a warning as the confusion tables are expected to improve and erroring on additional characters may break existing programs. |
Didn't we address this in #1565 ? ;) In all seriousness, I do believe homograph attack against code review sounds plausible, so ideally anything like this should live behind a feature gate, even after stabilization. There is no defense in restricting the script per identifier like @pyfisch mentioned. Also, you cannot do the confusability checks at the file level because this breaks using
Also, error messages should say exactly what script to enable, making this painless. We could do crate level confusability checks based on the scripts enabled. We should likely allow |
I'm a german and even German script, even though it uses the latin alphabet, has characters that are not in ascii (äöüß). Most languages that use the latin alphabet have. And of course, there are so many other languages that don't use the latin alphabet. I think it is a legitimate demand that you shall be able to code in your native language and script. So generally I'm in favour of RFC. Note that even though I agree there should be an option, I don't think it is a good idea for the code I am writing. As a german, I could make e.g. the comments in my public codebases in german. Like so many other non-native english open source developers, I chose to make them in english instead. This is highly useful because the number of german speakers is so much smaller than the number of english speakers. I can accept PRs from people in france, the USA, etc. They all understand what my code is doing, I understand what their code is doing. From a beauty aspect, I prefer consistent codebases. Most if not all of the crates, all of std, all the keywords, all of the rustc error messages are in english. Starting to use german would create a half english half german codebase which is something I don't find appealing. My personal view at least. Even if you have translated error messages, this causes googleability to suffer dramatically. Another advantage of keeping non ascii identifiers outside of codebases is that everyone with an ascii capable keyboard can easily edit stuff. Greek letters look nice but they are hard to edit. Imagine someone non technical (and non greek) doing their first steps in programming and they see µ and wonder how to type that weird u thing. So I agree with @Serentty that the main use case of this proposal should be internationalisation. This is a great goal. Rust should be international and there are programming communities which use different languages as lingua franca than english. Currently, the RFC intends to simply turn off the ascii ident check, without any opt-in or something else. Basically, you can accidentially start using ascii identifiers even though you might want to keep your codebase ascii only. I agree with @burdges that this should be different. I would love the default to be to allow ascii only idents only and error on non ascii idents using a default-error lint. Anyone who works on non-english codebases then can turn off that lint. The idea of making the lint customizeable using scripts is really awesome. It would mean that the lint is still useful, even for those who use a native script. Generally if you code on a codebase that uses german idents, you don't want e.g. russian/bulgarian script or idk runes or something. |
Unicode Technical Report #31 is also worth a read. It has a list of suggested scripts to allow and disallow, which I think we should consider, and not just allow all Unicode Characters like some other languages do. This could be set up in a way like lint levels, and set up specific lints for the different character sets with different warnings by default. |
I would be unhappy if this was added for a few reasons:
|
I agree with @SimonSapin agreeing with @pyfisch that the risk of homograph attacks / risks to security is dubious. It seems to me that there are far more risky things security wise than unicode idents. During my first year as a computer science student, we had two professors who insisted on talking about CS terms using their Swedish names and used Swedish in comments and variable names. Most students, as far as I recall, really disliked the experience, because they were not learning the international terms then, so they had to re-learn the terms in English. These students, among them myself, were not native English speakers, however, knowledge of English is expected in Swedish society. Therefore, I would not use Swedish in any code nor accept it in my projects. That said, it seems important to me to support the internationalization of Rust because English is far from a lingua franca across the world. So I support the principle of non-ASCII idents. To satisfy both those who want to avoid non-ASCII idents in their code and those who want to use it, I propose two lints (either in clippy or the compiler..):
The first allows you to use non-ASCII idents in non-exported code. |
Another concern is what happens when we update the compiler’s version of the Unicode character database. I think we don’t want to get stuck on an old version forever, but we also don’t want to start emitting errors for programs that used to compile correctly with a previous rustc. This requires some careful tweaks to the algorithm, see for example https://unicode.org/reports/tr15/#Stabilized_Strings. |
One more concern: FFI. What happens if I no_mangle an ident with a funny character? I'm guessing this should be disallowed? |
@mark-i-m we could start off by forbidding no_mangle and extern fn's with such names. |
For the upgrading tables question: this seems like something that could be done on epoch boundaries. Always warn using the latest tables, and error using the latest table when the epoch locked. (I'm not sure that's better than a deny-by-default lint that always uses the latest tables, though, which I think would be allowed under the stability guarantee.) |
This is the technical portion of the RFC. Explain the design in sufficient detail that: | ||
This RFC proposes that non-ASCII identifiers be stabilized, with the current set of allowed characters. However, there are restrictions to avoid visual confusion: | ||
|
||
* No two identifiers that are distinct but visually confusable may be in the same scope. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this interact with use foo::*; use bar::*;
? Does this mean that minor version updates to crates can stop my code compiling because they introduce distinct-but-confusable names?
Maybe it would make sense to require or encourage ascii aliases for non-ascii identifiers? This could solve many problems:
|
I've written an alternative proposal at #2457 that offers more a precise reference-level explanation and explores several other designs. |
Closing in favor of #2457, which is vastly better written. |
Rendered