-
Notifications
You must be signed in to change notification settings - Fork 856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change bare key characters to Letter and Digit #990
base: main
Are you sure you want to change the base?
Conversation
I believe this would greatly improve things and solves all the issues, mostly. It's a bit more complex, but not overly so, and can be implemented without a Unicode library without too much effort. It offers a good middle ground, IMHO. I don't think there are ANY perfect solutions here and that *anything* will be a trade-off. That said, I do believe some trade-offs are better than others, and I've made it no secret that I feel the current trade-off is a bad one. After looking at a bunch of different options I believe this is by far the best path for TOML. Advantages: - This is what I would consider the "minimal set" of characters we need to add for reasonable international support, meaning we can't really make a mistake with this by accidentally allowing too much. We can add new ranges in TOML 1.2 (or even change the entire approach, although I'd be very surprised if we need to), based on actual real-world feedback, but any approach we will take will need to include letters and digits from all scripts. This is a strong argument in favour of this and a huge improvement: we can't really do anything wrong here in a way that we can't correct later, unlike what we have now, which is "well I think it probably won't cause any problems, based on what these 5 European/American guys think, but if it does: we won't be able to correct it". Being conservative for these type of things is good! - This solves the normalisation issues, since combining characters are no longer allowed in bare keys, so it becomes a moot point. For quoted keys normalisation is mostly a non-issue because few people use them, which is why this gone largely unnoticed and undiscussed before the "Unicode in bare keys" PR was merged.[1] - It's consistent in what we allow: no "this character is allowed, but this very similar other thing isn't, what gives?!" Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but "this character works fine, but this very similar doesn't". This shows up in a number of things aside from emojis: a.toml: Input: ; = 42 # U+037E GREEK QUESTION MARK (Other_Punctuation) Error: line 1: expected '.' or '=', but got ';' instead b.toml: Input: · = 42 # # U+0387 GREEK ANO TELEIA (Other_Punctuation) Error: (none) c.toml: Input: – = 42 # U+2013 EN DASH (Dash_Punctuation) Error: line 1: expected '.' or '=', but got '–' instead d.toml: Input: ⁻ = 42 # U+207B SUPERSCRIPT MINUS (Math_Symbol) Error: (none) e.toml: Input: #x = "commented ... or is it?" # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) Error: (none) "Some punctuation is allowed but some isn't" is hard to explain, and also not what the specification says: "Punctuation, spaces, arrows, box drawing and private use characters are not allowed." In reality, a lot of punctuation IS allowed, but not all (especially outside of the Latin character range by the way, which shows the Euro/US bias in how it's written). People don't read specifications in great detail, nor should they. People try something and sees if it works. Now it seems to work on first approximation, and then (possibly months or years later) it seems to "suddenly break". From the user's perspective this seems like a bug in the TOML parser, but it's not: it's a bug in the specification. It should either allow everything or nothing. This in-between is confusing and horrible. There is no good way to communicate this other than "these codepoints, which cover most of what you'd write in a sentence, except when it doesn't". In contrast, "we allow letters and digits" is simple to spec, simple to communicate, and should have a minimum potential for confusion. The current spec disallows some things seemingly almost arbitrary while allowing other very similar characters. - This avoids a long list of confusable special TOML characters; some were mentioned above but there are many more: '#' U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) '"' U+FF02 FULLWIDTH QUOTATION MARK (Other_Punctuation) '﹟' U+FE5F SMALL NUMBER SIGN (Other_Punctuation) '﹦' U+FE66 SMALL EQUALS SIGN (Math_Symbol) '﹐' U+FE50 SMALL COMMA (Other_Punctuation) '︲' U+FE32 PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation) '˝' U+02DD DOUBLE ACUTE ACCENT (Modifier_Symbol) '՚' U+055A ARMENIAN APOSTROPHE (Other_Punctuation) '܂' U+0702 SYRIAC SUBLINEAR FULL STOP (Other_Punctuation) 'ᱹ' U+1C79 OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter) '₌' U+208C SUBSCRIPT EQUALS SIGN (Math_Symbol) '⹀' U+2E40 DOUBLE HYPHEN (Dash_Punctuation) '࠰' U+0830 SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation) Is this a big problem? I guess it depends; I can certainly imagine an Armenian speaker accidentally leaving an Armenian apostrophe. Confusables is also an issue with different scripts (Latin and Cyrillic is well-known), but this is less of an issue since it's not syntax, and also something that's fundamentally unavoidable in any multi-script environment. - Maps closer to identifiers in more (though not all) languages. We discussed whether TOML keys are "strings" or "identifiers" last week in toml-lang#966 and while views differ (mostly because they're both) it seems to me that making it map *closer* is better. This is a minor issue, but it's nice. That does not mean it's perfect; as I mentioned all solutions come with a trade-off. The ones made here are: - The biggest issue by far is that the check to see if a character is valid may become more complex for some languages and environments that can't rely on a Unicode database being present. However, implementing this check is trivial logic-wise: it just needs to loop over every character and check if it's in a range table. You already need this with TOML 1.0, it's just that the range tables become larger. The downside is it needs a somewhat large-ish "allowed characters" table with 716 start/stop ranges, which is not ideal, but entirely doable and easily auto-generated. It's ~164 lines hard-wrapped at column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387 lines, so that seems within the limits of reason (actually, reading through the tomlc99 code adding multibyte support at all will be the harder part, with this range table being a minor part). - There's a new Unicode version roughly every year or so, and the way it's written now means it's "locked" to Unicode 9 or, optionally, a later version. This is probably fine: Apple's APFS filesystem (which does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2. Go is Unicode 8.0. etc. I don't think this is really much of an issue in practice. I choose Unicode 9 as everyone supports this; I doubted a long time over it, and we can also use a more recent version. I feel this gives us a nice balance between reasonable interoperability while also future-proofing things. - ABNF doesn't support Unicode. This is a tooling issue, and in my opinion the tooling should adjust to how we want TOML to look like, rather than adjusting TOML to what tooling supports. AFAIK no one uses the ABNF directly in code, and it's merely "informational". I'm not happy with this, but personally I think this should be a non-issue when considering what to do here. We're not the only people running in to this limitation, and is really something that IETF should address in a new RFC or something ("Extra Augmented BNF"?) Another solution I tried is restricting the code ranges; I twice tried to do this (with some months in-between) and spent a long time looking at Unicode blocks and ranges, and I found this impractical: we'll end up with a long list which isn't all that different from what this proposal adds. Fixes toml-lang#954 Fixes toml-lang#966 Fixes toml-lang#979 Ref toml-lang#687 Ref toml-lang#891 Ref toml-lang#941 --- [1]: Aside: I encountered this just the other day as I created a TOML file with all UK election results since 1945, which looks like: [1950] Labour = [13_266_176, 315, 617] Conservative = [12_492_404, 298, 619] Liberal = [ 2_621_487, 9, 475] Sinn_Fein = [ 23_362, 0, 2] That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote it as Sinn_Fein. This is what most people seem to do.
Instead of trying to define custom subsets, have you considered using the characters based on Unicode properties instead? I'm thinking of https://www.unicode.org/reports/tr44/#Alphabetic and https://www.unicode.org/reports/tr44/#Numeric_Type (=Decimal? or Digit?) for these. The current set of characters ends up excluding multiple languages that don't use Latin/Latin-derived script (eg: none of the Other_Alphabetic characters are included in this set, which are necessary components of various languages -- I have an example below).
This is still the case. As an example I have handy to me: औरंगाबाद (Aurangabad) is a city in India. The codepoints that is composed of are... (w/ name + category):
The vowels and signs of the script end up being not in the current set (they're in Mc and Mn categories) while the letters are. It is subtle that रग is permitted but रंग is not. How I printed the decomposed info about that string
|
Taking a step back from the details, I think this is a good idea. TOML 1.0.0 used: Line 50 in 8eae5e1
Changing that to the somewhat equivalent seems like a reasonable approach to take...
(I still need to review the normalisation discussion though, to see why that concern isn't resolved with a "implementations should normalise in a manner appropriate for their implementation language/context" or something similar) |
"Alphabetic" is just a list of categories plus "Other_Alphabetic" property. This is similar to what we have now, except that Letter_Number (Nl) and Other_Alphabetic are included. I don't think we need Letter_Number since all of this seems to be for historic scripts, although it also won't hurt to include it I guess. We can just add Other_Alphabetic; looking through what that includes, it does have some combining characters: https://gist.github.com/arp242/183881717be3cde197d357bfe90d4541 But all of these seem to be without pre-composed forms, so that's not really an issue and " My main complaint it includes a bunch of a-z variants:
But I guess that's rare enough of a thing we don't need to worry about it. One issue is that Other_Alphabetic is harder to check than a category; e.g. Python's unicodedata doesn't have anything for this as far as I can see, so you'll need something like I'm not sure if "these categories + Other_Alphabetic property" or "Alphabetic property" is easier; I think the first one probably is? |
Per comments; also commit the script used to generate the ABNF ranges. Probably want to replace that with something less, ehm, crap, so it's easier for people to modify and run ... This was just quick and easy to write for me now.
On the Pitfalls of Including Unicode Character Classes in TOML SpecsIncorporating Unicode character classes into the TOML specification is more problematic than simply defining code-point ranges, and here's why: Code-point ranges are fixed values, while Unicode character classes are intrinsically tied to a given version of the Unicode Standard. This means that character classes are ever-evolving entities, updated with each new Unicode release. Tethering the TOML spec to such a mutable component like Unicode character classes creates an undesirable dependency. Every time the Unicode Standard updates, it would essentially trigger a change in the TOML specification, whether we like it or not. In practical terms, anyone building a TOML parser would either have to rely on bulky libraries like ICU to keep up-to-date with these character classes, or continually revise their parser to align with each new Unicode version. Neither of these options is particularly appealing or efficient. In summary: Let's absolutely incorporate the necessary code-points to make TOML as inclusive as possible for various scripts. But let's steer clear of adding Unicode character classes to the spec, to avoid creating an unnecessary and burdensome dependency. |
Existing codepoints are basically never changed, and only new ones are added. Lots of stuff runs on old Unicode versions. Practically speaking, this is a non-issue, and the only problem you might run in to is that an implementation won't support the codepoint you want. However, the new codepoints that have been added tend to be obscure and not commonly used, and the "or later" covers this. You really don't need an ICU library or have to "continually revise their parser". In fact, lots of systems run on years-old ICU libraries, which is effectively the same, or have some specific Unicode version ossified in the spec (without "or later"). |
I acknowledge that Unicode code-points are rarely, if ever, removed from character classes. However, it's worth noting that they do get added, particularly to the The issue I see here is that the flexibility in Unicode character classes can complicate validation tests for TOML parsers. Imagine you have a test suite that validates a parser this year; it might not produce the same results next year if the Unicode Standard for both the parser and the test suite diverges. So, if we're going down the road of incorporating character classes, it's imperative to also specify a minimum Unicode Standard version that's supported. In this way, tests should only hold parsers accountable for character classes as defined in that specific Unicode version. Additionally, given that reserved code-points will be introduced in future Unicode updates, tests should be prepared for ambiguity. Specifically, they shouldn't mark a parser as 'failed' just because it accommodates a code-point from a future Unicode Standard. In summary, if we must include character classes, we should also commit to a minimum Unicode Standard version. This adds a layer of complexity and leaves some room for ambiguity concerning which characters are considered valid, but it's a more manageable approach. |
That's what it does already; see the diff and commit message.
If it's not in Unicode then lots of stuff already won't work, and if it gets added today it will take a few years for the world (including TOML) to catch up. That's fine. It's really not a practical issue people will run in to, or at least not more so than anything else, and "or later" covers this.
"Flexibility" and "diverges" is rather overstating it; stuff isn't going to get randomly assigned, re-assigned, or unassigned, and many aspects have been very stable for many years. It's more or less "append only", and what is or isn't a "letter" or "digit" is not some hard cutting-edge problem, but fairly well established. Perhaps an implementation might add an unwise test such as checking for a random unassigned codepoint, but that's just not a smart thing to do, and toml-test will include tests so writing these tests yourself isn't even needed. I just don't see how this can be a practical issue. And even if it is: the fix is so trivial it's just not worth worrying about. |
My apologies for overlooking the inclusion of a minimal Unicode version in the diff and commit message—that indeed addresses the core of my previous concerns. If there's a specific Unicode version that serves as the benchmark, then implementors have a stable foundation to build upon. This eliminates the risk of a parser becoming outdated due to Unicode updates, as long as it aligns with the stated minimal version. In light of this new information, I have no reservations about incorporating Unicode character classes into the TOML spec. |
While I'm not opposed against Unicode properties as such, I have two problems with this approach: (1) As currently written, it would remove the possibility to use arbitrary words in any language as bare keys, since combining characters (Unicode categories Mc and Mn, particularly) are not allowed. I suppose that's intentional since @arp242 hopes in this way to implicitly force NFC normalization on bare keys? Right? When we first added non-ASCII letters to bare keys (#687), the initial proposed was indeed very similar to this one, except that it also included the rule:
And that, while it goes against the idea of "automatically enforced normalization" is indeed necessary to allow arbitrary words in arbitrary languages as bare keys, which is was #687 was all about. For example, the g̃ used in the Guarani alphabet is encoded as g with tilde using a combining diacritical mark (U+0303 ◌̃ COMBINING TILDE), rather than a precomposed character. The Navajo language also uses several letters, such as į́ (i with ogonek below and acute above), which don't exist in precomposed form. Generally, the Unicode people make it very clear that they don't add new characters in precomposed form if a combination including combining marks is already available, see the question Q: Unicode doesn't contain the character I need, which is a Latin letter with a certain diacritical mark. Can you add it? You can read the answer yourself, but the gist of it is that if a letter can already be expressed as "base letter" followed by one or more "combining diacritical marks", no new precomposed character will be added for it, since the combination can be used just fine. So no, we cannot throw the combining marks out, just as much as we would like to. (2) My second, much smaller reservation is that #687, as it was finally merged, is the much simpler solution. We had started with a Unicode approach close to this one, but @abelbraaksma then argued for the simpler range-based approach also used in the XML definition, and apparently convinced sufficiently many people, myself included, that that's the better way to go. Consider the 9 lines the character ranges have in the current ABNF compared to the 215 lines they would have if this proposal was accepted. |
@pradyunsg's point about औरंगाबाद (Aurangabad) is, of course, essentially the same as mine. It hasn't yet been addressed, as far as I can see. Possibly (I don't know) it is possible to write Devanagari in completely precomposed form, but if the software people typically use prefers combining marks that still wouldn't help them, since they would still run into incomprehensible errors when trying to use arbitrary words written in Devanagari as bare keys. |
औरंगाबाद is fixed by adding Other_Alphabetic; there's a bunch of other combining marks in there, and I believe this should cover what's needed. I'm not entirely sure if some of these also have a pre-composed form, but many don't (and if there isn't any pre-composed form, then "there is only one way to represent it" is retained). For example for Devanagari there are no pre-composed characters with Anusvara or aa vowel sign that are used in औरंगाबाद. I find it difficult to get good information on this; I really need to write a script to get a good analysis and I'll have to get back to you on that. Guarani and Navajo will be harder :-( But ... maybe we should just release and see what happens?
The thing is, with this proposal we can always change course if we make a mistake; other than the To be honest I wouldn't be surprised if this feature sees very little uptake, with the main usage being the odd diacritic in Latin like Or maybe people will start using this a lot; perhaps in ways we didn't anticipate. That's also entirely possible. This is really my main objection against the current state: we just don't know what people will do, and will work well, what will and won't be a practical problem that's encountered, or what people will or won't be confused about. And if it turns out we made a mistake, we can't easily correct it without breaking compatibility. We don't really know, and we can't really know without real-world experience. We don't really need to solve every possible detail here in one go; someone currently using quoted keys for their Navajo will have to continue using quoted keys a bit longer. So yeah, maybe just "release and see what happens"? It's certainly a bit of a downside/trade-off, but seems like a reasonable one, especially since the real-world experience will allow us to make a more informed decision later on, if need be. As for the ABNF: this doesn't fill me with joy either. But I also think it's unimportant; I feel people have been too focused on making the ABNF "nice", but the ABNF file is not an art project and should serve practical goals we have for TOML, one way or the other. 17 people have touched that file since 2014; it's slightly inconvenient for a small group of people. TOML will likely be around for decades, and "ABNF looks a bit ugly" is both minor and fixable: maybe ABNF gains Unicode next year, or we switch to NBNF which has Unicode, or whatever. For now, (I'm also fine with the comments that I had in my original proposal before I amended it after Pradyun's feedback; that was also discussed as an option before) |
This seems like a good middle-ground to me. It is very close to the original logic proposed in #687 so I already have lookups implemented in TOML++ in a very-nearly-conforming manner, so that's a plus. My only note:
ahem and at least one Australian 🇦🇺 😅 |
@arp242: Is औरंगाबाद really covered by the Other_Alphabetic property? I must say I'm a bit stumped by what Other_Alphabetic even is and what it's meant to be used for. My googling power has somehow failed me here. But I thought the gist file you posted contains all characters that have this property? If so, that wold be insufficient – unless I'm blind, there aren't any DEVANAGARI marks there (DEVANAGARI SIGN ANUSVARA and DEVANAGARI VOWEL SIGN AA are two mentioned by @pradyunsg , doubtless there are others). |
https://www.unicode.org/Public/14.0.0/ucd/PropList.txt (or the 9.0.0 link) is likely the best place to check whether things are in the list. IIRC, Devanagiri relies on both Md/Mc category characters and the relevant codepoints get the Other_Alphabetic property applied to them as well. |
Looks like something went wrong copy/pasting in to that gist file; actually looks like pasting in gist is pretty broken in general because I can't get the damn thing fixed (leave it to the frontend people to break pasting text...), so I put it here: https://pastebin.com/p1ty4NXn I use my uni tool for this kind of stuff by the way; for example to show that Other_Alphabetic includes enough for औरंगाबाद:
(or use That pastebin is just the output of In general it's a reasonably handy frontend for the Unicode database, or at least it is for me, if you're the sort of person who likes commandline tools anyway. |
(with apologies for jumping in with that might be an obvious comment to those who have been discussing this for years with lots of built-up context) Why go through such contortions to define a bespoke identifier syntax, instead of using the default Unicode identifier syntax (essentially Is the only reason that the Unicode identifier set includes marks that may or may not be fully NFC-normalized (the Personally, I think that if TOML wants to support Unicode, it's not unreasonable to expect parsers to support Unicode as well. However, an approximation of NFC-validation by filtering out Also: if you want a minimal set of security-conscious characters, you could additionally filter the identifier set with the |
Short answer: yes, that's the main reason. See this discussion for context: #966. You may also wish to visit the initial issue and subsequent pull request that got us here. |
And #941 |
Thanks for the links. NFC validation doesn't seem to me to be as terrible as it's been made out to be in those previous discussions:
|
Sorry @wjordan but I disagree with you that "only 17kb of memory" is fine; for some embedded environments that's absolutely a deal-breaker (I maintain a C++ TOML library and many of my users are embedded). I'm not going to implement any form of normalisation. There were also other reasons that were unrelated to implementation difficulty, too. In a more meta sense, I should point out we've debated normalisation a great deal over the last year or so, and have only just reached a consensus of sorts. I very much hope the ship has sailed. |
"We want normalisation" is also something we could reconsider if we spec things so that only one form is allowed so that normalisation is never or extremely rarely needed. I have no plans for this, or expectations that we need to, it's just nice to have the option and keeping doors open is good, just in case we might reconsider things in 10 or 15 years or whatever. And I think you're making fine points @wjordan, it's just ... there's been a lot of discussions about this (which are still on-going), and they've been difficult at times, and I think people have become a bit tired of it all 😅 |
I understand you're all exhausted by normalization talk, but the stiff resistance to adopting any non-trivial To recap: There are 111 combining characters ( As pointed out already, these combining marks are required in some Guarani and Navajo expressions, but my related, broader concern is that this will certainly affect other languages, and we don't have any idea or estimate on that total impact. (What other languages/words out there require This concern would be avoided by including the missing combining characters (effectively adopting UAX31 identifiers), and doing the bit of extra work to validate that those combining characters are used in NFC-valid contexts. This would make the spec much simpler, and more aligned with programming languages that have already adopted similar identifiers. But if you've already closed off discussion on that point and need to avoid any non-trivial Unicode normal-form logic whatsoever in order to ensure 100% compatibility with minimal embedded implementations of TOML, my preference would be to back out Unicode support entirely (#979) rather than offer incomplete Unicode-identifier support that rejects valid identifiers in an unknown number of languages. |
While I appreciate the thoughts @arp242 has put into this proposal, I too see it as a step into the wrong direction compared to what we already have in the current main branch. If we move away from @abelbraaksma's simple and robust proposal towards a Unicode-category-based solution, we should do it properly, that it as the described in the first comment of #687. Hence with combining characters (categories Mc and Mn) allowed anywhere in a bare key (except as the first character), instead of the clever but incomplete hack to go for the Other_Alphabetic property instead. If we go for the proposal here, we break Unicode, since Unicode was never meant to be usable without combining characters. They are an integral part of Unicode's concept of "letters". With this proposal, we couldn't say any more "now you can use arbitrary letters in bare keys, no matter what language", but only: "you can use some letters good for some languages", and if people then ask whether it's good enough for their language, we'll probably have to respond: "we don't know, ask your local Unicode experts, but the smaller your language is, the bigger the changes are that it's not". I'm sorry, but that's just not good enough. We had promised to allow bare keys in arbitrary languages, now let's deliver. Especially since one very well thought-out solution is ready, developed by @abelbraaksma who I suspect knows more about Unicode than all the rest of us together. Normalization is nothing we have to worry about, since we have already all but decided that we won't do it. @abelbraaksma's solution is very similar to what's allowed in XML names, and while XML doesn't normalize either (by default), in 25 years of XML history I haven't heard any complaints about that. JSON doesn't normalize either and still allows arbitrary strings as keys, and I haven't heard any complaints about that either. So, let's not break things out of worry about a problem that doesn't even exist in the first place. |
The current main also "breaks" Unicode, just in a different way, with different failure modes and edge cases. I don't consider it any more or less "robust" than this; actually, I consider it much less robust as we won't be able to amend or fix things in future revisions, as it pretty much closes the door to that. TOML is not XML; I don't think we can really strong draw any lessons from it, not without real-world experience anyway. In XML I have no idea if 1) people actually use it in the first place, and 2) how, and 3) if that worked out well for them. And even then: XML is much more of an interchange format between systems than a human-edited format like TOML. Also remember the XML spec is from 2005 and that things have changed since then. This is why only a subset of symbols work: it goes to some effort to exclude the blocks that were defined in 2005, but the symbols and emojis from the SMP are all allowed (which wasn't in use in 2005 – also see: MySQL's infamous "utf8" support). The XML spec is showing its age; the authors couldn't have predicted the future so that's only to be expected, but we shouldn't copy aspects that are showing their age almost 20 years later (and we also don't know if it actually worked out well for XML authors using non-ASCII in the first place). If anything, I feel the lesson is that their approach comes with serious downsides, because none of us know exactly what Unicode 35 will look like. In short, just remove the restriction (and allow much more) or update it to include all the new symbols. Leaving it in place half-arsed is the worst option. And as far as I'm concerned "allow everything" except syntax ( As I mentioned, I don't think there are any "perfect" choices necessarily, just "better" and "worse" trade-offs, but the current version is by far my least favourite trade-off and I feel almost any other option is better.
No one is using this feature, so it's not surprising there isn't an existing problem. I think the current approach has a lot of potential for problems, but of course I can't be sure about this: no one can, not without real-world experience. This is why I keep banging on about "we can always adjust later" as a major advantage of this. The author in #989 was right that we should just release and see what happens to get real-world feedback; it's just that with the current proposal we can't actually do that because once we release it we're more or less stuck with it (the only direction we can meaningfully go to is "allow almost everything"). |
You just say that, but without any evidence to back it up. We never promised to allow only letters in bare keys, hence we don't break anything if we also allow some other stuff.
Actually XML 1.0 it from 1998; I don't think the XML names definition has significantly changed since then. But we didn't just blindly copy it, there was a lot of discussion about whether this is the right approach and about getting the details right in #687 and #891, as you should remember too – after all, you participated in these discussions too.
I think that we might want to recheck and possibly revise things in TOML 1.2 (or maybe even later) is a valid concern. But I don't think we need to re-open the careful and slow consensus process (cf. the hundreds of comments that went into #687 and #891) because of that. Instead let's just add a sentence such as:
I propose to ship TOML 1.1 with this language (or something close to it). In this way, we keep the future open and could, if the need really arises, still switch to a Unicode-category-based implementation in TOML 1.2 or do some other adjustments, without breaking any SemVer promises. |
"This is actually really inconsistent" wasn't brought up. I brought it up later in #954. Normalisation also wasn't discussed at all. And sure, there were lots of comments, but I also just gave up commenting in #891 and unsubscribed. I figured "I think this is bad, but I can live with it, I guess", and people didn't seem especially receptive to actually questioning the entire approach (in particular after your comment, which I interpreted as "this is the approach we want to take, so stop saying it's bad"). And granted, I was late to the discussion, but people discussed things in #687 between Dec 2019 and July 2020. Surely "you should have commented in that 6 month window or forever be silent" can't be the way things go? Especially not if additional issues that were never even brought up are raised later? |
@arp242: I can understand you're frustrated since you opened #954 long ago and not much has happend there for a long time. However, I suggested a solution based on a proposal by @eksortso, and @abelbraaksma supported it too: #954 (comment). I can't remember hearing any objections, so I'd still say that's a good way to do it. Maybe I can prepare a PR for it one of these days. As for normalization, that affects bare and quoted keys in exactly the same way, so we need to address it anyway. We'd even have to address it if we had no bare keys at all, like JSON. And we have a solution ready to be adopted, the same one as used by JSON too. Other than that, what do you think of my proposal above to declare the relaxed rules for bare keys as experimental? That should address your main concern that we get stuck with something we might later want to revise, right? |
I agree, but few people use quoted keys so it's not really that much of a practical issue.
I'm not a huge fan to be honest. My issue is that for me, "compatibility" means "it doesn't break people's files". That it's marked as "experimental" is something most people won't see, so in practice, it will break files. And people don't update their implementations, dependencies, files, etc. right away, so it might be quite a while before we get meaningful feedback. So I would personally be very hesitant to revert any "experimental" feature, especially one so user-facing as this. I don't see myself being in favour of that, even if I think it's not a good feature, unless there really are overwhelming amount of problems with it. |
I haven't followed the whole discussion here, but I think the original post / PR is about using categories ( Furthermore, there have been instances where codepoints have been moved from one category to another, which causes yet other issues. "Be liberal in what you accept "was a prime goal when writing the original change that led to the inclusion of a wide range of international codepoints. There will always be characters that can be confusing, but that is a judgment we should leave to the writer of the TOML files. Who are we to decide what is and what is not a legible key character? Another argument for the way we did it was to be forward compatible (i.e., any unassigned Unicode code blocks are allowed by default, somewhere above someone claimed the opposite, but that's simply not true, unless there's a bug). Using any version with Letter/Digit categories is not going to be that. Plus that it will introduce conflicts between implementations. We tried to use the lessons learned from other standards that went through the same process and often regretted earlier decisions (i.e., allowing Letter/Digit and tying the standard to a minimal Unicode, introducing compatibility issues). Anyway, this is an open standard (well, just about) and if there is consensus to move in a different direction, I am certainly not going to lie in the way 😆. As the OP mentioned, there's no "perfect" solution here and each approach has up and downsides. |
I've just barely followed the discussion about what to allow for bare keys, except when it was first being discussed, and my only contribution was noting that adding what we had at the time would triple the length of That said, I do remember that we did lift a page from XML and allowed for a broad swath of code points that would satisfy an international user base. And we did reject using Unicode classes, because our standard would fluctuate a lot based on whichever version of Unicode we adhered to. But let's set all that aside for a bit. We have tools to automate ABNF generation, and it would fall upon us to use those tools consistently with every release if we decided to track Unicode classes. The use of those tools, then, must be standardized, if only for our own use. The output of those tools would allow us to release an addendum to the spec to describe which code points are allowed. The addendum wouldn't be used for well-formedness (we'll keep the ABNF minimal that way) but for validity. The new spec with addenda would require testing, and we would need to keep And if users hit upon something we missed or would cause problems, all these different processes and documents would need revisions. Well, hopefully not the processes. But we'd all need to keep on top of these things. Just some of my stray thoughts before I head into work. |
But while that would solve our own issue with creating the spec, it has a bunch of side effects:
Creating unicode ranges independent of any foreign versioning and whims would fix all of the above. There's a reason other standards went in that direction as well. No lock-in and no lock-step needed with existing versioning. |
So why bother to specify ranges at all then? This is a nice soundbite, but meaningless and doesn't address anything.
What "other standards" are these? What specifically have they come to regret? Because the only one you've ever mentioned is XML, which made a decision made 25 years ago, when Unicode was in a rather different place, and almost everything else that I've been able to find uses some form of Unicode database. |
I agree. There is no inherent need for that, except for certain special characters. Which is in part why we include much more than just Letter/Digit. We've briefly looked into precisely this idea.
Sorry if you see it that way. To me it is a very important part. We should not try to be prescriptive. People can think for themselves. If I would encounter a Chinese TOML file I won't understand it. Neither would I if it was written in smileys. But that's not for us to decide. The whole idea with this is to be as liberal as we can be (see previous point).
At the time it was a long discussion in the W3C WG spanning multiple standardization committees. Indeed, this includes XML, but also HTML, SVG, XPath and iirc, several standards that aren't partially derived or dependent on XML (for instance, you see a similar decision in the HTTP standardization process: the more recent the standard, the more permissive the ranges, generally speaking).
Did you look at standards that have a lot of implementations, or did you look at languages, that often have only one or two implementations? Like C# or Java or Python? As I mentioned above, in the end, if there's consensus to got the Unicode Categories way, or similar, fine with me. I am just warning against it as I feel strongly that it is a step in the wrong direction. You didn't address issues with versioning, being future proof, and differences between implementations, or implementations that don't have access to a Unicode Database, or that want to be lean and mean (the C++ Unicode is many dozens MBs in size iirc), and allowing Unicode to grow without having to update TOML, and how to deal with ranges that are currently unassigned that become assigned later. To me, these are insurpassible issues, unless you accept the downsides and just tell the folks to "live with it". Differences exist, just check each implementation's details. That's not the end of the world or anything, I just want us to make a (very) conscious decision on something that already has been discussed at length and (very) consciously been decided before. In my view, if you change a previous decision by 180 degrees, then there should be an even stronger argument in favor of it. Again, not trying to say "don't do it". Just tying to say "tread carefully, the path is treacherous..." ;). |
In my humble (and totally unbiased 😉 ) opinion my alternative PR #1002 (which extends the ranges just gently and improves the textual clarification) would be the best way to go. But leaving the ranges as they are in the current main branch should be acceptable as well (though I really would improve the wording in the written spec, as the main branch text is a bit misleading there). |
I already addressed all of this in the original PR message or the following discussion. This is like drawing blood from a stone. Vague references to standard running in to problems using this approach "So what specific problems then?" More vague references Well okay then 🤷
And I did mention back then, and then too I was told "we already discussed this, so fuck off" in so many words. So I unsubscribed and shrugged. Some one else objected a few weeks ago. He was told to fuck off as well and hasn't returned since. Never mind some issues were not mentioned even once before the entire thing was merged (mainly normalisation). The sheer length and verbosity generated on this right from the start makes it hard for anyone to pitch in. So "discussed at length" means bugger all. |
I listed them above, and then one by one in summary, and they are present in this PR. The part from the standards committees was very explicit. They locked themselves into Unicode versions, and it became impossible to create a new version of the standard without becoming backwards incompatible. We decided before to have a simple range, not locking into Letter/Digit definitions as they are way too complex (you added several hundred lines to the ABNF). If we go this way, we should do it properly:
Do not lock your own standard with another standard. If you do need to reference a version, make sure it is forward compatible (i.e., think of what would happen if you never update TOML anymore, are we screwed, or are we OK?). If there really is a good reason to go this more complex approach, harder to implement, but we can swallow that pill, I guess, then please, find a dynamic approach and let's do it properly. But realize that this is the second attempt we try that and I feel very much that this discussion is going in the exact same direction as back then. That is not a criticque, open standards have a way of doing that from time to time.
I doubt I ever said anything to that respect. I usually link to the places where something is already discussed if I make such claim, but it is a lot of work, if I did make you unassign, I apologize. When something takes over a year to implement, there is bound to be a lot of discussion. I doubt I remember it all now. Pity you unsubscribed. If I did that every time I didn't get it my way, there wouldn't be a repository left that I could participate in. Glad you came back, thanks for that!
Well, that's not nice. I have missed that entirely. But we should try to stay polite to one another. We all strive toward one goal: ensuring that TOML remains the brilliant, minimalistic standard it is.
Please refrain from offensive language. I'm really trying here, we're all volunteers and doing this for the benefit of the community. I was not referring to this PR, but to the previous one, where we had several moments with voting and recollecting our thoughts by summarizing them for everyone involved. We really went deep into the subject and researched several options. If I missed anything in this long discussion, I apologize. But I have not found an argument that convinces me to go to a sub-optimal solution. It may be here, but I did not catch it. |
No you didn't. You listed vagueries and platitudes. "People have come to regret it" is a worthless argument without knowing why specifically, what problems people reported, etc. Mindless citations of Postel's "law" is even more worthless (aka "Postel's thing he said 40 years ago that may or may not be applicable to some scenarios").
If you find yourself disagreeing all the time then maybe that's something you need to look at, because in my 25 years of participation strong disagreements have been exceedingly rare. But you know what? I'm done here for now. Merge whatever atrociously broken faux-Unicode nonsense you want. The ONLY way this will not cause problems if no one will use it.
Pretending to be offended by a statement like that is some bad faith nonsense, especially since it conveniently allows you to ignore why I said that. This has been a recurring theme; just ignore the inconvenient and continue as if it wasn't said. No wonder almost no one actually working on TOML is participating here any more and it's been hijacked with random people from the internet with Very Strong Opinions™. |
I don't want to get involved with this proposal. I simply have no time. But I've been involved with the TOML project in general for at least seven years, and @arp242 you still somehow suggest that I'm just one of those random opinionated people you're railing against. That is not fair, and it's not good. You've been doing great work on As I've suggested elsewhere, the alternative to this needless topic churn is an "enhancement proposal"-style document (like a Python PEP or Rust RFC) to keep down the bad feelings and raise up good choices. Especially if nobody can keep track of all the arguments one way or another. (I sure can't, but I would like to get into our history.) Doing this means compiling arguments and contexts that have already been stated, even those you disagree with, but the compassion would serve not just the standard project but also your own line of argument. If anyone wants to start setting precedents for these things, do please share your work with us. Once I finish moving back to my hometown at the end of the year, I think I will begin this type of work, using the TOML wiki. I'll compile everyone's very strong or barely mentioned opinions on many topics, while treating my own opinions as no more special than anyone else's. I won't address bare-key characters just yet; I'll start small, with e.g. what characters to allow in comments (#996, #924, and earlier). After this, then I'll participate more, and try to put this particular proposal in context, so that we have minimal and obvious documents explaining ourselves objectively that we can share with each other and with random outsiders. Meanwhile, I beg you and @abelbraaksma and everyone I've worked with in years past to:
|
This is veering hugely off-topic, but I definitely think what TOML needs less argueing from ivory towers and more firing up of editors and IDEs and actually writing stuff. In the end arguing isn't actually worth anything. As it stands the spec is just broken as it includes an example with an emoji, which just doesn't work for the general case. That it's nonetheless included as an example demonstrates that ya'll didn't even understand your own spec. And you really didn't need me to tell you that. You could have found that out yourself if you had actually bothered writing test cases. Again: more firing up of editors. All of this has been a recurring theme on a number of issues now. When I actually did the work of implementing duration and file size suffixes as a prototype and found it didn't actually work for reasons that were never mentioned by anyone, your response was "we discussed this at length already". Seriously...? All of that was completely worthless because the real important issues were not discovered until someone actually did the real work instead of bikeshedding about details. So yeah, turns out you do need to actually work on things to know where the problems are and what does and doesn't work, and while in principle I'm willing to listen to anyone when discussions are dominated by people who don't actually do anything other than argue here (not even writing a test case!) then I do think there's a bit of an issue... |
Yeah, nobody is perfect at first attempt, and though that's not exactly a bug, I agree it's confusing and should be removed as an example – as I also do in #1002. But I also think the most promising way forward is to straighten out such little irregularities if one finds them, rather than throwing more or less all prior work away and starting again from scratch, as this PR advocates. Well, there may be rare cases where starting from scratch is indeed best, but I'm pretty unconvinced that this is one of them. Especially since the solution you propose would be way more complicated than the one we already have, and it wouldn't solve any real-life problems that the existing solution doesn't solve just as well. Regarding the case of the duration and file size suffixes: I wasn't much involved with that, so I don't know any details, but it seems a very different case, since that one has so far remained in the exploration and discussion phase and is not scheduled to make it into TOML 1.1. For more comprehensive bare keys, on the other hand, we have a working solution that has been merged and is really to be shipped. (Though I agree that there is still some way for improvement, as I suggest to do in #1002 – but that's the way of incremental progress rather than a radical break.) |
I believe this would greatly improve things and solves all the issues, mostly. It's a bit more complex, but not overly so, and can be implemented without a Unicode library without too much effort. It offers a good middle ground, IMHO.
I don't think there are ANY perfect solutions here and that anything will be a trade-off. That said, I do believe some trade-offs are better than others, and I've made it no secret that I feel the current trade-off is a bad one. After looking at a bunch of different options I believe this is by far the best path for TOML.
Advantages:
This is what I would consider the "minimal set" of characters we need to add for reasonable international support, meaning we can't really make a mistake with this by accidentally allowing too much.
We can add new ranges in TOML 1.2 (or even change the entire approach, although I'd be very surprised if we need to), based on actual real-world feedback, but any approach we will take will need to include letters and digits from all scripts.
This is a strong argument in favour of this and a huge improvement: we can't really do anything wrong here in a way that we can't correct later, unlike what we have now, which is "well I think it probably won't cause any problems, based on what these 5 European/American guys think, but if it does: we won't be able to correct it".
Being conservative for these type of things is good!
This solves the normalisation issues, since combining characters are no longer allowed in bare keys, so it becomes a moot point.
For quoted keys normalisation is mostly a non-issue because few people use them, which is why this gone largely unnoticed and undiscussed before the "Unicode in bare keys" PR was merged.[1]
It's consistent in what we allow: no "this character is allowed, but this very similar other thing isn't, what gives?!"
Note that Not all emojis work as bare keys #954 was NOT about "I want all emojis to work" per se, but "this character works fine, but this very similar doesn't". This shows up in a number of things aside from emojis:
"Some punctuation is allowed but some isn't" is hard to explain, and also not what the specification says: "Punctuation, spaces, arrows, box drawing and private use characters are not allowed." In reality, a lot of punctuation IS allowed, but not all (especially outside of the Latin character range by the way, which shows the Euro/US bias in how it's written).
People don't read specifications in great detail, nor should they. People try something and sees if it works. Now it seems to work on first approximation, and then (possibly months or years later) it seems to "suddenly break". From the user's perspective this seems like a bug in the TOML parser, but it's not: it's a bug in the specification. It should either allow everything or nothing. This in-between is confusing and horrible.
There is no good way to communicate this other than "these codepoints, which cover most of what you'd write in a sentence, except when it doesn't".
In contrast, "we allow letters and digits" is simple to spec, simple to communicate, and should have a minimum potential for confusion. The current spec disallows some things seemingly almost arbitrary while allowing other very similar characters.
This avoids a long list of confusable special TOML characters; some were mentioned above but there are many more:
Is this a big problem? I guess it depends; I can certainly imagine an Armenian speaker accidentally leaving an Armenian apostrophe.
Confusables is also an issue with different scripts (Latin and Cyrillic is well-known), but this is less of an issue since it's not syntax, and also something that's fundamentally unavoidable in any multi-script environment.
Maps closer to identifiers in more (though not all) languages. We discussed whether TOML keys are "strings" or "identifiers" last week in Clarify that key uniqueness depends only on binary representation, recommend normalization #966 and while views differ (mostly because they're both) it seems to me that making it map closer is better. This is a minor issue, but it's nice.
That does not mean it's perfect; as I mentioned all solutions come with a trade-off. The ones made here are:
The biggest issue by far is that the check to see if a character is valid may become more complex for some languages and environments that can't rely on a Unicode database being present.
However, implementing this check is trivial logic-wise: it just needs to loop over every character and check if it's in a range table. You already need this with TOML 1.0, it's just that the range tables become larger.
The downside is it needs a somewhat large-ish "allowed characters" table with 716 start/stop ranges, which is not ideal, but entirely doable and easily auto-generated. It's ~164 lines hard-wrapped at column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387 lines, so that seems within the limits of reason (actually, reading through the tomlc99 code adding multibyte support at all will be the harder part, with this range table being a minor part).
There's a new Unicode version roughly every year or so, and the way it's written now means it's "locked" to Unicode 9 or, optionally, a later version. This is probably fine: Apple's APFS filesystem (which does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2. Go is Unicode 8.0. etc. I don't think this is really much of an issue in practice.
I choose Unicode 9 as everyone supports this; I doubted a long time over it, and we can also use a more recent version. I feel this gives us a nice balance between reasonable interoperability while also future-proofing things.
ABNF doesn't support Unicode. This is a tooling issue, and in my opinion the tooling should adjust to how we want TOML to look like, rather than adjusting TOML to what tooling supports. AFAIK no one uses the ABNF directly in code, and it's merely "informational".
I'm not happy with this, but personally I think this should be a non-issue when considering what to do here. We're not the only people running in to this limitation, and is really something that IETF should address in a new RFC or something ("Extra Augmented BNF"?)
Another solution I tried is restricting the code ranges; I twice tried to do this (with some months in-between) and spent a long time looking at Unicode blocks and ranges, and I found this impractical: we'll end up with a long list which isn't all that different from what this proposal adds.
Fixes #954
Fixes #966
Fixes #979
Ref #687
Ref #891
Ref #941
[1]:
Aside: I encountered this just the other day as I created a TOML file with all UK election results since 1945, which looks like:
That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote it as Sinn_Fein. This is what most people seem to do.