-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[spec/interpreter/test] Align definition of newline with Unicode reco… #1684
Conversation
I propose not recognizing NEL (U+85) as a newline character. It is not recognized as a newline in popular text formats, tools, or editors, so it would be inconvenient to work with if files using NEL newlines were in circulation. An incomplete list of popular text formats that don't recognize NEL: |
Yeah, I'm not feeling strongly. I'd assume there is value in following the official Unicode recommendation, even though NEL hardly matters in practice. I believe the motivation for that recommendation is maximising compatibility with legacy character sets, but I'd be fine limiting that concern to ASCII. That said, there are various other codepoints in Unicode that represent forms of line breaks, and there likely is no consistency in how editors and tools treat those. So it's doubtful if that form of alignment is even possible, independent of NEL. |
Yes, the situations with plain-CR, LS, and PS are complex, and there's no way to avoid some problems no matter what we do here. But, we can still avoid other problems, which I propose we do here. |
Do NEL characters ever appear in practice? If not, following unicode would not cause any additional problems in practice and we would get the benefits of just doing what unicode recommends (at least for this particular sliver of the design space). |
I'm not aware of any practical benefits of following this particular Unicode recommendation here. NEL's only known use in practice is in text converted from EBCDIC. My understanding is that Unicode added the recommendation about NEL for the benefit of interoperability with EBCDIC documents. I don't believe this is relevant to Wasm. Additional examples of things that don't interpret U+85 (NEL) as a newline include JS, CSS, HTML, Python, and many more. |
I do think that being able to offload all decision making in this area to unicode is an advantage, but I'll admit it's a small one. I guess being consistent with so many existing text formats is probably a larger advantage, so not treating NEL as a newline sounds fine to me. |
Okay, I don't care enough, so I just removed NEL. PTAL. |
\unicode{09} ~|~ \unicode{0A} ~|~ \unicode{0D} \\ | ||
\Tnewline ~|~ \unicode{09} \\ | ||
\production{newline} & \Tnewline &::=& | ||
\unicode{0A} ~|~ \unicode{0D} ~|~ \unicode{0D}~\unicode{0A} \\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we got rid of the last case here, then it would parse as two newlines in a row, which I don't think would break anything. Is it worth making that simplification, or is it better to keep that case for consistency with the standard definition of newline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, currently that should not observably change anything (though an implementation taking that literally would report incorrect line numbers). But yeah, I think it is preferable to use the standard definition for clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting and unfortunate that git and GitHub treat this as a binary file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume their heuristic is based on the appearance of 0x00 and other control characters.
So far, the Wasm text format only recognises LF as a newline. This PR tweaks that definition to align with Section 5.8 of the Unicode standard, which recommends treating CR, LF, CRLF, and NEL as equivalent newlines in text files, in order to be agnostic to newline translations performed by various tools and conversions between OSes and character sets.
This only affects the interpretation of line comments, where previously, code like
would (surprisingly) treat the instruction as part of the comment, but no longer does now. While theoretically a breaking change, it shouldn't affect any well-formatted code. It hence seems okay to consider this a bug fix.