Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zig source encoding #663

Closed
thejoshwolfe opened this issue Dec 23, 2017 · 7 comments
Closed

Zig source encoding #663

thejoshwolfe opened this issue Dec 23, 2017 · 7 comments
Labels
accepted This proposal is planned. proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Milestone

Comments

@thejoshwolfe
Copy link
Contributor

This issue exists to document the rationale for Zig's source encoding. The rules below will be added to the docs, but the rationale discussion will be linked from the docs to here.

Discussion

Goals

  1. Have some kind of unicode support.
  2. It's acceptable for zig source code to be difficult to validate (e.g. by the compiler).
  3. Once validated, zig source code should be easy to consume (e.g. analyzed by tools and displayed by editors).

We want some kind of unicode support

We want to support unicode in some contexts in Zig, such as string literals:

// looks good
print("Сделайте выбор.\n");

So we don't want to force all bytes of a zig source file to be ascii:

// this is so unreadable, it's unacceptable
print("\xd0\xa1\xd0\xb4\xd0\xb5\xd0\xbb\xd0\xb0\xd0\xb9\xd1\x82\xd0\xb5 \xd0\xb2\xd1\x8b\xd0\xb1\xd0\xbe\xd1\x80.\n");

Each rule in Zig's grammar is either defined with a character whitelist accepting only specific ascii characters (e.g. [0-9A-Za-z_] used in identifiers) or with a character blacklist accepting any character except for the terminator/escape characters (e.g. //.*?\n for comments). Here are the contexts where any character is allowed (using # as a placeholder for the characters):

  • character literal: '#'
  • string literal: "#", c"#", \\#, c\\#
  • comment: //#

It's tempting to simply allow any byte value in those contexts while searching for the terminator. This allows utf8 in string literals and is easy to support in the compiler. This isn't as robust as providing a unicode string type, but it works well enough for some usecases, like the print example above.

The problem with turning a blind eye to Unicode

If we want an editor to display the print example above the intended way, then editors really need to be interpreting the zig source file as utf8. Additionally, it's very natural in many programming environments (e.g. Python 3, Node JavaScript) to read a file as a string rather than as bytes, and the obvious encoding to reach for is utf8.

If the zig compiler simply tolerates any bytes where utf8 might be, then it's possible to have "correct" zig code with invalid utf8 sequences. This corner case will have undesirable consequences for naive consumers of the zig source, such as throwing an exception or crashing when simply trying to read the file as a string. If invalid utf8 sequences are valid zig, then zig really isn't utf8 compatible, which is an awkward situation for bytes-to-string conversion; there'd be no correct way to convert zig source bytes to a string.

We want valid zig source to be easy to consume, and we want to support unicode in some way, therefore zig source code shall be encoded in utf8.

Zig source is UTF-8 encoded

It is a compile error for zig source to contain invalid utf8 byte sequences. There are plenty of examples of these: "//\xff", "//\x80", "//\xc2", "//\xc0\x80", "//\xc2\x00", etc.

Although zig source code is technically in unicode, this not mean that zig grammar allows non-ascii unicode outside the "blacklist character" contexts outlined above. You cannot have identifiers in Russian, nor can you use NBSP to format your code. Outside of string literals and comments, it's always an error to have a byte value greater than 0x7f (this is discussed more below.)

Line endings are important

Comments and multiline string tokens are terminated by the end of the line. Knowing where lines end is critical to understanding zig source code.

In zig, all lines are terminated by an LF character, "\n". It is an error for zig source to contain CR characters. This suits the goal of making valid zig source easy to consume. You can either look for simply "\n", or you can use a general-purpose regex like \r\n?|\n. Either one will work, because all the complex alternatives to "\n" are guaranteed to not be present in the source.

But we can't stop there. Visual Studio recognizes even more variations on line ending style:

  • CRLF: Carriage return + line feed, Unicode characters 000D + 000A
  • LF: Line feed, Unicode character 000A
  • NEL: Next line, Unicode character 0085
  • LS: Line separator, Unicode character 2028
  • PS: Paragraph separator, Unicode character 2029

If NEL, LS, and PS are allowed to show up in zig comments without terminating the comment, then we've got a weird corner case for anyone making a VisualStudio plugin for Zig syntax.

Therefore, we impose additional restrictions on valid zig source that zig source must not contain NEL, LS, or PS unicode points. These characters are encoded in multiple bytes, so this adds complexity to zig source validators. However, this complexity is justified, because it makes valid zig source easier to consume (remember the goals above.).

Ascii control characters are mostly no good

Control characters '\x00' through '\x1f' and '\x7f' are mostly useless. The only control character zig recognizes is '\x0a', a.k.a. '\n', which is always and only the line terminator. All the other control characters have either superfluous (CR), confusing (BS), inconsistent (VT), or otherwise obsolete (ENQ) behavior, and they are all banned everywhere in zig source code. (For the debate on windows line endings and hard tabs in zig, see #544.)

Other crazy unicode stuff isn't as important

There's a huge amount of weird stuff you can do with unicode, like right-to-left text, zero-width characters, and the poop emoji. Although Zig does want to be a readable language, there's a limit to how much we can enforce when it comes to obscure unicode craziness. You're going to be able to make pretty obfuscated unicode string literals if you try, and zig isn't going to try to stop that. The important thing is that the unicode doesn't interfere with the interpretation of zig's grammar.

If some unicode craziness is found that zig allows that confuses naive editors or analysis tools, then we should consider imposing additional restrictions for the sake of keeping zig easy to consume.

The rules

  • It is an error for zig source to contain invalid utf8 byte sequences.
  • For every codepoint of zig source code, it is an error for the codepoint to be one of U+0000-U+0009, U+000b-U+001f, U+007f, U+0085, U+2028, U+2029
  • It is an error for character literals to contain any source codepoint > U+007f. (e.g. 'й')

Note: From the above rules, and from the zig grammar, it follows that:

  • Outside of a string literal or comment, it is an error for any codepoint to be > U+007f.

Implications for consumers

If you have zig source that you know is valid, you can trust that:

  • The source is valid utf8, or you can interpret it simply as bytes. Every character that is significant to zig's grammar is a single-byte character.
  • You can trust that every line is terminated by an LF character, or you can use a generic line ending parser.
  • String literals and comments might contain non-ascii unicode characters, but you can ignore them, either as entire code points or as individual bytes, when scanning for terminators and escape sequences.
  • Whitespace between tokens is ignored. When looking for "whitespace", you can just check for " " and "\n", or you can use a generic whitespace scanner that checks for "\r", "\t", and the 25 unicode whitespace characters.
@andrewrk andrewrk added this to the 0.3.0 milestone Dec 24, 2017
@andrewrk andrewrk added accepted This proposal is planned. proposal This issue suggests modifications. If it also has the "accepted" label then it is planned. labels Dec 24, 2017
@andrewrk andrewrk modified the milestones: 0.3.0, 0.4.0 Feb 28, 2018
@milkowski
Copy link

Isn't the CRLF ending style forbidding too restrictive for common use? I mean it will be weirdness in Windows world and even if you conscientious use it in your editor it may still make some problems with some sort automating source processing. I know no other popular language with such restriction.

@thejoshwolfe
Copy link
Contributor Author

CRLF

See the discussion in #544.

@andrewrk
Copy link
Member

Is this done? What are the action items to resolve this issue?

@thejoshwolfe
Copy link
Contributor Author

This is done in self hosted. I think that's good enough.

@vi
Copy link

vi commented Sep 23, 2019

Maybe CRLF should be allowed in --sloppy dev-only unpublishable mode?

@Serentty
Copy link

Non-ASCII identifiers are a very important feature to me. For code which isn't meant to be published for an English-speaking audience, I regularly use identifiers which can't be represented in ASCII. The current “solution” of enforcing only ASCII in identifiers is very anglocentric.

@thejoshwolfe
Copy link
Contributor Author

The current “solution” of enforcing only ASCII in identifiers is very anglocentric.

@Serentty That's a very legitimate point. I've opend #3947 to discuss it. Your input there would be appreciated.

andrewrk added a commit that referenced this issue Jul 31, 2024
I pointed a fuzzer at the tokenizer and it crashed immediately. Upon
inspection, I was dissatisfied with the implementation. This commit
removes several mechanisms:
* Removes the "invalid byte" compile error note.
* Dramatically simplifies tokenizer recovery by making recovery always
  occur at newlines, and never otherwise.
* Removes UTF-8 validation.
* Moves some character validation logic to `std.zig.parseCharLiteral`.

Removing UTF-8 validation is a regression of #663, however, the existing
implementation was already buggy. When adding this functionality back,
it must be fuzz-tested while checking the property that it matches an
independent Unicode validation implementation on the same file. While
we're at it, fuzzing should check the other properties of that proposal,
such as no ASCII control characters existing inside the source code.

Other changes included in this commit:

* Deprecate `std.unicode.utf8Decode` and its WTF-8 counterpart. This
  function has an awkward API that is too easy to misuse.
* Make `utf8Decode2` and friends use arrays as parameters, eliminating a
  runtime assertion in favor of using the type system.

After this commit, the crash found by fuzzing, which was
"\x07\xd5\x80\xc3=o\xda|a\xfc{\x9a\xec\x91\xdf\x0f\\\x1a^\xbe;\x8c\xbf\xee\xea"
no longer causes a crash. However, I did not feel the need to add this
test case because the simplified logic eradicates most crashes of this
nature.
andrewrk added a commit that referenced this issue Jul 31, 2024
I pointed a fuzzer at the tokenizer and it crashed immediately. Upon
inspection, I was dissatisfied with the implementation. This commit
removes several mechanisms:
* Removes the "invalid byte" compile error note.
* Dramatically simplifies tokenizer recovery by making recovery always
  occur at newlines, and never otherwise.
* Removes UTF-8 validation.
* Moves some character validation logic to `std.zig.parseCharLiteral`.

Removing UTF-8 validation is a regression of #663, however, the existing
implementation was already buggy. When adding this functionality back,
it must be fuzz-tested while checking the property that it matches an
independent Unicode validation implementation on the same file. While
we're at it, fuzzing should check the other properties of that proposal,
such as no ASCII control characters existing inside the source code.

Other changes included in this commit:

* Deprecate `std.unicode.utf8Decode` and its WTF-8 counterpart. This
  function has an awkward API that is too easy to misuse.
* Make `utf8Decode2` and friends use arrays as parameters, eliminating a
  runtime assertion in favor of using the type system.

After this commit, the crash found by fuzzing, which was
"\x07\xd5\x80\xc3=o\xda|a\xfc{\x9a\xec\x91\xdf\x0f\\\x1a^\xbe;\x8c\xbf\xee\xea"
no longer causes a crash. However, I did not feel the need to add this
test case because the simplified logic eradicates most crashes of this
nature.
andrewrk added a commit that referenced this issue Jul 31, 2024
I pointed a fuzzer at the tokenizer and it crashed immediately. Upon
inspection, I was dissatisfied with the implementation. This commit
removes several mechanisms:
* Removes the "invalid byte" compile error note.
* Dramatically simplifies tokenizer recovery by making recovery always
  occur at newlines, and never otherwise.
* Removes UTF-8 validation.
* Moves some character validation logic to `std.zig.parseCharLiteral`.

Removing UTF-8 validation is a regression of #663, however, the existing
implementation was already buggy. When adding this functionality back,
it must be fuzz-tested while checking the property that it matches an
independent Unicode validation implementation on the same file. While
we're at it, fuzzing should check the other properties of that proposal,
such as no ASCII control characters existing inside the source code.

Other changes included in this commit:

* Deprecate `std.unicode.utf8Decode` and its WTF-8 counterpart. This
  function has an awkward API that is too easy to misuse.
* Make `utf8Decode2` and friends use arrays as parameters, eliminating a
  runtime assertion in favor of using the type system.

After this commit, the crash found by fuzzing, which was
"\x07\xd5\x80\xc3=o\xda|a\xfc{\x9a\xec\x91\xdf\x0f\\\x1a^\xbe;\x8c\xbf\xee\xea"
no longer causes a crash. However, I did not feel the need to add this
test case because the simplified logic eradicates most crashes of this
nature.
SammyJames pushed a commit to SammyJames/zig that referenced this issue Aug 7, 2024
I pointed a fuzzer at the tokenizer and it crashed immediately. Upon
inspection, I was dissatisfied with the implementation. This commit
removes several mechanisms:
* Removes the "invalid byte" compile error note.
* Dramatically simplifies tokenizer recovery by making recovery always
  occur at newlines, and never otherwise.
* Removes UTF-8 validation.
* Moves some character validation logic to `std.zig.parseCharLiteral`.

Removing UTF-8 validation is a regression of ziglang#663, however, the existing
implementation was already buggy. When adding this functionality back,
it must be fuzz-tested while checking the property that it matches an
independent Unicode validation implementation on the same file. While
we're at it, fuzzing should check the other properties of that proposal,
such as no ASCII control characters existing inside the source code.

Other changes included in this commit:

* Deprecate `std.unicode.utf8Decode` and its WTF-8 counterpart. This
  function has an awkward API that is too easy to misuse.
* Make `utf8Decode2` and friends use arrays as parameters, eliminating a
  runtime assertion in favor of using the type system.

After this commit, the crash found by fuzzing, which was
"\x07\xd5\x80\xc3=o\xda|a\xfc{\x9a\xec\x91\xdf\x0f\\\x1a^\xbe;\x8c\xbf\xee\xea"
no longer causes a crash. However, I did not feel the need to add this
test case because the simplified logic eradicates most crashes of this
nature.
igor84 pushed a commit to igor84/zig that referenced this issue Aug 11, 2024
I pointed a fuzzer at the tokenizer and it crashed immediately. Upon
inspection, I was dissatisfied with the implementation. This commit
removes several mechanisms:
* Removes the "invalid byte" compile error note.
* Dramatically simplifies tokenizer recovery by making recovery always
  occur at newlines, and never otherwise.
* Removes UTF-8 validation.
* Moves some character validation logic to `std.zig.parseCharLiteral`.

Removing UTF-8 validation is a regression of ziglang#663, however, the existing
implementation was already buggy. When adding this functionality back,
it must be fuzz-tested while checking the property that it matches an
independent Unicode validation implementation on the same file. While
we're at it, fuzzing should check the other properties of that proposal,
such as no ASCII control characters existing inside the source code.

Other changes included in this commit:

* Deprecate `std.unicode.utf8Decode` and its WTF-8 counterpart. This
  function has an awkward API that is too easy to misuse.
* Make `utf8Decode2` and friends use arrays as parameters, eliminating a
  runtime assertion in favor of using the type system.

After this commit, the crash found by fuzzing, which was
"\x07\xd5\x80\xc3=o\xda|a\xfc{\x9a\xec\x91\xdf\x0f\\\x1a^\xbe;\x8c\xbf\xee\xea"
no longer causes a crash. However, I did not feel the need to add this
test case because the simplified logic eradicates most crashes of this
nature.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted This proposal is planned. proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Projects
None yet
Development

No branches or pull requests

5 participants