Zig source encoding #663

thejoshwolfe · 2017-12-23T23:46:35Z

This issue exists to document the rationale for Zig's source encoding. The rules below will be added to the docs, but the rationale discussion will be linked from the docs to here.

Discussion

Goals

Have some kind of unicode support.
It's acceptable for zig source code to be difficult to validate (e.g. by the compiler).
Once validated, zig source code should be easy to consume (e.g. analyzed by tools and displayed by editors).

We want some kind of unicode support

We want to support unicode in some contexts in Zig, such as string literals:

// looks good
print("Сделайте выбор.\n");

So we don't want to force all bytes of a zig source file to be ascii:

// this is so unreadable, it's unacceptable
print("\xd0\xa1\xd0\xb4\xd0\xb5\xd0\xbb\xd0\xb0\xd0\xb9\xd1\x82\xd0\xb5 \xd0\xb2\xd1\x8b\xd0\xb1\xd0\xbe\xd1\x80.\n");

Each rule in Zig's grammar is either defined with a character whitelist accepting only specific ascii characters (e.g. [0-9A-Za-z_] used in identifiers) or with a character blacklist accepting any character except for the terminator/escape characters (e.g. //.*?\n for comments). Here are the contexts where any character is allowed (using # as a placeholder for the characters):

character literal: '#'
string literal: "#", c"#", \\#, c\\#
comment: //#

It's tempting to simply allow any byte value in those contexts while searching for the terminator. This allows utf8 in string literals and is easy to support in the compiler. This isn't as robust as providing a unicode string type, but it works well enough for some usecases, like the print example above.

The problem with turning a blind eye to Unicode

If we want an editor to display the print example above the intended way, then editors really need to be interpreting the zig source file as utf8. Additionally, it's very natural in many programming environments (e.g. Python 3, Node JavaScript) to read a file as a string rather than as bytes, and the obvious encoding to reach for is utf8.

If the zig compiler simply tolerates any bytes where utf8 might be, then it's possible to have "correct" zig code with invalid utf8 sequences. This corner case will have undesirable consequences for naive consumers of the zig source, such as throwing an exception or crashing when simply trying to read the file as a string. If invalid utf8 sequences are valid zig, then zig really isn't utf8 compatible, which is an awkward situation for bytes-to-string conversion; there'd be no correct way to convert zig source bytes to a string.

We want valid zig source to be easy to consume, and we want to support unicode in some way, therefore zig source code shall be encoded in utf8.

Zig source is UTF-8 encoded

It is a compile error for zig source to contain invalid utf8 byte sequences. There are plenty of examples of these: "//\xff", "//\x80", "//\xc2", "//\xc0\x80", "//\xc2\x00", etc.

Although zig source code is technically in unicode, this not mean that zig grammar allows non-ascii unicode outside the "blacklist character" contexts outlined above. You cannot have identifiers in Russian, nor can you use NBSP to format your code. Outside of string literals and comments, it's always an error to have a byte value greater than 0x7f (this is discussed more below.)

Line endings are important

Comments and multiline string tokens are terminated by the end of the line. Knowing where lines end is critical to understanding zig source code.

In zig, all lines are terminated by an LF character, "\n". It is an error for zig source to contain CR characters. This suits the goal of making valid zig source easy to consume. You can either look for simply "\n", or you can use a general-purpose regex like \r\n?|\n. Either one will work, because all the complex alternatives to "\n" are guaranteed to not be present in the source.

But we can't stop there. Visual Studio recognizes even more variations on line ending style:

CRLF: Carriage return + line feed, Unicode characters 000D + 000A

LF: Line feed, Unicode character 000A

NEL: Next line, Unicode character 0085

LS: Line separator, Unicode character 2028

PS: Paragraph separator, Unicode character 2029

If NEL, LS, and PS are allowed to show up in zig comments without terminating the comment, then we've got a weird corner case for anyone making a VisualStudio plugin for Zig syntax.

Therefore, we impose additional restrictions on valid zig source that zig source must not contain NEL, LS, or PS unicode points. These characters are encoded in multiple bytes, so this adds complexity to zig source validators. However, this complexity is justified, because it makes valid zig source easier to consume (remember the goals above.).

Ascii control characters are mostly no good

Control characters '\x00' through '\x1f' and '\x7f' are mostly useless. The only control character zig recognizes is '\x0a', a.k.a. '\n', which is always and only the line terminator. All the other control characters have either superfluous (CR), confusing (BS), inconsistent (VT), or otherwise obsolete (ENQ) behavior, and they are all banned everywhere in zig source code. (For the debate on windows line endings and hard tabs in zig, see #544.)

Other crazy unicode stuff isn't as important

There's a huge amount of weird stuff you can do with unicode, like right-to-left text, zero-width characters, and the poop emoji. Although Zig does want to be a readable language, there's a limit to how much we can enforce when it comes to obscure unicode craziness. You're going to be able to make pretty obfuscated unicode string literals if you try, and zig isn't going to try to stop that. The important thing is that the unicode doesn't interfere with the interpretation of zig's grammar.

If some unicode craziness is found that zig allows that confuses naive editors or analysis tools, then we should consider imposing additional restrictions for the sake of keeping zig easy to consume.

The rules

It is an error for zig source to contain invalid utf8 byte sequences.
For every codepoint of zig source code, it is an error for the codepoint to be one of U+0000-U+0009, U+000b-U+001f, U+007f, U+0085, U+2028, U+2029
It is an error for character literals to contain any source codepoint > U+007f. (e.g. 'й')

Note: From the above rules, and from the zig grammar, it follows that:

Outside of a string literal or comment, it is an error for any codepoint to be > U+007f.

Implications for consumers

If you have zig source that you know is valid, you can trust that:

The source is valid utf8, or you can interpret it simply as bytes. Every character that is significant to zig's grammar is a single-byte character.
You can trust that every line is terminated by an LF character, or you can use a generic line ending parser.
String literals and comments might contain non-ascii unicode characters, but you can ignore them, either as entire code points or as individual bytes, when scanning for terminators and escape sequences.
Whitespace between tokens is ignored. When looking for "whitespace", you can just check for " " and "\n", or you can use a generic whitespace scanner that checks for "\r", "\t", and the 25 unicode whitespace characters.

The text was updated successfully, but these errors were encountered:

milkowski · 2018-03-16T22:58:26Z

Isn't the CRLF ending style forbidding too restrictive for common use? I mean it will be weirdness in Windows world and even if you conscientious use it in your editor it may still make some problems with some sort automating source processing. I know no other popular language with such restriction.

thejoshwolfe · 2018-03-17T00:00:40Z

CRLF

See the discussion in #544.

See #663

andrewrk · 2018-11-21T04:09:01Z

Is this done? What are the action items to resolve this issue?

thejoshwolfe · 2018-11-22T17:22:08Z

This is done in self hosted. I think that's good enough.

vi · 2019-09-23T21:51:31Z

Maybe CRLF should be allowed in --sloppy dev-only unpublishable mode?

Serentty · 2019-12-15T22:57:48Z

Non-ASCII identifiers are a very important feature to me. For code which isn't meant to be published for an English-speaking audience, I regularly use identifiers which can't be represented in ASCII. The current “solution” of enforcing only ASCII in identifiers is very anglocentric.

thejoshwolfe · 2019-12-19T14:37:20Z

The current “solution” of enforcing only ASCII in identifiers is very anglocentric.

@Serentty That's a very legitimate point. I've opend #3947 to discuss it. Your input there would be appreciated.

I pointed a fuzzer at the tokenizer and it crashed immediately. Upon inspection, I was dissatisfied with the implementation. This commit removes several mechanisms: * Removes the "invalid byte" compile error note. * Dramatically simplifies tokenizer recovery by making recovery always occur at newlines, and never otherwise. * Removes UTF-8 validation. * Moves some character validation logic to `std.zig.parseCharLiteral`. Removing UTF-8 validation is a regression of #663, however, the existing implementation was already buggy. When adding this functionality back, it must be fuzz-tested while checking the property that it matches an independent Unicode validation implementation on the same file. While we're at it, fuzzing should check the other properties of that proposal, such as no ASCII control characters existing inside the source code. Other changes included in this commit: * Deprecate `std.unicode.utf8Decode` and its WTF-8 counterpart. This function has an awkward API that is too easy to misuse. * Make `utf8Decode2` and friends use arrays as parameters, eliminating a runtime assertion in favor of using the type system. After this commit, the crash found by fuzzing, which was "\x07\xd5\x80\xc3=o\xda|a\xfc{\x9a\xec\x91\xdf\x0f\\\x1a^\xbe;\x8c\xbf\xee\xea" no longer causes a crash. However, I did not feel the need to add this test case because the simplified logic eradicates most crashes of this nature.

I pointed a fuzzer at the tokenizer and it crashed immediately. Upon inspection, I was dissatisfied with the implementation. This commit removes several mechanisms: * Removes the "invalid byte" compile error note. * Dramatically simplifies tokenizer recovery by making recovery always occur at newlines, and never otherwise. * Removes UTF-8 validation. * Moves some character validation logic to `std.zig.parseCharLiteral`. Removing UTF-8 validation is a regression of ziglang#663, however, the existing implementation was already buggy. When adding this functionality back, it must be fuzz-tested while checking the property that it matches an independent Unicode validation implementation on the same file. While we're at it, fuzzing should check the other properties of that proposal, such as no ASCII control characters existing inside the source code. Other changes included in this commit: * Deprecate `std.unicode.utf8Decode` and its WTF-8 counterpart. This function has an awkward API that is too easy to misuse. * Make `utf8Decode2` and friends use arrays as parameters, eliminating a runtime assertion in favor of using the type system. After this commit, the crash found by fuzzing, which was "\x07\xd5\x80\xc3=o\xda|a\xfc{\x9a\xec\x91\xdf\x0f\\\x1a^\xbe;\x8c\xbf\xee\xea" no longer causes a crash. However, I did not feel the need to add this test case because the simplified logic eradicates most crashes of this nature.

andrewrk added this to the 0.3.0 milestone Dec 24, 2017

andrewrk added accepted This proposal is planned. proposal This issue suggests modifications. If it also has the "accepted" label then it is planned. labels Dec 24, 2017

thejoshwolfe added a commit that referenced this issue Dec 24, 2017

[self-hosted] source must be valid utf8. see #663

d6a74ed

thejoshwolfe added a commit that referenced this issue Dec 24, 2017

add source encoding rules to the docs. see #663

f0a1753

andrewrk modified the milestones: 0.3.0, 0.4.0 Feb 28, 2018

andrewrk added a commit that referenced this issue May 25, 2018

update json_test to be compliant with zig source encoding

56cb7f1

See #663

andrewrk added a commit that referenced this issue May 29, 2018

fix syntax of std/json_test.zig

b0eebfa

See #663

bronze1man mentioned this issue Jul 5, 2018

doc: please document that zig do not accept '\t' in the source code file #1182

Closed

thejoshwolfe closed this as completed Nov 22, 2018

hryx mentioned this issue Nov 24, 2018

Make final newline mandatory #1779

Closed

thejoshwolfe mentioned this issue Dec 19, 2019

consider allowing non-ascii identifiers #3947

Closed

thejoshwolfe mentioned this issue Jan 13, 2020

Allow non-ascii identifiers #4151

Closed

andrewrk mentioned this issue Jul 31, 2024

std.zig.tokenizer: simplification and spec conformance #20885

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zig source encoding #663

Zig source encoding #663

thejoshwolfe commented Dec 23, 2017

milkowski commented Mar 16, 2018

thejoshwolfe commented Mar 17, 2018

andrewrk commented Nov 21, 2018

thejoshwolfe commented Nov 22, 2018

vi commented Sep 23, 2019

Serentty commented Dec 15, 2019

thejoshwolfe commented Dec 19, 2019