Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clarify string descriptions #875

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 17 additions & 15 deletions toml.md
Original file line number Diff line number Diff line change
Expand Up @@ -259,12 +259,13 @@ String
------

There are four ways to express strings: basic, multi-line basic, literal, and
multi-line literal. All strings must contain only valid UTF-8 characters.
multi-line literal. Strings can contain any valid Unicode codepoint except the
following control characters: U+0000 to U+0008, U+000A to U+001F, and
U+007F. Note that tab (U+0009) is allowed. Multi-line strings can also contain
newlines (U+000A) and carriage returns (U+000D).
Copy link

@abravalheri abravalheri Feb 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that saying that U+000A and U+000D are not allowed first1 and then adding an exception for multi-line strings is kind of a double negative (an exception of the previous exception)...

I would recommend restricting the code point ranges/enumeration to the ones that are allowed in all types of strings.

Then I would add a second (separated) statement specifically saying that "basic" and "literal" strings (single-line) don't allow newlines/carriage returns.

For example, something like:

Strings can contain any valid Unicode codepoint except the following control characters:
U+0000 to U+0008, U+000B, U+000C, U+000E, U+001F, and U+007F.
Note that tab (U+0009) is allowed.
Newlines (U+000A) and carriage returns (U+000D) are allowed in multi-line strings
but forbidden in basic and literal strings.

Footnotes

  1. U+000A and U+000D are elements of the previously mentioned character ranges/enumeration


**Basic strings** are surrounded by quotation marks (`"`). Any Unicode character
may be used except those that must be escaped: quotation mark, backslash, and
the control characters other than tab (U+0000 to U+0008, U+000A to U+001F,
U+007F).
**Basic strings** are surrounded by quotation marks (`"`). Backslash and
quotation mark may only occur if they are part of a valid escape sequence.

```toml
str = "I'm a string. \"You can quote me\". Name\tJos\u00E9\nLocation\tSF."
Expand Down Expand Up @@ -297,6 +298,9 @@ like to break up a very long string into multiple lines. TOML makes this easy.
**Multi-line basic strings** are surrounded by three quotation marks on each
side and allow newlines. A newline immediately following the opening delimiter
will be trimmed. All other whitespace and newline characters remain intact.
Carriage returns (U+000D) are allowed only as part of a newline sequence U+000D
U+000A (CRLF). Backslash may only occur if it is part of a valid escape
sequence.

```toml
str1 = """
Expand Down Expand Up @@ -340,11 +344,6 @@ str3 = """\
"""
```

Any Unicode character may be used except those that must be escaped: backslash
and the control characters other than tab, line feed, and carriage return
(U+0000 to U+0008, U+000B, U+000C, U+000E to U+001F, U+007F). Carriage returns
(U+000D) are only allowed as part of a newline sequence.

You can write a quotation mark, or two adjacent quotation marks, anywhere inside
a multi-line basic string. They can also be written just inside the delimiters.

Expand Down Expand Up @@ -405,9 +404,10 @@ apos15 = "Here are fifteen apostrophes: '''''''''''''''"
str = ''''That,' she said, 'is still pointless.''''
```

Control characters other than tab are not permitted in a literal string. Thus,
for binary data, it is recommended that you use Base64 or another suitable ASCII
or UTF-8 encoding. The handling of that encoding will be application-specific.
Because most control characters are not permitted even in literal and multi-line
literal strings, these literal strings are not suited for representing blobs of
binary data. It is recommended that you use Base64 or another suitable ASCII or
UTF-8 encoding. The handling of that encoding will be application-specific.
Comment on lines -408 to +410
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have an alternative paragraph that expresses these same sentiments in #929.


Integer
-------
Expand Down Expand Up @@ -763,7 +763,8 @@ member_since = 1999-08-04

Dotted keys create and define a table for each key part before the last one. Any
such table must have all its key/value pairs defined under the current `[table]`
header, or in the root table if defined before all headers, or in one inline table.
header, or in the root table if defined before all headers, or in one inline
table.

```toml
fruit.apple.color = "red"
Expand Down Expand Up @@ -1008,6 +1009,7 @@ When transferring TOML files over the internet, the appropriate MIME type is
ABNF Grammar
------------

A formal description of TOML's syntax is available, as a separate [ABNF file][abnf].
A formal description of TOML's syntax is available, as a separate
[ABNF file][abnf].

[abnf]: ./toml.abnf