-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EEP 64: Triple-Quoted Strings #47
Changes from 4 commits
42bb881
21d4fc8
1b7f9f3
7c0a9ce
c4b2911
d212a61
f2919de
5341f1a
5493c00
e04e8c9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,364 @@ | ||
Author: Kiko Fernandez-Reyes <kiko(at)erlang(dot)org>, | ||
Raimo Niskanen <raimo(at)erlang(dot)org> | ||
Status: Draft | ||
Type: Standards Track | ||
Created: 07-Jun-2023 | ||
Erlang-Version: OTP-27 | ||
Post-History: | ||
**** | ||
EEP 64: Verbatim Multi-line Indented Strings | ||
---- | ||
|
||
Abstract | ||
======== | ||
|
||
This EEP proposes the introduction of Verbatim Multi-line Indented strings, | ||
*VMI Strings*, and defines their semantics. The main benefit is to allow | ||
*multi-line* strings in an easy and useful (*indented*) way, | ||
similar to other languages, e.g. Elixir. | ||
|
||
Their first use case is for in-module documentation attributes | ||
containing [Markdown][] or similarly formatted text where *verbatim* text | ||
RaimoNiskanen marked this conversation as resolved.
Show resolved
Hide resolved
|
||
is desirable since any documentation text format has its own notion | ||
of escape sequences which will collide with Erlang's escape sequences. | ||
|
||
Rationale | ||
========= | ||
|
||
Today (June 2023), writing multi-line strings is awkward and arguably ugly. | ||
They may contain escape sequences and have no concept of indentation: | ||
|
||
foo () -> | ||
case bar() of | ||
ok -> | ||
X = "First line | ||
Second line with \"\\*not emphasized\\* Markdown\" | ||
Third line", | ||
{ok, X} | ||
end. | ||
|
||
The content's indentation cannot adhere to the surrounding code's | ||
and the `*` has to be doubly escaped to get a `\*` character | ||
sequence into the actual content. | ||
|
||
In a documentation attribute as suggested in [EEP 59][], | ||
the indentation problem is not that pronounced because | ||
the documentation attribute itself is not much indented: | ||
|
||
-doc """ | ||
First line | ||
Second line with "\*not emphasized\* Markdown" | ||
Third line | ||
""". | ||
|
||
The main reason to consider this EEP is for documentation | ||
attributes, where not having to worry about escape sequences | ||
is this EEP's most attractive property. Introducing a new string | ||
format, however, will also require defining how it shall behave | ||
in Erlang code. | ||
|
||
Having a string format that is only allowed in attributes would | ||
simply be very strange and the one suggested in this EEP | ||
would also be useful in Erlang code. | ||
|
||
Design Decisions | ||
---------------- | ||
|
||
An attribute is an Erlang form in the source code | ||
that consists of a `-` token, an atom, one value term and a full stop | ||
(dot). The value term may be enclosed in parentheses | ||
(which is not very interesting for documentation attributes). | ||
|
||
-doc " Badly formatted | ||
documentation paragraph | ||
/-\\ | ||
\\-/". | ||
|
||
A documentation attribute should have a string as its content term, | ||
RaimoNiskanen marked this conversation as resolved.
Show resolved
Hide resolved
|
||
and here we want to use our new and more convenient VMI String | ||
instead of a normal string: | ||
|
||
-doc """ | ||
Better formatted | ||
documentation paragraph | ||
/-\ | ||
\-/ | ||
""". | ||
|
||
### VMI String Scanner Token | ||
|
||
A VMI String must be a token that the scanner recognizes as a string, | ||
which makes it suitable for a documentation attribute value term. | ||
It starts and ends with three double quotes: `"""`. | ||
|
||
Double quotes, `"`, are chosen because normal Erlang strings use them | ||
and this is just a new variant. Since double quotes are used | ||
a VMI String shall, as a normal string, produce a list of codepoints. | ||
|
||
It would be more convenient if a VMI String produced an UTF-8 binary, | ||
but that would be a surprising feature for double quotes, and | ||
the documentation build process can work around this by converting | ||
the codepoint list (`string()`) into the needed binary chunk. | ||
|
||
In source code a VMI String is valid in a binary, | ||
so producing a Unicode binary is reasonably straightforward: | ||
|
||
X = <<""" | ||
Line 1 | ||
Line 2 | ||
"""/utf8>> | ||
|
||
The extra overhead is not exhausting since this we are targeting | ||
multi-line strings here. | ||
|
||
As a future expansion it has been proposed to use prefixes | ||
for specialized strings such as regular expressions, | ||
interpolated variables ([PR-7343][]), Unicode binary strings, etc. | ||
For example: `X = u"Tschüß"` for an UTF-8 encoded string. | ||
|
||
### VMI String Start | ||
|
||
After the starting `"""` only white-space is allowed | ||
up to the end of the line. | ||
|
||
As a possible future expansion we might allow a keyword here | ||
that shouldn't be part of the string content, but could be | ||
a hint for for syntax highlighting in the editor. | ||
|
||
-doc """ md | ||
Markdown content | ||
* Bullet list | ||
""". | ||
|
||
The scanner does not need to have any special treatment | ||
for the string content on the line after the starting | ||
`"""` except that it should not search for an ending `"""`. | ||
|
||
A later step, preferably done by the parser, strips the | ||
characters up to and including the newline from the | ||
first line of the string. | ||
|
||
If any of these characters is not white-space, | ||
the parser reports a syntax error. | ||
|
||
### VMI String End | ||
|
||
All bytes are collected as they are (verbatim) and becomes | ||
the VMI String content. | ||
|
||
A VMI String ends with newline followed by optional white-space | ||
and then by `"""`. This completes the scanner token. | ||
|
||
A later step, preferably done by the parser, uses the white-space | ||
RaimoNiskanen marked this conversation as resolved.
Show resolved
Hide resolved
|
||
on the ending line as the definition of the string's indentation | ||
and strips that white-space sequence from every line in the string, | ||
and strips the newline preceding the ending line. | ||
|
||
If any of the lines do not start with the defined indentation | ||
either because the line is too short or if the prefix differs, | ||
the parser reports a syntax error. | ||
|
||
Requiring that all lines must have exactly the same white-space | ||
characters as indentation is a simple solution to not have | ||
to define how indentation white-space (tab vs. space) normalization | ||
should be done, and also seems like a reasonable requirement. | ||
|
||
### Leading and trailing newline | ||
|
||
The rules above strips one leading and one trailing newline. | ||
This is a simple convention that also gives control over | ||
the string's content: | ||
|
||
"\n X\n" = """ | ||
|
||
X | ||
|
||
""", | ||
|
||
"X" = """ | ||
X | ||
""", | ||
|
||
"" = """ | ||
|
||
""" | ||
|
||
Note that the following could be a syntax error; too short | ||
multi-line string since trailing newline shall be stripped | ||
both from the starting line and from the last content line, | ||
so the content could be seen as less then empty. | ||
But it is more convenient to allow it as also an empty string: | ||
|
||
"" = """ | ||
""" | ||
|
||
The definition of newline and white-space is the same as | ||
the current in the scanner, but; when stripping the newline | ||
from the line preceding the ending line, a CR should also | ||
be stripped, if the line ends in CR LF. | ||
|
||
This is a convenience for systems where CR LF is used as newline. | ||
In most places in the scanner CR is treated as white-space, | ||
but in this case it would be inconvenient to not strip the CR. | ||
|
||
### Indentation | ||
|
||
The rules above facilitates indentation of the content | ||
to adhere to the surrounding code. The ending line | ||
determines the indentation. | ||
|
||
"This text\nhas no indentation" = """ | ||
This text | ||
has no indentation | ||
""", | ||
|
||
" This text\n has indentation" = """ | ||
This text | ||
has indentation | ||
""", | ||
|
||
" This text\nhas an indented first line" = | ||
""" | ||
This text | ||
has an indented first line | ||
""", | ||
|
||
""" | ||
This is a syntax error (incorrect indentation) | ||
""", | ||
|
||
""" This is a syntax error | ||
(non-white-space on start line) | ||
""", | ||
|
||
""" | ||
This will probably be a syntax error | ||
since no ending line can be found""" | ||
|
||
### Backwards incompatibility | ||
|
||
This is valid today: | ||
|
||
X = """ | ||
X | ||
""" | ||
|
||
It is equivalent to: | ||
|
||
X = "" " | ||
X | ||
" "" | ||
|
||
Which is equivalent to: | ||
|
||
X = " | ||
X | ||
" | ||
|
||
Which is equivalent to: | ||
|
||
X = "\n X\n" | ||
|
||
But with the suggested VMI Strings the first | ||
code snippet would instead be equivalent to: | ||
|
||
X = "X" | ||
|
||
Also, this is valid today: | ||
|
||
X = """ xxx | ||
X | ||
""" | ||
|
||
But according to this EEP it would be two syntax errors: | ||
|
||
1. The start line has got non-white-space after `"""`. | ||
2. The first content line has incorrect indentation. | ||
|
||
There are many other similar constructions that also | ||
would be syntax errors. | ||
|
||
* It is far from likely that anyone has deliberately | ||
used `"""` in source code to mean an empty string | ||
concatenated to another string. | ||
* Most today allowed combinations with `"""` will cause | ||
syntax errors. Only a few will have a subtly changed | ||
behaviour (string content). | ||
* Users can simply grep for `"""` in their source code. | ||
Causing the same sequence e.g through macros would | ||
be harder to find, but the worst problem would not | ||
be new syntax errors (hard to miss), but changed | ||
behaviour. And the changed behaviour would be | ||
a slightly different string content. | ||
|
||
Therefore, it should be very unlikely that anyone | ||
encounters a real backwards incompatibility problem | ||
from the suggestions in this EEP. | ||
|
||
### Quoting of `"""` | ||
|
||
In the rules above there is no possibility to have `"""` | ||
first on a line in a VMI String. | ||
|
||
This would be allowed: | ||
|
||
-doc """ | ||
A VMI String starts with: """ | ||
and ends with: """ | ||
""". | ||
|
||
As long as `"""` isn't first on a line. | ||
|
||
It would be possible to work around in Erlang code: | ||
|
||
X = """ | ||
A VMI String starts with: | ||
"" | ||
""" """ | ||
" | ||
and ends with: | ||
"" | ||
""" "\"" | ||
|
||
That is ugly, and it is not possible in a documentation | ||
attribute where string concatenation isn't allowed. | ||
|
||
We can either ignore the problem since it is only | ||
when placed first on a line that `"""` is a problem, | ||
or we can use the GitHub [Markdown][] trick to allow | ||
3 or more start characters and matching end characters | ||
so this would be valid: | ||
|
||
X = """" | ||
A VMI String starts with: | ||
""" | ||
and ends with: | ||
""" | ||
"""" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I liked the four double quotes idea. 1> """"
.. ```erlang
.. foo() ->
.. """
.. foo
.. bar
.. """.
.. ```
.. """".
"```erlang\nfoo() ->\n \"\"\"\n foo\n bar\n \"\"\".\n```\n" The 1> """
.. ```erlang
.. foo() ->
.. \"\"\"
.. foo
.. bar
.. \"\"\".
.. ```
.. """.
"```erlang\nfoo() ->\n \"\"\"\n foo\n bar\n \"\"\".\n```\n" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. C# has this three-or-more double quotes syntax too. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @williamthome: Note that the idea is to allow 3 or more, not 3 or 4... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @RaimoNiskanen Oh, nice! I'll change the code to support 3+ double quotes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The PR is updated and now 3+ double quotes are supported. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am a bit late to the party but I like the idea a lot too. This and automatically stripping the last line are elegant solutions to the problems presented. 👍 |
||
|
||
[EEP 59]: https://www.erlang.org/eeps/eep-0059 | ||
"EEP 59: Module attributes for documentation" | ||
|
||
[EEP 62]: https://www.erlang.org/eeps/eep-0062 | ||
"String Interpolation Syntax" | ||
|
||
[PR-7343]: https://github.com/erlang/otp/pull/7343 | ||
"Feature: String Interpolation" | ||
|
||
[Markdown]: https://github.github.com/gfm/ | ||
"GitHub Flavored Markdown" | ||
|
||
Copyright | ||
========= | ||
RaimoNiskanen marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
This document is placed in the public domain or under the CC0-1.0-Universal | ||
license, whichever is more permissive. | ||
|
||
[EmacsVar]: <> "Local Variables:" | ||
[EmacsVar]: <> "mode: indented-text" | ||
[EmacsVar]: <> "indent-tabs-mode: nil" | ||
[EmacsVar]: <> "sentence-end-double-space: t" | ||
[EmacsVar]: <> "fill-column: 70" | ||
[EmacsVar]: <> "coding: utf-8" | ||
[EmacsVar]: <> "End:" | ||
[VimVar]: <> " vim: set fileencoding=utf-8 expandtab shiftwidth=4 softtabstop=4: " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the term "VMI Strings" going to be used in docs whenever we describe such strings? I don't love this acronym. I think its nicer to call them triple-quoted strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@essen: I guess the fitting general feature here would be Multi-line Indented string, that has escape sequences. The question is if that merits a prefix...
@zuiderkwast: I just wanted to call them something in the EEP that was not too long since I have to mention them many times. "Triple quoted strings" is a better name, even if they could be "N>=3 quoted strings". "Verbatim strings" is another possibility. But if we later add prefixes to make them more flexible, e.g. loose "verbatim", it is maybe "triple quoted" that is the one property that must remain...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Triple quoted" or "Multi-line". I'd say we cannot take that away from them either. That they can be indented is just what makes them good for multi-line.