Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EEP 64: Triple-Quoted Strings #47

Merged
merged 10 commits into from
Jun 28, 2023
364 changes: 364 additions & 0 deletions eeps/eep-0064.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,364 @@
Author: Kiko Fernandez-Reyes <kiko(at)erlang(dot)org>,
Raimo Niskanen <raimo(at)erlang(dot)org>
Status: Draft
Type: Standards Track
Created: 07-Jun-2023
Erlang-Version: OTP-27
Post-History:
****
EEP 64: Verbatim Multi-line Indented Strings
----

Abstract
========

This EEP proposes the introduction of Verbatim Multi-line Indented strings,
*VMI Strings*, and defines their semantics. The main benefit is to allow

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the term "VMI Strings" going to be used in docs whenever we describe such strings? I don't love this acronym. I think its nicer to call them triple-quoted strings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@essen: I guess the fitting general feature here would be Multi-line Indented string, that has escape sequences. The question is if that merits a prefix...

@zuiderkwast: I just wanted to call them something in the EEP that was not too long since I have to mention them many times. "Triple quoted strings" is a better name, even if they could be "N>=3 quoted strings". "Verbatim strings" is another possibility. But if we later add prefixes to make them more flexible, e.g. loose "verbatim", it is maybe "triple quoted" that is the one property that must remain...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Triple quoted" or "Multi-line". I'd say we cannot take that away from them either. That they can be indented is just what makes them good for multi-line.

*multi-line* strings in an easy and useful (*indented*) way,
similar to other languages, e.g. Elixir.

Their first use case is for in-module documentation attributes
containing [Markdown][] or similarly formatted text where *verbatim* text
RaimoNiskanen marked this conversation as resolved.
Show resolved Hide resolved
is desirable since any documentation text format has its own notion
of escape sequences which will collide with Erlang's escape sequences.

Rationale
=========

Today (June 2023), writing multi-line strings is awkward and arguably ugly.
They may contain escape sequences and have no concept of indentation:

foo () ->
case bar() of
ok ->
X = "First line
Second line with \"\\*not emphasized\\* Markdown\"
Third line",
{ok, X}
end.

The content's indentation cannot adhere to the surrounding code's
and the `*` has to be doubly escaped to get a `\*` character
sequence into the actual content.

In a documentation attribute as suggested in [EEP 59][],
the indentation problem is not that pronounced because
the documentation attribute itself is not much indented:

-doc """
First line
Second line with "\*not emphasized\* Markdown"
Third line
""".

The main reason to consider this EEP is for documentation
attributes, where not having to worry about escape sequences
is this EEP's most attractive property. Introducing a new string
format, however, will also require defining how it shall behave
in Erlang code.

Having a string format that is only allowed in attributes would
simply be very strange and the one suggested in this EEP
would also be useful in Erlang code.

Design Decisions
----------------

An attribute is an Erlang form in the source code
that consists of a `-` token, an atom, one value term and a full stop
(dot). The value term may be enclosed in parentheses
(which is not very interesting for documentation attributes).

-doc " Badly formatted
documentation paragraph
/-\\
\\-/".

A documentation attribute should have a string as its content term,
RaimoNiskanen marked this conversation as resolved.
Show resolved Hide resolved
and here we want to use our new and more convenient VMI String
instead of a normal string:

-doc """
Better formatted
documentation paragraph
/-\
\-/
""".

### VMI String Scanner Token

A VMI String must be a token that the scanner recognizes as a string,
which makes it suitable for a documentation attribute value term.
It starts and ends with three double quotes: `"""`.

Double quotes, `"`, are chosen because normal Erlang strings use them
and this is just a new variant. Since double quotes are used
a VMI String shall, as a normal string, produce a list of codepoints.

It would be more convenient if a VMI String produced an UTF-8 binary,
but that would be a surprising feature for double quotes, and
the documentation build process can work around this by converting
the codepoint list (`string()`) into the needed binary chunk.

In source code a VMI String is valid in a binary,
so producing a Unicode binary is reasonably straightforward:

X = <<"""
Line 1
Line 2
"""/utf8>>

The extra overhead is not exhausting since this we are targeting
multi-line strings here.

As a future expansion it has been proposed to use prefixes
for specialized strings such as regular expressions,
interpolated variables ([PR-7343][]), Unicode binary strings, etc.
For example: `X = u"Tschüß"` for an UTF-8 encoded string.

### VMI String Start

After the starting `"""` only white-space is allowed
up to the end of the line.

As a possible future expansion we might allow a keyword here
that shouldn't be part of the string content, but could be
a hint for for syntax highlighting in the editor.

-doc """ md
Markdown content
* Bullet list
""".

The scanner does not need to have any special treatment
for the string content on the line after the starting
`"""` except that it should not search for an ending `"""`.

A later step, preferably done by the parser, strips the
characters up to and including the newline from the
first line of the string.

If any of these characters is not white-space,
the parser reports a syntax error.

### VMI String End

All bytes are collected as they are (verbatim) and becomes
the VMI String content.

A VMI String ends with newline followed by optional white-space
and then by `"""`. This completes the scanner token.

A later step, preferably done by the parser, uses the white-space
RaimoNiskanen marked this conversation as resolved.
Show resolved Hide resolved
on the ending line as the definition of the string's indentation
and strips that white-space sequence from every line in the string,
and strips the newline preceding the ending line.

If any of the lines do not start with the defined indentation
either because the line is too short or if the prefix differs,
the parser reports a syntax error.

Requiring that all lines must have exactly the same white-space
characters as indentation is a simple solution to not have
to define how indentation white-space (tab vs. space) normalization
should be done, and also seems like a reasonable requirement.

### Leading and trailing newline

The rules above strips one leading and one trailing newline.
This is a simple convention that also gives control over
the string's content:

"\n X\n" = """

X

""",

"X" = """
X
""",

"" = """

"""

Note that the following could be a syntax error; too short
multi-line string since trailing newline shall be stripped
both from the starting line and from the last content line,
so the content could be seen as less then empty.
But it is more convenient to allow it as also an empty string:

"" = """
"""

The definition of newline and white-space is the same as
the current in the scanner, but; when stripping the newline
from the line preceding the ending line, a CR should also
be stripped, if the line ends in CR LF.

This is a convenience for systems where CR LF is used as newline.
In most places in the scanner CR is treated as white-space,
but in this case it would be inconvenient to not strip the CR.

### Indentation

The rules above facilitates indentation of the content
to adhere to the surrounding code. The ending line
determines the indentation.

"This text\nhas no indentation" = """
This text
has no indentation
""",

" This text\n has indentation" = """
This text
has indentation
""",

" This text\nhas an indented first line" =
"""
This text
has an indented first line
""",

"""
This is a syntax error (incorrect indentation)
""",

""" This is a syntax error
(non-white-space on start line)
""",

"""
This will probably be a syntax error
since no ending line can be found"""

### Backwards incompatibility

This is valid today:

X = """
X
"""

It is equivalent to:

X = "" "
X
" ""

Which is equivalent to:

X = "
X
"

Which is equivalent to:

X = "\n X\n"

But with the suggested VMI Strings the first
code snippet would instead be equivalent to:

X = "X"

Also, this is valid today:

X = """ xxx
X
"""

But according to this EEP it would be two syntax errors:

1. The start line has got non-white-space after `"""`.
2. The first content line has incorrect indentation.

There are many other similar constructions that also
would be syntax errors.

* It is far from likely that anyone has deliberately
used `"""` in source code to mean an empty string
concatenated to another string.
* Most today allowed combinations with `"""` will cause
syntax errors. Only a few will have a subtly changed
behaviour (string content).
* Users can simply grep for `"""` in their source code.
Causing the same sequence e.g through macros would
be harder to find, but the worst problem would not
be new syntax errors (hard to miss), but changed
behaviour. And the changed behaviour would be
a slightly different string content.

Therefore, it should be very unlikely that anyone
encounters a real backwards incompatibility problem
from the suggestions in this EEP.

### Quoting of `"""`

In the rules above there is no possibility to have `"""`
first on a line in a VMI String.

This would be allowed:

-doc """
A VMI String starts with: """
and ends with: """
""".

As long as `"""` isn't first on a line.

It would be possible to work around in Erlang code:

X = """
A VMI String starts with:
""
""" """
"
and ends with:
""
""" "\""

That is ugly, and it is not possible in a documentation
attribute where string concatenation isn't allowed.

We can either ignore the problem since it is only
when placed first on a line that `"""` is a problem,
or we can use the GitHub [Markdown][] trick to allow
3 or more start characters and matching end characters
so this would be valid:

X = """"
A VMI String starts with:
"""
and ends with:
"""
""""
Copy link

@williamthome williamthome Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I liked the four double quotes idea.
I've updated my PR with this change. Now this is possible:

1> """"
.. ```erlang
.. foo() ->
..     """
..     foo
..     bar
..     """.
.. ```
.. """".
"```erlang\nfoo() ->\n    \"\"\"\n    foo\n    bar\n    \"\"\".\n```\n"

The -doc attribute can benefit from it to, for example, write triple quotes examples.
Before this change, the above should be written as

1> """
.. ```erlang
.. foo() ->
..     \"\"\"
..     foo
..     bar
..     \"\"\".
.. ```
.. """.
"```erlang\nfoo() ->\n    \"\"\"\n    foo\n    bar\n    \"\"\".\n```\n"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C# has this three-or-more double quotes syntax too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@williamthome: Note that the idea is to allow 3 or more, not 3 or 4...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RaimoNiskanen Oh, nice! I'll change the code to support 3+ double quotes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is updated and now 3+ double quotes are supported.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit late to the party but I like the idea a lot too. This and automatically stripping the last line are elegant solutions to the problems presented. 👍


[EEP 59]: https://www.erlang.org/eeps/eep-0059
"EEP 59: Module attributes for documentation"

[EEP 62]: https://www.erlang.org/eeps/eep-0062
"String Interpolation Syntax"

[PR-7343]: https://github.com/erlang/otp/pull/7343
"Feature: String Interpolation"

[Markdown]: https://github.github.com/gfm/
"GitHub Flavored Markdown"

Copyright
=========
RaimoNiskanen marked this conversation as resolved.
Show resolved Hide resolved

This document is placed in the public domain or under the CC0-1.0-Universal
license, whichever is more permissive.

[EmacsVar]: <> "Local Variables:"
[EmacsVar]: <> "mode: indented-text"
[EmacsVar]: <> "indent-tabs-mode: nil"
[EmacsVar]: <> "sentence-end-double-space: t"
[EmacsVar]: <> "fill-column: 70"
[EmacsVar]: <> "coding: utf-8"
[EmacsVar]: <> "End:"
[VimVar]: <> " vim: set fileencoding=utf-8 expandtab shiftwidth=4 softtabstop=4: "