Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EEP 64: Triple-Quoted Strings #47

Merged
merged 10 commits into from
Jun 28, 2023
354 changes: 354 additions & 0 deletions eeps/eep-0064.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,354 @@
Author: Kiko Fernandez-Reyes <kiko(at)erlang(dot)org>,
Raimo Niskanen <raimo(at)erlang(dot)org>
Status: Draft
Type: Standards Track
Created: 07-Jun-2023
Erlang-Version: OTP-27
Post-History:
****
EEP 64: Multi-line Indented Verabatim Strings
----

Abstract
========

This EEP proposes the introduction of Multi-line Indented Verbatim strings,
*MIV Strings* and defines their semantics. The main benefit is to allow
RaimoNiskanen marked this conversation as resolved.
Show resolved Hide resolved
*multi-line* strings in an easy and useful, *indented*, way,
RaimoNiskanen marked this conversation as resolved.
Show resolved Hide resolved
similar to in other languages, e.g Elixir.
RaimoNiskanen marked this conversation as resolved.
Show resolved Hide resolved

Their first use case is for in-module documentation attributes
containing [Markdown][] or similarly formatted text where *verbatim* text
RaimoNiskanen marked this conversation as resolved.
Show resolved Hide resolved
is desirable since any documentation text format has its own notion
of escape sequences which will collide with Erlang's escape sequences.

Rationale
=========

Today (June 2023), writing multi-line strings is awkward and arguably ugly.
They may contain escape sequences and have no concept of indentation:

foo () ->
case bar() of
ok ->
X = "First line
Second line with \"\\*not emphasized\\* Markdown\"
Third line",
{ok, X}
end.

The content's indentation cannot adhere to the surrounding code's
and the `*` has to be doubly escaped to get a `\*` character
sequence into the actual content.

In a documentation attribute as suggested in [EEP 59][]
RaimoNiskanen marked this conversation as resolved.
Show resolved Hide resolved
the indentation problem is not that pronounced because
the documentation attribute itself is not much indented:

-doc """
First line
Second line with "\*not emphasized\* Markdown"
Third line
""".

The main reason to consider this EEP is for documentation
attributes, and here not having to worry about escape sequences
is the most attractive property of this EEP. But introducing
a new string format will also require defining how it would behave
RaimoNiskanen marked this conversation as resolved.
Show resolved Hide resolved
in Erlang code.

Having a string format only allowed in attributes would simply be
very strange and the one suggested in this EEP would be useful
also in Erlang code.
RaimoNiskanen marked this conversation as resolved.
Show resolved Hide resolved

Design Decisions
----------------

An attribute is an Erlang form in the source code
that consists of a `-` token, an atom, one value term and a full stop
(dot). The value term may be enclosed in parentheses
(which is not very interesting for documentation attributes).

-doc " Badly formatted
documentation paragraph
/-\\
\\-/".

A documentation attribute should have a string as its content term,
RaimoNiskanen marked this conversation as resolved.
Show resolved Hide resolved
and here we want to use our new and more convenient MIV String
instead of a normal string:

-doc """
Better formatted
documentation paragraph
/-\
\-/
""".

### MIV String Scanner Token

A MIV String must be a token that the scanner recognizes as a string,
RaimoNiskanen marked this conversation as resolved.
Show resolved Hide resolved
which makes it suitable for a documentation attribute value term.
It starts and ends with three double quotes: `"""`.

Double quotes, `"`, are chosen because normal Erlang strings use them
and this is just a new variant. Since double quotes are used
a MIV String should, as a normal string, produce a list of codepoints.
RaimoNiskanen marked this conversation as resolved.
Show resolved Hide resolved

It would be more convenient if a MIV String produced an UTF-8 binary,
but that would be surprising when using double quotes, and
the documentation build process can convert the codepoint list
into the needed binary chunk.
RaimoNiskanen marked this conversation as resolved.
Show resolved Hide resolved

In source code a MIV String is valid in a binary so
RaimoNiskanen marked this conversation as resolved.
Show resolved Hide resolved
producing a unicode binary is reasonably straightforward:

X = <<"""
Line 1
Line 2
"""/utf8>>

The extra overhead is not exhausting since this we are targeting
multi-line strings here.

As a future expansion it has been proposed to use prefixes
for specialized strings such as regular expressions,
interpolated variables ([PR-7343][]), unicode binary strings, etc.
For example: `X = u"Tschüß"` for an UTF-8 encoded string.

### MIV String Start

After the starting `"""` only whitespace is allowed
up to the end of the line.

As a possible future expansion we might allow a keyword here
that shouldn't be part of the string content, but could be
a hint for for syntax highlighting in the editor.

-doc """ md
Markdown content
* Bullet list
""".

The scanner does not need to have any special treatment
for the string content on the line after the starting
`"""` except that it should not search for an ending `"""`.

A later step, preferably done by the parser, strips the
characters up to and including the newline from the
first line of the string.

If any of these characters is not whitespace,
the parser reports a syntax error.

### MIV String End

All bytes are collected as they are (verbatim) and becomes
the MIV String content.

A MIV String ends with whitespace on a new line followed by `"""`.
This completes the scanner token.

A later step, preferably done by the parser, uses the whitespace
on the ending line as the definition of the string's indentation
and strips that whitespace seqence from every line in the string,
and strips the newline preceding the ending line.

If any of the lines should not start with the defined indentation
RaimoNiskanen marked this conversation as resolved.
Show resolved Hide resolved
either because the line is too short or if the prefix differs,
the parser reports a syntax error.

Requiring that all lines must have exactly the same whitspace
RaimoNiskanen marked this conversation as resolved.
Show resolved Hide resolved
characters as indentation is a simple solution to not have
to define how indentation whitespace (tab vs. space) normalization
should be done, and also seems like a reasonable requirement.

### Leading and trailing newline

The rules above strips one leading and one trailing newline.
This is a simple convention that also gives control over
the string's content:

"\n X\n" = """

X

""",

"X" = """
X
""",

"" = """
"""

The definition of newline and whitespace is the same as
the current in the scanner, but; when stripping the newline
from the line preceding the ending line, a CR should also
be stripped, if the line ends in CR LF.

This is a convenience for systems where CR LF is used as newline.
In most places in the scanner CR is treated as whitespace,
but in this case it would be inconvenient to not strip the CR.

### Indentation

The rules above facilitates indentation of the content
to adhere to the surrounding code. The ending line
determines the indentation.

"This text\nhas no indentation" = """
This text
has no indentation
""",

" This text\n has indentation" = """
This text
has indentation
""",

" This text\nhas an indented first line" =
"""
This text
has an indented first line
""",

"""
This is a syntax error (incorrect indentation)
""",

""" This is a syntax error
(non-whitespace on start line)
""",

"""
This will probably be a syntax error
since no ending line can be found"""

### Backwards incompatibility

This is valid today:

X = """
X
"""

It is equivalent to:

X = "" "
X
" ""

Which is equivalent to:

X = "
X
"

Which is equivalent to:

X = "\n X\n"

But with the suggested MIV Strings the first
code snippet would instead be equivalent to:

X = "X"

Also, this is valid today:

X = """ xxx
X
"""

But according to this EEP it would be two syntax errors:

1. The start line has got non-whitespace after `"""`.
2. The first content line has incorrect indentation.

There are many other similar constructions that also
would be syntax errors.

* It is far from likely that anyone has deliberately
used `"""` in source code to mean an empty string
concatenated to another string.
* Most today allowed combinations with `"""` will cause
syntax errors. Only a few will have a subtly changed
behaviour (string content).
* Users can simply grep for `"""` in their source code.
Causing the same sequence e.g through macros would
be harder to find, but the worst problem would not
be new syntax errors (hard to miss), but changed
behaviour. And the changed behaviour would be
a slightly different string content.

Therefore, it should be very unlikly that anyone
RaimoNiskanen marked this conversation as resolved.
Show resolved Hide resolved
encounters a real backwards incompatibility problem
from the suggestions in this EEP.

### Quoting of `"""`

In the rules above there is no possibility to have `"""`
first on a line in a MIV String.

This would be allowed:

-doc """
A MIV String starts with: """
and ends with: """
""".

As long as `"""` isn't first on a line.

It would be possible to work around in Erlang code:

X = """
A MIV String starts with:
""
""" """
"
and ends with:
""
""" "\""

That is ugly, and it is not possible in a documentation
attribute where string concatenation isn't allowed.

We can either ignore the problem since it is only
when placed first on a line that `"""` is a problem,
or we can use the GitHub [Markdown][] trick to allow
3 or more start characters and matching end characters
so this would be valid:

X = """"
A MIV String starts with:
"""
and ends with:
"""
""""
Copy link

@williamthome williamthome Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I liked the four double quotes idea.
I've updated my PR with this change. Now this is possible:

1> """"
.. ```erlang
.. foo() ->
..     """
..     foo
..     bar
..     """.
.. ```
.. """".
"```erlang\nfoo() ->\n    \"\"\"\n    foo\n    bar\n    \"\"\".\n```\n"

The -doc attribute can benefit from it to, for example, write triple quotes examples.
Before this change, the above should be written as

1> """
.. ```erlang
.. foo() ->
..     \"\"\"
..     foo
..     bar
..     \"\"\".
.. ```
.. """.
"```erlang\nfoo() ->\n    \"\"\"\n    foo\n    bar\n    \"\"\".\n```\n"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C# has this three-or-more double quotes syntax too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@williamthome: Note that the idea is to allow 3 or more, not 3 or 4...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RaimoNiskanen Oh, nice! I'll change the code to support 3+ double quotes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is updated and now 3+ double quotes are supported.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit late to the party but I like the idea a lot too. This and automatically stripping the last line are elegant solutions to the problems presented. 👍


[EEP 59]: https://www.erlang.org/eeps/eep-0059
"EEP 59: Module attributes for documentation"

[EEP 62]: https://www.erlang.org/eeps/eep-0062
"String Interpolation Syntax"

[PR-7343]: https://github.com/erlang/otp/pull/7343
"Feature: String Interpolation"

[Markdown]: https://github.github.com/gfm/
"GitHub Flavored Markdown"

Copyright
=========
RaimoNiskanen marked this conversation as resolved.
Show resolved Hide resolved

This document is placed in the public domain or under the CC0-1.0-Universal
license, whichever is more permissive.

[EmacsVar]: <> "Local Variables:"
[EmacsVar]: <> "mode: indented-text"
[EmacsVar]: <> "indent-tabs-mode: nil"
[EmacsVar]: <> "sentence-end-double-space: t"
[EmacsVar]: <> "fill-column: 70"
[EmacsVar]: <> "coding: utf-8"
[EmacsVar]: <> "End:"
[VimVar]: <> " vim: set fileencoding=utf-8 expandtab shiftwidth=4 softtabstop=4: "