Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EEP 64: Triple-Quoted Strings #47

Merged
merged 10 commits into from
Jun 28, 2023
Merged

Conversation

kikofernandez
Copy link
Contributor

This EEP discusses the design of the triple-quoted binary strings.

In the case that document attributes EEP 59 and interpolation strings (EEP 62) are added before this EEP, interpolation attributes are to be disallowed in documentation attributes (this is mentioned in the EEP).

The semantics should feel familiar for Elixir developers, and should be pretty close to Elixir triple-quote semantics.

Feedback is welcome.

eeps/eep-0064.md Outdated Show resolved Hide resolved
eeps/eep-0064.md Outdated
"""
remove_double_quotes(X) ->

#### Binary-Strings Errors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps worth mentioning it also errors when there is a """ and it is not immediately followed spaces and a newline? Elixir says:

iex(1)> """foo"""
** (SyntaxError) iex:1:1: heredoc allows only optional whitespace followed by a new line after """

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are still compatible with existing code, and one can write strings using single quotes, we may have to allow code that for some reason is """foo""", which in Erlang produces "foo". If we use backtick instead of quotes, then your point stands

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it is worth pushing a deprecation warning to Erlang/OTP 26.x that warns on triple quotes?

eeps/eep-0064.md Outdated

### Runtime semantics

Triple-quoted strings should only produce binary-strings. This makes easy to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry this may be potentially confusing. Why is "foo" a list but """foo""" a binary? What if I want to write a long text but as a charlist instead? Here is a potential example. So, even though I would prefer binaries, I believe a more consistent option is to return charlists.

Then, for binaries, there are a few options:

  1. Write <<"""...""">>

  2. Build on EEP 63 to introduce u"""..."""

  3. However, for documentation in particular, the compiler can convert -doc """...""". into binaries, so the distinction may not be terribly important

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely agreeing with José here, I think triple quotes strings should produce strings in erlang unless I closed in a binary context, this would be more consistent with the existing Erlang syntax.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with José as well, I think it is confusing to use """" as a binary. Another option for unicode binaries could be to use backticks:

io:format(`backtick unicode binary`)
-doc ```
       backtick 
       multiline
       unicode
       binary
     ```.

Not the most convenient key on some keyboards, but Javascript seems to get away with using it so maybe we can aswell.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like using backticks to mean binary. That nicely solves how to write binary literal strings on one line as well. I think we have a winner!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backticks is interesting because it also exists in Prolog. Erlang's syntax comes from Prolog as we all know (which also has single quote for atoms). In SWI Prolog it seems to be configurable how double-quoted and back-quoted strings work: https://www.swi-prolog.org/pldoc/man?section=string and in GNU Prolog similarily.

But, for simple binary strings, isn't a prefix like b"binary string" simpler and more aligned with Python, C, Rust, etc. for different kinds of strings?

I think b"" is enough for UTF-8 encoded binaries as long as the source code is UTF-8 encoded, which it normally is. (If we then add triple quotes for charlist strings, then b"""...""" seems the natural choice for the binary version.)

(Btw C11 uses u"" for UTF-16 strings, U"" for UTF-32 strings and u8"" for UTF-8, so u"" for UTF-8 would be confusing in this context.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but a regexp string is a different animal where I guess you do not want any escape sequences, and therefore cannot use interpolation. Possibly chosen with a prefix, e.g r"foo\(\.\)+".

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that Erlang becomes more Perl-like. :-)

Copy link

@wojtekmach wojtekmach Jun 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally, choosing backtick will also depend on the documentation format. If we choose Markdown, the backticks in Markdown are used for code snippets. Which means that, if I want to write string:to_upper(`hello`)

This will be also problematic with heredocs and fenced code blocks:

image

(I'm pasting an image because I don't know how to write it in Markdown so it renders correctly. :))

I'm personally a fan of code blocks done with 4 spaces indent but fenced blocks are sometimes useful too.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wojtekmach Markdown allows any number of backticks, so if you need 3 backticks, just write it within 4 backticks. Perhaps Erlang could allow that trick too:

-doc ````
Some function.

Example:

```
> 1 + 2
3
```
````

(Note that the above example was enclosed in 5 backticks.)

Copy link

@williamthome williamthome Jun 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the backtick idea, but I'll prefer it to be used like in Javascript, where it permits to do interpolation, for example:

Bar = `bar`,
<<"foo bar 2">> = `foo ~s{Bar} ~p{1+1}`.

or

-define(BAR, `bar`).

-doc ```
foo ~s{?BAR} ~p{1+1}
```

Or (why not) single backtick as an alternative to triple quotes using the same idea, the last backtick defines the indentation

-doc `
backtick 
multiline
unicode
binary
`

% Single line also a valid syntax
-doc `backtick unicode binary`

BTW, triple quotes give more control of the encoding (I don't see any problem writing the below)

<<"""
foo bar
"""/utf8>>

@essen
Copy link

essen commented Jun 8, 2023

I'm not a fan of triple quoting for this. Providing an end marker avoids having to worry about escaping and whatnot. Something like this in a terminal:

<<EOF
hello
EOF

But of course adapted to Erlang syntax. The end marker is whatever you need, it could be <<PIE or other.

eeps/eep-0064.md Outdated
has no indentation
"""

Equivalent to: <<"This text has no indentation\n">>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense the possibility to comment after \?
For example:

"""
foo \ % Maybe I want to break this line to add a comment about it
bar
"""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's ambiguous because \ is no longer at the end. Should \% be allowed then at the end? And, if so, how do you actually know \ is at the end? Couldn't \ also be used to escape space for some reason?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's ambiguous because \ is no longer at the end.

Oh, this is true and makes things tricky.

Should \% be allowed then at the end?

Oh, using \% makes sense to me.

And, if so, how do you actually know \ is at the end?

I think this is easy, I've implemented this now based on your idea of the \% on this branch of a side project called heredocs. That's what the implementation outputs:

"foo bar" = """
            foo \% This is a comment and must be ignored
            bar\
            """

I don't know if this is a good idea, but it's possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that the last example is still ambiguous with escaping the % character itself. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand why you want a comment in a multi line string.
All in it should be string content, otherwise it isn't very useful, right?

We want it for Markdown documentation, and Markdown uses trailing \ for forcing a line break, which clashes head on with using trailing \ for newline escape.

So I do not know if we want newline escape. Or any escape codes at all, for that matter, since Markdown uses \ for escaping things, just as the regular Erlang string syntax, and these may also collide...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want it for Markdown documentation, and Markdown uses trailing \ for forcing a line break, which clashes head on with using trailing \ for newline escape.

\ will escape the new line character. To force the line break, you need \\ (because Markdown needs to see the \ itself). So to me it makes logical sense: if you use \\, you are escaping the character that would escape/hide the line break. :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment inside the string makes the syntax too complex IMO. It's already possible to put two strings side-by-side to concatenate them, e.g.

"foo " % this comment is ignored
"bar" = """
        foo \
        """
        %% this comment would be ignored too
        """
        bar\
        """.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want it for Markdown documentation

Sorry @RaimoNiskanen, I commented without thinking in Markdown.
BTW, we can write comments in Markdown in many different ways.

@michalmuskala
Copy link
Contributor

michalmuskala commented Jun 13, 2023

I'm not a fan of triple quoting for this. Providing an end marker avoids having to worry about escaping and whatnot. Something like this in a terminal:

<<EOF
hello
EOF

This would make Erlang syntax context-dependent. You wouldn't be able to parse this anymore with yecc as it is today, since you'd need to track the "end" quotes.


In general with triple quote syntax, I love it from Elixir. However, having used Rust quite extensively that doesn't have it, regular multi-line strings can serve this use-case relatively decently. With a good-enough integration with a formatter, it's not particularly annoying either. For example see how it's used in Rust Analyzer for testing - https://github.com/rust-lang/rust-analyzer/blob/f8dec25bd70cc3568069daf8c3d5f2a65e3aa4cb/crates/ide/src/move_item.rs#L193.

I'd say the much more useful feature is "raw strings" that disable escape sequences.

Taking examples from the EEP, they would need to be written as below:

-doc "
Removes double-quotes (\") from a given string
".

foo() ->
    X = "
This is the beginning of the triple-quote text
",
    use_x(X).

The un-nesting of string under X is fairly awkward, but is significantly less painful than I initially anticipated seeing it for the first time.

This does leave the issue of the extra whitespace at the beginning (and end) though it could be handled with "escaping the newline", and potentially -doc could automatically strip the string:

foo() ->
    X = "\
This is the beginning of the triple-quote text\
",
    use_x(X).

While this is "uglier" it does avoid introducing an new concept and syntactic form to the language.

@josevalim
Copy link
Contributor

@michalmuskala I haven't felt the need for triple quoted strings in Erlang so far but if the goal is to introduce docstrings then I think the need for them is well justified. Imagine writing examples in the documentation and having to escape every string in the documentation itself. So I think the #"..."# delimiters in the Rust example is pretty important too.

@williamthome
Copy link

williamthome commented Jun 13, 2023

@michalmuskala My concern is to generalize this syntax because of the list of integers. How to deal with this:

1> [92,65,66,67].
"\\ABC"

If I understand you correctly, the compiler would automatically strip the 92.

Edit
I think I'm wrong because maybe the compiler should know how to deal with this it will be parsed by the scan. Anyway, I'll keep my comment here.

@essen
Copy link

essen commented Jun 13, 2023

I'm not a fan of triple quoting for this. Providing an end marker avoids having to worry about escaping and whatnot. Something like this in a terminal:

<<EOF
hello
EOF

This would make Erlang syntax context-dependent. You wouldn't be able to parse this anymore with yecc as it is today, since you'd need to track the "end" quotes.

I'm not sure why the parser would have to be context dependent. The scanner, sure, but we already can't parse Erlang with leex so this won't change things much. The string can be processed and made a regular string long before we reach the parser.

@zuiderkwast
Copy link

If it's only for -doc, it's not required to add a new string literal. We can just have a multi-line doc attribute, something like this:

-doc.
Removes double-quotes (\") from a given string
-end.

... which would need to be handled by the scanner and be equivalent to -doc "Removes double-quotes (\") from a given string".

@michalmuskala
Copy link
Contributor

@michalmuskala I haven't felt the need for triple quoted strings in Erlang so far but if the goal is to introduce docstrings then I think the need for them is well justified. Imagine writing examples in the documentation and having to escape every string in the documentation itself. So I think the #"..."# delimiters in the Rust example is pretty important too.

Yes, I fully agree that "raw" strings or strings that disable escaping are important - more than triple-quoted strings. Their utility would be beyond docs - e.g. regex

@williamthome
Copy link

@zuiderkwast It's not only for documentation, it's also useful to write better multi-line strings in Erlang code. See the EEP abstract and rationale. It's the main reason, but not only for this propose.
See also this example from @josevalim.

@williamthome
Copy link

Documentation is the main reason for this EEP and Markdown was mentioned a lot of times to be compatible and to not clash with it, but reading about Markdown to technical writings I ended to AsciiDoc:

[...] AsciiDoc is a lightweight and semantic markup language primarily designed for writing technical documentation.
[...] An AsciiDoc processor can parse and comprehend an AsciiDoc source document and convert the parsed document structure into one or more output formats, such as HTML, PDF, EPUB3, man(ual) page, or DocBook. The ability to produce multiple output formats is one of the main advantages of AsciiDoc.
[...] AsciiDoc bridges the gap between ease of writing and the rigorous requirements of technical authoring and publishing. AsciiDoc only requires a text editor to read or write, thereby offering a low bar to getting started.

The license is Apache 2.0. It's simple and powerful. O'Reilly has it in their docs and has a repo with examples of how to, so books are written using it. Looking at the examples the format is compatible with GitHub.

I'm not trying to cause any confusion about this, quite the opposite. Maybe it can be a better alternative to Markdown for docs.
I saw @essen using AsciiDoc in Cowboy, so it's not something new for all.

This link compares it to Markdown and this link talks about Markdown compatibility.

@RaimoNiskanen
Copy link
Contributor

@essen

I have these kinds of strings all over my tests and if I understand correctly they cannot be expressed because the \r is removed from the line ending and it's not possible to include it in the MIV string (no escaping).

Yes or no. It should be possible to use Verbatims strings for these test strings.

Your editor should allow you to insert CR into the source code, last on the line. In Emacs you use Ctrl-Q to quote the CR ([Enter]). It is only on the last line; where CR? LF Whitespace* " " " is considered to be the string ending. So CR LF is stripped only from the last content line.

All whitespace (CR LF TAB HT ...) on the preceding lines is copied verbatim.

Also, the string start is " " " Whitespace* LF, which is not part of the string content.

Note that CR is considered Whitespace in the scanner, but LF is end of line.

@RaimoNiskanen
Copy link
Contributor

@williamthome: Regarding trailing backslash and escaping. Just because of all the quirks of escape sequences e.g. by backslashes in different documentation text formats such as Markdown, maybe also AsciiDoc, and whatnot, I think it is simplest to not have any escape character in these Verbatim strings.

We'll see how to do about value interpolation one day...

Improve the use of conditionals and punctuation.

Rename to "Verbatim Multi-line Indented Strings" to emphasize
the verbatim property and to avoid ambiguity about how to
pronounce the abbreviation.

Clarify that there are two forms of the empty string.
@RaimoNiskanen RaimoNiskanen changed the title EEP 64: Multi-line Indented Verbatim Strings EEP 64: Verbatim Multi-line Indented Strings Jun 19, 2023
@RaimoNiskanen
Copy link
Contributor

Thank you for the feedback! I have updated (and renamed, again).

I realized that there are two forms of empty string, so I wrote a paragraph about that.
My suggestion is to allow both forms.

@essen
Copy link

essen commented Jun 19, 2023

@essen

I have these kinds of strings all over my tests and if I understand correctly they cannot be expressed because the \r is removed from the line ending and it's not possible to include it in the MIV string (no escaping).

Yes or no. It should be possible to use Verbatims strings for these test strings.

Alright then it is possible but it is not ideal: the CR is hidden (like the LF), so it's no longer possible to see the difference between a test that tries sending requests with only LF, compared to a test that properly sends CRLF for example. You can see it if you display hidden whitespaces but you won't see it on GitHub or others.

Modifying the string to make the LF into CRLF would make things more obvious I think (crlf(""" ... """)). Perhaps this warrants an example in the EEP.

@RaimoNiskanen
Copy link
Contributor

@essen

You can see it if you display hidden whitespaces but you won't see it on GitHub or others.

That is a problem. In Emacs it shows as ^M, unless, I guess in DOS mode where it probably is hidden. So Emacs is an exception, and that is not enough.

Modifying the string to make the LF into CRLF would make things more obvious I think (crlf(""" ... """)). Perhaps this warrants an example in the EEP.

There are many suggestions on creative string prefixes appearing.

b" u" re" crlf" f" d", ...
Binary, Unicode, Regular Expression, CR-LF-expanded, Formatted (Interpolated), Debug Formatted (Verbose Interpolation), and probably more.

How to combine them into a reasonable semantics will have to be a future problem. This EEP mentions "specialized strings such as regular expressions, interpolated variables ([PR-7343][]), Unicode binary strings, etc" and I think this can fall under other "specialized strings".

Your use case needs escape sequences, so as you do concatenating single quoted strings is maybe not that bad...

@essen
Copy link

essen commented Jun 19, 2023

@RaimoNiskanen

crlf"

Probably not a good idea anyway. CRLF is increasingly becoming a legacy problem. I wouldn't build a feature around it. Workarounds in the documentation are appreciated though.

Your use case needs escape sequences, so as you do concatenating single quoted strings is maybe not that bad...

The downside is not being able to just copy paste stuff, for example examples from RFCs or output from developer tools. Or copying a test input into netcat to try it in a terminal. So I think the crlf(""" ... """) solution would be the best fit for this particular scenario (tests).

eeps/eep-0064.md Outdated
Comment on lines 15 to 16
This EEP proposes the introduction of Verbatim Multi-line Indented strings,
*VMI Strings*, and defines their semantics. The main benefit is to allow

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the term "VMI Strings" going to be used in docs whenever we describe such strings? I don't love this acronym. I think its nicer to call them triple-quoted strings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@essen: I guess the fitting general feature here would be Multi-line Indented string, that has escape sequences. The question is if that merits a prefix...

@zuiderkwast: I just wanted to call them something in the EEP that was not too long since I have to mention them many times. "Triple quoted strings" is a better name, even if they could be "N>=3 quoted strings". "Verbatim strings" is another possibility. But if we later add prefixes to make them more flexible, e.g. loose "verbatim", it is maybe "triple quoted" that is the one property that must remain...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Triple quoted" or "Multi-line". I'd say we cannot take that away from them either. That they can be indented is just what makes them good for multi-line.

@RaimoNiskanen
Copy link
Contributor

I have polished two small language details. That, I hope, would be the last on this track.

Now, in the next commit, I will switch viewpoint and argue for multi-line strings with escape sequences...

Polish the text and clarify CR LF handling
@RaimoNiskanen RaimoNiskanen changed the title EEP 64: Verbatim Multi-line Indented Strings EEP 64: Triple-Quoted Strings Jun 20, 2023
@RaimoNiskanen
Copy link
Contributor

Now, in the next commit, I will switch viewpoint and argue for multi-line strings with escape sequences...

That did not happen. I renamed them, again, to "triple-quoted strings" and clarified some details.

I tried to see that escape sequences would be useful, but ended up back in that it will be nice to have them out of the way for any documentation text markup source format

@williamthome
Copy link

williamthome commented Jun 20, 2023

@RaimoNiskanen Now I see what you mean and I agree with you. This EEP should not include the trailing backslash or any other, this is a job for a post-process or format functions. Triple-quoted strings should only handle multiline strings and permit "quotes inside quotes".
An example where escaped sequences can be a problem is if the -doc attribute be implemented and a file as argument be accepted, for example

-doc {file, "foo.md"}.

and the foo.md content be

This is the first line.\
This is the second one.

considering the backslash trailing

% this will be the output:
-doc "This is the first line.This is the second one."

% but this is expected
-doc "This is the first line.\\\nThis is the second one."

So, I switch viewpoint and I'm against the trailing backslash that I mentioned before.

@zuiderkwast
Copy link

@williamthome quotes cannot be escaped. Nothing can be escaped.

@essen
Copy link

essen commented Jun 20, 2023

I tried to see that escape sequences would be useful, but ended up back in that it will be nice to have them out of the way for any documentation text markup source format

Yes I'm arriving at the same conclusion. It will be very useful to not have to escape anything, and not just for documentation. Strings can always be post-processed if necessary (even as a parse transform if that's important).

eeps/eep-0064.md Outdated
"\nX" = """

X
""""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
""""
"""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops!

eeps/eep-0064.md Outdated
"\r\nX" = """

X
""""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
""""
"""

eeps/eep-0064.md Outdated Show resolved Hide resolved
* Verbatim strings:        Motivate more why
* Comparison with Elixir:  Fleshes out the motivations
@RaimoNiskanen
Copy link
Contributor

I added the sections "Verbatim strings" and "Comparison with Elixir"

@josevalim
Copy link
Contributor

For what is worth, I am very happy with the current proposal. It outlines the problem well, the design decisions and their limitations, plus possible future improvements. 👍 Regarding the name, I would propose indeed verbatim strings, as it clarifies it does not support escape characters.

@RaimoNiskanen
Copy link
Contributor

I also think "Verbatim Strings" is a better name, today, but given that a future extension might be Sigils, these Triple-quoted Strings could start to support e.g. escape sequences and interpolation...

So "triple-quoted" is the thing that they always will be. Except maybe if they become "at-least-triple-quoted", but the name is still sort of valid.

"Heredocs" could be a name, especially for a free-to-choose delimiter as at-least-triple-quoted. I actually think the name "heredocs" in Elixir is a bit misleading since the delimiter is fixed to """, but in shells, where the name originates, the delimiter is a free-to-choose word.

eeps/eep-0064.md Outdated Show resolved Hide resolved
eeps/eep-0064.md Outdated Show resolved Hide resolved
@RaimoNiskanen RaimoNiskanen merged commit f61c8a3 into erlang:master Jun 28, 2023
1 check passed
@RaimoNiskanen
Copy link
Contributor

RaimoNiskanen commented Jun 28, 2023

I merged this EEP which only means that it is recognized and in status: Draft.

I also wrote an implementation of the EEP: erlang/otp#7451

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.