Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: line-endings #1212

Merged
merged 1 commit into from
Aug 27, 2015
Merged

RFC: line-endings #1212

merged 1 commit into from
Aug 27, 2015

Conversation

RalfJung
Copy link
Member

@RalfJung RalfJung commented Jul 15, 2015

This RFC proposes to define a "line" as terminated by either \n or \r\n`. Also see rust-lang/rust#26743.

Rendered output

@RalfJung RalfJung changed the title Add RFC line_endings RFC: line-endings Jul 15, 2015
@alexcrichton alexcrichton added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label Jul 15, 2015
@ben0x539
Copy link

What about just '\r'?

@nagisa
Copy link
Member

nagisa commented Jul 15, 2015

That is not a new line and does not behave like one on any system I am aware of.

@yigal100
Copy link

@nagisa Just '\r' was used on Mac OS before Apple switched to Mac OS X. There used to be and maybe still exist similar open source (I should say hobby) OSes.
In any case it isn't used on any mainstream / modern OS and is a mostly historic edge case.

anecdote:
The distinction between \n and \r is in itself an historic carry over from before computers. On a mechanic typewriter in order to start writing the next line of text the user had to perform two distinct actions:

  1. Return the writing head to the beginning of the line ("carriage return" or '\r')
  2. Advance the paper to the next line ("line feed" or '\n')

It is ridiculous that this was carried to computers where there is no carriage that needs to be returned...

@RalfJung
Copy link
Member Author

It is ridiculous that this was carried to computers where there is no carriage that needs to be returned...

Yeah, just imagine we had fully digital screens that still scaled and messed up the (perfectly pixel-aligned) image they get via HDMI input, just to be compatible with some analog screens from the 60ies. Oh wait...

Back on topic though, I did not include \r because (a) it's fairly exotic nowadays, as already mentioned, and (b) it's harder to implement. The change for this one is actually a 3-liner, since we can still split at \n, and just add some post-processing.

@emberian
Copy link
Member

I don't really have a strong opinion about this. I do think that this will make lines less suitable for use in robust parsers, but it's my opinion that robust parsers shouldn't have used it in the first place, in favor of nom or somesuch. This does make it much easier to write "do what I want" simple code.

@ruuda
Copy link

ruuda commented Jul 21, 2015

As a developer who works both on Windows and GNU/Linux I am in favour of this. I have never encountered a carriage return followed by a line feed that was not intended as a newline, and even if people worked around the current behaviour by manually stripping the carriage return, this will not break anything (assuming only carriage returns were stripped, not blindly the first character before a line feed).

@jminer
Copy link

jminer commented Jul 22, 2015

I strongly support this change. On Windows, \r\n is considered a newline (not just the \n alone). Notepad (Windows' most basic text editor) doesn't even recognize a lone \n as a newline. I really wish every app on every platform supported both \n and \r\n.

@withoutboats
Copy link
Contributor

Why not extend this RFC to also recognize uncommon but well-defined line separators, specifically U+0085 NEXT LINE, U+2028 LINE SEPERATOR, and U+2029 PARAGRAPH SEPARATOR?

@RalfJung
Copy link
Member Author

I don't know of any language framework that will treat these special characters as a newline per default, in their basic notion of what a "line" is. Adding \r\n and just that makes Rust consistent with most languages out there, and it is trivially implementable with virtually no performance impact: Just continue splitting at \n, and then remove a single trailing \r if present. In particular, read_until for BufRead can be easily employed, which is no longer the case if the pattern is to be generalized.

@nagisa
Copy link
Member

nagisa commented Jul 23, 2015

Most of the languages are not unicode-aware or unicode-focused as rust is. Unicode standard specifies that these characters should be accounted for.

@nagisa
Copy link
Member

nagisa commented Jul 25, 2015

Thinking about it again, I really agree with @withoutboats; these unicode line boundaries should be supported.

👎 from me on this RFC until it is updated to include all the line boundaries.

@petrochenkov
Copy link
Contributor

I'm a bit out of the loop, why \n is trimmed off in the first place? I remember lines() didn't trimmed anything around 1.0 (exactly to avoid this kind of bikesheddy discussions , I guess).

Anyway, big -1 to unicode, if you need exotic separators for some exotic reason, make a crate, use it, put it to crates.io and don't make lines() unimplementable by simple byte search. \r\n is an unfortunate legacy we, Windows people, have to live with, so I guess it have to be trimmed off if lines() trims anything at all.

Edit: actually I'm for separate method for \r\n like lines_any with the next recommendation: use lines() on files controlled by you (they should contain only \ns) and use lines_any() only when you have to, i.e. on uncontrolled files possibly containing \r\ns or console.

@aturon aturon self-assigned this Jul 29, 2015
@mitsuhiko
Copy link
Contributor

Super strong -1 on supporting \r\n generally as a line ending. I think it would make more sense to define clearly where line conversion should take place. For instance it makes a lot of sense for opening files to perform a conversion between \r\n to \n as a policy on platforms that require that.

The support for \r\n in various places in Python has become an annoying problem and I think a new language should not carry over this into every code every written. Instead provide strong support for performing those conversions similar to how we only encode/decode unicode in one place.

@RalfJung
Copy link
Member Author

@mitsuhiko So, what do you propose instead? In my view, this is exactly what I am proposing: We have APIs that work on file data byte-per-byte, we have APIs that work char-per-char (and involve unicode), and then we have APIs that work line-by-line - and these are exactly the right place to handle different conventions of treating file endings.

Note that I am defining pretty clearly where line conversion should take place: When you call the lines() iterator. This is very unlike Python, where line endings are handled like in C: Even when you bread char-by-char, you see \n uniformly as line ending, because the "text mode" takes care of that below the character level. Rust has no "text mode", and I am not proposing to add one.

@mitsuhiko
Copy link
Contributor

APIs that work line-by-line - and these are exactly the right place to handle different conventions of treating file endings.

In my mind APIs should assume that lines are terminated with \n. It's the responsibility of the input/output system to perform the newline conversions.

That's the same rule that is also applied for unicode handling.

@aturon
Copy link
Member

aturon commented Aug 11, 2015

@mitsuhiko

It's the responsibility of the input/output system to perform the newline conversions.

I'm a bit confused here -- isn't the proposed API part of the I/O system?

Update: ah, I forgot this affected str as well.

@nagisa
Copy link
Member

nagisa commented Aug 11, 2015

I'm a bit confused here -- isn't the proposed API part of the I/O system?

RFC proposes changing both BufRead::lines and str::lines. The latter one is most certainly not a part of any I/O scheme.

@aturon
Copy link
Member

aturon commented Aug 11, 2015

@nagisa Yes, I'd forgotten that this RFC also touched str. I agree that str::lines is not part of I/O, of course.

So, @mitsuhiko, your argument is that I/O should actually convert, and not just parse, newlines? I disagree that this is how things work with Unicode: there's no conversion step there (unless you count the _lossy methods); rather, there are various functions that assume the data is already in UTF8.

To be honest, having I/O do a conversion of the data seems more intrusive to me; we try pretty hard in Rust to expose system APIs without imposing a lot of extra semantics. And I agree with the basic thrust of the RFC, that the expectation when seeing a lines parsing function in I/O is that it will handle common forms of newline markers.

The situation for str::lines is more subtle since, as you imply, at that point you've already adopted at least an encoding (UTF8); there's perhaps somewhat less of an expectation about the flexible behavior of str::lines. Nevertheless, it's entirely possible to wind up with a str from I/O without having done any newline conversion. And given the existence of the split method, I still don't see a good reason not to have an ergonomic, default lines parser that works across multiple newline conventions. If you want \n only, it's trivial to get that with split.

The support for \r\n in various places in Python has become an annoying problem

Can you elaborate on the problem?

@mitsuhiko
Copy link
Contributor

@aturon

disagree that this is how things work with Unicode: there's no conversion step there (unless you count the _lossy methods); rather, there are various functions that assume the data is already in UTF8.

That's only the case because what you read just happens to match the encoding you deal internally with. If you had a hypothetical API that reads iso-8859-1 encoded files then in Rust that would be a re-encode into UTF-8 on the way in and not a string class that deals with iso-8859-1 encoded charpoints. You might have a stream based API that re-encodes charpoints step by step if that is more likely, but at least the individual chars would always be unicode charpoints.

This is how things are "supposed" to work on Windows. If you open a file in text mode on Windows the newlines are \n and not \r\n.

Can you elaborate on the problem?

The problem is that it's nearly impossible to understand if code will deal with non \n newlines properly. For instance a str.splitlines handles \r, \n as well as \r\n as newlines. However if you pass True to it the newlines remain and then some code gets this very wrong. Largely this is caused by code having different assumptions about what valid newlines are. For instance a multipart form data parser that was quite popular in Python had a bug that would get continuations wrong if a \r\n was split into \r and \n at a chunk boundary because it did not know that \r was a valid newline terminator for Python.

There is plenty of code that works on \n only and then just passes \r alone through which ends up creating a mix of \n and \r\n in the document. Sometimes lone \r show up because someone ended up running x.rstrip('\n') on a line.

All of this is emphasized by the fact that there are no good APIs to work with newline data other than .splitlines() and the text read mode which however only exists on windows. This then causes problems on Unix systems where \r\n show up for HTTP and there is no support for converting the newlines.

@RalfJung
Copy link
Member Author

Hi,

In my mind APIs should assume that lines are terminated with \n. It's the responsibility of the input/output system to perform the newline conversions.

That's the same rule that is also applied for unicode handling.

There is no conversion going on with unicode. Instead, you call APIs
that imply that you "view this buffer as a UTF-8 buffer", and then you
get various additional operations unlocked.
So, I absolutely disagree here. The I/O system is about getting bytes
read and written, and it shouldn't (and doesn't) care about what these
bytes mean. It's up to the programmer to apply higher-level APIs, that
know about concepts such as "characters" or "lines".
(Or maybe when you say I/O system, you have a larger scope in mind, in
which case I'd argue that BufRead is part of the I/O system.)

The API for line-based access corresponding to this would be a type that
wraps any kind of buffer and provides a line-based iterator. We
shouldn't be changing what the characters are, that just leads to
confusion - we should change how we view them.
Neither String nor BufRead should know what a line is, and BufRead
shouldn't know about String either.
We could have a trait "Lines", with the lines() method returning an
iterator that yield lines, and implement that trait for String and
BufRead. Or maybe for BufRead only, as it should be possible to get a
BufRead on a String? The latter would be a much less efficient
implementation, though.

However, it's too late to go for such an API, we already have lines()
functions in String and BufRead, implying that they both know what a
"line" is. What I am proposing is merely to make sure that these APIs
live up to their promise, and actually teach them about what lines are.
Using the vocabulary I used above, these two functions are the ones that
change "how we view the characters", they are abstracting from a
sequence of characters to a sequence of lines - they very much define
what a line is. I'm suggesting to change that definition, to match
better with the programmers' expectations and real-world files that
programs are going to be working on.

It may be that I still did not understand your proposal. Clearly, you
are saying that something, somewhere, should know about \r\n, so if my
reply here doesn't answer your concerns, maybe you could draft an API
that works the way you envision it?

@mitsuhiko
Copy link
Contributor

The I/O system is about getting bytes read and written, and it shouldn't (and doesn't) care about what these bytes mean.

I completely disagree. This only works in a world where no conversions should be performed. Note that I am not saying that any IO operation should do this. However the place to perform encoding/decoding/newline conversions should be one transformation step at the boundary to not drag that problem into every single part of the system.

While you can argue that rust currently does not have that problem because it only supports UTF-8, the reasonable place to perform unicode handling has traditionally always been the IO boundary and not "every part of the system".

Clearly, you are saying that something, somewhere, should know about \r\n, so if my reply here doesn't answer your concerns, maybe you could draft an API that works the way you envision it?

For instance I would imagine that there could be a wrapper for a buffered reader or iterator that converts newlines.

@mitsuhiko
Copy link
Contributor

Also remember that Unicode also defines more terminators. If there is a hypothetical system in the future that has another one you don't want to have to update each and every library ever written that deals with newlines, but just the place where the conversion takes place.

@mitsuhiko
Copy link
Contributor

Also another note: there are still systems which use \r only. For instance if you ever opened up an XML file generated by OS X system APIs you will notice that they terminate like this. A good example are keyboard layout files. You can find an example of such a file here: https://raw.githubusercontent.com/mitsuhiko/osx-keyboard-layouts/master/US%20ISO.keylayout

@RalfJung
Copy link
Member Author

For instance I would imagine that there could be a wrapper for a buffered reader or iterator that converts newlines.

That sounds to me like what I wrote in the paragraph "The API for
line-based access" above, doesn't it?

@mitsuhiko
Copy link
Contributor

That sounds to me like what I wrote in the paragraph "The API for line-based access" above, doesn't it?

If that is one API isolated then I'm all for it. If that is like in the RFC then I am not a fan of it, because it will just add to a list of ever growing APIs that need to be aware of different newlines.

@SimonSapin
Copy link
Contributor

@brson I don’t know. I suppose this was written in a world where most programs on your IMB S/360 would not support anything other than EBCDIC (an encoding competing with ASCII) with NEL for newlines.

I’m not that familiar with the Windows ecosystem. I’d say let’s wait and see if someone asks for println! to write \r\n.

@retep998
Copy link
Member

What I think is important is that Rust be able to read any file ending equally well. So if I ask for the next line, it'll treat \r\n the same as it does a plain \n across any platform.

What I don't think works too well is having Rust write different newlines based on the platform it is running on. When printing to the console, \n already works just fine to move to the next line, but if you're using writeln! to write to some arbitrary Write thing, it would be surprising if suddenly it was writing a different kind of newline just because you switched platforms. Maybe we could have a type that wraps around a Write with the sole purpose of translating \n to the preferred newline for that platform.

@RalfJung
Copy link
Member Author

Why not just keep the lines functions and let them work on \n only? There is nothing wrong with them.

That would be very confusing, having some functions called lines support different line delimiters than others.

Regarding the Unicode spec, I wonder if there is any precedent of a language using unicode line separators for their built-in notion of "lines"? Maybe the unicode crate (that some functionality of String also moved/is going to move to) would be a better place for that? On the other hand, Rust has better unicode support than most other languages. I guess the question here is whether supporting these line endings is more like supporting unicode-character iteration and belongs into libstd, or more like supporting unicode-grapeheme iteration. In the end, I'm fine either way.

If we decide to go with unicode support, suddenly read_line also is in scope for this RFC. I strongly think that read_line and lines should be in sync, the main difference being that read_line does not strip the separator. (Of course, it also doesn't yield an iterator, but that's a different story.)

If we make this change, then reading the lines of a file and outputting them again will produce a different file.

The lines iterator loses information, namely, which line separator has been used. This is pretty much by design, and I don't think it is a problem.

Maybe we could have a type that wraps around a Write with the sole purpose of translating \n to the preferred newline for that platform.

I'd rather prefer an API similar to lines, but for writing. So far, Rust has successfully resisted the temptation to mess with the byte stream in order to support different kinds of newlines: There is no magic translation of \r\n to \n, followed by a function that splits at \n, going on. Having a "lines" iterator leaves the door open to alternative "lines" iterators implemented in crates, keeps the newline processing out of the lower-level "byte stream" API, and makes it clear from the type what you are dealing with (BufRead? I'm getting the raw bytes/characters. Lines? I'm getting something higher-level.) No magic "file mode".
For writing, this is of course harder, because a write_line function has to pick some line separator, rather than just supporting a sensible set of them. I think for now \n is a sensible default, it is reasonably well supported (doesn't even Windows Notepad properly read \n by now?). Offering more choices here is, I would argue, out of scope for this RFC. (Except for the part that every line ending possibly chosen by write_line should be supported by lines, but I guess that goes without saying ;-).

@liigo
Copy link
Contributor

liigo commented Aug 21, 2015

That's meaningless to read and output a file without changing its contents.
If you change its contents, you always product different file.
2015年8月21日 08:52于 "Brian Anderson" notifications@github.com写道:

@SimonSapin https://github.com/SimonSapin

Only on output is it necessary to distinguish between them.

Rust does not do any special handling of newlines based on platform, like
some systems do. If we make this change, then reading the lines of a file,
and outputting them again will produce a different file.

Does Rust's treatment of newlines on output also need to be considered
here?


Reply to this email directly or view it on GitHub
#1212 (comment).

@nagisa
Copy link
Member

nagisa commented Aug 21, 2015

You just called cat a meaningless program.
On Aug 21, 2015 4:17 PM, "Liigo Zhuang" notifications@github.com wrote:

That's meaningless to read and output a file without changing its contents.
If you change its contents, you always product different file.
2015年8月21日 08:52于 "Brian Anderson" notifications@github.com写道:

@SimonSapin https://github.com/SimonSapin

Only on output is it necessary to distinguish between them.

Rust does not do any special handling of newlines based on platform, like
some systems do. If we make this change, then reading the lines of a
file,
and outputting them again will produce a different file.

Does Rust's treatment of newlines on output also need to be considered
here?


Reply to this email directly or view it on GitHub
#1212 (comment).


Reply to this email directly or view it on GitHub
#1212 (comment).

@RalfJung
Copy link
Member Author

An implementation of cat in Rust could easily use read_line, which preserves the separator. Actually, most of the time it doesn't have to care about line endings at all.

Now, there's a discussion we could have about why preserving the separator is tied to not using an iterator. That's a good question. For uniformity, both read_line and lines could take an argument specifying whether the separator is to be stripped or not. This opens a few cans of worms though, about backwards compatibility and the best way to design an API taking a boolean flag...

@retep998
Copy link
Member

Just for the record, on Windows 10, notepad.exe still does not understand \n newlines. Basically everything else does though.

@jminer
Copy link

jminer commented Aug 21, 2015

I agree with @retep998 that it is fine to write a newline as \n on Windows. I think some languages produce a different line ending based on the platform, but I prefer using \n even on Windows. Notepad is the only program I can think of that does not support reading both \r\n and \n. Hopefully Rust will support reading both too.

@liigo
Copy link
Contributor

liigo commented Aug 21, 2015 via email

@liigo
Copy link
Contributor

liigo commented Aug 21, 2015 via email

@Emerentius
Copy link

👍 on supporting \r\n.
👎 on unicode separators unless they are actually used in the wild, I don't think they are worth including. It seems to me, that they are so rare, that they only serve as an obscure attack surface and bug source noone thinks of at first. Not that we are preventing any of that with either choice, of course.

If they are deemed necessary, they could also be included later in a separate RFC.

@jansegre
Copy link

I just found out that Unicode has a line breaking algorithm best practices. There's a simplistic view on wikipedia, which is already somewhat complicated. I think these recommendations are akin to XID_Start and XID_Continue going to a separate crate, and also like .is_numerical() being a different method than .is_digit().

I expect that .lines() consider both \r\n and \n, from a subjective point of view. About the unicode line break, I would expect to either have a separate method or a dedicated crate for it, if I ever need it.

TL;DR 👍 for \r\n

@bstrie
Copy link
Contributor

bstrie commented Aug 25, 2015

I'm wary of getting into the weeds of supporting unicode separators. Let's follow the principle of least surprise and try to do what people most generally expect from other languages, and I don't know of any language that has set the precedent of being aware of unicode linebreaks.

@withoutboats
Copy link
Contributor

These arguments for not supporting unicode separators are convincing to me (as the person who first raised the idea of supporting unicode separators).

It wouldn't be surprising if, for example, some protocol defined \n as newline, and then allowed arbitrary unicode data other than \n in the body of a line; attempting to parse this protocol using a unicode separator aware lines() method would lead to surprising and very rare bugs.

@birkenfeld
Copy link

@brson

If we make this change, then reading the lines of a file and outputting them again will produce a different file.

This is already the case (although less prominent), since after lines() we don't know if the last line was terminated by a newline character or not.

In general, since lines() is a convenience API anyway, 👍 for \n and \r\n support.

@RalfJung
Copy link
Member Author

It wouldn't be surprising if, for example, some protocol defined \n
as newline, and then allowed arbitrary unicode data other than \n
in the body of a line; attempting to parse this protocol using a
unicode separator aware lines() method would lead to surprising and
very rare bugs.

Parsing such a protocol with lines() would be a bug anyway, one should
use split('\n'). lines() is really for "I'm parsing text, and I want
the lines", it is not for protocols that happen to use a particular
line ending as separator for something.

@withoutboats
Copy link
Contributor

@RalfJung Right (except that split('\n') can't be used on a BufRead interface), but someone could easily do it as a quick hack and they shouldn't encounter really unusual surprises like "this obscure character that I've never heard of is a recognized line seperator."

@RalfJung
Copy link
Member Author

@RalfJung Right (except that split('\n') can't be used on a BufRead
interface)

Why not?
https://doc.rust-lang.org/beta/std/io/trait.BufRead.html#method.split

@withoutboats
Copy link
Contributor

@RalfJung - oh right, split(b'\n') can, but not if the line-ending is \r\n. Anyway this is off topic.

@alexcrichton
Copy link
Member

This was discussed this week at the libs team triage meeting and the conclusion was to merge. The APIs here seem to be a good improvement over what we have today without any loss of functionality, and pretty strong cases have been made to avoid unicode line endings for now.

Thanks again for the RFC @RalfJung!

@alexcrichton alexcrichton merged commit 8e0f4f2 into rust-lang:master Aug 27, 2015
@aturon
Copy link
Member

aturon commented Aug 27, 2015

Follow-up: we would like to land an implementation as soon as possible, and need to widely announce this potentially-breaking change when it lands. Unfortunately, this kind of semantic change is very hard to check for breakage using crater.

@RalfJung
Copy link
Member Author

Yay, my first RFC got accepted :)

I can give the implementation a try. Shouldn't be hard. I'll be traveling a lot tomorrow, should have time to do the coding easily - but I can't really compile anything while on battery, considering Rust's size and compile time.

@aturon
Copy link
Member

aturon commented Aug 27, 2015

@RalfJung looks like @alexcrichton beat you to it: rust-lang/rust#28034 -- but you could give the PR an initial review!

@RalfJung
Copy link
Member Author

Oh, okay, that was fast^^ - seems like I need to wait for my first non-trivial commit to the actual compiler ;-) . I'll check out his PR.

@tshepang
Copy link
Member

@alexcrichton is a lord

@RalfJung RalfJung deleted the line-endings branch August 6, 2018 09:15
@Centril Centril added the A-platform Platform related proposals & ideas label Nov 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-platform Platform related proposals & ideas final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. T-libs-api Relevant to the library API team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet