-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: line-endings #1212
RFC: line-endings #1212
Conversation
What about just |
That is not a new line and does not behave like one on any system I am aware of. |
@nagisa Just anecdote:
It is ridiculous that this was carried to computers where there is no carriage that needs to be returned... |
Yeah, just imagine we had fully digital screens that still scaled and messed up the (perfectly pixel-aligned) image they get via HDMI input, just to be compatible with some analog screens from the 60ies. Oh wait... Back on topic though, I did not include |
I don't really have a strong opinion about this. I do think that this will make |
As a developer who works both on Windows and GNU/Linux I am in favour of this. I have never encountered a carriage return followed by a line feed that was not intended as a newline, and even if people worked around the current behaviour by manually stripping the carriage return, this will not break anything (assuming only carriage returns were stripped, not blindly the first character before a line feed). |
I strongly support this change. On Windows, |
Why not extend this RFC to also recognize uncommon but well-defined line separators, specifically |
I don't know of any language framework that will treat these special characters as a newline per default, in their basic notion of what a "line" is. Adding |
Most of the languages are not unicode-aware or unicode-focused as rust is. Unicode standard specifies that these characters should be accounted for. |
Thinking about it again, I really agree with @withoutboats; these unicode line boundaries should be supported. 👎 from me on this RFC until it is updated to include all the line boundaries. |
I'm a bit out of the loop, why Anyway, big -1 to unicode, if you need exotic separators for some exotic reason, make a crate, use it, put it to crates.io and don't make Edit: actually I'm for separate method for |
Super strong -1 on supporting The support for |
@mitsuhiko So, what do you propose instead? In my view, this is exactly what I am proposing: We have APIs that work on file data byte-per-byte, we have APIs that work char-per-char (and involve unicode), and then we have APIs that work line-by-line - and these are exactly the right place to handle different conventions of treating file endings. Note that I am defining pretty clearly where line conversion should take place: When you call the |
In my mind APIs should assume that lines are terminated with That's the same rule that is also applied for unicode handling. |
I'm a bit confused here -- isn't the proposed API part of the I/O system? Update: ah, I forgot this affected |
RFC proposes changing both |
@nagisa Yes, I'd forgotten that this RFC also touched So, @mitsuhiko, your argument is that I/O should actually convert, and not just parse, newlines? I disagree that this is how things work with Unicode: there's no conversion step there (unless you count the To be honest, having I/O do a conversion of the data seems more intrusive to me; we try pretty hard in Rust to expose system APIs without imposing a lot of extra semantics. And I agree with the basic thrust of the RFC, that the expectation when seeing a lines parsing function in I/O is that it will handle common forms of newline markers. The situation for
Can you elaborate on the problem? |
That's only the case because what you read just happens to match the encoding you deal internally with. If you had a hypothetical API that reads iso-8859-1 encoded files then in Rust that would be a re-encode into UTF-8 on the way in and not a string class that deals with iso-8859-1 encoded charpoints. You might have a stream based API that re-encodes charpoints step by step if that is more likely, but at least the individual chars would always be unicode charpoints. This is how things are "supposed" to work on Windows. If you open a file in text mode on Windows the newlines are
The problem is that it's nearly impossible to understand if code will deal with non There is plenty of code that works on All of this is emphasized by the fact that there are no good APIs to work with newline data other than |
Hi,
There is no conversion going on with unicode. Instead, you call APIs The API for line-based access corresponding to this would be a type that However, it's too late to go for such an API, we already have lines() It may be that I still did not understand your proposal. Clearly, you |
I completely disagree. This only works in a world where no conversions should be performed. Note that I am not saying that any IO operation should do this. However the place to perform encoding/decoding/newline conversions should be one transformation step at the boundary to not drag that problem into every single part of the system. While you can argue that rust currently does not have that problem because it only supports UTF-8, the reasonable place to perform unicode handling has traditionally always been the IO boundary and not "every part of the system".
For instance I would imagine that there could be a wrapper for a buffered reader or iterator that converts newlines. |
Also remember that Unicode also defines more terminators. If there is a hypothetical system in the future that has another one you don't want to have to update each and every library ever written that deals with newlines, but just the place where the conversion takes place. |
Also another note: there are still systems which use |
That sounds to me like what I wrote in the paragraph "The API for |
If that is one API isolated then I'm all for it. If that is like in the RFC then I am not a fan of it, because it will just add to a list of ever growing APIs that need to be aware of different newlines. |
@brson I don’t know. I suppose this was written in a world where most programs on your IMB S/360 would not support anything other than EBCDIC (an encoding competing with ASCII) with NEL for newlines. I’m not that familiar with the Windows ecosystem. I’d say let’s wait and see if someone asks for |
What I think is important is that Rust be able to read any file ending equally well. So if I ask for the next line, it'll treat What I don't think works too well is having Rust write different newlines based on the platform it is running on. When printing to the console, |
That would be very confusing, having some functions called Regarding the Unicode spec, I wonder if there is any precedent of a language using unicode line separators for their built-in notion of "lines"? Maybe the unicode crate (that some functionality of If we decide to go with unicode support, suddenly
The
I'd rather prefer an API similar to |
That's meaningless to read and output a file without changing its contents.
|
You just called
|
An implementation of Now, there's a discussion we could have about why preserving the separator is tied to not using an iterator. That's a good question. For uniformity, both |
Just for the record, on Windows 10, |
I agree with @retep998 that it is fine to write a newline as |
You just called `cat` a meaningless program.
Writing to console do not product a file, and don't write back, so saying
"producing different file" here is meaningless.
|
'newlines' is something depends on platform, it variants on different OS,
not just \n or \r\n. Rust std is always trying to do things that 'cross
mainstream platform', which should cover newlines IMO. Support both \n and
\r\n is enough currently.
|
👍 on supporting If they are deemed necessary, they could also be included later in a separate RFC. |
I just found out that Unicode has a line breaking algorithm best practices. There's a simplistic view on wikipedia, which is already somewhat complicated. I think these recommendations are akin to I expect that TL;DR 👍 for |
I'm wary of getting into the weeds of supporting unicode separators. Let's follow the principle of least surprise and try to do what people most generally expect from other languages, and I don't know of any language that has set the precedent of being aware of unicode linebreaks. |
These arguments for not supporting unicode separators are convincing to me (as the person who first raised the idea of supporting unicode separators). It wouldn't be surprising if, for example, some protocol defined |
This is already the case (although less prominent), since after In general, since |
Parsing such a protocol with |
@RalfJung Right (except that split('\n') can't be used on a BufRead interface), but someone could easily do it as a quick hack and they shouldn't encounter really unusual surprises like "this obscure character that I've never heard of is a recognized line seperator." |
Why not? |
@RalfJung - oh right, split(b'\n') can, but not if the line-ending is |
This was discussed this week at the libs team triage meeting and the conclusion was to merge. The APIs here seem to be a good improvement over what we have today without any loss of functionality, and pretty strong cases have been made to avoid unicode line endings for now. Thanks again for the RFC @RalfJung! |
Follow-up: we would like to land an implementation as soon as possible, and need to widely announce this potentially-breaking change when it lands. Unfortunately, this kind of semantic change is very hard to check for breakage using crater. |
Yay, my first RFC got accepted :) I can give the implementation a try. Shouldn't be hard. I'll be traveling a lot tomorrow, should have time to do the coding easily - but I can't really compile anything while on battery, considering Rust's size and compile time. |
@RalfJung looks like @alexcrichton beat you to it: rust-lang/rust#28034 -- but you could give the PR an initial review! |
Oh, okay, that was fast^^ - seems like I need to wait for my first non-trivial commit to the actual compiler ;-) . I'll check out his PR. |
@alexcrichton is a lord |
This RFC proposes to define a "line" as terminated by either
\n
or \r\n`. Also see rust-lang/rust#26743.Rendered output