-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PEP 680: Further explain why files are read as binary #2281
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with the change (if it helps anyone understand the issue better), but also don't think it's really necessary. My thinking is that even if all platforms defaulted to UTF-8, it makes no sense providing an API that allows changing that default, because TOML strictly requires UTF-8. Why allow configuring incorrect parse results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your changes LGTM, thanks! I just have one comment to clarify something that confused me when I was reading the PEP initially.
pep-0680.rst
Outdated
encodings, such as Windows), and avoid incorrectly parsing single carriage | ||
returns as valid TOML due to universal newlines in text mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
encodings, such as Windows), and avoid incorrectly parsing single carriage | |
returns as valid TOML due to universal newlines in text mode. | |
encodings, such as Windows), and avoid incorrectly parsing files using a | |
single carriage return (``\r``) for the end of line character as valid TOML | |
(due to Python's universal newline conversion in text mode). |
Despite careful re-reading, I was initially confused by this line when reading the original PEP, and then misunderstood it to mean files that were just single CR
s, rather than files that use single CR
s as the newline character. I suggest clarifying this as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a commit to clarify this. I think this reason is much less important than the other one, so don't wish to spend too many words on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commit doesn't really clarify the confusion I had over its intended meaning, sorry. The problematic usage is with using a single CR
as an end of line character, not single CRs anywhere in the file being an invalid character as the current wording implies (which both won't universal newlines unless it is the EOL character, and appears to be legal TOML inside multi-line double-quoted strings, at least per the verbatim letter of the spec).
If it isn't that important, such that it wouldn't be worth adding a few extra words to make its intended meaning clear and explicit and avoid introducing additional confusion, then maybe it would be best to just remove it entirely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a preference on the wording here, but I do think this is important and should not be removed.
Regarding TOML spec confusion: note that CR character is always part of an EOL sequence (CR+LF). If it isn't then the TOML is invalid. A bare CR is not valid in multi-line strings (double or single quoted) as per toml.abnf (I've also made a clarification of this in the Markdown spec, but that isn't included in TOML v1.0.0).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a preference on the wording here, but I do think this is important and should not be removed.
In that case, I suggest including the above clarification (or similar), as the present wording was the source of confusion on what you meant here (by someone who makes heavy use of TOML, though lacking your depth of expertise into the intricacies of the format).
Regarding TOML spec confusion: note that CR character is always part of an EOL sequence (CR+LF). If it isn't then the TOML is invalid. A bare CR is not valid in multi-line strings (double or single quoted) as per toml.abnf.
Ah, thanks for the clarification. I was mislead by this line,
Any Unicode character may be used except those that must be escaped: backslash and the control characters other than tab, line feed, and carriage return (U+0000 to U+0008, U+000B, U+000C, U+000E to U+001F, U+007F).
and the lack of explicitly requiring any CF
to always be followed by LF
, but indeed the formal ABNF is correct.
On the discuss thread, questions were brought up about the binary file type: https://discuss.python.org/t/pep-680-tomllib-support-for-parsing-toml-in-the-standard-library/13040/43
Ensuring that it's easy to correctly parse TOML files containing non-ASCII characters on Windows is a good reason to use binary files. This PR adds some words to highlight this reason for using binary files.
cc @hukkin