PEP 680: Further explain why files are read as binary #2281

hauntsaninja · 2022-01-26T23:07:14Z

On the discuss thread, questions were brought up about the binary file type: https://discuss.python.org/t/pep-680-tomllib-support-for-parsing-toml-in-the-standard-library/13040/43
Ensuring that it's easy to correctly parse TOML files containing non-ASCII characters on Windows is a good reason to use binary files. This PR adds some words to highlight this reason for using binary files.

cc @hukkin

hukkin

I'm happy with the change (if it helps anyone understand the issue better), but also don't think it's really necessary. My thinking is that even if all platforms defaulted to UTF-8, it makes no sense providing an API that allows changing that default, because TOML strictly requires UTF-8. Why allow configuring incorrect parse results?

CAM-Gerlach

Your changes LGTM, thanks! I just have one comment to clarify something that confused me when I was reading the PEP initially.

CAM-Gerlach · 2022-01-26T23:26:17Z

pep-0680.rst

+encodings, such as Windows), and avoid incorrectly parsing single carriage
+returns as valid TOML due to universal newlines in text mode.


Suggested change

encodings, such as Windows), and avoid incorrectly parsing single carriage

returns as valid TOML due to universal newlines in text mode.

encodings, such as Windows), and avoid incorrectly parsing files using a

single carriage return (``\r``) for the end of line character as valid TOML

(due to Python's universal newline conversion in text mode).

Despite careful re-reading, I was initially confused by this line when reading the original PEP, and then misunderstood it to mean files that were just single CRs, rather than files that use single CRs as the newline character. I suggest clarifying this as above.

Added a commit to clarify this. I think this reason is much less important than the other one, so don't wish to spend too many words on this.

The commit doesn't really clarify the confusion I had over its intended meaning, sorry. The problematic usage is with using a single CR as an end of line character, not single CRs anywhere in the file being an invalid character as the current wording implies (which both won't universal newlines unless it is the EOL character, and appears to be legal TOML inside multi-line double-quoted strings, at least per the verbatim letter of the spec).

If it isn't that important, such that it wouldn't be worth adding a few extra words to make its intended meaning clear and explicit and avoid introducing additional confusion, then maybe it would be best to just remove it entirely?

I don't have a preference on the wording here, but I do think this is important and should not be removed.

Regarding TOML spec confusion: note that CR character is always part of an EOL sequence (CR+LF). If it isn't then the TOML is invalid. A bare CR is not valid in multi-line strings (double or single quoted) as per toml.abnf (I've also made a clarification of this in the Markdown spec, but that isn't included in TOML v1.0.0).

I don't have a preference on the wording here, but I do think this is important and should not be removed.

In that case, I suggest including the above clarification (or similar), as the present wording was the source of confusion on what you meant here (by someone who makes heavy use of TOML, though lacking your depth of expertise into the intricacies of the format).

Regarding TOML spec confusion: note that CR character is always part of an EOL sequence (CR+LF). If it isn't then the TOML is invalid. A bare CR is not valid in multi-line strings (double or single quoted) as per toml.abnf.

Ah, thanks for the clarification. I was mislead by this line,

Any Unicode character may be used except those that must be escaped: backslash and the control characters other than tab, line feed, and carriage return (U+0000 to U+0008, U+000B, U+000C, U+000E to U+001F, U+007F).

and the lack of explicitly requiring any CF to always be followed by LF, but indeed the formal ABNF is correct.

PEP 680: further explain binary files

ce3d680

hauntsaninja requested a review from encukou as a code owner January 26, 2022 23:07

the-knights-who-say-ni added the CLA signed label Jan 26, 2022

hauntsaninja mentioned this pull request Jan 26, 2022

Please consider pushing tomli into stdlib hukkin/tomli#141

Closed

hukkin approved these changes Jan 26, 2022

View reviewed changes

CAM-Gerlach reviewed Jan 26, 2022

View reviewed changes

code review

d1df0be

CAM-Gerlach changed the title ~~PEP 680: further explain binary files~~ PEP 680: Further explain why files are read as binary Jan 27, 2022

encukou merged commit 5796270 into python:main Jan 27, 2022

hauntsaninja deleted the pep-680-edits-2 branch January 27, 2022 10:42

hukkin mentioned this pull request Feb 8, 2022

Submission for SC consideration: PEP 680 -- tomllib python/steering-council#104

Closed

erlend-aasland mentioned this pull request May 13, 2022

pep8/greppable exception messages erlend-aasland/peps#1

Closed

erlend-aasland mentioned this pull request Jun 27, 2022

pep 687/mark as accepted erlend-aasland/peps#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PEP 680: Further explain why files are read as binary #2281

PEP 680: Further explain why files are read as binary #2281

hauntsaninja commented Jan 26, 2022 •

edited

Loading

hukkin left a comment

CAM-Gerlach left a comment

CAM-Gerlach Jan 26, 2022

hauntsaninja Jan 26, 2022

CAM-Gerlach Jan 27, 2022 •

edited

Loading

hukkin Jan 27, 2022 •

edited

Loading

CAM-Gerlach Jan 27, 2022

		encodings, such as Windows), and avoid incorrectly parsing single carriage
		returns as valid TOML due to universal newlines in text mode.

-encodings, such as Windows), and avoid incorrectly parsing single carriage
-returns as valid TOML due to universal newlines in text mode.
+encodings, such as Windows), and avoid incorrectly parsing files using a
+single carriage return (``\r``) for the end of line character as valid TOML
+(due to Python's universal newline conversion in text mode).

PEP 680: Further explain why files are read as binary #2281

PEP 680: Further explain why files are read as binary #2281

Conversation

hauntsaninja commented Jan 26, 2022 • edited Loading

hukkin left a comment

Choose a reason for hiding this comment

CAM-Gerlach left a comment

Choose a reason for hiding this comment

CAM-Gerlach Jan 26, 2022

Choose a reason for hiding this comment

hauntsaninja Jan 26, 2022

Choose a reason for hiding this comment

CAM-Gerlach Jan 27, 2022 • edited Loading

Choose a reason for hiding this comment

hukkin Jan 27, 2022 • edited Loading

Choose a reason for hiding this comment

CAM-Gerlach Jan 27, 2022

Choose a reason for hiding this comment

hauntsaninja commented Jan 26, 2022 •

edited

Loading

CAM-Gerlach Jan 27, 2022 •

edited

Loading

hukkin Jan 27, 2022 •

edited

Loading