Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WG21 P1854: Source to Execution encoding conversion should not lead to loss of information #50

Open
cor3ntin opened this issue Aug 2, 2019 · 7 comments
Assignees
Labels
enhancement New feature or request WG21-tracked This issue is tracked as a WG21 github issue

Comments

@cor3ntin
Copy link
Collaborator

cor3ntin commented Aug 2, 2019

When converting as string literal or wide string literal (or character) from the source to execution encoding, it is implementation defined how non-representable characters are handled, which can lead to loss of data.

In practice, most compilers make that ill-formed https://godbolt.org/z/SlhCdr

The standard should match existing practice and not encourage implementation to be able to
modify the meaning of string literals

http://eel.is/c++draft/lex#phases-1.5

Each basic source character set member in a character literal or a string literal, as well as each escape sequence and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set ([lex.ccon], [lex.string]); if there is no corresponding member, it is converted to an implementation-defined member other than the null (wide) characterthe program is ill-formed.

Note: the above paragraph needs further modifications as per #46

@tahonermann tahonermann added the New New issues that have not yet been discussed in SG16 label Aug 2, 2019
@tahonermann
Copy link
Member

In practice, most compilers make that ill-formed

The diagnostic for that example is awful. Instead of stating that the character lacks representation in the presumed execution encoding, it states that it is "invalid" (whatever that means), or is an "incomplete multibyte or wide character". I have no idea what an "incomplete wide character" might be.

Substituting a \u1234 escape sequence for the non-representable character produces a similar diagnostic.

The standard should match existing practice and not encourage implementation to be able to
modify the meaning of string literals

The linked godbolt example only demonstrates behavior for a single compiler. The proposed change is not existing practice for some other compilers. In particular, the Microsoft compiler will silently substitute a replacement character. The claim that the proposed change reflects the behavior of "most compilers" is unsubstantiated.

That being said, I think I can get behind this proposed change. Implementations can always offer an extension to substitute replacement characters in the (very few) cases where that is desirable.

@cor3ntin
Copy link
Collaborator Author

cor3ntin commented Aug 2, 2019

TBH i wasn't able to make clang accept anything but utf8 as input encoding

@tahonermann
Copy link
Member

LLVM Clang (and common derivatives like Apple Clang and Android Clang) only support UTF-8. There are derivatives that do support other encodings though (e.g., the z/OS Clang ports).

@tahonermann tahonermann added paper submitted A paper proposing a specific solution has been submitted bug Something isn't working enhancement New feature or request and removed New New issues that have not yet been discussed in SG16 bug Something isn't working labels Nov 17, 2019
@tahonermann
Copy link
Member

P1854 was submitted with a proposed fix for this issue and was discussed by SG16 in Belfast. This is now waiting on an updated paper.

@tahonermann tahonermann added paper revision needed An updated paper proposing a specific solution is needed and removed paper submitted A paper proposing a specific solution has been submitted labels Nov 17, 2019
@tahonermann
Copy link
Member

This issue is now tracked by cplusplus/papers#608.

@tahonermann tahonermann changed the title Source to Execution encoding conversion should not lead to loss of information WG21 P1854: Source to Execution encoding conversion should not lead to loss of information Mar 1, 2020
@tahonermann tahonermann added WG21-tracked This issue is tracked as a WG21 github issue and removed paper revision needed An updated paper proposing a specific solution is needed labels Mar 1, 2020
@peter-b
Copy link
Collaborator

peter-b commented Sep 16, 2021

@cor3ntin I don't think we ever polled Proposal 7 from P2178, and there doesn't seem to be a current paper that plugs this silent data loss hole. Do we need a new paper, or a new revision of P1854?

@cor3ntin
Copy link
Collaborator Author

cor3ntin commented Sep 16, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request WG21-tracked This issue is tracked as a WG21 github issue
Development

No branches or pull requests

3 participants