WG21 P1854: Source to Execution encoding conversion should not lead to loss of information #50

cor3ntin · 2019-08-02T10:38:40Z

When converting as string literal or wide string literal (or character) from the source to execution encoding, it is implementation defined how non-representable characters are handled, which can lead to loss of data.

In practice, most compilers make that ill-formed https://godbolt.org/z/SlhCdr

The standard should match existing practice and not encourage implementation to be able to
modify the meaning of string literals

http://eel.is/c++draft/lex#phases-1.5

Each basic source character set member in a character literal or a string literal, as well as each escape sequence and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set ([lex.ccon], [lex.string]); if there is no corresponding member, ~~it is converted to an implementation-defined member other than the null (wide) character~~the program is ill-formed.

Note: the above paragraph needs further modifications as per #46

tahonermann · 2019-08-02T21:13:34Z

In practice, most compilers make that ill-formed

The diagnostic for that example is awful. Instead of stating that the character lacks representation in the presumed execution encoding, it states that it is "invalid" (whatever that means), or is an "incomplete multibyte or wide character". I have no idea what an "incomplete wide character" might be.

Substituting a \u1234 escape sequence for the non-representable character produces a similar diagnostic.

The standard should match existing practice and not encourage implementation to be able to
modify the meaning of string literals

The linked godbolt example only demonstrates behavior for a single compiler. The proposed change is not existing practice for some other compilers. In particular, the Microsoft compiler will silently substitute a replacement character. The claim that the proposed change reflects the behavior of "most compilers" is unsubstantiated.

That being said, I think I can get behind this proposed change. Implementations can always offer an extension to substitute replacement characters in the (very few) cases where that is desirable.

cor3ntin · 2019-08-02T22:06:40Z

TBH i wasn't able to make clang accept anything but utf8 as input encoding

tahonermann · 2019-08-02T23:58:02Z

LLVM Clang (and common derivatives like Apple Clang and Android Clang) only support UTF-8. There are derivatives that do support other encodings though (e.g., the z/OS Clang ports).

tahonermann · 2019-11-17T22:06:13Z

P1854 was submitted with a proposed fix for this issue and was discussed by SG16 in Belfast. This is now waiting on an updated paper.

tahonermann · 2020-03-01T02:39:01Z

This issue is now tracked by cplusplus/papers#608.

peter-b · 2021-09-16T14:40:55Z

@cor3ntin I don't think we ever polled Proposal 7 from P2178, and there doesn't seem to be a current paper that plugs this silent data loss hole. Do we need a new paper, or a new revision of P1854?

cor3ntin · 2021-09-16T15:19:38Z

P1854 will be revised

…

On Thu, Sep 16, 2021, 16:41 Peter TB Brett ***@***.***> wrote: @cor3ntin <https://github.com/cor3ntin> I don't think we ever polled Proposal 7 from P2178, and there doesn't seem to be a current paper that plugs this silent data loss hole. Do we need a new paper, or a new revision of P1854? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#50 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKX764CFESHCI5JYZ42QETUCH6YFANCNFSM4II4SRSQ> .

tahonermann added the New New issues that have not yet been discussed in SG16 label Aug 2, 2019

tahonermann added paper submitted A paper proposing a specific solution has been submitted bug Something isn't working enhancement New feature or request and removed New New issues that have not yet been discussed in SG16 bug Something isn't working labels Nov 17, 2019

tahonermann added paper revision needed An updated paper proposing a specific solution is needed and removed paper submitted A paper proposing a specific solution has been submitted labels Nov 17, 2019

tahonermann changed the title ~~Source to Execution encoding conversion should not lead to loss of information~~ WG21 P1854: Source to Execution encoding conversion should not lead to loss of information Mar 1, 2020

tahonermann added WG21-tracked This issue is tracked as a WG21 github issue and removed paper revision needed An updated paper proposing a specific solution is needed labels Mar 1, 2020

tahonermann assigned cor3ntin Mar 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WG21 P1854: Source to Execution encoding conversion should not lead to loss of information #50

WG21 P1854: Source to Execution encoding conversion should not lead to loss of information #50

cor3ntin commented Aug 2, 2019 •

edited

Loading

tahonermann commented Aug 2, 2019

cor3ntin commented Aug 2, 2019 •

edited

Loading

tahonermann commented Aug 2, 2019

tahonermann commented Nov 17, 2019

tahonermann commented Mar 1, 2020

peter-b commented Sep 16, 2021

cor3ntin commented Sep 16, 2021 via email

WG21 P1854: Source to Execution encoding conversion should not lead to loss of information #50

WG21 P1854: Source to Execution encoding conversion should not lead to loss of information #50

Comments

cor3ntin commented Aug 2, 2019 • edited Loading

tahonermann commented Aug 2, 2019

cor3ntin commented Aug 2, 2019 • edited Loading

tahonermann commented Aug 2, 2019

tahonermann commented Nov 17, 2019

tahonermann commented Mar 1, 2020

peter-b commented Sep 16, 2021

cor3ntin commented Sep 16, 2021 via email

cor3ntin commented Aug 2, 2019 •

edited

Loading

cor3ntin commented Aug 2, 2019 •

edited

Loading