Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement RFC 3348, c"foo" literals #108801

Merged
merged 10 commits into from
May 5, 2023
Merged

Conversation

fee1-dead
Copy link
Member

RFC: rust-lang/rfcs#3348
Tracking issue: #105723

@rustbot
Copy link
Collaborator

rustbot commented Mar 6, 2023

r? @wesleywiser

(rustbot has picked a reviewer for you, use r? to override)

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue. labels Mar 6, 2023
@petrochenkov petrochenkov self-assigned this Mar 6, 2023
compiler/rustc_ast/src/ast.rs Outdated Show resolved Hide resolved
compiler/rustc_lexer/src/lib.rs Outdated Show resolved Hide resolved
compiler/rustc_ast_passes/src/feature_gate.rs Outdated Show resolved Hide resolved
compiler/rustc_lexer/src/unescape.rs Outdated Show resolved Hide resolved
Char(char),
}

pub fn unescape_c_string<F>(src: &str, mode: Mode, callback: &mut F)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If c"..." requires different unescaping from some other existing strings, then something is going wrong, in general.

Perhaps implementation for c"..." and the stuff from rust-lang/rfcs#3349 should be decoupled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has to be different because returning a char doesn't cover all cases for C string literals. If the RFC that you mentioned is accepted, then byte string literals can't have units represented as characters too. We need to differentiate unicode characters that should be encoded using utf8. c"À" is C3 80 while codepoint is 0xC0, and c"\xC0" would encode to [0xC0] directly. Before this PR, byte strings pass these byte values as chars which are then converted into u8s, while C strings need to pass chars that need to be encoded as UTF-8 as chars and bytes that need to be appended as u8s.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't understand what you are saying.
Both byte and C strings support non-UTF8 so (Rust) chars are out of the question.
I'm concerned about the difference between byte strings and C strings, both produce arbitrary non-UTF [u8] and any differences between them should eventually be eliminated (that's the point of rust-lang/rfcs#3349 from what I remember).

Copy link
Member Author

@fee1-dead fee1-dead Mar 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what you are saying is true. but currently, both byte strings and normal strings emit chars in their implementation. Byte strings just use the codepoints to represent the byte values, but that would need to be changed to an enum (just like how this PR changes it for c literals) if we were to implement that rfc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that most of this complication comes from the fact that the C-str RFC explicit states that it supports both \u and \x escapes in c"" literals. Is that correct?

Copy link
Member Author

@fee1-dead fee1-dead Apr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@compiler-errors Not necessarily about the \u escape, but more about the \x escape which has a different meaning in byte strings and characters. nnethercote's comment at the RFC mentioned above suggested that a table should make this clearer:

  Example # sets* Characters Escapes
Character 'H' 0 All Unicode Quote & ASCII & Unicode
String "hello" 0 All Unicode Quote & ASCII & Unicode
Raw string r#"hello"# <256 All Unicode N/A
Byte b'H' 0 All ASCII Quote & Byte
Byte string b"hello" 0 All ASCII Quote & Byte
Raw byte string br#"hello"# <256 All ASCII N/A
C string c"hello" 0 All unicode Quote & Byte & Unicode

Note that since normal strings accept unicode, we can emit chars that correspond to the actual characters. But for byte strings this is different. Byte strings allow bytes that are not encoded as UTF-8. (e.g. \xFF allowed in byte strings but not in normal strings) How do we unescape them currently? We currently emit the codepoint (e.g. \xFF -> ÿ U+00FF) for byte strings and then interpret the values later on.

That means that ÿ character emitted by a normal string means "ÿ", with codepoint U+00FF, encoded in UTF-8 as 0xC3 0xBF. But this emitted for a byte string would mean the byte 0xFF only. C strings are explicitly allowed to have both, therefore it is necessary to use an enum to convey either the character encoded as UTF-8 or the byte value.

Copy link
Member

@est31 est31 Apr 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fee1-dead In the entry in the "Characters" column, in the "C string" row, do you really mean "all bytes exept NUL"?. IIRC Rust files are required to be valid UTF-8, and RFC 3348 has changed nothing about that. At least I found nothing in the RFC's text indicating that. The goal was more about the escapes column: the encoded result can be a non-valid unicode string, but the literal itself still has to be valid UTF-8. Otherwise this would mean that programs processing rust source code cannot assume UTF-8 validity of the source code any more. In other words, any program that uses Rust's String type to represent a slice of Rust code (including Rust's proc macro infrastructure!) would fail for specific snippets containing c strings that have invalid UTF-8.

I think that entry should rather read "All Unicode" or "All UTF-8".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@est31: corrected, thanks.

@petrochenkov petrochenkov removed their assignment Mar 6, 2023
@rust-log-analyzer

This comment has been minimized.

@fee1-dead fee1-dead added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 8, 2023
@fee1-dead fee1-dead marked this pull request as ready for review March 10, 2023 15:31
@rustbot
Copy link
Collaborator

rustbot commented Mar 10, 2023

Hey! It looks like you've submitted a new PR for the library teams!

If this PR contains changes to any rust-lang/rust public library APIs then please comment with @rustbot label +T-libs-api -T-libs to tag it appropriately. If this PR contains changes to any unstable APIs please edit the PR description to add a link to the relevant API Change Proposal or create one if you haven't already. If you're unsure where your change falls no worries, just leave it as is and the reviewer will take a look and make a decision to forward on if necessary.

Examples of T-libs-api changes:

  • Stabilizing library features
  • Introducing insta-stable changes such as new implementations of existing stable traits on existing stable types
  • Introducing new or changing existing unstable library APIs (excluding permanently unstable features / features without a tracking issue)
  • Changing public documentation in ways that create new stability guarantees
  • Changing observable runtime behavior of library APIs

Some changes occurred in src/tools/clippy

cc @rust-lang/clippy

@fee1-dead fee1-dead added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Mar 10, 2023
@bors
Copy link
Contributor

bors commented Mar 11, 2023

☔ The latest upstream changes (presumably #108998) made this pull request unmergeable. Please resolve the merge conflicts.

@fee1-dead fee1-dead changed the title [WIP] Implement RFC 3348, c"foo" literals Implement RFC 3348, c"foo" literals Mar 12, 2023
@fee1-dead
Copy link
Member Author

r? compiler

@rustbot rustbot assigned jackh726 and unassigned wesleywiser Apr 7, 2023
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this pull request May 4, 2023
…r-errors

Implement RFC 3348, `c"foo"` literals

RFC: rust-lang/rfcs#3348
Tracking issue: rust-lang#105723
bors added a commit to rust-lang-ci/rust that referenced this pull request May 5, 2023
Rollup of 6 pull requests

Successful merges:

 - rust-lang#103056 (Fix `checked_{add,sub}_duration` incorrectly returning `None` when `other` has more than `i64::MAX` seconds)
 - rust-lang#108801 (Implement RFC 3348, `c"foo"` literals)
 - rust-lang#110773 (Reduce MIR dump file count for MIR-opt tests)
 - rust-lang#110876 (Added default target cpu to `--print target-cpus` output and updated docs)
 - rust-lang#111068 (Improve check-cfg implementation)
 - rust-lang#111238 (btree_map: `Cursor{,Mut}::peek_prev` must agree)

Failed merges:

 - rust-lang#110694 (Implement builtin # syntax and use it for offset_of!(...))

r? `@ghost`
`@rustbot` modify labels: rollup
@bors bors merged commit 4891f02 into rust-lang:master May 5, 2023
@rustbot rustbot added this to the 1.71.0 milestone May 5, 2023
@klensy
Copy link
Contributor

klensy commented May 16, 2023

Looks like rustfmt don't know about that new literals, sadly.

flip1995 pushed a commit to flip1995/rust that referenced this pull request May 20, 2023
…r-errors

Implement RFC 3348, `c"foo"` literals

RFC: rust-lang/rfcs#3348
Tracking issue: rust-lang#105723
@kanashimia
Copy link

@fee1-dead this should be feature gated under c_str_literal and not c_str_literals , as mentioned in the RFC and tracking issue, right?

Extreme confusion:

error[E0635]: unknown feature `c_str_literal`

@fee1-dead fee1-dead deleted the c-str branch May 30, 2023 13:24
@fee1-dead
Copy link
Member Author

fee1-dead commented May 30, 2023

feature(c_str_literals) made more sense to me, but I don't really mind c_str_literal either. If anyone has preference for one over the other, feel free to open a PR to either the rfcs repo or to rust-lang/rust. I've updated the tracking issue description in the mean time.

matthiaskrgr added a commit to matthiaskrgr/rust that referenced this pull request Jun 1, 2023
use c literals in compiler and library

Use c literals rust-lang#108801 in compiler and library

currently blocked on:
* <strike>rustfmt: don't know how to format c literals</strike> nope, nightly one works.
* <strike>bootstrap</strike>

r? `@ghost`
`@rustbot` blocked
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this pull request Jun 2, 2023
use c literals in compiler and library

Use c literals rust-lang#108801 in compiler and library

currently blocked on:
* <strike>rustfmt: don't know how to format c literals</strike> nope, nightly one works.
* <strike>bootstrap</strike>

r? `@ghost`
`@rustbot` blocked
fee1-dead added a commit to fee1-dead-contrib/rust that referenced this pull request Jul 6, 2023
…=compiler-errors

Revert the lexing of `c"…"` string literals

Fixes \[after beta-backport\] rust-lang#113235.
Further progress is tracked in rust-lang#113333.

This PR *manually* reverts parts of rust-lang#108801 (since a git-revert would've been too coarse-grained & messy)
and git-reverts rust-lang#111647.

CC `@fee1-dead` (rust-lang#108801) `@klensy` (rust-lang#111647)
r? `@compiler-errors`

`@rustbot` label F-c_str_literals beta-nominated
bors added a commit to rust-lang-ci/rust that referenced this pull request Dec 1, 2023
…ilstrieb

Stabilize C string literals

RFC: https://rust-lang.github.io/rfcs/3348-c-str-literal.html

Tracking issue: rust-lang#105723

Documentation PR (reference manual): rust-lang/reference#1423

# Stabilization report

Stabilizes C string and raw C string literals (`c"..."` and `cr#"..."#`), which are expressions of type [`&CStr`](https://doc.rust-lang.org/stable/core/ffi/struct.CStr.html). Both new literals require Rust edition 2021 or later.

```rust
const HELLO: &core::ffi::CStr = c"Hello, world!";
```

C strings may contain any byte other than `NUL` (`b'\x00'`), and their in-memory representation is guaranteed to end with `NUL`.

## Implementation

Originally implemented by PR rust-lang#108801, which was reverted due to unintentional changes to lexer behavior in Rust editions < 2021.

The current implementation landed in PR rust-lang#113476, which restricts C string literals to Rust edition >= 2021.

## Resolutions to open questions from the RFC

* Adding C character literals (`c'.'`) of type `c_char` is not part of this feature.
  * Support for `c"..."` literals does not prevent `c'.'` literals from being added in the future.
* C string literals should not be blocked on making `&CStr` a thin pointer.
  * It's possible to declare constant expressions of type `&'static CStr` in stable Rust (as of v1.59), so C string literals are not adding additional coupling on the internal representation of `CStr`.
* The unstable `concat_bytes!` macro should not accept `c"..."` literals.
  * C strings have two equally valid `&[u8]` representations (with or without terminal `NUL`), so allowing them to be used in `concat_bytes!` would be ambiguous.
* Adding a type to represent C strings containing valid UTF-8 is not part of this feature.
  * Support for a hypothetical `&Utf8CStr` may be explored in the future, should such a type be added to Rust.
bors added a commit to rust-lang/miri that referenced this pull request Dec 2, 2023
Stabilize C string literals

RFC: https://rust-lang.github.io/rfcs/3348-c-str-literal.html

Tracking issue: rust-lang/rust#105723

Documentation PR (reference manual): rust-lang/reference#1423

# Stabilization report

Stabilizes C string and raw C string literals (`c"..."` and `cr#"..."#`), which are expressions of type [`&CStr`](https://doc.rust-lang.org/stable/core/ffi/struct.CStr.html). Both new literals require Rust edition 2021 or later.

```rust
const HELLO: &core::ffi::CStr = c"Hello, world!";
```

C strings may contain any byte other than `NUL` (`b'\x00'`), and their in-memory representation is guaranteed to end with `NUL`.

## Implementation

Originally implemented by PR rust-lang/rust#108801, which was reverted due to unintentional changes to lexer behavior in Rust editions < 2021.

The current implementation landed in PR rust-lang/rust#113476, which restricts C string literals to Rust edition >= 2021.

## Resolutions to open questions from the RFC

* Adding C character literals (`c'.'`) of type `c_char` is not part of this feature.
  * Support for `c"..."` literals does not prevent `c'.'` literals from being added in the future.
* C string literals should not be blocked on making `&CStr` a thin pointer.
  * It's possible to declare constant expressions of type `&'static CStr` in stable Rust (as of v1.59), so C string literals are not adding additional coupling on the internal representation of `CStr`.
* The unstable `concat_bytes!` macro should not accept `c"..."` literals.
  * C strings have two equally valid `&[u8]` representations (with or without terminal `NUL`), so allowing them to be used in `concat_bytes!` would be ambiguous.
* Adding a type to represent C strings containing valid UTF-8 is not part of this feature.
  * Support for a hypothetical `&Utf8CStr` may be explored in the future, should such a type be added to Rust.
flip1995 pushed a commit to flip1995/rust-clippy that referenced this pull request Dec 5, 2023
Stabilize C string literals

RFC: https://rust-lang.github.io/rfcs/3348-c-str-literal.html

Tracking issue: rust-lang/rust#105723

Documentation PR (reference manual): rust-lang/reference#1423

# Stabilization report

Stabilizes C string and raw C string literals (`c"..."` and `cr#"..."#`), which are expressions of type [`&CStr`](https://doc.rust-lang.org/stable/core/ffi/struct.CStr.html). Both new literals require Rust edition 2021 or later.

```rust
const HELLO: &core::ffi::CStr = c"Hello, world!";
```

C strings may contain any byte other than `NUL` (`b'\x00'`), and their in-memory representation is guaranteed to end with `NUL`.

## Implementation

Originally implemented by PR rust-lang/rust#108801, which was reverted due to unintentional changes to lexer behavior in Rust editions < 2021.

The current implementation landed in PR rust-lang/rust#113476, which restricts C string literals to Rust edition >= 2021.

## Resolutions to open questions from the RFC

* Adding C character literals (`c'.'`) of type `c_char` is not part of this feature.
  * Support for `c"..."` literals does not prevent `c'.'` literals from being added in the future.
* C string literals should not be blocked on making `&CStr` a thin pointer.
  * It's possible to declare constant expressions of type `&'static CStr` in stable Rust (as of v1.59), so C string literals are not adding additional coupling on the internal representation of `CStr`.
* The unstable `concat_bytes!` macro should not accept `c"..."` literals.
  * C strings have two equally valid `&[u8]` representations (with or without terminal `NUL`), so allowing them to be used in `concat_bytes!` would be ambiguous.
* Adding a type to represent C strings containing valid UTF-8 is not part of this feature.
  * Support for a hypothetical `&Utf8CStr` may be explored in the future, should such a type be added to Rust.
lnicola pushed a commit to lnicola/rust-analyzer that referenced this pull request Apr 7, 2024
Stabilize C string literals

RFC: https://rust-lang.github.io/rfcs/3348-c-str-literal.html

Tracking issue: rust-lang/rust#105723

Documentation PR (reference manual): rust-lang/reference#1423

# Stabilization report

Stabilizes C string and raw C string literals (`c"..."` and `cr#"..."#`), which are expressions of type [`&CStr`](https://doc.rust-lang.org/stable/core/ffi/struct.CStr.html). Both new literals require Rust edition 2021 or later.

```rust
const HELLO: &core::ffi::CStr = c"Hello, world!";
```

C strings may contain any byte other than `NUL` (`b'\x00'`), and their in-memory representation is guaranteed to end with `NUL`.

## Implementation

Originally implemented by PR rust-lang/rust#108801, which was reverted due to unintentional changes to lexer behavior in Rust editions < 2021.

The current implementation landed in PR rust-lang/rust#113476, which restricts C string literals to Rust edition >= 2021.

## Resolutions to open questions from the RFC

* Adding C character literals (`c'.'`) of type `c_char` is not part of this feature.
  * Support for `c"..."` literals does not prevent `c'.'` literals from being added in the future.
* C string literals should not be blocked on making `&CStr` a thin pointer.
  * It's possible to declare constant expressions of type `&'static CStr` in stable Rust (as of v1.59), so C string literals are not adding additional coupling on the internal representation of `CStr`.
* The unstable `concat_bytes!` macro should not accept `c"..."` literals.
  * C strings have two equally valid `&[u8]` representations (with or without terminal `NUL`), so allowing them to be used in `concat_bytes!` would be ambiguous.
* Adding a type to represent C strings containing valid UTF-8 is not part of this feature.
  * Support for a hypothetical `&Utf8CStr` may be explored in the future, should such a type be added to Rust.
RalfJung pushed a commit to RalfJung/rust-analyzer that referenced this pull request Apr 27, 2024
Stabilize C string literals

RFC: https://rust-lang.github.io/rfcs/3348-c-str-literal.html

Tracking issue: rust-lang/rust#105723

Documentation PR (reference manual): rust-lang/reference#1423

# Stabilization report

Stabilizes C string and raw C string literals (`c"..."` and `cr#"..."#`), which are expressions of type [`&CStr`](https://doc.rust-lang.org/stable/core/ffi/struct.CStr.html). Both new literals require Rust edition 2021 or later.

```rust
const HELLO: &core::ffi::CStr = c"Hello, world!";
```

C strings may contain any byte other than `NUL` (`b'\x00'`), and their in-memory representation is guaranteed to end with `NUL`.

## Implementation

Originally implemented by PR rust-lang/rust#108801, which was reverted due to unintentional changes to lexer behavior in Rust editions < 2021.

The current implementation landed in PR rust-lang/rust#113476, which restricts C string literals to Rust edition >= 2021.

## Resolutions to open questions from the RFC

* Adding C character literals (`c'.'`) of type `c_char` is not part of this feature.
  * Support for `c"..."` literals does not prevent `c'.'` literals from being added in the future.
* C string literals should not be blocked on making `&CStr` a thin pointer.
  * It's possible to declare constant expressions of type `&'static CStr` in stable Rust (as of v1.59), so C string literals are not adding additional coupling on the internal representation of `CStr`.
* The unstable `concat_bytes!` macro should not accept `c"..."` literals.
  * C strings have two equally valid `&[u8]` representations (with or without terminal `NUL`), so allowing them to be used in `concat_bytes!` would be ambiguous.
* Adding a type to represent C strings containing valid UTF-8 is not part of this feature.
  * Support for a hypothetical `&Utf8CStr` may be explored in the future, should such a type be added to Rust.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.