Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow limited access to OsStr bytes #109698

Merged
merged 6 commits into from
May 31, 2023
Merged

Allow limited access to OsStr bytes #109698

merged 6 commits into from
May 31, 2023

Conversation

epage
Copy link
Contributor

@epage epage commented Mar 28, 2023

OsStr has historically kept its implementation details private out of
concern for locking us into a specific encoding on Windows.

This is an alternative to #95290 which proposed specifying the encoding on Windows. Instead, this
only specifies that for cross-platform code, OsStr's encoding is a superset of UTF-8 and defines
rules for safely interacting with it

At minimum, this can greatly simplify the os_str_bytes crate and every
arg parser that interacts with OsStr directly (which is most of those
that support invalid UTF-8).

Tracking issue: #111544

`OsStr` has historically kept its implementation details private out of
concern for locking us into a specific encoding on Windows.

This is an alternative to rust-lang#95290 which proposed specifying the encoding on Windows.  Instead, this
only specifies that for cross-platform code, `OsStr`'s encoding is a superset of UTF-8 and defines
rules for safely interacting with it

At minimum, this can greatly simplify the `os_str_bytes` crate and every
arg parser that interacts with `OsStr` directly (which is most of those
that support invalid UTF-8).
@rustbot
Copy link
Collaborator

rustbot commented Mar 28, 2023

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @cuviper (or someone else) soon.

Please see the contribution instructions for more information. Namely, in order to ensure the minimum review times lag, PR authors and assigned reviewers should ensure that the review label (S-waiting-on-review and S-waiting-on-author) stays updated, invoking these commands when appropriate:

  • @rustbot author: the review is finished, PR author should check the comments and take action accordingly
  • @rustbot review: the author is ready for a review, this PR will be queued again in the reviewer's queue

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Mar 28, 2023
@rustbot
Copy link
Collaborator

rustbot commented Mar 28, 2023

Hey! It looks like you've submitted a new PR for the library teams!

If this PR contains changes to any rust-lang/rust public library APIs then please comment with @rustbot label +T-libs-api -T-libs to tag it appropriately. If this PR contains changes to any unstable APIs please edit the PR description to add a link to the relevant API Change Proposal or create one if you haven't already. If you're unsure where your change falls no worries, just leave it as is and the reviewer will take a look and make a decision to forward on if necessary.

Examples of T-libs-api changes:

  • Stabilizing library features
  • Introducing insta-stable changes such as new implementations of existing stable traits on existing stable types
  • Introducing new or changing existing unstable library APIs (excluding permanently unstable features / features without a tracking issue)
  • Changing public documentation in ways that create new stability guarantees
  • Changing observable runtime behavior of library APIs

@epage epage force-pushed the wtf branch 2 times, most recently from 01f8d93 to 0d87d66 Compare March 28, 2023 14:26
@cuviper
Copy link
Member

cuviper commented Mar 28, 2023

I think this counts as new guarantee for API purposes.

@rustbot label +T-libs-api -T-libs
r? libs-api

@rustbot rustbot added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. and removed T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Mar 28, 2023
@rustbot rustbot assigned Amanieu and unassigned cuviper Mar 28, 2023
@ChrisDenton
Copy link
Member

Out of interest, was there anything in particular that motivated you to propose this now?

@the8472
Copy link
Member

the8472 commented Mar 28, 2023

I think this should come with an example since it's a wide pointer. Conversion from C to Rust has to jump through extra hoops.

@joshtriplett
Copy link
Member

I'm personally amenable to adding this guarantee, but I do agree with @the8472 that we need to confirm that this will actually work as intended.

@epage epage changed the title Allow FFI support for OsStr Allow limited access to OsStr bytes in unsafe blocks Mar 31, 2023
@epage
Copy link
Contributor Author

epage commented Mar 31, 2023

Sorry for the lack of details and confusion on this. I was under the impression that working towards my final goal would best be done in smaller steps of defining OsStr but that meant I didn't have a use case for it which I recognize can be frustrating to deal with when evaluating a major change like this. I've since updated the documentation and PR to reflect the minimum of what I need for my use case: clap and other CLI tools that need to deal with parsing and splitting up OsStrs for however much valid UTF-8 is in them.

My hope that this more limited alternative to #95290 will have a chance to move forward.

@ChrisDenton
Copy link
Member

If we do this, I wonder if it makes sense to also offer a split_at method on OsStr that panics if used incorrectly. I.e.:

// Similar to `str::split_at` and `[T]::split_at`
// Panics if `mid` is not on a UTF-8 code point boundary.
fn split_at(&self, mid: usize) -> (&OsStr, &OsStr);

Splitting and joining are the two hazard areas when dealing with known valid OsStrs (whatever form they're in). We have OsString::push for the joining case but nothing to help with splitting. I think it would be unfortunate if we don't have some safe way to check you're doing it right.

@epage
Copy link
Contributor Author

epage commented Mar 31, 2023

but nothing to help with splitting. I think it would be unfortunate if we don't have some safe way to check you're doing it right.

A part of me finds it weird to offer a safe function like split_at that takes a parameter (mid) that can only be calculated from another type derived from with unsafe code. Documenting use of split_at would then further raise visibility of this feature, for both good (people getting stuff done) and bad (risk of misuse). However, I do see the benefit as that does greatly reduce the risk of something going wrong. In my experiments with transmuting &OsStr, I wrapped all byte operations in functions that only allowed deriving mid from &str parameters (e.g. OsStrExt::strip_prefix(&self, prefix: &str)) so I knew I was splitting at a safe boundary.

EDIT: To make clear, I'm open to adding that function if that is the direction we want to take this.

@ChrisDenton
Copy link
Member

Hm, we could offer some safe way to get an index according to these rules. E.g.

// Returns the byte index where `searcher` first returns true.
// `searcher` can inspect `char`s decoded from UTF-8.
// Non UTF-8 encoded characters are skipped.
fn find_char(&self, searcher: FnMut(char) -> bool) -> Option<usize>;

But at this point I'll stop because I'm practically designing yet another alternative approach.

@Amanieu Amanieu added the I-libs-api-nominated Nominated for discussion during a libs-api team meeting. label Apr 8, 2023
@chorman0773
Copy link
Contributor

Does this confirm that OsStr is a "bag of bytes" on any platform, in that you can put and leave any (init, non-pointer) bytes in the slice, or am I misinterpreting the proposal?

@epage
Copy link
Contributor Author

epage commented Apr 11, 2023

Does this confirm that OsStr is a "bag of bytes" on any platform, in that you can put and leave any (init, non-pointer) bytes in the slice, or am I misinterpreting the proposal?

For now, this proposals restricts what is safe to transmute from &[u8] to &OsStr, so this does not work as an arbitrary bag of bytes

//! - When [transmuting] from `&[u8]` to `&OsStr`,
//!   - the slice may only include content from comparable `&OsStr` (see above) or be valid UTF-8
//!   - any splits of the `&OsStr` must be along char boundaries (the first byte of a UTF-8 code
//!     point sequence)

@BurntSushi
Copy link
Member

I think I buy the rules as written, but I would like to see a doc example using unsafe, along with a SAFETY comment justifying why it's OK. I think that would also help others write correct code.

Also, cc @SimonSapin to see what you think about this. (Let me know if you want me to stop pinging you about this, but I always think of you as the champion against a change like this.)

Popping up a level, this is kind of an interesting use of unsafe, isn't it? It's carving out a strictly more conservative set of things the caller needs to uphold than what is actually true. I wonder, for example, how Miri might detect UB in a case like this.

@blyxxyz
Copy link
Contributor

blyxxyz commented Apr 11, 2023

Does this confirm that OsStr is a "bag of bytes" on any platform, in that you can put and leave any (init, non-pointer) bytes in the slice, or am I misinterpreting the proposal?

It actually isn't a bag of bytes on Windows, it must be WTF-8-encoded or you can start reading out of bounds:

fn main() {
    let b: &[u8] = b"\xC2";
    let s: &std::ffi::OsStr = unsafe { std::mem::transmute(b) };
    dbg!(s);
}
$ cargo run -q --target x86_64-pc-windows-gnu
[src/main.rs:4] s = "\u{9b}:]  = \n\0\0\0\0\0\0\01\u{10}L[...]thread 'main' panicked at 'failed printing to stderr: Windows stdio in console mode does not support writing non-UTF-8 byte sequences', library\std\src\io\stdio.rs:1008:9

The proposal is phrased to never produce invalid WTF-8.

I think that as written now it allows for any OsStr encoding that's a self-synchronizing superset of UTF-8.

@asquared31415
Copy link
Contributor

asquared31415 commented Apr 11, 2023

When [transmuting] from &[u8] to &OsStr,

This wording would provide an additional guarantee that &[u8] and &OsStr have the same layout in terms of their (ptr, len) pairs. As far as I am aware, this is something that is not currently guaranteed, and I don't recall the current stance on guaranteeing layout compatibility between any two fat pointer types that have different pointees.

@bors bors added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels May 28, 2023
@epage epage force-pushed the wtf branch 2 times, most recently from e2d912b to e6a35c4 Compare May 29, 2023 12:57
@epage
Copy link
Contributor Author

epage commented May 30, 2023

I've addressed what caused the wasm build failure and it should be good to go again.

@Amanieu
Copy link
Member

Amanieu commented May 30, 2023

@bors r+

@bors
Copy link
Contributor

bors commented May 30, 2023

📌 Commit e6a35c4 has been approved by Amanieu

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 30, 2023
Dylan-DPC added a commit to Dylan-DPC/rust that referenced this pull request May 30, 2023
Allow limited access to `OsStr` bytes

`OsStr` has historically kept its implementation details private out of
concern for locking us into a specific encoding on Windows.

This is an alternative to rust-lang#95290 which proposed specifying the encoding on Windows.  Instead, this
only specifies that for cross-platform code, `OsStr`'s encoding is a superset of UTF-8 and defines
rules for safely interacting with it

At minimum, this can greatly simplify the `os_str_bytes` crate and every
arg parser that interacts with `OsStr` directly (which is most of those
that support invalid UTF-8).

Tracking issue: rust-lang#111544
@bors
Copy link
Contributor

bors commented May 30, 2023

⌛ Testing commit e6a35c4 with merge 9610dfe...

@bors
Copy link
Contributor

bors commented May 31, 2023

☀️ Test successful - checks-actions
Approved by: Amanieu
Pushing 9610dfe to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label May 31, 2023
@bors bors merged commit 9610dfe into rust-lang:master May 31, 2023
@rustbot rustbot added this to the 1.72.0 milestone May 31, 2023
@rust-timer
Copy link
Collaborator

Finished benchmarking commit (9610dfe): comparison URL.

Overall result: no relevant changes - no action needed

@rustbot label: -perf-regression

Instruction count

This benchmark run did not return any relevant results for this metric.

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-3.5% [-4.6%, -2.5%] 2
All ❌✅ (primary) - - 0

Cycles

This benchmark run did not return any relevant results for this metric.

Binary size

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.0% [0.0%, 0.1%] 5
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-0.1% [-0.1%, -0.1%] 3
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -0.0% [-0.1%, 0.1%] 8

Bootstrap: 642.846s -> 643.715s (0.14%)

@apiraino apiraino removed the to-announce Announce this issue on triage meeting label Jun 15, 2023
epage added a commit to epage/rust that referenced this pull request Jul 7, 2023
This extends rust-lang#109698 to allow no-cost conversion between `Vec<u8>` and `OsString`
as suggested in feedback from `os_str_bytes` crate in rust-lang#111544.
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this pull request Jul 22, 2023
Allow limited access to `OsString` bytes

This extends rust-lang#109698 to allow no-cost conversion between `Vec<u8>` and `OsString` as suggested in feedback from `os_str_bytes` crate in rust-lang#111544.
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this pull request Jul 22, 2023
Allow limited access to `OsString` bytes

This extends rust-lang#109698 to allow no-cost conversion between `Vec<u8>` and `OsString` as suggested in feedback from `os_str_bytes` crate in rust-lang#111544.
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this pull request Jul 22, 2023
Allow limited access to `OsString` bytes

This extends rust-lang#109698 to allow no-cost conversion between `Vec<u8>` and `OsString` as suggested in feedback from `os_str_bytes` crate in rust-lang#111544.
@epage epage deleted the wtf branch July 22, 2023 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
disposition-merge This issue / PR is in PFCP or FCP with a disposition to merge it. finished-final-comment-period The final comment period is finished for this PR / Issue. merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.