-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OS string string-like interface #1309
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,239 @@ | ||
- Feature Name: osstring_string_interface | ||
- Start Date: 2015-10-05 | ||
- RFC PR: (leave this empty) | ||
- Rust Issue: (leave this empty) | ||
|
||
# Summary | ||
|
||
Add a string-like API to the `OsString` and `OsStr` types. This RFC | ||
focuses on creating a string-like interface, as opposed to RFC #1307, | ||
which focuses more on container-like features. | ||
|
||
# Motivation | ||
|
||
As mentioned in the `std::ffi::os_str` documentation: "**Note**: At | ||
the moment, these types are extremely bare-bones, usable only for | ||
conversion to/from various other string types. Eventually these types | ||
will offer a full-fledged string API." This is intended as a step in | ||
that direction. | ||
|
||
Having an ergonomic way to manipulate OS strings is needed to allow | ||
programs to easily handle non-UTF-8 data received from the operating | ||
system. Currently, it is common for programs to just convert OS data | ||
to `String`s, which leads to undesirable panics in the unusual case | ||
where the input is not UTF-8. For example, currently, calling rustc | ||
with a non-UTF-8 command line argument will result in an immediate | ||
panic. Fixing that in a way that actually handles non-UTF-8 data | ||
correctly (as opposed to, for example, just interpreting it lossily as | ||
UTF-8) would be very difficult with the current OS string API. Most | ||
of the functions proposed here were motivated by the OS string | ||
processing needs of rustc. | ||
|
||
# Detailed design | ||
|
||
## `OsString` | ||
|
||
`OsString` will get the following new method: | ||
```rust | ||
/// Converts an `OsString` into a `String`, avoiding a copy if possible. | ||
/// | ||
/// Any non-Unicode sequences are replaced with U+FFFD REPLACEMENT CHARACTER. | ||
pub fn into_string_lossy(self) -> String; | ||
|
||
``` | ||
|
||
This is analogous to the existing `OsStr::to_string_lossy` method, but | ||
transfers ownership. This operation can be done without a copy if the | ||
`OsString` contains UTF-8 data or if the platform is Windows. | ||
|
||
## `OsStr` | ||
|
||
OsStr will get the following new methods: | ||
```rust | ||
/// Returns true if the string starts with a valid UTF-8 sequence | ||
/// equal to the given `&str`. | ||
fn starts_with_str(&self, prefix: &str) -> bool; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When adding string-like APIs to fn starts_with<S: AsRef<OsStr>>(&self, prefix: S) -> bool; That should cover this use case as well as something like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Reasonable. I'll make that change. |
||
|
||
/// If the string starts with the given `&str`, returns the rest | ||
/// of the string. Otherwise returns `None`. | ||
fn remove_prefix_str(&self, prefix: &str) -> Option<&OsStr>; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Along the lines of sticking as close to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With I’ve wanted There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I still like this functionality, but you're right that it would be nice on both types of strings. Since modifying Hmm, if this took a pattern and also returned the prefix, then it could nicely replace |
||
|
||
/// Retrieves the first character from the `OsStr` and returns it | ||
/// and the remainder of the `OsStr`. Returns `None` if the | ||
/// `OsStr` does not start with a character (either because it it | ||
/// empty or because it starts with non-UTF-8 data). | ||
fn slice_shift_char(&self) -> Option<(char, &OsStr)>; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We may want to hold off on adding this API to OS strings for now as the API on There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My understanding is that the let first = string.chars().next();
if let Some(first) = first {
let rest = &string[first.len_utf8()..];
...
} But neither let mut split = string.splitn(3, "");
split.next().unwrap();
match (split.next(), split.next()) {
(Some(first), Some(rest)) if first.to_str().map(|s| s.chars().count() == 1).unwrap_or(false) => {
let first = first.to_str().unwrap().chars().next().unwrap();
...
}
_ => {}
} but that's kind of a mouthful and is easy to screw up in subtle ways (and depends on what empty patterns end up doing). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hm, very good points! I realize now I should have read the section below a little more closely, thanks for the explanation! |
||
|
||
/// If the `OsStr` starts with a UTF-8 section followed by | ||
/// `boundary`, returns the sections before and after the boundary | ||
/// character. Otherwise returns `None`. | ||
fn split_off_str(&self, boundary: char) -> Option<(&str, &OsStr)>; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similarly to |
||
|
||
/// Returns an iterator over sections of the `OsStr` separated by | ||
/// the given character. | ||
/// | ||
/// # Panics | ||
/// | ||
/// Panics if the boundary character is not ASCII. | ||
fn split<'a>(&'a self, boundary: char) -> Split<'a>; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is certainly an interesting API! Some thoughts I have here:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As I mentioned in the Alternatives section, it is not possible to split an Thinking about this a bit more, the restriction to ASCII should be completely unnecessary. We should be able to take an arbitrary There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Aha! Sorry I think I missed that part of the section below. Generalization to a full cc @SimonSapin, curious on your thoughts on There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree the ASCII restriction is not necessary on Windows. I don’t think there is an issue splitting WTF-8 on any
Can you expand on this? Concatenation joining surrogate pairs that were previously not in a pair can be unexpected, but I don’t think it makes anything impossible. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm going to move to a I'll try to add a better explanation of the splitting problem. For now see my comments about |
||
``` | ||
|
||
These methods fall into two categories. The first four | ||
(`starts_with_str`, `remove_prefix_str`, `slice_shift_char`, and | ||
`split_off_str`) interpret a prefix of the `OsStr` as UTF-8 data, | ||
while ignoring any non-UTF-8 parts later in the string. The last is a | ||
restricted splitting operation. | ||
|
||
### `starts_with_str` | ||
|
||
`string.starts_with_str(prefix)` is logically equivalent to | ||
`string.remove_prefix_str(prefix).is_some()`, but is likely to be a | ||
common enough special case to warrant it's own clearer syntax. | ||
|
||
### `remove_prefix_str` | ||
|
||
This could be used for things such as removing the leading "--" from | ||
command line options as is common to enable simpler processing. | ||
Example: | ||
```rust | ||
let opt = OsString::from("--path=/some/path"); | ||
assert_eq!(opt.remove_prefix_str("--"), Some(OsStr::new("path=/some/path"))); | ||
``` | ||
|
||
### `slice_shift_char` | ||
|
||
This performs the same function as the similarly named method on | ||
`str`, except that it also returns `None` if the `OsStr` does not | ||
start with a valid UTF-8 character. While the `str` version of this | ||
function may be removed for being redundant with `str::chars`, the | ||
functionality is still needed here because it is not clear how an | ||
iterator over the contents of an `OsStr` could be defined in a | ||
platform-independent way. | ||
|
||
An intended use for this function is for interpreting bundled | ||
command-line switches. For example, with switches from rustc: | ||
|
||
```rust | ||
let mut opts = &OsString::from("vL/path")[..]; // Leading '-' has already been removed | ||
while let Some((ch, rest)) = opts.slice_shift_char() { | ||
opts = rest; | ||
match ch { | ||
'v' => { verbose = true; } | ||
'L' => { /* interpret remainder as a link path */ } | ||
.... | ||
} | ||
} | ||
``` | ||
|
||
### `split_off_str` | ||
|
||
This is intended for interpreting "tagged" OS strings, for example | ||
rustc's `-L [KIND=]PATH` arguments. It is expected that such tags | ||
will usually be UTF-8. Example: | ||
```rust | ||
let s = OsString::from("dylib=/path"); | ||
|
||
let (name, kind) = match s.split_off_str('=') { | ||
None => (&*s, cstore::NativeUnknown), | ||
Some(("dylib", name)) => (name, cstore::NativeUnknown), | ||
Some(("framework", name)) => (name, cstore::NativeFramework), | ||
Some(("static", name)) => (name, cstore::NativeStatic), | ||
Some((s, _)) => { error(...) } | ||
}; | ||
``` | ||
|
||
### `split` | ||
|
||
This is similar to the similarly named function on `str`, except the | ||
splitting boundary is restricted to be an ASCII character instead of a | ||
general pattern. ASCII characters have well-defined meanings in both | ||
flavors of OS string, and the portions before and after such a | ||
character are always well-formed OS strings. | ||
|
||
This is intended for interpreting OS strings containing several paths. | ||
Using this function will generally restrict the allowed paths to those | ||
not containing the separator, but this is a common limitation already | ||
in such interfaces. For example, rustc's `--emit dep-info=bar.d,link` | ||
could be processed as: | ||
```rust | ||
let arg = OsString::from("dep-info=bar.d,link"); | ||
|
||
for part in arg.split(',') { | ||
match part.split_off_str('=') { | ||
... | ||
} | ||
} | ||
``` | ||
|
||
## `SliceConcatExt` | ||
|
||
Implement the trait | ||
```rust | ||
impl<S> SliceConcatExt<OsStr> for [S] where S: Borrow<OsStr> { | ||
type Output = OsString; | ||
... | ||
} | ||
``` | ||
|
||
This has the same behavior as the `str` version, except that it works | ||
on OS strings. It is intended as a more convenient and efficient way | ||
of building up an `OsString` from parts than repeatedly calling | ||
`push`. | ||
|
||
# Drawbacks | ||
|
||
This is a somewhat unusual string interface in that much of the | ||
functionality either accepts or returns a different type of string | ||
than the one the interface is designed to work with (`str` instead of | ||
the probably expected `OsStr`). | ||
|
||
# Alternatives | ||
|
||
## Interfaces without `str` | ||
|
||
Versions of the `*_str` functions that take or return `&OsStr`s seem | ||
more natural, but in at least some of the cases it is not possible to | ||
implement such an interface. For example, on Windows, the following | ||
should hold using a hypothetical `remove_prefix(&self, &OsStr) -> | ||
Option<&OsStr>`: | ||
|
||
```rust | ||
let string = OsString::from("😺"); // [0xD83D, 0xDE3A] in UTF-16 | ||
let prefix: OsString = OsStringExt::from_wide(&[0xD83D]); | ||
let suffix: OsString = OsStringExt::from_wide(&[0xDE3A]); | ||
|
||
assert_eq!(string.remove_prefix(&prefix[..]), Some(&suffix[..])); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is possible to write There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is possible to write a |
||
``` | ||
|
||
However, the slice `&suffix[..]` (internally `[0xED, 0xB8, 0xBA]`) | ||
does not occur anywhere in `string` (internally `[0xF0, 0x9F, 0x98, | ||
0xBA]`), so there would be no way to construct the return value of | ||
such a function. | ||
|
||
## Different forms for `split` | ||
|
||
The restriction of the argument of `split` to ASCII characters is a | ||
very conservative choice. It would be possible to allow any Unicode | ||
character as the divider, at the expense of creating somewhat strange | ||
situations where, for example, applying `split` followed by `concat` | ||
produces a string containing the divider character. As any interface | ||
manipulating OS strings is generally non-Unicode, needing to split on | ||
non-ASCII characters is likely rare. | ||
|
||
In some ways, it would be more natural to split on bytes in Unix and | ||
16-bit code units in Windows, but it would be difficult to present a | ||
cross-platform interface for such functionality and implementations on | ||
Windows would have similar issues to those in the `remove_prefix` | ||
example above. | ||
|
||
# Unresolved questions | ||
|
||
It is not obvious that the `split` function's restriction to ASCII | ||
dividers is the correct interface. | ||
|
||
There are many directions this interface could be extended in. It | ||
would be possible to proved a subset of this functionality using | ||
`OsStr` rather than `str` in the interface, and it would also be | ||
possible to create functions that interacted with non-prefix portions | ||
of `OsStr`s. It is not clear whether the usefulness of these | ||
interfaces is high enough to be worth pursuing them at this time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading between lines I think you already know that, but WTF-8 to UTF-8 conversion can be done in place: https://simonsapin.github.io/wtf-8/#converting-wtf-8-utf-8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. In fact, the internal WTF-8 implementation in libstd already has an
into_string_lossy
that does the right thing. It just isn't exposed at theOsString
level.