Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OS string string-like interface #1309

Closed
wants to merge 3 commits into from
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
239 changes: 239 additions & 0 deletions text/0000-osstring-string-interface.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,239 @@
- Feature Name: osstring_string_interface
- Start Date: 2015-10-05
- RFC PR: (leave this empty)
- Rust Issue: (leave this empty)

# Summary

Add a string-like API to the `OsString` and `OsStr` types. This RFC
focuses on creating a string-like interface, as opposed to RFC #1307,
which focuses more on container-like features.

# Motivation

As mentioned in the `std::ffi::os_str` documentation: "**Note**: At
the moment, these types are extremely bare-bones, usable only for
conversion to/from various other string types. Eventually these types
will offer a full-fledged string API." This is intended as a step in
that direction.

Having an ergonomic way to manipulate OS strings is needed to allow
programs to easily handle non-UTF-8 data received from the operating
system. Currently, it is common for programs to just convert OS data
to `String`s, which leads to undesirable panics in the unusual case
where the input is not UTF-8. For example, currently, calling rustc
with a non-UTF-8 command line argument will result in an immediate
panic. Fixing that in a way that actually handles non-UTF-8 data
correctly (as opposed to, for example, just interpreting it lossily as
UTF-8) would be very difficult with the current OS string API. Most
of the functions proposed here were motivated by the OS string
processing needs of rustc.

# Detailed design

## `OsString`

`OsString` will get the following new method:
```rust
/// Converts an `OsString` into a `String`, avoiding a copy if possible.
///
/// Any non-Unicode sequences are replaced with U+FFFD REPLACEMENT CHARACTER.
pub fn into_string_lossy(self) -> String;

```

This is analogous to the existing `OsStr::to_string_lossy` method, but
transfers ownership. This operation can be done without a copy if the
`OsString` contains UTF-8 data or if the platform is Windows.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading between lines I think you already know that, but WTF-8 to UTF-8 conversion can be done in place: https://simonsapin.github.io/wtf-8/#converting-wtf-8-utf-8

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. In fact, the internal WTF-8 implementation in libstd already has an into_string_lossy that does the right thing. It just isn't exposed at the OsString level.


## `OsStr`

OsStr will get the following new methods:
```rust
/// Returns true if the string starts with a valid UTF-8 sequence
/// equal to the given `&str`.
fn starts_with_str(&self, prefix: &str) -> bool;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When adding string-like APIs to OsStr, it may be best to try to stick to original str APIs as much as possible, while also allowing all kinds of fun functionality for OsStr. Along those lines, perhaps this API could look like:

fn starts_with<S: AsRef<OsStr>>(&self, prefix: S) -> bool;

That should cover this use case as well as something like os_str.starts_with(&other_os_string).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reasonable. I'll make that change.


/// If the string starts with the given `&str`, returns the rest
/// of the string. Otherwise returns `None`.
fn remove_prefix_str(&self, prefix: &str) -> Option<&OsStr>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Along the lines of sticking as close to str as possible, I think that this would be best expressed as a splitn if you're working with str. Adding that API to OS strings, however, may be a bit tricky as it may involve dealing with the Pattern trait, so it may be best to perhaps hold off on this just yet? Either that or perhaps adding splitn which only takes P: AsRef<OsStr> for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With splitn you have to check that the first chunk is empty, and it does more work than necessary in the negative case.

I’ve wanted str::remove_prefix as well (on "normal" UTF-8 strings). Could it be added on both types?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I still like this functionality, but you're right that it would be nice on both types of strings. Since modifying str seems out of scope for this RFC, I think this will be probably be bumped to a future RFC.

Hmm, if this took a pattern and also returned the prefix, then it could nicely replace slice_shift_char.


/// Retrieves the first character from the `OsStr` and returns it
/// and the remainder of the `OsStr`. Returns `None` if the
/// `OsStr` does not start with a character (either because it it
/// empty or because it starts with non-UTF-8 data).
fn slice_shift_char(&self) -> Option<(char, &OsStr)>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to hold off on adding this API to OS strings for now as the API on str is still unstable and it may be awhile before it's stabilize (also a little dubious as to how useful it is).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that the str version of this is potentially non-useful because it is easy to rewrite it as

let first = string.chars().next();
if let Some(first) = first {
    let rest = &string[first.len_utf8()..];
    ...
}

But neither chars() nor indexing make sense on an OsStr. The best I can come up with for OsStr is something like

let mut split = string.splitn(3, "");
split.next().unwrap();
match (split.next(), split.next()) {
    (Some(first), Some(rest)) if first.to_str().map(|s| s.chars().count() == 1).unwrap_or(false) => {
        let first = first.to_str().unwrap().chars().next().unwrap();
        ...
    }
    _ => {}
}

but that's kind of a mouthful and is easy to screw up in subtle ways (and depends on what empty patterns end up doing).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, very good points! I realize now I should have read the section below a little more closely, thanks for the explanation!


/// If the `OsStr` starts with a UTF-8 section followed by
/// `boundary`, returns the sections before and after the boundary
/// character. Otherwise returns `None`.
fn split_off_str(&self, boundary: char) -> Option<(&str, &OsStr)>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to remove_prefix_str above, perhaps this would be best suited for a splitn in terms of compositions? You'd probably get two OsStr instances out of that, but could call to_str on the first to get the same API as this.


/// Returns an iterator over sections of the `OsStr` separated by
/// the given character.
///
/// # Panics
///
/// Panics if the boundary character is not ASCII.
fn split<'a>(&'a self, boundary: char) -> Split<'a>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is certainly an interesting API! Some thoughts I have here:

  • In the interest of staying aligned with str, this may want to take at least P: AsRef<OsStr> and can perhaps eventually be generalized to P: OsPattern (similar to the split function on str) to also allow char.
  • Having this API panic if a character isn't ASCII though is pretty unfortunate, and I think it'd be best to avoid that (and perhaps using OsStr would alleviate that for now?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in the Alternatives section, it is not possible to split an OsStr on an OsStr in Windows because of the details of the WTF-8 encoding.

Thinking about this a bit more, the restriction to ASCII should be completely unnecessary. We should be able to take an arbitrary &str, and possibly even a full P: Pattern (although I'll have to think about that a bit more), but notably not any generalizations of those to OsStr.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha! Sorry I think I missed that part of the section below. Generalization to a full Pattern would be great, and even &str would be quite nice!

cc @SimonSapin, curious on your thoughts on split + the WTF-8 encoding

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the ASCII restriction is not necessary on Windows. I don’t think there is an issue splitting WTF-8 on any char. It’s not clear what splitting on a non-ASCII means on Unix, though. Split on the corresponding UTF-8 sequence? That makes sense (especially if split also accepts &str arguments), but needs to be documented.

As I mentioned in the Alternatives section, it is not possible to split an OsStr on an OsStr in Windows because of the details of the WTF-8 encoding.

Can you expand on this? Concatenation joining surrogate pairs that were previously not in a pair can be unexpected, but I don’t think it makes anything impossible.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to move to a Pattern interface for split in the next version, although that might end up being scaled back if we can't decide what to do with a pattern that matches "".

I'll try to add a better explanation of the splitting problem. For now see my comments about remove_prefix later in these comments.

```

These methods fall into two categories. The first four
(`starts_with_str`, `remove_prefix_str`, `slice_shift_char`, and
`split_off_str`) interpret a prefix of the `OsStr` as UTF-8 data,
while ignoring any non-UTF-8 parts later in the string. The last is a
restricted splitting operation.

### `starts_with_str`

`string.starts_with_str(prefix)` is logically equivalent to
`string.remove_prefix_str(prefix).is_some()`, but is likely to be a
common enough special case to warrant it's own clearer syntax.

### `remove_prefix_str`

This could be used for things such as removing the leading "--" from
command line options as is common to enable simpler processing.
Example:
```rust
let opt = OsString::from("--path=/some/path");
assert_eq!(opt.remove_prefix_str("--"), Some(OsStr::new("path=/some/path")));
```

### `slice_shift_char`

This performs the same function as the similarly named method on
`str`, except that it also returns `None` if the `OsStr` does not
start with a valid UTF-8 character. While the `str` version of this
function may be removed for being redundant with `str::chars`, the
functionality is still needed here because it is not clear how an
iterator over the contents of an `OsStr` could be defined in a
platform-independent way.

An intended use for this function is for interpreting bundled
command-line switches. For example, with switches from rustc:

```rust
let mut opts = &OsString::from("vL/path")[..]; // Leading '-' has already been removed
while let Some((ch, rest)) = opts.slice_shift_char() {
opts = rest;
match ch {
'v' => { verbose = true; }
'L' => { /* interpret remainder as a link path */ }
....
}
}
```

### `split_off_str`

This is intended for interpreting "tagged" OS strings, for example
rustc's `-L [KIND=]PATH` arguments. It is expected that such tags
will usually be UTF-8. Example:
```rust
let s = OsString::from("dylib=/path");

let (name, kind) = match s.split_off_str('=') {
None => (&*s, cstore::NativeUnknown),
Some(("dylib", name)) => (name, cstore::NativeUnknown),
Some(("framework", name)) => (name, cstore::NativeFramework),
Some(("static", name)) => (name, cstore::NativeStatic),
Some((s, _)) => { error(...) }
};
```

### `split`

This is similar to the similarly named function on `str`, except the
splitting boundary is restricted to be an ASCII character instead of a
general pattern. ASCII characters have well-defined meanings in both
flavors of OS string, and the portions before and after such a
character are always well-formed OS strings.

This is intended for interpreting OS strings containing several paths.
Using this function will generally restrict the allowed paths to those
not containing the separator, but this is a common limitation already
in such interfaces. For example, rustc's `--emit dep-info=bar.d,link`
could be processed as:
```rust
let arg = OsString::from("dep-info=bar.d,link");

for part in arg.split(',') {
match part.split_off_str('=') {
...
}
}
```

## `SliceConcatExt`

Implement the trait
```rust
impl<S> SliceConcatExt<OsStr> for [S] where S: Borrow<OsStr> {
type Output = OsString;
...
}
```

This has the same behavior as the `str` version, except that it works
on OS strings. It is intended as a more convenient and efficient way
of building up an `OsString` from parts than repeatedly calling
`push`.

# Drawbacks

This is a somewhat unusual string interface in that much of the
functionality either accepts or returns a different type of string
than the one the interface is designed to work with (`str` instead of
the probably expected `OsStr`).

# Alternatives

## Interfaces without `str`

Versions of the `*_str` functions that take or return `&OsStr`s seem
more natural, but in at least some of the cases it is not possible to
implement such an interface. For example, on Windows, the following
should hold using a hypothetical `remove_prefix(&self, &OsStr) ->
Option<&OsStr>`:

```rust
let string = OsString::from("😺"); // [0xD83D, 0xDE3A] in UTF-16
let prefix: OsString = OsStringExt::from_wide(&[0xD83D]);
let suffix: OsString = OsStringExt::from_wide(&[0xDE3A]);

assert_eq!(string.remove_prefix(&prefix[..]), Some(&suffix[..]));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to write remove_prefix for WTF-8 such that this holds. (When the prefix ends with a lead surrogate, also consider the corresponding range of four-bytes sequences in self.) But this is such an edge case that it may not be worth the complexity, and it’s not even clear to me that it’s a desirable behavior.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to write a starts_with variant that does this (and one will be included in the next version of this RFC), but for remove_prefix you also have to construct an &OsStr representing the rest of the string, which is a problem if that OsStr should start with a trail surrogate that is combined with something in the in-memory representation. (I suppose it could return a Cow<OsStr>, but I'm not planning on proposing that unless someone else thinks it's a really good idea.)

```

However, the slice `&suffix[..]` (internally `[0xED, 0xB8, 0xBA]`)
does not occur anywhere in `string` (internally `[0xF0, 0x9F, 0x98,
0xBA]`), so there would be no way to construct the return value of
such a function.

## Different forms for `split`

The restriction of the argument of `split` to ASCII characters is a
very conservative choice. It would be possible to allow any Unicode
character as the divider, at the expense of creating somewhat strange
situations where, for example, applying `split` followed by `concat`
produces a string containing the divider character. As any interface
manipulating OS strings is generally non-Unicode, needing to split on
non-ASCII characters is likely rare.

In some ways, it would be more natural to split on bytes in Unix and
16-bit code units in Windows, but it would be difficult to present a
cross-platform interface for such functionality and implementations on
Windows would have similar issues to those in the `remove_prefix`
example above.

# Unresolved questions

It is not obvious that the `split` function's restriction to ASCII
dividers is the correct interface.

There are many directions this interface could be extended in. It
would be possible to proved a subset of this functionality using
`OsStr` rather than `str` in the interface, and it would also be
possible to create functions that interacted with non-prefix portions
of `OsStr`s. It is not clear whether the usefulness of these
interfaces is high enough to be worth pursuing them at this time.