Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC]: Standardize methods for leaking containers #2969

Closed
wants to merge 9 commits into from
335 changes: 335 additions & 0 deletions text/0000-container-leak.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,335 @@
- Feature Name: `container-leak`
- Start Date: 2020-08-01
- RFC PR: [rust-lang/rfcs#0000](https://github.com/rust-lang/rfcs/pull/0000)
- Rust Issue: [rust-lang/rust#0000](https://github.com/rust-lang/rust/issues/0000)

# Summary
[summary]: #summary

Describe a standard set of methods for converting container types like `Box<T>`, `Arc<T>`, `Vec<T>`, `String` to and from raw pointers.

For containers with a single value like `Box<T>`, `Arc<T>`, and `Rc<T>`, any subset of the following method pairs should be added to work with their raw representations:

- `leak`: leak the container and return an arbitrarily long-lived shared or mutable reference to its allocated content.
- `leak_raw`: leak the container and return a `NonNull<T>` pointer to its content. The type `T` is the same as `Deref::Target`, so `Self::leak_raw(value)` is equivalent to `NonNull::from(&*self)` and `NonNull::from(Self::leak(value))`.
- `unleak_raw`: take a previously leaked `NonNull<T>` pointer and restore the container from it.
- `into_raw`: leak the container and return a raw pointer to its content.
- `from_raw`: take a previously leaked raw pointer and restore the container from it.

For multi-value containers like `Vec<T>` and `String`, any subset of the following method pairs should be added to work with their raw representations:

- `leak`: shrink the container to its initialized length, leak it and return an arbitrarily long-lived shared or mutable reference to its allocated content.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shrink the container to its initialized length

Is shrinking it first the best option here? I'd personally expect Vec::leak to not shrink the contents. I think of leak as just something like "hey this Vec you got here, never deallocate it", like std::mem::forget, except we can still access the leaked contents. Shrinking it first might involve a lot of copying, just to save some unused capacity. Without shrinking, it does what I'd expect from a leak function: it's basically a no-op other than stopping deallocation from happening.

If I want to not waste any unused capacity, I could still easily call shrink_to_fit first (or leak the .into_boxed_slice() instead).

If leak does shrink, I'd have to use unsafe code to leak it 'efficiently' (i.e. without copying/shrinking, but wasting some unused heap space).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an example use-case where this inefficiency would matter? The best one I can think of is a startup latency sensitive embedded system, that might want to quickly leak a lot of data at startup. However that use-case might be better off using a separate "pointer-bump" allocator for data it wants to leak. Not possible yet. But with custom allocators it will be.

Other than that situation, I'd expect shrink_to_fit to almost always be the right choice, because the inefficiency of something done once at startup would be outweighed by the long-term gain of having a bit less memory pressure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't feel very much like Rust to hide a potentially complex operation (shrinking/reallocating) inside a function that appears to be cheap or even free (leak()). It might very well be the case that there are not many programs where it would matter much, but it'd be good if functions do what their name suggests. The name leak suggests it just leaks the memory allocation of the container. Not that it copies all the data over to another new allocation first and leaks that instead while deallocating the original allocation.

If leak() doesn't shrink, that's easily explained in the documentation, and shrink-leaking is still easy to achieve with by calling shrink_to_fit() first (or using into_boxed_*().leak()). (Or maybe a shrink_and_leak() can be added?)

If leak() doesn't just leak but also shrinks, then just leaking is hard to achieve. It'd require unsafe {} and a raw pointer.

I also think leak() should be consistent with leak_raw, which only leaks and doesn't shrink.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is shrinking it first the best option here? I'd personally expect Vec::leak to not shrink the contents.

Maybe whether or not these methods drop extra capacity should be left unspecified? I think Vec::leak only shrinks as a consequence of being implemented through Vec::into_boxed_slice so that it can use Box::leak.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think of leak as just something like "hey this Vec you got here, never deallocate it"

Current leak implementations can return a lifetime shorter than &'static, which means at the end of its lifetime it would be legal to reconstitute the original allocation. For that one needs to know the capacity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current leak implementations can return a lifetime shorter than &'static, which means at the end of its lifetime it would be legal to reconstitute the original allocation.

As far as I know, the only reason it can return lifetimes shorter than 'static is to make it possible to leak Vec<Something<'notstatic>>. Turning a leaked Vec back into the original Vec doesn't really seem like an intended use case for leak(). into_raw_parts seems like the right function in that case.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preferable the term leak should only be used in a context where the intent is to leak the vector, and it should not shrink the vector.

Different versions of into_raw(_parts)/from_raw(_parts) should generally be used to decompose this types if you later one intent to recompose them.

Except that sadly due to historic reason some of the into_raw methods do not return NonNull and as such require an otherwise unnecessary unsafe block (or .unwrap()) to create a NonNull.

With regard to shrink_to_fit() we should warn with bold text in the leak documentation that reconstructing a Vec from which you only have the len and data_ptr will lead to unsafe behavior when dropping it as the Layout for alloc and dealloc do not match. That is why Vec::into_boxed_slice() does call Vec::schrink_to_fit() as else wise dropping the boxed slice would cause unsafe behavior of which the consequences are allocator defined.

Using leak() to create a pointer the the underlying slice and then passing it to a C-FFI as ptr+len was a bug I just ran into...

- `leak_raw_parts`: leak the container and return a `NonNull<T>` pointer to its content along with any other state, like the allocated capacity, that would be needed to restore the container. The type `T` is the same as `Deref::Target`, so `NonNull::from(&*self)` is equivalent to `NonNull::from(self.leak())` and `NonNull::from(self.leak_raw_parts().0)`.
- `unleak_raw_parts`: take a previously leaked `NonNull<T>` pointer and additional state and restore the container from it.
- `into_raw_parts`: leak the container and return a raw pointer to its content along with any other state that would be needed to restore the container.
- `from_raw_parts`: take a previously leaked raw pointer and additional state and restore the container from it.

The `leak_raw`/`unleak_raw` methods are "modern" semantic alternatives to the existing `into_raw`/`from_raw` pair of methods on containers that use `NonNull<T>` as the pointer type instead of `*const T` or `*mut T`.
Users are encouraged to prefer the `leak_raw`/`unleak_raw` methods over `into_raw`/`from_raw` except for the important case where they need FFI-safety.

# Motivation
[motivation]: #motivation

The `NonNull<T>` type is a non-nullable pointer type that's variant over `T`. `NonNull<T>` has stronger invariants than `*mut T`, but weaker than the internal `Unique<T>`.
Since `Unique<T>` isn't planned to be stabilized, `NonNull<T>` is the most appropriate pointer type for containers like `Box<T>` and `Vec<T>` to use as pointers to their inner value.

Unfortunately, `NonNull<T>` was stabilized after methods like `Box::into_raw` and `Vec::from_raw_parts`, which are left working with `*mut T`.
Now with the proposed API addition of `Vec::into_raw_parts` we're left with a conundrum. The options appear to be:

- break symmetry with `Vec::from_raw_parts` and diverge from `Box::into_raw` by producing a more semantically accurate `NonNull<T>`.
- not use a newer and more appropriate type for the purpose it exists for and leave it up to users to convert.

This RFC aims to answer this question by specifying any `into_raw`/`from_raw`-like APIs to stay consistent with the precedent set by `Box<T>` and `Vec<T>` of working with `*const T` and `*mut T`, and introduce a similar new API for `NonNull<T>` that is also more semantically typed with respect to `T`.
Instead of `Vec::leak_raw` returning a `(*mut T, usize)` pair for its allocated storage, it returns a `NonNull<[T]>` instead.

Keeping the new `leak_raw`/`unleak_raw` API similar to the existing `into_raw`/`from_raw` API is to make them discoverable and avoid new cognitive load for those that are already familiar with `into_raw`/`from_raw`.
The semantic names make it clear to a reader what happens to the contents of the container through the conversion into a pointer.

# Guide-level explanation
[guide-level-explanation]: #guide-level-explanation

## When do I use `leak_raw`/`unleak_raw`?

The `leak_raw`/`unleak_raw` and `leak_raw_parts`/`unleak_raw_parts` methods are good for pure Rust datastructures that would probably use references if it was possible to describe their non-trivial access and ownership requirements through them.

The `leak_raw` method can be used to take manual control of the lifetime and access to the contents of a container like `Box<T>`.
The `unleak_raw` method can then be used to later restore the container from its leaked pointer.

Take the example of `LinkedList<T>` from the standard library. Internally, it stores `NonNull<T>` pointers to its nodes:

```rust
pub struct LinkedList<T> {
head: Option<NonNull<Node<T>>>,
tail: Option<NonNull<Node<T>>>,
len: usize,
marker: PhantomData<Box<Node<T>>>,
}
```

The nodes are allocated using `Box<T>`, where they're then leaked into the linked list, then later unleaked back out.
This can be done using `leak_raw`/`unleak_raw`:

```rust
impl<T> LinkedList<T> {
fn push_front_node(&mut self, mut node: Box<Node<T>>) {
unsafe {
node.next = self.head;
node.prev = None;

// Leak the contents of `node` and return a `NonNull<Node<T>>`.
// It's now the responsibility of `LinkedList<T>` to manage.
let node = Some(Box::leak_raw(node));

match self.head {
None => self.tail = node,
Some(head) => (*head.as_ptr()).prev = node,
}

self.head = node;
self.len += 1;
}
}

fn pop_front_node(&mut self) -> Option<Box<Node<T>>> {
self.head.map(|node| unsafe {

// Unleak the contents of `node` and return a `Box<Node<T>>`.
// It's now the responsibility of `Box<T>` to manage.
let node = Box::unleak_raw(node.as_ptr());
self.head = node.next;

match self.head {
None => self.tail = None,
Some(head) => (*head.as_ptr()).prev = None,
}

self.len -= 1;
node
})
}
}
```

The `leak_raw_parts` method is the equivalent of `leak_raw` for multi-value containers like `String` that return extra data beyond the pointer needed to reconstruct the container later.
The `unleak_raw_parts` method is the equivalent of `unleak_raw`.

The `String::leak_raw_parts` method is a nice example of the new `leak_raw` API because it returns the most accurate pointer type possible to represent the raw string data.
Instead of a `(*mut u8, usize)` pair for the pointer and length, it returns a `NonNull<str>`, which encodes its length and retains the UTF8 invariant together.
The following example shows how `leak_raw_parts` makes it easier to work with the leaked string than `into_raw_parts`:

```diff
let string = String::from("🗻∈🌏");

+ let (ptr, cap): (NonNull<str>, usize) = string.leak_raw_parts();
- let (ptr, len, cap): (*mut u8, usize, usize) = string.into_raw_parts();

+ assert_eq!(Some("🗻"), unsafe { ptr.as_ref().get(0..4) });
- assert_eq!(Some("🗻"), unsafe { str::from_utf8_unchecked(slice::from_raw_parts(ptr, len)).get(0..4) });

+ let string = String::unleak_raw_parts(ptr, cap);
- let string = String::from_raw_parts(ptr, len, cap);
```

## When do I use `into_raw`/`from_raw`?

The `into_raw`/`from_raw` and `into_raw_parts`/`from_raw_parts` methods are good for FFI where a Rust type needs to be used by non-Rust code.

The `*mut T`, `*const T`, and `usize` types returned by these methods typically have a direct counterpart in the target language, so they don't require learning new concepts for users that are familiar with raw pointers.

As an example, it's common to share complex Rust values opaquely by boxing them and passing raw pointers to-and-fro.
Take this example [from The Rust FFI Guide][ffi-guide] that wraps a web request:

```rust
#[no_mangle]
pub unsafe extern "C" fn request_create(url: *const c_char) -> *mut Request {
if url.is_null() {
return ptr::null_mut();
}

let raw = CStr::from_ptr(url);

let url_as_str = match raw.to_str() {
Ok(s) => s,
Err(_) => return ptr::null_mut(),
};

let parsed_url = match Url::parse(url_as_str) {
Ok(u) => u,
Err(_) => return ptr::null_mut(),
};

let req = Request::new(parsed_url, Method::Get);

// Get a stable address for the request
Box::into_raw(Box::new(req))
}

#[no_mangle]
pub unsafe extern "C" fn request_destroy(req: *mut Request) {
if !req.is_null() {
// Reinterpret the stable address as a previously allocated box
drop(Box::from_raw(req));
}
}
```

In this example, a reader only needs to consider one kind of pointer type (technically `*const T` and `*mut T` are different types, but one could read them like `T*` from other languages with a sharing annotation).
This API could use `Option<NonNull<Request>>` instead of `*mut Request` to force null checking in `request_destroy`, but that requires the author to juggle more concepts to write.
They'd need to understand that while `NonNull<T>` has the same representation as `*const T`, it has the same semantics as `Option<NonNull<T>>`.

The `into_raw_parts` method is the equivalent of `into_raw` for multi-value containers like `Vec<T>` that split the fat pointer into its FFI-safe parts.
The `from_raw_parts` method is the equivalent of `from_raw`.

An FFI over `Vec<u8>` is a nice example of when `into_raw_parts` can be helpful over `leak_raw_parts`.
An FFI should only be built from FFI-safe types that have a well-known representation, but the fat `NonNull<[u8]>` pointer returned by `leak_raw_parts` (and consequently `*const [u8]`) is not considered FFI-safe.
That's not a problem for `into_raw_parts` though because it only returns FFI-safe `*mut u8` and `usize` types.

The following example shows how `into_raw_parts` makes it easier to work with FFI-safe values than `leak_raw_parts`:

```diff
#[repr(C)]
pub struct RawVec {
ptr: *mut u8,
len: usize,
cap: usize
}

#[no_mangle]
pub unsafe extern "C" fn vec_create() -> RawVec {
let v = vec![0u8; 512];

+ let (ptr, len, cap) = v.into_raw_parts();
- let (ptr, cap) = v.leak_raw_parts();
- let (ptr, len) = (ptr.cast::<u8>().as_ptr(), ptr.len());

RawVec { ptr, len, cap }
}

#[no_mangle]
pub unsafe extern "C" fn vec_destroy(vec: RawVec) {
if !vec.ptr.is_null() {
+ drop(Vec::from_raw_parts(vec.ptr, vec.len, vec.cap));
- drop(Vec::unleak_raw_parts(NonNull::slice_from_raw_parts(NonNull::new_unchecked(vec.ptr), vec.len), vec.cap));
}
}
```

[ffi-guide]: https://michael-f-bryan.github.io/rust-ffi-guide/basic_request.html#creating-the-c-interface

# Reference-level explanation
[reference-level-explanation]: #reference-level-explanation

This RFC proposes the following API for single-value containers (some of these methods are already stable or implemented but unstable):

```rust
impl<T> Box<T> {
// Already stable
pub fn leak<'a>(this: Box<T>) -> &'a mut T where T: 'a;

pub fn leak_raw(this: Box<T>) -> NonNull<T>;
pub unsafe fn unleak_raw(ptr: NonNull<T>) -> Box<T>;

// Already stable
pub fn into_raw(this: Box<T>) -> *mut T;
// Already stable
pub unsafe fn from_raw(ptr: *mut T) -> Box<T>;
}

impl<T> Rc<T> {
pub fn leak_raw(this: Rc<T>) -> NonNull<T>;
pub unsafe fn unleak_raw(ptr: NonNull<T>);

// Already stable
pub fn into_raw(this: Rc<T>) -> *const T;
// Already stable
pub unsafe fn from_raw(ptr: *const T) -> Rc<T>;
}

impl<T> Arc<T> {
pub fn leak_raw(this: Arc<T>) -> NonNull<T>;
pub unsafe fn unleak_raw(ptr: NonNull<T>) -> Arc<T>;

// Already stable
pub fn into_raw(this: Arc<T>) -> *const T;
// Already stable
pub unsafe fn from_raw(ptr: *const T) -> Arc<T>;
}
```

and the following API for multi-value containers (some of these methods are already stable or implemented but unstable):

```rust
impl<T> Vec<T> {
pub fn leak<'a>(self) -> &'a mut [T] where T: 'a;

pub fn leak_raw_parts(self) -> (NonNull<[T]>, usize);
pub fn unleak_raw_parts(ptr: NonNull<[T]>, capacity: usize) -> Vec<T>;

// Unstable, tracked by: https://github.com/rust-lang/rust/issues/65816
pub fn into_raw_parts(self) -> (*mut T, usize, usize);
// Already stable
pub fn from_raw_parts(ptr: *mut T, length: usize, capacity: usize) -> Vec<T>;
}

impl String {
pub fn leak<'a>(self) -> &'a mut str;

pub fn leak_raw_parts(self) -> (NonNull<str>, usize);
pub fn unleak_raw_parts(ptr: NonNull<str>, capacity: usize) -> String;

// Unstable, tracked by: https://github.com/rust-lang/rust/issues/65816
pub fn into_raw_parts(self) -> (*mut u8, usize, usize);
// Already stable
pub fn from_raw_parts(ptr: *mut u8, length: usize, capacity: usize) -> String;
}
```

These conversion methods follow the existing semantics of static functions for containers that dereference to their inner value like `Box<T>`, and inherent methods for others.

The docs for the `into_raw`/`from_raw` methods will point users to `leak_raw`/`unleak_raw` unless they need FFI-safety.

The `NonNull<[T]>` and `NonNull<str>` methods are expected to eventually offer a way to get their length without needing to go through a reference first, but the exact mechanism is left as out-of-scope for this RFC.

# Drawbacks
[drawbacks]: #drawbacks

A drawback of this approach is that it creates a standard that any future containers are expected to adhere to.
It creates more API surface area that needs to be rationalized with future idioms, just like this RFC is attempting to do for `into_raw`/`from_raw` with `NonNull<T>`.
As an example, if a future Rust stabilizes another even more appropriate pointer type then it would need to be fit into this scheme.

It introduces more APIs so users have to choose the right one for their usecase instead of just trying to make the only option available work for them.
With clear guidance in the documentation for these methods and similarities in their design this shouldn't be an issue in practice.

# Rationale and alternatives
[rationale-and-alternatives]: #rationale-and-alternatives

An alternative is to just start using `NonNull<T>` going forward and accept the inconsistency with existing methods.
This isn't preferable to keeping new `into_raw`/`from_raw` pairs consistent with the ones that already exist because it forces users to learn the return values for all of these methods by rote instead of being able to rely on simple conventions.

Another is to just use `leak` methods and the conversion from `&T` and `&mut T` into `NonNull<T>` to work with.
This isn't preferable to method pairs that return a `NonNull<T>` and look similar to `into_raw`/`from_raw` because they're less discoverable while still being preferable for common usecases, and require more steps to leak and unleak than would otherwise be needed.

Another is to deprecate `into_raw`/`from_raw` in favor of `leak_raw().as_ptr()` and `NonNull::new_unchecked(ptr)`.
This makes it easier to discover the preferred API for working with raw container contents and the expense of more machinery in FFI use-cases.
This isn't preferable to guidance in docs on both sets of methods because it puts more burden on FFI code and deprecates APIs that are already perfectly suited to their needs.
This could possibly be worked around by making it easier to convert types like `NonNull<[T]>` into a `(*mut T, usize)` pair.

# Prior art
[prior-art]: #prior-art

The prior art is `Box<T>`, which already has the `leak`, `into_raw` and `from_raw` methods.
It also has the unstable `into_raw_non_null`, but is deprecated in favor of `NonNull::from(Box::leak(b))`.
This current workaround is the second alternative listed above, that isn't considered preferable to `Box::leak_raw(b)`.

# Unresolved questions
[unresolved-questions]: #unresolved-questions

This RFC doesn't propose a `leak` method for `Rc<T>` or `Arc<T>` but they could be added after working through the motivations.

Do we expect `Box::unleak_raw(NonNull::from(Box::leak(b)))` to work?

# Future possibilities
[future-possibilities]: #future-possibilities

There are other types that should probably be included, like `OsString` and `PathBuf`.
Using `NonNull<[T]` and `NonNull<str>` sets an expectation that `NonNull<T>` will have some APIs for working with these fat-pointer types.