-
Notifications
You must be signed in to change notification settings - Fork 433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Fill for [MaybeUninit<T>] #1080
Comments
Hmm, the problem there is that implementing Rand v0.8 has already replaced this trait with
|
Oh damn, that's what I get for forgetting to click the "Go to latest version" button on docs.rs. That's good news then, I'm glad there's nothing in the way of this feature.
Actually no, because it's undefined behavior in Rust to produce a mutable reference to uninitialized memory, even if it's not explicitly read from. That's because, when writing to the mutable reference, Rust will attempt to drop the previous contained value, which is uninitialized in this case. So a designated Fill implementation for |
Question about the source code: in the Lines 350 to 358 in 98a1aaf
|
This trait is also used by PRNGs and for some applications (e.g. games) it's important to keep RNG output reproducible on different platforms. |
Currently, the API of RngCore can only fill In order to implement the feature proposed in this issue, a new method would need to be added to RngCore: pub fn try_fill_uninitialized_bytes(&mut self, dest: &mut [MaybeUninit<u8>]) -> Result<(), Error> What do the rand maintainers think about this suggestion? |
I'd really prefer not to go that route since (a) we want to keep How much performance overhead is there if you copy to a buffer first? The compiler may even optimise the extra copy away, and in many cases the perf. isn't critical anyway. |
Of course at this point you might as well just use a zeroed array to start with and avoid using If we really need to redesign the core RNG for better performance here, it would be better to look at making |
This should probably be backed up by a benchmark demonstrating the performance advantage. |
We also probably should wait for rust-lang/rust#78485 to be implemented and use the proposed |
The fn try_fill_read_buf(&mut self, buf: &mut ReadBuf<'_>) -> Result<(), Error> {
self.try_fill_bytes(buf.initialize_unfilled());
} This is non-breaking and fairly low complexity, but still an addition to what is supposed to be a simple core trait. What are our thoughts on this? Mine is that a benchmark demonstrating significant improvement is required. I guess it would be interesting to look at both something like |
I think it may be useful to have I am significantly less sure about benefits of supporting uninitialized memory. First it should be demonstrated that zero-filling initialization indeed does not get optimized out for our PRNGs, I will not be surprised if it does (but note that compiler should not be able to perform such optimization for |
@newpavlov Isn't I think it might make sense to improve |
@vks |
Maybe a good alternative is to provide something like |
I threw together a quick benchmark: fn tmp(c: &mut Criterion) {
c.bench_function("tmp", |b| {
let mut rng = XorShiftRng::from_entropy();
b.iter(|| {
let mut buf = [0u8; 4096];
rng.fill_bytes(&mut buf);
buf
});
});
c.bench_function("pmt", |b| {
let mut rng = XorShiftRng::from_entropy();
b.iter(|| {
// I know this is UB, but there's no way around this right now
let mut buf: [u8; 4096] = unsafe { MaybeUninit::uninit().assume_init() };
rng.fill_bytes(&mut buf);
buf
});
});
} I'm seeing a consistent 0.1us second difference between the two which I guess isn't all that much (it's still measurable though and would presumably get worse as the buffer size increases). I feel like it'd be a huge bummer to make it impossible to use
People who know what they're doing (this repo) would implement those methods, but other people can just ignore them and pay the cost of zeroing everything out. |
Interesting benchmark. Does it do what you think it does? Quips aside, saving 100ns isn't sufficient justification for adding new API. We could however try building an API around |
Playing with this on godbolt, you get basically the same assembly (also, I'm using the UB method in my project with tests that pass so it seems fine for now): example::test:
push rbx
mov eax, 4096
call __rust_probestack
sub rsp, rax
mov rbx, rsp
mov edx, 4096
mov rdi, rbx
xor esi, esi
call qword ptr [rip + memset@GOTPCREL]
add rsp, 4096
pop rbx
ret example::test:
mov eax, 4096
call __rust_probestack
sub rsp, rax
mov rax, rsp
add rsp, 4096
ret
I was kind of thinking that too, but those 100ns are way more significant if you look at it from the percentage perspective:
That's 6% lost for nothing. Regarding the ReadBuf stuff, I don't understand its relevance. ReadBuf is a higher level API that offers users access to either an uninitialized buffer or an initialized version as needed. Why would we force people to use that when they can just pass in a MaybeUninit buf (which can come from ReadBuf if they want). Regarding the API, why is there resistance against something that consumers don't have to implement? That doesn't seem like bloat to me. Also, the current API forces users to leave performance on the table which I'm strongly against — you should always have the choice to go lower level if you'd like. |
It's still extra complexity, which requires some justification. 1.46μs vs 1.56μs is potentially valid justification, but it's a micro-benchmark and basically the best-case scenario for this change. 100ns on its own is nothing, so this is only significant if (a) repeated often or (b) a much larger buffer is filled. But, if repeating often, the same buffer should be reused, so (a) is out. With a larger buffer, probably the only reason you'd do that is to write to disk, but then the disk writes will be much more significant than the extra initialisation time. Hence, 6% faster in this benchmark is on its own very weak grounds for any new addition. Demonstrating a 5% improvement in a real-world example which saves non-negligible amounts of CPU time would be much stronger justification. |
That's a fair point. I came up with a somewhat convoluted benchmark that would be best case for disk writes as it doesn't reuse the buffer at all: fn maybe(c: &mut Criterion) {
let mut g = c.benchmark_group("maybe");
g.sample_size(1000);
g.bench_function("init", |b| {
let mut rng = XorShiftRng::from_entropy();
let mut file = std::fs::File::create("/tmp/init.bench").unwrap();
b.iter(|| {
let mut buf = [0u8; 4096];
rng.fill_bytes(&mut buf);
file.write_all(&buf).unwrap();
});
std::fs::remove_file("/tmp/init.bench").unwrap();
});
g.bench_function("uninit", |b| {
let mut rng = XorShiftRng::from_entropy();
let mut file = std::fs::File::create("/tmp/uninit.bench").unwrap();
b.iter(|| {
let mut buf: [u8; 4096] = unsafe { std::mem::MaybeUninit::uninit().assume_init() };
rng.fill_bytes(&mut buf);
file.write_all(&buf).unwrap();
});
std::fs::remove_file("/tmp/uninit.bench").unwrap();
});
}
Perhaps more interestingly is finding the number of iterations it takes to make that lost performance insignificant by reusing the buffer N times: fn maybe(c: &mut Criterion) {
let mut g = c.benchmark_group("maybe");
g.sample_size(1000);
const N: usize = 10;
let mut rng = XorShiftRng::from_entropy();
let mut file = std::fs::File::create("/tmp/maybe.bench").unwrap();
g.bench_function("init", |b| {
b.iter(|| {
let mut buf = [0u8; 4096];
for _ in 0..N {
rng.fill_bytes(&mut buf);
file.write_all(&buf).unwrap();
}
});
});
g.bench_function("uninit", |b| {
b.iter(|| {
let mut buf: [u8; 4096] = unsafe { std::mem::MaybeUninit::uninit().assume_init() };
for _ in 0..N {
rng.fill_bytes(&mut buf);
file.write_all(&buf).unwrap();
}
});
});
std::fs::remove_file("/tmp/maybe.bench").unwrap();
} I couldn't get any consistent numbers, but at 10 iterations it takes somewhere around 200us to complete. If we use the 1-2us difference from the first benchmark, 200us is threshold where the cost of zero initialization drop just below 1%, meaning we have to use the buffer 10 times before the cost becomes insignificant. Realistically, I would argue that you have to use the buffer ~100 times before it becomes truly insignificant (0.1%). Honestly not really sure if these benchmarks are useful since they're so finicky, but I thought it'd be worth sharing what I've been playing around with. |
Thanks for the extra benchmarking. Summary: 4.4% or 800ns impact in the most significant case, and it's still something that doesn't exactly scale (unless you have code that writes thousands of random files but doesn't share the buffer used for some reason). If it were some internal change, then sure, but I'm still not convinced this justifies API additions. |
Yeah, that's fair. BTW, is the API addition issue specifically around |
Honestly, it's about keeping the complexity of the library in check. You seem oddly committed to getting this optimisation in somehow, so I'm not saying 100% no, just that it'll take more to convince me — but personally, I'd drop the idea. We're spending more CPU time discussing this issue than I can see it saving. I don't want an |
I guess what bothers me is that it's impossible. I also don't feel like 2 extra methods and some transmutes to bridge the gap adds much complexity, but since neither of us are budging, I'll drop it. :) |
@SUPERCILEX thanks for the comments and benches. In the end, I think we won't do this, at least not unless there is good real-world justification. Your benchmarks are interesting, but not quite enough to convince me of its utility. The original motivation was this:
... in which case, re-use of buffers should surely make the savings from using Regarding The core issue is whether to add a method like |
Background
What is your motivation? Right now, the Rng::fill function can only write to buffers that are already initialized. That's because passing uninitialized memory as a
&[T]
is undefined behavior. With an AsByteSliceMut implementation for[MaybeUninit<T>]
, rand could be used to write vast amounts of random numbers into memory without needing to zero out the memory before.What type of application is this? Fast generation of vast amounts of numbers
Feature request
The
rand::AsByteSliceMut
trait should be implemented for all[MaybeUninit<T>]
where T are the numeric integer types. In other words, similar to how you canfill
a&mut [u32]
/&mut [i32]
/&mut [usize]
/... or a&mut [Wrapping<u32>]
/&mut [Wrapping<i32>]
/&mut [Wrapping<usize>]
/..., rand should supportfill
ing a&mut [MaybeUninit<u32>]
/&mut [MaybeUninit<i32>]
/&mut [MaybeUninit<usize>]
/...The text was updated successfully, but these errors were encountered: