compilation time (of volatile array read?) is super-quadratic in array size #74476

ijackson · 2020-07-18T12:42:43Z

Hi. I tried this code:

const COUNT : usize = 40000;

fn main () {
  let mut vol_buffer = [ 0u64; COUNT ];

  let mut input = unsafe { std::ptr::read_volatile(&mut vol_buffer) };

  input[0] = 42;
  eprintln!("{:?}", input[0]);
}

Expected results:

Relatively quick compilation of a program that prints 42.

Actual results:

rustc took 45 seconds on my really rather fast laptop (and also uses a great deal of memory). The compiled binary works as epected.

Changing the value of COUNT shows the following (with an earlier rustc, but the newly-updated one is about as slow):

10000 1.74s
20000 5.9s
30000 20s
40000 46s

I haven't waited for completion of any larger values. In my original version I had COUNT = a million; I killed rustc after some minutes and about 2G of RAM use.

Meta

rustc --version --verbose:

binary: rustc
commit-hash: 346aec9b02f3c74f3fce97fd6bda24709d220e49
commit-date: 2020-07-11
host: x86_64-unknown-linux-gnu
release: 1.46.0-nightly
LLVM version: 10.0

The text was updated successfully, but these errors were encountered:

hanna-kruppe · 2020-07-18T17:19:45Z

A tremendous amount of machine code is generated (depending on COUNT) which of course takes time. The volatile load is loading a huge value, and because it's volatile, the generated code has to ensure every byte is loaded exactly once, so it can't just call memcpy. I don't see any way to avoid this problem, but I also wonder why you're doing a volatile load of the entire array in the first place.

hanna-kruppe · 2020-07-18T17:28:57Z

(Well, the amount of code generated should be linear in COUNT, not quadratic or worse, but since it's all in one basic block, one of the many passes over that code probably is non-linear in the amount of code being processed, and avoiding that is probably very hard and not worth the effort.)

the8472 · 2020-07-18T21:10:45Z

couldn't this be rewritten as a loop of volatile u64 reads?

hanna-kruppe · 2020-07-19T10:02:59Z

You're right that a loop would be a legal lowering, but I don't think there's a good way to actually generate that within the the context of how CodeGen and especially legalization is done in LLVM today. But I just remembered that "volatile memcpy" is now a thing in LLVM, so perhaps rustc could be modified to emit that for sufficiently large values? But this can call the ordinary memcpy (which may perform multiple accesses to the same byte) and more generally it's unspecified what memory accesses it performs, so it might not be what users of {read,write}_volatile expect or want. But once again, I don't know why OP has written the code like this in the first place, so it's hard to prioritize the issue or suggest workarounds.

ijackson · 2020-07-19T10:52:19Z

Thanks etc.

Hi again. I'm afraid I don't have much to add in the discussion of
compiler internals.

My program does not need to do such a large volatile read and I have
changed it not to - ie, now it reads in smaller chunks. So this bug
is not a practical problem for me and I'm not looking for help :-).

I thought I would report this as a bug as the behaviour seemed, to me
as a non-compiler-engineer, quite surprising. It seemed to me that it
was propbably not intended and that you'd like to know about it.

Rust is usually very helpful and forgiving. (Admittedly, here I used
unsafe, and unsafe is definitely treading hazardous ground.) I am
trying to help make Rust more helpful and forgiving.

I don't know how common this kind of program is. If you have seen
others have similar experiences, maybe at least some kind of compiler
warning would be helpful? If you think there's little that could be
done here, to make matters better, then fair enough and please close
this issue.

What I was trying to do

Anyway, in the hope it's useful and/or interesting to you, I'll
explain what I was trying to do. Try not to laugh too hard:

I used this technique because I was doing some benchmarking and needed
to defeat the optimiser, which could reasonably otherwise have
realised that my entire program was a no-op. I could have used IO
instead but the function I was benchmarking was a very complex way of
describing the identity function.[1] The shortest (and to me most
obvious) expression in Rust code of this concept was have one single
volatile read of a whole array of input values.

I'm not aware of a better way to make a barrier for the optimiser.
IMO ideally there would be an operator, in std::mem perhaps, which
would take a value and return it, but guarantee that the compiler
would not optimise 'through' it. Ie, an identity function the
compiler does not know the innards of: so the compiler must compute
precisely the input value, and then make sure that all uses of the
returned value use that one computation. You could then benchmark
f by computing opt_barrier(f(opt_barrier(test_value))).
A volatile memory barrier isn't quite right because it requires that
the value is in memory. But maybe that is one for the language team.

Ian.

[1] Read only if you have a strong stomach: orlp/slotmap#43 (comment)
The benchmark program (now working, with a smaller volatile read) is attached.

Incidentally I am very impressed with how Rust managed to deal with
that ludicrous contraption. Out of curiousity I looked at the
generated assembler for 32-bit ARM and the relevant method contains
simply one instruction: bx lr. Well done to all involved!

slotmap-slot-idx-test.rs.txt

the8472 · 2020-07-19T10:56:05Z

I'm not aware of a better way to make a barrier for the optimiser.

https://docs.rs/bencher/0.1.5/bencher/fn.black_box.html
https://doc.rust-lang.org/std/hint/fn.black_box.html
https://doc.rust-lang.org/test/bench/fn.black_box.html

ijackson · 2020-07-19T10:57:22Z

Interesting, thanks. That is precisely right except it says it's not guaranteed to actually work :-).

the8472 · 2020-07-19T11:15:39Z

It's generally considered good enough for benchmarking, you can always look at the assembly output to make sure.

ijackson · 2020-07-19T11:19:58Z

I ended up (skim)reading a lot of the discussion in the various black_box issues etc.

I think what is needed is what is described in this comment: rust-lang/rfcs#1484 (comment) Unfortunately what is provided is something considerably weaker. As documented, black_blox is not really useful even for benchmarking because (as you suggest) you still have to read the generated assembler to know if it worked. It's also no good for crypto constant-time stuff, which was another use case. But I think that discussion should probably be in some other issue.

In this issue: is it worth trying to make the compiler not quadratic in the size of the volatile array read? Or providing a warning about this or similar situations? If not then I think probably you'll want to close this issue, and sorry for wasting your time.

Thanks for your attention.

hanna-kruppe · 2020-07-19T12:15:43Z

I do think we should deprecate and warn about volatile reads and writes like this one. Generally, volatile accesses for aggregates (excluding those that map to architecturally-supported data types, such as repr(transparent) newtypes around primitives or SIMD types) have few legitimate use cases and several gotchas (no real guarantees about hardware memory access granularity, bad codegen as seen here). I am aware of a few use cases, but those have different needs from e.g. MMIO uses, so a different interface for them (cf. LLVM's volatile memcpy) seems like a better choice than overloading these operations. cc @rust-lang/wg-unsafe-code-guidelines

RalfJung · 2020-07-23T11:01:42Z

Cc rust-lang/unsafe-code-guidelines#212

ijackson added the C-bug Category: This is a bug. label Jul 18, 2020

ijackson mentioned this issue Jul 19, 2020

Tracking issue for std::hint::black_box #64102

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compilation time (of volatile array read?) is super-quadratic in array size #74476

compilation time (of volatile array read?) is super-quadratic in array size #74476

ijackson commented Jul 18, 2020

hanna-kruppe commented Jul 18, 2020

hanna-kruppe commented Jul 18, 2020 •

edited

Loading

the8472 commented Jul 18, 2020 •

edited

Loading

hanna-kruppe commented Jul 19, 2020

ijackson commented Jul 19, 2020 •

edited

Loading

the8472 commented Jul 19, 2020

ijackson commented Jul 19, 2020

the8472 commented Jul 19, 2020

ijackson commented Jul 19, 2020 •

edited

Loading

hanna-kruppe commented Jul 19, 2020

RalfJung commented Jul 23, 2020

compilation time (of volatile array read?) is super-quadratic in array size #74476

compilation time (of volatile array read?) is super-quadratic in array size #74476

Comments

ijackson commented Jul 18, 2020

Hi. I tried this code:

Expected results:

Actual results:

Meta

hanna-kruppe commented Jul 18, 2020

hanna-kruppe commented Jul 18, 2020 • edited Loading

the8472 commented Jul 18, 2020 • edited Loading

hanna-kruppe commented Jul 19, 2020

ijackson commented Jul 19, 2020 • edited Loading

Thanks etc.

What I was trying to do

the8472 commented Jul 19, 2020

ijackson commented Jul 19, 2020

the8472 commented Jul 19, 2020

ijackson commented Jul 19, 2020 • edited Loading

hanna-kruppe commented Jul 19, 2020

RalfJung commented Jul 23, 2020

hanna-kruppe commented Jul 18, 2020 •

edited

Loading

the8472 commented Jul 18, 2020 •

edited

Loading

ijackson commented Jul 19, 2020 •

edited

Loading

ijackson commented Jul 19, 2020 •

edited

Loading