Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compilation time (of volatile array read?) is super-quadratic in array size #74476

Open
ijackson opened this issue Jul 18, 2020 · 11 comments
Open
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@ijackson
Copy link
Contributor

Hi. I tried this code:

const COUNT : usize = 40000;

fn main () {
  let mut vol_buffer = [ 0u64; COUNT ];

  let mut input = unsafe { std::ptr::read_volatile(&mut vol_buffer) };

  input[0] = 42;
  eprintln!("{:?}", input[0]);
}

Expected results:

Relatively quick compilation of a program that prints 42.

Actual results:

rustc took 45 seconds on my really rather fast laptop (and also uses a great deal of memory). The compiled binary works as epected.

Changing the value of COUNT shows the following (with an earlier rustc, but the newly-updated one is about as slow):

10000 1.74s
20000 5.9s
30000 20s
40000 46s

I haven't waited for completion of any larger values. In my original version I had COUNT = a million; I killed rustc after some minutes and about 2G of RAM use.

Meta

rustc --version --verbose:

binary: rustc
commit-hash: 346aec9b02f3c74f3fce97fd6bda24709d220e49
commit-date: 2020-07-11
host: x86_64-unknown-linux-gnu
release: 1.46.0-nightly
LLVM version: 10.0
@ijackson ijackson added the C-bug Category: This is a bug. label Jul 18, 2020
@jonas-schievink jonas-schievink added C-enhancement Category: An issue proposing an enhancement or a PR with one. I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. and removed C-bug Category: This is a bug. labels Jul 18, 2020
@hanna-kruppe
Copy link
Contributor

A tremendous amount of machine code is generated (depending on COUNT) which of course takes time. The volatile load is loading a huge value, and because it's volatile, the generated code has to ensure every byte is loaded exactly once, so it can't just call memcpy. I don't see any way to avoid this problem, but I also wonder why you're doing a volatile load of the entire array in the first place.

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented Jul 18, 2020

(Well, the amount of code generated should be linear in COUNT, not quadratic or worse, but since it's all in one basic block, one of the many passes over that code probably is non-linear in the amount of code being processed, and avoiding that is probably very hard and not worth the effort.)

@the8472
Copy link
Member

the8472 commented Jul 18, 2020

couldn't this be rewritten as a loop of volatile u64 reads?

@hanna-kruppe
Copy link
Contributor

You're right that a loop would be a legal lowering, but I don't think there's a good way to actually generate that within the the context of how CodeGen and especially legalization is done in LLVM today. But I just remembered that "volatile memcpy" is now a thing in LLVM, so perhaps rustc could be modified to emit that for sufficiently large values? But this can call the ordinary memcpy (which may perform multiple accesses to the same byte) and more generally it's unspecified what memory accesses it performs, so it might not be what users of {read,write}_volatile expect or want. But once again, I don't know why OP has written the code like this in the first place, so it's hard to prioritize the issue or suggest workarounds.

@ijackson
Copy link
Contributor Author

ijackson commented Jul 19, 2020

Thanks etc.

Hi again. I'm afraid I don't have much to add in the discussion of
compiler internals.

My program does not need to do such a large volatile read and I have
changed it not to - ie, now it reads in smaller chunks. So this bug
is not a practical problem for me and I'm not looking for help :-).

I thought I would report this as a bug as the behaviour seemed, to me
as a non-compiler-engineer, quite surprising. It seemed to me that it
was propbably not intended and that you'd like to know about it.

Rust is usually very helpful and forgiving. (Admittedly, here I used
unsafe, and unsafe is definitely treading hazardous ground.) I am
trying to help make Rust more helpful and forgiving.

I don't know how common this kind of program is. If you have seen
others have similar experiences, maybe at least some kind of compiler
warning would be helpful? If you think there's little that could be
done here, to make matters better, then fair enough and please close
this issue.

What I was trying to do

Anyway, in the hope it's useful and/or interesting to you, I'll
explain what I was trying to do. Try not to laugh too hard:

I used this technique because I was doing some benchmarking and needed
to defeat the optimiser, which could reasonably otherwise have
realised that my entire program was a no-op. I could have used IO
instead but the function I was benchmarking was a very complex way of
describing the identity function.[1] The shortest (and to me most
obvious) expression in Rust code of this concept was have one single
volatile read of a whole array of input values.

I'm not aware of a better way to make a barrier for the optimiser.
IMO ideally there would be an operator, in std::mem perhaps, which
would take a value and return it, but guarantee that the compiler
would not optimise 'through' it. Ie, an identity function the
compiler does not know the innards of: so the compiler must compute
precisely the input value, and then make sure that all uses of the
returned value use that one computation. You could then benchmark
f by computing opt_barrier(f(opt_barrier(test_value))).
A volatile memory barrier isn't quite right because it requires that
the value is in memory. But maybe that is one for the language team.

Ian.

[1] Read only if you have a strong stomach: orlp/slotmap#43 (comment)
The benchmark program (now working, with a smaller volatile read) is attached.

Incidentally I am very impressed with how Rust managed to deal with
that ludicrous contraption. Out of curiousity I looked at the
generated assembler for 32-bit ARM and the relevant method contains
simply one instruction: bx lr. Well done to all involved!

slotmap-slot-idx-test.rs.txt

@the8472
Copy link
Member

the8472 commented Jul 19, 2020

@ijackson
Copy link
Contributor Author

Interesting, thanks. That is precisely right except it says it's not guaranteed to actually work :-).

@the8472
Copy link
Member

the8472 commented Jul 19, 2020

It's generally considered good enough for benchmarking, you can always look at the assembly output to make sure.

@ijackson
Copy link
Contributor Author

ijackson commented Jul 19, 2020

I ended up (skim)reading a lot of the discussion in the various black_box issues etc.

I think what is needed is what is described in this comment: rust-lang/rfcs#1484 (comment) Unfortunately what is provided is something considerably weaker. As documented, black_blox is not really useful even for benchmarking because (as you suggest) you still have to read the generated assembler to know if it worked. It's also no good for crypto constant-time stuff, which was another use case. But I think that discussion should probably be in some other issue.

In this issue: is it worth trying to make the compiler not quadratic in the size of the volatile array read? Or providing a warning about this or similar situations? If not then I think probably you'll want to close this issue, and sorry for wasting your time.

Thanks for your attention.

@hanna-kruppe
Copy link
Contributor

I do think we should deprecate and warn about volatile reads and writes like this one. Generally, volatile accesses for aggregates (excluding those that map to architecturally-supported data types, such as repr(transparent) newtypes around primitives or SIMD types) have few legitimate use cases and several gotchas (no real guarantees about hardware memory access granularity, bad codegen as seen here). I am aware of a few use cases, but those have different needs from e.g. MMIO uses, so a different interface for them (cf. LLVM's volatile memcpy) seems like a better choice than overloading these operations. cc @rust-lang/wg-unsafe-code-guidelines

@RalfJung
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants