-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rdrand: Avoid inlining unrolled retry loops. #444
Conversation
The rdrand implementation contains three calls to rdrand(): 1. One in the main loop, for full words of output. 2. One after the main loop, for the potential partial word of output. 3. One inside the self-test loop. In the first case, the loop is unrolled into: ``` loop: .... rdrand <register> jb loop rdrand <register> jb loop rdrand <register> jb loop rdrand <register> jb loop rdrand <register> jb loop rdrand <register> jb loop rdrand <register> jb loop rdrand <register> jb loop rdrand <register> jb loop rdrand <register> jb loop ``` The second case is similar, except it isn't a loop. In the third case, the self-test loop, the same unrolling happens, but then the self-test loop is also unrolled, so the result is a sequence of 160 instructions. With this change, the generated code for the loop looks like this: ``` loop: ... rdrand <register> jb loop call retry test rax, rax jne loop jmp fail ``` The generated code for the tail now looks like this: ``` rdrand rdx jae call_retry ... ``` This is much better because we're no longer jumping over the uselessly- unrolled loops. The loop in `retry()` still gets unrolled though, but the compiler will put it in the cold function section. Since rdrand will basically never fail, the `jb <success>` in each call is going to be predicted as succeeding, so the number of instructions doesn't change. But, instruction cache pressure should be reduced.
e28dd9c
to
857e873
Compare
Maybe it's better to just write the desired assembly? |
This pretty much does what we want. There is a longstanding rust-lang/rust issue about controlling unrolling, but with these changes the unrolling isn't much of an issue anymore. There is a place for inline assembly but I think loops are one thing where I'd like to avoid doing them in inline assembly. |
Assembly with loop looks simple enough: https://rust.godbolt.org/z/1r87Gc6Kb It's just a quick draft, so it's better to double-check the code. (Also, I am not sure why it inserts seemingly useless |
I did some experiments looking at the size of
So this does help at
I think we should stick to the intrinsics, they allow LLVM to reason about the carry flag, and avoid error-prone inline assembly. |
TBH, I don't have any urgent need for this. I mostly wrote it so I could review the generated assembly easier. At some point I think it would be good to expose the RDRAND implementation on all x86/x86_64 targets via a new API, so that a user can mix |
I found a much better approach: https://rust.godbolt.org/z/xW4sWr8hW It uses the fact that the |
I don't think we need to spend too much more time on this. But I just want to point out that in targets where |
Yes, this is why I mentioned Feel free to close this PR, if you do not plan to move forward with it. |
@newpavlov Your idea of removing the With that in mind, I'm going to close this. If we end up not adding a public rdrand API then we can implement your idea. Or if you want to do it in the interim, I'm happy to review it. |
The rdrand implementation contains three calls to rdrand():
In the first case, the loop is unrolled into:
The second case is similar, except it isn't a loop.
In the third case, the self-test loop, the same unrolling happens, but then
the self-test loop is also unrolled, so the result is a sequence of 160
instructions.
With this change, the generated code for the loop looks like this:
The generated code for the tail now looks like this:
This is much better because we're no longer jumping over the uselessly-
unrolled loops.
The loop in
retry()
still gets unrolled though, but the compiler willput it in the cold function section.
Since rdrand will basically never fail, the
jb <success>
in eachcall is going to be predicted as succeeding, so the number of
instructions doesn't change. But, instruction cache pressure should
be reduced.