rdrand: Avoid inlining unrolled retry loops. #444

briansmith · 2024-05-31T20:18:04Z

The rdrand implementation contains three calls to rdrand():

One in the main loop, for full words of output.
One after the main loop, for the potential partial word of output.
One inside the self-test loop.

In the first case, the loop is unrolled into:

loop:
   ....

   rdrand <register>
   jb loop
   rdrand <register>
   jb loop
   rdrand <register>
   jb loop
   rdrand <register>
   jb loop
   rdrand <register>
   jb loop
   rdrand <register>
   jb loop
   rdrand <register>
   jb loop
   rdrand <register>
   jb loop
   rdrand <register>
   jb loop
   rdrand <register>
   jb loop

The second case is similar, except it isn't a loop.

In the third case, the self-test loop, the same unrolling happens, but then
the self-test loop is also unrolled, so the result is a sequence of 160
instructions.

With this change, the generated code for the loop looks like this:

loop:
        ...

	rdrand <register>
	jb loop
	call retry
	test rax, rax
	jne loop
	jmp fail

The generated code for the tail now looks like this:

        rdrand rdx
	jae call_retry
        ...

This is much better because we're no longer jumping over the uselessly-
unrolled loops.

The loop in retry() still gets unrolled though, but the compiler will
put it in the cold function section.

Since rdrand will basically never fail, the jb <success> in each
call is going to be predicted as succeeding, so the number of
instructions doesn't change. But, instruction cache pressure should
be reduced.

The rdrand implementation contains three calls to rdrand(): 1. One in the main loop, for full words of output. 2. One after the main loop, for the potential partial word of output. 3. One inside the self-test loop. In the first case, the loop is unrolled into: ``` loop: .... rdrand <register> jb loop rdrand <register> jb loop rdrand <register> jb loop rdrand <register> jb loop rdrand <register> jb loop rdrand <register> jb loop rdrand <register> jb loop rdrand <register> jb loop rdrand <register> jb loop rdrand <register> jb loop ``` The second case is similar, except it isn't a loop. In the third case, the self-test loop, the same unrolling happens, but then the self-test loop is also unrolled, so the result is a sequence of 160 instructions. With this change, the generated code for the loop looks like this: ``` loop: ... rdrand <register> jb loop call retry test rax, rax jne loop jmp fail ``` The generated code for the tail now looks like this: ``` rdrand rdx jae call_retry ... ``` This is much better because we're no longer jumping over the uselessly- unrolled loops. The loop in `retry()` still gets unrolled though, but the compiler will put it in the cold function section. Since rdrand will basically never fail, the `jb <success>` in each call is going to be predicted as succeeding, so the number of instructions doesn't change. But, instruction cache pressure should be reduced.

newpavlov · 2024-05-31T22:45:30Z

Maybe it's better to just write the desired assembly?

briansmith · 2024-06-01T00:07:28Z

Maybe it's better to just write the desired assembly?

This pretty much does what we want. There is a longstanding rust-lang/rust issue about controlling unrolling, but with these changes the unrolling isn't much of an issue anymore. There is a place for inline assembly but I think loops are one thing where I'd like to avoid doing them in inline assembly.

newpavlov · 2024-06-02T22:26:27Z

Assembly with loop looks simple enough: https://rust.godbolt.org/z/1r87Gc6Kb It's just a quick draft, so it's better to double-check the code. (Also, I am not sure why it inserts seemingly useless push rax and pop rcx)

josephlr · 2024-06-04T00:52:11Z

I did some experiments looking at the size of rdrand_exact at different optimization levels. I wanted to exclude the initialization code, as #443 is a better place to discuss inlining of the initialization code.

Optimization Level	Current Impl	This PR	ASM Loop
`opt-level=z`	`186` bytes	`197` bytes	`159` bytes
`opt-level=s`	`147` bytes	`152` bytes	`147` bytes
`opt-level=3`	`211` bytes	`194` bytes (`131` excluding `retry`)	`132` bytes

So this does help at opt-level=3, but I'm wondering if it's worth it. All these implementations will have essentially identical performance, and 80 bytes of icache pressure is probably irrelevant compared to the latancy of rdrand. The main benefit to me is that when we have #288, we will be able to see if the retry logic is covered by tests.

Maybe it's better to just write the desired assembly?

I think we should stick to the intrinsics, they allow LLVM to reason about the carry flag, and avoid error-prone inline assembly.

briansmith · 2024-06-04T01:23:00Z

TBH, I don't have any urgent need for this. I mostly wrote it so I could review the generated assembly easier. At some point I think it would be good to expose the RDRAND implementation on all x86/x86_64 targets via a new API, so that a user can mix getrandom() and RDRAND, which is pretty common to do. This is mostly a step towards that; the hope is that the linker will be able to put the non-cold parts of getrandom[_uninit] and the non-cold parts of rdrand right next to each other in the same page.

newpavlov · 2024-06-05T00:32:04Z

I found a much better approach: https://rust.godbolt.org/z/xW4sWr8hW

It uses the fact that the rdrand function can not be inlined because of the enabled target feature, so number of retries becomes opaque for the optimizer. We also could mark it with #[inline(never)] or even #[no_mangle] to be extra safe. It also may be worth to return Result<Word, Error> instead of Option to shave off several instructions.

briansmith · 2024-06-05T00:39:24Z

It uses the fact that the rdrand function can not be inlined because of the enabled target feature, so number of retries becomes opaque for the optimizer.

I don't think we need to spend too much more time on this. But I just want to point out that in targets where rdrand is actually used, at lesat SGX, have rdrand as a statically-enabled feature, so everything gets inlined into getrandom_inner.

newpavlov · 2024-06-05T00:45:43Z

Yes, this is why I mentioned #[inline(never)] and #[no_mangle]. From playing in godbolt, the latter looks more robust, but we would need to be careful with the function name to prevent potential collisions.

Feel free to close this PR, if you do not plan to move forward with it.

briansmith · 2024-06-08T04:12:29Z

@newpavlov Your idea of removing the #[target_feature(enable = "rdrand")] from the callers does seem better than my idea. However, in #230 (comment) it was suggested that maybe we would offer a public RDRAND API. I think we should have a discussion about whether we want to have such an API, what the design of that public API would be, see if we can implement getrandom_uninit in terms of that API, and then see if we can avoid the weird codegen in the implementation of that API.

With that in mind, I'm going to close this. If we end up not adding a public rdrand API then we can implement your idea. Or if you want to do it in the interim, I'm happy to review it.

briansmith force-pushed the b/rdrand-loops branch from e28dd9c to 857e873 Compare May 31, 2024 20:37

briansmith closed this Jun 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rdrand: Avoid inlining unrolled retry loops. #444

rdrand: Avoid inlining unrolled retry loops. #444

briansmith commented May 31, 2024 •

edited

Loading

newpavlov commented May 31, 2024

briansmith commented Jun 1, 2024

newpavlov commented Jun 2, 2024

josephlr commented Jun 4, 2024

briansmith commented Jun 4, 2024 •

edited

Loading

newpavlov commented Jun 5, 2024 •

edited

Loading

briansmith commented Jun 5, 2024

newpavlov commented Jun 5, 2024

briansmith commented Jun 8, 2024

rdrand: Avoid inlining unrolled retry loops. #444

rdrand: Avoid inlining unrolled retry loops. #444

Conversation

briansmith commented May 31, 2024 • edited Loading

newpavlov commented May 31, 2024

briansmith commented Jun 1, 2024

newpavlov commented Jun 2, 2024

josephlr commented Jun 4, 2024

briansmith commented Jun 4, 2024 • edited Loading

newpavlov commented Jun 5, 2024 • edited Loading

briansmith commented Jun 5, 2024

newpavlov commented Jun 5, 2024

briansmith commented Jun 8, 2024

briansmith commented May 31, 2024 •

edited

Loading

briansmith commented Jun 4, 2024 •

edited

Loading

newpavlov commented Jun 5, 2024 •

edited

Loading