Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rdrand: Avoid inlining unrolled retry loops. #444

Closed
wants to merge 1 commit into from

Conversation

briansmith
Copy link
Contributor

@briansmith briansmith commented May 31, 2024

The rdrand implementation contains three calls to rdrand():

  1. One in the main loop, for full words of output.
  2. One after the main loop, for the potential partial word of output.
  3. One inside the self-test loop.

In the first case, the loop is unrolled into:

loop:
   ....

   rdrand <register>
   jb loop
   rdrand <register>
   jb loop
   rdrand <register>
   jb loop
   rdrand <register>
   jb loop
   rdrand <register>
   jb loop
   rdrand <register>
   jb loop
   rdrand <register>
   jb loop
   rdrand <register>
   jb loop
   rdrand <register>
   jb loop
   rdrand <register>
   jb loop

The second case is similar, except it isn't a loop.

In the third case, the self-test loop, the same unrolling happens, but then
the self-test loop is also unrolled, so the result is a sequence of 160
instructions.

With this change, the generated code for the loop looks like this:

loop:
        ...

	rdrand <register>
	jb loop
	call retry
	test rax, rax
	jne loop
	jmp fail

The generated code for the tail now looks like this:

        rdrand rdx
	jae call_retry
        ...

This is much better because we're no longer jumping over the uselessly-
unrolled loops.

The loop in retry() still gets unrolled though, but the compiler will
put it in the cold function section.

Since rdrand will basically never fail, the jb <success> in each
call is going to be predicted as succeeding, so the number of
instructions doesn't change. But, instruction cache pressure should
be reduced.

The rdrand implementation contains three calls to rdrand():

1. One in the main loop, for full words of output.
2. One after the main loop, for the potential partial word of output.
3. One inside the self-test loop.

In the first case, the loop is unrolled into:

```
loop:
   ....

   rdrand <register>
   jb loop
   rdrand <register>
   jb loop
   rdrand <register>
   jb loop
   rdrand <register>
   jb loop
   rdrand <register>
   jb loop
   rdrand <register>
   jb loop
   rdrand <register>
   jb loop
   rdrand <register>
   jb loop
   rdrand <register>
   jb loop
   rdrand <register>
   jb loop
```

The second case is similar, except it isn't a loop.

In the third case, the self-test loop, the same unrolling happens, but then
the self-test loop is also unrolled, so the result is a sequence of 160
instructions.

With this change, the generated code for the loop looks like this:

```
loop:
        ...

	rdrand <register>
	jb loop
	call retry
	test rax, rax
	jne loop
	jmp fail
```

The generated code for the tail now looks like this:

```
        rdrand rdx
	jae call_retry
        ...

```

This is much better because we're no longer jumping over the uselessly-
unrolled loops.

The loop in `retry()` still gets unrolled though, but the compiler will
put it in the cold function section.

Since rdrand will basically never fail, the `jb <success>` in each
call is going to be predicted as succeeding, so the number of
instructions doesn't change. But, instruction cache pressure should
be reduced.
@newpavlov
Copy link
Member

Maybe it's better to just write the desired assembly?

@briansmith
Copy link
Contributor Author

Maybe it's better to just write the desired assembly?

This pretty much does what we want. There is a longstanding rust-lang/rust issue about controlling unrolling, but with these changes the unrolling isn't much of an issue anymore. There is a place for inline assembly but I think loops are one thing where I'd like to avoid doing them in inline assembly.

@newpavlov
Copy link
Member

Assembly with loop looks simple enough: https://rust.godbolt.org/z/1r87Gc6Kb It's just a quick draft, so it's better to double-check the code. (Also, I am not sure why it inserts seemingly useless push rax and pop rcx)

@josephlr
Copy link
Member

josephlr commented Jun 4, 2024

I did some experiments looking at the size of rdrand_exact at different optimization levels. I wanted to exclude the initialization code, as #443 is a better place to discuss inlining of the initialization code.

Optimization Level Current Impl This PR ASM Loop
opt-level=z 186 bytes 197 bytes 159 bytes
opt-level=s 147 bytes 152 bytes 147 bytes
opt-level=3 211 bytes 194 bytes (131 excluding retry) 132 bytes

So this does help at opt-level=3, but I'm wondering if it's worth it. All these implementations will have essentially identical performance, and 80 bytes of icache pressure is probably irrelevant compared to the latancy of rdrand. The main benefit to me is that when we have #288, we will be able to see if the retry logic is covered by tests.

Maybe it's better to just write the desired assembly?

I think we should stick to the intrinsics, they allow LLVM to reason about the carry flag, and avoid error-prone inline assembly.

@briansmith
Copy link
Contributor Author

briansmith commented Jun 4, 2024

TBH, I don't have any urgent need for this. I mostly wrote it so I could review the generated assembly easier. At some point I think it would be good to expose the RDRAND implementation on all x86/x86_64 targets via a new API, so that a user can mix getrandom() and RDRAND, which is pretty common to do. This is mostly a step towards that; the hope is that the linker will be able to put the non-cold parts of getrandom[_uninit] and the non-cold parts of rdrand right next to each other in the same page.

@newpavlov
Copy link
Member

newpavlov commented Jun 5, 2024

I found a much better approach: https://rust.godbolt.org/z/xW4sWr8hW

It uses the fact that the rdrand function can not be inlined because of the enabled target feature, so number of retries becomes opaque for the optimizer. We also could mark it with #[inline(never)] or even #[no_mangle] to be extra safe. It also may be worth to return Result<Word, Error> instead of Option to shave off several instructions.

@briansmith
Copy link
Contributor Author

It uses the fact that the rdrand function can not be inlined because of the enabled target feature, so number of retries becomes opaque for the optimizer.

I don't think we need to spend too much more time on this. But I just want to point out that in targets where rdrand is actually used, at lesat SGX, have rdrand as a statically-enabled feature, so everything gets inlined into getrandom_inner.

@newpavlov
Copy link
Member

Yes, this is why I mentioned #[inline(never)] and #[no_mangle]. From playing in godbolt, the latter looks more robust, but we would need to be careful with the function name to prevent potential collisions.

Feel free to close this PR, if you do not plan to move forward with it.

@briansmith
Copy link
Contributor Author

@newpavlov Your idea of removing the #[target_feature(enable = "rdrand")] from the callers does seem better than my idea. However, in #230 (comment) it was suggested that maybe we would offer a public RDRAND API. I think we should have a discussion about whether we want to have such an API, what the design of that public API would be, see if we can implement getrandom_uninit in terms of that API, and then see if we can avoid the weird codegen in the implementation of that API.

With that in mind, I'm going to close this. If we end up not adding a public rdrand API then we can implement your idea. Or if you want to do it in the interim, I'm happy to review it.

@briansmith briansmith closed this Jun 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants