-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Very inefficient code generated for async functions setup (and likely for generators in general) #99504
Comments
From a cursory look, I believe this requires a MIR optimization, LLVM doesn't have enough information for this, and I don't think it can be provided either (heck, |
@nikic Well, I'm no expert on LLVM or MIR, my knowledge is pretty superficial. So I can't judge. But, the workaround shown in https://godbolt.org/z/dWzoqEjh1 is basically manually implementing what the Rust compiler is/should be doing and it works w/o the temporary on the stack and thus w/o BTW, would you care adding appropriate labels to the issue (bug, performance, ...)? I seem not to be able to add anything. Thanks. |
I did some more digging with MIR. What I see is the following (slow with memcpy: https://godbolt.org/z/eKGqjhWWP, fast with With "regular" call and await, the code generated is like this:
In So one of the two moves in The workaround I made with pinning looks a bit different:
Here, the future is stored directly into the caller's state into We see that So I thought, maybe it's just a question of adding I'm stopping the analysis here for now. Maybe someone can take it further (e.g., look at the generated LLVM IR). |
Well, I did take yet another look. I can confirm that in the generated LLVM IR there is a call to @nikic You mentioned that this requires MIR optimization. In the MIR output, there is no explicit call to In my pinned future workaround, I added an explicit call to Note that this seems to have something to do with the destination being on the caller's future's closure. I could not generate a similar problem with regular functions - there, the identity function is always optimized out, both on assignments to temporaries, to composite objects, via RVO, etc. The question is, what can be done to get a solution to the problem? All async programs are heavily impacted by it. |
This is fixable in principle, but it's not easy. The first thing to do would be to fix that inlining into generators is turned off, without that we have no shot at doing this at all. From a cursory look, this seems non-trivial. cc @cjgillot - do you have any idea how much work this would be, and/or is it on your radar (or anyone else who's working on inlining)? |
@JakobDegen Just an understanding question:
Which inlining do you mean? I do see stuff being inlined into async functions, even multiple async functions into each other in the final compiler output (after all phases). |
Sorry, I mean in MIR. LLVM still inlines, but since this is something that needs to be fixed in MIR, we need to do the inlining there as well |
@JakobDegen thanks for explanation. I thought it was something like that :-). BTW, just an idea which might help short-term: When generating code for Would it be possible to determine this in the compiler in the code generation path in An alternative would be to check there whether the implementation of This wouldn't be a "proper" fix, but it would alleviate the most pressing issue, which makes async programs slow. And, provided the expression type can be checked for having a What do you think? |
It would be possible to not emit the |
Well, I fully agree :-). But based on the discussion above, it seems like the proper fix is quite far away. That's why I proposed the workaround for now, which will likely can be backported to a stable release soon. |
I hacked #99782 to try and enable MIR inlining for generators. Does this branch solve the issue you are having? |
That on its own is probably not enough, but try that with #96451 rebased on top. Even if that doesn't solve this particular issue, it should still get rid of a bunch of copies (would love to see a usage report if you have some internal benchmarks) |
@cjgillot I tested with your changes (I hope correctly) by building the Rust complier from sources, adding it as a new Unfortunately, it doesn't seem to help. The callee future is still materialized on the stack first and then memcpy-ed into the proper place in the caller future space. @JakobDegen Your changes can't be applied conflict-free, neither to the current master nor on top of the above changes. I resolved relevant technical conflicts (that was not much, basically one function), but there are further logical conflicts (like not covered exhaustive matches), so I didn't investigate further. Seems like you'll have to rebase your PR on the current master and fix it. I can then try afterwards. |
Ah, crap. Yeah, I'll try and do that tonight |
@schreter rebased, try now |
@JakobDegen (CC @workingjubilee) This time I could merge both changes cleanly, yours had just a minor conflict in some test files for generated LLVM instructions. However, it doesn't seem to help - memcpy() is still generated. Of course, the error can be on my side and the compiler used is not the newly-built compiler (not sure how to check that, I just followed the instructions mentioned above). Can you please test yourself? The minimal example is this: #[tokio::main]
pub async fn main() {
// this call causes the Future from `callee()`
// to be materialized on stack and memcpy-moved
// to the future of this function
callee().await;
}
#[inline(never)] // prevent inlining creating the generator
pub async fn callee() {
let mut data = [0_u32; 128];
tokio::task::yield_now().await;
println!("addr {:p}", &mut data);
} If you set a breakpoint at |
I can probably take a look at some point, but it might be a bit |
@JakobDegen Any news? This, together with #97540 cause quite big and measurable performance issues in our project. Thanks. |
What we see in our project using also larger
Future
s is a lot of unnecessary memory copying. These memcpy() calls are the hottest function in the profile (and I mean, in some cases, very dominant, like 90%, even with optimization level 3). I searched for similar issues, but found none, so here we go:What happens:
Future
(which is in fact a generator, which is later polled). ThisFuture
is stored in the callers's async function "stack" (i.e., it'sFuture
), so the caller'sFuture
is the aggregate of the parent function state and the called function'sFuture
(with some layout optimizations if calling multiple async function in independent blocks).Future
generator to materialize theFuture
directly into the proper place in the caller's state, theFuture
is first materialized on the regular stack and then copied from the stack to the caller'sFuture
.match
statement) where one or more of them produce a largeFuture
. Then, the call to the request handler materializes aFuture
by writing a few words to the stack and then this (practically empty) space is copied in full from the stack to the caller'sFuture
(i.e., including the uninitialized space - it's simply a binary move).This wastes cycles like there is no tomorrow - instead of materializing the callee's
Future
on the stack, async functions should materialize the callee'sFuture
directly in-place in the caller'sFuture
. That would save the copying (and, especially, copying of still uninitialized space!).A minimal example in compiler explorer is here: https://godbolt.org/z/b45MTex3e. You can see that
callee().await
first materializes theFuture
on stack and then it's copied into proper place.Generated instructions (Intel):
What I'd expect to see:
This might be related to #97540, I also posted it there first.
Interestingly, the same problem does NOT happen when calling a function producing a large result and storing it in a variable in the async closure, subsequently using that variable later. In that case, the function producing a large result produces the value directly in future's state. This is also true when storing the large generated future in a variable, pinning it explicitly and awaiting it (as demonstrated via https://godbolt.org/z/dWzoqEjh1).
We found some temporary workarounds for the problem, boxing some especially large futures and/or the abovementioned workaround. This helps improve the performance somewhat, but memory allocation is also not particularly cheap. Further, hunting down these issues requires a lot of analysis time (since it's also not possible to set a warning level to warn about large futures). Therefore, these are not practicable.
Real solution to the problem, removing unnecessary memcpy, would be very desirable, since that would help performance in async code in general. It looks like some move optimization is missing.
BTW, I tried to post this directly as a bug, but the "Submit new issue" was grayed out. Therefore, I'm submitting it as a blank issue.
The text was updated successfully, but these errors were encountered: