Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test suite compilation spends a large amount of time in evaluate_obligation #87012

Open
patrickfreed opened this issue Jul 9, 2021 · 19 comments
Labels
I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@patrickfreed
Copy link

I originally posted this on the Rust forums but was directed here since the cause could be related to a compiler bug.

I'm looking for some help debugging why the compile times in the mongodb crate's test suite are much slower than the crate itself. For reference, a single incremental change can take a minute or longer, even when running just cargo check --tests. The crate itself compiles quickly--the issue is only with the tests.

The output of compiling with -Zself-profile indicates that nearly the entire time (~1m) is spent in evaluate_obligation, but I can't figure what this implies or how to possibly reduce this.

A few possible culprits that I've investigated so far:

  • the tests make heavy use of #[cfg_attr(...)] to conditionally select either #[tokio::test] or #[async_std::test]. Removing all usages of this and defaulting to just #[tokio::test] has an insignificant impact on compile times
  • the tests also make heavy use of a function that accepts a closure that returns a future. According to this thread, such functions can cause type length / compilation time issues, but converting it to a macro that calls a function directly didn't seem to change much. Perhaps that suffers from a similar issue?
  • We generate a lot of builders via the typed-builder crate. Perhaps using them in the tests leads to long compilation times due to the generated builders having lots of generic parameters.
  • this compiler issue related to async and compilation times, though someone posted their profile output and evaluate_obligation is not a factor in it.
  • most of our tests are in src/test rather than tests, not sure if this has any impact

Anyways, I'm not sure where to go from here, and I was wondering if anyone else has encountered similar issues in the past or knows how I could further debug this to identify the root cause. Thanks!

Here are the top few results from the self-profile output:

+--------------------------------------------------+-----------+-----------------+----------+------------+-----------------------+
| Item                                             | Self time | % of total time | Time     | Item count | Incremental load time |
+--------------------------------------------------+-----------+-----------------+----------+------------+-----------------------+
| evaluate_obligation                              | 58.34s    | 85.300          | 58.39s   | 110480     | 0.00ns                |
+--------------------------------------------------+-----------+-----------------+----------+------------+-----------------------+
| expand_crate                                     | 3.54s     | 5.179           | 3.59s    | 1          | 0.00ns                |
+--------------------------------------------------+-----------+-----------------+----------+------------+-----------------------+
| typeck                                           | 2.74s     | 4.011           | 61.39s   | 1159       | 71.43ms               |
+--------------------------------------------------+-----------+-----------------+----------+------------+-----------------------+
| self_profile_alloc_query_strings                 | 979.78ms  | 1.433           | 980.49ms | 1          | 0.00ns                |
+--------------------------------------------------+-----------+-----------------+----------+------------+-----------------------+
| hir_lowering                                     | 204.80ms  | 0.299           | 204.80ms | 1          | 0.00ns                |
+--------------------------------------------------+-----------+-----------------+----------+------------+-----------------------+
@est31
Copy link
Member

est31 commented Jul 9, 2021

How many tests does the testsuite contain? Is it doctests? tests in the tests/ directory? #[test] functions?

Can you isolate the behaviour to a lone test that if you remove it, the time spent in evaluate_obligation is more normal again? You could try commenting out tests one by one. Maybe it is the result of the combination of multiple tests?

@estebank estebank added I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jul 10, 2021
@jyn514
Copy link
Member

jyn514 commented Jul 10, 2021

We generate a lot of builders via the typed-builder crate. Perhaps using them in the tests leads to long compilation times due to the generated builders having lots of generic parameters.

Without looking into this for more than 30 seconds, I'd expect this to be the issue.

@jyn514
Copy link
Member

jyn514 commented Jul 10, 2021

For context, evaluate_obligation is the part of the compiler that determines whether a T: Trait bound is fulfilled.

@nikic nikic added I-compiletime Issue: Problems and improvements with respect to compile times. and removed I-slow Issue: Problems and improvements with respect to performance of generated code. labels Jul 11, 2021
@patrickfreed
Copy link
Author

Can you isolate the behaviour to a lone test that if you remove it, the time spent in evaluate_obligation is more normal again? You could try commenting out tests one by one. Maybe it is the result of the combination of multiple tests?

I went ahead and commented each test out in chunks until they were all commented out and saw no meaningful change, which means the slowdown is somewhere in the test harness code rather than the tests themselves.

Digging further, I managed to find that it's likely due to a large number of of #[async_trait] implementations (which is in line with evaluate_obligation being related to determining if trait bounds are satisfied as @jyn514 pointed out). After commenting out most of the implementations, I was able to get the evaluate_obligation down to roughly 20s instead of a minute.

For reference, the trait looks like this:

#[async_trait]
pub trait TestOperation: Debug {
    async fn execute_on_collection(
        &self,
        collection: &Collection<Document>,
        session: Option<&mut ClientSession>,
    ) -> Result<Option<Bson>>;
    async fn execute_on_database(
        &self,
        database: &Database,
        session: Option<&mut ClientSession>,
    ) -> Result<Option<Bson>>;
    async fn execute_on_client(&self, client: &TestClient) -> Result<Option<Bson>>;
    async fn execute_on_session(&self, session: &mut ClientSession) -> Result<Option<Bson>>;
}

And we have maybe 60 implementations of that trait.

evaluate_obligation still takes up a big chunk of time after removing most of them, but I think it could be reduced further after removing that last few implementations of the trait.

self-profile output after commenting out:

+--------------------------------------------------+-----------+-----------------+----------+------------+-----------------------+
| Item                                             | Self time | % of total time | Time     | Item count | Incremental load time |
+--------------------------------------------------+-----------+-----------------+----------+------------+-----------------------+
| evaluate_obligation                              | 7.00s     | 31.556          | 7.18s    | 33687      | 0.00ns                |
+--------------------------------------------------+-----------+-----------------+----------+------------+-----------------------+
| expand_crate                                     | 5.28s     | 23.804          | 5.40s    | 1          | 0.00ns                |
+--------------------------------------------------+-----------+-----------------+----------+------------+-----------------------+
| typeck                                           | 2.39s     | 10.790          | 10.15s   | 1028       | 100.07ms              |
+--------------------------------------------------+-----------+-----------------+----------+------------+-----------------------+
| incr_comp_encode_dep_graph                       | 1.28s     | 5.761           | 1.28s    | 981842     | 0.00ns                |
+--------------------------------------------------+-----------+-----------------+----------+------------+-----------------------+

So does this seem like a bug in the compiler then? in async_trait? or neither?

@patrickfreed
Copy link
Author

patrickfreed commented Jul 30, 2021

I'm happy to report that converting the async_trait implementations to regular trait implementations that return futures eliminated nearly all the time spent in evaluate_obligation. Given that, it seems like this may be a problem with async_trait, so I've filed an issue there.

edit: per @dtolnay, this is still likely a rustc issue, not an async_trait one.

@crepererum
Copy link
Contributor

So we have this issue over at https://github.com/influxdata/influxdb_iox as well. Here is a (non-minimal) reproducible example:

$ git clone https://github.com/influxdata/influxdb_iox.git
$ cd influxdb_iox
$ git checkout 61ccdbba93b3908ca80ef1ecce8a0e96d89ef1f0
$ RUSTFLAGS="-Zself-profile -Zself-profile-events=default,args" cargo +nightly build
$ cd ..
$ summarize summarize query_tests-<PID>.mm_profdata

I've also followed https://blog.rust-lang.org/inside-rust/2020/02/25/intro-rustc-self-profile.html and created a chrome profile: https://we.tl/t-dqgMlqYWHI (can re-upload somewhere else on request)

There you can see the detailed calls to evaluate_obligation which often have these or similar parameters:

Canonical {
    max_universe: U0,
    variables: [
        CanonicalVarInfo { kind: Region(U0) }, 
        CanonicalVarInfo { kind: Region(U0) }
    ],
    value: ParamEnvAnd {
       param_env: ParamEnv {
           caller_bounds: [
               Binder(OutlivesPredicate(ReLateBound(DebruijnIndex(1), BoundRegion { var: 0, kind: BrAnon(0) }), ReLateBound(DebruijnIndex(1), BoundRegion { var: 1, kind: BrAnon(1) })), []),
               Binder(OutlivesPredicate(scenarios::delete::TwoDeleteMultiExprsFromRubOneMeasurementOneChunk, ReLateBound(DebruijnIndex(1), BoundRegion { var: 1, kind: BrAnon(1) })), [])
            ],
            reveal: UserFacing
        },
        value: Binder(TraitPredicate(<lifecycle::policy::LifecyclePolicy<server::db::lifecycle::WeakDb> as std::marker::Send>), [])
    }
}
Canonical {
    max_universe: U0,
    variables: [
        CanonicalVarInfo { kind: Region(U0) },
        CanonicalVarInfo { kind: Region(U0) }
    ],
    value: ParamEnvAnd {
        param_env: ParamEnv {
            caller_bounds: [
                Binder(OutlivesPredicate(ReLateBound(DebruijnIndex(1), BoundRegion { var: 0, kind: BrAnon(0) }), ReLateBound(DebruijnIndex(1), BoundRegion { var: 1, kind: BrAnon(1) })), []),
                Binder(OutlivesPredicate(scenarios::delete::TwoDeleteMultiExprsFromRubOneMeasurementOneChunk, ReLateBound(DebruijnIndex(1), BoundRegion { var: 1, kind: BrAnon(1) })), [])
            ],
        reveal: UserFacing
        },
        value: Binder(TraitPredicate(<std::sync::Arc<server::Db> as std::marker::Sync>), [])
    }
}

I'm not really sure how to read this, but could it be that the compiler is quite busy figuring out if a future is Send and Sync?

@est31
Copy link
Member

est31 commented Sep 22, 2021

It would be interesting to get a self contained example that only uses the async-trait crate, no other dependencies. Is that possible?

@crepererum
Copy link
Contributor

I've tried, but was so far not able to reproduce it on a smaller scale. I THINK it's not just about the number of async-trait impls. I'll give it another shot and report once I have some smaller example.

@est31
Copy link
Member

est31 commented Sep 23, 2021

@crepererum yeah I thought about something like a crate that autogenerates a lot of async-trait impls.

@crepererum
Copy link
Contributor

I think I have a somewhat usable example now: https://github.com/crepererum/rust_issue_87012/tree/ee51fa3d07438c0b065df39a961698118a75aad2

It's not perfect since the effect could be larger but the effect is clearly visible, esp. considering that the time evaluate_obligation takes could easily be increased by adding more example code.

@carols10cents
Copy link
Member

Hi, I'm investigating this issue further and I'd like to work on fixing it, but I'm not sure where to start on the fix.

I've taken @crepererum 's example and tried various reductions on it, with the following results:

At main commit https://github.com/crepererum/rust_issue_87012/tree/ee51fa3d07438c0b065df39a961698118a75aad2, running ./test.sh shows evaluate_obligation taking around 100ms and 23% of compiling repro_crate.

In the first commit on my experiments branch, https://github.com/integer32llc/rust_issue_87012/tree/a0411b038eba9d3ddfa1b85fb0bd0012dc674da7, I used cargo expand to get the code that the async-trait crate generates and I replaced the code in repro_crate/src/main.rs with the generated code and removed the dependency on async-trait. This has no effect on the evaluate_obligation other than to raise the percentage of compile time it's taking because it's no longer doing macro expansion at compile time; this commit is still ~100ms, 36%.

In integer32llc/rust_issue_87012@bf24fd9 I removed the clippy allows just to reduce some noise; this had no effect on the problem.

In integer32llc/rust_issue_87012@cbd1831, I changed the body of all the trait implementation functions to call futures::future::ready(()).await; rather than helper_crate::entry_point().await; and the problem no longer occurred: evaluate_obligation time was now 4ms/3%.

I reverted that and tried removing the Send trait bound: integer32llc/rust_issue_87012@8b288cf the problem no longer occurred: 4ms/3%.

I reverted that and tried removing the lifetime annotation bounds: integer32llc/rust_issue_87012@df4daad the problem no longer occurred: 2ms/1%.

I tried removing the body of the helper_crate::entry_point function: integer32llc/rust_issue_87012@7d577df the problem no longer occurred: 6ms/4%.

So it appears to me that this problem needs all 3 of:

  • A Send trait bound
  • A lifetime bound
  • Calling async fn that calls other async fns

Additionally, looking at the chrome profiles of cases that exhibit the problem, it looks like the issue is both that evaluate_obligation is being called many times and that these calls are expensive. The rectangles next to the one I'm hovering over here are all evaluate_obligation:

screen shot zooming in on a relevant part of a chrome profile with the problem

In cases when the problem isn't present, evaluate_obligation appears fewer times and when it does appear, it takes way less time:

screen shot zooming in on a relevant part of a chrome profile where the problem isn't present

I'm not sure what experiments would be helpful to do next. Does it seem like it's the multiple calls to evaluate_obligation are the problem, or that the calls are taking a long time? Are there other reductions that would be interesting? Would it be useful to vary the number of parameters (and thus lifetime annotations) in the trait function definition, or vary the number of trait implementations, or vary something else?

Does anyone have any intuition about what and where the underlying issue might be?

@est31
Copy link
Member

est31 commented Dec 16, 2021

Personally I'd try working on removing the async_trait macro invocation / dependency next. I'd start with creating a macro_rules macro that repeats the output of the macro (based on cargo expand or similar). Then I'd try to reduce this code and see how far I can go with the issue still remaining.

This will then hopefully help me gain enough knowledge that upon studying the code I can identify potential sources for the slowness.

It would also be interesting to try to find out what arguments evaluate_obligation gets called with and what makes it run so long.

@crepererum
Copy link
Contributor

@est31 I think async_trait is already gone (see @carols10cents 's first step). The current code at https://github.com/integer32llc/rust_issue_87012/blob/78cc8c8a91d043cf7970d10fc1ad59c3352d74e2/repro_crate/src/main.rs looks something like this:

trait Trait {
    #[must_use]
    fn f<'life0, 'async_trait>(
        &'life0 self,
    ) -> ::core::pin::Pin<
        Box<dyn ::core::future::Future<Output = ()> + ::core::marker::Send + 'async_trait>,
    >
    where
        'life0: 'async_trait,
        Self: 'async_trait;
}
pub struct S0;
impl Trait for S0 {
    fn f<'life0, 'async_trait>(
        &'life0 self,
    ) -> ::core::pin::Pin<
        Box<dyn ::core::future::Future<Output = ()> + ::core::marker::Send + 'async_trait>,
    >
    where
        'life0: 'async_trait,
        Self: 'async_trait,
    {
        Box::pin(async move {
            let __self = self;
            let _: () = {
                helper_crate::entry_point().await;
            };
        })
    }
}
// Repeat this for S1...Sn

There's also a helper_crate that builds up a somewhat nested object structure. My original uneducated intuition was that this is related to checking the Send bound and that these checks are run multiple times i.e. for every S0..Sn and that it needs some substantial type payload (X0..Xn in helper_crate).

@est31
Copy link
Member

est31 commented Dec 16, 2021

@crepererum good point, I haven't checked her repo. I think there are several ways to narrow the problem down further.

First would be to find out the complexity function, that is what happens if you add M new items to N already existing ones. If M = N for example, and the run time doubles, you know that it's linear in the N variable. If the run time quadruples, you know it's a quadratic function. If it is 8 times, it's a cubic function, etc. If the run time increases by a lot, it might be exponential. For that, try adding 1 or 2 cases a few times, while measuring it at each step, and check whether the time gets multiplied by a constant each time.

The complexity can be a mix of multiple functions, so it might not always be 100% clear, but a trend can show up. Instead of doubling you could do a 5 fold increase, which if the largest power is 2, gives you a 25 fold increase in time, or a 125 if the largest power is 3.

Why is the function interesting? Because it tells you something about the structure of the code that causes the slowness. If it's linearly increasing, it's likely not the main cause, unless the number is really large, but instead it's just invoking something moderately slow many times. If it's quadratically increasing, you have nested loops of level 2. If it's cubically increasing, the nesting level is 3. If it's exponentially increasing, it's likely some search problem, that is something where you have code trying out all combinations of some possible values to try to make them fit.

Why is this information helpful? Because it tells you for which code to look for.

Furthermore, if the problem can be nicely tuned to become really bad, the offending code paths will show up in the profiles.

@carols10cents
Copy link
Member

carols10cents commented Dec 16, 2021

good point, I haven't checked her repo.

@est31 Just FYI, I tried to give a summary of what was contained in the repo in my comment where I said:

In the first commit on my experiments branch, integer32llc/rust_issue_87012@a0411b0, I used cargo expand to get the code that the async-trait crate generates and I replaced the code in repro_crate/src/main.rs with the generated code and removed the dependency on async-trait.

Also @wesleywiser pointed me to #89831 that changes some of the evaluate_obligation code. I'm not 100% sure I've built that PR and configured rustup correctly, but if I have, that PR makes my test case worse: evaluate_obligation takes 172ms/42% :(

I will try varying the number of trait implementations, the number of parameters/lifetime constraints, and perhaps the number of async functions that are called in helper_crate to see if I can find any performance patterns.

Thank you!

@carols10cents
Copy link
Member

carols10cents commented Dec 16, 2021

Ok, I've now written scripts that do 3 experiments:

  1. Varying the number of trait implementations (script, results)
  2. Varying the number of parameters (and each parameter has a lifetime annotation) (script, results)
  3. Varying the number of nested async fn calls (script, results)

They all appear to be linear, BUT the number of nested async fn calls has by far the largest multiplier.

By the time the script gets to 22 nested async functions, the compiler hits the recursion limit when compiling the repro_crate crate, which is interesting because the nested async function definitions are in the helper_crate crate-- the overflow is happening when the repro_crate tries to evaluate whether the call to the entry function meets the trait bounds.

If I add #![recursion_limit = "256"] to repro_crate, I can get up to 23 nested async functions, then helper_crate errors with:

error: overflow representing the type `MaybeUninit<from_generator::GenFuture<[static generator@helper_crate/src/lib.rs:6:30: 8:2]>>`

This overflow error can be reproduced by checking out integer32llc/rust_issue_87012@d2a34c0 and running cargo build.

There are a number of nested async performance issues I've found that I'd suspect are related:

but none of those mention evaluate_obligation-- I suspect because they aren't trying to check that the nested async calls satisfy trait bounds.

Next, I'm going to try get the arguments evaluate_obligation is called with, unless anyone has other ideas for better directions for me to take!

@carols10cents
Copy link
Member

In chatting with @nnethercote, I made a version that reproduces the problem with only one crate, in case compiling a workspace is difficult to do with profiling tools. That version is here: https://github.com/integer32llc/rust_issue_87012/tree/one-crate and in this comment I was using the code at 9740ea40b19f76f35ef7da836712564ff8adffc3.

I started with 69ac533 and added #[instrument(level = "debug", skip(self))] to evaluate_obligation, did ./x.py build && ./x.py install.

then in rust_issue_87012, I setup a rustup toolchain with rustup toolchain link stage2 build/<host-triple>/stage2, set a rustup override to my stage2 toolchain, and ran:

RUSTC_LOG=[evaluate_obligation]=debug rustc --crate-name repro_crate --edition=2018 src/main.rs -Zself-profile -Zself-profile-events=default,args --crate-type bin --emit=dep-info,link -C embed-bitcode=no -C split-debuginfo=unpacked -C debuginfo=2 -C metadata=8bb05b458def09dc -C extra-filename=-8bb05b458def09dc --out-dir /Users/carolnichols/rust/rust_issue_87012/target/debug/deps 2>traits-log

and there are some particularly wordy and deep groups of log messages, one of which starts with:

┐rustc_trait_selection::traits::query::evaluate_obligation::evaluate_obligation obligation=Obligation(predicate=Binder(TraitPredicate(<std::future::from_generator::GenFuture<[static gener
ator@src/main.rs:586:28: 589:2]> as std::marker::Send>, polarity:Positive), []), depth=9)
├─0ms DEBUG rustc_query_system::query::plumbing ty::query::get_query<evaluate_obligation>(key=Canonical { max_universe: U0, variables: [CanonicalVarInfo { kind: Region(U0) }, CanonicalVar
Info { kind: Region(U0) }], value: ParamEnvAnd { param_env: ParamEnv { caller_bounds: [Binder(OutlivesPredicate(ReLateBound(DebruijnIndex(1), BoundRegion { var: 0, kind: BrAnon(0) }), ReL
ateBound(DebruijnIndex(1), BoundRegion { var: 1, kind: BrAnon(1) })), []), Binder(OutlivesPredicate(S14, ReLateBound(DebruijnIndex(1), BoundRegion { var: 1, kind: BrAnon(1) })), [])], rev
eal: UserFacing, constness: NotConst }, value: Binder(TraitPredicate(<std::future::from_generator::GenFuture<[static generator@src/main.rs:586:28: 589:2]> as std::marker::Send>, polarity:
Positive), []) } }, span=src/main.rs:291:9: 296:11 (#0))
├─0ms DEBUG rustc_traits::evaluate_obligation evaluate_obligation(canonical_goal=Canonical {
│     max_universe: U0,
│     variables: [
│         CanonicalVarInfo {
│             kind: Region(
│                 U0,
│             ),
│         },
│         CanonicalVarInfo {
│             kind: Region(
│                 U0,
│             ),
│         },
│     ],
│     value: ParamEnvAnd {
│         param_env: ParamEnv {
│             caller_bounds: [
│                 Binder(OutlivesPredicate(ReLateBound(DebruijnIndex(1), BoundRegion { var: 0, kind: BrAnon(0) }), ReLateBound(DebruijnIndex(1), BoundRegion { var: 1, kind: BrAnon(1) })),
 []),
│                 Binder(OutlivesPredicate(S14, ReLateBound(DebruijnIndex(1), BoundRegion { var: 1, kind: BrAnon(1) })), []),
│             ],
│             reveal: UserFacing,
│             constness: NotConst,
│         },
│         value: Binder(TraitPredicate(<std::future::from_generator::GenFuture<[static generator@src/main.rs:586:28: 589:2]> as std::marker::Send>, polarity:Positive), []),
│     },
│ })
├─0ms DEBUG rustc_infer::infer::region_constraints created new region variable '_#0r in U0 with origin MiscVariable(no-location (#0))
├─0ms DEBUG rustc_infer::infer::region_constraints created new region variable '_#1r in U0 with origin MiscVariable(no-location (#0))
├─0ms DEBUG rustc_traits::evaluate_obligation evaluate_obligation: goal=ParamEnvAnd {
│     param_env: ParamEnv {
│         caller_bounds: [
│             Binder(OutlivesPredicate('_#0r, '_#1r), []),
│             Binder(OutlivesPredicate(S14, '_#1r), []),
│         ],
│         reveal: UserFacing,
│         constness: NotConst,
│     },
│     value: Binder(TraitPredicate(<std::future::from_generator::GenFuture<[static generator@src/main.rs:586:28: 589:2]> as std::marker::Send>, polarity:Positive), []),
│ }

and continues such that the depth of the vertical pipes wraps around my terminal width... here's ~1800 lines that's incomplete but gives an idea: https://gist.github.com/carols10cents/58a4fcf4ab4e1483ad5b4e4f5c01b0f8

There are messages in there that look troubling to me, like "CACHE MISS" and "rollback", but I don't actually know if that's expected or not.

If there are ways to filter these logs or interesting bits in these logs that would be helpful, please let me know and I'm happy to get that info.

@Aaron1011
Copy link
Member

Can you see if #92044 improves performance for you?

@carols10cents
Copy link
Member

@Aaron1011 It does fix it!!!! I commented on the PR, thank you so much!!1

nnethercote added a commit to nnethercote/rust that referenced this issue Apr 5, 2022
By passing in a `&mut Vec` in various places and appending to it, rather
than building many tiny `Vec`'s and then combining them.

This removes about 20% of the allocations occurred in a `check` build of
a test program from rust-lang#87012, making it roughly 2% faster.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

8 participants