Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rust cache invalidated between native and Wasm builds #971

Closed
tiziano88 opened this issue May 11, 2020 · 35 comments
Closed

Rust cache invalidated between native and Wasm builds #971

tiziano88 opened this issue May 11, 2020 · 35 comments
Assignees

Comments

@tiziano88
Copy link
Collaborator

To reproduce:

  • cargo build --release --target=wasm32-unknown-unknown --package=aggregator
    • Finished release [optimized] target(s) in 43.39s (slow)
  • cargo build --release --target=wasm32-unknown-unknown --package=aggregator
    • Finished release [optimized] target(s) in 0.14s (fast)
  • cargo build --release --package=aggregator_backend
    • Finished release [optimized] target(s) in 1m 15s (slow)
  • cargo build --release --package=aggregator_backend
    • Finished release [optimized] target(s) in 0.15s (fast)
  • cargo build --release --target=wasm32-unknown-unknown --package=aggregator
    • Finished release [optimized] target(s) in 43.40s (expected: fast; actual: slow again)

I am guessing that switching compilation target causes some dependencies to be rebuilt (looks like libc may be a potential culprit).

@rbehjati could you look into it? It will become more and more relevant as we switch to the Rust version of main for the oak_loader, as we will interleave compiling native and Wasm code even more often then.

@tiziano88
Copy link
Collaborator Author

Perhaps we should have two separate Cargo workspaces, one for x86 and one for wasm?

cc @project-oak/core

@tiziano88
Copy link
Collaborator Author

We could have the following top-level directories:

  • runtime (Cargo workspace, compiled to x86) -- basically the current oak top-level directory, after removing the remaining C++ code and renaming
  • sdk (Cargo workspace, compiled to Wasm) -- we could consider dropping the rust subfolder at some point, but we can keep it for the time being
  • abi (stand-alone crate, no need for workspace) -- moved from oak_abi which is currently within server
  • examples (Cargo workspace, compiled to Wasm, except extra binaries such as the aggregator back-end, which will be a stand-alone crate not nested in the examples workspace)

Thoughts?

@daviddrysdale
Copy link
Contributor

Couple of cross-refs:

@tiziano88
Copy link
Collaborator Author

@rbehjati could you try and pull out the abi crate to be top level in a PR to start with? then we can split the top-level Cargo workspace file into smaller parts, and eventually do any other moves / renames.

@tiziano88
Copy link
Collaborator Author

I am not sure things are in a consistent state after #1034; is oak_abi part of the workspace or not now? It seems I am not able to build or do anything with it directly:

cargo build
error: current package believes it's in a workspace when it's not:
current:   /home/tzn/src/oak/oak_abi/Cargo.toml
workspace: /home/tzn/src/oak/Cargo.toml

this may be fixable by adding `oak_abi` to the `workspace.members` array of the manifest located at: /home/tzn/src/oak/Cargo.toml
Alternatively, to keep it out of the workspace, add the package to the `workspace.exclude` array, or add an empty `[workspace]` table to the package's manifest.

@tiziano88
Copy link
Collaborator Author

This may also be the cause of #1037

@tiziano88
Copy link
Collaborator Author

FYI I just realised that cargo build has a --target-dir flag that may be sufficient to fix this issue in the short term, we can just use different dirs for wasm and native code.

tiziano88 added a commit to tiziano88/oak that referenced this issue May 28, 2020
It should help with project-oak#971, until a more proper solution is in place.
tiziano88 added a commit to tiziano88/oak that referenced this issue May 28, 2020
It should help with project-oak#971, until a more proper solution is in place.
@blaxill
Copy link
Contributor

blaxill commented May 28, 2020

@tiziano88 25c1d06#diff-35e128d19b2a49f8257e7a6ed82e3f44
There was an issue causing temporary files to accumulate when doing this previously, although it might have been due to using a fresh temp directory each time (and not manually erasing it), rather than using a consistent --target-dir

tiziano88 added a commit to tiziano88/oak that referenced this issue May 28, 2020
It should help with project-oak#971, until a more proper solution is in place.
@tiziano88
Copy link
Collaborator Author

Good catch @blaxill , thanks! I do think it is solved by reusing the same directory, hopefully

tiziano88 added a commit that referenced this issue May 28, 2020
It should help with #971, until a more proper solution is in place.

In my experiments, this brings the time to run `./scripts/run_examples` without having made any changes from 440s down to 39s.
@tiziano88
Copy link
Collaborator Author

Now that examples is a separate workspace (#1045), can we remove the separate target dir (#1044), and still benefit from separate caches?

@rbehjati
Copy link
Contributor

rbehjati commented Jun 5, 2020

I am not sure if --target-dir=examples/target has been of much help. But specifying a --target prevents invalidation of the cache. No target-dir is specified in the following commands, all targets are generated in examples/target, and cache is not invalidated:

$ cargo build --release --target=x86_64-unknown-linux-musl --manifest-path="examples/aggregator/backend/Cargo.toml"
    Finished release [optimized] target(s) in 0.15s
$ cargo build --release --target=wasm32-unknown-unknown --manifest-path="examples/translator/module/rust/Cargo.toml"
   Compiling same-file v1.0.6
   Compiling maplit v1.0.2
   Compiling cfg-if v0.1.10
   Compiling bytes v0.5.4
   Compiling byteorder v1.3.4
   Compiling serde v1.0.111
   Compiling fmt v0.1.0
   Compiling getrandom v0.1.14
   Compiling log v0.4.8
   Compiling serde_derive v1.0.111
   Compiling walkdir v2.3.1
   Compiling rand_core v0.5.1
   Compiling oak_utils v0.1.0 (/opt/my-project/oak_utils)
   Compiling prost v0.6.1
   Compiling prost-types v0.6.1
   Compiling oak_abi v0.1.0 (/opt/my-project/oak_abi)
   Compiling oak v0.1.0 (/opt/my-project/sdk/rust/oak)
   Compiling translator_common v0.1.0 (/opt/my-project/examples/translator/common)
   Compiling translator v0.1.0 (/opt/my-project/examples/translator/module/rust)
    Finished release [optimized] target(s) in 17.64s
$ cargo build --release --target=wasm32-unknown-unknown --manifest-path="examples/translator/module/rust/Cargo.toml"
    Finished release [optimized] target(s) in 0.13s
$ cargo build --release --target=x86_64-unknown-linux-musl --manifest-path="examples/aggregator/backend/Cargo.toml"
    Finished release [optimized] target(s) in 0.15s

If instead of cargo build --release --target=x86_64-unknown-linux-musl --manifest-path="examples/aggregator/backend/Cargo.toml" I use cargo build --release --manifest-path="examples/aggregator/backend/Cargo.toml" (which does not specify a target), then the cache will be invalidated after each alternating command.

--target-dir can still help if we use different dirs (e.g., examples/target/aggregator and examples/target/backend):

$ cargo build --release --target-dir=examples/target/aggregator --target=wasm32-unknown-unknown --manifest-path="examples/translator/module/rust/Cargo.toml"
    Finished release [optimized] target(s) in 31.73s
$ cargo build --release --target-dir=examples/target/backend --manifest-path="examples/aggregator/backend/Cargo.toml"
    Finished release [optimized] target(s) in 57.85s
$ cargo build --release --target-dir=examples/target/backend --manifest-path="examples/aggregator/backend/Cargo.toml"
    Finished release [optimized] target(s) in 0.16s
$ cargo build --release --target-dir=examples/target/aggregator --target=wasm32-unknown-unknown --manifest-path="examples/translator/module/rust/Cargo.toml"
    Finished release [optimized] target(s) in 0.13s

So, most of the changes in #1044 are still relevant. We could remove --target-dir=examples/targets, but that would not really make a difference.

[edit: I noticed that I've included results from experimenting with translator here, but aggregator behaves similarly.]

@tiziano88
Copy link
Collaborator Author

I think if you undo your changes, then #1044 definitely makes a difference. Now that your changes are in, as you say, it does not seem to make a difference any more, since target dirs are already separated by workspace anyways. Hence my point that we can now remove that flag. But I haven't actually tried this myself.

@rbehjati
Copy link
Contributor

rbehjati commented Jun 5, 2020

My hypothesis is that #1044 would not have made a huge difference if you had not added --target=x86_64-unknown-linux-musl (but I have not tried all combinations either 😄). Right now, we have to keep that flag, but we can remove --target-dir=examples/target. If that is what you mean. Does it make sense to keep --target=x86_64-unknown-linux-musl for aggregator backend?

@tiziano88
Copy link
Collaborator Author

Could you try to remove --target-dir but keep --target and see if the cache gets invalidated? I think it would be good to know. Is your hypothesis that the aggregator backend was the only thing that was invalidating the cache then?

@rbehjati
Copy link
Contributor

rbehjati commented Jun 5, 2020

The cache does not seem to get invalidated in that case.

--target-dir kept, --target kept (currently in master):

$ rm -rf target && rm -rf examples/target
$ time ./scripts/run_examples
real    4m22.916s
user    31m14.892s
sys     0m42.810s
$ time ./scripts/run_examples
real    0m43.430s
user    0m7.713s
sys     0m3.313s

--target-dir removed, --target kept (timing is similar to the previous case):

$ rm -rf target && rm -rf examples/target
$ time ./scripts/run_examples
real    4m15.375s
user    31m21.495s
sys     0m42.732s
$ time ./scripts/run_examples
real    0m43.149s
user    0m7.744s
sys     0m3.265s

--target-dir kept, --target removed (much slower):

$ rm -rf target && rm -rf examples/target
$ time ./scripts/run_examples
real    4m33.951s
user    33m22.486s
sys     0m44.127s
$ time ./scripts/run_examples
real    2m51.976s
user    16m34.688s
sys     0m22.346s

--target-dir removed, --target removed:

$ rm -rf target && rm -rf examples/target
$ time ./scripts/run_examples
real    4m25.595s
user    33m28.209s
sys     0m44.055s
$ time ./scripts/run_examples
real    2m52.324s
user    16m36.129s
sys     0m22.476s

I noticed that the change you made to aggregator backend improved things, but I was not timing the commands before. So I could not really measure the improvement... all I have is just a hunch!

@tiziano88
Copy link
Collaborator Author

If the tests were done in master, after your change that separated the examples workspace already, then I don't think this experiment is particularly conclusive though, is it?

@rbehjati
Copy link
Contributor

rbehjati commented Jun 5, 2020

Yes. The tests were done in master. What do we want to reach a conclusion about? Whether to remove --target-dir or how do cargo build and its flags work?
I think the tests give enough evidence that --target-dir can be removed. I included the last two cases with target removed just for info, not as evidence for removing --target-dir.

@tiziano88
Copy link
Collaborator Author

I guess I'm still confused how this interacted with the --target=x86_64-unknown-linux-musl, as you mentioned in #971 (comment) .

@rbehjati
Copy link
Contributor

rbehjati commented Jun 5, 2020

Yeah. Me too. Here is more data that may or may not help with the confusion. Please don't ask me to do a full factorial experiment!

Before #1044 (on commit 0b133c0: no --target-dir, and no --target for backend):

$ rm -rf target && rm -rf examples/target
$ time ./scripts/run_examples
real    5m30.142s
user    44m1.707s
sys     0m58.771s
$ time ./scripts/run_examples
real    3m41.917s
user    27m42.284s
sys     0m35.379s

Still on commit 0b133c0, but after adding --target for backend:

$ rm -rf target && rm -rf examples/target
$ time ./scripts/run_examples
real    3m32.346s
user    22m22.393s
sys     0m34.561s
$ time ./scripts/run_examples
real    0m39.415s
user    0m6.175s
sys     0m3.168s

On #1044 (both --target-dir, and --target are present):

$ rm -rf target && rm -rf examples/target
$ time ./scripts/run_examples
real    3m52.818s
user    25m40.070s
sys     0m38.064s
$ time ./scripts/run_examples
real    0m39.501s
user    0m6.185s
sys     0m3.084s

Still on #1044, but --target for backend is removed:

$ rm -rf target && rm -rf examples/target
$ time ./scripts/run_examples
real    5m40.568s
user    47m30.893s
sys     1m3.099s
$ time ./scripts/run_examples
real    2m53.625s
user    23m40.563s
sys     0m29.746s

@rbehjati
Copy link
Contributor

rbehjati commented Jun 5, 2020

Getting back to your earlier question... the aggregator backend seems to have a significant impact on invalidating the cache, if not being the only thing. Do you agree?

@tiziano88
Copy link
Collaborator Author

Does it mean that compiling the aggregator to x86 and then wasm is what is causing the issue?

@tiziano88
Copy link
Collaborator Author

Apart from the cache issue, I think we should at least split out the crates so that the runtime / loader have a dedicated Cargo.lock file, which is the list of dependencies that are actually part of the TCB. Specifically, I think this would mean trimming down this list so that it only contains oak_loader and oak_runtime:

oak/Cargo.toml

Lines 1 to 10 in a2f74a4

[workspace]
members = [
"oak/server/rust/oak_loader",
"oak/server/rust/oak_runtime",
"runner",
"sdk/rust/oak",
"sdk/rust/oak_tests",
"third_party/roughenough",
]
exclude = ["oak_abi", "oak_utils"]

@rbehjati does this make sense?

@rbehjati
Copy link
Contributor

rbehjati commented Jun 8, 2020

I agree. I'll make SDK and runner separate crates. I suppose we don't want to make third_party a separate workspace. Do we?

After this we can perform a thorough analysis to understand what is contributing to the invalidation of the cache.

@tiziano88
Copy link
Collaborator Author

Note this is still an issue, try and run the following command twice in a row: ./scripts/run_example -e trusted_information_retrieval . Probably because of the backend, which is x86 but invalidates the wasm cache, since its target-dir is set to examples/target.

cc @ipetr0v

@tiziano88
Copy link
Collaborator Author

@rbehjati if possible, let's try to build an understanding of what's happening, rather than just trying to get the numbers down.

My theory (not tested) is still that some dependency has a feature flag that causes it to be compiled differently in wasm vs x86, and / or has optional dependencies that cause part of the cache to go missing in one case.

@rbehjati
Copy link
Contributor

Your theory seems consistent with my observation that specifying the target helps. From what I can see the change in #1179 solves the problem with trusted_information_retrieval (at least temporarily). Clearly fewer/no crates are recompiled when specifying the appropriate --target.

Apparently, trusted_information_retrieval cannot be compiled for --target=x86_64-unknown-linux-musl (which we previously had), but it can be compiled for --target=x86_64-unknown-linux-gnu (I don't know what is the difference between these two architectures).

Getting back to your theory, is it expected that everything should be compiled the same for wasm and x86? In other words, do we have a requirement for excluding anything that compiles differently for different architectures?

@tiziano88
Copy link
Collaborator Author

Thanks for putting together #1179, AFAICT it is doing two things at once:

  • remove --target-dir flag
  • add --target flag

Which one is actually having the desired effect? Or is it really the combination of --target and no --target-dir that solves the issue?

@tiziano88
Copy link
Collaborator Author

For reference, I think this is pretty much what I thought it was happening: https://stackoverflow.com/questions/60869985/why-is-cargo-build-cache-invalidating

Though it does not really explain why #1179 actually makes things work correctly 😅

@rbehjati
Copy link
Contributor

Which one is actually having the desired effect? Or is it really the combination of --target and no --target-dir that solves the issue?

I have not seen --target-dir=./examples/target to have any impact (at least after separating the workspaces).

Though it does not really explain why #1179 actually makes things work correctly 😅

I think when --target is specified for each architecture, the files that are specific to that architecture go into some examples/target/<arch-name>/release directory, as opposed to being included in examples/target/release. So, in this case, examples/target/release only contains the files that are compiled the same for different architectures. However, I could not find any explanation confirming this anywhere in any of the cargo documentations I looked at. This is just my conclusion based on observations from some experiments reported below.
I have included the file structures for each case:

  1. Using --target-dir=./examples/target and without specifying a target for backend. In our current setup, this would be the same as not specifying a --target-dir. In this case many crates are recompiled when running ./scripts/run_example -e trusted_information_retrieval for the second time.
398M	examples/target/release
 26M	examples/target/wasm32-unknown-unknown
---------------------------
424M	examples/target/

246M	oak/server/target/release
214M	oak/server/target/x86_64-unknown-linux-musl
  1. Using --target=x86_64-unknown-linux-gnu for the backend. In this case none of the crates are recompiled when running ./scripts/run_example -e trusted_information_retrieval twice in a row.
227M	examples/target/release
 26M	examples/target/wasm32-unknown-unknown
183M	examples/target/x86_64-unknown-linux-gnu
---------------------------
436M	examples/target/

246M	oak/server/target/release
214M	oak/server/target/x86_64-unknown-linux-musl

The difference between this case and the previous case is that everything that is now in examples/target/x86_64-unknown-linux-gnu has been in examples/target/release in the previous case (almost). I don't know how exactly cargo works, but it seems that by specifying --target for the backend, we are forcing backend-specific files, which compile differently based on the target architecture, to go into a separate directory.

  1. Using --target-dir="examples/target/${EXAMPLE}/wasm" for the Oak module, and --target-dir="examples/target/${EXAMPLE}/backend" for the backend, without specifying a target for the backend. None of the crates are recompiled in this case.
104M    examples/target/trusted_information_retrieval/wasm/release
26M     examples/target/trusted_information_retrieval/wasm/wasm32-unknown-unknown

381M    examples/target/trusted_information_retrieval/backend/release
---------------------------
511M	examples/target/

246M	oak/server/target/release
214M	oak/server/target/x86_64-unknown-linux-musl

In this case nothing is shared between the backend and the wasm module. It is the same as compiling cargo build for the oak module and the backend separately each in a clear environment (similar to case 4 below).

  1. If I run only cargo build --release --target=x86_64-unknown-linux-gnu --manifest-path="examples/trusted_information_retrieval/backend/Cargo.toml", I get the following:
210M    examples/target/release
183M    examples/target/x86_64-unknown-linux-gnu

@tiziano88
Copy link
Collaborator Author

So is --target=x86_64-unknown-linux-gnu the same or different than no --target flag, on an x86-linux machine?

@rbehjati
Copy link
Contributor

It is the same.

The following command (source) gives the default target (which is used when --target is not specified):

rustc -Z unstable-options --print target-spec-json | grep llvm-target

On my linux machine, and our docker image, the output from this command is

"llvm-target": "x86_64-unknown-linux-gnu",

This is also the target specified in scripts/run_tests_tsan and our .cargo file.

@tiziano88
Copy link
Collaborator Author

It is the same.

In that case, does it mean that we actually don't need to specify --target in #1179 ? Just removing --target-dir would have been enough?

@rbehjati
Copy link
Contributor

I don't think so. If we don't specify --target all the target files go to the same directory, and the cache will be invalidated again. We are using --target=XYZ to put the target files in different dirs.

@rbehjati
Copy link
Contributor

rbehjati commented Jul 7, 2020

I have been digging deeper into this and here are some of my findings.

The following is the list of packages that are recompiled when running cargo build --release --target=wasm32-unknown-unknown --manifest-path=examples/aggregator/module/rust/Cargo.toml -Z unstable-options after running cargo build --release --manifest-path=examples/aggregator/backend/Cargo.toml.

# in examples/target/release/build

anyhow-f3bb683c8c9d193b
getrandom-3c98b8e535bc4756/
indexmap-aaacfd4ae6512dee/
libc-079e747c65dddbcb/
log-85d8f5fb872ec954/
proc-macro2-06ed86eb6c32d03e/
prost-build-bae3641e35f69cf3/
syn-f0a8bab71fd7020d/

After this, running cargo build --release --manifest-path=examples/aggregator/backend/Cargo.toml again results in rebuilding the same packages.

Each of these folders has the following content:

build-script-build
build_script_build-f0a8bab71fd7020d
build_script_build-f0a8bab71fd7020d.d

The binary files build-script-build and build_script_build-f0a8bab71fd7020d are rewritten when switching between the cargo build ... commands.

Corresponding to each of the folders, there is another folder inside examples/target/release/build with the same crate name, but a different fingerprint (e.g., anyhow-3ef89ee82156c6e6). All these folders have the following content, and are not rewritten when switching between cargo build ... commands:

invoked.timestamp  
out/
output  
root-output  
stderr

These seem to be the output from some build.rs script. When specifying --target in a cargo build (or cargo run) command, the subdirectories in examples/target/<TARGET>/release/build are all of the second form (i.e., are the output from a Rust build script).

I am still not entirely sure how cargo decides whether to rebuild a package or not, however the following note from the cargo book might be relevant.

When not using --target, this has a consequence that Cargo will share your dependencies with build scripts and proc macros. RUSTFLAGS will be shared with every rustc invocation. With the --target flag, build scripts and proc macros are built separately (for the host architecture), and do not share RUSTFLAGS.

More specifically it is advised that:

If you have args that you do not want to pass to build scripts or proc macros and are building for the host, pass --target with the host triple.

We set RUSTFLAGS in some of our scripts, but I am not sure if they are causing the problem. For now, the best solution seems to be to keep specifying --target with the host triple.

@rbehjati
Copy link
Contributor

rbehjati commented Jul 9, 2020

For now we are happy with the solution using --target, so I close this. I have shared a more detailed report of my investigation with the team.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants