Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spin apps fail to run - compiled with incompatible wasmtime version #79

Closed
vdice opened this issue Apr 12, 2024 · 16 comments
Closed

Spin apps fail to run - compiled with incompatible wasmtime version #79

vdice opened this issue Apr 12, 2024 · 16 comments

Comments

@vdice
Copy link
Contributor

vdice commented Apr 12, 2024

I built the shim from main as of writing and I'm seeing the following when I attempt to run seemingly any Spin app.

Steps I followed:

  1. Build the shim via eg cargo build --release; move it to where containerd expects it
  2. Run containerd via eg sudo containerd -c /path/to/containerd/config.toml
    a. Note: containerd version containerd github.com/containerd/containerd v1.7.14 dcf2847247e18caba8dce86522029642f60fe96b
  3. Pull Spin app (published with Spin v2.4.2 via spin registry push) via ctr eg sudo ctr image pull docker.io/vdice/static-fileserver:latest
  4. Run Spin app via eg sudo ctr run --rm --runtime io.containerd.spin.v2 docker.io/vdice/static-fileserver:latest static bogus

See the following in the containerd/runwasi/shim logs:

time="2024-04-12T02:19:00.310100534Z" level=info msg="starting instance: static"

time="2024-04-12T02:19:00.310223623Z" level=info msg="calling start function"

time="2024-04-12T02:19:00.310247832Z" level=info msg="setting up wasi"

time="2024-04-12T02:19:00.310726562Z" level=info msg=" >>> configuring spin oci application 4"

time="2024-04-12T02:19:00.310737395Z" level=info msg="<<< writing wasm artifact with length 5882984 config to cache, near "/.cache/registry/manifests""

time="2024-04-12T02:19:00.315409687Z" level=info msg="writing spin oci config to "/spin.json""

time="2024-04-12T02:19:00.346827672Z" level=error msg="run_wasi ERROR >>>  failed: failed to build spin trigger

Caused by:
    0: Failed to instantiate component 'static-fileserver'
    1: deserializing module "/.cache/registry/wasm/sha256:5f05b15f0f7cd353d390bc5ebffec7fe25c6a6d7a05b9366c86dcb1a346e9f0f"
    2: Module was compiled with incompatible Wasmtime version '19.0.1'"

time="2024-04-12T02:19:00.347279817Z" level=info msg="error running start function: failed to build spin trigger"

time="2024-04-12T02:19:00.350726929Z" level=info msg="deleting instance: static"

time="2024-04-12T02:19:00.350799974Z" level=info msg="cgroup manager V2 will be used"

INFO[2024-04-11T20:19:00.351348290-06:00] shim disconnected                             id=static namespace=default
WARN[2024-04-11T20:19:00.351442211-06:00] cleaning up after shim disconnected           id=static namespace=default
INFO[2024-04-11T20:19:00.351447753-06:00] cleaning up dead shim                         namespace=default
time="2024-04-12T02:19:00.351227785Z" level=info msg="Shutting down shim instance"

I noticed the shim's wasmtime version was recently bumped to 19.0.1 in #71 -- unintended consequence? Perhaps there are other deps needing to be bumped as well for compatibility?

@devigned
Copy link
Contributor

I think #71 might be the culprit. @Mossaka, would you expect this behavior?

@devigned
Copy link
Contributor

My guess here is that we are pre-compiling in the shim with 19.0.1, but Spin is expecting some other version of Wasmtime and is throwing the error. @vdice does this seem reasonable?

@vdice
Copy link
Contributor Author

vdice commented Apr 12, 2024

@devigned That indeed makes sense. I was thrown off by run_wasi ERROR thinking it was somewhere further up but looking at it again, I agree, that must be the Spin runtime throwing the error. I see Spin is still on 18.0.4.

@radu-matei
Copy link
Member

Yeah, I think we need to revert the Wasmtime bump for now.

@devigned
Copy link
Contributor

#80

@vdice
Copy link
Contributor Author

vdice commented Apr 12, 2024

Thanks @devigned! Meanwhile, curious as to what we can add in CI to catch this? Did it slip by the tests due to the tested Spin Apps still being wrapped in Docker images, hence not hitting the precompilation flow?

@devigned
Copy link
Contributor

Thanks @devigned! Meanwhile, curious as to what we can add in CI to catch this? Did it slip by the tests due to the tested Spin Apps still being wrapped in Docker images, hence not hitting the precompilation flow?

That is exactly where my mind went too. Having a test that catches this scenario should be the resolution of this issue.

@Mossaka
Copy link
Member

Mossaka commented Apr 12, 2024

Having a test that catches this scenario should be the resolution of this issue.

Yup, definitely a test needed to be added. I also think the dependabot should disable wasmtime bump since it's a core dependency of Spin.

@vdice
Copy link
Contributor Author

vdice commented Apr 12, 2024

(An update to mention that Spin apps now run on the shim built from main after the revert, as expected. Thanks again!)

@devigned
Copy link
Contributor

(An update to mention that Spin apps now run on the shim built from main after the revert, as expected. Thanks again!)

Thank you for creating the issue!

@kingdonb
Copy link

kingdonb commented Apr 21, 2024

Thank you for laying this out in detail.

I believe I hit this issue but I have not installed containerd-shim-spin manually, I used the Talos extension for spin. It seems to work with all the example apps, except for with my app pushed using spin registry push (spin 2.4.2) which quits with no log immediately after starting; I noticed a lot of the examples have used docker build with a wasi target, and the OCIs they emit have different metadata and structure just ever so subtly. I pulled down to a cache directory and compared the manifests and noticed some differences in config.json but I have not got direct console access to run ctr directly, so am not sure if I hit the same issue as you or not. Not sure if any of that is relevant to the crashloopbackoff, (not sure how to pack up the static fs server and my content without spin registry push, but I imagine it's a pretty simple Dockerfile...)

I didn't find anything particularly informative in any of the other logs I checked, to help me narrow down the issue. So I was really glad to read your report, because this seems likely to be the same issue I'm having. Except the only indication of what is wrong exposed through spin operator seems to be an exit code 137 on the pod that is in crashloopbackoff. Should I file a separate report, or do you think this might be the same issue? (I have honestly no idea how to compile containerd-shim-spin from source and make my own extension for Talos Linux, but I guess I'm gonna find out soon...)

@radu-matei
Copy link
Member

Thanks for the additional details, @kingdonb!

For context, the issue is coming from the ahead of time compilation feature that was added to the shim in #32.
This is an optimization that ensures a Wasm component is only compiled once on any given node, so scaling up is significantly faster compared to when having to compile (especially when a pod only is allowed little CPU resources).

The issue we're seeing here seems to come out of a mismatch between the compiler output and Wasm engine — and I'm really confused by it, since I assumed we were doing a compatibility check somewhere (see

.precompile_compatibility_hash()
).
But perhaps this is a larger issue with the caching feature that we need to revisit.

@kingdonb
Copy link

I saw #84 that might also be the same issue, where it was suggested to disable precompilation to get better logs. No idea if that's a userland knob or something that must be configured at compile time. But that issue might also be more precisely my issue, I am also pushing ~200MB in my packed bundle, (this is a hugo site that has been built and a static fs server wasm.)

My bigger issue is that whatever caused my failure doesn't leave any trace in the K8s userland, I have given up compiling extensions for now as it sounds like the main issue has been resolved and I'll see it in the next release... thank you!

@devigned
Copy link
Contributor

I believe we should consider adding containerd config options for the shim to enable feature code paths. I don't believe Rust compilation flags would be desirable in cases like this since the user is likely to not have compiled the shim in the first place. Adding containerd config options for the shim would allow for on node configuration and debugging. It could also provide a path for turning on and off newly release / unstable features w/o recompilation.

@jsturtevant
Copy link
Contributor

jsturtevant commented Apr 22, 2024

The issue we're seeing here seems to come out of a mismatch between the compiler output and Wasm engine — and I'm really confused by it, since I assumed we were doing a compatibility check somewhere (see

.precompile_compatibility_hash()

).
But perhaps this is a larger issue with the caching feature that we need to revisit.

This seems it might be related to the fact that the Spin compiled on version 0.19 but we are using linker and other functions from 0.17.0: https://github.com/containerd/runwasi/blob/5309512c6dac6d03df9e67190f560689790c04f7/crates/containerd-shim-wasmtime/src/instance.rs#L153-L166

The pre-compile and the checks we perform are all handled by the shim its self, but then when we set everything up we are linking with a different version.

Monday morning 🤦, that is a different shim all together so there must be something else going on.

I believe we should consider adding containerd config options for the shim to enable feature code paths. I don't believe Rust compilation flags would be desirable in cases like this since the user is likely to not have compiled the shim in the first place. Adding containerd config options for the shim would allow for on node configuration and debugging. It could also provide a path for turning on and off newly release / unstable features w/o recompilation.

+1

@vdice
Copy link
Contributor Author

vdice commented May 15, 2024

Are all the follow-ups/tangential issues filed for this one? I'd be fine with closing now that we have #90 in place as a safeguard for catching this in the future, in tandem with #99

@vdice vdice closed this as completed Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants