Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Daemon errors with (HTTP code 404) -- no such container: sandbox #261

Open
cywang117 opened this issue Jul 12, 2021 · 17 comments
Open

Daemon errors with (HTTP code 404) -- no such container: sandbox #261

cywang117 opened this issue Jul 12, 2021 · 17 comments

Comments

@cywang117
Copy link

cywang117 commented Jul 12, 2021

NOTE: For users and support agents arriving here in the future: since it's not clear how we can reproduce this issue, please find out more information about various conditions on the device. Some good starting questions and things to check:

  • Did this error appear after a release update?
  • Are deltas enabled?
  • Does the release build use intermediate containers? (If not sure, looking at the Dockerfile(s) of the containers would tell you)
  • Any other questions which you think might be relevant.

Asking the user if they wouldn't mind leaving the device in this invalid state for engineers to investigate would also help, if the user is okay with this of course.

Description

balenaEngine daemon errors with (HTTP code 404) -- no such container: sandbox . However, there is no sandbox container on the device. This error is communicated by the device Supervisor from the journal logs with:

Device state apply error Error: Failed to apply state transition steps. (HTTP code 404) no such container - sandbox 915c9f1f78712e9db8bb1edf3d94fd669a917c608270f4c95e3a8c72de142b15 not found Steps:["updateMetadata"]

Per https://github.com/balena-io/balena-io/issues/1684, this might be due to a bad internal state with one of the containers on the device. The issue is fixed by restarting balenaEngine with systemctl restart balena OR systemctl stop balena-supervisor && balena stop $(balena ps -a -q) && balena rm $(balena ps -a -q) && systemctl start balena-supervisor, however this is not ideal as the containers experience a few minutes of downtime.

It's unclear how to reproduce this issue.

Additional information you deem important (e.g. issue happens only occasionally):

Issue happens when a new update is downloaded by the device. Has sometimes appeared in combination with #1579, making cause unclear.

Additional environment details (device type, OS, etc.):

Device Type: Raspberry Pi 4 64bit, 2GB RAM
OS: balenaOS 2.80.3+rev1.prod

@jellyfish-bot
Copy link

[cywang117] This issue has attached support thread https://jel.ly.fish/72633746-3415-449a-9617-e123cba1e954

@jellyfish-bot
Copy link

[cywang117] This issue has attached support thread https://jel.ly.fish/e7428359-c335-4d00-81db-dfb4293d1423

@cywang117
Copy link
Author

The fact that stopping the Supervisor, removing the containers, and starting the Supervisor fixes the issue seems to indicate that this is a Supervisor issue and not a balenaEngine issue. I'll move this to the Supervisor repo

@cywang117 cywang117 transferred this issue from balena-os/balena-engine Jul 12, 2021
@cywang117
Copy link
Author

cywang117 commented Jul 12, 2021

So it seems that just restarting the Supervisor without removing containers does not fix this issue. However, restarting balenaEngine fixes the issue. Now I'm unclear whether this is Supervisor related or balenaEngine related. I'm leaning towards this being related to balenaEngine having bad state for one of the containers on the device, as a Supervisor restart didn't do anything.

@cywang117 cywang117 transferred this issue from balena-os/balena-supervisor Jul 12, 2021
@jellyfish-bot
Copy link

[cywang117] This issue has attached support thread https://jel.ly.fish/661c8c96-8357-4bfc-9380-308a65fff910

@jellyfish-bot
Copy link

[danthegoodman1] This issue has attached support thread https://jel.ly.fish/a4f6be4b-50dc-454d-9c5c-dbcf168119db

@cywang117
Copy link
Author

@lmbarros @robertgzr Drawing your attention to some edits I made to this GitHub issue:

NOTE: For users and support agents arriving here in the future: since it's not clear how we can reproduce this issue, please find out more information about various conditions on the device. Some good starting questions and things to check:

  • Did this error appear after a release update?
  • Are deltas enabled?
  • Does the release build use intermediate containers? (If not sure, looking at the Dockerfile(s) of the containers would tell you)
  • Any other questions which you think might be relevant.

Asking the user if they wouldn't mind leaving the device in this invalid state for engineers to investigate would also help, if the user is okay with this of course.

Are there any other questions that you think would be useful in investigating the causes behind this issue? Could this kind of problem be something that is unavoidable based on current implementation limitations in dependencies (Moby)?

@jellyfish-bot
Copy link

[pipex] This issue has attached support thread https://jel.ly.fish/dc8d2638-ebb4-4ba8-8ae6-edae48602850

@jellyfish-bot
Copy link

[pipex] This issue has attached support thread https://jel.ly.fish/e82fe388-3955-4252-97c4-6c837151cce2

@jellyfish-bot
Copy link

[pipex] This issue has attached support thread https://jel.ly.fish/b7fa70df-ad99-4deb-8f6a-2b78d2f47a44

@pipex
Copy link

pipex commented Dec 9, 2021

Some extra information for this ticket, this has been reported to be happening more with containers that don't get updated as frequently as others. So a container that has been renamed a few times while others have been recreated may sometimes get into this state

For instance, for a particular device, the failing container shows a network prefix of 16

root@4cd008d3ffa1:/opt# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
15: eth0@if16: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 02:42:ac:11:00:05 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.17.0.5/16 brd 172.17.255.255 scope global eth0
       valid_lft forever preferred_lft forever

While other veth networks have a much larger prefix, confirming that this is an old network.

root@c73b31f:~# ip a | grep veth
1291: veth6f7ff99@if1290: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue master br-2f18a4b13b86 
16: veth367da35@if15: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue master br-2f18a4b13b86 
1380: veth72261a3@if1379: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue master br-2f18a4b13b86 
1180: vethe52f1a4@if1179: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue master br-2f18a4b13b86

Could this issue be an unintended side effect of some cleanup process?

@jellyfish-bot
Copy link

[gantonayde] This issue has attached support thread https://jel.ly.fish/1b57a2f7-e2b2-4658-94ef-0a35bef04f4b

@jellyfish-bot
Copy link

[pipex] This issue has attached support thread https://jel.ly.fish/bf30fa84-cc92-4cf8-aefd-4c2f14c4a944

@jellyfish-bot
Copy link

[nitish] This issue has attached support thread https://jel.ly.fish/9f4bc524-e6d5-4480-98a5-4d2cefba84f3

@vipulgupta2048
Copy link
Member

vipulgupta2048 commented Jun 2, 2022

Did this error appear after a release update? Yep
Are deltas enabled? Yes
Does the release build use intermediate containers? Indeed, 2 stages

Happened on a new device with just the second release I pushed on it, running a minimal server application (200 mb image, 2 stage build process). Error is below:

Jun 02 20:25:35 a01a838 balena-supervisor[2376]: [info]    Applying target state
Jun 02 20:25:36 a01a838 balena-supervisor[2376]: [error]   Scheduling another update attempt in 1000ms due to failure:  Error: Failed to appl>
Jun 02 20:25:36 a01a838 balena-supervisor[2376]: [error]         at fn (/usr/src/app/dist/app.js:6:8690)
Jun 02 20:25:36 a01a838 balena-supervisor[2376]: [error]   Device state apply error Error: Failed to apply state transition steps. (HTTP code>
Jun 02 20:25:36 a01a838 balena-supervisor[2376]: [error]         at fn (/usr/src/app/dist/app.js:6:8690)
Jun 02 20:25:37 a01a838 balena-supervisor[2376]: [info]    Applying target state
Jun 02 20:25:38 a01a838 balena-supervisor[2376]: [error]   Scheduling another update attempt in 2000ms due to failure:  Error: Failed to appl>
Jun 02 20:25:38 a01a838 balena-supervisor[2376]: [error]         at fn (/usr/src/app/dist/app.js:6:8690)
Jun 02 20:25:38 a01a838 balena-supervisor[2376]: [error]   Device state apply error Error: Failed to apply state transition steps. (HTTP code>
Jun 02 20:25:38 a01a838 balena-supervisor[2376]: [error]         at fn (/usr/src/app/dist/app.js:6:8690)
Jun 02 20:25:40 a01a838 balena-supervisor[2376]: [info]    Applying target state

Attaching diagnostics File: a01a83846e174aa51dc2b33fbf0a17e7_diagnostics_2022.06.02_20.56.19+0000.txt

Adding outputs of commands balena info and balena version

root@a01a838:~# balena info
Client:
 Context:    default
 Debug Mode: false

Server:
 Containers: 2
  Running: 2
  Paused: 0
  Stopped: 0
 Images: 3
 Server Version: 20.10.12
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: journald
 Cgroup Driver: systemd
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host null
  Log: journald json-file local
 Swarm: 
  NodeID: 
  Is Manager: false
  Node Address: 
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: balena-engine-init
 containerd version: 
 runc version: 
 init version: 949e6fa-dirty (expected: de40ad007797e)
 Kernel Version: 5.10.83-v8
 Operating System: balenaOS 2.94.4
 OSType: linux
 Architecture: aarch64
 CPUs: 4
 Total Memory: 960MiB
 Name: a01a838
 ID: V47H:PCFQ:GMDT:PV3S:OW2J:FRXS:MRZ7:V737:5HEQ:BFCP:GBUS:SJOJ
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: true
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No blkio throttle.read_bps_device support
WARNING: No blkio throttle.write_bps_device support
WARNING: No blkio throttle.read_iops_device support
WARNING: No blkio throttle.write_iops_device support
root@a01a838:~# balena version
Client:
 Version:           20.10.12
 API version:       1.41
 Go version:        go1.16.2
 Git commit:        73c78258302d94f9652da995af6f65a621fac918
 Built:             Wed Mar  2 10:28:01 2022
 OS/Arch:           linux/arm64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          20.10.12
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.16.2
  Git commit:       73c78258302d94f9652da995af6f65a621fac918
  Built:            Wed Mar  2 10:28:01 2022
  OS/Arch:          linux/arm64
  Experimental:     true
 containerd:
  Version:          1.4.0+unknown
  GitCommit:        
 runc:
  Version:          spec: 1.0.2-dev
  GitCommit:        
 balena-engine-init:
  Version:          0.13.0
  GitCommit:        949e6fa-dirty

FD: https://www.flowdock.com/app/rulemotion/r-supervisor/threads/FQqETXXQaGFg1oLyWz7ccNbPgAx

@jellyfish-bot
Copy link

[lmbarros] This has attached https://jel.ly.fish/88b86997-9411-40b9-ae2f-8f3505febb93

@jellyfish-bot
Copy link

[pipex] This has attached https://jel.ly.fish/c09369f0-c870-4f93-9133-0ec8b995fda9

pipex added a commit to balena-os/balena-supervisor that referenced this issue Nov 15, 2023
The `updateMetadata` step renames the container to match the target
release when the service doesn't change between releases. We have seen
this step fail because of an engine bug that seems to relate to the
engine keeping stale references after container restarts. The only way
around this issue is to remove the old container and create it again.
This implements that workaround during the updateMetadata step to deal
with that issue.

Change-type: minor
Relates-to: balena-os/balena-engine#261
pipex added a commit to balena-os/balena-supervisor that referenced this issue Nov 22, 2023
The `updateMetadata` step renames the container to match the target
release when the service doesn't change between releases. We have seen
this step fail because of an engine bug that seems to relate to the
engine keeping stale references after container restarts. The only way
around this issue is to remove the old container and create it again.
This implements that workaround during the updateMetadata step to deal
with that issue.

Change-type: minor
Relates-to: balena-os/balena-engine#261
pipex added a commit to balena-os/balena-supervisor that referenced this issue Nov 22, 2023
The `updateMetadata` step renames the container to match the target
release when the service doesn't change between releases. We have seen
this step fail because of an engine bug that seems to relate to the
engine keeping stale references after container restarts. The only way
around this issue is to remove the old container and create it again.
This implements that workaround during the updateMetadata step to deal
with that issue.

Change-type: minor
Relates-to: balena-os/balena-engine#261
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants