Skip to content
This repository has been archived by the owner on Oct 13, 2023. It is now read-only.

[18.09 backport] Delete stale containerd object on start failure #154

Conversation

thaJeztah
Copy link
Member

backport of moby#38364 for 18.09
fixes moby#38346 for 18.09

containerd has two objects with regard to containers.
There is a "container" object which is metadata and a "task" which is
manging the actual runtime state.

When docker starts a container, it creartes both the container metadata
and the task at the same time. So when a container exits, docker deletes
both of these objects as well.

This ensures that if, on start, when we go to create the container metadata object
in containerd, if there is an error due to a name conflict that we go
ahead and clean that up and try again.

Signed-off-by: Brian Goff cpuguy83@gmail.com
(cherry picked from commit 5ba30cd)
Signed-off-by: Sebastiaan van Stijn github@gone.nl

- What I did

- How I did it

- How to verify it

- Description for the changelog

- A picture of a cute animal (not mandatory but encouraged)

containerd has two objects with regard to containers.
There is a "container" object which is metadata and a "task" which is
manging the actual runtime state.

When docker starts a container, it creartes both the container metadata
and the task at the same time. So when a container exits, docker deletes
both of these objects as well.

This ensures that if, on start, when we go to create the container metadata object
in containerd, if there is an error due to a name conflict that we go
ahead and clean that up and try again.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
(cherry picked from commit 5ba30cd)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
@thaJeztah thaJeztah added this to the 18.09.3 milestone Feb 15, 2019
@thaJeztah
Copy link
Member Author

ping @tonistiigi @cpuguy83 PTAL

Copy link

@cpuguy83 cpuguy83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thaJeztah
Copy link
Member Author

thaJeztah commented Feb 18, 2019

Interesting; both Power and S390x are failing with; https://jenkins.dockerproject.org/job/Docker-PRs-powerpc/13358/console https://jenkins.dockerproject.org/job/Docker-PRs-s390x/13246/console

01:54:08 FAIL: docker_cli_swarm_test.go:340: DockerSwarmSuite.TestSwarmContainerEndpointOptions
01:54:08 
01:54:08 [d4244336afd63] waiting for daemon to start
01:54:08 [d4244336afd63] daemon started
01:54:08 
01:54:08 docker_cli_swarm_test.go:348:
01:54:08     c.Assert(err, checker.IsNil, check.Commentf("%s", out))
01:54:08 ... value *exec.ExitError = &exec.ExitError{ProcessState:(*os.ProcessState)(0xc4242ca480), Stderr:[]uint8(nil)} ("exit status 125")
01:54:08 ... jwpgahfrribmvdlkks63m881o
01:54:08 
01:54:08 
01:54:08 [d4244336afd63] exiting daemon

@thaJeztah
Copy link
Member Author

From the daemon logs of that test;

time="2019-02-15T01:54:08.013683320Z" level=debug msg="DisableService ingress-sbox START"
time="2019-02-15T01:54:08.013772013Z" level=debug msg="DisableService ingress-sbox DONE"
time="2019-02-15T01:54:08.214509584Z" level=debug msg="Revoking external connectivity on endpoint gateway_ingress-sbox (84ee9ee8134f41eea984de47ec1580f8af63258aedec2d636e6cbfca2978a9d7)"
time="2019-02-15T01:54:08.215592405Z" level=debug msg="DeleteConntrackEntries purged ipv4:0, ipv6:0"
time="2019-02-15T01:54:08.321575898Z" level=debug msg="Releasing addresses for endpoint gateway_ingress-sbox's interface on network docker_gwbridge"
time="2019-02-15T01:54:08.321648833Z" level=debug msg="ReleaseAddress(LocalDefault/172.26.0.0/16, 172.26.0.2)"
time="2019-02-15T01:54:08.321714887Z" level=debug msg="Released address PoolID:LocalDefault/172.26.0.0/16, Address:172.26.0.2 Sequence:App: ipam/default/data, ID: LocalDefault/172.26.0.0/16, DBIndex: 0x0, Bits: 65536, Unselected: 65532, Sequence: (0xe0000000, 1)->(0x0, 2046)->(0x1, 1)->end Curr:3"
time="2019-02-15T01:54:08.322653261Z" level=warning msg="Error (Unable to complete atomic operation, key modified) deleting object [endpoint yy1f8qf6tq4un5i0dauxfjdox 5e3f493590e12eb724bbd4da6947326074018ab87c7d239a4a8c192d11c5b472], retrying...."
time="2019-02-15T01:54:08.332116764Z" level=debug msg="Releasing addresses for endpoint ingress-endpoint's interface on network ingress"
time="2019-02-15T01:54:08.332146193Z" level=debug msg="ReleaseAddress(LocalDefault/10.255.0.0/16, 10.255.0.2)"
time="2019-02-15T01:54:08.332208481Z" level=debug msg="Released address PoolID:LocalDefault/10.255.0.0/16, Address:10.255.0.2 Sequence:App: ipam/default/data, ID: LocalDefault/10.255.0.0/16, DBIndex: 0x0, Bits: 65536, Unselected: 65532, Sequence: (0xe0000000, 1)->(0x0, 2046)->(0x1, 1)->end Curr:0"
time="2019-02-15T01:54:08.345266665Z" level=debug msg="releasing IPv4 pools from network ingress (yy1f8qf6tq4un5i0dauxfjdox)"
time="2019-02-15T01:54:08.345293981Z" level=debug msg="ReleaseAddress(LocalDefault/10.255.0.0/16, 10.255.0.1)"
time="2019-02-15T01:54:08.345330922Z" level=debug msg="Released address PoolID:LocalDefault/10.255.0.0/16, Address:10.255.0.1 Sequence:App: ipam/default/data, ID: LocalDefault/10.255.0.0/16, DBIndex: 0x0, Bits: 65536, Unselected: 65533, Sequence: (0xc0000000, 1)->(0x0, 2046)->(0x1, 1)->end Curr:0"
time="2019-02-15T01:54:08.345367527Z" level=debug msg="ReleasePool(LocalDefault/10.255.0.0/16)"
time="2019-02-15T01:54:08.345611258Z" level=debug msg="cleanupServiceDiscovery for network:yy1f8qf6tq4un5i0dauxfjdox"
time="2019-02-15T01:54:08.350383843Z" level=debug msg="Unix socket /run/docker/libnetwork/74ea7309c6268e8e8671cef226d36e5534a22f3bb42428b98d0103fac1508edc.sock doesn't exist. cannot accept client connections"
time="2019-02-15T01:54:08.350490938Z" level=debug msg="Cleaning up old mountid : start."
time="2019-02-15T01:54:08.350896947Z" level=info msg="stopping event stream following graceful shutdown" error="<nil>" module=libcontainerd namespace=moby
time="2019-02-15T01:54:08.355076867Z" level=debug msg="Cleaning up old mountid : done."
time="2019-02-15T01:54:08.359546495Z" level=debug msg="Clean shutdown succeeded"

docker.log

@thaJeztah
Copy link
Member Author

I think the remaining failures are flaky tests

@andrewhsu andrewhsu merged commit ba8664c into docker-archive:18.09 Feb 22, 2019
@thaJeztah thaJeztah deleted the 18.09_backport_fix_stale_container_on_start branch February 22, 2019 22:07
algitbot pushed a commit to alpinelinux/aports that referenced this pull request Mar 12, 2019
https://github.com/docker/docker-ce/releases/tag/v18.09.3

The more important fixes in this version:
* When copying existing folder, ignore xattr set errors when the target filesystem doesn't support xattr. docker-archive/engine#135
* Graphdriver: fix device mode not being detected if character-device bit is set. docker-archive/engine#160
* Fix nil pointer derefence on failure to connect to containerd. docker-archive/engine#162
* Delete stale containerd object on start failure. docker-archive/engine#154
liske pushed a commit to liske/aports that referenced this pull request Apr 7, 2019
https://github.com/docker/docker-ce/releases/tag/v18.09.3

The more important fixes in this version:
* When copying existing folder, ignore xattr set errors when the target filesystem doesn't support xattr. docker-archive/engine#135
* Graphdriver: fix device mode not being detected if character-device bit is set. docker-archive/engine#160
* Fix nil pointer derefence on failure to connect to containerd. docker-archive/engine#162
* Delete stale containerd object on start failure. docker-archive/engine#154
seiferteric pushed a commit to project-arlo/sonic-buildimage that referenced this pull request Oct 14, 2019
…ll dockers are down except database

It's an issue in docker engine, which has been resolved in PR#154
docker-archive/engine#154

And in this commit, we will update the docker to 19.03.0 for this

Signed-off-by: Dante (Kuo-Jung) Su <dante.su@broadcom.com>
Change-Id: Iecfb7b312abfbcc7741cdcd8b506f9d6c19c4eef
@alexanderadam
Copy link

alexanderadam commented Mar 2, 2020

Should this bug be fixed in 19.03.6?
Could it be the cause for Ansible issue 64492?

@cpuguy83
Copy link

cpuguy83 commented Mar 2, 2020

Based on the error message it seems like there is still a task running, which this PR does not handle and sounds like a very different problem.

@cpuguy83
Copy link

cpuguy83 commented Mar 2, 2020

Could you open a new issue with the details? Any clues on how the system got into that state?

@alexanderadam
Copy link

alexanderadam commented Mar 3, 2020

Done. Not really. I believe that it was caused by installing some updates (including containerd.io, docker-ce and docker-ce-cli) this time.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants