garden running in bpm does not find/delete existing containers on startup #120

sunjayBhatia · 2019-01-16T19:18:34Z

Description

When running garden in bpm, any orphaned containers are not cleaned up on restarts of the garden job. See reproduction steps below.

This affected us in our CI environment as these orphaned containers left state in the form of container network interfaces/network namespace files on the host VM. Subsequent container creations failed during network configuration b/c of this polluted state, causing Diego cells in CF to become unhealthy periodically.

Links

Diego investigation story: https://www.pivotaltracker.com/story/show/163212244
Slack conversation: https://cloudfoundry.slack.com/archives/C0JV2NPJ4/p1547576166061500
CF networking story: https://www.pivotaltracker.com/story/show/163266092

Environment

garden-runc-release version: 1.17.2
IaaS: bosh-lite + GCP
Stemcell version: N/A
Kernel version: N/A

Steps to reproduce

Deploy CF on bosh-lite with one Diego cell and BPM/rootless enabled
Push an app
Run bosh stop diego-cell (or monit stop garden from the cell)
SSH onto the cell and see there is a garden-init process still running from the stranded application instance
Run bosh start diego-cell (or monit start garden from the cell)
SSH onto the cell and eventually see there are now two garden-init processes, the original stranded application (now pid 1) and another for the rescheduled app
Look at the garden depot directory and see there is only one container in garden's record
Look through the garden logs and see that there was no clean-up-container log line corresponding to the stranded container
Look at output of ifconfig and see there are two container network interfaces

Note: stopping garden seems to only orphan 1 extra garden-init process if done repeatedly but you can see from the stranded container network interfaces that garden failed to fully delete many containers

Cause

Garden running in BPM does not keep its container depot state through restarts of the job so it loses track of containers that are running when the job exits

Resolution

Garden uses a volume mounted on the host to store container state so it can remember it between restarts?
Other options pending

The text was updated successfully, but these errors were encountered:

cf-gitbot · 2019-01-16T19:18:35Z

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/163269392

The labels on this github issue will be updated when the story is started.

BooleanCat · 2019-01-22T15:23:50Z

We observe that when garden is stopped via monit stop garden, all of Garden's containers also stop. We appear to leak image dirs since we fail to create containers with the same name after restarting garden and the image plugin complains about this.

We also observe that the depot contains no trace of our container.

We can see from BPM code and logs that BPM attempts the following on gdn:

send sigterm
send sigquit
runc delete -f the bpm container
3.1 runc sends kill --all to the container

Outside of bpm in CF, garden is configured to destroy containers on start up. However, when garden dies, the container continue to function as normal, garden kills the containers next time it starts up.

Within BPM, because of the above, garden containers die when the garden bpm container dies.

Within a pid namespace, it is documented in man7 that when you kill the init process (pid 1), the kernel will send SIGKILL to all other processes in the namespace. Experimentally, we've observed that this applies to all nested child pid namespaces.

The conclusion is that, if we wish to continue to use BPM, then it is expected behaviour for containers to die when garden dies.

TODOs

Consider always enabling destroy_on_startup in rootless
Persist the depot in bpm mode
raise issue in BPM to document this "gotcha"
PR man7 to document the behaviour we obvserved (around pid namespace ancestors dying)

Callisto13 · 2019-03-12T18:38:47Z

This appears fixed since 1.18.1, are we okay to close?

Callisto13 · 2019-03-21T12:12:01Z

@BooleanCat ^^

Submodule src/garden-integration-tests a3a11e8e..f1f7dd8f: > Update go.mod dependencies > Merge pull request #120 from cloudfoundry/fix-run-standalone-gdn-gats/builds/8.1 Submodule src/netplugin-shim 345979f7..2f59faed: > Update go.mod dependencies

…eptance-tests grootfs guardian idmapper Submodule src/dontpanic 2611f6b1..a9aaa67b: > Update go.mod dependencies > Update go.mod dependencies > Update go.mod dependencies > Update go.mod dependencies > Update go.mod dependencies Submodule src/garden fba22f3d..e596c7c5: > Update go.mod dependencies > Update go.mod dependencies > Update go.mod dependencies > Update go.mod dependencies > Update go.mod dependencies > Merge pull request #122 from cloudfoundry/fix-g104 > Merge pull request #120 from cloudfoundry/g306-fix Submodule src/garden-integration-tests 797d06aa..101e55e4: > Update go.mod dependencies Submodule src/garden-performance-acceptance-tests e6276077..46b13b45: > Update go.mod dependencies > Update go.mod dependencies > Update go.mod dependencies > Update go.mod dependencies > Update go.mod dependencies > Update go.mod dependencies > Update go.mod dependencies Submodule src/grootfs 1dd33356..ea023c7b: > Update go.mod dependencies > Update go.mod dependencies > Update go.mod dependencies > Update go.mod dependencies > fix links in readme > fix link to ginkgo repo > fix broken docker docs links > Update go.mod dependencies > Merge pull request #269 from cloudfoundry/g301-followup > Merge pull request #270 from cloudfoundry/fix-g104 > Update go.mod dependencies > Update go.mod dependencies > Merge pull request #268 from cloudfoundry/fix-g110 > Merge pull request #267 from cloudfoundry/g306-fix > Update go.mod dependencies Submodule src/guardian d7f54dca..8fdda0bb: > Update go.mod dependencies > Merge pull request #455 from cloudfoundry/nstar-arm64 > Merge pull request #454 from cloudfoundry/go-1.23-test-update > Merge pull request #453 from cloudfoundry/fix-g104 > Merge pull request #451 from cloudfoundry/g306-fix > Remove toolchain for compatiblity. Submodule src/idmapper 34b682d8..7f68486a: > Update go.mod dependencies > Update go.mod dependencies > Update go.mod dependencies > Update go.mod dependencies > fix link in readme > Update go.mod dependencies > Merge pull request #80 from cloudfoundry/fix-g104

cf-gitbot added unscheduled scheduled in progress and removed unscheduled scheduled labels Jan 16, 2019

cf-gitbot added delivered and removed in progress labels Jan 24, 2019

cf-gitbot added accepted and removed delivered labels Feb 6, 2019

Callisto13 closed this as completed Apr 8, 2019

cf-gitbot removed the accepted label Apr 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

garden running in bpm does not find/delete existing containers on startup #120

garden running in bpm does not find/delete existing containers on startup #120

sunjayBhatia commented Jan 16, 2019

cf-gitbot commented Jan 16, 2019

BooleanCat commented Jan 22, 2019

Callisto13 commented Mar 12, 2019

Callisto13 commented Mar 21, 2019

garden running in bpm does not find/delete existing containers on startup #120

garden running in bpm does not find/delete existing containers on startup #120

Comments

sunjayBhatia commented Jan 16, 2019

Description

Links

Environment

Steps to reproduce

Cause

Resolution

cf-gitbot commented Jan 16, 2019

BooleanCat commented Jan 22, 2019

Callisto13 commented Mar 12, 2019

Callisto13 commented Mar 21, 2019