Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

garden running in bpm does not find/delete existing containers on startup #120

Closed
sunjayBhatia opened this issue Jan 16, 2019 · 4 comments
Closed

Comments

@sunjayBhatia
Copy link
Contributor

Description

When running garden in bpm, any orphaned containers are not cleaned up on restarts of the garden job. See reproduction steps below.

This affected us in our CI environment as these orphaned containers left state in the form of container network interfaces/network namespace files on the host VM. Subsequent container creations failed during network configuration b/c of this polluted state, causing Diego cells in CF to become unhealthy periodically.

Links

Environment

  • garden-runc-release version: 1.17.2
  • IaaS: bosh-lite + GCP
  • Stemcell version: N/A
  • Kernel version: N/A

Steps to reproduce

  • Deploy CF on bosh-lite with one Diego cell and BPM/rootless enabled
  • Push an app
  • Run bosh stop diego-cell (or monit stop garden from the cell)
  • SSH onto the cell and see there is a garden-init process still running from the stranded application instance
  • Run bosh start diego-cell (or monit start garden from the cell)
  • SSH onto the cell and eventually see there are now two garden-init processes, the original stranded application (now pid 1) and another for the rescheduled app
  • Look at the garden depot directory and see there is only one container in garden's record
  • Look through the garden logs and see that there was no clean-up-container log line corresponding to the stranded container
  • Look at output of ifconfig and see there are two container network interfaces

Note: stopping garden seems to only orphan 1 extra garden-init process if done repeatedly but you can see from the stranded container network interfaces that garden failed to fully delete many containers

Cause

  • Garden running in BPM does not keep its container depot state through restarts of the job so it loses track of containers that are running when the job exits

Resolution

  • Garden uses a volume mounted on the host to store container state so it can remember it between restarts?
  • Other options pending
@cf-gitbot
Copy link

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/163269392

The labels on this github issue will be updated when the story is started.

@BooleanCat
Copy link
Contributor

We observe that when garden is stopped via monit stop garden, all of Garden's containers also stop. We appear to leak image dirs since we fail to create containers with the same name after restarting garden and the image plugin complains about this.

We also observe that the depot contains no trace of our container.

We can see from BPM code and logs that BPM attempts the following on gdn:

  1. send sigterm
  2. send sigquit
  3. runc delete -f the bpm container
    3.1 runc sends kill --all to the container

Outside of bpm in CF, garden is configured to destroy containers on start up. However, when garden dies, the container continue to function as normal, garden kills the containers next time it starts up.

Within BPM, because of the above, garden containers die when the garden bpm container dies.

Within a pid namespace, it is documented in man7 that when you kill the init process (pid 1), the kernel will send SIGKILL to all other processes in the namespace. Experimentally, we've observed that this applies to all nested child pid namespaces.

The conclusion is that, if we wish to continue to use BPM, then it is expected behaviour for containers to die when garden dies.

TODOs

  1. Consider always enabling destroy_on_startup in rootless
  2. Persist the depot in bpm mode
  3. raise issue in BPM to document this "gotcha"
  4. PR man7 to document the behaviour we obvserved (around pid namespace ancestors dying)

@Callisto13
Copy link
Contributor

This appears fixed since 1.18.1, are we okay to close?

@Callisto13
Copy link
Contributor

@BooleanCat ^^

@cf-gitbot cf-gitbot removed the accepted label Apr 8, 2019
tas-runtime-bot pushed a commit that referenced this issue Dec 20, 2023
Submodule src/garden-integration-tests a3a11e8e..f1f7dd8f:
  > Update go.mod dependencies
  > Merge pull request #120 from cloudfoundry/fix-run-standalone-gdn-gats/builds/8.1
Submodule src/netplugin-shim 345979f7..2f59faed:
  > Update go.mod dependencies
tas-runtime-bot added a commit that referenced this issue Oct 15, 2024
…eptance-tests grootfs guardian idmapper

Submodule src/dontpanic 2611f6b1..a9aaa67b:
  > Update go.mod dependencies
  > Update go.mod dependencies
  > Update go.mod dependencies
  > Update go.mod dependencies
  > Update go.mod dependencies
Submodule src/garden fba22f3d..e596c7c5:
  > Update go.mod dependencies
  > Update go.mod dependencies
  > Update go.mod dependencies
  > Update go.mod dependencies
  > Update go.mod dependencies
  > Merge pull request #122 from cloudfoundry/fix-g104
  > Merge pull request #120 from cloudfoundry/g306-fix
Submodule src/garden-integration-tests 797d06aa..101e55e4:
  > Update go.mod dependencies
Submodule src/garden-performance-acceptance-tests e6276077..46b13b45:
  > Update go.mod dependencies
  > Update go.mod dependencies
  > Update go.mod dependencies
  > Update go.mod dependencies
  > Update go.mod dependencies
  > Update go.mod dependencies
  > Update go.mod dependencies
Submodule src/grootfs 1dd33356..ea023c7b:
  > Update go.mod dependencies
  > Update go.mod dependencies
  > Update go.mod dependencies
  > Update go.mod dependencies
  > fix links in readme
  > fix link to ginkgo repo
  > fix broken docker docs links
  > Update go.mod dependencies
  > Merge pull request #269 from cloudfoundry/g301-followup
  > Merge pull request #270 from cloudfoundry/fix-g104
  > Update go.mod dependencies
  > Update go.mod dependencies
  > Merge pull request #268 from cloudfoundry/fix-g110
  > Merge pull request #267 from cloudfoundry/g306-fix
  > Update go.mod dependencies
Submodule src/guardian d7f54dca..8fdda0bb:
  > Update go.mod dependencies
  > Merge pull request #455 from cloudfoundry/nstar-arm64
  > Merge pull request #454 from cloudfoundry/go-1.23-test-update
  > Merge pull request #453 from cloudfoundry/fix-g104
  > Merge pull request #451 from cloudfoundry/g306-fix
  > Remove toolchain for compatiblity.
Submodule src/idmapper 34b682d8..7f68486a:
  > Update go.mod dependencies
  > Update go.mod dependencies
  > Update go.mod dependencies
  > Update go.mod dependencies
  > fix link in readme
  > Update go.mod dependencies
  > Merge pull request #80 from cloudfoundry/fix-g104
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants