-
Notifications
You must be signed in to change notification settings - Fork 79
Don't panic!
This document provides tips on troubleshooting Garden in the unlikely event that it does not work as expected.
Your best friend in troubleshooting Garden is the dontpanic
report. dontpanic
is a Garden inhouse tool to gather various diagnostic information and pack it into an archive. The reporter is expected to run the tool and share the report alongside a reasonable description of the problem with Garden developers for further investigation.
https://github.com/cloudfoundry/dontpanic
Lot's of useful stuff, mostly the output of various commands executed on the diego cell. Refer to the dontpanic
README for more information, or just look at the code to see the commands it runs.
dontpanic
ships as a package within the garden-runc-release
which means that the binary would be available on every diego cell running garden 1.17.1+. In the unlikely event that you are troubleshooting previous Garden, you could download the binary from the dontpanic
release page
NOTE: The latest dontpanic
release may be outdated, you could consider releasing a new version. There is no CI pipeline for that, you would have to that manually.
- The reporter logs onto the diego cell (via ssh), see how to ssh on a diego cell below
- The reporter switches to
root
via runningsudo su -
- The reporter runs
/var/vcap/packages/dontpanic/bin/dontpanic
and relaxes until thedontpanic
is done. Once it is done, it prints the location to the the produced report (it is produced in the/var/vcap/data/tmp
directory) - The reporter shares the report with the Garden team. The reporter might want to copy the report via
bosh scp
to their machine in order to make sharing easier. - The Garden team looks at the report, figures out what is wrong and saves the day.
Sometimes it might be useful to see what is Garden up to at the moment of running the dontpanic
report. If so, the reporter could supply the --sigquit
flag to the dontpanic
binary - this would SIGQUIT the Guardian server (gdn
) process which would make gdn dump its goroutines stack into the Garden error log.
WARNING The --sigquit
flag would terminate the Guardian server!
Note1 dontpanic
collects Garden logs, monit
logs, kernel logs, various data from the /proc
filesystem. We do not expect this data to contain any secrets (such as passwords or certificates) but it is advisable the reporter to double check that.
Note2 The report archive might be big so sharing it might ivolve some sort of a shared cloud storage.
Once you get a dontpanic
report you should have all the troubleshooting details you need. If that is not the case, consider enhancing dontpanic
.
Here is a sample algorithm you can follow
- Usually you would start with
monit summary
to figure out whether all jobs are running - If Garden is not running, have a look at its logs (in the
garden
directory in the report). If thegdn
process is crashing, its error logs should contain clues. Also consider looking at the containerd logs. - Have a look at the configuration files (the
config
directory). Do the options there make sense? - Are there errors in the Garden logs?
- Look at the process tree:
- Is the
gdn
process (the Guardian) server alive? If not, look into Garden error logs for clues. - Are there any
dadoo
orcontainerd-shim
processes that have no children? Such processes could indicate that a container exited without the shim noticing
- Is the
- Look at the running containers (
garden-containers.log
) - aren't there too much of them? - Look at disk usage (
df.log
) - could the cell be running out of disk space? - Depending on the issue you are looking at, look at the lsof, iptables, meminfo, etc. information.
- Once you have targeted the deployment with
bosh
, you need to figure out the VM instance ID of the diego cell you want to connect to. In order to list all the deployments VMs, just runbosh -d <dep name> vms
. - Pick the diego cell VM you want to connect to and run
bosh -d <dep name> ssh <diego cell VM ID>
, for examplebosh -d cf ssh diego-cell/d3e1b55a-078d-4d9d-9c0a-76306e891dea
. If working on a deployment with a single cell,bosh -d <dep name> ssh diego-cell
would also do the trick. - Once connected to the diego cell it is useful to become
root
viasudo su -
Provided that you ssh-ed onto a diego cell, managed to run dontpanic
and it told you where the report archive is, you should run bosh scp
on your local machine, for example:
bosh -d cf scp diego-cell/d3e1b55a-078d-4d9d-9c0a-76306e891dea:/var/vcap/data/tmp/os-report-ed0db125-3c54-43b4-8023-78d2ff53a39e-2021-06-01-09-29-20.084414748.tar.gz .
- Garden logs:
/var/vcap/sys/log/garden
- Garden depot:
/var/vcap/data/garden/depot
- Garden configuration files:
/var/vcap/jobs/garden/config
- Garden binaries:
/var/vcap/packages/{dontpanic,garden-idmapper,greenskeeper,guardian,netplugin-shim,thresholder}
- Garden job monit scripts:
/var/vcap/jobs/garden/bin
In order to troubleshoot an issue it is best to reproduce it in a local minimal environment. Ideally, in most of the cases it is sufficient to create a pure Garden deployment (without all the CF machinery) where you could use the Garden API with proper input to reproduce a problem.
Check out the Creating-sandbox-environments-for-debugging wiki page.
The most convenient way to call Garden is to use the unofficial Garden client - gaol. It provides a command line interface to create/delete containers, run processes, etc (view its README on how to use it). In most of the cases the client should be sufficient as is, if it is not, it is always an option to change it to your liking (e.g. make it create containers with hardcoded memory limit), build it and use your own version instead.
Running gaol
from your local machine would be only possible if using a local bosh-lite Garden deployment where Garden is configured to listen on an HTTP port. For more realistic deployments it is most convenient to copy the gaol
binary on the diego cell and call it after ssh-ing. The bosh-inject-garden-tools.sh script automates that task.
Alternatively, you may want to create a Ginkgo test that calls the Garden API and reproduces the issue. This approach is great because you can push the test after fixing the issue to ensure that the bug never appears again. Existing GAT tests are a great starting point as their fixture setup a ready to use Garden client.