Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kargo-controller creates zombie [git] processes #2926

Closed
4 tasks done
moro-drake opened this issue Nov 14, 2024 · 11 comments · Fixed by #2959
Closed
4 tasks done

kargo-controller creates zombie [git] processes #2926

moro-drake opened this issue Nov 14, 2024 · 11 comments · Fixed by #2959

Comments

@moro-drake
Copy link

Checklist

  • I've searched the issue queue to verify this is not a duplicate bug report.
  • I've included steps to reproduce the bug.
  • I've pasted the output of kargo version.
  • I've pasted logs, if applicable.

Description

We use kargo in openshift cluster. Openshift runs it with the following user: runAsUser: 1001910000 (ps output for this user attached as screenshots). Since 1.0.3 update we have noticed it creates zombie processes [git]. Those process slowly bulk up and make controller unusable (unix fork can't create more processes).
Warehouse set to discover new tags using NewestTag strategy speeds up this process (you can see a 'zombie' spawn on each refresh of WH in the UI). With 1m interval and 20 discovery limit kargo-controller was dead in half a day (we had about 44 active WHs with 'NewestTag' subscription)

Screenshots

image
image
image

Steps to Reproduce

  1. Create a Kargo project with a stage and warehouse subscription to git
  2. Set the WH subscription spec like this:
spec:
  freightCreationPolicy: Automatic
  interval: 1m0s
  subscriptions:
    - git:
        branch: main
        commitSelectionStrategy: NewestTag
        discoveryLimit: 20
  1. Open terminal to node running the kargo-controller and do a ps auxf | grep 'defunct' | wc -l
  2. Refresh the WH that subscribes to NewestTag in the UI
  3. Run the ps auxf | grep 'defunct' | wc -l - count will increment as new 'zombie' has spawned.

Version

Kargo v1.0.3

Logs

time="2024-11-14T03:06:08Z" level=error msg="Reconciler error" Warehouse="{release-warehouse dex}" controller=warehouse controllerGroup=kargo.akuity.io controllerKind=Warehouse error="error discovering artifacts: error discovering commits: failed to clone git repo \"https://repo.git\": error cloning repo \"https://repo.git\" into \"/tmp/repo-223330204/repo\": error executing cmd [/usr/bin/git clone --no-tags --branch main --single-branch https://repo.git /tmp/repo-223330204/repo]: Cloning into '/tmp/repo-223330204/repo'...\nerror: cannot fork() for remote-https: Resource temporarily unavailable\n" name=release-warehouse namespace=dex reconcileID="\"f2e1b0c6-3ad5-4d40-9965-fdda5c8b1a7d\""
@hiddeco
Copy link
Contributor

hiddeco commented Nov 18, 2024

This is an interesting issue as we appear to make use of Exec everywhere where we call git, which in turn uses cmd.CombinedOutput() that makes sure to .Wait() (which is normally a reason for zombie processes).

Needs a more thorough investigation.

@hiddeco
Copy link
Contributor

hiddeco commented Nov 18, 2024

Update: I have thus far been unable to reproduce this on a development setup of main with multiple Warehouses with Git subscriptions in a variety of configurations, including the suggested NewestTag config.

@hiddeco
Copy link
Contributor

hiddeco commented Nov 18, 2024

Appear to have found one variation of this, steps:

  1. Create a Warehouse with an SSH Git subscription (to e.g. GitHub.com), but without supplying any credentials for it.

  2. Let this run for some time (or refresh the Warehouse by hand), which causes authentication errors.

  3. Watch the number of open SSH processes grow for each clone:

    $ ps aux
    PID   USER     TIME  COMMAND
     1330 kargo     0:00 [ssh]
     1335 kargo     0:00 [ssh]
     1344 kargo     0:00 [ssh]
     1348 kargo     0:00 [ssh]
     1352 kargo     0:00 [ssh]
     1356 kargo     0:00 [ssh]
     1360 kargo     0:00 [ssh]
     1374 kargo     0:00 [ssh]
     1378 kargo     0:00 [ssh]
     1382 kargo     0:00 [ssh]
     1386 kargo     0:00 [ssh]
     1390 kargo     0:00 [ssh]
     1419 kargo     0:00 [ssh]
     1442 kargo     0:00 [ssh]
     1450 kargo     0:00 [ssh]
     1491 kargo     0:00 [ssh]
     1664 kargo     0:00 [ssh]

@krancour
Copy link
Member

Similarly, just now, I've been able to do the same with HTTPS:

docker run -d --rm -e GIT_ASKPASS=/usr/local/bin/credential-helper -w /tmp ghcr.io/akuity/kargo:v1.0.3 sleep 1000

# Important that there be no tty; repeat this next step n times
docker exec <container id> git clone https://any/private/repo

docker exec <container id> ps                                      
PID   USER     TIME  COMMAND
    1 nonroot   0:00 sleep 1000
   13 nonroot   0:00 [git] # You will see n of these
   27 nonroot   0:00 [git]
   41 nonroot   0:00 [git]
   55 nonroot   0:00 [git]
   63 nonroot   0:00 ps

@krancour
Copy link
Member

krancour commented Nov 18, 2024

Before our custom credential helper is implicated...

This has the same result:

docker run -d --rm -e GIT_ASKPASS=/usr/bin/false -w /tmp ghcr.io/akuity/kargo:v1.0.3 sleep 1000

# Important that there be no tty; repeat this next step n times
docker exec <container id> git clone https://any/private/repo

docker exec <container id> ps

The zombies are git and not the credential helper. But those zombies are being created when the credential helper exits non-zero and git attempts to interactively ask the user for creds and finds there is no tty.

It is interesting that @hiddeco was able to make ssh zombies instead of git zombies, but the overall smell is similar.

@hiddeco
Copy link
Contributor

hiddeco commented Nov 18, 2024

@krancour does this go away if you set GIT_TERMINAL_PROMPT=0?

@krancour
Copy link
Member

Changes the error message, but the zombies still get created.

@krancour
Copy link
Member

krancour commented Nov 18, 2024

docker exec a9cc0da85d19c5c254b5281cbcc0d88cc4e5a3152f5db13854f907ce92f36898 git clone https://github.com...
Cloning into 'repo'...
GIT_PASSWORD must be set
error: unable to read askpass response from '/usr/local/bin/credential-helper'
fatal: could not read Username for 'https://github.com': terminal prompts disabled

x 3

docker exec a9cc0da85d19c5c254b5281cbcc0d88cc4e5a3152f5db13854f907ce92f36898 ps                                           
PID   USER     TIME  COMMAND
    1 nonroot   0:00 sleep 1000
   13 nonroot   0:00 [git]
   27 nonroot   0:00 [git]
   41 nonroot   0:00 [git]
   49 nonroot   0:00 ps

@krancour
Copy link
Member

This is interesting...

Remove GIT_ASKPASS from the equation and this can still be triggered:

docker run -d --rm -e GIT_TERMINAL_PROMPT=0 -w /tmp ghcr.io/akuity/kargo:v1.0.3 sleep 1000

docker exec <container> git clone https://github.com/...
Cloning into 'repo'...
fatal: could not read Username for 'https://github.com': terminal prompts disabled

x 3

docker exec <container> ps                                           
PID   USER     TIME  COMMAND
    1 nonroot   0:00 sleep 1000
   13 nonroot   0:00 [git]
   22 nonroot   0:00 [git]
   31 nonroot   0:00 [git]
   34 nonroot   0:00 ps

@hiddeco
Copy link
Contributor

hiddeco commented Nov 18, 2024

The problem is likely that the kargo-controller process has PID 1, which has a special responsibility (see e.g. https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/)

Its task is to "adopt" orphaned child processes (again, this is the actual technical term). This means that the init process becomes the parent of such processes, even though those processes were never created directly by the init process.

We can confirm this theory by running @krancour's original example with the --init flag added (available since Docker 1.25):

$ docker run --init -d --rm -e GIT_ASKPASS=/usr/local/bin/credential-helper -w /tmp ghcr.io/akuity/kargo:v1.0.3 sleep 1000
$ docker exec <id> git clone https://github.com/some/repository.git
...
$ docker exec <id> git clone https://github.com/some/repository.git
...
$ docker exec <id> git clone https://github.com/some/repository.git
...
$ docker exec 197f93ca78208788f3ee1ebdf62a27a5380d37f4db70e955add7fa97013e111e ps aux
PID   USER     TIME  COMMAND
    1 nonroot   0:00 /sbin/docker-init -- sleep 1000
    7 nonroot   0:00 sleep 1000
   38 nonroot   0:00 ps aux

To avoid this, we should likely make use of a lightweight supervisor like tini.

@krancour
Copy link
Member

Nice find @hiddeco!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants