kargo-controller creates zombie [git] processes #2926

moro-drake · 2024-11-14T08:59:58Z

Checklist

I've searched the issue queue to verify this is not a duplicate bug report.
I've included steps to reproduce the bug.
I've pasted the output of kargo version.
I've pasted logs, if applicable.

Description

We use kargo in openshift cluster. Openshift runs it with the following user: runAsUser: 1001910000 (ps output for this user attached as screenshots). Since 1.0.3 update we have noticed it creates zombie processes [git]. Those process slowly bulk up and make controller unusable (unix fork can't create more processes).
Warehouse set to discover new tags using NewestTag strategy speeds up this process (you can see a 'zombie' spawn on each refresh of WH in the UI). With 1m interval and 20 discovery limit kargo-controller was dead in half a day (we had about 44 active WHs with 'NewestTag' subscription)

Screenshots

Steps to Reproduce

Create a Kargo project with a stage and warehouse subscription to git
Set the WH subscription spec like this:

spec:
  freightCreationPolicy: Automatic
  interval: 1m0s
  subscriptions:
    - git:
        branch: main
        commitSelectionStrategy: NewestTag
        discoveryLimit: 20

Open terminal to node running the kargo-controller and do a ps auxf | grep 'defunct' | wc -l
Refresh the WH that subscribes to NewestTag in the UI
Run the ps auxf | grep 'defunct' | wc -l - count will increment as new 'zombie' has spawned.

Version

Kargo v1.0.3

Logs

time="2024-11-14T03:06:08Z" level=error msg="Reconciler error" Warehouse="{release-warehouse dex}" controller=warehouse controllerGroup=kargo.akuity.io controllerKind=Warehouse error="error discovering artifacts: error discovering commits: failed to clone git repo \"https://repo.git\": error cloning repo \"https://repo.git\" into \"/tmp/repo-223330204/repo\": error executing cmd [/usr/bin/git clone --no-tags --branch main --single-branch https://repo.git /tmp/repo-223330204/repo]: Cloning into '/tmp/repo-223330204/repo'...\nerror: cannot fork() for remote-https: Resource temporarily unavailable\n" name=release-warehouse namespace=dex reconcileID="\"f2e1b0c6-3ad5-4d40-9965-fdda5c8b1a7d\""

The text was updated successfully, but these errors were encountered:

hiddeco · 2024-11-18T16:16:20Z

This is an interesting issue as we appear to make use of Exec everywhere where we call git, which in turn uses cmd.CombinedOutput() that makes sure to .Wait() (which is normally a reason for zombie processes).

Needs a more thorough investigation.

hiddeco · 2024-11-18T21:01:44Z

Update: I have thus far been unable to reproduce this on a development setup of main with multiple Warehouses with Git subscriptions in a variety of configurations, including the suggested NewestTag config.

hiddeco · 2024-11-18T21:41:31Z

Appear to have found one variation of this, steps:

Create a Warehouse with an SSH Git subscription (to e.g. GitHub.com), but without supplying any credentials for it.
Let this run for some time (or refresh the Warehouse by hand), which causes authentication errors.

Watch the number of open SSH processes grow for each clone:

$ ps aux
PID   USER     TIME  COMMAND
 1330 kargo     0:00 [ssh]
 1335 kargo     0:00 [ssh]
 1344 kargo     0:00 [ssh]
 1348 kargo     0:00 [ssh]
 1352 kargo     0:00 [ssh]
 1356 kargo     0:00 [ssh]
 1360 kargo     0:00 [ssh]
 1374 kargo     0:00 [ssh]
 1378 kargo     0:00 [ssh]
 1382 kargo     0:00 [ssh]
 1386 kargo     0:00 [ssh]
 1390 kargo     0:00 [ssh]
 1419 kargo     0:00 [ssh]
 1442 kargo     0:00 [ssh]
 1450 kargo     0:00 [ssh]
 1491 kargo     0:00 [ssh]
 1664 kargo     0:00 [ssh]

krancour · 2024-11-18T21:46:14Z

Similarly, just now, I've been able to do the same with HTTPS:

docker run -d --rm -e GIT_ASKPASS=/usr/local/bin/credential-helper -w /tmp ghcr.io/akuity/kargo:v1.0.3 sleep 1000

# Important that there be no tty; repeat this next step n times
docker exec <container id> git clone https://any/private/repo

docker exec <container id> ps                                      
PID   USER     TIME  COMMAND
    1 nonroot   0:00 sleep 1000
   13 nonroot   0:00 [git] # You will see n of these
   27 nonroot   0:00 [git]
   41 nonroot   0:00 [git]
   55 nonroot   0:00 [git]
   63 nonroot   0:00 ps

krancour · 2024-11-18T21:56:54Z

Before our custom credential helper is implicated...

This has the same result:

docker run -d --rm -e GIT_ASKPASS=/usr/bin/false -w /tmp ghcr.io/akuity/kargo:v1.0.3 sleep 1000

# Important that there be no tty; repeat this next step n times
docker exec <container id> git clone https://any/private/repo

docker exec <container id> ps

The zombies are git and not the credential helper. But those zombies are being created when the credential helper exits non-zero and git attempts to interactively ask the user for creds and finds there is no tty.

It is interesting that @hiddeco was able to make ssh zombies instead of git zombies, but the overall smell is similar.

hiddeco · 2024-11-18T21:59:37Z

@krancour does this go away if you set GIT_TERMINAL_PROMPT=0?

krancour · 2024-11-18T22:02:24Z

Changes the error message, but the zombies still get created.

krancour · 2024-11-18T22:03:00Z

docker exec a9cc0da85d19c5c254b5281cbcc0d88cc4e5a3152f5db13854f907ce92f36898 git clone https://github.com...
Cloning into 'repo'...
GIT_PASSWORD must be set
error: unable to read askpass response from '/usr/local/bin/credential-helper'
fatal: could not read Username for 'https://github.com': terminal prompts disabled

x 3

docker exec a9cc0da85d19c5c254b5281cbcc0d88cc4e5a3152f5db13854f907ce92f36898 ps                                           
PID   USER     TIME  COMMAND
    1 nonroot   0:00 sleep 1000
   13 nonroot   0:00 [git]
   27 nonroot   0:00 [git]
   41 nonroot   0:00 [git]
   49 nonroot   0:00 ps

krancour · 2024-11-18T22:21:42Z

This is interesting...

Remove GIT_ASKPASS from the equation and this can still be triggered:

docker run -d --rm -e GIT_TERMINAL_PROMPT=0 -w /tmp ghcr.io/akuity/kargo:v1.0.3 sleep 1000

docker exec <container> git clone https://github.com/...
Cloning into 'repo'...
fatal: could not read Username for 'https://github.com': terminal prompts disabled

x 3

docker exec <container> ps                                           
PID   USER     TIME  COMMAND
    1 nonroot   0:00 sleep 1000
   13 nonroot   0:00 [git]
   22 nonroot   0:00 [git]
   31 nonroot   0:00 [git]
   34 nonroot   0:00 ps

hiddeco · 2024-11-18T22:22:38Z

The problem is likely that the kargo-controller process has PID 1, which has a special responsibility (see e.g. https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/)

Its task is to "adopt" orphaned child processes (again, this is the actual technical term). This means that the init process becomes the parent of such processes, even though those processes were never created directly by the init process.

We can confirm this theory by running @krancour's original example with the --init flag added (available since Docker 1.25):

$ docker run --init -d --rm -e GIT_ASKPASS=/usr/local/bin/credential-helper -w /tmp ghcr.io/akuity/kargo:v1.0.3 sleep 1000
$ docker exec <id> git clone https://github.com/some/repository.git
...
$ docker exec <id> git clone https://github.com/some/repository.git
...
$ docker exec <id> git clone https://github.com/some/repository.git
...
$ docker exec 197f93ca78208788f3ee1ebdf62a27a5380d37f4db70e955add7fa97013e111e ps aux
PID   USER     TIME  COMMAND
    1 nonroot   0:00 /sbin/docker-init -- sleep 1000
    7 nonroot   0:00 sleep 1000
   38 nonroot   0:00 ps aux

To avoid this, we should likely make use of a lightweight supervisor like tini.

krancour · 2024-11-18T22:25:34Z

Nice find @hiddeco!

moro-drake added the kind/bug label Nov 14, 2024

github-actions bot added needs/priority needs/area labels Nov 14, 2024

hiddeco added priority/high area/controller and removed needs/priority needs/area labels Nov 18, 2024

hiddeco added this to the v1.1.0 milestone Nov 18, 2024

hiddeco mentioned this issue Nov 18, 2024

fix(build): use tini to reap zombie processes #2959

Merged

hiddeco closed this as completed in #2959 Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kargo-controller creates zombie [git] processes #2926

kargo-controller creates zombie [git] processes #2926

moro-drake commented Nov 14, 2024

hiddeco commented Nov 18, 2024

hiddeco commented Nov 18, 2024

hiddeco commented Nov 18, 2024

krancour commented Nov 18, 2024

krancour commented Nov 18, 2024 •

edited

Loading

hiddeco commented Nov 18, 2024

krancour commented Nov 18, 2024

krancour commented Nov 18, 2024 •

edited

Loading

krancour commented Nov 18, 2024

hiddeco commented Nov 18, 2024 •

edited

Loading

krancour commented Nov 18, 2024

kargo-controller creates zombie [git] processes #2926

kargo-controller creates zombie [git] processes #2926

Comments

moro-drake commented Nov 14, 2024

Checklist

Description

Screenshots

Steps to Reproduce

Version

Logs

hiddeco commented Nov 18, 2024

hiddeco commented Nov 18, 2024

hiddeco commented Nov 18, 2024

krancour commented Nov 18, 2024

krancour commented Nov 18, 2024 • edited Loading

hiddeco commented Nov 18, 2024

krancour commented Nov 18, 2024

krancour commented Nov 18, 2024 • edited Loading

krancour commented Nov 18, 2024

hiddeco commented Nov 18, 2024 • edited Loading

krancour commented Nov 18, 2024

krancour commented Nov 18, 2024 •

edited

Loading

krancour commented Nov 18, 2024 •

edited

Loading

hiddeco commented Nov 18, 2024 •

edited

Loading