Artifact GC still acting up for failed Workflows #12845

static-moonlight · 2024-03-26T13:11:20Z

static-moonlight
Mar 26, 2024

I still have problems with the artifact GC with Argo Workflows 3.5.5.

I have defined a shared artifact repository in the workflow controller config map (it's basically a MinIO running as a service in the Kubernetes cluster):

# [...]
artifactRepository:
  archiveLogs: true
  s3:
    bucket: argo-artifacts
    endpoint: s3.storage:9999
    insecure: true
    accessKeySecret:
      name: artifact-repository
      key: USERNAME
    secretKeySecret:
      name: artifact-repository
      key: PASSWORD
# [...]

In the workflow deployments I have a secret named artifact-repository

In the workflow template I have enabled artifact GC

# [...]
artifactGC:
  serviceAccountName: [...]
  strategy: OnWorkflowDeletion
# [...]

This is working fine, as long as there are no errors. I can see artifacts being created and automatically removed in that bucket (for successful workflows).

If a workflow failed, I see the following error (in the UI):
"Artifact garbage collection failed"
If a workflow failed, I can see the following message in the workflow details:
"ArtifactGCError: Artifact Garbage Collection failed for strategy [...], pod OnWorkflowDeletion exited with non-zero exit code: check pod logs for more information".
The way how this reads is that the GC failed, because the workflow failed?!
(BTW: I think this error message is messed up as well: "[...] failed for strategy {pod-name}, pod {strategy} [...]". I think the placeholders are wrong)

In the logs of an [...]-artgc-[...] container I found this:

# [...]
time="[...]" level=info msg="S3 Delete artifact: key: [...]/[...]/ARTIFACT.tgz"
time="[...]" level=info msg="Creating minio client using static credentials" endpoint="s3.storage:9999"
time="[...]" level=info msg="Deleting object from s3" bucket=argo-artifacts endpoint="s3.storage:9999" key=[...]/[...]/ARTIFACT.tgz
Error: You need to configure artifact storage. More information on how to do this can be found in the docs: https://argo-workflows.readthedocs.io/en/release-3.5/configure-artifact-repository/
You need to configure artifact storage. More information on how to do this can be found in the docs: https://argo-workflows.readthedocs.io/en/release-3.5/configure-artifact-repository/

This container knows exactly what to do. It has the artifact key, the bucked and an address. And then if somehow fails?!

So, in conclusion: the artifact GC failed, for an unknown reason. The [...]-artgc-[...] container doesn't log anything specific, but it tells me to configure an artifact storage, which I already did (see point 1), and doesn't log any errors whatsoever. And remember: the artifact repository works absolutely fine for successful workflows. Which means: the S3 service and bucket are accessible, the credentials are correct, the required S3 permissions are working as well.

So what is the problem here? What am I missing? Why is the GC somehow not working for failed workflows?

Answered by agilgur5

Jun 10, 2024

Summarizing this very long and winding and discussion:

The error message is occurring due to some incomplete status / results in a few places, specifically a missing s3.key. So the code finds an artifactLocation but then doesn't know how to process it and defaults to "configure an artifact repository" error.
This occurs when an artifact was never written to the location, e.g. due to activeDeadlineSeconds being hit or the main container otherwise being stopped.
A PR was made to fix this bug by ensuring an incomplete artifact is not written
A test image was made and OP confirmed that it fixes the bug.
A regression test case was added to the PR

View full answer

agilgur5 · 2024-03-26T18:13:53Z

agilgur5
Mar 26, 2024

I haven't used or worked on Artifact GC, but can try to help

The way how this reads is that the GC failed, because the workflow failed?!

No, I'm reading that as the Artifact GC Pod failed and so the Workflow has not been cleaned up properly and so has that message on it. That's also why it didn't finish deleting (as it failed to clean up), if I'm understanding correctly, as you don't have forceFinalizerRemoval enabled.

(BTW: I think this error message is messed up as well: "[...] failed for strategy {pod-name}, pod {strategy} [...]". I think the placeholders are wrong)

I think you are correct -- would you like to submit a PR to fix that?

So, in conclusion: the artifact GC failed, for an unknown reason. The [...]-artgc-[...] container doesn't log anything specific, but it tells me to configure an artifact storage, which I already did (see point 1), and doesn't log any errors whatsoever. And remember: the artifact repository works absolutely fine for successful workflows. Which means: the S3 service and bucket are accessible, the credentials are correct, the required S3 permissions are working as well.

The Artifact GC Pod log message is the same log message you get when you try to add artifacts without a configured Artifact Repository. So my suspicion is that it didn't detect your Controller-level Artifact Repository configuration for some reason. It might be expecting a Workflow or artifactGC level config (which is maybe a bug or missing feature), but I'm not sure yet as I don't know this feature well yet.

2 replies

agilgur5 Mar 26, 2024

So my suspicion is that it didn't detect your Controller-level Artifact Repository configuration for some reason. It might be expecting a Workflow or artifactGC level config (which is maybe a bug or missing feature), but I'm not sure yet as I don't know this feature well yet.

The docs would suggest it uses the same Artifact Repository configuration (or at least don't suggest otherwise).
What namespace is the Artifact GC Pod being ran in? Does that namespace contain your Secret?

If that's all fine, this might be a bug or missing feature; the much of the Artifact GC code is here and I don't see anything obvious wrong with it. It seems to inherit the same Secret Volumes + Mounts as Workflow Pods.

static-moonlight Apr 15, 2024
Author

I think you are correct -- would you like to submit a PR to fix that?

done, see #12935

static-moonlight · 2024-04-02T13:53:05Z

static-moonlight
Apr 2, 2024
Author

I think you are correct -- would you like to submit a PR to fix that?

I would, but I couldn't find the correct location yet.

if I'm understanding correctly, as you don't have forceFinalizerRemoval enabled.

That is correct. I might consider enabling this once this kind of behavior is an exception and doesn't happen for every failed workflow. Otherwise, my artifact repository fills up with orphaned artifacts.

So my suspicion is that it didn't detect your Controller-level Artifact Repository configuration for some reason.

Are you sure? I mean, look at point 6 again. In the logs are all necessary details for the artifact repository. The only place where this is can come from is the controller-level artifact repository configuration. If that component/container would be unable to get that config (somehow), it wouldn't be able to write those logs. Based on the fact, that the container can write those logs, I would assume, the config is accessible, but there is another problem hidden behind that generic error message. But I can't fix it if I don't know what it is.

What namespace is the Artifact GC Pod being ran in? Does that namespace contain your Secret?

Argo runs in the argo namespace. the workflow runs in a different namespace, lets call it workflow. And yes, in the workflow namespace exists as secret named artifact-repository with the elements USERNAME and PASSWORD, at stated in point 2. Without it, it would not work at all.

2 replies

agilgur5 Apr 2, 2024

I would, but I couldn't find the correct location yet.

That line is right here, and indeed the interpolation arguments are in the wrong order. It's the only line that pops up for "failed for strategy"

What namespace is the Artifact GC Pod being ran in? [...]

Argo runs in the argo namespace. the workflow runs in a different namespace, lets call it workflow

The Artifact GC Pod is neither the Controller nor the Workflow 😅 I want to make sure it's running in the correct namespace as well.

Are you sure? I mean, look at point 6 again. In the logs are all necessary details for the artifact repository

Oh I missed that, didn't realize that the first few log lines had the correct artifact repository while still getting an error. That is indeed odd 🤔
It is failing to delete the artifacts as well, right? So it is getting the config correctly and interpolating the correct location based on that, but failing to delete that artifact. Does the Artifact GC Pod SA have the proper IAM permissions to delete? (that is a destructive operation different from create and get of artifacts during the Workflow. also I'm not that familiar with Minio config, so not sure if it names "IAM" differently)

but there is another problem hidden behind that generic error message

Fortunately, this is coming from the Executor, so there's not too much code that could produce this. Here's the code for argo artifact delete. The error comes from when the code tries to detect the location of the artifact, during this Relocate call (which takes you near the error; hasLocation calls Get), as far as I can tell.
As you have the "S3 Delete artifact" log line (from the S3 Driver's Delete function), then if I'm interpreting correctly, that means at least one of the artifacts was deleted, but another failed to be located.

agilgur5 Apr 2, 2024

Since this is only happening for failed Workflows and based on the above, my guess is that some of the failed tasks may have a corrupted status?
cc @Garett-MacGowan who worked on #11947 which seems quite similar

static-moonlight · 2024-04-03T19:43:18Z

static-moonlight
Apr 3, 2024
Author

The Artifact GC Pod is neither the Controller nor the Workflow 😅 I want to make sure it's running in the correct namespace as well.

Yes, I'm aware :) The [...]-artgc-[...] container, where the logs are coming from (point 6), was definitely running in the workflow namespace, and should have proper access to the artifact-repository secret.

Does the Artifact GC Pod SA have the proper IAM permissions to delete?

It should, yes. Otherwise even the successful workflows wouldn't be able to delete artifacts. But I was thinking about that as well. I still have that on my todo list to verify if there are any errors in the audit logs of MinIO while that happens. It's one of the very few ways to potentially make whatever-this-is visible.

Here's the code for argo artifact delete. The error comes from when the code tries to detect the location of the artifact, during this Relocate call (which takes you near the error; hasLocation calls Get), as far as I can tell.

I looked at the code and I don't see a reason, why this would suddenly fail? 🤔

Also: Relocate seems to be called only when archiveLocation is not nil. That implies that the config can be read but somehow ends up not being usable because it doesn't seem to be any of the supported types (Artifactory, ... S3, ...) I would love to know what state the ArtifactLocation variable is in when the Get function is called. Is it possible, that there is an error somewhere in the code where the ArtifactLocation is put together? Maybe all the fields (Artifactory, Azure, ...) are nil because of it?

What is the task.Spec.ArtifactsByNode code doing, exactly. It there a way that I can find out, what the result of that statement is gonna be?

then if I'm interpreting correctly, that means at least one of the artifacts was deleted, but another failed to be located.

What could possibly make the GC work for one artifact but not for another? 🤔

Since this is only happening for failed Workflows and based on the above, my guess is that some of the failed tasks may have a corrupted status?

What does that mean exactly? I can check, when I know what I am looking for.

Thanks btw. for all the answers. Hopefully I can fix this some day. It's driving me crazy. Since I can easily implement a workaround for successful workflows (exits handler), failed workflows are the main(!) reason, why I need this feature.

1 reply

agilgur5 Apr 10, 2024

That implies that the config can be read but somehow ends up not being usable because it doesn't seem to be any of the supported types

Yes that's more or less what I meant by corrupted status.

What could possibly make the GC work for one artifact but not for another?

What does that mean exactly?

My guess is that the Workflow Executor was terminated before it could properly save all the state, so only some got properly saved.

I can check, when I know what I am looking for.

The status of completed Workflows -- I detailed this below

static-moonlight · 2024-04-03T20:14:28Z

static-moonlight
Apr 3, 2024
Author

Just an idea: is it possible, that for failed workflows, somehow additional Kubernetes resources are being created/used that my workflow service account can't access, and that's why the artifact repository seems to be not configured?

This is the role, used by the workflow service account. By any chance, is there some important permission missing?

apiVersion: "rbac.authorization.k8s.io/v1"
kind: "Role"
metadata:
  name: "workflow-argo"
rules:
  # See https://argoproj.github.io/argo-workflows/workflow-rbac/
  - apiGroups:
      - "argoproj.io"
    resources:
      - "workflowtaskresults"
    verbs:
      - "create"
      - "patch"
  - apiGroups:
      - "argoproj.io"
    resources:
      - "workflows"
    verbs:
      # Permissions to submit a workflow
      - "list"
      - "create"
      # Permissions to resubmit, retry, resume, suspend a workflow
      - "get"
      - "update"
  # See https://argoproj.github.io/argo-workflows/walk-through/artifacts/#service-accounts-and-annotations
  - apiGroups:
      - "argoproj.io"
    resources:
      - "workflowartifactgctasks"
    verbs:
      - "list"
      - "watch"
  # See https://argoproj.github.io/argo-workflows/walk-through/artifacts/#service-accounts-and-annotations
  - apiGroups:
      - "argoproj.io"
    resources:
      - "workflowartifactgctasks/status"
    verbs:
      - "patch"

1 reply

Garett-MacGowan Apr 4, 2024
Collaborator

I think this is fine.

Garett-MacGowan · 2024-04-04T22:51:15Z

Garett-MacGowan
Apr 4, 2024
Collaborator

@static-moonlight can you confirm that the artifact still exists in the repository? Have you done a manual inspection of it? I assume the artifactgc finalizer doesn't get removed, right?

0 replies

Garett-MacGowan · 2024-04-04T23:13:36Z

Garett-MacGowan
Apr 4, 2024
Collaborator

@static-moonlight I just experienced an artifactgc failure for one of my failed workflows. Will investigate and report back.

1 reply

Garett-MacGowan Apr 5, 2024
Collaborator

Hmm... It turns out that in my case, I was just being too impatient. GC was just taking longer than I expected.

Garett-MacGowan · 2024-04-05T20:11:18Z

Garett-MacGowan
Apr 5, 2024
Collaborator

@static-moonlight is your workflow controller configmap syntax correct? Notably:

In all versions, the configuration may be under a config: | key:

https://argo-workflows.readthedocs.io/en/latest/workflow-controller-configmap/

1 reply

agilgur5 Apr 10, 2024

Since artifacts get uploaded properly and successful Workflows' artifacts get deleted properly, it must be correct.

Garett-MacGowan · 2024-04-06T15:32:24Z

Garett-MacGowan
Apr 6, 2024
Collaborator

@static-moonlight I've added a test specific to failed workflow artifact garbage collection. It seems to be working. #12904

1 reply

agilgur5 Apr 10, 2024

My guess is that there's a race where workflowtaskresults or something is not properly saved. So a simple test wouldn't detect that unfortunately

static-moonlight · 2024-04-09T13:44:05Z

static-moonlight
Apr 9, 2024
Author

@static-moonlight is your workflow controller configmap syntax correct? Notably:

It should. Otherwise, my editor would most likely show me syntax errors, and nothing would work in the cluster.

I don't like the inline yaml config (config: |), so I did this:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: argo
# [...]
configMapGenerator:
  - name: workflow-controller-configmap
    behavior: replace
    files:
      - config=config.yaml
# [...]

config.yaml:

# [...]
artifactRepository:
  s3:
    bucket: argo-artifacts
    endpoint: s3.storage:9999
    insecure: true
    accessKeySecret:
      name: artifact-repository
      key: USERNAME
    secretKeySecret:
      name: artifact-repository
      key: PASSWORD
# [...]

It works nice though for normal operation. Workflows are being executed. The artifact repository only contains artifacts for active workflows. Which means it works ok. Because of our latest stabilization, workflow errors are kind of rare now. I'm currently working on a dedicated setup to forcefully fail a few workflow runs to hopefully find the missing puzzle piece which is causing all this trouble ...

0 replies

static-moonlight · 2024-04-09T14:48:06Z

static-moonlight
Apr 9, 2024
Author

I have a suspicion that it has something to do with this:

"Pod was active on the node longer than the specified deadline"

I have collected roughly 200 defect input files (which will cause the workflow to fail) and threw them at the cluster. Most of those workflows failed, but not because they couldn't handle the input, but because they took too long an hit the configured "activeDeadlineSeconds" of the pod. And for all those failed workflows, I see the "Artifact garbage collection failed" message again.

... I'm doing more tests on this one.

1 reply

agilgur5 Apr 25, 2024

It seems like it might indeed be the activeDeadlineSeconds, this seems to be the primary outlier. See also my comment on the status.nodes below -- there seems to be a corrupted artifact for a step that had this error message.

static-moonlight · 2024-04-09T15:29:30Z

static-moonlight
Apr 9, 2024
Author

Well, that doesn't seem to be it either. When I try to specifically fail a single workflow by setting a pods activeDeadlineSeconds to an unrealistic low number, it works as intended (meaning: the artifact gc works).

So far it seems that there is a random factor involved. Sometimes it works and sometimes it doesn't. In high-load situations I could produce lots of artifact-gc errors. During normal operation I almost never have them.

I'm a little lost here. I can't seem to find the magic recipe to reliably (re)produce this behavior. But apparently it still happens sometimes.

Any more ideas?

1 reply

agilgur5 Apr 25, 2024

Well, that doesn't seem to be it either. When I try to specifically fail a single workflow by setting a pods activeDeadlineSeconds to an unrealistic low number, it works as intended (meaning: the artifact gc works).

It may depend on the size of the artifact (i.e. how long it takes to upload) as well as how long it takes the step to produce the artifact (i.e. whether it does so before or after activeDeadlineSeconds). I did seem to find a corrupted artifact for a step with this error message below

In high-load situations I could produce lots of artifact-gc errors. During normal operation I almost never have them.

Your Workflow Pods don't seem to have resource requests, so I wonder if they're getting starved potentially? And that may then cause them to hit the activeDeadlineSeconds?

Also yea, high load is typically when you see race conditions like these get hit more often, that's why they're rare (and hard to diagnose, debug, reproduce, test, etc).

static-moonlight · 2024-04-10T09:09:53Z

static-moonlight
Apr 10, 2024
Author

I did the same test again. This time I limited the number of parallel workflows with a semaphore. The result: green across the board. The workflows failed, as expected, and they were removed successfully, including artifact gc.

This amplifies my suspicion that workload has something to do with the issue. I would even throw the theory of a potential race condition into the ring. All I can do now is to do incremental tests, increasing the number of parallel workflows each time, and look when it starts to break ...

EDIT: The intended limit (for parallel workflows) is set to 5. I increased it to 10, 15 and now 20. With the limit of 20, things start to break. I have a couple left-overs from this test run ... with "Artifact garbage collection failed" errors again. I assume the number of artifact-gc-errors increases when I put even more load on the cluster.

Meaning: Load is obviously a factor. Can someone confirm this?

15 replies

agilgur5 Apr 25, 2024

~~But overall the Workflow status doesn't look corrupted, particularly not the artifact definitions themselves~~ EDIT: nvm, see below corrupted artifact... I'm wondering if then the WorkflowArtifactGCTask was somehow corrupted? 🤔
That one is created based on the Workflow status.nodes though...
WorkflowTaskResult is another one of particular relevance that could be corrupted

agilgur5 Apr 25, 2024

Hmm there are also only 5 items in the taskResultsCompletionStatus in the failed one and 7 in the succeeded one. That may correspond to the 2 activeDeadlineSeconds failures

Wait a minute, I think I did find a corrupted artifact:

    workflow-j7mjs-2762173675:
      boundaryID: workflow-j7mjs
      children:
      - workflow-j7mjs-3284464227
      displayName: create-variant-d
      finishedAt: "2024-04-12T14:34:55Z"
      hostNodeName: node-5.company.com
      id: workflow-j7mjs-2762173675
      inputs:
        artifacts:
        - name: RAW_DATA
          path: /tmp/raw-data
          s3:
            key: workflow-j7mjs/workflow-j7mjs-download-raw-data-538700425/RAW_DATA.tgz
        parameters:
        - name: PRODUCT_NAME
          value: path/to/input-file
        - name: REGION
          value: region-2
        - name: MAP_TYPE
          value: map-type-2
      message: Pod was active on the node longer than the specified deadline
      name: workflow-j7mjs.create-variant-d
      outputs:
        artifacts:
        - name: FINAL_DATA
          path: /tmp/final-data   # <------- this is missing an s3.key??
        - name: main-logs
          s3:
            key: workflow-j7mjs/workflow-j7mjs-create-data-variant-2762173675/main.log
        exitCode: "143"
      phase: Failed
      progress: 0/1
      resourcesDuration:
        cpu: 18
        memory: 182
      startedAt: "2024-04-12T14:33:23Z"
      templateName: create-data-variant
      templateScope: local/
      type: Pod

That is one of the steps that failed due to activeDeadlineSeconds (the other step seems fine) and it seems to be missing s3.key for the FINAL_DATA output artifact (the other step has it).

Was that artifact ever uploaded? It might not have been if the Pod got killed due to an activeDeadlineSeconds overrun.

Also if this is the root cause, you should be able to cross-check that against other failed artifact GC Workflows and see if they're also missing an s3.key in one of their artifacts in status.nodes

static-moonlight Apr 25, 2024
Author

That is a lot on information. And I hope I understand that correctly. Unfortunately, I took too long and by now I lost log and artifact data due to automatic cleanup (it's just a test environment, and orphaned data was never meant to be kept for very long here, I disabled it for now).

However, lets say I would run that test again, and lets say I could link those gc failures to missing s3.key attributes, and lets say I could determine if those artifacts in question were uploaded or not ... what would that mean? Would that help you to narrow down the root cause?

agilgur5 May 16, 2024

Yes, that would be definitive evidence of root cause. Something is missing writing the s3.key to the status. Either due to a bug or a race condition somewhere -- so we need to fix that or add a wait somewhere.

If this were due to an inevitable race and the artifact is never uploaded, we could safely ignore this error.
If the artifact were uploaded, we need to make sure we capture that, perhaps the finalizer is getting removed too early or the Executor failed to write the TaskResult status after upload, so that write may need to be changed to be more transactional and write once before the upload and once after the upload to mark as completed. If not complete, we'd still check for the artifact during GC and try to delete it.

Just a few scenarios off the top of my head, I haven't worked with artifacts too much actually (didn't really use them when I was just a user and one of few areas I haven't worked extensively on as a maintainer yet), so would probably want another contributor to find and fix that scenario. Garrett's fixed a bunch of races with artifacts and GC, hence why I cc'ed him here.

agilgur5 May 29, 2024

Follow-up test below

Garett-MacGowan · 2024-04-10T14:14:48Z

Garett-MacGowan
Apr 10, 2024
Collaborator

@juliev0 @shuangkun thoughts?

0 replies

juliev0 · 2024-05-18T22:25:11Z

juliev0
May 18, 2024
Collaborator

I apologize, I haven't been able to go through this entire thread yet, but in the off chance it helps, I did notice a bug in the existing ArtifactGC deletion pod code while I was reviewing a separate PR - that person was going to fix it in theirs, but their PR never got merged unfortunately and it still needs fixing. I will ask my colleague (who wrote it) to fix it. The issue is that if this line returns an error, it isn't handled.

4 replies

agilgur5 May 19, 2024

@juliev0 the gist is in this comment; one of the artifacts in status.nodes is missing s3.key. Not sure if they're related from a quick glance

juliev0 May 19, 2024
Collaborator

Yeah, my comment is a separate thing because it does seem like what Anton found as root cause makes sense. The call to artifact.Relocate() here calls GetKey(), which calls Get() which fails with the error you saw, since "s3" isn't defined at all.

juliev0 May 19, 2024
Collaborator

BTW, for future, reference, did you ever do kubectl get workflowartifactgctask <name> -o yaml to see which instructions were given to the ArtifactGC Pod? That can help illuminate things, and should be left up (in addition to the ArtifactGC Pod which owns it) when ArtifactGC fails.

juliev0 May 19, 2024
Collaborator

I'll take a look and see if I can think of any race condition, etc, for why that wouldn't get saved in the node status

juliev0 · 2024-05-19T01:47:26Z

juliev0
May 19, 2024
Collaborator

So, the wait container in the Pod wrote the Outputs to the TaskResult just as it was finishing up here. I believe that is the only place it writes that. It's interesting that the Output Artifact got saved in the NodeStatus at all, which means that the reconciliation of that task result did in fact happen (here). That tends to make me think that it's not a race condition (please anyone correct me if I'm wrong here) but rather some logical error, likely on the executor side. (If only we could look at the TaskResult object, it would help, but I suppose that was deleted along with the Workflow.)

0 replies

juliev0 · 2024-05-19T02:02:24Z

juliev0
May 19, 2024
Collaborator

It seems like this line is where s3 should be set, so it must not be getting called, or it's getting called and it's failing.

0 replies

juliev0 · 2024-05-19T02:11:44Z

juliev0
May 19, 2024
Collaborator

It sounds like we don't know if the Artifact ever got written in the first place? Yes, I am not sure if there's logic in here to take that into account. I see the file gets looked for here. If that fails, an error is simply written, but then the flow continues and the Outputs are still reported here (with no "s3" ever written)

@Garett-MacGowan @shuangkun did your tests of Failed Workflows ever involve artifacts that never got written at all? (if you know)

1 reply

Garett-MacGowan May 19, 2024
Collaborator

@juliev0 I believe all my tests checked that the artifacts were written. I assume we would need to set up a mock, which I am not experienced in with golang yet.

For reference, here's a PR I have for testing failed workflows, which I haven't had time to address the comment on yet #12904

juliev0 · 2024-05-19T16:48:42Z

juliev0
May 19, 2024
Collaborator

This Draft PR is what I have in mind. I haven't tested it at all, and not sure if anyone wants to take it over...

2 replies

Garett-MacGowan May 19, 2024
Collaborator

This seems like a reasonable thing to be doing. I'm not the best person to test this considering my comment above WRT mocks.

juliev0 May 19, 2024
Collaborator

You don’t have to take it! But it can be tested with e2e and not necessarily unit test I suppose, right? (theoretically may not even need to be a failed workflow)
@static-moonlight would you maybe want to take the PR forward with testing? (At least if you find that your situation is one in which an artifact wasn’t written)

static-moonlight · 2024-05-22T21:38:56Z

static-moonlight
May 22, 2024
Author

Thank you for all the input. I am currently a little busy, but I will make some room to run another test and try to get you some answers for the open questions ...

I am currently holding back any Argo upgrades actually so that the results are comparable, at least until I ran the tests again and got you the information you are waiting for. After that, I can see a scenario to test out the pull request. I might need some instructions for that, but we can get to that later :)

1 reply

agilgur5 May 22, 2024

After that, I can see a scenario to test out the pull request. I might need some instructions for that, but we can get to that later :)

We can publish an image for you to run to simplify that. We've done that in the past for various bugfixes that need more "real-world" testing

Thanks for your time and assistance in testing and debugging this!

static-moonlight · 2024-05-28T13:43:14Z

static-moonlight
May 28, 2024
Author

Soooo ... i ran the test again:

I have a candidate with a "Artifact garbage collection failed" problem, including an output artifact with a missing s3 key. I want to point out though that the task was killed, because of Pod was active on the node longer than the specified deadline.

workflow-9h4k7-370302293:
  # [...]
  message: Error (exit code 1)
  name: workflow-9h4k7.create-variant-a
  outputs:
  artifacts:
  - name: FINAL_DATA
    path: /tmp/final-data
    s3:
      key: workflow-9h4k7/workflow-9h4k7-create-data-variant-370302293/FINAL_DATA.tgz
  exitCode: "1"
  phase: Failed
  # [...]

workflow-9h4k7-1778390099:
  # [...]
  message: Pod was active on the node longer than the specified deadline
  name: workflow-9h4k7.create-variant-b
  outputs:
  artifacts:
  - name: FINAL_DATA
    path: /tmp/final-data # <---- MISSING S3 KEY?
  exitCode: "143"
  phase: Failed
  # [...]

workflow-9h4k7-2441593696:
  # [...]
  name: workflow-9h4k7.create-variant-c
  outputs:
  artifacts:
  - name: FINAL_DATA
    path: /tmp/final-data
    s3:
      key: workflow-9h4k7/workflow-9h4k7-create-data-variant-2441593696/FINAL_DATA.tgz
  exitCode: "0"
  phase: Succeeded
  # [...]

workflow-9h4k7-2907771518:
  # [...]
  message: Error (exit code 1)
  name: workflow-9h4k7.create-variant-d
  outputs:
  artifacts:
  - name: FINAL_DATA
    path: /tmp/final-data
    s3:
      key: workflow-9h4k7/workflow-9h4k7-create-data-variant-2907771518/FINAL_DATA.tgz
  exitCode: "1"
  phase: Failed
  # [...]

These artifacts are still present after completion. I assume at least the RAW_DATA was removed.

$ rclone lsf --recursive s3:argo-artifacts/workflow-9h4k7 | sort
workflow-9h4k7-create-data-variant-2441593696/
workflow-9h4k7-create-data-variant-2441593696/FINAL_DATA.tgz

ArtifactGC Pod

$ kubectl get workflowartifactgctasks.argoproj.io -n workflow workflow-9h4k7-artgc-wfdel-473657125-0 -o yaml

apiVersion: argoproj.io/v1alpha1
kind: WorkflowArtifactGCTask
metadata:
  creationTimestamp: "2024-05-28T12:45:35Z"
  generation: 1
  labels:
    workflows.argoproj.io/artifact-gc-pod: "3766346137"
  name: workflow-9h4k7-artgc-wfdel-473657125-0
  namespace: workflow
  ownerReferences:
  - apiVersion: argoproj.io/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: Workflow
    name: workflow-9h4k7
    uid: 162f1039-2101-40e5-8e6c-4e79998f0952
  resourceVersion: "114378141"
  uid: d1b8aa00-7e76-4c0c-8393-6a8ff5ad9a96
spec:
  artifactsByNode:
    workflow-9h4k7-370302293:
      archiveLocation:
        s3:
          accessKeySecret:
            key: USERNAME
            name: artifact-repository
          bucket: argo-artifacts
          endpoint: s3.storage:9999
          insecure: true
          key: '{{workflow.name}}/{{pod.name}}'
          secretKeySecret:
            key: PASSWORD
            name: artifact-repository
      artifacts:
        FINAL_DATA:
          name: FINAL_DATA
          path: /tmp/final-data
          s3:
            key: workflow-9h4k7/workflow-9h4k7-create-data-variant-370302293/FINAL_DATA.tgz
    workflow-9h4k7-1778390099:
      archiveLocation:
        s3:
          accessKeySecret:
            key: USERNAME
            name: artifact-repository
          bucket: argo-artifacts
          endpoint: s3.storage:9999
          insecure: true
          key: '{{workflow.name}}/{{pod.name}}'
          secretKeySecret:
            key: PASSWORD
            name: artifact-repository
      artifacts:
        FINAL_DATA: # <---- MISSING S3 KEY?
          name: FINAL_DATA
          path: /tmp/final-data
    workflow-9h4k7-2441593696:
      archiveLocation:
        s3:
          accessKeySecret:
            key: USERNAME
            name: artifact-repository
          bucket: argo-artifacts
          endpoint: s3.storage:9999
          insecure: true
          key: '{{workflow.name}}/{{pod.name}}'
          secretKeySecret:
            key: PASSWORD
            name: artifact-repository
      artifacts:
        FINAL_DATA:
          name: FINAL_DATA
          path: /tmp/final-data
          s3:
            key: workflow-9h4k7/workflow-9h4k7-create-data-variant-2441593696/FINAL_DATA.tgz
    workflow-9h4k7-2907771518:
      archiveLocation:
        s3:
          accessKeySecret:
            key: USERNAME
            name: artifact-repository
          bucket: argo-artifacts
          endpoint: s3.storage:9999
          insecure: true
          key: '{{workflow.name}}/{{pod.name}}'
          secretKeySecret:
            key: PASSWORD
            name: artifact-repository
      artifacts:
        FINAL_DATA:
          name: FINAL_DATA
          path: /tmp/final-data
          s3:
            key: workflow-9h4k7/workflow-9h4k7-create-data-variant-2907771518/FINAL_DATA.tgz
    workflow-9h4k7-4240421633:
      archiveLocation:
        s3:
          accessKeySecret:
            key: USERNAME
            name: artifact-repository
          bucket: argo-artifacts
          endpoint: s3.storage:9999
          insecure: true
          key: '{{workflow.name}}/{{pod.name}}'
          secretKeySecret:
            key: PASSWORD
            name: artifact-repository
      artifacts:
        RAW_DATA:
          name: RAW_DATA
          path: /tmp/raw-data
          s3:
            key: workflow-9h4k7/workflow-9h4k7-download-raw-data-4240421633/RAW_DATA.tgz

Unfortunately, I forgot to enable event logging for the argo-artifacts bucket, so I can't provide detailed activity logs for this workflow for the artifact repository. But since at least one of the variant was successful, it means, that the RAW_DATA artifact was successfully written to the artifact repository and probably successfully removed as well.

Please let me know if you need any more information :)

6 replies

juliev0 May 29, 2024
Collaborator

I can push an argoexec image to test with. I wonder if maybe I should also push a workflow-controller image with matching tag, as I believe the workflow-controller will automatically use the same tag as its own image to deploy the wait and init container images in the Workflow Pods/GC Pods. @agilgur5 do you have any experience with that?

agilgur5 May 29, 2024

You can use the --executor-image flag in the Controller's args or set executor.image (note ConfigMap formatting) in the Controller ConfigMap to specify the executor image that the Controller uses; no need for a separate Controller image.

agilgur5 May 29, 2024

3. These artifacts are still present after completion. I assume at least the RAW_DATA was removed.
$ rclone lsf --recursive s3:argo-artifacts/workflow-9h4k7 | sort
workflow-9h4k7-create-data-variant-2441593696/
workflow-9h4k7-create-data-variant-2441593696/FINAL_DATA.tgz

I'm imagining this one is present because artifact GC failed on the earlier workflow with the missing s3.key?

@juliev0 as an alternative and/or addition to your PR, we could also make artifact GC skip errors like these. Could still report error at the end, but at least finish clean up of the rest. In this case if it skipped the error, everything would have been successfully cleaned up (and so maybe such an error is ignoreable).
I have less experience with artifacts than both you and Garrett though, so I'll let you make that call.

juliev0 May 29, 2024
Collaborator

@agilgur5 yes, I think here we could still delete all of the artifacts instead of returning after the first failure (this would be in addition to my PR)

juliev0 May 30, 2024
Collaborator

You can use the --executor-image flag in the Controller's args or set executor.image (note ConfigMap formatting) in the Controller ConfigMap to specify the executor image that the Controller uses; no need for a separate Controller image.

Good to know, thanks! I just pushed this image: https://hub.docker.com/repository/docker/julievogelman878/argo-workflows/tags (it's just the original PR code, not the additional suggestion @agilgur5 mentioned). Will you be able to confirm that you're executing the right image, just in case?

static-moonlight · 2024-05-29T15:22:53Z

static-moonlight
May 29, 2024
Author

I am setting the version for quay.io/argoproj/argocli and quay.io/argoproj/workflow-controller explicitly in my Argo deployment (because, the default deployment comes with latest). That means, I have at least those 2 under control and can set it to whatever I want ;-) But I am not sure where the version for argoexec comes from.

If you can provide a test image, I can run it to see if it helps with the problem.

1 reply

agilgur5 May 29, 2024

See above for how to set the executor image

static-moonlight · 2024-05-30T06:15:09Z

static-moonlight
May 30, 2024
Author

Sorry for asking, just to clarify: ... [...]/argo-workflows:workflows-save-artifacts is the argoexec image?

5 replies

juliev0 May 30, 2024
Collaborator

Yeah, sorry I should’ve named it argoexec

static-moonlight May 30, 2024
Author

I'll try to run it tomorrow, thank you

static-moonlight May 30, 2024
Author

I have a question though. The image you provided is password protected. Normally I would just configure a pull secret. But this image is used indirectly by the workflow manager, right? So how do I make that work?

juliev0 May 30, 2024
Collaborator

it's password protected? sorry about that. Can you manually do a docker pull or no?

juliev0 May 31, 2024
Collaborator

I verified that the repository is public

static-moonlight · 2024-05-31T09:36:48Z

static-moonlight
May 31, 2024
Author

So, first things first. I got a little confused, because I was required to sign in on docker hub, just to explore the image. That's why I falsely assumed, I would need pull secrets for Kubernetes. I didn't. As you stated, I could access the image just fine. Long story short: I was able to setup my test environment with your argoexec image. Thank you!

Used argoexec image: docker.io/julievogelman878/argo-workflows:workflows-save-artifacts.

I also confirmed with kubectl describe pod [...]-artgc-[...] commands that the correct images was used by Argo.

As for the test: It looks pretty good. I did multiple test runs, with increased workload each time. I didn't see any "Artifact garbage collection failed" errors. I also checked the artifact repository after each test and can confirm, that everything was cleaned up properly.

Green across the board.

To be sure I did a final test with the normal argoexec image again (in my case: quay.io/argoproj/argoexec:v3.5.5) and confirmed this configuration change with the artgc pods. I immediately saw those "Artifact garbage collection failed" again for more than 50% of the injected workflows. I can also see remnants in the artifact repository again,

I would say: good test

Either it was pure luck that 3 times in a row, I couldn't get this error to show or is was because of your code change. My money is on the code change ;-)

The only thing I noticed is that the regular TTL mechanism seemed a little affected by the workload burst. Meaning: completed workflows stayed in the system for longer than usual. But it automatically recovered without me doing anything withing 15-20 minutes. Plus, it only happened once. So I'm not worried about that. Additionally the high workload slightly effected other workflows as well, in form of higher workflow duration and a couple exceeded activeDeadlineSeconds. But that was to be expected because the test environment has limited resources. No big deal.

Aside from that I didn't see anything unusual and everything was working fine.

I would be happy to see this code change in an upcoming Argo release :) I'm confident it will contribute to an increased stability for Argo's artifact handling in general.

Thank you so much for being patient with me, for your help and support!

P.S. I just realized that this entire thing was tracked as a Q&A discussion, instead of being a real issue. Sorry about that.

16 replies

juliev0 Jun 6, 2024
Collaborator

By the way, my analysis is that this path, which is part of the graceful shutdown that occurs whether or not the main container exits prematurely, is occurring when there's no artifact in the presumed location:

this attempt to find the file fails
therefore, this call to saveArtifactFromFile() below never occurs
and you can see in there that if the artifact doesn't have a key, it will be created

agilgur5 Jun 9, 2024

Ah, I see, thanks for elaborating.
So I think we will still have an issue with non-graceful shutdown then, i.e. #12993. But that is more of an edge case, although it's been getting hit with some frequency.

juliev0 Jun 9, 2024
Collaborator

Ah, I see, thanks for elaborating. So I think we will still have an issue with non-graceful shutdown then, i.e. #12993. But that is more of an edge case, although it's been getting hit with some frequency.

Oh, I didn't know about that one. So, I suppose non-graceful shutdown is where the Workflow Controller does a graceful shutdown, waits a bit and then does a forceful immediate shutdown. I'll try to think about that one when I get time as well.

By the way, I was able to fix my test by configuring the artifact in the Workflow to only consist of:

      outputs:
        artifacts:
          - name: notpresent
            path: /tmp/notpresent

I could flip between "main" branch and my PR branch and see it not work in the "main" branch but work in mine. I also tried it with activeDeadlineSeconds which produced the same result.

agilgur5 Jun 10, 2024

So, I suppose non-graceful shutdown is where the Workflow Controller does a graceful shutdown, waits a bit and then does a forceful immediate shutdown

Or another Pod Controller otherwise causes immediate Pod shutdown, e.g. for eviction purposes

By the way, I was able to fix my test by configuring the artifact in the Workflow to only consist of:

Awesome, will review the PR shortly then

agilgur5 Jun 10, 2024

P.S. I just realized that this entire thing was tracked as a Q&A discussion, instead of being a real issue. Sorry about that.

Yes and the threading is very confusing as sometimes they were and weren't used 😅

I left a summary below that can be used as the "answer" to the "question" and will then appear at the top.

Also someone filed an issue today that seems to duplicate this: #13161

agilgur5 · 2024-06-10T17:03:44Z

agilgur5
Jun 10, 2024

Summarizing this very long and winding and discussion:

The error message is occurring due to some incomplete status / results in a few places, specifically a missing s3.key. So the code finds an artifactLocation but then doesn't know how to process it and defaults to "configure an artifact repository" error.
This occurs when an artifact was never written to the location, e.g. due to activeDeadlineSeconds being hit or the main container otherwise being stopped.
A PR was made to fix this bug by ensuring an incomplete artifact is not written
A test image was made and OP confirmed that it fixes the bug.
A regression test case was added to the PR

0 replies

Artifact GC still acting up for failed Workflows #12845

Replies: 24 comments · 62 replies

static-moonlight Apr 15, 2024 Author

static-moonlight Apr 2, 2024 Author

static-moonlight Apr 3, 2024 Author

static-moonlight Apr 3, 2024 Author

Garett-MacGowan Apr 4, 2024 Collaborator

Garett-MacGowan Apr 4, 2024 Collaborator

Garett-MacGowan Apr 4, 2024 Collaborator

Garett-MacGowan Apr 5, 2024 Collaborator

Garett-MacGowan Apr 5, 2024 Collaborator

Garett-MacGowan Apr 6, 2024 Collaborator

static-moonlight Apr 9, 2024 Author

static-moonlight Apr 9, 2024 Author

static-moonlight Apr 9, 2024 Author

static-moonlight Apr 10, 2024 Author

static-moonlight Apr 25, 2024 Author

Garett-MacGowan Apr 10, 2024 Collaborator

juliev0 May 18, 2024 Collaborator

juliev0 May 19, 2024 Collaborator

juliev0 May 19, 2024 Collaborator

juliev0 May 19, 2024 Collaborator

juliev0 May 19, 2024 Collaborator

juliev0 May 19, 2024 Collaborator

juliev0 May 19, 2024 Collaborator

Garett-MacGowan May 19, 2024 Collaborator

juliev0 May 19, 2024 Collaborator

Garett-MacGowan May 19, 2024 Collaborator

juliev0 May 19, 2024 Collaborator

static-moonlight May 22, 2024 Author

static-moonlight May 28, 2024 Author

juliev0 May 29, 2024 Collaborator

Replies: 24 comments 62 replies

static-moonlight Apr 15, 2024
Author

static-moonlight
Apr 2, 2024
Author

static-moonlight
Apr 3, 2024
Author

static-moonlight
Apr 3, 2024
Author

Garett-MacGowan Apr 4, 2024
Collaborator

Garett-MacGowan
Apr 4, 2024
Collaborator

Garett-MacGowan
Apr 4, 2024
Collaborator

Garett-MacGowan Apr 5, 2024
Collaborator

Garett-MacGowan
Apr 5, 2024
Collaborator

Garett-MacGowan
Apr 6, 2024
Collaborator

static-moonlight
Apr 9, 2024
Author

static-moonlight
Apr 9, 2024
Author

static-moonlight
Apr 9, 2024
Author

static-moonlight
Apr 10, 2024
Author

static-moonlight Apr 25, 2024
Author

Garett-MacGowan
Apr 10, 2024
Collaborator

juliev0
May 18, 2024
Collaborator

juliev0 May 19, 2024
Collaborator

juliev0 May 19, 2024
Collaborator

juliev0 May 19, 2024
Collaborator

juliev0
May 19, 2024
Collaborator

juliev0
May 19, 2024
Collaborator

juliev0
May 19, 2024
Collaborator

Garett-MacGowan May 19, 2024
Collaborator

juliev0
May 19, 2024
Collaborator

Garett-MacGowan May 19, 2024
Collaborator

juliev0 May 19, 2024
Collaborator

static-moonlight
May 22, 2024
Author

static-moonlight
May 28, 2024
Author

juliev0 May 29, 2024
Collaborator