Artifact GC still acting up for failed Workflows #12845
-
I still have problems with the artifact GC with Argo Workflows 3.5.5.
This is working fine, as long as there are no errors. I can see artifacts being created and automatically removed in that bucket (for successful workflows).
So, in conclusion: the artifact GC failed, for an unknown reason. The So what is the problem here? What am I missing? Why is the GC somehow not working for failed workflows? |
Beta Was this translation helpful? Give feedback.
Replies: 24 comments 62 replies
-
I haven't used or worked on Artifact GC, but can try to help
No, I'm reading that as the Artifact GC Pod failed and so the Workflow has not been cleaned up properly and so has that message on it. That's also why it didn't finish deleting (as it failed to clean up), if I'm understanding correctly, as you don't have
I think you are correct -- would you like to submit a PR to fix that?
The Artifact GC Pod log message is the same log message you get when you try to add |
Beta Was this translation helpful? Give feedback.
-
I would, but I couldn't find the correct location yet.
That is correct. I might consider enabling this once this kind of behavior is an exception and doesn't happen for every failed workflow. Otherwise, my artifact repository fills up with orphaned artifacts.
Are you sure? I mean, look at point 6 again. In the logs are all necessary details for the artifact repository. The only place where this is can come from is the controller-level artifact repository configuration. If that component/container would be unable to get that config (somehow), it wouldn't be able to write those logs. Based on the fact, that the container can write those logs, I would assume, the config is accessible, but there is another problem hidden behind that generic error message. But I can't fix it if I don't know what it is.
Argo runs in the |
Beta Was this translation helpful? Give feedback.
-
Yes, I'm aware :) The
It should, yes. Otherwise even the successful workflows wouldn't be able to delete artifacts. But I was thinking about that as well. I still have that on my todo list to verify if there are any errors in the audit logs of MinIO while that happens. It's one of the very few ways to potentially make whatever-this-is visible.
I looked at the code and I don't see a reason, why this would suddenly fail? 🤔 Also: What is the
What could possibly make the GC work for one artifact but not for another? 🤔
What does that mean exactly? I can check, when I know what I am looking for. Thanks btw. for all the answers. Hopefully I can fix this some day. It's driving me crazy. Since I can easily implement a workaround for successful workflows (exits handler), failed workflows are the main(!) reason, why I need this feature. |
Beta Was this translation helpful? Give feedback.
-
Just an idea: is it possible, that for failed workflows, somehow additional Kubernetes resources are being created/used that my workflow service account can't access, and that's why the artifact repository seems to be not configured? This is the role, used by the workflow service account. By any chance, is there some important permission missing? apiVersion: "rbac.authorization.k8s.io/v1"
kind: "Role"
metadata:
name: "workflow-argo"
rules:
# See https://argoproj.github.io/argo-workflows/workflow-rbac/
- apiGroups:
- "argoproj.io"
resources:
- "workflowtaskresults"
verbs:
- "create"
- "patch"
- apiGroups:
- "argoproj.io"
resources:
- "workflows"
verbs:
# Permissions to submit a workflow
- "list"
- "create"
# Permissions to resubmit, retry, resume, suspend a workflow
- "get"
- "update"
# See https://argoproj.github.io/argo-workflows/walk-through/artifacts/#service-accounts-and-annotations
- apiGroups:
- "argoproj.io"
resources:
- "workflowartifactgctasks"
verbs:
- "list"
- "watch"
# See https://argoproj.github.io/argo-workflows/walk-through/artifacts/#service-accounts-and-annotations
- apiGroups:
- "argoproj.io"
resources:
- "workflowartifactgctasks/status"
verbs:
- "patch"
|
Beta Was this translation helpful? Give feedback.
-
@static-moonlight can you confirm that the artifact still exists in the repository? Have you done a manual inspection of it? I assume the artifactgc finalizer doesn't get removed, right? |
Beta Was this translation helpful? Give feedback.
-
@static-moonlight I just experienced an artifactgc failure for one of my failed workflows. Will investigate and report back. |
Beta Was this translation helpful? Give feedback.
-
@static-moonlight is your workflow controller configmap syntax correct? Notably:
https://argo-workflows.readthedocs.io/en/latest/workflow-controller-configmap/ |
Beta Was this translation helpful? Give feedback.
-
@static-moonlight I've added a test specific to failed workflow artifact garbage collection. It seems to be working. #12904 |
Beta Was this translation helpful? Give feedback.
-
It should. Otherwise, my editor would most likely show me syntax errors, and nothing would work in the cluster. I don't like the inline yaml config ( apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: argo
# [...]
configMapGenerator:
- name: workflow-controller-configmap
behavior: replace
files:
- config=config.yaml
# [...] config.yaml: # [...]
artifactRepository:
s3:
bucket: argo-artifacts
endpoint: s3.storage:9999
insecure: true
accessKeySecret:
name: artifact-repository
key: USERNAME
secretKeySecret:
name: artifact-repository
key: PASSWORD
# [...] It works nice though for normal operation. Workflows are being executed. The artifact repository only contains artifacts for active workflows. Which means it works ok. Because of our latest stabilization, workflow errors are kind of rare now. I'm currently working on a dedicated setup to forcefully fail a few workflow runs to hopefully find the missing puzzle piece which is causing all this trouble ... |
Beta Was this translation helpful? Give feedback.
-
I have a suspicion that it has something to do with this: "Pod was active on the node longer than the specified deadline" I have collected roughly 200 defect input files (which will cause the workflow to fail) and threw them at the cluster. Most of those workflows failed, but not because they couldn't handle the input, but because they took too long an hit the configured "activeDeadlineSeconds" of the pod. And for all those failed workflows, I see the "Artifact garbage collection failed" message again. ... I'm doing more tests on this one. |
Beta Was this translation helpful? Give feedback.
-
Well, that doesn't seem to be it either. When I try to specifically fail a single workflow by setting a pods So far it seems that there is a random factor involved. Sometimes it works and sometimes it doesn't. In high-load situations I could produce lots of artifact-gc errors. During normal operation I almost never have them. I'm a little lost here. I can't seem to find the magic recipe to reliably (re)produce this behavior. But apparently it still happens sometimes. Any more ideas? |
Beta Was this translation helpful? Give feedback.
-
I did the same test again. This time I limited the number of parallel workflows with a semaphore. The result: green across the board. The workflows failed, as expected, and they were removed successfully, including artifact gc. This amplifies my suspicion that workload has something to do with the issue. I would even throw the theory of a potential race condition into the ring. All I can do now is to do incremental tests, increasing the number of parallel workflows each time, and look when it starts to break ... EDIT: The intended limit (for parallel workflows) is set to 5. I increased it to 10, 15 and now 20. With the limit of 20, things start to break. I have a couple left-overs from this test run ... with "Artifact garbage collection failed" errors again. I assume the number of artifact-gc-errors increases when I put even more load on the cluster. Meaning: Load is obviously a factor. Can someone confirm this? |
Beta Was this translation helpful? Give feedback.
-
@juliev0 @shuangkun thoughts? |
Beta Was this translation helpful? Give feedback.
-
I apologize, I haven't been able to go through this entire thread yet, but in the off chance it helps, I did notice a bug in the existing ArtifactGC deletion pod code while I was reviewing a separate PR - that person was going to fix it in theirs, but their PR never got merged unfortunately and it still needs fixing. I will ask my colleague (who wrote it) to fix it. The issue is that if this line returns an error, it isn't handled. |
Beta Was this translation helpful? Give feedback.
-
So, the |
Beta Was this translation helpful? Give feedback.
-
It seems like this line is where s3 should be set, so it must not be getting called, or it's getting called and it's failing. |
Beta Was this translation helpful? Give feedback.
-
It sounds like we don't know if the Artifact ever got written in the first place? Yes, I am not sure if there's logic in here to take that into account. I see the file gets looked for here. If that fails, an error is simply written, but then the flow continues and the Outputs are still reported here (with no "s3" ever written) @Garett-MacGowan @shuangkun did your tests of Failed Workflows ever involve artifacts that never got written at all? (if you know) |
Beta Was this translation helpful? Give feedback.
-
This Draft PR is what I have in mind. I haven't tested it at all, and not sure if anyone wants to take it over... |
Beta Was this translation helpful? Give feedback.
-
Thank you for all the input. I am currently a little busy, but I will make some room to run another test and try to get you some answers for the open questions ... I am currently holding back any Argo upgrades actually so that the results are comparable, at least until I ran the tests again and got you the information you are waiting for. After that, I can see a scenario to test out the pull request. I might need some instructions for that, but we can get to that later :) |
Beta Was this translation helpful? Give feedback.
-
Soooo ... i ran the test again:
workflow-9h4k7-370302293:
# [...]
message: Error (exit code 1)
name: workflow-9h4k7.create-variant-a
outputs:
artifacts:
- name: FINAL_DATA
path: /tmp/final-data
s3:
key: workflow-9h4k7/workflow-9h4k7-create-data-variant-370302293/FINAL_DATA.tgz
exitCode: "1"
phase: Failed
# [...] workflow-9h4k7-1778390099:
# [...]
message: Pod was active on the node longer than the specified deadline
name: workflow-9h4k7.create-variant-b
outputs:
artifacts:
- name: FINAL_DATA
path: /tmp/final-data # <---- MISSING S3 KEY?
exitCode: "143"
phase: Failed
# [...] workflow-9h4k7-2441593696:
# [...]
name: workflow-9h4k7.create-variant-c
outputs:
artifacts:
- name: FINAL_DATA
path: /tmp/final-data
s3:
key: workflow-9h4k7/workflow-9h4k7-create-data-variant-2441593696/FINAL_DATA.tgz
exitCode: "0"
phase: Succeeded
# [...] workflow-9h4k7-2907771518:
# [...]
message: Error (exit code 1)
name: workflow-9h4k7.create-variant-d
outputs:
artifacts:
- name: FINAL_DATA
path: /tmp/final-data
s3:
key: workflow-9h4k7/workflow-9h4k7-create-data-variant-2907771518/FINAL_DATA.tgz
exitCode: "1"
phase: Failed
# [...]
$ rclone lsf --recursive s3:argo-artifacts/workflow-9h4k7 | sort
workflow-9h4k7-create-data-variant-2441593696/
workflow-9h4k7-create-data-variant-2441593696/FINAL_DATA.tgz
$ kubectl get workflowartifactgctasks.argoproj.io -n workflow workflow-9h4k7-artgc-wfdel-473657125-0 -o yaml apiVersion: argoproj.io/v1alpha1
kind: WorkflowArtifactGCTask
metadata:
creationTimestamp: "2024-05-28T12:45:35Z"
generation: 1
labels:
workflows.argoproj.io/artifact-gc-pod: "3766346137"
name: workflow-9h4k7-artgc-wfdel-473657125-0
namespace: workflow
ownerReferences:
- apiVersion: argoproj.io/v1alpha1
blockOwnerDeletion: true
controller: true
kind: Workflow
name: workflow-9h4k7
uid: 162f1039-2101-40e5-8e6c-4e79998f0952
resourceVersion: "114378141"
uid: d1b8aa00-7e76-4c0c-8393-6a8ff5ad9a96
spec:
artifactsByNode:
workflow-9h4k7-370302293:
archiveLocation:
s3:
accessKeySecret:
key: USERNAME
name: artifact-repository
bucket: argo-artifacts
endpoint: s3.storage:9999
insecure: true
key: '{{workflow.name}}/{{pod.name}}'
secretKeySecret:
key: PASSWORD
name: artifact-repository
artifacts:
FINAL_DATA:
name: FINAL_DATA
path: /tmp/final-data
s3:
key: workflow-9h4k7/workflow-9h4k7-create-data-variant-370302293/FINAL_DATA.tgz
workflow-9h4k7-1778390099:
archiveLocation:
s3:
accessKeySecret:
key: USERNAME
name: artifact-repository
bucket: argo-artifacts
endpoint: s3.storage:9999
insecure: true
key: '{{workflow.name}}/{{pod.name}}'
secretKeySecret:
key: PASSWORD
name: artifact-repository
artifacts:
FINAL_DATA: # <---- MISSING S3 KEY?
name: FINAL_DATA
path: /tmp/final-data
workflow-9h4k7-2441593696:
archiveLocation:
s3:
accessKeySecret:
key: USERNAME
name: artifact-repository
bucket: argo-artifacts
endpoint: s3.storage:9999
insecure: true
key: '{{workflow.name}}/{{pod.name}}'
secretKeySecret:
key: PASSWORD
name: artifact-repository
artifacts:
FINAL_DATA:
name: FINAL_DATA
path: /tmp/final-data
s3:
key: workflow-9h4k7/workflow-9h4k7-create-data-variant-2441593696/FINAL_DATA.tgz
workflow-9h4k7-2907771518:
archiveLocation:
s3:
accessKeySecret:
key: USERNAME
name: artifact-repository
bucket: argo-artifacts
endpoint: s3.storage:9999
insecure: true
key: '{{workflow.name}}/{{pod.name}}'
secretKeySecret:
key: PASSWORD
name: artifact-repository
artifacts:
FINAL_DATA:
name: FINAL_DATA
path: /tmp/final-data
s3:
key: workflow-9h4k7/workflow-9h4k7-create-data-variant-2907771518/FINAL_DATA.tgz
workflow-9h4k7-4240421633:
archiveLocation:
s3:
accessKeySecret:
key: USERNAME
name: artifact-repository
bucket: argo-artifacts
endpoint: s3.storage:9999
insecure: true
key: '{{workflow.name}}/{{pod.name}}'
secretKeySecret:
key: PASSWORD
name: artifact-repository
artifacts:
RAW_DATA:
name: RAW_DATA
path: /tmp/raw-data
s3:
key: workflow-9h4k7/workflow-9h4k7-download-raw-data-4240421633/RAW_DATA.tgz
Please let me know if you need any more information :) |
Beta Was this translation helpful? Give feedback.
-
I am setting the version for If you can provide a test image, I can run it to see if it helps with the problem. |
Beta Was this translation helpful? Give feedback.
-
Sorry for asking, just to clarify: ... |
Beta Was this translation helpful? Give feedback.
-
So, first things first. I got a little confused, because I was required to sign in on docker hub, just to explore the image. That's why I falsely assumed, I would need pull secrets for Kubernetes. I didn't. As you stated, I could access the image just fine. Long story short: I was able to setup my test environment with your Used I also confirmed with As for the test: It looks pretty good. I did multiple test runs, with increased workload each time. I didn't see any "Artifact garbage collection failed" errors. I also checked the artifact repository after each test and can confirm, that everything was cleaned up properly. Green across the board. To be sure I did a final test with the normal I would say: good test Either it was pure luck that 3 times in a row, I couldn't get this error to show or is was because of your code change. My money is on the code change ;-) The only thing I noticed is that the regular TTL mechanism seemed a little affected by the workload burst. Meaning: completed workflows stayed in the system for longer than usual. But it automatically recovered without me doing anything withing 15-20 minutes. Plus, it only happened once. So I'm not worried about that. Additionally the high workload slightly effected other workflows as well, in form of higher workflow duration and a couple exceeded Aside from that I didn't see anything unusual and everything was working fine. I would be happy to see this code change in an upcoming Argo release :) I'm confident it will contribute to an increased stability for Argo's artifact handling in general. Thank you so much for being patient with me, for your help and support! P.S. I just realized that this entire thing was tracked as a Q&A discussion, instead of being a real issue. Sorry about that. |
Beta Was this translation helpful? Give feedback.
-
Summarizing this very long and winding and discussion:
|
Beta Was this translation helpful? Give feedback.
Summarizing this very long and winding and discussion:
status
/ results in a few places, specifically a missings3.key
. So the code finds anartifactLocation
but then doesn't know how to process it and defaults to "configure an artifact repository" error.activeDeadlineSeconds
being hit or themain
container otherwise being stopped.