-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Parallel artifact GC #11768
feat: Parallel artifact GC #11768
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@juliev0 Could you help review this?
Certainly. Will try to do today or tomorrow. |
This is missing documentation as it is. I'm not sure how it undrafted itself. I'll try and get that done this evening. |
Was gonna mention docs as there's a new number of workers. Potentially the new scaling docs from #11731 could be a good place to add. Also the CI failure is on |
The test failures aren't happening locally, but I have an idea. The new variable isn't configurable - the pod that runs artifact GC is completely hardcoded at the moment. I suggest a separate PR to add configurability to it. |
da948e5
to
454dd14
Compare
This PR was put in recently to add configurability for resources in the Pod. Is this what you mean? |
Yep, I missed that PR. That's great, saves me a job. |
Looking at the e2e test failure, I'm thinking it probably doesn't have anything to do with your code. This is the test that doesn't delete any artifacts. It looks like it waits for 1 minute to verify that the finalizer gets removed. I wonder if there is potential for this not to be long enough...if it just never did the Workflow reconciliation because maybe it was CPU starved in the CI or something? If you do a series of empty commits, does it happen again? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious if you tried testing with multiple WorkflowArtifactGCTasks
? I know we never actually enabled the creation of multiple but there was always the possibility.
} | ||
results[artifact.Name] = v1alpha1.ArtifactResult{Name: artifact.Name, Success: true, Error: nil} | ||
return true, err | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is a bug in the original code: it's not doing anything if err
is non-nill
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have guaranteed err == nil
here. I've made that explicit now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, as @isubasinghe pointed out you meant the other err. Oops. Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks great, the error isn't handled. I've added more comments, let me know what you think :)
taskList, err := artifactGCTaskInterface.List(context.Background(), metav1.ListOptions{LabelSelector: labelSelector}) | ||
if err != nil { | ||
return err | ||
} | ||
taskWorkers := env.LookupEnvIntOr(common.EnvExecGCWorkers, 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we look into how many cores are available and base the parallelism off that ?
I can imagine some deployments being backed by some heavy servers.
if log2(cores) + 4 >= cores:
return cores
else:
return log2(cores) + 4
Something like the above growth rate could make sense because it doesn't scale with cores linearly.
Not even sure if this is good idea, just a suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is deployed as a separate pod, so users can tune resource limits and ARGO_EXEC_GC_WORKERS
to fit. This is a heavily IO bound process as we're mostly just dispatching deletes to a remote blob store, and it's dependent on how that is configured how much it will benefit from parallelism, so attempting an autotune is probably not worthwhile.
Adding more workers needs a little more RAM too, so if we want to avoid OOM kills that needs adjusting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, yeah I can imagine this being majorly bound on the network. Makes sense that an autotune isn't worthwhile.
} else { | ||
response.Task.Status.ArtifactResultsByNode[response.NodeName] = v1alpha1.ArtifactResultNodeStatus{ArtifactResults: response.Results} | ||
// Check for completed tasks | ||
nodesToGo[response.Task]-- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we need to add an ok check here? I see that this should obviously exist, just tempted to be more defensive so that we future proof this a bit more.
Alternatively add a comment that we expect the response.Task to be present here because it was populated in the previous loop, probably should be accompanied by a comment at line 89 where it gets populated.
Sorry for being pedantic/annoying when it is obvious, I just don't want things to go in the way of NodeStatus, tempted to use a linter to force ok checks everywhere to be honest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, is this the if response.Task == nil
check? I'm actually kind of confused why that's in there. It seems redundant, no? I think we should remove this. If we started doing this everywhere, it would make the code unnecessarily longer everywhere.
continue | ||
} | ||
|
||
err = waitutil.Backoff(retry.DefaultRetry, func() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @juliev0 states, the err is not handled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
workflow/common/common.go
Outdated
@@ -159,6 +159,9 @@ const ( | |||
// EnvAgentPatchRate is the rate that the Argo Agent will patch the Workflow TaskSet | |||
EnvAgentPatchRate = "ARGO_AGENT_PATCH_RATE" | |||
|
|||
// EnvExecGCWorkers is the number of artifact GC workers for argoexec | |||
EnvExecGCWorkers = "ARGO_EXEC_GC_WORKERS" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo if its only used for artifact gc perhaps the env name should also carry this information? Just to be more obvious to anyone who is looking at a .env file.
ARGO_EXEC_ARTI_GC_WORKERS
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
Parallelise deletion of artifacts when they are being garbage collected. This is theoretically a transparent change and hence can't be integration tested. Signed-off-by: Alan Clucas <alan@clucas.org>
53d4f38
to
68559bd
Compare
Signed-off-by: Alan Clucas <alan@clucas.org>
Sorry but would you mind adding a unit test? |
This PR has been automatically marked as stale because it has not had recent activity and needs further changes. It will be closed if no further activity occurs. |
I'll get back to this one day! |
This PR has been automatically marked as stale because it has not had recent activity and needs further changes. It will be closed if no further activity occurs. |
This PR has been closed due to inactivity and lack of changes. If you would like to still work on this PR, please address the review comments and re-open. |
dis forgotten |
Nope, but it's not a priority for me right now. If someone wants to work on it they are welcome |
Parallelise deletion of artifacts when they are being garbage collected.
This is theoretically a transparent change and hence can't be integration tested.
Addresses #9295
Tested on minio and real S3