feat: Parallel artifact GC #11768

Joibel · 2023-09-06T20:24:57Z

Parallelise deletion of artifacts when they are being garbage collected.

This is theoretically a transparent change and hence can't be integration tested.

Addresses #9295

Tested on minio and real S3

terrytangyuan

@juliev0 Could you help review this?

juliev0 · 2023-09-07T14:36:39Z

@juliev0 Could you help review this?

Certainly. Will try to do today or tomorrow.

Joibel · 2023-09-07T15:20:05Z

This is missing documentation as it is. I'm not sure how it undrafted itself. I'll try and get that done this evening.

agilgur5 · 2023-09-07T16:45:28Z

This is missing documentation as it is.

Was gonna mention docs as there's a new number of workers. Potentially the new scaling docs from #11731 could be a good place to add.

Also the CI failure is on TestArtifactGC, so I imagine it's related to the changes

Joibel · 2023-09-07T21:37:40Z

The test failures aren't happening locally, but I have an idea.

The new variable isn't configurable - the pod that runs artifact GC is completely hardcoded at the moment. I suggest a separate PR to add configurability to it.

juliev0 · 2023-09-08T15:20:01Z

The test failures aren't happening locally, but I have an idea.

The new variable isn't configurable - the pod that runs artifact GC is completely hardcoded at the moment. I suggest a separate PR to add configurability to it.

This PR was put in recently to add configurability for resources in the Pod. Is this what you mean?

Joibel · 2023-09-08T15:36:51Z

The test failures aren't happening locally, but I have an idea.

The new variable isn't configurable - the pod that runs artifact GC is completely hardcoded at the moment. I suggest a separate PR to add configurability to it.

This PR was put in recently to add configurability for resources in the Pod. Is this what you mean?

Yep, I missed that PR. That's great, saves me a job.

juliev0 · 2023-09-12T19:45:11Z

Looking at the e2e test failure, I'm thinking it probably doesn't have anything to do with your code. This is the test that doesn't delete any artifacts. It looks like it waits for 1 minute to verify that the finalizer gets removed. I wonder if there is potential for this not to be long enough...if it just never did the Workflow reconciliation because maybe it was CPU starved in the CI or something?

If you do a series of empty commits, does it happen again?

cmd/argoexec/commands/artifact/delete.go

juliev0

I'm curious if you tried testing with multiple WorkflowArtifactGCTasks? I know we never actually enabled the creation of multiple but there was always the possibility.

juliev0 · 2023-09-12T20:41:19Z

cmd/argoexec/commands/artifact/delete.go

+				}
+				results[artifact.Name] = v1alpha1.ArtifactResult{Name: artifact.Name, Success: true, Error: nil}
+				return true, err
+			})


It looks like this is a bug in the original code: it's not doing anything if err is non-nill

We have guaranteed err == nil here. I've made that explicit now.

Or, as @isubasinghe pointed out you meant the other err. Oops. Done.

workflow/common/common.go

isubasinghe

Generally looks great, the error isn't handled. I've added more comments, let me know what you think :)

isubasinghe · 2023-12-09T06:06:34Z

cmd/argoexec/commands/artifact/delete.go

 	taskList, err := artifactGCTaskInterface.List(context.Background(), metav1.ListOptions{LabelSelector: labelSelector})
 	if err != nil {
 		return err
 	}
+	taskWorkers := env.LookupEnvIntOr(common.EnvExecGCWorkers, 4)


Should we look into how many cores are available and base the parallelism off that ?
I can imagine some deployments being backed by some heavy servers.

if log2(cores) + 4 >= cores: return cores else: return log2(cores) + 4

Something like the above growth rate could make sense because it doesn't scale with cores linearly.

Not even sure if this is good idea, just a suggestion.

This is deployed as a separate pod, so users can tune resource limits and ARGO_EXEC_GC_WORKERS to fit. This is a heavily IO bound process as we're mostly just dispatching deletes to a remote blob store, and it's dependent on how that is configured how much it will benefit from parallelism, so attempting an autotune is probably not worthwhile.
Adding more workers needs a little more RAM too, so if we want to avoid OOM kills that needs adjusting.

Makes sense, yeah I can imagine this being majorly bound on the network. Makes sense that an autotune isn't worthwhile.

isubasinghe · 2023-12-09T06:16:38Z

cmd/argoexec/commands/artifact/delete.go

+		} else {
+			response.Task.Status.ArtifactResultsByNode[response.NodeName] = v1alpha1.ArtifactResultNodeStatus{ArtifactResults: response.Results}
+			// Check for completed tasks
+			nodesToGo[response.Task]--


Do you think we need to add an ok check here? I see that this should obviously exist, just tempted to be more defensive so that we future proof this a bit more.

Alternatively add a comment that we expect the response.Task to be present here because it was populated in the previous loop, probably should be accompanied by a comment at line 89 where it gets populated.

Sorry for being pedantic/annoying when it is obvious, I just don't want things to go in the way of NodeStatus, tempted to use a linter to force ok checks everywhere to be honest.

sorry, is this the if response.Task == nil check? I'm actually kind of confused why that's in there. It seems redundant, no? I think we should remove this. If we started doing this everywhere, it would make the code unnecessarily longer everywhere.

isubasinghe · 2023-12-09T06:18:29Z

cmd/argoexec/commands/artifact/delete.go

+				continue
+			}
+
+			err = waitutil.Backoff(retry.DefaultRetry, func() (bool, error) {


As @juliev0 states, the err is not handled

isubasinghe · 2023-12-09T06:20:04Z

workflow/common/common.go

@@ -159,6 +159,9 @@ const (
 	// EnvAgentPatchRate is the rate that the Argo Agent will patch the Workflow TaskSet
 	EnvAgentPatchRate = "ARGO_AGENT_PATCH_RATE"

+	// EnvExecGCWorkers is the number of artifact GC workers for argoexec
+	EnvExecGCWorkers = "ARGO_EXEC_GC_WORKERS"


imo if its only used for artifact gc perhaps the env name should also carry this information? Just to be more obvious to anyone who is looking at a .env file.
ARGO_EXEC_ARTI_GC_WORKERS ?

Good idea, done.

Parallelise deletion of artifacts when they are being garbage collected. This is theoretically a transparent change and hence can't be integration tested. Signed-off-by: Alan Clucas <alan@clucas.org>

Signed-off-by: Alan Clucas <alan@clucas.org>

juliev0 · 2023-12-12T17:28:35Z

Sorry but would you mind adding a unit test?

github-actions · 2024-07-17T02:15:07Z

This PR has been automatically marked as stale because it has not had recent activity and needs further changes. It will be closed if no further activity occurs.

Joibel · 2024-07-17T08:13:58Z

I'll get back to this one day!

github-actions · 2024-08-01T02:19:07Z

This PR has been automatically marked as stale because it has not had recent activity and needs further changes. It will be closed if no further activity occurs.

github-actions · 2024-08-16T02:15:41Z

This PR has been closed due to inactivity and lack of changes. If you would like to still work on this PR, please address the review comments and re-open.

tooptoop4 · 2024-10-21T20:10:42Z

dis forgotten

Joibel · 2024-10-22T07:27:33Z

dis forgotten

Nope, but it's not a priority for me right now.

If someone wants to work on it they are welcome

agilgur5 added the area/artifacts S3/GCP/OSS/Git/HDFS etc label Sep 6, 2023

terrytangyuan reviewed Sep 6, 2023

View reviewed changes

Joibel force-pushed the parallel-delet branch from da948e5 to 454dd14 Compare September 7, 2023 21:40

juliev0 reviewed Sep 12, 2023

View reviewed changes

cmd/argoexec/commands/artifact/delete.go Show resolved Hide resolved

juliev0 reviewed Sep 12, 2023

View reviewed changes

cmd/argoexec/commands/artifact/delete.go Show resolved Hide resolved

juliev0 reviewed Sep 12, 2023

View reviewed changes

workflow/common/common.go Show resolved Hide resolved

isubasinghe reviewed Dec 9, 2023

View reviewed changes

feat: Parallel artifact GC

68559bd

Parallelise deletion of artifacts when they are being garbage collected. This is theoretically a transparent change and hence can't be integration tested. Signed-off-by: Alan Clucas <alan@clucas.org>

juliev0 self-assigned this Dec 11, 2023

Joibel force-pushed the parallel-delet branch 2 times, most recently from 53d4f38 to 68559bd Compare December 11, 2023 22:42

code review updates

cf0b2c1

Signed-off-by: Alan Clucas <alan@clucas.org>

isubasinghe mentioned this pull request Jan 30, 2024

REQUEST: Promotion to Approver for @isubasinghe argoproj/argoproj#232

Closed

7 tasks

agilgur5 added the area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more label Jun 9, 2024

juliev0 added the problem/more information needed Not enough information has been provide to diagnose this issue. label Jul 3, 2024

github-actions bot added the problem/stale This has not had a response in some time label Jul 17, 2024

Joibel removed the problem/stale This has not had a response in some time label Jul 17, 2024

github-actions bot added the problem/stale This has not had a response in some time label Aug 1, 2024

github-actions bot closed this Aug 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Parallel artifact GC #11768

feat: Parallel artifact GC #11768

Joibel commented Sep 6, 2023

terrytangyuan left a comment

juliev0 commented Sep 7, 2023

Joibel commented Sep 7, 2023

agilgur5 commented Sep 7, 2023 •

edited

Loading

Joibel commented Sep 7, 2023

juliev0 commented Sep 8, 2023

Joibel commented Sep 8, 2023

juliev0 commented Sep 12, 2023

juliev0 left a comment

juliev0 Sep 12, 2023 •

edited

Loading

Joibel Dec 11, 2023

Joibel Dec 11, 2023

isubasinghe left a comment

isubasinghe Dec 9, 2023

Joibel Dec 11, 2023 •

edited

Loading

isubasinghe Dec 11, 2023

isubasinghe Dec 9, 2023

Joibel Dec 11, 2023

isubasinghe Dec 11, 2023

juliev0 Dec 12, 2023 •

edited

Loading

isubasinghe Dec 9, 2023

Joibel Dec 11, 2023

isubasinghe Dec 9, 2023

Joibel Dec 11, 2023

isubasinghe Dec 11, 2023

juliev0 commented Dec 12, 2023

github-actions bot commented Jul 17, 2024

Joibel commented Jul 17, 2024

github-actions bot commented Aug 1, 2024

github-actions bot commented Aug 16, 2024

tooptoop4 commented Oct 21, 2024

Joibel commented Oct 22, 2024

feat: Parallel artifact GC #11768

feat: Parallel artifact GC #11768

Conversation

Joibel commented Sep 6, 2023

terrytangyuan left a comment

Choose a reason for hiding this comment

juliev0 commented Sep 7, 2023

Joibel commented Sep 7, 2023

agilgur5 commented Sep 7, 2023 • edited Loading

Joibel commented Sep 7, 2023

juliev0 commented Sep 8, 2023

Joibel commented Sep 8, 2023

juliev0 commented Sep 12, 2023

juliev0 left a comment

Choose a reason for hiding this comment

juliev0 Sep 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

isubasinghe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Joibel Dec 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juliev0 Dec 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juliev0 commented Dec 12, 2023

github-actions bot commented Jul 17, 2024

Joibel commented Jul 17, 2024

github-actions bot commented Aug 1, 2024

github-actions bot commented Aug 16, 2024

tooptoop4 commented Oct 21, 2024

Joibel commented Oct 22, 2024

agilgur5 commented Sep 7, 2023 •

edited

Loading

juliev0 Sep 12, 2023 •

edited

Loading

Joibel Dec 11, 2023 •

edited

Loading

juliev0 Dec 12, 2023 •

edited

Loading