Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Parallel artifact GC #11768

Closed
wants to merge 2 commits into from
Closed

Conversation

Joibel
Copy link
Member

@Joibel Joibel commented Sep 6, 2023

Parallelise deletion of artifacts when they are being garbage collected.

This is theoretically a transparent change and hence can't be integration tested.

Addresses #9295

Tested on minio and real S3

@agilgur5 agilgur5 added the area/artifacts S3/GCP/OSS/Git/HDFS etc label Sep 6, 2023
Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juliev0 Could you help review this?

@juliev0
Copy link
Contributor

juliev0 commented Sep 7, 2023

@juliev0 Could you help review this?

Certainly. Will try to do today or tomorrow.

@Joibel
Copy link
Member Author

Joibel commented Sep 7, 2023

This is missing documentation as it is. I'm not sure how it undrafted itself. I'll try and get that done this evening.

@agilgur5
Copy link

agilgur5 commented Sep 7, 2023

This is missing documentation as it is.

Was gonna mention docs as there's a new number of workers. Potentially the new scaling docs from #11731 could be a good place to add.

Also the CI failure is on TestArtifactGC, so I imagine it's related to the changes

@Joibel
Copy link
Member Author

Joibel commented Sep 7, 2023

The test failures aren't happening locally, but I have an idea.

The new variable isn't configurable - the pod that runs artifact GC is completely hardcoded at the moment. I suggest a separate PR to add configurability to it.

@juliev0
Copy link
Contributor

juliev0 commented Sep 8, 2023

The test failures aren't happening locally, but I have an idea.

The new variable isn't configurable - the pod that runs artifact GC is completely hardcoded at the moment. I suggest a separate PR to add configurability to it.

This PR was put in recently to add configurability for resources in the Pod. Is this what you mean?

@Joibel
Copy link
Member Author

Joibel commented Sep 8, 2023

The test failures aren't happening locally, but I have an idea.

The new variable isn't configurable - the pod that runs artifact GC is completely hardcoded at the moment. I suggest a separate PR to add configurability to it.

This PR was put in recently to add configurability for resources in the Pod. Is this what you mean?

Yep, I missed that PR. That's great, saves me a job.

@juliev0
Copy link
Contributor

juliev0 commented Sep 12, 2023

Looking at the e2e test failure, I'm thinking it probably doesn't have anything to do with your code. This is the test that doesn't delete any artifacts. It looks like it waits for 1 minute to verify that the finalizer gets removed. I wonder if there is potential for this not to be long enough...if it just never did the Workflow reconciliation because maybe it was CPU starved in the CI or something?

If you do a series of empty commits, does it happen again?

Copy link
Contributor

@juliev0 juliev0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if you tried testing with multiple WorkflowArtifactGCTasks? I know we never actually enabled the creation of multiple but there was always the possibility.

}
results[artifact.Name] = v1alpha1.ArtifactResult{Name: artifact.Name, Success: true, Error: nil}
return true, err
})
Copy link
Contributor

@juliev0 juliev0 Sep 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is a bug in the original code: it's not doing anything if err is non-nill

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have guaranteed err == nil here. I've made that explicit now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, as @isubasinghe pointed out you meant the other err. Oops. Done.

Copy link
Member

@isubasinghe isubasinghe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks great, the error isn't handled. I've added more comments, let me know what you think :)

taskList, err := artifactGCTaskInterface.List(context.Background(), metav1.ListOptions{LabelSelector: labelSelector})
if err != nil {
return err
}
taskWorkers := env.LookupEnvIntOr(common.EnvExecGCWorkers, 4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we look into how many cores are available and base the parallelism off that ?
I can imagine some deployments being backed by some heavy servers.

if log2(cores) + 4 >= cores:
  return cores
else:
  return log2(cores) + 4

Something like the above growth rate could make sense because it doesn't scale with cores linearly.

Not even sure if this is good idea, just a suggestion.

Copy link
Member Author

@Joibel Joibel Dec 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is deployed as a separate pod, so users can tune resource limits and ARGO_EXEC_GC_WORKERS to fit. This is a heavily IO bound process as we're mostly just dispatching deletes to a remote blob store, and it's dependent on how that is configured how much it will benefit from parallelism, so attempting an autotune is probably not worthwhile.
Adding more workers needs a little more RAM too, so if we want to avoid OOM kills that needs adjusting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, yeah I can imagine this being majorly bound on the network. Makes sense that an autotune isn't worthwhile.

} else {
response.Task.Status.ArtifactResultsByNode[response.NodeName] = v1alpha1.ArtifactResultNodeStatus{ArtifactResults: response.Results}
// Check for completed tasks
nodesToGo[response.Task]--
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we need to add an ok check here? I see that this should obviously exist, just tempted to be more defensive so that we future proof this a bit more.

Alternatively add a comment that we expect the response.Task to be present here because it was populated in the previous loop, probably should be accompanied by a comment at line 89 where it gets populated.

Sorry for being pedantic/annoying when it is obvious, I just don't want things to go in the way of NodeStatus, tempted to use a linter to force ok checks everywhere to be honest.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Contributor

@juliev0 juliev0 Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, is this the if response.Task == nil check? I'm actually kind of confused why that's in there. It seems redundant, no? I think we should remove this. If we started doing this everywhere, it would make the code unnecessarily longer everywhere.

continue
}

err = waitutil.Backoff(retry.DefaultRetry, func() (bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @juliev0 states, the err is not handled

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -159,6 +159,9 @@ const (
// EnvAgentPatchRate is the rate that the Argo Agent will patch the Workflow TaskSet
EnvAgentPatchRate = "ARGO_AGENT_PATCH_RATE"

// EnvExecGCWorkers is the number of artifact GC workers for argoexec
EnvExecGCWorkers = "ARGO_EXEC_GC_WORKERS"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo if its only used for artifact gc perhaps the env name should also carry this information? Just to be more obvious to anyone who is looking at a .env file.
ARGO_EXEC_ARTI_GC_WORKERS ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

Parallelise deletion of artifacts when they are being garbage
collected.

This is theoretically a transparent change and hence can't be
integration tested.

Signed-off-by: Alan Clucas <alan@clucas.org>
@juliev0 juliev0 self-assigned this Dec 11, 2023
@Joibel Joibel force-pushed the parallel-delet branch 2 times, most recently from 53d4f38 to 68559bd Compare December 11, 2023 22:42
Signed-off-by: Alan Clucas <alan@clucas.org>
@juliev0
Copy link
Contributor

juliev0 commented Dec 12, 2023

Sorry but would you mind adding a unit test?

@agilgur5 agilgur5 added the area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more label Jun 9, 2024
@juliev0 juliev0 added the problem/more information needed Not enough information has been provide to diagnose this issue. label Jul 3, 2024
Copy link
Contributor

This PR has been automatically marked as stale because it has not had recent activity and needs further changes. It will be closed if no further activity occurs.

@github-actions github-actions bot added the problem/stale This has not had a response in some time label Jul 17, 2024
@Joibel Joibel removed the problem/stale This has not had a response in some time label Jul 17, 2024
@Joibel
Copy link
Member Author

Joibel commented Jul 17, 2024

I'll get back to this one day!

Copy link
Contributor

github-actions bot commented Aug 1, 2024

This PR has been automatically marked as stale because it has not had recent activity and needs further changes. It will be closed if no further activity occurs.

@github-actions github-actions bot added the problem/stale This has not had a response in some time label Aug 1, 2024
Copy link
Contributor

This PR has been closed due to inactivity and lack of changes. If you would like to still work on this PR, please address the review comments and re-open.

@github-actions github-actions bot closed this Aug 16, 2024
@tooptoop4
Copy link
Contributor

dis forgotten

@Joibel
Copy link
Member Author

Joibel commented Oct 22, 2024

dis forgotten

Nope, but it's not a priority for me right now.

If someone wants to work on it they are welcome

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/artifacts S3/GCP/OSS/Git/HDFS etc area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more problem/more information needed Not enough information has been provide to diagnose this issue. problem/stale This has not had a response in some time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants