fix(backend): Synced ScheduledWorkflow CRs on apiserver startup #11469

hbelmiro · 2024-12-16T19:09:55Z

Resolves: #11296

This PR changes apiserver to patch existing ScheduledWorkflow CRs for each Job on startup to reflect the current KFP version deployed.

Testing

Create recurring runs
Make sure the recurring runs are running

Patch their swf CRs to force failures
3.1 Get the swf CRs

kubectl get swf
NAME                             AGE
heya8gqgg                        27m
howdyxt5fj                       26m
runofpipelinewithconditio5zwcj   60m

3.2 For each swf patch them with an invalid workflow spec to force failures. At this point, the recurring runs will start to fail due to the invalid spec.

kubectl patch scheduledworkflow heya8gqgg --type='merge' -p '{"spec":{"workflow":{"spec":{"dynamicKey1":"dynamicValue1","dynamicKey2":"dynamicValue2"}}}}'

Build a new apiserver image
Edit the ml-pipeline deployment to use the new apiserver image
The new apiserver pod will fix the swf CRs and the recurring runs will run successfully again

See a video demonstrating the test:

kfp-issue-11296.mp4

Checklist:

You have signed off your commits
The title for your pull request (PR) should follow our title convention. Learn more about the pull request title convention used in this repository.

google-oss-prow · 2024-12-16T19:09:59Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

google-oss-prow · 2024-12-16T19:41:39Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign chensun for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

backend/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Helber Belmiro <helber.belmiro@gmail.com>

hbelmiro · 2024-12-17T20:53:51Z

@HumairAK @chensun can you guys PTAL?

backend/src/apiserver/main.go

backend/src/apiserver/resource/resource_manager.go

mprahl · 2024-12-17T21:53:32Z

backend/src/apiserver/main.go

@@ -106,6 +106,13 @@ func main() {
 	}
 	log.SetLevel(level)

+	ctx, cancel := context.WithTimeout(context.Background(), 3*time.Minute)
+	defer cancel()
+	err = resourceManager.SyncSwfCrs(ctx)


The persistence agent seems to already have a reconcile loop for scheduled workflows. If I'm reading the code right, on start up, it'll reconcile everything and then handle creates, updates, and deletes.

Could the migration logic be added to the persistence agent instead?

It probably could. But that doesn't sound like a persistence agent's responsibility. It doesn't sound like API server's responsibility too, but I think it fits better there since we have a jobStore.

My original plan was to do it in https://github.com/kubeflow/pipelines/tree/master/backend/src/crd/controller/scheduledworkflow, but we would have to make HTTP calls to the API server. Then we decided to leave it in the API server.

It would be nice to hear others' opinions about that.

@hbelmiro I agree that persistence agent isn't the best fit. That controller you linked could be a good fit.

I think we should consider adding a KFP version column to the jobs table so that you can skip generating and updating scheduled workflows if the version matches.

If were to pursue using the scheduled workflow controller, here is an idea:

Have the scheduled workflow controller query the API server health endpoint at start up to get the tag_name value to see what version of KFP we are on. In the background, it could keep querying the API server to see if the version changed.

The scheduled workflow controller's reconcile loop checks an annotation of pipelines.kubeflow.org/version and if the value of that annotation doesn't match tag_name, then the workflow definition is updated and the annotation value is set to the current version.

When the API server creates a ScheduledWorkflow object, it sets the pipelines.kubeflow.org/version annotation.

I love this idea. My only concern is about making HTTP calls to the API server.
How about implementing it in a follow-up PR?

Sure, no problem! What do you think of the other comnment of adding a KFP version column to the jobs table so that you can skip generating and updating scheduled workflows if the version matches?

I like that too. But other than https://github.com/kubeflow/pipelines/blob/master/VERSION, I'm not aware of a field/file from where I can get the version. Are you?
Otherwise, maybe @HumairAK and @chensun can recommend something.

@hbelmiro I think TAG_VERSION can be used like is surfaced from the health endpoint.

Co-authored-by: Matt Prahl <mprahl@users.noreply.github.com> Signed-off-by: Helber Belmiro <helber.belmiro@gmail.com>

Signed-off-by: Helber Belmiro <helber.belmiro@gmail.com>

mprahl · 2024-12-18T15:23:15Z

backend/src/apiserver/main.go

+	go reconcileSwfCrs(resourceManager)
+
 	go startRpcServer(resourceManager)
 	startHttpProxy(resourceManager)

 	clientManager.Close()
 }

+func reconcileSwfCrs(resourceManager *resource.ResourceManager) {
+	ctx, cancel := context.WithTimeout(context.Background(), 3*time.Minute)
+	defer cancel()
+	err := resourceManager.ReconcileSwfCrs(ctx)
+	if err != nil {
+		log.Errorf("Could not reconcile the ScheduledWorkflow Kubernetes resources: %v", err)
+	}
+}


I know the rest of the code doesn't do this for separate go routines but I think the correct approach would be something like this to properly wait for the goroutine to finish.

Suggested change

go reconcileSwfCrs(resourceManager)

go startRpcServer(resourceManager)

startHttpProxy(resourceManager)

clientManager.Close()

}

func reconcileSwfCrs(resourceManager *resource.ResourceManager) {

ctx, cancel := context.WithTimeout(context.Background(), 3*time.Minute)

defer cancel()

err := resourceManager.ReconcileSwfCrs(ctx)

if err != nil {

log.Errorf("Could not reconcile the ScheduledWorkflow Kubernetes resources: %v", err)

}

}

backgroundCtx, backgroundCancel := context.WithCancel(context.Background())

defer backgroundCancel()

wg := sync.WaitGroup{}

wg.Add(1)

go reconcileSwfCrs(resourceManager, ctx, &wg)

go startRpcServer(resourceManager)

// This is blocking

startHttpProxy(resourceManager)

backgroundCancel()

clientManager.Close()

wg.Wait()

}

func reconcileSwfCrs(resourceManager *resource.ResourceManager, ctx context.Context, wg *sync.WaitGroup) {

defer wg.Done()

err := resourceManager.ReconcileSwfCrs(ctx)

if err != nil {

log.Errorf("Could not reconcile the ScheduledWorkflow Kubernetes resources: %v", err)

}

}

mprahl · 2024-12-18T15:24:28Z

backend/src/apiserver/main.go

 	go startRpcServer(resourceManager)
 	startHttpProxy(resourceManager)

 	clientManager.Close()
 }

+func reconcileSwfCrs(resourceManager *resource.ResourceManager) {
+	ctx, cancel := context.WithTimeout(context.Background(), 3*time.Minute)


I don't think we need a timeout context here since we don't really care if it takes longer than 3 minutes when it's run asynchronously. A regular background context should be okay.

mprahl · 2024-12-18T15:25:59Z

backend/src/apiserver/main.go

+	defer cancel()
+	err := resourceManager.ReconcileSwfCrs(ctx)
+	if err != nil {
+		log.Errorf("Could not reconcile the ScheduledWorkflow Kubernetes resources: %v", err)


I agree with your approach of not exiting with an error code in this case, but I'm curious what others think. My concern is that it'd be hard to know that this failed as a KFP admin but it also doesn't seem warranted to keep the API server from running if this can't succeed.

If I was implementing this from scratch, I'd definitely exit with an error. But since the original behavior (the one that this PR aims to fix) is to run the API server with outdated swfs, existing deployments may start to fail after an upgrade even when they would work with outdated swfs.
At the same time, silently ignoring it and the outdated swfs "working" silently may be even more dangerous than just exiting with an error. (Now while I'm writing this, I'm tilting more to exit with an error)

Yeah, let's see what others think.

mprahl · 2024-12-18T15:35:24Z

backend/src/apiserver/resource/resource_manager.go

+	if k8sNamespace == "" {
+		k8sNamespace = common.GetPodNamespace()
+	}
+	if k8sNamespace == "" {
+		return errors.New("Namespace cannot be empty when deleting a ScheduledWorkflow Kubernetes resource.")
+	}


When is this code needed? It seems the jobs table always has the namespace set.

mprahl · 2024-12-18T15:43:51Z

backend/src/apiserver/resource/resource_manager.go

+			return failedToReconcileSwfCrsError(err)
+		}
+
+		err = r.patchSwfCrSpec(ctx, jobs[i].Namespace, jobs[i].K8SName, scheduledWorkflow.Spec)


Could you use a reflect.DeepEquals check to compare the desired spec and the current spec to avoid patches when the value is already correct?

mprahl · 2024-12-18T15:49:03Z

backend/src/apiserver/resource/resource_manager.go

+			"Failed to marshal patch spec")
+	}
+
+	_, err = r.getScheduledWorkflowClient(k8sNamespace).Patch(


Could you perform an update instead of a patch? That way it'll catch resourceVersion mismatches (i.e. ScheduledWorkflow was updated between when the object was retrieved and the update request). It'd be good to also retry the whole job iteration flow in that event. The IsConflict function from "k8s.io/apimachinery/pkg/api/errors" should help detect that specific case.

mprahl · 2024-12-18T15:52:30Z

backend/src/apiserver/resource/resource_manager.go

+		}
+
+		scheduledWorkflow, err := tmpl.ScheduledWorkflow(jobs[i])
+		if err != nil {


Should this ignore not found errors? I could see the case where the database still has a reference to the scheduled workflow but the object no longer exists on the Kubernetes cluster.

google-oss-prow bot added the do-not-merge/work-in-progress label Dec 16, 2024

google-oss-prow bot requested review from ouadakarim, rimolive and zijianjoy December 16, 2024 19:10

google-oss-prow bot added the size/XXL label Dec 16, 2024

hbelmiro force-pushed the issue-11296 branch 2 times, most recently from 8ed1796 to 332cc47 Compare December 16, 2024 19:27

google-oss-prow bot added size/L and removed size/XXL labels Dec 16, 2024

fix(backend): Synced ScheduledWorkflow CRs on apiserver startup

953426d

Signed-off-by: Helber Belmiro <helber.belmiro@gmail.com>

hbelmiro force-pushed the issue-11296 branch from 50713d3 to 953426d Compare December 17, 2024 20:17

hbelmiro changed the title ~~WIP - Update SWF CRs on apiserver startup~~ fix(backend): Synced ScheduledWorkflow CRs on apiserver startup Dec 17, 2024

hbelmiro marked this pull request as ready for review December 17, 2024 20:26

google-oss-prow bot removed the do-not-merge/work-in-progress label Dec 17, 2024

google-oss-prow bot requested a review from HumairAK December 17, 2024 20:26

mprahl reviewed Dec 17, 2024

View reviewed changes

backend/src/apiserver/main.go Outdated Show resolved Hide resolved

mprahl reviewed Dec 17, 2024

View reviewed changes

backend/src/apiserver/resource/resource_manager.go Outdated Show resolved Hide resolved

mprahl reviewed Dec 17, 2024

View reviewed changes

hbelmiro and others added 2 commits December 18, 2024 10:40

Update backend/src/apiserver/resource/resource_manager.go

7a04116

Co-authored-by: Matt Prahl <mprahl@users.noreply.github.com> Signed-off-by: Helber Belmiro <helber.belmiro@gmail.com>

Made ReconcileSwfCrs async

04f8ba3

Signed-off-by: Helber Belmiro <helber.belmiro@gmail.com>

mprahl reviewed Dec 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(backend): Synced ScheduledWorkflow CRs on apiserver startup #11469

fix(backend): Synced ScheduledWorkflow CRs on apiserver startup #11469

hbelmiro commented Dec 16, 2024 •

edited

Loading

google-oss-prow bot commented Dec 16, 2024

google-oss-prow bot commented Dec 16, 2024

hbelmiro commented Dec 17, 2024

mprahl Dec 17, 2024

hbelmiro Dec 18, 2024

mprahl Dec 18, 2024

hbelmiro Dec 18, 2024

mprahl Dec 18, 2024

hbelmiro Dec 18, 2024

mprahl Dec 18, 2024

mprahl Dec 18, 2024

mprahl Dec 18, 2024

mprahl Dec 18, 2024

hbelmiro Dec 18, 2024

mprahl Dec 18, 2024

mprahl Dec 18, 2024

mprahl Dec 18, 2024

mprahl Dec 18, 2024

fix(backend): Synced ScheduledWorkflow CRs on apiserver startup #11469

Are you sure you want to change the base?

fix(backend): Synced ScheduledWorkflow CRs on apiserver startup #11469

Conversation

hbelmiro commented Dec 16, 2024 • edited Loading

Testing

google-oss-prow bot commented Dec 16, 2024

google-oss-prow bot commented Dec 16, 2024

hbelmiro commented Dec 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hbelmiro commented Dec 16, 2024 •

edited

Loading