Add scheduler func for clearing batch scheduling on completed #1079

nicholas-fwang · 2020-11-20T10:48:03Z

Related to issue #1062

The spark application injected with the volcano scheduler does not delete the pod group when the state of the spark application is changed to a completed or failed.
It can only be removed from the volcano queue by spark application deletion command or TTL timeout.
This will cause new pod groups to be scheduled to get stuck(pending) in the queue until spark application removed with ownerReference.
In my opinion, it would be better to remove the pod group when the spark application is changed to a completed or failed state.

nicholas-fwang · 2020-11-20T10:49:34Z

if want to add some test codes for volcano scheduler, I will write it :)

liyinan926 · 2020-11-21T00:41:21Z

pkg/batchscheduler/interface/interface.go

@@ -25,4 +25,5 @@ type BatchScheduler interface {

 	ShouldSchedule(app *v1beta2.SparkApplication) bool
 	DoBatchSchedulingOnSubmission(app *v1beta2.SparkApplication) error
+	ClearBatchSchedulingOnCompleted(app *v1beta2.SparkApplication) error


Nit: CleanupOnCompletion.

liyinan926 · 2020-11-21T00:41:48Z

pkg/batchscheduler/volcano/volcano_scheduler.go

+func (v *VolcanoBatchScheduler) ClearBatchSchedulingOnCompleted(app *v1beta2.SparkApplication) error {
+	// remove pod group
+	podGroupName := v.getAppPodGroupName(app)
+	if err := v.volcanoClient.SchedulingV1beta1().PodGroups(app.Namespace).Delete(podGroupName, &metav1.DeleteOptions{}); err != nil {


You can do:

return v.volcanoClient.SchedulingV1beta1().PodGroups(app.Namespace).Delete(podGroupName, &metav1.DeleteOptions{})

liyinan926 · 2020-11-21T00:44:42Z

pkg/controller/sparkapplication/controller.go

@@ -597,6 +597,11 @@ func (c *Controller) syncSparkApplication(key string) error {
 			glog.Errorf("failed to update SparkApplication %s/%s: %v", app.Namespace, app.Name, err)
 			return err
 		}
+
+		if err := c.clearBatchScheduling(app, appCopy); err != nil {
+			glog.Errorf("failed to clean up batch scheduling %s/%s: %v", app.Namespace, app.Name, err)


failed to clean up batch scheduling config for SparkApplication

liyinan926 · 2020-11-21T00:45:18Z

pkg/controller/sparkapplication/controller.go

@@ -1001,3 +1006,19 @@ func (c *Controller) hasApplicationExpired(app *v1beta2.SparkApplication) bool {

 	return false
 }
+
+// Clean up batch scheduler if use


Clean up batch scheduling config if applicable..

nicholas-fwang · 2020-11-23T00:25:04Z

@liyinan926 Thanks! Update code for your comments

nicholas-fwang · 2020-11-26T02:02:23Z

@liyinan926 PTAL

liyinan926

Sorry for the delay in review.

liyinan926 · 2020-12-07T05:52:43Z

pkg/controller/sparkapplication/controller.go

@@ -597,6 +597,11 @@ func (c *Controller) syncSparkApplication(key string) error {
 			glog.Errorf("failed to update SparkApplication %s/%s: %v", app.Namespace, app.Name, err)
 			return err
 		}
+
+		if err := c.cleanUpBatchScheduling(app, appCopy); err != nil {


Just realized that we have a method for cleaning up application resources like the driver UI service. We should add deletion of this resource to that method.

Thanks for review!
I wonder if the function you are talking about is deleteSparkResources.
As far as I understand, this function is called before the Spark application retry, which is called when not normal (invalidating, failed submission, etc.).
In the case of cleanup, the podgroup is deleted when it is in completed and failed state. This means that the spark application no longer restarts. If spark operator deletes the podgroup in the state of invalidating or failed submission, the spark application will not be queued to the volcano and will not be restarted until re-queued.
Please comment if it is different from what I understood.

Makes sense.

liyinan926 · 2020-12-08T05:29:37Z

pkg/controller/sparkapplication/controller.go

@@ -597,6 +597,11 @@ func (c *Controller) syncSparkApplication(key string) error {
 			glog.Errorf("failed to update SparkApplication %s/%s: %v", app.Namespace, app.Name, err)
 			return err
 		}
+
+		if err := c.cleanUpBatchScheduling(app, appCopy); err != nil {


Makes sense.

liyinan926 · 2020-12-08T05:30:00Z

pkg/batchscheduler/interface/interface.go

@@ -25,4 +25,5 @@ type BatchScheduler interface {

 	ShouldSchedule(app *v1beta2.SparkApplication) bool
 	DoBatchSchedulingOnSubmission(app *v1beta2.SparkApplication) error
+	CleanupOnCompleted(app *v1beta2.SparkApplication) error


To be consistent, let's rename this to CleanupOnCompletion.

liyinan926 · 2020-12-08T05:32:50Z

pkg/controller/sparkapplication/controller.go

@@ -1001,3 +1006,19 @@ func (c *Controller) hasApplicationExpired(app *v1beta2.SparkApplication) bool {

 	return false
 }
+
+// Clean up batch scheduling config if applicable.
+func (c *Controller) cleanUpBatchScheduling(oldApp, newApp *v1beta2.SparkApplication) error {


Please rename this to cleanUpOnTermination, which sounds more generic and can be extended later to handle deletion of other resources, e.g., the UI service. Please update the comments also.

liyinan926 · 2020-12-08T05:34:12Z

pkg/controller/sparkapplication/controller.go

+		state := newApp.Status.AppState.State
+		// If new app state is completed or failed, no more app is running,
+		// and only once needs to clean up on completed or failed
+		if (state == v1beta2.CompletedState || state == v1beta2.FailedState) &&


I suggest moving the state check to the caller of this method.

thanks, I moved it to caller

liyinan926 · 2020-12-09T03:45:45Z

pkg/controller/sparkapplication/controller.go

@@ -1001,3 +1009,17 @@ func (c *Controller) hasApplicationExpired(app *v1beta2.SparkApplication) bool {

 	return false
 }
+
+// Clean up resources such as batch scheduler config if applicable.


Comment needs to be updated.

updated it
Clean up when the spark application is terminated.

liyinan926 · 2020-12-09T03:46:59Z

pkg/controller/sparkapplication/controller.go

+// Clean up resources such as batch scheduler config if applicable.
+func (c *Controller) cleanUpOnTermination(oldApp, newApp *v1beta2.SparkApplication) error {
+	if needScheduling, scheduler := c.shouldDoBatchScheduling(newApp); needScheduling {
+		// If new app state is completed or failed, no more app is running,


Comment needs update.

updated it
batch schduler is cleaned up only when state is changed to completion state

liyinan926 · 2020-12-09T04:35:11Z

pkg/batchscheduler/volcano/volcano_scheduler.go

+func (v *VolcanoBatchScheduler) CleanupOnCompletion(app *v1beta2.SparkApplication) error {
+	podGroupName := v.getAppPodGroupName(app)
+	//Remove pod group for Spark Application
+	return v.volcanoClient.SchedulingV1beta1().PodGroups(app.Namespace).Delete(podGroupName, &metav1.DeleteOptions{})


We should ignore the error if it's NotFound.

I saw this comments at now, I will update code that ignores NotFound error here

liyinan926 · 2020-12-09T04:35:22Z

pkg/batchscheduler/volcano/volcano_scheduler.go

@@ -160,6 +160,12 @@ func (v *VolcanoBatchScheduler) syncPodGroup(app *v1beta2.SparkApplication, size
 	return nil
 }

+func (v *VolcanoBatchScheduler) CleanupOnCompletion(app *v1beta2.SparkApplication) error {
+	podGroupName := v.getAppPodGroupName(app)
+	//Remove pod group for Spark Application


This comment is not needed.

liyinan926 · 2020-12-09T04:35:56Z

pkg/controller/sparkapplication/controller.go

+func (c *Controller) cleanUpOnTermination(oldApp, newApp *v1beta2.SparkApplication) error {
+	if needScheduling, scheduler := c.shouldDoBatchScheduling(newApp); needScheduling {
+		// batch schduler is cleaned up only when state is changed to completion state
+		if newApp.Status.AppState.State != oldApp.Status.AppState.State {


This check is not needed if deletion is idempotent, i.e., if we ignore NotFound error above.

looks reasonable
updated to

if needScheduling, scheduler := c.shouldDoBatchScheduling(newApp); needScheduling { if err := scheduler.CleanupOnCompletion(newApp); err != nil && !errors.IsNotFound(err) { return err } }

liyinan926 · 2020-12-09T04:55:37Z

pkg/controller/sparkapplication/controller.go

-			if err := scheduler.CleanupOnCompletion(newApp); err != nil {
-				return err
-			}
+		if err := scheduler.CleanupOnCompletion(newApp); err != nil && !errors.IsNotFound(err) {


Instead of dealing with the NotFound error here, why not return a nil if so in CleanupOnCompletion?

you're right :)
Updated not found checking in CleanUpOnCompletion

liyinan926

LGTM

…ow#1079) * Add scheduler func for clearing batch scheduling on completed * Rename batch scheduler func * Rename spark app claen up func * Update comment for clean up function * Remove app state changed checking when clean up batch scheduler

Add scheduler func for clearing batch scheduling on completed

3c91cc3

liyinan926 reviewed Nov 21, 2020

View reviewed changes

Rename batch scheduler func

6ed3470

nicholas-fwang force-pushed the cleanup-batch-scheduler branch from 629ceb8 to 6ed3470 Compare November 23, 2020 00:32

liyinan926 reviewed Dec 7, 2020

View reviewed changes

liyinan926 reviewed Dec 8, 2020

View reviewed changes

Rename spark app claen up func

2c4c7a4

liyinan926 reviewed Dec 9, 2020

View reviewed changes

Update comment for clean up function

4883f3c

liyinan926 reviewed Dec 9, 2020

View reviewed changes

Remove app state changed checking when clean up batch scheduler

c13b1c3

nicholas-fwang force-pushed the cleanup-batch-scheduler branch from c35c566 to c13b1c3 Compare December 9, 2020 05:00

liyinan926 approved these changes Dec 9, 2020

View reviewed changes

liyinan926 merged commit 8ac786b into kubeflow:master Dec 9, 2020

nicholas-fwang deleted the cleanup-batch-scheduler branch December 14, 2020 02:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add scheduler func for clearing batch scheduling on completed #1079

Add scheduler func for clearing batch scheduling on completed #1079

nicholas-fwang commented Nov 20, 2020

nicholas-fwang commented Nov 20, 2020

liyinan926 Nov 21, 2020

liyinan926 Nov 21, 2020

liyinan926 Nov 21, 2020

liyinan926 Nov 21, 2020

nicholas-fwang commented Nov 23, 2020

nicholas-fwang commented Nov 26, 2020

liyinan926 left a comment

liyinan926 Dec 7, 2020

nicholas-fwang Dec 7, 2020

liyinan926 Dec 8, 2020

liyinan926 Dec 8, 2020

liyinan926 Dec 8, 2020

nicholas-fwang Dec 9, 2020

liyinan926 Dec 8, 2020

nicholas-fwang Dec 9, 2020

liyinan926 Dec 8, 2020

nicholas-fwang Dec 9, 2020

liyinan926 Dec 9, 2020

nicholas-fwang Dec 9, 2020

liyinan926 Dec 9, 2020

nicholas-fwang Dec 9, 2020

liyinan926 Dec 9, 2020

nicholas-fwang Dec 9, 2020

liyinan926 Dec 9, 2020

nicholas-fwang Dec 9, 2020

liyinan926 Dec 9, 2020

nicholas-fwang Dec 9, 2020

liyinan926 Dec 9, 2020

nicholas-fwang Dec 9, 2020

liyinan926 left a comment

Add scheduler func for clearing batch scheduling on completed #1079

Add scheduler func for clearing batch scheduling on completed #1079

Conversation

nicholas-fwang commented Nov 20, 2020

nicholas-fwang commented Nov 20, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicholas-fwang commented Nov 23, 2020

nicholas-fwang commented Nov 26, 2020

liyinan926 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liyinan926 left a comment

Choose a reason for hiding this comment