[STRMHELP-315] Rollback on Failed Job Monitoring 🐛 #291

sethsaperstein-lyft · 2023-05-15T22:31:53Z

overview

In the job monitoring PR we introduced a bug such that when the job monitoring fails due to timeout or a failed vertex, the state DeployFailed is reached instead of attempting to rollback. This simplifies the logic of submitting job and job monitoring as well as results in the job attempting to roll back

additional info

Errors returned by a state in the state machine are added to the status as the last error. The shouldRollback at the beginning of these states checks to see if it is retryable and moves to rolling back if not. Thus, the change made is to return an error if monitoring results in a failed vertex or vertex timeout

sethsaperstein-lyft · 2023-05-15T23:41:44Z

/PTAL @maghamravi @anandswaminathan

maghamravi · 2023-05-16T17:55:42Z

pkg/controller/flinkapplication/flink_state_machine.go

-		}
-		return updateJobAndReturn(ctx, job, s, allVerticesRunning, app, hash)
+	logger.Info(ctx, "Monitoring job vertices with timeout ", flinkJobVertexTimeout)
+	jobStarted, err := monitorJobStart(job, flinkJobVertexTimeout)


nit: would be good to call the method name monitorJobSubmission and jobStarted to status

I agree with monitorJobSubmission. Cleaner.

I don't fully understand why call jobStarted to status. I believe jobStarted is more intuitive as to what the monitorJobStart actually returns. In the case where all vertices are not running jobStarted is false. If all vertices are running jobStarted is true. If any vertex is failed it throws an error.

Unless you're suggesting that status should be a string rather than a bool and correspond to something like "NOT_STARTED", "STARTED". Can you clarify status and why it should be status?

There are only two hard things in Computer Science: cache invalidation and naming things :)
My rationale to rename jobStarted to status was primarily due my other suggestion for renaming the method to monitorJobSubmission. Given the method was returning a boolean, a status felt more natural. Okay to keep it as jobStarted.

maghamravi · 2023-05-16T18:13:07Z

pkg/controller/flinkapplication/flink_state_machine.go

-	// wait until all vertices have been scheduled and running
-	hasFailure := false
-	failedVertexIndex := -1
+func monitorJobStart(job *client.FlinkJobOverview, timeout config2.Duration) (bool, error) {


nice to see this method being succinct now !

rollback on job start failure

ec83fbe

sethsaperstein-lyft requested review from anandswaminathan, premsantosh and maghamravi as code owners May 15, 2023 22:31

sethsaperstein-lyft added 2 commits May 15, 2023 15:41

lint overlords

46c26e7

fix log level

9072e02

sethsaperstein-lyft mentioned this pull request May 15, 2023

Monitor Job Vertices State on Deploy #284

Merged

maghamravi reviewed May 16, 2023

View reviewed changes

maghamravi previously approved these changes May 16, 2023

View reviewed changes

rename method

1a44de0

sethsaperstein-lyft dismissed maghamravi’s stale review via 1a44de0 May 16, 2023 20:39

maghamravi approved these changes May 17, 2023

View reviewed changes

sethsaperstein-lyft merged commit bea4e54 into master May 17, 2023

sethsaperstein-lyft deleted the STRMHELP-315_monitor_state_fix branch May 17, 2023 21:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[STRMHELP-315] Rollback on Failed Job Monitoring 🐛 #291

[STRMHELP-315] Rollback on Failed Job Monitoring 🐛 #291

sethsaperstein-lyft commented May 15, 2023 •

edited

Loading

sethsaperstein-lyft commented May 15, 2023

maghamravi May 16, 2023

sethsaperstein-lyft May 16, 2023

maghamravi May 17, 2023

maghamravi May 16, 2023

[STRMHELP-315] Rollback on Failed Job Monitoring 🐛 #291

[STRMHELP-315] Rollback on Failed Job Monitoring 🐛 #291

Conversation

sethsaperstein-lyft commented May 15, 2023 • edited Loading

overview

additional info

sethsaperstein-lyft commented May 15, 2023

maghamravi May 16, 2023

Choose a reason for hiding this comment

sethsaperstein-lyft May 16, 2023

Choose a reason for hiding this comment

maghamravi May 17, 2023

Choose a reason for hiding this comment

maghamravi May 16, 2023

Choose a reason for hiding this comment

sethsaperstein-lyft commented May 15, 2023 •

edited

Loading