[STRMCMP-1658] Enable Deploys to Fallback without State #287

sethsaperstein-lyft · 2023-04-17T21:33:57Z

overview

Add CRD parameter fallbackWithoutState to be enabled to allow deploys to occur without a savepoint or checkpoint in the case that the savepoint fails and there are no usable externalized checkpoints.

In the case that the savepoint and checkpoints are failing, this has a similar effect as savepointDisabled. The benefit of this configuration option is that it only exhibits this behavior when there is no other path forward and thus acts as a fallback without potential intervention. This is disabled by default.

The state graph does not change, but rather, the recovering state that attempts to find an externalized checkpoint forwards to submittingJob in the case that this is enabled and there is no externalized checkpoint

sethsaperstein-lyft · 2023-04-25T22:58:09Z

/PTAL @premsantosh @leoluoInSea @maghamravi

sethsaperstein-lyft · 2023-04-25T23:00:05Z

additional info

The original scope of this change was to handle 2 cases:

Savepoints and checkpoints are failing and we would like to start without state
Submitting job is failing and would otherwise succeed if not using a savepoint (ex: job graph change)

The original solution involved an extra state in the state graph for submitting without state. This state would be reached in the above two cases on upgrade as such:

savepointing -> recovering -> submittingJobWithoutState
savepointing -> submittingJob -> submittingJobWithoutState

After further review, it appears that case 2 can be solved by enabling the option allowNonRestoredState. This leaves only the first case to be handled in which the savepoint fails and no externalized checkpoints are available. In this case, we do not need a different state and instead can reuse the state submittingJob which will proceed without an empty savepointPath. This solves the issue while simplifying the solution. Consulted @maghamravi who agrees.

sethsaperstein-lyft · 2023-04-25T23:02:01Z

integ/operator-test-app/Dockerfile

@@ -9,7 +9,7 @@ ENV PATH=$FLINK_HOME/bin:$HADOOP_HOME/bin:$MAVEN_HOME/bin:$PATH
 COPY . /code

 # Configure Flink version
-ENV FLINK_VERSION=1.11.6 \
+ENV FLINK_VERSION=1.8.1 \


returning this back to 1.8.1 which corresponds to the image in dockerhub. Attempted to update this in the integration test PR but am reverting for now as not to deal with further configuration options that are non-trivial when dealing with 8GB memory total for github actions

sethsaperstein-lyft · 2023-04-25T23:04:35Z

integ/setup.sh

-minikube image load lyft/operator-test-app:b1b3cb8e8f98bd41f44f9c89f8462ce255e0d13f.1
-minikube image load lyft/operator-test-app:b1b3cb8e8f98bd41f44f9c89f8462ce255e0d13f.2
-
+cd integ/operator-test-app


Building the test app image as part of the integration test as opposed to relying on the remote image whose contents may not match the integ/operator-test-app contents

anandswaminathan

~~You are introducing a regression here when fallbackWithoutState is not set.~~

~~You are not failing deploy when there is an error~~

I see that you have handled the case below. Looks good

anandswaminathan

The logic looks good. Wondering if the duplication of lines can be avoided. Something like

failDeploy := false

if err !=nil  {
  failDeploy = true

} else path =="" {
  ...
  failDeploy = true
}

if failDeploy {
  if FallbackWithoutState {
  failDeploy = false
}
}

if failDeploy {
  s.deplpyFailed
}

submitjob

premsantosh · 2023-04-26T05:08:32Z

integ/checkpoint_failure_test.go

+	err = s.Util.ExecuteCommand("minikube", "ssh", "touch /tmp/checkpoints/fail && chmod 0644 /tmp/checkpoints/fail")
+	c.Assert(err, IsNil)
+
+	// wait a bit for it to start failing


Is this comment incorrect because on line 115 you assert error is nil?

Oh! You can not verifying that the app is failing is it?

Lines have moved around since but if you are referring to s.Util.GetFlinkApplication this gets the k8s FlinkApplication CR and the err corresponds to if there were errors retrieving the CR object as opposed to anything related to the status of the actual job.

Ln 135 s.Util.FlinkAPIGet(newApp, endpoint) gets the actual flink job and corresponding status to show that the job itself is healthy

This reverts commit d004ec2.

This reverts commit d2cfc6d.

sethsaperstein-lyft · 2023-05-04T18:37:00Z

The logic looks good. Wondering if the duplication of lines can be avoided. Something like

failDeploy := false

if err !=nil  {
  failDeploy = true

} else path =="" {
  ...
  failDeploy = true
}

if failDeploy {
  if FallbackWithoutState {
  failDeploy = false
}
}

if failDeploy {
  s.deplpyFailed
}

submitjob

agreed. refactored

This reverts commit 924725b.

This reverts commit 2845963.

This reverts commit 8781624.

This reverts commit d97fbb5.

sethsaperstein-lyft added 9 commits March 17, 2023 16:14

add fields

42eb35f

merge master

8932cf3

integ test failing due to vertices timeout of 0

f94219f

about to make changes to rid of new state in graph

caadae6

revamped to not modify state. test passing

68ad78f

fix image name

5e1bc6b

update logging

a397152

unit test

841924c

fix import for lint

f05c41d

sethsaperstein-lyft requested review from maghamravi, leoluoInSea and premsantosh April 25, 2023 22:57

sethsaperstein-lyft marked this pull request as ready for review April 25, 2023 22:57

sethsaperstein-lyft requested a review from anandswaminathan as a code owner April 25, 2023 22:57

sethsaperstein-lyft commented Apr 25, 2023

View reviewed changes

anandswaminathan requested changes Apr 26, 2023

View reviewed changes

anandswaminathan previously approved these changes Apr 26, 2023

View reviewed changes

premsantosh reviewed Apr 26, 2023

View reviewed changes

sethsaperstein-lyft added 3 commits May 3, 2023 11:25

skip tests

d004ec2

change order of status check and fail

cdc027b

Revert "skip tests"

d2cfc6d

This reverts commit d004ec2.

sethsaperstein-lyft dismissed anandswaminathan’s stale review via d2cfc6d May 3, 2023 19:25

sethsaperstein-lyft added 5 commits May 3, 2023 13:42

Revert "Revert "skip tests""

30d4644

This reverts commit d2cfc6d.

skip test

c92aae4

fix test with task failure now that there's the vertex monitoring

d9350f4

remove skips

924725b

lint and refactor

9aacdbf

sethsaperstein-lyft requested review from anandswaminathan and premsantosh and removed request for leoluoInSea May 4, 2023 18:38

sethsaperstein-lyft added 5 commits May 4, 2023 13:21

Revert "remove skips"

2845963

This reverts commit 924725b.

limit checkpoint lookback

25b4859

Revert "Revert "remove skips""

8781624

This reverts commit 2845963.

add checkpoint timeouts

062390f

Revert "Revert "Revert "remove skips"""

d97fbb5

This reverts commit 8781624.

anandswaminathan previously approved these changes May 5, 2023

View reviewed changes

sethsaperstein-lyft added 2 commits May 4, 2023 17:43

don't allow checkpoints to trigger

bcb53bc

Revert "Revert "Revert "Revert "remove skips""""

fb202d4

This reverts commit d97fbb5.

sethsaperstein-lyft dismissed anandswaminathan’s stale review via fb202d4 May 5, 2023 00:44

sethsaperstein-lyft added 2 commits May 4, 2023 17:47

remove redundant line

f3a3775

lint

3b2e806

anandswaminathan approved these changes May 5, 2023

View reviewed changes

sethsaperstein-lyft merged commit db5a5de into master May 5, 2023

sethsaperstein-lyft deleted the STRMCMP-1658_fallback_without_state branch May 5, 2023 18:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[STRMCMP-1658] Enable Deploys to Fallback without State #287

[STRMCMP-1658] Enable Deploys to Fallback without State #287

sethsaperstein-lyft commented Apr 17, 2023 •

edited

Loading

sethsaperstein-lyft commented Apr 25, 2023

sethsaperstein-lyft commented Apr 25, 2023

sethsaperstein-lyft Apr 25, 2023

sethsaperstein-lyft Apr 25, 2023

anandswaminathan left a comment •

edited

Loading

anandswaminathan left a comment •

edited

Loading

premsantosh Apr 26, 2023

premsantosh Apr 26, 2023

sethsaperstein-lyft May 4, 2023

sethsaperstein-lyft commented May 4, 2023

[STRMCMP-1658] Enable Deploys to Fallback without State #287

[STRMCMP-1658] Enable Deploys to Fallback without State #287

Conversation

sethsaperstein-lyft commented Apr 17, 2023 • edited Loading

overview

sethsaperstein-lyft commented Apr 25, 2023

sethsaperstein-lyft commented Apr 25, 2023

additional info

sethsaperstein-lyft Apr 25, 2023

Choose a reason for hiding this comment

sethsaperstein-lyft Apr 25, 2023

Choose a reason for hiding this comment

anandswaminathan left a comment • edited Loading

Choose a reason for hiding this comment

anandswaminathan left a comment • edited Loading

Choose a reason for hiding this comment

premsantosh Apr 26, 2023

Choose a reason for hiding this comment

premsantosh Apr 26, 2023

Choose a reason for hiding this comment

sethsaperstein-lyft May 4, 2023

Choose a reason for hiding this comment

sethsaperstein-lyft commented May 4, 2023

sethsaperstein-lyft commented Apr 17, 2023 •

edited

Loading

anandswaminathan left a comment •

edited

Loading

anandswaminathan left a comment •

edited

Loading