-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[STRMCMP-1658] Enable Deploys to Fallback without State #287
[STRMCMP-1658] Enable Deploys to Fallback without State #287
Conversation
additional infoThe original scope of this change was to handle 2 cases:
The original solution involved an extra state in the state graph for submitting without state. This state would be reached in the above two cases on upgrade as such:
After further review, it appears that case 2 can be solved by enabling the option |
@@ -9,7 +9,7 @@ ENV PATH=$FLINK_HOME/bin:$HADOOP_HOME/bin:$MAVEN_HOME/bin:$PATH | |||
COPY . /code | |||
|
|||
# Configure Flink version | |||
ENV FLINK_VERSION=1.11.6 \ | |||
ENV FLINK_VERSION=1.8.1 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
returning this back to 1.8.1 which corresponds to the image in dockerhub. Attempted to update this in the integration test PR but am reverting for now as not to deal with further configuration options that are non-trivial when dealing with 8GB memory total for github actions
minikube image load lyft/operator-test-app:b1b3cb8e8f98bd41f44f9c89f8462ce255e0d13f.1 | ||
minikube image load lyft/operator-test-app:b1b3cb8e8f98bd41f44f9c89f8462ce255e0d13f.2 | ||
|
||
cd integ/operator-test-app |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Building the test app image as part of the integration test as opposed to relying on the remote image whose contents may not match the integ/operator-test-app contents
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are introducing a regression here when fallbackWithoutState is not set.
You are not failing deploy when there is an error
I see that you have handled the case below. Looks good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic looks good. Wondering if the duplication of lines can be avoided. Something like
failDeploy := false
if err !=nil {
failDeploy = true
} else path =="" {
...
failDeploy = true
}
if failDeploy {
if FallbackWithoutState {
failDeploy = false
}
}
if failDeploy {
s.deplpyFailed
}
submitjob
err = s.Util.ExecuteCommand("minikube", "ssh", "touch /tmp/checkpoints/fail && chmod 0644 /tmp/checkpoints/fail") | ||
c.Assert(err, IsNil) | ||
|
||
// wait a bit for it to start failing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment incorrect because on line 115 you assert error is nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! You can not verifying that the app is failing is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines have moved around since but if you are referring to s.Util.GetFlinkApplication
this gets the k8s FlinkApplication CR and the err corresponds to if there were errors retrieving the CR object as opposed to anything related to the status of the actual job.
Ln 135 s.Util.FlinkAPIGet(newApp, endpoint)
gets the actual flink job and corresponding status to show that the job itself is healthy
This reverts commit d004ec2.
agreed. refactored |
overview
Add CRD parameter
fallbackWithoutState
to be enabled to allow deploys to occur without a savepoint or checkpoint in the case that the savepoint fails and there are no usable externalized checkpoints.In the case that the savepoint and checkpoints are failing, this has a similar effect as savepointDisabled. The benefit of this configuration option is that it only exhibits this behavior when there is no other path forward and thus acts as a fallback without potential intervention. This is disabled by default.
The state graph does not change, but rather, the recovering state that attempts to find an externalized checkpoint forwards to submittingJob in the case that this is enabled and there is no externalized checkpoint