-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Since 1.12, healthy Revision is taken down because of temporary glitch during Pod creation #14660
Comments
@dprotaso some fallout of the other fix ^^ |
Hmmm... we also added this fix recently - #14453 The motivation there was if you have quotas or something else folks wanted it to fail fast. |
In theory we would drop this code to fix your problem serving/pkg/reconciler/revision/resources/pa.go Lines 55 to 64 in 5e23d81
It looks like prior code had it only switch to unreachable when it was a failed revision and reachability was pending or unreachable serving/pkg/reconciler/revision/resources/pa.go Lines 49 to 55 in 621acf8
|
Yep, the first code block is what I also think is causing the behavior. Not sure if this can be changed to have some delayed effect only, like when Knative waits for the initial scale of a revision. |
/assign @dprotaso |
@SaschaSchwarze0 - I have a WIP fix here - #14795 But I'd like to add some e2e tests to confirm the changes and prevent future regressions. In the meantime are you able to validate the patch to see if there are any further issues? I've done it manualy and I've confirmed it fixes this issue and #14115 I don't think I'll wrap this up for the release next week - but it can land in a future patch release. |
Following up - confirmed this only really happens during an upgrade. I added a test for that. I discovered (#14795 (comment)) that when upgrading the older controller could still cause this error. What happens is the older revision reconciler will detect the config map change for the queue proxy side car image and then try to update the deployment. If you are trying to upgrade from a bad version then I would recommend updating the |
@dprotaso Sry for not having had time earlier. I see this merged now already. But good news is that I do not see any issues after having patched it. Also the reported issue is resolved. The revision still goes into ready=False and reason=FailedCreate, but the deployment stays up and I can continue to reach the app. Once the pod creation is functional again (by removing the broken webhook), the revision eventually goes back to Active. Thanks for fixing it. |
@dprotaso We had an outage exactly with this problem when upgrading from the bad version 1.12.0 to 1.13.1, are we supposed to update the controller deployment image to 1.13.1 manually first? Though we are using operator which may revert it back. |
Yes
Unsure if you're able to override the controller using some operator property |
In Knative 1.12, there is a change in how the reachability of a revision is calculated: #14309. This one has a negative side impact on the following scenario:
kn service create test --image ghcr.io/src2img/http-synthetics:latest --scale-min 1 --scale-max 1
kubectl -n knative-serving patch configmap config-deployment -p '{"data":{"queue-sidecar-image":"gcr.io/knative-releases/knative.dev/serving/cmd/queue:v1.12.1"}}'
The following is now happening:
Knative Serving updates the Deployment's queue-proxy image.
The Deployment controller creates a new ReplicaSet for this with desiredScale 1.
The ReplicaSet controller fails to create the Pod for this ReplicaSet. This will lead to the Deployment to have this status:
Without Knative in the picture, the Deployment would still be up as the Pod of the old ReplicaSet still exists. The ReplicaSet controller goes into an exponential backoff in retrying the Pod creation. Eventually when the webhook communication works again, it will succeed.
Knative Serving 1.11 works like this. It keeps the Revision active and the KService therefore fully reachable.
Knative Serving 1.12 breaks here. It determines that the Revision is not reachable anymore and propagates the Deployment reason (
FailedCreate
) to the Revision. As it is not reachable anymore, the Deployment is scaled down to 0. This breaks the availability of the KService.In what area(s)?
Remove the '> ' to select
What version of Knative?
Expected Behavior
A temporary problem in creating a Pod should not cause the KService to be down.
Actual Behavior
The KService is down since Serving 1.12. One can repair the Revision by deleting its Deployment. The new one will come up assuming Pod creation works.
Without knowing the exact design details, so just opinion ... In general, I think that an active Revision may go into a failed status. But it should not do it that quickly. For example if I have a Revision for which I deleted the image, then the new Pod will not come up and it may eventually mark the revision as Failed, but for temporary things that resolve within a few minutes, it should just give the Deployment time to become healthy (especially if it was not really ever broken).
Steps to Reproduce the Problem
Included in above.
The text was updated successfully, but these errors were encountered: