Since 1.12, healthy Revision is taken down because of temporary glitch during Pod creation #14660

SaschaSchwarze0 · 2023-11-23T12:32:28Z

In Knative 1.12, there is a change in how the reachability of a revision is calculated: #14309. This one has a negative side impact on the following scenario:

You have Kubernetes with Knative Serving
One simple service with currentScale=1 exists, e. g. kn service create test --image ghcr.io/src2img/http-synthetics:latest --scale-min 1 --scale-max 1
You have an admission webhook for the creation of a Pod in place with a failurePolicy=Fail. The webhook is not functional (in our case it was a temporary network glitch from Kubernetes master to the worker nodes that run the service). To reproduce this, just apply a webhook like this:

cat <<EOF | kubectl create -f -
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  name: dummy-webhook
webhooks:
- admissionReviewVersions:
  - v1
  clientConfig:
    service:
      name: non-existing
      namespace: non-existing
      path: /defaulting
      port: 443
  failurePolicy: Fail
  matchPolicy: Equivalent
  name: webhook.non-existing.dev
  objectSelector: {}
  reinvocationPolicy: IfNeeded
  rules:
  - apiGroups:
    - ""
    apiVersions:
    - v1
    operations:
    - CREATE
    resources:
    - pods
    scope: '*'
  sideEffects: None
  timeoutSeconds: 5
EOF

Trigger an update on the Knative configuration that causes all deployments to be updated, for example kubectl -n knative-serving patch configmap config-deployment -p '{"data":{"queue-sidecar-image":"gcr.io/knative-releases/knative.dev/serving/cmd/queue:v1.12.1"}}'

The following is now happening:

Knative Serving updates the Deployment's queue-proxy image.
The Deployment controller creates a new ReplicaSet for this with desiredScale 1.

The ReplicaSet controller fails to create the Pod for this ReplicaSet. This will lead to the Deployment to have this status:

status:
  conditions:
  - lastTransitionTime: "2023-11-21T13:54:56Z"
    lastUpdateTime: "2023-11-22T10:49:39Z"
    message: ReplicaSet "http-synthetics-00001-deployment-bcbddf84" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  - lastTransitionTime: "2023-11-22T10:49:59Z"
    lastUpdateTime: "2023-11-22T10:49:59Z"
    message: Deployment does not have minimum availability.
    reason: MinimumReplicasUnavailable
    status: "False"
    type: Available
  - lastTransitionTime: "2023-11-22T10:49:59Z"
    lastUpdateTime: "2023-11-22T10:49:59Z"
    message: 'Internal error occurred: failed calling webhook "webhook.non-existing.dev": failed to call webhook: ...'
    reason: FailedCreate
    status: "True"
    type: ReplicaFailure
  observedGeneration: 18
  unavailableReplicas: 1

Without Knative in the picture, the Deployment would still be up as the Pod of the old ReplicaSet still exists. The ReplicaSet controller goes into an exponential backoff in retrying the Pod creation. Eventually when the webhook communication works again, it will succeed.

Knative Serving 1.11 works like this. It keeps the Revision active and the KService therefore fully reachable.

Knative Serving 1.12 breaks here. It determines that the Revision is not reachable anymore and propagates the Deployment reason (FailedCreate) to the Revision. As it is not reachable anymore, the Deployment is scaled down to 0. This breaks the availability of the KService.

In what area(s)?

Remove the '> ' to select

/area autoscale

What version of Knative?

1.12

Expected Behavior

A temporary problem in creating a Pod should not cause the KService to be down.

Actual Behavior

The KService is down since Serving 1.12. One can repair the Revision by deleting its Deployment. The new one will come up assuming Pod creation works.

Without knowing the exact design details, so just opinion ... In general, I think that an active Revision may go into a failed status. But it should not do it that quickly. For example if I have a Revision for which I deleted the image, then the new Pod will not come up and it may eventually mark the revision as Failed, but for temporary things that resolve within a few minutes, it should just give the Deployment time to become healthy (especially if it was not really ever broken).

Steps to Reproduce the Problem

Included in above.

The text was updated successfully, but these errors were encountered:

SaschaSchwarze0 · 2023-11-23T12:33:00Z

@dprotaso some fallout of the other fix ^^

dprotaso · 2023-11-23T15:18:18Z

Hmmm... we also added this fix recently - #14453

The motivation there was if you have quotas or something else folks wanted it to fail fast.

dprotaso · 2023-11-23T15:27:09Z

In theory we would drop this code to fix your problem

serving/pkg/reconciler/revision/resources/pa.go

Lines 55 to 64 in 5e23d81

    
           conds := []apis.ConditionType{ 
        
           	v1.RevisionConditionResourcesAvailable, 
        
           	v1.RevisionConditionContainerHealthy, 
        
           } 
        
           for _, cond := range conds { 
        
           	if c := rev.Status.GetCondition(cond); c != nil && c.IsFalse() { 
        
           		return autoscalingv1alpha1.ReachabilityUnreachable 
        
           	} 
        
           }

It looks like prior code had it only switch to unreachable when it was a failed revision and reachability was pending or unreachable

serving/pkg/reconciler/revision/resources/pa.go

Lines 49 to 55 in 621acf8

    
           if rev.Status.GetCondition(v1.RevisionConditionReady).IsFalse() { 
        
           	// Make sure that we don't do this when a newly failing revision is 
        
           	// marked reachable by outside forces. 
        
           	if !rev.IsReachable() { 
        
           		return autoscalingv1alpha1.ReachabilityUnreachable 
        
           	} 
        
           }

SaschaSchwarze0 · 2023-11-24T07:12:53Z

Yep, the first code block is what I also think is causing the behavior. Not sure if this can be changed to have some delayed effect only, like when Knative waits for the initial scale of a revision.

dprotaso · 2023-11-28T17:28:32Z

/assign @dprotaso

dprotaso · 2024-01-17T22:03:22Z

@SaschaSchwarze0 - I have a WIP fix here - #14795

But I'd like to add some e2e tests to confirm the changes and prevent future regressions. In the meantime are you able to validate the patch to see if there are any further issues?

I've done it manualy and I've confirmed it fixes this issue and #14115

I don't think I'll wrap this up for the release next week - but it can land in a future patch release.

dprotaso · 2024-01-30T17:54:01Z

Following up - confirmed this only really happens during an upgrade. I added a test for that.

I discovered (#14795 (comment)) that when upgrading the older controller could still cause this error.

What happens is the older revision reconciler will detect the config map change for the queue proxy side car image and then try to update the deployment.

If you are trying to upgrade from a bad version then I would recommend updating the controller deployment first before continuing the upgrade.

SaschaSchwarze0 · 2024-02-05T15:11:24Z

In the meantime are you able to validate the patch to see if there are any further issues?

@dprotaso Sry for not having had time earlier. I see this merged now already. But good news is that I do not see any issues after having patched it. Also the reported issue is resolved. The revision still goes into ready=False and reason=FailedCreate, but the deployment stays up and I can continue to reach the app. Once the pod creation is functional again (by removing the broken webhook), the revision eventually goes back to Active.

Thanks for fixing it.

yuzisun · 2024-03-03T06:00:14Z

Following up - confirmed this only really happens during an upgrade. I added a test for that.

I discovered (#14795 (comment)) that when upgrading the older controller could still cause this error.

What happens is the older revision reconciler will detect the config map change for the queue proxy side car image and then try to update the deployment.

If you are trying to upgrade from a bad version then I would recommend updating the controller deployment first before continuing the upgrade.

@dprotaso We had an outage exactly with this problem when upgrading from the bad version 1.12.0 to 1.13.1, are we supposed to update the controller deployment image to 1.13.1 manually first? Though we are using operator which may revert it back.

dprotaso · 2024-03-03T15:07:28Z

are we supposed to update the controller deployment image to 1.13.1 manually first?

Yes

Though we are using operator which may revert it back.

Unsure if you're able to override the controller using some operator property

SaschaSchwarze0 added the kind/bug Categorizes issue or PR as related to a bug. label Nov 23, 2023

dprotaso added this to the v1.13.0 milestone Nov 23, 2023

knative-prow bot assigned dprotaso Nov 28, 2023

This was referenced Jan 12, 2024

Check all container's status when calculating revision ContainerHealthy condition #14744

Merged

Fix crash-looping pods take a long time to terminate/clean #14607

Closed

This was referenced Jan 26, 2024

Don't drop traffic when upgrading a deployment fails #14795

Merged

[release-1.12] Don't drop traffic when upgrading a deployment fails #14840

Merged

SaschaSchwarze0 closed this as completed Feb 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Since 1.12, healthy Revision is taken down because of temporary glitch during Pod creation #14660

Since 1.12, healthy Revision is taken down because of temporary glitch during Pod creation #14660

SaschaSchwarze0 commented Nov 23, 2023

SaschaSchwarze0 commented Nov 23, 2023

dprotaso commented Nov 23, 2023 •

edited

Loading

dprotaso commented Nov 23, 2023 •

edited

Loading

SaschaSchwarze0 commented Nov 24, 2023

dprotaso commented Nov 28, 2023

dprotaso commented Jan 17, 2024 •

edited

Loading

dprotaso commented Jan 30, 2024 •

edited

Loading

SaschaSchwarze0 commented Feb 5, 2024

yuzisun commented Mar 3, 2024 •

edited

Loading

dprotaso commented Mar 3, 2024

Since 1.12, healthy Revision is taken down because of temporary glitch during Pod creation #14660

Since 1.12, healthy Revision is taken down because of temporary glitch during Pod creation #14660

Comments

SaschaSchwarze0 commented Nov 23, 2023

In what area(s)?

What version of Knative?

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

SaschaSchwarze0 commented Nov 23, 2023

dprotaso commented Nov 23, 2023 • edited Loading

dprotaso commented Nov 23, 2023 • edited Loading

SaschaSchwarze0 commented Nov 24, 2023

dprotaso commented Nov 28, 2023

dprotaso commented Jan 17, 2024 • edited Loading

dprotaso commented Jan 30, 2024 • edited Loading

SaschaSchwarze0 commented Feb 5, 2024

yuzisun commented Mar 3, 2024 • edited Loading

dprotaso commented Mar 3, 2024

dprotaso commented Nov 23, 2023 •

edited

Loading

dprotaso commented Nov 23, 2023 •

edited

Loading

dprotaso commented Jan 17, 2024 •

edited

Loading

dprotaso commented Jan 30, 2024 •

edited

Loading

yuzisun commented Mar 3, 2024 •

edited

Loading