Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Since 1.12, healthy Revision is taken down because of temporary glitch during Pod creation #14660

Closed
SaschaSchwarze0 opened this issue Nov 23, 2023 · 10 comments · Fixed by #14795
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.
Milestone

Comments

@SaschaSchwarze0
Copy link
Contributor

In Knative 1.12, there is a change in how the reachability of a revision is calculated: #14309. This one has a negative side impact on the following scenario:

  1. You have Kubernetes with Knative Serving
  2. One simple service with currentScale=1 exists, e. g. kn service create test --image ghcr.io/src2img/http-synthetics:latest --scale-min 1 --scale-max 1
  3. You have an admission webhook for the creation of a Pod in place with a failurePolicy=Fail. The webhook is not functional (in our case it was a temporary network glitch from Kubernetes master to the worker nodes that run the service). To reproduce this, just apply a webhook like this:
cat <<EOF | kubectl create -f -
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  name: dummy-webhook
webhooks:
- admissionReviewVersions:
  - v1
  clientConfig:
    service:
      name: non-existing
      namespace: non-existing
      path: /defaulting
      port: 443
  failurePolicy: Fail
  matchPolicy: Equivalent
  name: webhook.non-existing.dev
  objectSelector: {}
  reinvocationPolicy: IfNeeded
  rules:
  - apiGroups:
    - ""
    apiVersions:
    - v1
    operations:
    - CREATE
    resources:
    - pods
    scope: '*'
  sideEffects: None
  timeoutSeconds: 5
EOF
  1. Trigger an update on the Knative configuration that causes all deployments to be updated, for example kubectl -n knative-serving patch configmap config-deployment -p '{"data":{"queue-sidecar-image":"gcr.io/knative-releases/knative.dev/serving/cmd/queue:v1.12.1"}}'

The following is now happening:

  1. Knative Serving updates the Deployment's queue-proxy image.

  2. The Deployment controller creates a new ReplicaSet for this with desiredScale 1.

  3. The ReplicaSet controller fails to create the Pod for this ReplicaSet. This will lead to the Deployment to have this status:

    status:
      conditions:
      - lastTransitionTime: "2023-11-21T13:54:56Z"
        lastUpdateTime: "2023-11-22T10:49:39Z"
        message: ReplicaSet "http-synthetics-00001-deployment-bcbddf84" has successfully progressed.
        reason: NewReplicaSetAvailable
        status: "True"
        type: Progressing
      - lastTransitionTime: "2023-11-22T10:49:59Z"
        lastUpdateTime: "2023-11-22T10:49:59Z"
        message: Deployment does not have minimum availability.
        reason: MinimumReplicasUnavailable
        status: "False"
        type: Available
      - lastTransitionTime: "2023-11-22T10:49:59Z"
        lastUpdateTime: "2023-11-22T10:49:59Z"
        message: 'Internal error occurred: failed calling webhook "webhook.non-existing.dev": failed to call webhook: ...'
        reason: FailedCreate
        status: "True"
        type: ReplicaFailure
      observedGeneration: 18
      unavailableReplicas: 1

    Without Knative in the picture, the Deployment would still be up as the Pod of the old ReplicaSet still exists. The ReplicaSet controller goes into an exponential backoff in retrying the Pod creation. Eventually when the webhook communication works again, it will succeed.

    Knative Serving 1.11 works like this. It keeps the Revision active and the KService therefore fully reachable.

  4. Knative Serving 1.12 breaks here. It determines that the Revision is not reachable anymore and propagates the Deployment reason (FailedCreate) to the Revision. As it is not reachable anymore, the Deployment is scaled down to 0. This breaks the availability of the KService.

In what area(s)?

Remove the '> ' to select

/area autoscale

What version of Knative?

1.12

Expected Behavior

A temporary problem in creating a Pod should not cause the KService to be down.

Actual Behavior

The KService is down since Serving 1.12. One can repair the Revision by deleting its Deployment. The new one will come up assuming Pod creation works.

Without knowing the exact design details, so just opinion ... In general, I think that an active Revision may go into a failed status. But it should not do it that quickly. For example if I have a Revision for which I deleted the image, then the new Pod will not come up and it may eventually mark the revision as Failed, but for temporary things that resolve within a few minutes, it should just give the Deployment time to become healthy (especially if it was not really ever broken).

Steps to Reproduce the Problem

Included in above.

@SaschaSchwarze0 SaschaSchwarze0 added the kind/bug Categorizes issue or PR as related to a bug. label Nov 23, 2023
@SaschaSchwarze0
Copy link
Contributor Author

@dprotaso some fallout of the other fix ^^

@dprotaso
Copy link
Member

dprotaso commented Nov 23, 2023

Hmmm... we also added this fix recently - #14453

The motivation there was if you have quotas or something else folks wanted it to fail fast.

@dprotaso
Copy link
Member

dprotaso commented Nov 23, 2023

In theory we would drop this code to fix your problem

conds := []apis.ConditionType{
v1.RevisionConditionResourcesAvailable,
v1.RevisionConditionContainerHealthy,
}
for _, cond := range conds {
if c := rev.Status.GetCondition(cond); c != nil && c.IsFalse() {
return autoscalingv1alpha1.ReachabilityUnreachable
}
}

It looks like prior code had it only switch to unreachable when it was a failed revision and reachability was pending or unreachable

if rev.Status.GetCondition(v1.RevisionConditionReady).IsFalse() {
// Make sure that we don't do this when a newly failing revision is
// marked reachable by outside forces.
if !rev.IsReachable() {
return autoscalingv1alpha1.ReachabilityUnreachable
}
}

@dprotaso dprotaso added this to the v1.13.0 milestone Nov 23, 2023
@SaschaSchwarze0
Copy link
Contributor Author

Yep, the first code block is what I also think is causing the behavior. Not sure if this can be changed to have some delayed effect only, like when Knative waits for the initial scale of a revision.

@dprotaso
Copy link
Member

/assign @dprotaso

@dprotaso
Copy link
Member

dprotaso commented Jan 17, 2024

@SaschaSchwarze0 - I have a WIP fix here - #14795

But I'd like to add some e2e tests to confirm the changes and prevent future regressions. In the meantime are you able to validate the patch to see if there are any further issues?

I've done it manualy and I've confirmed it fixes this issue and #14115

I don't think I'll wrap this up for the release next week - but it can land in a future patch release.

@dprotaso
Copy link
Member

dprotaso commented Jan 30, 2024

Following up - confirmed this only really happens during an upgrade. I added a test for that.

I discovered (#14795 (comment)) that when upgrading the older controller could still cause this error.

What happens is the older revision reconciler will detect the config map change for the queue proxy side car image and then try to update the deployment.

If you are trying to upgrade from a bad version then I would recommend updating the controller deployment first before continuing the upgrade.

@SaschaSchwarze0
Copy link
Contributor Author

In the meantime are you able to validate the patch to see if there are any further issues?

@dprotaso Sry for not having had time earlier. I see this merged now already. But good news is that I do not see any issues after having patched it. Also the reported issue is resolved. The revision still goes into ready=False and reason=FailedCreate, but the deployment stays up and I can continue to reach the app. Once the pod creation is functional again (by removing the broken webhook), the revision eventually goes back to Active.

Thanks for fixing it.

@yuzisun
Copy link

yuzisun commented Mar 3, 2024

Following up - confirmed this only really happens during an upgrade. I added a test for that.

I discovered (#14795 (comment)) that when upgrading the older controller could still cause this error.

What happens is the older revision reconciler will detect the config map change for the queue proxy side car image and then try to update the deployment.

If you are trying to upgrade from a bad version then I would recommend updating the controller deployment first before continuing the upgrade.

@dprotaso We had an outage exactly with this problem when upgrading from the bad version 1.12.0 to 1.13.1, are we supposed to update the controller deployment image to 1.13.1 manually first? Though we are using operator which may revert it back.

@dprotaso
Copy link
Member

dprotaso commented Mar 3, 2024

are we supposed to update the controller deployment image to 1.13.1 manually first?

Yes

Though we are using operator which may revert it back.

Unsure if you're able to override the controller using some operator property

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants