Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use user-defined readinessProbe in queue-proxy #4731

Merged
merged 9 commits into from
Jul 16, 2019

Conversation

joshrider
Copy link
Contributor

Signed-off-by: Shash Reddy shashwathireddy@gmail.com
Co-authored-by: Shash Reddy shashwathireddy@gmail.com

Fixes #4014

Proposed Changes

  • add default readiness probe to revision spec when user does not specify one
  • remove HTTP and TCP readiness probes from user-container when creating deployments, instead translate them into probe performed by queue-proxy against user-container
  • when user specifies an Exec readiness probe, it will stay on the user-container and the queue-proxy will perform a TCP probe against the user-container to ensure a path is open
  • have the handler used by the activator (to check that the pod is ready) use the same readiness criteria defined by the user

NOTE: for the activator's probe, we are using the same count of "successful probes" as the pod's usual readiness probe. That is, if the activator and "kubelet" are both probing concurrently and the probe's SuccessThreshold is 4, they will only need 4 consecutive successes collectively (as opposed to 4 each). Please poke holes in this.

Release Note

HTTP and TCP readinessProbes are performed by the queue-proxy against the user-container

@googlebot
Copy link

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of all the commit author(s), set the cla label to yes (if enabled on your project), and then merge this pull request when appropriate.

ℹ️ Googlers: Go here for more info.

@googlebot googlebot added the cla: no Indicates the PR's author has not signed the CLA. label Jul 12, 2019
@knative-prow-robot knative-prow-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jul 12, 2019
Copy link
Contributor

@knative-prow-robot knative-prow-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joshrider: 0 warnings.

In response to this:

Signed-off-by: Shash Reddy shashwathireddy@gmail.com
Co-authored-by: Shash Reddy shashwathireddy@gmail.com

Fixes #4014

Proposed Changes

  • add default readiness probe to revision spec when user does not specify one
  • remove HTTP and TCP readiness probes from user-container when creating deployments, instead translate them into probe performed by queue-proxy against user-container
  • when user specifies an Exec readiness probe, it will stay on the user-container and the queue-proxy will perform a TCP probe against the user-container to ensure a path is open
  • have the handler used by the activator (to check that the pod is ready) use the same readiness criteria defined by the user

NOTE: for the activator's probe, we are using the same count of "successful probes" as the pod's usual readiness probe. That is, if the activator and "kubelet" are both probing concurrently and the probe's SuccessThreshold is 4, they will only need 4 consecutive successes collectively (as opposed to 4 each). Please poke holes in this.

Release Note

HTTP and TCP readinessProbes are performed by the queue-proxy against the user-container

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@knative-prow-robot knative-prow-robot added area/API API objects and controllers area/networking labels Jul 12, 2019
@joshrider
Copy link
Contributor Author

/test pull-knative-serving-integration-tests

@joshrider joshrider changed the title use user-defined readinessprobe in queue-proxy Use user-defined readinessProbe in queue-proxy Jul 12, 2019
@knative-metrics-robot
Copy link

The following is the coverage report on pkg/.
Say /test pull-knative-serving-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/serving/k8s_validation.go 98.9% 98.6% -0.3
pkg/apis/serving/v1beta1/revision_defaults.go 87.5% 89.5% 2.0

@shashwathi
Copy link
Contributor

/test pull-knative-serving-integration-tests

Copy link
Contributor

@markusthoemmes markusthoemmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few flyby comments. I have a really hard time keeping track of what calls what, which probes go where and which retries are applied at which spots.

Do you mind drawing a picture of where we want to apply which retry? The nested retrying feels a little odd to me, maybe there's room for an interim change there as well as this PR is pretty big.

Thanks for doing this though, this is great stuff 🙂

if probeUserContainer() {
// Respond with the name of the component handling the request.
w.Write([]byte(queue.Name))
if prober != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in a separate PR: Is there a reason why we don't return the state from healthState here? Seems unnecessarily redundant to probe on this path 🤔

@greghaynes do you need that for your "direct to ip" work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like an excellent suggestion. 👍

cmd/queue/main.go Outdated Show resolved Hide resolved
cmd/queue/main.go Show resolved Hide resolved
cmd/queue/main.go Outdated Show resolved Hide resolved
pkg/reconciler/revision/resources/queue.go Outdated Show resolved Hide resolved
pkg/reconciler/revision/resources/queue.go Outdated Show resolved Hide resolved
@joshrider
Copy link
Contributor Author

/test pull-knative-serving-smoke-tests

cmd/queue/main.go Outdated Show resolved Hide resolved
pkg/apis/serving/k8s_validation.go Outdated Show resolved Hide resolved
pkg/reconciler/revision/resources/queue.go Outdated Show resolved Hide resolved
joshrider and others added 6 commits July 15, 2019 21:31
Signed-off-by: Shash Reddy <shashwathireddy@gmail.com>
Co-authored-by: Shash Reddy <shashwathireddy@gmail.com>
Co-authored-by: Shash Reddy <shashwathireddy@gmail.com>
Co-authored-by: Shash Reddy <shashwathireddy@gmail.com>
- merge logic for knative probes and user defined probes
- use probe-period as argument name
- pass probe as environment variable instead of container args

Signed-off-by: Shash Reddy <shashwathireddy@gmail.com>
- Use context for timeout
- do not override exec probe
- simplify the logic for errors when multiple probes are mentioned

Signed-off-by: Shash Reddy <shashwathireddy@gmail.com>
@shashwathi
Copy link
Contributor

@mattmoor : Addressed all your comments. Ready for another review 👍

cmd/queue/main.go Show resolved Hide resolved
// started as early as possible while still wanting to give the container some breathing
// room to get up and running.
timeoutErr := wait.PollImmediate(25*time.Millisecond, timeout, func() (bool, error) {
timeoutErr := wait.PollImmediateUntil(aggressivePollInterval, func() (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I know Matt suggested it I liked the previous version more, it's shorter :)
🤷‍♀

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What'd I suggest?

cmd/queue/main_test.go Outdated Show resolved Hide resolved
joshrider and others added 2 commits July 16, 2019 09:48
Co-authored-by: Shash Reddy <shashwathireddy@gmail.com>
Co-authored-by: Shash Reddy <shashwathireddy@gmail.com>
@mattmoor
Copy link
Member

I think things largely look good. Going to give others a chance to leave comments, but if nothing comes up I'll do a final pass later so we can get this baking. It may be worth checking out the data race failure above, since this PR touches the queue logic. thanks for all the work leading up to this!

@joshrider
Copy link
Contributor Author

Sounds good. Neither of us have been able to recreate that data race locally. Would be curious to hear if someone else knows how it happened.

/test pull-knative-serving-unit-tests

Signed-off-by: Shash Reddy <shashwathireddy@gmail.com>
Copy link
Member

@mattmoor mattmoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve
🎉

@knative-prow-robot knative-prow-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jul 16, 2019
@mattmoor mattmoor added cla: yes Indicates the PR's author has signed the CLA. and removed approved Indicates a PR has been approved by an approver from all required OWNERS files. cla: no Indicates the PR's author has not signed the CLA. labels Jul 16, 2019
@googlebot
Copy link

A Googler has manually verified that the CLAs look good.

(Googler, please make sure the reason for overriding the CLA status is clearly documented in these comments.)

ℹ️ Googlers: Go here for more info.

@mattmoor
Copy link
Member

/approve

@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: joshrider, mattmoor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow-robot knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 16, 2019
@knative-prow-robot knative-prow-robot merged commit f53271a into knative:master Jul 16, 2019
nak3 added a commit to nak3/serving that referenced this pull request Jul 17, 2019
This patch makes a tiny fix which removes invalid setting in
configuration example.

After knative#4731, `periodSeconds`
needs to be set with `failureThreshold` and `timeoutSeconds`. This
patch simply removes `periodSeconds` from the config.
@joshrider joshrider deleted the queue-probe branch August 6, 2019 15:17
knative-prow-robot pushed a commit that referenced this pull request Sep 29, 2019
This patch makes a tiny fix which removes invalid setting in
configuration example.

After #4731, `periodSeconds`
needs to be set with `failureThreshold` and `timeoutSeconds`. This
patch simply removes `periodSeconds` from the config.
knative-prow-robot pushed a commit to knative/docs that referenced this pull request Jun 5, 2020
Since knative/serving#4731, periodSeconds also requires
failureThreshold and timeoutSeconds to be set.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/API API objects and controllers area/networking cla: yes Indicates the PR's author has signed the CLA. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Apply default livenessProbe and readinessProbe to the user container
10 participants