Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply default livenessProbe and readinessProbe to the user container #4014

Closed
chizhg opened this issue May 6, 2019 · 15 comments · Fixed by #4731
Closed

Apply default livenessProbe and readinessProbe to the user container #4014

chizhg opened this issue May 6, 2019 · 15 comments · Fixed by #4731
Assignees
Labels
area/test-and-release It flags unit/e2e/conformance/perf test issues for product features kind/feature Well-understood/specified features, ready for coding. kind/spec Discussion of how a feature should be exposed to customers. P1 P1
Milestone

Comments

@chizhg
Copy link
Member

chizhg commented May 6, 2019

In what area(s)?

/area test-and-release

Describe the feature

According to the second paragraph of Meta Requests in Knative Runtime Contract, we should apply default livenessProbe and readinessProbe to the user container when they are not specified, and have them the following settings:

  • tcpSocket set to the container's port
  • initialDelaySeconds set to 0
  • periodSeconds set to platform-specific value

@tcnghia @mattmoor @markusthoemmes

@chizhg chizhg added the kind/feature Well-understood/specified features, ready for coding. label May 6, 2019
@knative-prow-robot knative-prow-robot added area/test-and-release It flags unit/e2e/conformance/perf test issues for product features kind/good-first-issue kind/spec Discussion of how a feature should be exposed to customers. labels May 6, 2019
@markusthoemmes
Copy link
Contributor

In theory we have this (we don't do liveness probes though) implemented via the queue-proxy, which does this exact probe.

@dgerd
Copy link

dgerd commented May 6, 2019

QueueProxy Probe on Admin Interface:

queueReadinessProbe = &corev1.Probe{

QueueProxyHandler Setup on Admin Interface:

func createAdminHandlers() *http.ServeMux {

Probe Handler:

queueReadinessProbe = &corev1.Probe{

@dgerd
Copy link

dgerd commented May 6, 2019

We do have something here. Some areas where this doesn't match well with the description:

  1. LivenessProbe is not configured
  2. "Platform-specific value" makes me think this is configuration; however, periodSeconds is hard-coded in the queueProxy
  3. "Default" makes me think that user specified probes override this behavior, but instead they are applied concurrently.

I think we could go either way on whether to update the contract, or to update the implementation for each of these things. I do however want to make them consistent.

dgerd pushed a commit to dgerd/serving that referenced this issue May 7, 2019
This change makes numerous cleanups to the runtime contract in an
attempt to improve the readability of the document and make the document
more useful for the intended auidence.

* Moves developer facing statements to a new `runtime-user-guide`.
Focuses `runtime-contract` on operator/platform-provider.
* Add links to Conformance tests that test Runtime Contract statements.
* Corrects, updates, or removes statements to more accurately represent
today's Knative runtime.
* Updates to informative or removes most untestable statements
* Copies in important OCI runtime requirements we previously referenced
* Removes reference to OCI specification that didn't bring new
requirements.

Ref: knative#2539, knative#2973, knative#4014, knative#4027
@dgerd
Copy link

dgerd commented May 8, 2019

After discussion in the API working group meeting, this is my understanding of where we landed. Please correct if I missed something or misunderstood.

On the contract/specification side:

  1. LivenessProbe should not be configured by default. This part of the statement should be removed from the runtime contract.

On the implementation side:

  1. The default TCP probe behavior should be disabled when a user specifies a custom readinessProbe.
  2. The default TCP readinessProbe should be added by the webhook and appear on the Revision spec.
  3. Custom readiness probes defined by the user should be translated to health checks that are done BY the queue-proxy against the user-container, and removed from the user container.
  4. We should understand if our queue-proxy http server startup time is causing Good gRPC deployment pods frequently fail at least one health check #3308. If it is we should consider using an execProbe on the queue-proxy to perform the healthcheck against the user-container. If it is not, we should take a look at how fast we timeout on health checking the user container.

@mattmoor mattmoor added this to the Serving 0.7 milestone May 10, 2019
@joshrider
Copy link
Contributor

/assign @joshrider

@joshrider
Copy link
Contributor

/assign @shashwathi

@joshrider
Copy link
Contributor

joshrider commented Jun 11, 2019

I've run into some questions about how to translate a RevisionSpec's probes over into something we'd want to execute from the queue-proxy against the user-container.

periodSeconds, successThreshold, and failureThreshold do not cleanly map over to the way we do things now.

At present, our hardcoded readinessProbe will make a GET request to the queue-proxy, which then fires TCP probes at the user-container at 50 ms intervals. This gives us the chance to pick up the user-container as soon as possible. This 50ms could be the "platform-specific" periodSeconds value referred to by the runtime-contract, but the sub-second value puts us into a bit of a sticky spot.

The periodSeconds value on the probe is an integer representing a number of seconds. This makes it difficult to: a) include an accurate description of the default probe in the Revision spec, and b) allow the user to configure their own sub-second periodSeconds values.

@dgerd pointed out that 0 is not a valid value for a probe, and may be an option to signify some special value. It could be that 0 is a stand-in for some smaller period of time (like 50ms), or it could indicate that we need to look elsewhere for the desired value. This could get us where we're going, but does not seem straightforward for a user. I'd be happy to get some feedback on ways to tackle this.

successThreshold and failureThreshold are both a little bit awkward, and may result in us reimplementing a bunch of logic from the kubelet's prober. I also have some questions about their meaningfulness given that, as the default probe is currently implemented, once the TCP probe from the queue-proxy against the user-container succeeds once, subsequent probes of the queue-proxy return successfully without ever re-checking the user-container.

Sidenote: there is a comment where we build the default probe in the queue-proxy that suggests we want to get the PreStop going as soon as possible. Given that the failureThreshold isn't set and that the default value for the failureThreshold is 3, would we benefit by setting it ourselves to a lower value?

@mattmoor
Copy link
Member

Few thoughts:

  1. I think periodSeconds: 0 should mean do it as the system wants to.
  2. I don't think the system default needs to be reflected back into the yaml (and here in fact cannot be)
  3. If periodSeconds is specified we should respect it.
  4. I think that we should disallow failureThreshold if periodSeconds is unspecified, but I think successThreshold still makes sense.

@joshrider
Copy link
Contributor

@mattmoor @dgerd

Clarifying question: if the user specifies an HTTP probe with periodSeconds: 0, do we want that to be invalid or do we want to translate that to the aggressive retry configuration?

@mattmoor mattmoor modified the milestones: Serving 0.7, Serving 0.8 Jun 19, 2019
@mattmoor
Copy link
Member

aggressive retries.

@mattmoor
Copy link
Member

The base change is in. I'd like to see this land as early in 0.8 as we can, so it can bake. Please LMK if you have questions or need reviews.

@joshrider
Copy link
Contributor

Not to jinx anything, but the e2e tests just passed. Have some housekeeping and cleanup to do, but hope to have something up for feedback at the end of the week.

@mattmoor
Copy link
Member

@joshrider That's awesome news, LMK when things are RFAL.

@mattmoor
Copy link
Member

mattmoor commented Jul 2, 2019

@joshrider did you jinx it? 😲

@joshrider
Copy link
Contributor

Guess it's hard to argue that I didn't at this point...

Just put something up with a WIP tag. #4600
Lots to look at and feedback greatly appreciated. Will also need to be shepherded through CLA-land manually.

@mattmoor @dgerd

joshrider added a commit to joshrider/serving that referenced this issue Jul 8, 2019


Co-authored-by: Shash Reddy <shashwathireddy@gmail.com>
knative-prow-robot pushed a commit that referenced this issue Jul 9, 2019
* prepare queue healthHandler for optional aggressive retries #4014

Co-authored-by: Shash Reddy <shashwathireddy@gmail.com>

* add documentation for IsHTTPProbeReady

Co-authored-by: Shash Reddy <shashwathireddy@gmail.com>

* Address comments

* add kubelet header to http probe, add status code to failed probe error

* use kubelet user-agent in http probe

Co-authored-by: Shash Reddy <shashwathireddy@gmail.com>
shashwathi added a commit to joshrider/serving that referenced this issue Jul 9, 2019
This PR builds the adapter which converts a user-defined probe to a probe that
will be executed by queue proxy against user container.

Co-authored-by: Shash Reddy <shashwathireddy@gmail.com>
knative-prow-robot pushed a commit that referenced this issue Jul 10, 2019
* Prep for queue health handler for aggressive retries #4014

This PR builds the adapter which converts a user-defined probe to a probe that
will be executed by queue proxy against user container.

Co-authored-by: Shash Reddy <shashwathireddy@gmail.com>

* Address golang linter comments

* fix comment

Co-authored-by: Shash Reddy <shashwathireddy@gmail.com>

* return probe encoding error

* factor out queue probe retry logic, clean up

Co-authored-by: Shash Reddy <shashwathireddy@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test-and-release It flags unit/e2e/conformance/perf test issues for product features kind/feature Well-understood/specified features, ready for coding. kind/spec Discussion of how a feature should be exposed to customers. P1 P1
Projects
None yet
8 participants