Apply default livenessProbe and readinessProbe to the user container #4014

chizhg · 2019-05-06T18:34:25Z

In what area(s)?

/area test-and-release

Describe the feature

According to the second paragraph of Meta Requests in Knative Runtime Contract, we should apply default livenessProbe and readinessProbe to the user container when they are not specified, and have them the following settings:

tcpSocket set to the container's port
initialDelaySeconds set to 0
periodSeconds set to platform-specific value

@tcnghia @mattmoor @markusthoemmes

The text was updated successfully, but these errors were encountered:

markusthoemmes · 2019-05-06T18:36:33Z

In theory we have this (we don't do liveness probes though) implemented via the queue-proxy, which does this exact probe.

dgerd · 2019-05-06T18:53:18Z

QueueProxy Probe on Admin Interface:

serving/pkg/reconciler/revision/resources/queue.go

Line 63 in 1b6908c

queueReadinessProbe = &corev1.Probe{

QueueProxyHandler Setup on Admin Interface:

serving/cmd/queue/main.go

Line 201 in 1b6908c

func createAdminHandlers() *http.ServeMux {

Probe Handler:

serving/pkg/reconciler/revision/resources/queue.go

Line 63 in 1b6908c

queueReadinessProbe = &corev1.Probe{

dgerd · 2019-05-06T18:57:04Z

We do have something here. Some areas where this doesn't match well with the description:

LivenessProbe is not configured
"Platform-specific value" makes me think this is configuration; however, periodSeconds is hard-coded in the queueProxy
"Default" makes me think that user specified probes override this behavior, but instead they are applied concurrently.

I think we could go either way on whether to update the contract, or to update the implementation for each of these things. I do however want to make them consistent.

This change makes numerous cleanups to the runtime contract in an attempt to improve the readability of the document and make the document more useful for the intended auidence. * Moves developer facing statements to a new `runtime-user-guide`. Focuses `runtime-contract` on operator/platform-provider. * Add links to Conformance tests that test Runtime Contract statements. * Corrects, updates, or removes statements to more accurately represent today's Knative runtime. * Updates to informative or removes most untestable statements * Copies in important OCI runtime requirements we previously referenced * Removes reference to OCI specification that didn't bring new requirements. Ref: knative#2539, knative#2973, knative#4014, knative#4027

dgerd · 2019-05-08T20:34:53Z

After discussion in the API working group meeting, this is my understanding of where we landed. Please correct if I missed something or misunderstood.

On the contract/specification side:

LivenessProbe should not be configured by default. This part of the statement should be removed from the runtime contract.

On the implementation side:

The default TCP probe behavior should be disabled when a user specifies a custom readinessProbe.
The default TCP readinessProbe should be added by the webhook and appear on the Revision spec.
Custom readiness probes defined by the user should be translated to health checks that are done BY the queue-proxy against the user-container, and removed from the user container.
We should understand if our queue-proxy http server startup time is causing Good gRPC deployment pods frequently fail at least one health check #3308. If it is we should consider using an execProbe on the queue-proxy to perform the healthcheck against the user-container. If it is not, we should take a look at how fast we timeout on health checking the user container.

joshrider · 2019-05-13T14:16:32Z

/assign @joshrider

joshrider · 2019-06-07T15:47:08Z

/assign @shashwathi

joshrider · 2019-06-11T13:32:01Z

I've run into some questions about how to translate a RevisionSpec's probes over into something we'd want to execute from the queue-proxy against the user-container.

periodSeconds, successThreshold, and failureThreshold do not cleanly map over to the way we do things now.

At present, our hardcoded readinessProbe will make a GET request to the queue-proxy, which then fires TCP probes at the user-container at 50 ms intervals. This gives us the chance to pick up the user-container as soon as possible. This 50ms could be the "platform-specific" periodSeconds value referred to by the runtime-contract, but the sub-second value puts us into a bit of a sticky spot.

The periodSeconds value on the probe is an integer representing a number of seconds. This makes it difficult to: a) include an accurate description of the default probe in the Revision spec, and b) allow the user to configure their own sub-second periodSeconds values.

@dgerd pointed out that 0 is not a valid value for a probe, and may be an option to signify some special value. It could be that 0 is a stand-in for some smaller period of time (like 50ms), or it could indicate that we need to look elsewhere for the desired value. This could get us where we're going, but does not seem straightforward for a user. I'd be happy to get some feedback on ways to tackle this.

successThreshold and failureThreshold are both a little bit awkward, and may result in us reimplementing a bunch of logic from the kubelet's prober. I also have some questions about their meaningfulness given that, as the default probe is currently implemented, once the TCP probe from the queue-proxy against the user-container succeeds once, subsequent probes of the queue-proxy return successfully without ever re-checking the user-container.

Sidenote: there is a comment where we build the default probe in the queue-proxy that suggests we want to get the PreStop going as soon as possible. Given that the failureThreshold isn't set and that the default value for the failureThreshold is 3, would we benefit by setting it ourselves to a lower value?

mattmoor · 2019-06-12T16:41:31Z

Few thoughts:

I think periodSeconds: 0 should mean do it as the system wants to.
I don't think the system default needs to be reflected back into the yaml (and here in fact cannot be)
If periodSeconds is specified we should respect it.
I think that we should disallow failureThreshold if periodSeconds is unspecified, but I think successThreshold still makes sense.

joshrider · 2019-06-19T15:52:00Z

@mattmoor @dgerd

Clarifying question: if the user specifies an HTTP probe with periodSeconds: 0, do we want that to be invalid or do we want to translate that to the aggressive retry configuration?

mattmoor · 2019-06-19T16:24:48Z

aggressive retries.

mattmoor · 2019-06-25T19:29:52Z

The base change is in. I'd like to see this land as early in 0.8 as we can, so it can bake. Please LMK if you have questions or need reviews.

joshrider · 2019-06-26T19:18:12Z

Not to jinx anything, but the e2e tests just passed. Have some housekeeping and cleanup to do, but hope to have something up for feedback at the end of the week.

mattmoor · 2019-06-26T19:54:53Z

@joshrider That's awesome news, LMK when things are RFAL.

mattmoor · 2019-07-02T13:01:39Z

@joshrider did you jinx it? 😲

joshrider · 2019-07-02T15:30:28Z

Guess it's hard to argue that I didn't at this point...

Just put something up with a WIP tag. #4600
Lots to look at and feedback greatly appreciated. Will also need to be shepherded through CLA-land manually.

@mattmoor @dgerd

Co-authored-by: Shash Reddy <shashwathireddy@gmail.com>

* prepare queue healthHandler for optional aggressive retries #4014 Co-authored-by: Shash Reddy <shashwathireddy@gmail.com> * add documentation for IsHTTPProbeReady Co-authored-by: Shash Reddy <shashwathireddy@gmail.com> * Address comments * add kubelet header to http probe, add status code to failed probe error * use kubelet user-agent in http probe Co-authored-by: Shash Reddy <shashwathireddy@gmail.com>

This PR builds the adapter which converts a user-defined probe to a probe that will be executed by queue proxy against user container. Co-authored-by: Shash Reddy <shashwathireddy@gmail.com>

* Prep for queue health handler for aggressive retries #4014 This PR builds the adapter which converts a user-defined probe to a probe that will be executed by queue proxy against user container. Co-authored-by: Shash Reddy <shashwathireddy@gmail.com> * Address golang linter comments * fix comment Co-authored-by: Shash Reddy <shashwathireddy@gmail.com> * return probe encoding error * factor out queue probe retry logic, clean up Co-authored-by: Shash Reddy <shashwathireddy@gmail.com>

chizhg added the kind/feature Well-understood/specified features, ready for coding. label May 6, 2019

knative-prow-robot added area/test-and-release It flags unit/e2e/conformance/perf test issues for product features kind/good-first-issue kind/spec Discussion of how a feature should be exposed to customers. labels May 6, 2019

dgerd mentioned this issue May 7, 2019

Clean-up Runtime Contract #4035

Closed

mattmoor added this to the Serving 0.7 milestone May 10, 2019

knative-prow-robot assigned joshrider May 13, 2019

eallred-google added the P1 P1 label Jun 6, 2019

knative-prow-robot assigned shashwathi Jun 7, 2019

joshrider mentioned this issue Jun 19, 2019

use exec probe for queue proxy readiness check #4148

Merged

mattmoor modified the milestones: Serving 0.7, Serving 0.8 Jun 19, 2019

joshrider mentioned this issue Jul 2, 2019

[WIP] Use user-defined readiness probes through queue-proxy #4600

Closed

joshrider added a commit to joshrider/serving that referenced this issue Jul 8, 2019

prepare queue healthHandler for optional aggressive retries knative#4014

be8f3fc

Co-authored-by: Shash Reddy <shashwathireddy@gmail.com>

joshrider mentioned this issue Jul 8, 2019

Prepare queue healthHandler for optional aggressive retries #4649

Merged

joshrider mentioned this issue Jul 9, 2019

Add queue readiness probe package #4668

Merged

joshrider mentioned this issue Jul 12, 2019

Use user-defined readinessProbe in queue-proxy #4731

Merged

knative-prow-robot closed this as completed in #4731 Jul 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply default livenessProbe and readinessProbe to the user container #4014

Apply default livenessProbe and readinessProbe to the user container #4014

chizhg commented May 6, 2019

markusthoemmes commented May 6, 2019

dgerd commented May 6, 2019

dgerd commented May 6, 2019 •

edited

Loading

dgerd commented May 8, 2019

joshrider commented May 13, 2019

joshrider commented Jun 7, 2019

joshrider commented Jun 11, 2019 •

edited

Loading

mattmoor commented Jun 12, 2019

joshrider commented Jun 19, 2019

mattmoor commented Jun 19, 2019

mattmoor commented Jun 25, 2019

joshrider commented Jun 26, 2019

mattmoor commented Jun 26, 2019

mattmoor commented Jul 2, 2019

joshrider commented Jul 2, 2019

Apply default livenessProbe and readinessProbe to the user container #4014

Apply default livenessProbe and readinessProbe to the user container #4014

Comments

chizhg commented May 6, 2019

In what area(s)?

Describe the feature

markusthoemmes commented May 6, 2019

dgerd commented May 6, 2019

dgerd commented May 6, 2019 • edited Loading

dgerd commented May 8, 2019

joshrider commented May 13, 2019

joshrider commented Jun 7, 2019

joshrider commented Jun 11, 2019 • edited Loading

mattmoor commented Jun 12, 2019

joshrider commented Jun 19, 2019

mattmoor commented Jun 19, 2019

mattmoor commented Jun 25, 2019

joshrider commented Jun 26, 2019

mattmoor commented Jun 26, 2019

mattmoor commented Jul 2, 2019

joshrider commented Jul 2, 2019

dgerd commented May 6, 2019 •

edited

Loading

joshrider commented Jun 11, 2019 •

edited

Loading