Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: [release/v0.14.x] Introduce startup probe to CAREN pod to reduce races #934

Merged
merged 1 commit into from
Oct 2, 2024

Conversation

jimmidyson
Copy link
Member

@jimmidyson jimmidyson commented Oct 1, 2024

Introduce a more frequent startup probe polling every second, which once
it succeeds will trigger readiness and liveness probes to run at their
default intervals. This reduces chances of races in clusterctl checking
for readiness while keeping the overhead on the kubelet for checking
readiness once the pod is up to a minimum (default probe period).

dkoshkin
dkoshkin previously approved these changes Oct 1, 2024
@dkoshkin dkoshkin dismissed their stale review October 1, 2024 17:47

Wrong timeout

@dkoshkin
Copy link
Contributor

dkoshkin commented Oct 1, 2024

I think this is the wrong timeout, and is being set for the helm registry?

@supershal
Copy link
Contributor

with this it will result in following with defaults

readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /readyz
        port: healthz
        scheme: HTTP
      periodSeconds: 1
      successThreshold: 1
      timeoutSeconds: 1

Once the container starts it will probe every 1 second with timeout of 1 seconds and upto 3 consecutive failures. so I guess CAREN should be up and ready in 3 seconds. CAREN starts within a second.

@jimmidyson
Copy link
Member Author

with this it will result in following with defaults

readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /readyz
        port: healthz
        scheme: HTTP
      periodSeconds: 1
      successThreshold: 1
      timeoutSeconds: 1

Once the container starts it will probe every 1 second with timeout of 1 seconds and upto 3 consecutive failures. so I guess CAREN should be up and ready in 3 seconds. CAREN starts within a second.

While you're right, the failure of a readiness probe just removes it from the service endpoints so the pod keeps running (as opposed to liveness probe where the pod will be killed). The readiness probe will still run every second and after a single success will be marked as ready. So this isn't a problem even if it fails.

@jimmidyson jimmidyson force-pushed the jimmi/probe-races-backport branch from 4f89fe1 to fa52c80 Compare October 2, 2024 09:03
@github-actions github-actions bot added fix and removed fix labels Oct 2, 2024
@jimmidyson jimmidyson changed the title fix: [release/v0.14.x] Shorten readiness probe period to try to prevent races fix: [release/v0.14.x] Introduce startup probe to CAREN pod to reduce races Oct 2, 2024
@github-actions github-actions bot added fix and removed fix labels Oct 2, 2024
@jimmidyson
Copy link
Member Author

I've actually changed the implementation here to use startup probe to have the same effect (fast checking for readiness) but not have the overhead on the kubelet of probing every second once the pod is first set to ready (using default readiness probe period once startup probe has succeeded).

Introduce a more frequent startup probe polling every second, which once
it succeeds will trigger readiness and liveness probes to run at their
default intervals. This reduces chances of races in clusterctl checking
for readiness while keeping the overhead on the kubelet for checking
readiness once the pod is up to a minimum (default probe period).
@jimmidyson
Copy link
Member Author

@dkoshkin If you could take a look and approve again if you're happy then we can merge.

@supershal supershal merged commit 05fd175 into release/v0.14.x Oct 2, 2024
15 checks passed
@supershal supershal deleted the jimmi/probe-races-backport branch October 2, 2024 18:53
supershal added a commit that referenced this pull request Oct 2, 2024
🤖 I have created a release *beep* *boop*
---


## 0.14.9 (2024-10-02)

<!-- Release notes generated using configuration in .github/release.yaml
at release/v0.14.x -->

## What's Changed
### Fixes 🔧
* fix: [release/v0.14.x] Rename webhook container to manager by
@jimmidyson in
#933
* fix: [release/v0.14.x] Introduce startup probe to CAREN pod to reduce
races by @jimmidyson in
#934
### Other Changes
* ci(release/v0.14.x): enable creating release-please PR from release
branches by @supershal in
#911
* test(e2e): [release/v0.14.x] Use released rocky image by @jimmidyson
in
#935


**Full Changelog**:
v0.14.8...v0.14.9

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants