Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support eventual consistency #298

Open
3 tasks
windsource opened this issue Jun 19, 2024 · 7 comments
Open
3 tasks

Support eventual consistency #298

windsource opened this issue Jun 19, 2024 · 7 comments
Labels
enhancement New feature or request. Issue will appear in the change log "Features"
Milestone

Comments

@windsource
Copy link
Contributor

windsource commented Jun 19, 2024

Description

When applying a new manifest, currently Ankaios has a fixed number of retries when the start fails and after that finally gives up. The workload remains in state Pending, subState StartingFailed. There can be different reasons why the start fails like

  • cannot pull image (registry not available, image not found, not authorized, ...)
  • invalid options passed to either commandOptions or commandArgs
  • Folder not existing when mouting volumes
  • etc.

While some of the problems cannot be solved without changing the manifest (e.g. invalid options) others might disappear after some time (e.g. registry not available or folder not existing).

Some users expect that Ankaios constantly tries to reach the desired state and also that Ankaios provides the result of the latest try (e.g. Podman error message).

Goals

  • Ankaios should constantly try to reach the desired state.
  • The interval between the retries shall be increased over time with a jitter (maybe same strategy as K8S).
  • ank get workloads shall provide the latest result.
  • ank apply shall return after the first attempt

Final result

Summary

To be filled when the final solution is sketched.

Tasks

  • Task 1
  • Task 2
  • ...
@windsource windsource added the enhancement New feature or request. Issue will appear in the change log "Features" label Jun 19, 2024
@windsource
Copy link
Contributor Author

windsource commented Jun 21, 2024

Maybe we can also have an optional maximum time before Ankaios stops to reach desired state. The parameter could be part of a config file (see #302).

@krucod3 krucod3 added this to the v0.6 milestone Jun 26, 2024
@inf17101
Copy link
Contributor

inf17101 commented Jun 26, 2024

Builds upon #67 (PR #137)

@krucod3
Copy link
Contributor

krucod3 commented Dec 4, 2024

Kubernetes uses a backoff strategy capped at 300 seconds (5 minutes):
https://kubernetes.io/docs/concepts/containers/images/

There are different capped exponential backoff algorithms that can be used. A small comparison can be found here:
https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
Additionally general information can be found here:
https://en.wikipedia.org/wiki/Exponential_backoff

According to the AWS comparison we can just use some sort of full or decorated jitter, e.g.

delay = min(cap, random_between(base, 3*last_delay)) // note that this is not exponential, but relatively fast growing 

where cap is 300 seconds and base could be something between 100 and 500 milliseconds.

@krucod3
Copy link
Contributor

krucod3 commented Dec 4, 2024

We should probably think about adding a new substate to the pending state as the backoff could be quite long and we should signalize the backoff waiting to the users. This must be done carefully as we already had problems with quickly changing substates for the retry.

@krucod3
Copy link
Contributor

krucod3 commented Dec 4, 2024

An optional maximal time for the retry can be configured centrally at the server as a start. We already added the possibility to distribute config options to agents using the server hello message so the workflow is already prepared.

@windsource
Copy link
Contributor Author

windsource commented Dec 6, 2024

Just to make it sure: The capping at 5 minutes in K8S does not mean, that all attempts are stopped after 5 minutes but that the maximum time between two attempts is 5 minutes.

@krucod3
Copy link
Contributor

krucod3 commented Dec 6, 2024

Yes, the backoff delay is capped (see proposed formula above).
Regarding stopping of the further retry attempts, that would be your proposal of an optional configuration option which can be implemented using this high level idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request. Issue will appear in the change log "Features"
Projects
None yet
Development

No branches or pull requests

3 participants