Skip to content
This repository has been archived by the owner on Jul 30, 2021. It is now read-only.

Will not recover if you delete the last api-server pod #267

Closed
aaronlevy opened this issue Jan 11, 2017 · 2 comments
Closed

Will not recover if you delete the last api-server pod #267

aaronlevy opened this issue Jan 11, 2017 · 2 comments
Assignees
Labels
kind/regression Categorizes issue or PR as related to a regression from a prior release. priority/P0

Comments

@aaronlevy
Copy link
Contributor

We should be able to recover from deletion of the only api-server pod (as long as it is ultimately managed by a higher-level object like a daemonset).

However, with the change to the new checkpointer in v0.3.1 there is an issue where the local kubelet state will only be updated if it can contact an api-server. So checkpointer asks kubelet "is the api-server running" and it responds "yes". If that is the last api-server pod that was just deleted -- the local state will never be updated to reflect that it is not in fact running.

The old checkpointer would work around this because it would just try and hit "localhost:8080" to determine if the api-server is running -- but that has downsides that it means the checkpointer is no longer a generic tool (has to know about certain pods). And it also isn't actually accurate (it could be that the checkpoint copy is running and listening on :8080).

But it would be a short term option to keep allowing that same functionality.

/cc @Quentin-M

@aaronlevy aaronlevy self-assigned this Jan 11, 2017
@aaronlevy aaronlevy added kind/regression Categorizes issue or PR as related to a regression from a prior release. priority/P0 labels Jan 11, 2017
@Quentin-M
Copy link
Contributor

Quentin-M commented Jan 11, 2017

Leaving a quick comment before I leave the office for today.

We discussed about potential solutions. The two that sounded the most reasonable, given the existing constraints and the desire of keeping a generic checkpointer are:

  • Extracting the liveness probe from a pod specification and querying it in order to determine the pod state. The good point is that we give some control to the developer over what the checkpointer should do with his pod. The drawback is that we could only do HTTP probes (and that someone might eventually ask for script support..) and that it might not be reliable in every cases.

  • Actually contacting the local container runtime for the state of the pods. The advantage is that it is a quite simple and reliable solution. It can be done either by implementing the calls to existing runtimes (supposedly low amount of code and maintenance - there are few runtimes, rarely new ones) or by leveraging the Kubelet's code (which means we need to vendor it...).

@aaronlevy
Copy link
Contributor Author

Just to note: the workaround to this in the interim is to use an older version of the checkpointer (until a fix is merged): quay.io/coreos/pod-checkpointer:b4f0353cc12d95737628b8815625cc8e5cedb6fc

Also the fix I plan for is using the /runningpods/ api from the kubelet to determine the needed information.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/regression Categorizes issue or PR as related to a regression from a prior release. priority/P0
Projects
None yet
Development

No branches or pull requests

2 participants