Will not recover if you delete the last api-server pod #267

aaronlevy · 2017-01-11T01:29:26Z

We should be able to recover from deletion of the only api-server pod (as long as it is ultimately managed by a higher-level object like a daemonset).

However, with the change to the new checkpointer in v0.3.1 there is an issue where the local kubelet state will only be updated if it can contact an api-server. So checkpointer asks kubelet "is the api-server running" and it responds "yes". If that is the last api-server pod that was just deleted -- the local state will never be updated to reflect that it is not in fact running.

The old checkpointer would work around this because it would just try and hit "localhost:8080" to determine if the api-server is running -- but that has downsides that it means the checkpointer is no longer a generic tool (has to know about certain pods). And it also isn't actually accurate (it could be that the checkpoint copy is running and listening on :8080).

But it would be a short term option to keep allowing that same functionality.

/cc @Quentin-M

Quentin-M · 2017-01-11T03:00:22Z

Leaving a quick comment before I leave the office for today.

We discussed about potential solutions. The two that sounded the most reasonable, given the existing constraints and the desire of keeping a generic checkpointer are:

Extracting the liveness probe from a pod specification and querying it in order to determine the pod state. The good point is that we give some control to the developer over what the checkpointer should do with his pod. The drawback is that we could only do HTTP probes (and that someone might eventually ask for script support..) and that it might not be reliable in every cases.
Actually contacting the local container runtime for the state of the pods. The advantage is that it is a quite simple and reliable solution. It can be done either by implementing the calls to existing runtimes (supposedly low amount of code and maintenance - there are few runtimes, rarely new ones) or by leveraging the Kubelet's code (which means we need to vendor it...).

aaronlevy · 2017-01-11T21:47:34Z

Just to note: the workaround to this in the interim is to use an older version of the checkpointer (until a fix is merged): quay.io/coreos/pod-checkpointer:b4f0353cc12d95737628b8815625cc8e5cedb6fc

Also the fix I plan for is using the /runningpods/ api from the kubelet to determine the needed information.

aaronlevy self-assigned this Jan 11, 2017

aaronlevy added kind/regression Categorizes issue or PR as related to a regression from a prior release. priority/P0 labels Jan 11, 2017

This was referenced Jan 12, 2017

examples: Update Kubernetes to use self-hosted flannel poseidon/matchbox#411

Merged

checkpoint: Get local pod running state from /runningpods/ endpoint #268

Merged

aaronlevy closed this as completed in #268 Jan 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Will not recover if you delete the last api-server pod #267

Will not recover if you delete the last api-server pod #267

aaronlevy commented Jan 11, 2017

Quentin-M commented Jan 11, 2017 •

edited

Loading

aaronlevy commented Jan 11, 2017

Will not recover if you delete the last api-server pod #267

Will not recover if you delete the last api-server pod #267

Comments

aaronlevy commented Jan 11, 2017

Quentin-M commented Jan 11, 2017 • edited Loading

aaronlevy commented Jan 11, 2017

Quentin-M commented Jan 11, 2017 •

edited

Loading