Skip to content
This repository has been archived by the owner on Jul 30, 2021. It is now read-only.

master reboot test is failing #262

Closed
peebs opened this issue Jan 6, 2017 · 6 comments
Closed

master reboot test is failing #262

peebs opened this issue Jan 6, 2017 · 6 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. priority/P0

Comments

@peebs
Copy link

peebs commented Jan 6, 2017

This seemed to slip into the master branch during when the tests were temporarily broken over some of the self-hosted-flannel changes. If I reboot a master node, the api-server doesn't come back up. @aaronlevy is already on it and suspects what the problem is.

@aaronlevy aaronlevy self-assigned this Jan 6, 2017
@aaronlevy aaronlevy added kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. priority/P0 labels Jan 6, 2017
@aaronlevy
Copy link
Contributor

There are a few different issues at play here - but the core issue seems to be that the checkpointer is relying on determining local state from the kubelet /pods api - but that api may not reflect accurate local state.

Unfortunately that kubelet api endpoint will just report state from the last time it was able to successfully contact an api-server. Essentially a cache of the last state reported to api.

We didn't have this same issue with the old checkpointer - because it would determine "api is running" by reaching out to an api -- but that isn't reliable in multi-master, or for generic checkpointing.

This is a rather disappointing discovery -- as it seems there is no way to easily determine the local pod state if an api-server is not available.

Some options moving forward:

  1. Go back to hard-coded expectations (e.g. curl localhost:8080)
  2. Extract liveness probes from pod specs and try to use those if available
  3. Add some kind of "last seen" annotation to checkpoint parent pods, where we expect to see an update within a particular time window.
  4. Longer term option: make use of CRI interface and inspect state ourselves

/cc @yifan-gu @dghubble

@Quentin-M
Copy link
Contributor

Quentin-M commented Jan 9, 2017

I am not too familiar with the checkpointer itself and the contraints that surround it but I believe that relying on liveness probes could be a decent solution. It is simple, users are familiar with the concept and they would expect the checkpointer to rely on them. It is also user customizable in a way. Additionally, the project also vendors Kubernetes so the functions that achieve that could be re-used directly (less code to maintain, better integration overall).

The downside is that if we do not want to add the requirement of defining probes, this is only part of the solution.

@aaronlevy
Copy link
Contributor

The other part of this issue is that api-server behavior has changed in v1.5:

In earlier versions, the api-server would continually re-try to bind to particular addresses (:443 & :8080). If they were already in use, it would just try again in 15 seconds.

The behavior in v1.5 is that the api-server will exit immediately if it is not able to listen on those addresses (and would rely on external mechanisms of systemd/kubelet/etc to restart it again).

In terms of the checkpointer - we need a reliable way to determine "real api-server is running, or it is trying to be run". The "trying to be run" is important, because we need to remove an active checkpoint in this situation, so the real server can actually start (and bind on 443/8080).

Even if we check the liveness probe, we don't know "what" we are checking (someone happens to be listening on 8080). It could be an active checkpoint, or it could be the active parent. All we get to determine is that one of them happens to be running - but we can't make super reliable / actionable decisions based on just that information (I think it will work to add the liveness check - just not in a particularly clean way).

For example, the issue I was seeing was:

  • After reboot the kubelet starts (no local pod state)
  • Checkpointer sees that it has an inactive checkpoint for api-server, and should start it
  • Checkpointed-apiserver starts, and now kubelet can ask api what pods should be running
  • Kubelet starts the real apiserver pod, but it immediately fails because it cannot bind to 8080/443
  • Even though the api-server pod immediately failed, it was in fact "started" by the kubelet - so the last reported state to the api was: apiserver podState=running.
  • Checkpointer inspects the local kubelet-api, and sees apiserver podState=running - even though that is no longer true.
  • We are now in a state where kubelet-api thinks api-pod is running - but this could be stale information.
  • the issue: The desired state is that the checkpointed copy is deactivated (happens), and that the real apiserver pod takes over (does not happen). What I'm seeing is that the kubelet is not trying to restart the real api-server pod (it had failed to bind prior) - even though it could successfully start now. But then the checkpointer only sees that /pods api reports (stale) state of apiserver podState=running - so it makes no changes (it thinks everything is a-ok).

One other option which comes to mind after typing this: ensuring that the local docker state (of failed pods) exists for a longer window (--minimum-image-ttl-duration).

I think this might help because I believe the kubelet determines what pods it needs to restart by inspecting information serialized into the docker containers (otherwise a reboot of a kubelet would mean all local state is lost until api-server is available). So my hunch is that we have an issue where the api-server pod is being garbage collected in the window before kubelet knows to restart it. But if we leave this sitting around longer -- we could have a better recovery window.

@Quentin-M
Copy link
Contributor

  • Both the old and new checkpointers are affected by the API server change in 1.5
  • The --minimum-image-ttl-duration doesn't seem to help
  • The old checkpointer is slightly better because the race can potentially be won but the odds are low and it takes forever to get the right timing

Will experiment waiting for a file lock on the API server start.

@aaronlevy
Copy link
Contributor

For posterity:

There is only minimal state which is actually stored with the docker containers: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/dockertools/labels.go#L70

So it's not actually possible to recover local state from this info alone (this is mapped to the internal kubelet pod state).

Another option @Quentin-M and I came up with is to have the api-server use file locks to coordinate between the parent / checkpoint. This way we don't end up in failure loops when both are running, but only one can successfully listen on host address.

@aaronlevy
Copy link
Contributor

Should be closed by: #264

peebs pushed a commit to peebs/bootkube that referenced this issue Mar 22, 2017
This commit represents a workaround for kubernetes-retired#262. By maintaining
a file lock while the API server is running (either temporary
or self-hosted), we prevent the self-hosted API server from
starting and trying to bind ports, until the temporary one is
stopped. Therefore, we avoid the loop where the self-hosted
API server would crash as soon as it is brought up due to the
ports already being bound by the stopping temporary server.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. priority/P0
Projects
None yet
Development

No branches or pull requests

3 participants