Skip to content
This repository has been archived by the owner on Jul 30, 2021. It is now read-only.

checkpointer: rewrite for generic pod checkpoints #238

Merged
merged 1 commit into from
Dec 30, 2016

Conversation

aaronlevy
Copy link
Contributor

Rewrite of the checkpointer to support checkpointing of arbitrary pods.

Will checkpoint pods based on annotation, then also supports garbage collection of pods and secrets by tracking the parents of the checkpoints. I've called out TODOs in the code for some minor tweaks that can be follow ups.

Big change, so hopefully can get a few eyes on this.

/cc @yifan-gu @xiang90 @ericchiang @dghubble @derekparker

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 20, 2016
@ericchiang
Copy link
Contributor

Code looks good at first scan. Good comments, TODOS, and appreciate you expanding the docs. No specific comments on the overall architecture or flow.

@derekparker
Copy link
Contributor

Looks good to me overall, tests look good, and great to have tests come out of this rewrite, and overall making it more unit-testable.

name: kube-apiserver
namespace: kube-system
annotations:
checkpointer.alpha.coreos.com/checkpoint-of=kube-system/kube-apiserver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure we need to append kube-system in this annotation (we might need it for the on disk file name). we already have namespace in metadata.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I'll change annotation to just use podname.

if pod.Annotations == nil {
return false
}
shouldCheckpoint := pod.Annotations[shouldCheckpointAnnotation] == "true"
Copy link
Contributor

@xiang90 xiang90 Dec 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

define "true" somewhere as a const?

return false
}
shouldCheckpoint := pod.Annotations[shouldCheckpointAnnotation] == "true"
isStatic := pod.Annotations[podSourceAnnotation] == "file"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

define file as a const?

// If there is a parent pod and an active checkpoint, stop the active checkpoint
// The parent may not be in a running state, but the kubelet is trying to start it so we should get out of the way.
for id := range localParentPods {
if _, ok := activeCheckpoints[id]; ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

range over the activeCheckpoints in the outer loop? be consistent with line 148?


podList, err := client.Core().Pods(api.NamespaceAll).List(opts)
if err != nil {
glog.Infof("Unable to contact APIServer, will not perform garbage collection: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warnf?

@xiang90
Copy link
Contributor

xiang90 commented Dec 27, 2016

gave this pr a quick review. the logic seems reasonable.

@aaronlevy
Copy link
Contributor Author

coreosbot test checkpoint code

activeManifest = filepath.Join(activePath, manifestFilename)
checkpointManifest = filepath.Join(ignorePath, manifestFilename)
secureAPIAddr = fmt.Sprintf("https://%s:%s", os.Getenv("KUBERNETES_SERVICE_HOST"), os.Getenv("KUBERNETES_SERVICE_PORT_HTTPS"))
activeCheckpointPath = "/etc/kubernetes/manifests"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it better to change path to "/etc/kubernetes/checkpointer/pod_manifests"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must map to the --config or --pod-manifest-path for the static pods to be started. The path by convention (and our installation tools) is set to /etc/kubernetes/manifests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the explanation.

glog.Error(err)
time.Sleep(3 * time.Second)

// Retrive pods from the local kubelet API
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is meaningless i feel. line67 is clear enough.

}

// Create an inactive checkpoint for any valid parent pods
createCheckpoints(client, localParentPods)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createCheckpointsForValidPartents? Then remove the comment?

if !isRunning(pod) {
continue
}
id := podID(pod)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PodID is not ideal. k8s pods object has its UID. I thought it was Pod UID. maybe podPath or podFullName or something less confusing?

glog.Errorf("Failed to sanitize manifest for %s: %v", id, err)
continue
}
if err := ensureCheckpointManifest(cp); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

writeCheckpointManifest? why do we name it as ensure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used ensure because it may not write the manifest if it already exists and is the same. No strong feelings here, could use writeCheckpointManifest or maybe writeIfNotExistsCheckpointManifest

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel writeXXX is more descriptive.

p := podIDToActiveCheckpointPath(id)
if err := os.Remove(p); err != nil {
if os.IsNotExist(err) { // Sanity check (it's fine - just want to surface this if it's occuring)
glog.Infof("Attempted to remove active checkpoint, but manifest no longer exists: %s", p)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warnf?

@xiang90
Copy link
Contributor

xiang90 commented Dec 29, 2016

LGTM after fixing the minor issues. We should fix the CI test (or at least figure out what is wrong) before get this merged.

@xiang90
Copy link
Contributor

xiang90 commented Dec 29, 2016

@aaronlevy Squash commits and merge?

@aaronlevy
Copy link
Contributor Author

@pbx0 is on vacation, so not entirely sure how to debug the CI test -- but I'll squash and do a bit more manual testing then merge if all looks well. Thanks @xiang90

@aaronlevy
Copy link
Contributor Author

I think the ci tests are borked at the moment.

@aaronlevy
Copy link
Contributor Author

coreosbot test this please

@aaronlevy
Copy link
Contributor Author

Manually tested the checkpoint functionality by launching a cluster and rebooting the master node multiple times.

I'm going to go ahead and merge this as it doesn't really change any existing functionality until we bump the checkpointer manifest hash. The only change in existing behavior with this PR will be a new annotation on the api-server.

@aaronlevy aaronlevy merged commit 5f0ba7b into kubernetes-retired:master Dec 30, 2016
@aaronlevy aaronlevy deleted the checkpoint-rewrite branch December 30, 2016 00:54
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants