-
Notifications
You must be signed in to change notification settings - Fork 224
checkpointer: rewrite for generic pod checkpoints #238
checkpointer: rewrite for generic pod checkpoints #238
Conversation
Code looks good at first scan. Good comments, TODOS, and appreciate you expanding the docs. No specific comments on the overall architecture or flow. |
Looks good to me overall, tests look good, and great to have tests come out of this rewrite, and overall making it more unit-testable. |
name: kube-apiserver | ||
namespace: kube-system | ||
annotations: | ||
checkpointer.alpha.coreos.com/checkpoint-of=kube-system/kube-apiserver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure we need to append kube-system in this annotation (we might need it for the on disk file name). we already have namespace in metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. I'll change annotation to just use podname.
if pod.Annotations == nil { | ||
return false | ||
} | ||
shouldCheckpoint := pod.Annotations[shouldCheckpointAnnotation] == "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
define "true" somewhere as a const?
return false | ||
} | ||
shouldCheckpoint := pod.Annotations[shouldCheckpointAnnotation] == "true" | ||
isStatic := pod.Annotations[podSourceAnnotation] == "file" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
define file as a const?
// If there is a parent pod and an active checkpoint, stop the active checkpoint | ||
// The parent may not be in a running state, but the kubelet is trying to start it so we should get out of the way. | ||
for id := range localParentPods { | ||
if _, ok := activeCheckpoints[id]; ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
range over the activeCheckpoints in the outer loop? be consistent with line 148?
|
||
podList, err := client.Core().Pods(api.NamespaceAll).List(opts) | ||
if err != nil { | ||
glog.Infof("Unable to contact APIServer, will not perform garbage collection: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warnf?
gave this pr a quick review. the logic seems reasonable. |
coreosbot test checkpoint code |
activeManifest = filepath.Join(activePath, manifestFilename) | ||
checkpointManifest = filepath.Join(ignorePath, manifestFilename) | ||
secureAPIAddr = fmt.Sprintf("https://%s:%s", os.Getenv("KUBERNETES_SERVICE_HOST"), os.Getenv("KUBERNETES_SERVICE_PORT_HTTPS")) | ||
activeCheckpointPath = "/etc/kubernetes/manifests" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it better to change path to "/etc/kubernetes/checkpointer/pod_manifests"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This must map to the --config
or --pod-manifest-path
for the static pods to be started. The path by convention (and our installation tools) is set to /etc/kubernetes/manifests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the explanation.
glog.Error(err) | ||
time.Sleep(3 * time.Second) | ||
|
||
// Retrive pods from the local kubelet API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment is meaningless i feel. line67 is clear enough.
} | ||
|
||
// Create an inactive checkpoint for any valid parent pods | ||
createCheckpoints(client, localParentPods) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
createCheckpointsForValidPartents? Then remove the comment?
if !isRunning(pod) { | ||
continue | ||
} | ||
id := podID(pod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PodID is not ideal. k8s pods object has its UID. I thought it was Pod UID. maybe podPath or podFullName or something less confusing?
glog.Errorf("Failed to sanitize manifest for %s: %v", id, err) | ||
continue | ||
} | ||
if err := ensureCheckpointManifest(cp); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
writeCheckpointManifest? why do we name it as ensure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used ensure because it may not write the manifest if it already exists and is the same. No strong feelings here, could use writeCheckpointManifest
or maybe writeIfNotExistsCheckpointManifest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel writeXXX is more descriptive.
p := podIDToActiveCheckpointPath(id) | ||
if err := os.Remove(p); err != nil { | ||
if os.IsNotExist(err) { // Sanity check (it's fine - just want to surface this if it's occuring) | ||
glog.Infof("Attempted to remove active checkpoint, but manifest no longer exists: %s", p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warnf?
LGTM after fixing the minor issues. We should fix the CI test (or at least figure out what is wrong) before get this merged. |
@aaronlevy Squash commits and merge? |
@pbx0 is on vacation, so not entirely sure how to debug the CI test -- but I'll squash and do a bit more manual testing then merge if all looks well. Thanks @xiang90 |
bc66d23
to
bd6e885
Compare
I think the ci tests are borked at the moment. |
coreosbot test this please |
Manually tested the checkpoint functionality by launching a cluster and rebooting the master node multiple times. I'm going to go ahead and merge this as it doesn't really change any existing functionality until we bump the checkpointer manifest hash. The only change in existing behavior with this PR will be a new annotation on the api-server. |
Rewrite of the checkpointer to support checkpointing of arbitrary pods.
Will checkpoint pods based on annotation, then also supports garbage collection of pods and secrets by tracking the parents of the checkpoints. I've called out TODOs in the code for some minor tweaks that can be follow ups.
Big change, so hopefully can get a few eyes on this.
/cc @yifan-gu @xiang90 @ericchiang @dghubble @derekparker