-
Notifications
You must be signed in to change notification settings - Fork 224
checkpointer: make grace period a flag. #813
Conversation
coreosbot run e2e checkpointer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
what are other things? I do not feel a flag should be added solely for testing. it is better to create a unit test if we want to get rid of the dependencies of timing. |
@xiang90 it's a tunable parameter. I don't know what the optimal value is. Adding a flag helps for experimentation. |
pkg/checkpoint/checkpoint.go
Outdated
@@ -45,6 +44,9 @@ type Options struct { | |||
RemoteRuntimeEndpoint string | |||
// RuntimeRequestTimeout is the timeout that is used for requests to the RemoteRuntimeEndpoint. | |||
RuntimeRequestTimeout time.Duration | |||
// CheckpointGracePeriod is the timeout ithat is used for cleaning up checkpoints when the parent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ithat -> that
This allows users to configure the checkpoint removal grace period parameter as a flag. This may be useful for tests, among other things.
To be honest, I would rather have less tunable parameters, and good default values. Also it is not clear to me that why do we have this gracePeriod at the first place by reading the code. Maybe it is for dealing with a delete then create case, so that the deleted pod will still run for a while until the created one comes up? |
If I understand it correctly setting graceperiod depends on how fast the kubelet will start the replaced pod after the API server changes the state of the pod. What would be the side effect to set it long enough for 99% of the cases, say 5 minutes instead of 1 minutes for real world cases? And what is the plan to do the experiments you mentioned? What I want to avoid is to add a flag, and either someone misuses it or no one uses it. |
In the future I might expect there to be more of these types of configurables (like don't start checkpoint if it's older than X - to protect against stale nodes rejoining with really old workloads), or even control these types of things per-pod via annotations. So to move this forward, @xiang90 would opening an issue to discuss experimenting with the default grace period, and maybe adding a note on the flag that it's alpha / could be removed satisfy your concerns? |
will do. |
that would be great. |
Thinking about this a bit more, changing a flag from So for now, let's just merge this as is - I'm not concerned about this as a configurable for now. Feel free to add some info into the flag description if you like, but won't block on it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@aaronlevy Shall we get this merged? I will go ahead and create an issue for evaluating the default timeout and potentially remove the flag once we have high confidence after this PR gets merged. Thanks. |
This allows users to configure the checkpoint removal grace period
parameter as a flag. This may be useful for tests, among other things.