-
Notifications
You must be signed in to change notification settings - Fork 3
Add cron job that does defrag without considering threshold #37
Conversation
It can be beneficial to defrag even if the threshold is not yet met, so we add a command line flag that can take the values hourly, daily or weekly and will force a defrag at this interval. The default value is set to weekly. The implementation is not robust, defrags can be missed due to restarting etc. A new failure scenario is added as we validate the interval and fail to start if the value is invalid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a few comments about context's and tests, but it looks good overall
@@ -1025,6 +1048,14 @@ func nextOutdatedPeer(cluster *etcdv1alpha1.EtcdCluster, peers *etcdv1alpha1.Etc | |||
return nil | |||
} | |||
|
|||
func scheduleMapKeyFor(cluster *etcdv1alpha1.EtcdCluster) string { | |||
return string(cluster.UID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it might be worth making this string(cluster.UID) + defrag
.
Just so it's clear what this key does and to stop us accidentally overlapping keys if we add more cron jobs in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, will add it.
func (r *EtcdClusterReconciler) defragCronJob(log logr.Logger, cluster *etcdv1alpha1.EtcdCluster, members *[]etcd.Member) { | ||
ctx, cancel := context.WithTimeout(context.Background(), time.Second*25) | ||
func (r *EtcdClusterReconciler) defragWithThresholdCronJob(log logr.Logger, cluster *etcdv1alpha1.EtcdCluster) { | ||
ctx, cancel, etcdClient, members, ok := r.setupDefragDeps(log, cluster) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a bit cleaner to create both the required context's outside of setupDefragDeps
, so it's really clear what context is being used where and what timeouts are set.
It would also be nice to add a comment on why we need a new context (i.e "we can't use reconciles
ctx, as it will have likely expired when this cronjob is ran").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it was a bit strange to return the ctx, cancel from the method (also I think there was a bug with defer cancel() as well). I've moved the creation of the defrag context to each method https://github.com/storageos/etcd-cluster-operator/pull/37/files#diff-4d256d176300fb45905cd320c72758685ad4f268407980cfd0564bb59504bd26R1444 I did not move the tls context out though, imo it belongs in the method as a implementation detail. But happy to move it outside and pass it in if you think it's clearer.
} | ||
|
||
func (r *EtcdClusterReconciler) defragCronJob(log logr.Logger, cluster *etcdv1alpha1.EtcdCluster) { | ||
ctx, cancel, etcdClient, members, ok := r.setupDefragDeps(log, cluster) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same point about creating the context's outside of setupDefragDeps
defer cancel() | ||
defer etcdClient.Close() | ||
|
||
ctx = context.Background() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we overwrite the ctx we're meant to be using, which means we're not setting a timeout on our defrag call
internal/interval/validate_test.go
Outdated
t.Run(tc.name, func(t *testing.T) { | ||
out, err := ValidateForDefrag(tc.input) | ||
|
||
if tc.err != nil && err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend using testify's assertions here
https://github.com/stretchr/testify
assert.Equal(t, tc.err, err)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The v1.6.1 version did not support wrapped errors, I updated the library to v.1.7.2 which has methods for checking wrapped errors. That said, the methods do not handle both nil and error cases in the same method so I added a small helper method https://github.com/storageos/etcd-cluster-operator/pull/37/files#diff-f4cb5f477ae509d72e01582438b6e1b118c6695ac2287f765d3e0f10c19460c3R31
internal/interval/validate_test.go
Outdated
if tc.err != nil && err == nil { | ||
t.Errorf("expected an error") | ||
} | ||
if err != nil && tc.err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same point about testify assertions here
internal/interval/validate_test.go
Outdated
} | ||
} | ||
|
||
if out != tc.output { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same point about testify assertions here
@@ -91,13 +99,21 @@ func main() { | |||
// See https://github.com/improbable-eng/etcd-cluster-operator/issues/171 | |||
ctrl.SetLogger(zap.New(zap.UseDevMode(true))) | |||
|
|||
validatedInterval, err := interval.ValidateForDefrag(defragWithoutThresholdInterval) | |||
if err != nil { | |||
setupLog.Error(err, "invalid defrag-without-threshold-interval configuration") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth printing out the supplied value in this error message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is part of the error we are getting e.g.
ERROR setup invalid defrag-without-threshold-interval configuration {"error": "invalid interval: every 5s"}
Should we move it up to log it as key:value here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yep, I misread.
Ignore me :)
Update the version of the testify library to v1.7.2 (from v1.6.1).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - except the one point raised
// create a new context with a long timeout for the defrag operation. | ||
ctx, cancel := context.WithTimeout(context.Background(), defragTimeout) | ||
defer cancel() | ||
|
||
err := defragger.Defrag(ctx, members, etcdClient, log) | ||
if err != nil { | ||
log.Error(err, "failed to defrag") | ||
} | ||
} | ||
|
||
func (r *EtcdClusterReconciler) setupDefragDeps(log logr.Logger, cluster *etcdv1alpha1.EtcdCluster) ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth adding a comment that "it's the callers responsibility to close the return etcd client"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢
It can be beneficial to defrag even if the threshold is not yet met, so
we add a command line flag that can take the values hourly, daily or
weekly and will force a defrag at this interval. The default value is
set to weekly.
The implementation is not robust, defrags can be missed due to
restarting etc.
A new failure scenario is added as we validate the interval and fail to
start if the value is invalid.