Skip to content
This repository is currently being migrated. It's locked while the migration is in progress.

Add cron job that does defrag without considering threshold #37

Merged
merged 6 commits into from
Jun 14, 2022

Conversation

walle
Copy link

@walle walle commented Jun 13, 2022

It can be beneficial to defrag even if the threshold is not yet met, so
we add a command line flag that can take the values hourly, daily or
weekly and will force a defrag at this interval. The default value is
set to weekly.
The implementation is not robust, defrags can be missed due to
restarting etc.

A new failure scenario is added as we validate the interval and fail to
start if the value is invalid.

It can be beneficial to defrag even if the threshold is not yet met, so
we add a command line flag that can take the values hourly, daily or
weekly and will force a defrag at this interval. The default value is
set to weekly.
The implementation is not robust, defrags can be missed due to
restarting etc.

A new failure scenario is added as we validate the interval and fail to
start if the value is invalid.
@walle walle requested a review from a team as a code owner June 13, 2022 13:41
Copy link
Collaborator

@Mojachieee Mojachieee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a few comments about context's and tests, but it looks good overall

@@ -1025,6 +1048,14 @@ func nextOutdatedPeer(cluster *etcdv1alpha1.EtcdCluster, peers *etcdv1alpha1.Etc
return nil
}

func scheduleMapKeyFor(cluster *etcdv1alpha1.EtcdCluster) string {
return string(cluster.UID)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it might be worth making this string(cluster.UID) + defrag.

Just so it's clear what this key does and to stop us accidentally overlapping keys if we add more cron jobs in the future

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, will add it.

func (r *EtcdClusterReconciler) defragCronJob(log logr.Logger, cluster *etcdv1alpha1.EtcdCluster, members *[]etcd.Member) {
ctx, cancel := context.WithTimeout(context.Background(), time.Second*25)
func (r *EtcdClusterReconciler) defragWithThresholdCronJob(log logr.Logger, cluster *etcdv1alpha1.EtcdCluster) {
ctx, cancel, etcdClient, members, ok := r.setupDefragDeps(log, cluster)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a bit cleaner to create both the required context's outside of setupDefragDeps, so it's really clear what context is being used where and what timeouts are set.

It would also be nice to add a comment on why we need a new context (i.e "we can't use reconciles ctx, as it will have likely expired when this cronjob is ran").

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it was a bit strange to return the ctx, cancel from the method (also I think there was a bug with defer cancel() as well). I've moved the creation of the defrag context to each method https://github.com/storageos/etcd-cluster-operator/pull/37/files#diff-4d256d176300fb45905cd320c72758685ad4f268407980cfd0564bb59504bd26R1444 I did not move the tls context out though, imo it belongs in the method as a implementation detail. But happy to move it outside and pass it in if you think it's clearer.

}

func (r *EtcdClusterReconciler) defragCronJob(log logr.Logger, cluster *etcdv1alpha1.EtcdCluster) {
ctx, cancel, etcdClient, members, ok := r.setupDefragDeps(log, cluster)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same point about creating the context's outside of setupDefragDeps

defer cancel()
defer etcdClient.Close()

ctx = context.Background()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we overwrite the ctx we're meant to be using, which means we're not setting a timeout on our defrag call

t.Run(tc.name, func(t *testing.T) {
out, err := ValidateForDefrag(tc.input)

if tc.err != nil && err == nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend using testify's assertions here
https://github.com/stretchr/testify

assert.Equal(t, tc.err, err)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The v1.6.1 version did not support wrapped errors, I updated the library to v.1.7.2 which has methods for checking wrapped errors. That said, the methods do not handle both nil and error cases in the same method so I added a small helper method https://github.com/storageos/etcd-cluster-operator/pull/37/files#diff-f4cb5f477ae509d72e01582438b6e1b118c6695ac2287f765d3e0f10c19460c3R31

if tc.err != nil && err == nil {
t.Errorf("expected an error")
}
if err != nil && tc.err == nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same point about testify assertions here

}
}

if out != tc.output {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same point about testify assertions here

@@ -91,13 +99,21 @@ func main() {
// See https://github.com/improbable-eng/etcd-cluster-operator/issues/171
ctrl.SetLogger(zap.New(zap.UseDevMode(true)))

validatedInterval, err := interval.ValidateForDefrag(defragWithoutThresholdInterval)
if err != nil {
setupLog.Error(err, "invalid defrag-without-threshold-interval configuration")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth printing out the supplied value in this error message

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is part of the error we are getting e.g.

ERROR   setup   invalid defrag-without-threshold-interval configuration {"error": "invalid interval: every 5s"}

Should we move it up to log it as key:value here as well?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yep, I misread.

Ignore me :)

Mojachieee
Mojachieee previously approved these changes Jun 14, 2022
Copy link
Collaborator

@Mojachieee Mojachieee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - except the one point raised

// create a new context with a long timeout for the defrag operation.
ctx, cancel := context.WithTimeout(context.Background(), defragTimeout)
defer cancel()

err := defragger.Defrag(ctx, members, etcdClient, log)
if err != nil {
log.Error(err, "failed to defrag")
}
}

func (r *EtcdClusterReconciler) setupDefragDeps(log logr.Logger, cluster *etcdv1alpha1.EtcdCluster) (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth adding a comment that "it's the callers responsibility to close the return etcd client"

Copy link
Collaborator

@Mojachieee Mojachieee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

@walle walle merged commit 07c92ea into main Jun 14, 2022
@walle walle deleted the add-defrag-without-threshold-cron-job branch June 14, 2022 12:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants