Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emit failure resolve event #604

Closed
wants to merge 5 commits into from
Closed

Conversation

darkowlzz
Copy link
Contributor

@darkowlzz darkowlzz commented Mar 7, 2022

When source reconciliation failure happens, a warning event is emitted but
when resolved, there's no resolved event emitted, leaving the user uninformed
about the status of the operation.

This change checks the object status for failure conditions and
emits a normal event to notify that the failure is resolved on a
successful reconciliation. If a failure is resolved along with a new artifact,
"stored artifact" event is emitted, suppressing the failure resolve message.
New artifact event implies that the reconciliation was successful.

When using notification-controller, this appears as:

git-repo-fail-final

Console logs:

{"level":"error","ts":"2022-03-07T13:57:00.950Z","logger":"controller.gitrepository","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"podinfo","namespace":"flux-system","error":"failed to checkout and determine revision: unable to clone 'ssh://git@github.com/stefanprodan/podinfo': Failed to retrieve list of SSH authentication methods: Failed getting response"}
{"level":"info","ts":"2022-03-07T13:57:07.279Z","logger":"controller.gitrepository","msg":"resolved 'GitOperationFailed'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"podinfo","namespace":"flux-system"}

@darkowlzz darkowlzz added the enhancement New feature or request label Mar 7, 2022
Copy link
Member

@pjbgf pjbgf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should improve user's perception when facing intermittent issues (i.e. during github clone ops). Nice work @darkowlzz

LGTM

@stefanprodan
Copy link
Member

That printscreen is confusing, why is there the error twice? Are we sending 2 events now instead of one?

@stefanprodan
Copy link
Member

stefanprodan commented Mar 7, 2022

Maybe it could be something like resolved <reason>, stored artifact for commit <message>

@darkowlzz
Copy link
Contributor Author

That printscreen is confusing, why is there the error twice? Are we sending 2 events now instead of one?

The error in the normal event is to show what was the failure before, in case the recovery took place after some time, providing user some context about what failed. But now I don't think it's needed. Would be simpler to just say something like:

Successfully recovered from earlier 'GitOperationFailed' failure

Happy to have some better messaging.

@darkowlzz
Copy link
Contributor Author

darkowlzz commented Mar 7, 2022

Maybe it could be something like resolved <reason>, stored artifact for commit <message>

This event is emitted in reconcileSource() sub-reconciler. By this time, we haven't stored the artifact. We can't say anything about artifact.

@darkowlzz
Copy link
Contributor Author

darkowlzz commented Mar 7, 2022

[OUTDATED]

Just for a record, with this, the ssh could not read data would appear as:

git-repo-failure-recovery-2

The simplified messaging could also be:

resolved 'GitOperationFailed' failure

Not sure how informative we should be.

@stefanprodan
Copy link
Member

I'm not sure we should inject the previous failure in metadata, also do we send a 2nd message after this with the revision? Does this change means people will receive 2 info messages every time SC recovers? For example now SC errors out for each HelmChart at startup, I have 10 charts, so 10 error messages and 10 success messages at every upgrade. Will I receive 30 messages every time after this change?

@darkowlzz
Copy link
Contributor Author

I'm not sure we should inject the previous failure in metadata, also do we send a 2nd message after this with the revision? Does this change means people will receive 2 info messages every time SC recovers? For example now SC errors out for each HelmChart at startup, I have 10 charts, so 10 error messages and 10 success messages at every upgrade. Will I receive 30 messages every time after this change?

Yes, it'll result in double notifications every time there's a failure.
I checked the code, I think we can consolidate everything as one message at the very end, along with the stored artifact message.
Will get back with the new implementation.
Thanks for the feedback.

@hiddeco hiddeco added the area/git Git related issues and pull requests label Mar 7, 2022
@darkowlzz darkowlzz force-pushed the git-fetch-recovery-event branch 2 times, most recently from ce5f508 to b18a7d2 Compare March 8, 2022 12:06
Copy link
Member

@stefanprodan stefanprodan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks @darkowlzz

@darkowlzz darkowlzz changed the title gitrepo: Emit failure resolve event Emit failure resolve event Mar 11, 2022
When source fetch failure happens, a warning event is emitted but
when resolved, there's no resolved event emitted.

This change introduces a new ResultProcessor, NotifySuccess, for
SummarizeAndPatch helper which consolidates all the messages to be
emitted as a single event. It checks for any recovery from failure
condition and emits event about the recovery. In case there's a failure
recovery and a new artifact, success event message only mentions the new
artifact success message. Recovery is implied. This helps avoid mixing
of the event types to have K8s native events friendly events that work
with event counters.

The gitrepo reconciler is modified to use the new NotifySuccess
ResultProcessor. A change in artifact is checked in reconcile() and a
Notification object is returned, populated with the success information.

When the Notification is empty/zero, and no failure recovery, no event
is emitted. Failure recovery without any Notification results in an
event with only the recovery information.

Introduce new condition "StorageOperationFailed" for use with any
failures due to storage operation. With this, all the failures are
associated with some condition, which is used to detect a failure
recovery and reported as notification.

Signed-off-by: Sunny <darkowlzz@protonmail.com>
Update bucket, helmrepo and helmchart reconcilers to use NotifySuccess
ResultProcessor and StorageOperationFailedCondition.

Signed-off-by: Sunny <darkowlzz@protonmail.com>
Since the old version of the object is now kept in a separate variable,
the old object can be passed to summary helper to create patch helper.

Signed-off-by: Sunny <darkowlzz@protonmail.com>
Replace `StorageOperationFailedReason` with granular reasons for the
failure.
Update docs with `StorageOperationFailedCondition`.

Signed-off-by: Sunny <darkowlzz@protonmail.com>
Signed-off-by: Sunny <darkowlzz@protonmail.com>
@darkowlzz
Copy link
Contributor Author

Summarizing how things changed eventually...
I've updated the PR description with a new image to show what the notifications look like.
Instead of combining recovery and artifact stored events, the recovery event message is suppressed when there's a new artifact. New artifact implies that reconciliation was successful. When there's a recovery without a new artifact, a single notification with the recovery message is sent. Keeping the recovery and new artifact events separate help in watching these events separate with their own counters in K8s native events.
Also, updated all the failures to be associated with some status condition. The storage failures are now associated with the StorageOperationFailedCondition with granular reasons for the exact failure which appear in the recovery message.
Updated all the controllers to have these changes.

@hiddeco
Copy link
Member

hiddeco commented Mar 11, 2022

@stefanprodan please have a look at this and see if you agree with it from a UX perspective. I directed changes this way, as I felt that "polluting" the event data with conditional strings would be non beneficial to native Kubernetes Events, as it for example causes issues with keeping counts of repetitive messages.

By emitting the same event again after "recovering" from a failure, we still produce a signal to a user while allowing Kubernetes to keep count.

@darkowlzz darkowlzz added the hold Issues and pull requests put on hold label Mar 14, 2022
@darkowlzz
Copy link
Contributor Author

darkowlzz commented Mar 14, 2022

Putting this on hold for now. I would like to implement it in another way.
Will extract the status condition API changes into a separate PR.

@darkowlzz
Copy link
Contributor Author

#624 is ready, a new simpler implementation. Closing this.

@darkowlzz darkowlzz closed this Apr 5, 2022
@darkowlzz darkowlzz deleted the git-fetch-recovery-event branch April 5, 2022 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/git Git related issues and pull requests enhancement New feature or request hold Issues and pull requests put on hold
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants