-
Notifications
You must be signed in to change notification settings - Fork 534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add common.RetryForever() and use for concurrent sync operations #1503
Conversation
Note that the failing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some cases retrying an error could make sense but in the majority of cases, the approach should be to error as soon as possible and stop the execution from progressing.
I initially misinterpreted this to mean stop the program - but I understand that you mean stop the goroutine that is erroring. This approach makes it very hard to operate and debug Edge instances. We could have an instance that failed to sync rootchain events many hours ago and we would have to search for that singular ERROR log in an unknown range of time. If the sync was retried, we would see the ERROR log regularly. If we alert on every ERROR log, then we are inundated with alerts - I believe because dispatcher can ERROR log if RPC requests from users fail. If we keep the status quo, and don't retry (or exit the program) for failures like gRPC server startup, event tracker sync, and prometheus server, what is your advice for cutting through the logs and being able to alert on ERRORs? Am I incorrect about the dispatcher logging ERROR on user requests? Thanks Example of user-initiated RPC requests that results in ERROR logs (using eth_sendRawTransaction):
|
Looking at tt.Sync in ethgo https://github.com/umbracle/ethgo/blob/main/tracker/tracker.go#L578. go common.RetryForever(ctx, time.Second, func(context.Context) error {
if err := tt.Sync(ctx); err != nil {
e.logger.Error("failed to sync", "error", err)
return err
}
return nil
}) IDK why the sync func in ethgo completely ignores those errors in the for loop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that we should keep retrying logic only in event_tracker.go
and in polybft.go
(synchronization protocol).
As a side note, please set golangci-lint
as a linting tool and run make lint
command. Seems like there are a couple of linter errors which should be fixed.
@Stefan-Ethernal Yep I can see the golangci-lint warnings in my IDE about line cuddling etc. I was ignoring them as I wasn't sure whether you guys really wanted to enforce those spacing requirements given the github workflow does not enforce them. I am happy to add a workflow step that enforces the linter if you would like. Or perhaps there is a reason you do not want this. Thanks |
@sergerad In fact we do have GH workflow for linters as well, although for some reason it hasn't been triggering for external collaborators. I have created a #1512 which got merged into the EDIT: It didn't help, so now I'm suspecting that it may be related to repository settings. |
ee378f8
to
99a1d8a
Compare
Have removed the retry logic apart from polybft sync / event tracker. If we merge this PR, I will need to do a follow up PR to ethgo to address the fact that this error is not handled: https://github.com/umbracle/ethgo/blob/main/tracker/tracker.go#LL582C7-L582C22 @vcastellm are you happy doing the retries on the sync logic? Currently in this PR we are also concurrently retrying initialization logic for the trackers because the tests rely on the fact that those errors are not handled yet. If we want initialization logic out of the concurrent retries, I can update the tests in this PR. If you are not happy doing retries on sync logic at all, I can close this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
As a side note: please rebase again to the most recent commit on the develop branch, since we have finally fixed lint workflow and then just resolve pending linting issues.
I did rebase when you mentioned the linting. I can see this commit in my logs
Will rebase again now in any case |
d75e370
to
80ba145
Compare
73ade2b
to
6bc2939
Compare
Yep, but that one didn't work, so we have opened and merged another PR today (#1526) and now it's working properly 🙂 |
6bc2939
to
29a0cba
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for taking into consideration comments from the guys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vcastellm are you happy doing the retries on the sync logic?
LGTM now, thanks for the contribution and sorry for the late reply
Description
Add common.RetryForever() and use for concurrent block/event synchronisation operations that can take a long time and fail at any point.
Background
The Edge application approaches some errors by logging and running the rest of the program without retrying the failed operation. This is undesirable when those operations are critical (such as relayer startup and sync, which can mean that deposits no longer work). Critical operations should be retried because:
This change allows critical operations to be retried every N seconds. Some such errors are transient and will fix themselves on retry. Other errors will have to be manually fixed by Edge operators.
Note that in a previous PR, we implemented an approach which propagated concurrent errors back to the server shutdown logic. However, this was much more complicated and introduced a paradigm to error handling that is not consistent with how Edge is implemented today.
Changes include
Breaking changes
None
Checklist
Testing