Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add common.RetryForever() and use for concurrent sync operations #1503

Merged
merged 14 commits into from
May 22, 2023

Conversation

sergerad
Copy link
Contributor

@sergerad sergerad commented May 14, 2023

Description

Add common.RetryForever() and use for concurrent block/event synchronisation operations that can take a long time and fail at any point.

Background

The Edge application approaches some errors by logging and running the rest of the program without retrying the failed operation. This is undesirable when those operations are critical (such as relayer startup and sync, which can mean that deposits no longer work). Critical operations should be retried because:

  1. It allows the application to self heal, rather than run without critical capabilities; and
  2. It allows for more reliable alerting and monitoring based on repeating critical errors.

This change allows critical operations to be retried every N seconds. Some such errors are transient and will fix themselves on retry. Other errors will have to be manually fixed by Edge operators.

Note that in a previous PR, we implemented an approach which propagated concurrent errors back to the server shutdown logic. However, this was much more complicated and introduced a paradigm to error handling that is not consistent with how Edge is implemented today.

Changes include

  • Bugfix (non-breaking change that solves an issue)
  • Hotfix (change that solves an urgent issue, and requires immediate attention)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (change that is not backwards-compatible and/or changes current functionality)

Breaking changes

None

Checklist

  • I have assigned this PR to myself
  • I have added at least 1 reviewer
  • I have added the relevant labels
  • I have updated the official documentation
  • I have added sufficient documentation in code

Testing

  • I have tested this code with the official test suite
  • I have tested this code manually

@sergerad
Copy link
Contributor Author

Note that the failing TestE2E_Bridge_DepositAndWithdrawERC1155 passes on my machine. I assume its flakey due to the heuristic it uses for waiting for blocks after the deposit txs

@Stefan-Ethernal Stefan-Ethernal added the bug fix Functionality that fixes a bug label May 14, 2023
@Stefan-Ethernal Stefan-Ethernal requested a review from a team May 14, 2023 18:41
server/server.go Outdated Show resolved Hide resolved
network/gossip.go Outdated Show resolved Hide resolved
server/server.go Outdated Show resolved Hide resolved
server/server.go Outdated Show resolved Hide resolved
Copy link
Contributor

@vcastellm vcastellm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some cases retrying an error could make sense but in the majority of cases, the approach should be to error as soon as possible and stop the execution from progressing.

server/server.go Outdated Show resolved Hide resolved
server/server.go Outdated Show resolved Hide resolved
tracker/event_tracker.go Show resolved Hide resolved
@sergerad
Copy link
Contributor Author

sergerad commented May 15, 2023

In some cases retrying an error could make sense but in the majority of cases, the approach should be to error as soon as possible and stop the execution from progressing.

I initially misinterpreted this to mean stop the program - but I understand that you mean stop the goroutine that is erroring.

This approach makes it very hard to operate and debug Edge instances. We could have an instance that failed to sync rootchain events many hours ago and we would have to search for that singular ERROR log in an unknown range of time. If the sync was retried, we would see the ERROR log regularly. If we alert on every ERROR log, then we are inundated with alerts - I believe because dispatcher can ERROR log if RPC requests from users fail.

If we keep the status quo, and don't retry (or exit the program) for failures like gRPC server startup, event tracker sync, and prometheus server, what is your advice for cutting through the logs and being able to alert on ERRORs? Am I incorrect about the dispatcher logging ERROR on user requests? Thanks

Example of user-initiated RPC requests that results in ERROR logs (using eth_sendRawTransaction):

message:"2023-05-15T20:58:25.180Z [ERROR] polygon.server.dispatcher: failed to dispatch: method=eth_sendRawTransaction err=\"cannot parse RLP: cannot parse short bytes: cannot parse array value: cannot parse short bytes: cannot parse array value: cannot parse short bytes: length is not enough\""

@sergerad
Copy link
Contributor Author

sergerad commented May 15, 2023

Looking at tt.Sync in ethgo https://github.com/umbracle/ethgo/blob/main/tracker/tracker.go#L578.
The RetryForever here only helps for batchsync, because the longrunning for loop after it is handled concurrently by the Sync func itself. That means the following change is useless apart from retrying the batchsync before the for loop

	go common.RetryForever(ctx, time.Second, func(context.Context) error {
		if err := tt.Sync(ctx); err != nil {
			e.logger.Error("failed to sync", "error", err)
			return err
		}
		return nil
	})

IDK why the sync func in ethgo completely ignores those errors in the for loop

@Stefan-Ethernal Stefan-Ethernal mentioned this pull request May 16, 2023
11 tasks
Copy link
Collaborator

@Stefan-Ethernal Stefan-Ethernal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that we should keep retrying logic only in event_tracker.go and in polybft.go (synchronization protocol).

As a side note, please set golangci-lint as a linting tool and run make lint command. Seems like there are a couple of linter errors which should be fixed.

server/server.go Outdated Show resolved Hide resolved
@sergerad
Copy link
Contributor Author

sergerad commented May 16, 2023

@Stefan-Ethernal Yep I can see the golangci-lint warnings in my IDE about line cuddling etc. I was ignoring them as I wasn't sure whether you guys really wanted to enforce those spacing requirements given the github workflow does not enforce them.

I am happy to add a workflow step that enforces the linter if you would like. Or perhaps there is a reason you do not want this. Thanks

@Stefan-Ethernal
Copy link
Collaborator

Stefan-Ethernal commented May 16, 2023

@Stefan-Ethernal Yep I can see the golangci-lint warnings in my IDE about line cuddilng etc. I was ignoring them as I wasn't sure whether you guys really wanted to enforce those spacing requirements given the github workflow does not enforce them.

I am happy to add a workflow step that enforces the linter if you would like. Or perhaps there is a reason you do not want this. Thanks

@sergerad In fact we do have GH workflow for linters as well, although for some reason it hasn't been triggering for external collaborators. I have created a #1512 which got merged into the develop, so if you can rebase to the latest commit on the develop branch and let's see if is it any better now (I expect it to trigger this time). 🙂 🤞

EDIT: It didn't help, so now I'm suspecting that it may be related to repository settings.

@sergerad
Copy link
Contributor Author

sergerad commented May 16, 2023

Have removed the retry logic apart from polybft sync / event tracker. If we merge this PR, I will need to do a follow up PR to ethgo to address the fact that this error is not handled: https://github.com/umbracle/ethgo/blob/main/tracker/tracker.go#LL582C7-L582C22

@vcastellm are you happy doing the retries on the sync logic? Currently in this PR we are also concurrently retrying initialization logic for the trackers because the tests rely on the fact that those errors are not handled yet. If we want initialization logic out of the concurrent retries, I can update the tests in this PR.

If you are not happy doing retries on sync logic at all, I can close this PR

Copy link
Collaborator

@Stefan-Ethernal Stefan-Ethernal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

As a side note: please rebase again to the most recent commit on the develop branch, since we have finally fixed lint workflow and then just resolve pending linting issues.

@sergerad
Copy link
Contributor Author

As a side note: please rebase again to the most recent commit on the develop branch, since we have finally fixed lint workflow and then just resolve pending linting issues.

I did rebase when you mentioned the linting. I can see this commit in my logs

commit c3490646a75df27a36831e1a56873a5c46d01d61
Author: Stefan Negovanović <93934272+Stefan-Ethernal@users.noreply.github.com>
Date:   Tue May 16 20:43:07 2023 +0200

    Fix Lint GH workflow (#1512)
    
    * Remove ignored branches and increase timeout
    
    * Broader triggers
    
    * Add verbose output

Will rebase again now in any case

@sergerad sergerad changed the title Add common.RetryForever() and use for concurrent operations Add common.RetryForever() and use for concurrent sync operations May 19, 2023
@Stefan-Ethernal
Copy link
Collaborator

I did rebase when you mentioned the linting. I can see this commit in my logs

Yep, but that one didn't work, so we have opened and merged another PR today (#1526) and now it's working properly 🙂

Copy link
Collaborator

@goran-ethernal goran-ethernal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for taking into consideration comments from the guys.

Copy link
Contributor

@vcastellm vcastellm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vcastellm are you happy doing the retries on the sync logic?

LGTM now, thanks for the contribution and sorry for the late reply

@vcastellm vcastellm merged commit 60b21b4 into 0xPolygon:develop May 22, 2023
@github-actions github-actions bot locked and limited conversation to collaborators May 22, 2023
@sergerad sergerad deleted the retry-concurrent branch May 22, 2023 18:08
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug fix Functionality that fixes a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants