Add common.RetryForever() and use for concurrent sync operations #1503

sergerad · 2023-05-14T03:57:50Z

Description

Add common.RetryForever() and use for concurrent block/event synchronisation operations that can take a long time and fail at any point.

Background

The Edge application approaches some errors by logging and running the rest of the program without retrying the failed operation. This is undesirable when those operations are critical (such as relayer startup and sync, which can mean that deposits no longer work). Critical operations should be retried because:

It allows the application to self heal, rather than run without critical capabilities; and
It allows for more reliable alerting and monitoring based on repeating critical errors.

This change allows critical operations to be retried every N seconds. Some such errors are transient and will fix themselves on retry. Other errors will have to be manually fixed by Edge operators.

Note that in a previous PR, we implemented an approach which propagated concurrent errors back to the server shutdown logic. However, this was much more complicated and introduced a paradigm to error handling that is not consistent with how Edge is implemented today.

Changes include

Bugfix (non-breaking change that solves an issue)
Hotfix (change that solves an urgent issue, and requires immediate attention)
New feature (non-breaking change that adds functionality)
Breaking change (change that is not backwards-compatible and/or changes current functionality)

Breaking changes

None

Checklist

I have assigned this PR to myself
I have added at least 1 reviewer
I have added the relevant labels
I have updated the official documentation
I have added sufficient documentation in code

Testing

I have tested this code with the official test suite
I have tested this code manually

sergerad · 2023-05-14T07:44:38Z

Note that the failing TestE2E_Bridge_DepositAndWithdrawERC1155 passes on my machine. I assume its flakey due to the heuristic it uses for waiting for blocks after the deposit txs

server/server.go

helper/common/common.go

network/gossip.go

server/server.go

tracker/event_tracker.go

server/server.go

vcastellm

In some cases retrying an error could make sense but in the majority of cases, the approach should be to error as soon as possible and stop the execution from progressing.

server/server.go

tracker/event_tracker.go

sergerad · 2023-05-15T10:21:46Z

In some cases retrying an error could make sense but in the majority of cases, the approach should be to error as soon as possible and stop the execution from progressing.

I initially misinterpreted this to mean stop the program - but I understand that you mean stop the goroutine that is erroring.

This approach makes it very hard to operate and debug Edge instances. We could have an instance that failed to sync rootchain events many hours ago and we would have to search for that singular ERROR log in an unknown range of time. If the sync was retried, we would see the ERROR log regularly. If we alert on every ERROR log, then we are inundated with alerts - I believe because dispatcher can ERROR log if RPC requests from users fail.

If we keep the status quo, and don't retry (or exit the program) for failures like gRPC server startup, event tracker sync, and prometheus server, what is your advice for cutting through the logs and being able to alert on ERRORs? Am I incorrect about the dispatcher logging ERROR on user requests? Thanks

Example of user-initiated RPC requests that results in ERROR logs (using eth_sendRawTransaction):

message:"2023-05-15T20:58:25.180Z [ERROR] polygon.server.dispatcher: failed to dispatch: method=eth_sendRawTransaction err=\"cannot parse RLP: cannot parse short bytes: cannot parse array value: cannot parse short bytes: cannot parse array value: cannot parse short bytes: length is not enough\""

sergerad · 2023-05-15T20:38:49Z

Looking at tt.Sync in ethgo https://github.com/umbracle/ethgo/blob/main/tracker/tracker.go#L578.
The RetryForever here only helps for batchsync, because the longrunning for loop after it is handled concurrently by the Sync func itself. That means the following change is useless apart from retrying the batchsync before the for loop

	go common.RetryForever(ctx, time.Second, func(context.Context) error {
		if err := tt.Sync(ctx); err != nil {
			e.logger.Error("failed to sync", "error", err)
			return err
		}
		return nil
	})

IDK why the sync func in ethgo completely ignores those errors in the for loop

Stefan-Ethernal

I feel that we should keep retrying logic only in event_tracker.go and in polybft.go (synchronization protocol).

As a side note, please set golangci-lint as a linting tool and run make lint command. Seems like there are a couple of linter errors which should be fixed.

server/server.go

sergerad · 2023-05-16T18:10:13Z

@Stefan-Ethernal Yep I can see the golangci-lint warnings in my IDE about line cuddling etc. I was ignoring them as I wasn't sure whether you guys really wanted to enforce those spacing requirements given the github workflow does not enforce them.

I am happy to add a workflow step that enforces the linter if you would like. Or perhaps there is a reason you do not want this. Thanks

server/server.go

Stefan-Ethernal · 2023-05-16T18:48:37Z

@Stefan-Ethernal Yep I can see the golangci-lint warnings in my IDE about line cuddilng etc. I was ignoring them as I wasn't sure whether you guys really wanted to enforce those spacing requirements given the github workflow does not enforce them.

I am happy to add a workflow step that enforces the linter if you would like. Or perhaps there is a reason you do not want this. Thanks

@sergerad In fact we do have GH workflow for linters as well, although for some reason it hasn't been triggering for external collaborators. I have created a #1512 which got merged into the develop, so if you can rebase to the latest commit on the develop branch and let's see if is it any better now (I expect it to trigger this time). 🙂 🤞

EDIT: It didn't help, so now I'm suspecting that it may be related to repository settings.

sergerad · 2023-05-16T21:32:30Z

Have removed the retry logic apart from polybft sync / event tracker. If we merge this PR, I will need to do a follow up PR to ethgo to address the fact that this error is not handled: https://github.com/umbracle/ethgo/blob/main/tracker/tracker.go#LL582C7-L582C22

@vcastellm are you happy doing the retries on the sync logic? Currently in this PR we are also concurrently retrying initialization logic for the trackers because the tests rely on the fact that those errors are not handled yet. If we want initialization logic out of the concurrent retries, I can update the tests in this PR.

If you are not happy doing retries on sync logic at all, I can close this PR

Stefan-Ethernal

LGTM.

As a side note: please rebase again to the most recent commit on the develop branch, since we have finally fixed lint workflow and then just resolve pending linting issues.

sergerad · 2023-05-19T09:36:04Z

As a side note: please rebase again to the most recent commit on the develop branch, since we have finally fixed lint workflow and then just resolve pending linting issues.

I did rebase when you mentioned the linting. I can see this commit in my logs

commit c3490646a75df27a36831e1a56873a5c46d01d61
Author: Stefan Negovanović <93934272+Stefan-Ethernal@users.noreply.github.com>
Date:   Tue May 16 20:43:07 2023 +0200

    Fix Lint GH workflow (#1512)
    
    * Remove ignored branches and increase timeout
    
    * Broader triggers
    
    * Add verbose output

Will rebase again now in any case

…fail

This reverts commit a758a1c.

This reverts commit b1ae839.

Stefan-Ethernal · 2023-05-19T09:42:19Z

I did rebase when you mentioned the linting. I can see this commit in my logs

Yep, but that one didn't work, so we have opened and merged another PR today (#1526) and now it's working properly 🙂

goran-ethernal

LGTM. Thanks for taking into consideration comments from the guys.

vcastellm

@vcastellm are you happy doing the retries on the sync logic?

LGTM now, thanks for the contribution and sorry for the late reply

sergerad mentioned this pull request May 14, 2023

Handle block tracker init/start failures and treat as fatal #1373

Closed

11 tasks

Stefan-Ethernal assigned sergerad May 14, 2023

Stefan-Ethernal added the bug fix Functionality that fixes a bug label May 14, 2023

Stefan-Ethernal requested a review from a team May 14, 2023 18:41

goran-ethernal reviewed May 15, 2023

View reviewed changes

server/server.go Outdated Show resolved Hide resolved

igorcrevar reviewed May 15, 2023

View reviewed changes

helper/common/common.go Show resolved Hide resolved

igorcrevar reviewed May 15, 2023

View reviewed changes

network/gossip.go Outdated Show resolved Hide resolved

igorcrevar reviewed May 15, 2023

View reviewed changes

server/server.go Outdated Show resolved Hide resolved

igorcrevar reviewed May 15, 2023

View reviewed changes

tracker/event_tracker.go Show resolved Hide resolved

sergerad commented May 15, 2023

View reviewed changes

server/server.go Outdated Show resolved Hide resolved

vcastellm requested changes May 15, 2023

View reviewed changes

server/server.go Outdated Show resolved Hide resolved

server/server.go Outdated Show resolved Hide resolved

tracker/event_tracker.go Show resolved Hide resolved

Stefan-Ethernal mentioned this pull request May 16, 2023

Fix Lint GH workflow #1512

Merged

11 tasks

Stefan-Ethernal reviewed May 16, 2023

View reviewed changes

server/server.go Outdated Show resolved Hide resolved

sergerad commented May 16, 2023

View reviewed changes

server/server.go Show resolved Hide resolved

sergerad force-pushed the retry-concurrent branch from ee378f8 to 99a1d8a Compare May 16, 2023 20:42

Stefan-Ethernal approved these changes May 19, 2023

View reviewed changes

sergerad added 7 commits May 19, 2023 21:36

Add common.RetryForever() and use for concurrent operations that can …

5826b61

…fail

Add unit tests for RetryForever()

4b7892c

Fix missing return err statement

86cdbd3

Update prometheus error logic

2dbddca

Revert gossip retry

629dc0f

Revert unwanted fmt

b3c7411

Only do sync concurrently in event_tracker.go

52e6ce9

sergerad added 6 commits May 19, 2023 21:36

RM some retries

74e3cba

Revert "Only do sync concurrently in event_tracker.go"

b7f825a

This reverts commit a758a1c.

Revert "RM some retries"

b855a0d

This reverts commit b1ae839.

RM retries

6dc28e0

Lint

10b3199

Add timeout to shouldend test

80ba145

sergerad force-pushed the retry-concurrent branch from d75e370 to 80ba145 Compare May 19, 2023 09:36

sergerad changed the title ~~Add common.RetryForever() and use for concurrent operations~~ Add common.RetryForever() and use for concurrent sync operations May 19, 2023

sergerad force-pushed the retry-concurrent branch from 73ade2b to 6bc2939 Compare May 19, 2023 09:41

Fix lint missed earlier

29a0cba

sergerad force-pushed the retry-concurrent branch from 6bc2939 to 29a0cba Compare May 19, 2023 09:48

sergerad requested a review from vcastellm May 19, 2023 18:20

goran-ethernal approved these changes May 20, 2023

View reviewed changes

vcastellm approved these changes May 22, 2023

View reviewed changes

vcastellm merged commit 60b21b4 into 0xPolygon:develop May 22, 2023

github-actions bot locked and limited conversation to collaborators May 22, 2023

sergerad deleted the retry-concurrent branch May 22, 2023 18:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add common.RetryForever() and use for concurrent sync operations #1503

Add common.RetryForever() and use for concurrent sync operations #1503

sergerad commented May 14, 2023 •

edited

Loading

sergerad commented May 14, 2023

vcastellm left a comment

sergerad commented May 15, 2023 •

edited

Loading

sergerad commented May 15, 2023 •

edited

Loading

Stefan-Ethernal left a comment

sergerad commented May 16, 2023 •

edited

Loading

Stefan-Ethernal commented May 16, 2023 •

edited

Loading

sergerad commented May 16, 2023 •

edited

Loading

Stefan-Ethernal left a comment

sergerad commented May 19, 2023

Stefan-Ethernal commented May 19, 2023

goran-ethernal left a comment

vcastellm left a comment

Add common.RetryForever() and use for concurrent sync operations #1503

Add common.RetryForever() and use for concurrent sync operations #1503

Conversation

sergerad commented May 14, 2023 • edited Loading

Description

Background

Changes include

Breaking changes

Checklist

Testing

sergerad commented May 14, 2023

vcastellm left a comment

Choose a reason for hiding this comment

sergerad commented May 15, 2023 • edited Loading

sergerad commented May 15, 2023 • edited Loading

Stefan-Ethernal left a comment

Choose a reason for hiding this comment

sergerad commented May 16, 2023 • edited Loading

Stefan-Ethernal commented May 16, 2023 • edited Loading

sergerad commented May 16, 2023 • edited Loading

Stefan-Ethernal left a comment

Choose a reason for hiding this comment

sergerad commented May 19, 2023

Stefan-Ethernal commented May 19, 2023

goran-ethernal left a comment

Choose a reason for hiding this comment

vcastellm left a comment

Choose a reason for hiding this comment

sergerad commented May 14, 2023 •

edited

Loading

sergerad commented May 15, 2023 •

edited

Loading

sergerad commented May 15, 2023 •

edited

Loading

sergerad commented May 16, 2023 •

edited

Loading

Stefan-Ethernal commented May 16, 2023 •

edited

Loading

sergerad commented May 16, 2023 •

edited

Loading