[Access] Connection pool evictions cause connection failures #4534

Guitarheroua · 2023-06-30T20:18:30Z

Context
Connection pool evictions cause connection failures. To resolve this issue, I proposed checking the current state of the ClientConn (GetState() or WaitForStateChange() methods). While these methods can track the state of the connection itself, they do not handle ongoing requests gracefully.

To handle ongoing requests properly, a new client interceptor called the "request watcher" is introduced. This interceptor keeps track of the number of requests that have started and completed. It utilizes synchronization mechanisms to wait for unfinished requests during the connection closure process. Any additional requests that arrive after the closure is initiated will be rejected while waiting for ongoing requests to finish.

…e-conn-failures

engine/access/rpc/backend/connection_factory.go

engine/access/rpc/backend/connection_factory_test.go

Co-authored-by: Peter Argue <89119817+peterargue@users.noreply.github.com>

…e-conn-failures

Guitarheroua · 2023-07-11T12:10:16Z

@peterargue Fixed all comments

peterargue

thanks for the updates @Guitarheroua. Added a few more comments based on the changes.

engine/access/rpc/backend/connection_factory.go

peterargue · 2023-07-16T15:57:25Z

engine/access/rpc/backend/connection_factory.go

+	) error {
+		// Prevent new request from being sent if the connection is marked for closure
+		if cachedClient.closeRequested.Load() {
+			return status.Errorf(codes.Unavailable, "the connection to %s was closed", cachedClient.Address)


I think this is the correct behavior, but ideally, we'd never encounter this error since it's too late to reconnect. Is there a check needed in retrieveConnection to ensure the connection isn't closed before reusing?

If we catch it there, then a new connection can be created.

engine/access/rpc/backend/connection_factory.go

engine/access/rpc/backend/connection_factory_test.go

…e-conn-failures

…es' of github.com:Guitarheroua/flow-go into guitarheroua/2833-conn-pool-evictions-cause-conn-failures

…e-conn-failures

…es' of github.com:Guitarheroua/flow-go into guitarheroua/2833-conn-pool-evictions-cause-conn-failures

…e-conn-failures

…es' of github.com:Guitarheroua/flow-go into guitarheroua/2833-conn-pool-evictions-cause-conn-failures

peterargue

This connection factory has proven to be difficult to get right with the different race conditions in a performant way. I took a shot at refactoring it a bit to make it easier to work with. feel free to take some or all of these changes if you think they help. Also feel free to ignore. I haven't done any testing and I'm sure it's missing metrics, logging, comments, etc.

https://github.com/onflow/flow-go/compare/petera/example-conn-factory-refactor?expand=1

engine/access/rpc/backend/connection_factory.go

…e-conn-failures

…low/flow-go into guitarheroua/2833-conn-pool-evictions-cause-conn-failures

…es' of github.com:Guitarheroua/flow-go into guitarheroua/2833-conn-pool-evictions-cause-conn-failures

Guitarheroua · 2023-07-20T18:40:09Z

This connection factory has proven to be difficult to get right with the different race conditions in a performant way. I took a shot at refactoring it a bit to make it easier to work with. feel free to take some or all of these changes if you think they help. Also feel free to ignore. I haven't done any testing and I'm sure it's missing metrics, logging, comments, etc.

https://github.com/onflow/flow-go/compare/petera/example-conn-factory-refactor?expand=1

It was unexpected for me, but it really looks nicer and cleaner. I added all changes to the code, added metrics, commented on everything, and fixed tests. Should be nice.

…e-conn-failures

peterargue

added a few small comments, but I think this is pretty much ready to go. Can you also do some manual testing using localnet to make sure it's working as expected? let me know if you need any pointers on how to do that.

engine/access/rpc/connection/cache.go

engine/access/rpc/connection/manager.go

engine/access/rpc/connection/connection.go

engine/access/rpc/engine.go

engine/access/rpc/connection/connection_test.go

codecov-commenter · 2023-07-20T22:20:15Z

Codecov Report

Merging #4534 (f7c7116) into master (2528190) will decrease coverage by 1.75%.
Report is 286 commits behind head on master.
The diff coverage is 55.25%.

@@            Coverage Diff             @@
##           master    #4534      +/-   ##
==========================================
- Coverage   56.25%   54.51%   -1.75%     
==========================================
  Files         653      914     +261     
  Lines       64699    85318   +20619     
==========================================
+ Hits        36396    46509   +10113     
- Misses      25362    35219    +9857     
- Partials     2941     3590     +649

Flag	Coverage Δ
unittests	`54.51% <55.25%> (-1.75%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed	Coverage Δ
cmd/scaffold.go	`14.83% <0.00%> (+0.06%)`	⬆️
engine/access/rest/middleware/metrics.go	`0.00% <0.00%> (ø)`
engine/access/rest/routes/execution_result.go	`56.75% <0.00%> (ø)`
engine/access/rpc/backend/backend_accounts.go	`54.83% <ø> (ø)`
engine/access/rpc/backend/backend_events.go	`66.88% <ø> (ø)`
engine/access/rpc/backend/backend_scripts.go	`16.95% <ø> (ø)`
engine/access/rpc/backend/backend_transactions.go	`49.01% <0.00%> (+2.86%)`	⬆️
engine/access/state_stream/engine.go	`0.00% <0.00%> (ø)`
engine/common/rpc/convert/blocks.go	`78.76% <ø> (ø)`
fvm/environment/env.go	`100.00% <ø> (ø)`
... and 75 more

... and 239 files with indirect coverage changes

durkmurder · 2023-07-21T11:52:41Z

engine/access/rpc/connection/manager.go

+	}
+	defer client.mu.Unlock()
+
+	if client.ClientConn != nil && client.ClientConn.GetState() != connectivity.Shutdown {


Do we actually need check for state? We always remove connection from cache when we close the connection.
Additionally, I would be more explicit with implementation of this function by implementing next logic:

GetOrAdd returns new client - we perform initialization of connection, record metrics and return.

GetOrAdd returns cached client - we take lock, record metrics and return

We can simplify and structure this function in a way that we have specific paths.
In reality it's the way I have described it, but the check for path is not based on having the client cached but rather on availability of ClientConn.

@peterargue

On other hand current logic helps to recover from case where someone closed the connection using grpc.ClientConn.Close instead of Manager.Remove(grpcAddress), that is a good thing, but also encourages wrong usage of connection manager.

Yea, this also covers the case where the remote end hung up. I think it's a good idea to keep this, though if we didn't check, the application logic would likely handle it gracefully.

durkmurder · 2023-07-21T12:04:01Z

engine/access/rpc/connection/manager.go

+
+// Remove removes the gRPC client connection associated with the given grpcAddress from the cache.
+// It returns true if the connection was removed successfully, false otherwise.
+func (m *Manager) Remove(grpcAddress string) bool {


I think we need to obtain lock here to make sure that res.ClientConn is indeed initialized.
Without it we can run in scenario:

GetOrAdd creates client, takes lock, but haven't created connection yet

Another goroutine calls Remove, which gets a client in half-initialized state and removes it from cache.

since ClientConn is not yet set, we will perform Close on nil grpc.Connection

Obtaining lock after calling cache.Remove should fix the problem

…e-conn-failures

engine/access/rpc/connection/connection_test.go

…eroua/2833-conn-pool-evictions-cause-conn-failures

durkmurder

Very solid, great stuff.

Guitarheroua · 2023-07-21T18:36:03Z

Merged conflicts. Fixed all comments. @peterargue could you please do a final double-check?

Guitarheroua · 2023-07-21T20:19:59Z

added a few small comments, but I think this is pretty much ready to go. Can you also do some manual testing using localnet to make sure it's working as expected? let me know if you need any pointers on how to do that.

Run this branch with all the latest changes on the localnet. Access container successfully created and work shows logs, and displays visual info about the access node. I do not know if I should check something specific?

peterargue · 2023-07-21T20:37:16Z

added a few small comments, but I think this is pretty much ready to go. Can you also do some manual testing using localnet to make sure it's working as expected? let me know if you need any pointers on how to do that.

Run this branch with all the latest changes on the localnet. Access container successfully created and work shows logs, and displays visual info about the access node. I do not know if I should check something specific?

Maybe add a few more collection nodes (make -e COLLECTION=6 bootstrap start-cached) , then set the cache size to something small like 1 or 2. You should be able to just run the loader (make load) to generate SendTransaction traffic and make sure evictions are handled properly

Guitarheroua · 2023-07-21T21:01:44Z

added a few small comments, but I think this is pretty much ready to go. Can you also do some manual testing using localnet to make sure it's working as expected? let me know if you need any pointers on how to do that.

Run this branch with all the latest changes on the localnet. Access container successfully created and work shows logs, and displays visual info about the access node. I do not know if I should check something specific?

Maybe add a few more collection nodes (make -e COLLECTION=6 bootstrap start-cached) , then set the cache size to something small like 1 or 2. You should be able to just run the loader (make load) to generate SendTransaction traffic and make sure evictions are handled properly

Here, is what I got. I do not know, whether the error was caused by our changes, looks like not.

> make -e COLLECTION=6 bootstrap start-cached=2
> make load
go run --tags relic ../benchmark/cmd/manual -log-level info -tps 1,10,100 -tps-durations 30s,30s
11:54PM INF metrics server started address=:8080 endpoint=/metrics
11:54PM INF Service Address: f8d6e0586b0a20c7
11:54PM INF Fungible Token Address: ee82856bf20e2aa6
11:54PM INF Flow Token Address: 0ae53cb6e3f42a79
11:54PM INF worker stats TxsExecuted=0 TxsExecutedMovingAverage=0 TxsSent=0 TxsSentMovingAverage=0 TxsTimedout=0 Workers=0
11:54PM INF service account loaded num_keys=50
11:54PM INF creating accounts cumulative=0 num=750 numberOfAccounts=10000
11:54PM INF worker stats TxsExecuted=0 TxsExecutedMovingAverage=0 TxsSent=1 TxsSentMovingAverage=0 TxsTimedout=0 Workers=0
11:54PM INF creating accounts cumulative=750 num=750 numberOfAccounts=10000
11:54PM INF creating accounts cumulative=1500 num=750 numberOfAccounts=10000
11:54PM INF worker stats TxsExecuted=0 TxsExecutedMovingAverage=0 TxsSent=2 TxsSentMovingAverage=0 TxsTimedout=0 Workers=0
11:54PM INF worker stats TxsExecuted=0 TxsExecutedMovingAverage=0 TxsSent=3 TxsSentMovingAverage=0 TxsTimedout=0 Workers=0
...
11:55PM INF worker stats TxsExecuted=10 TxsExecutedMovingAverage=0.10461840683514284 TxsSent=14 TxsSentMovingAverage=0.000007213393639844955 TxsTimedout=0 Workers=0
11:55PM FTL unable to init loader error="error creating accounts: timeout waiting for account creation tx to be executed"
exit status 1
make: *** [Makefile:122: load] Error 1

peterargue · 2023-07-21T21:10:53Z

Do you see any errors or panics in the Access node logs?

Guitarheroua · 2023-07-24T11:39:47Z

Do you see any errors or panics in the Access node logs?

No, I do not see panic logs in the access node logs. I checked it with Loki output after running the load.

peterargue · 2023-07-24T18:21:12Z

actually, I we also need to remove this check
https://github.com/Guitarheroua/flow-go/blob/guitarheroua/2833-conn-pool-evictions-cause-conn-failures/engine/access/rpc/backend/backend.go#L228-L234

Your localnet tests would have reset the limit to 250 and not exercised the evictions. If you remove that check, you should be able to run that same manual tests and see lots of closing grpc connection evicted from pool debug messages. We're looking to make sure that

The access node remains stable
The error rate for SendTransaction requests is stable with a low limit (and high rate of evictions)

Guitarheroua · 2023-07-24T21:17:09Z

actually, I we also need to remove this check https://github.com/Guitarheroua/flow-go/blob/guitarheroua/2833-conn-pool-evictions-cause-conn-failures/engine/access/rpc/backend/backend.go#L228-L234

Your localnet tests would have reset the limit to 250 and not exercised the evictions. If you remove that check, you should be able to run that same manual tests and see lots of closing grpc connection evicted from pool debug messages. We're looking to make sure that
1. The access node remains stable

2. The error rate for SendTransaction requests is stable with a low limit (and high rate of evictions)

No panic at all. A lot of:

{
   "level": "debug",
   "node_role": "access",
    "node_id": "c1aac167011455fcd922e19388c79ff1b0d2ea565df24dd1552da9baba859c08",
    "grpc_conn_evicted": "execution_2:9000",   "time": "2023-07-24T21:09:55.170365577Z",
    "message": "closing grpc connection evicted from pool"
 }

…e-conn-failures

peterargue · 2023-07-25T23:08:42Z

thanks for testing! Approved

Guitarheroua and others added 5 commits June 29, 2023 21:04

Implemented atomic request counter

54670af

Implemented graceful close and test.

fa9acbc

Added comments

f906baa

Merge branch 'master' into guitarheroua/2833-conn-pool-evictions-caus…

49f751e

…e-conn-failures

Merge branch 'master' into guitarheroua/2833-conn-pool-evictions-caus…

3886060

…e-conn-failures

Guitarheroua marked this pull request as ready for review July 5, 2023 11:07

Guitarheroua requested a review from peterargue as a code owner July 5, 2023 11:07

peterargue reviewed Jul 7, 2023

View reviewed changes

Guitarheroua and others added 4 commits July 7, 2023 16:14

Update engine/access/rpc/backend/connection_factory_test.go

d80f544

Co-authored-by: Peter Argue <89119817+peterargue@users.noreply.github.com>

Merge branch 'master' into guitarheroua/2833-conn-pool-evictions-caus…

721dd84

…e-conn-failures

Merge branch 'master' into guitarheroua/2833-conn-pool-evictions-caus…

cfdf79a

…e-conn-failures

Fixed remarks

7b7658e

peterargue reviewed Jul 16, 2023

View reviewed changes

Guitarheroua and others added 12 commits July 17, 2023 10:53

Merge branch 'master' into guitarheroua/2833-conn-pool-evictions-caus…

9089a1e

…e-conn-failures

Fixed remarks

02be088

Added evicting cache client test

e2547b6

Merge branch 'master' into guitarheroua/2833-conn-pool-evictions-caus…

fb08d0d

…e-conn-failures

Merge branch 'guitarheroua/2833-conn-pool-evictions-cause-conn-failur…

7e9e885

…es' of github.com:Guitarheroua/flow-go into guitarheroua/2833-conn-pool-evictions-cause-conn-failures

Merge branch 'master' into guitarheroua/2833-conn-pool-evictions-caus…

8aa1593

…e-conn-failures

Merge branch 'guitarheroua/2833-conn-pool-evictions-cause-conn-failur…

163cee6

…es' of github.com:Guitarheroua/flow-go into guitarheroua/2833-conn-pool-evictions-cause-conn-failures

Added test TestExecutionEvictingCacheClients

fbd8446

Merge branch 'master' into guitarheroua/2833-conn-pool-evictions-caus…

3412d68

…e-conn-failures

Return add to retrive connection.

43bea66

Merge branch 'guitarheroua/2833-conn-pool-evictions-cause-conn-failur…

13c49c8

…es' of github.com:Guitarheroua/flow-go into guitarheroua/2833-conn-pool-evictions-cause-conn-failures

spike to improve locking in grpc connection factory

811f83b

peterargue reviewed Jul 19, 2023

View reviewed changes

engine/access/rpc/backend/connection_factory.go Outdated Show resolved Hide resolved

peterargue and others added 3 commits July 19, 2023 16:51

remove cache from factory

e09ba32

Merge branch 'master' into guitarheroua/2833-conn-pool-evictions-caus…

da22beb

…e-conn-failures

Merge branch 'petera/example-conn-factory-refactor' of github.com:onf…

6a857fe

…low/flow-go into guitarheroua/2833-conn-pool-evictions-cause-conn-failures

Merge branch 'guitarheroua/2833-conn-pool-evictions-cause-conn-failur…

48cae42

…es' of github.com:Guitarheroua/flow-go into guitarheroua/2833-conn-pool-evictions-cause-conn-failures

Guitarheroua and others added 3 commits July 20, 2023 21:41

removed unnecessary changes

1122b8c

removed unnecessary changes

fb4e43f

Merge branch 'master' into guitarheroua/2833-conn-pool-evictions-caus…

226f053

…e-conn-failures

peterargue reviewed Jul 20, 2023

View reviewed changes

durkmurder reviewed Jul 21, 2023

View reviewed changes

Merge branch 'master' into guitarheroua/2833-conn-pool-evictions-caus…

5240b3b

…e-conn-failures

durkmurder reviewed Jul 21, 2023

View reviewed changes

engine/access/rpc/connection/connection_test.go Show resolved Hide resolved

Guitarheroua added 2 commits July 21, 2023 16:28

Fixed review remarks

695ff8c

Merge branch 'master' of github.com:Guitarheroua/flow-go into guitarh…

8644def

…eroua/2833-conn-pool-evictions-cause-conn-failures

durkmurder approved these changes Jul 21, 2023

View reviewed changes

Fixed conflicts

eb6b627

Guitarheroua and others added 2 commits July 25, 2023 00:56

Removed work around

46d10fb

Merge branch 'master' into guitarheroua/2833-conn-pool-evictions-caus…

f7c7116

…e-conn-failures

peterargue approved these changes Jul 25, 2023

View reviewed changes

peterargue merged commit 0607ed2 into onflow:master Jul 26, 2023

peterargue mentioned this pull request Jul 26, 2023

[Access] Add circuit breaker for unavailable Collection/Execution nodes #4496

Merged

Guitarheroua deleted the guitarheroua/2833-conn-pool-evictions-cause-conn-failures branch October 30, 2023 09:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Access] Connection pool evictions cause connection failures #4534

[Access] Connection pool evictions cause connection failures #4534

Guitarheroua commented Jun 30, 2023

Guitarheroua commented Jul 11, 2023

peterargue left a comment

peterargue Jul 16, 2023

peterargue left a comment •

edited

Loading

Guitarheroua commented Jul 20, 2023

peterargue left a comment

codecov-commenter commented Jul 20, 2023 •

edited

Loading

durkmurder Jul 21, 2023

durkmurder Jul 21, 2023

durkmurder Jul 21, 2023

peterargue Jul 21, 2023 •

edited

Loading

durkmurder Jul 21, 2023 •

edited

Loading

peterargue Jul 21, 2023

durkmurder left a comment

Guitarheroua commented Jul 21, 2023

Guitarheroua commented Jul 21, 2023

peterargue commented Jul 21, 2023 •

edited

Loading

Guitarheroua commented Jul 21, 2023

peterargue commented Jul 21, 2023

Guitarheroua commented Jul 24, 2023

peterargue commented Jul 24, 2023

Guitarheroua commented Jul 24, 2023

peterargue commented Jul 25, 2023

[Access] Connection pool evictions cause connection failures #4534

[Access] Connection pool evictions cause connection failures #4534

Conversation

Guitarheroua commented Jun 30, 2023

Guitarheroua commented Jul 11, 2023

peterargue left a comment

Choose a reason for hiding this comment

peterargue Jul 16, 2023

Choose a reason for hiding this comment

peterargue left a comment • edited Loading

Choose a reason for hiding this comment

Guitarheroua commented Jul 20, 2023

peterargue left a comment

Choose a reason for hiding this comment

codecov-commenter commented Jul 20, 2023 • edited Loading

Codecov Report

durkmurder Jul 21, 2023

Choose a reason for hiding this comment

durkmurder Jul 21, 2023

Choose a reason for hiding this comment

durkmurder Jul 21, 2023

Choose a reason for hiding this comment

peterargue Jul 21, 2023 • edited Loading

Choose a reason for hiding this comment

durkmurder Jul 21, 2023 • edited Loading

Choose a reason for hiding this comment

peterargue Jul 21, 2023

Choose a reason for hiding this comment

durkmurder left a comment

Choose a reason for hiding this comment

Guitarheroua commented Jul 21, 2023

Guitarheroua commented Jul 21, 2023

peterargue commented Jul 21, 2023 • edited Loading

Guitarheroua commented Jul 21, 2023

peterargue commented Jul 21, 2023

Guitarheroua commented Jul 24, 2023

peterargue commented Jul 24, 2023

Guitarheroua commented Jul 24, 2023

peterargue commented Jul 25, 2023

peterargue left a comment •

edited

Loading

codecov-commenter commented Jul 20, 2023 •

edited

Loading

peterargue Jul 21, 2023 •

edited

Loading

durkmurder Jul 21, 2023 •

edited

Loading

peterargue commented Jul 21, 2023 •

edited

Loading