Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Access] Connection pool evictions cause connection failures #4534

Conversation

Guitarheroua
Copy link
Contributor

#2833

Context
Connection pool evictions cause connection failures. To resolve this issue, I proposed checking the current state of the ClientConn (GetState() or WaitForStateChange() methods). While these methods can track the state of the connection itself, they do not handle ongoing requests gracefully.

To handle ongoing requests properly, a new client interceptor called the "request watcher" is introduced. This interceptor keeps track of the number of requests that have started and completed. It utilizes synchronization mechanisms to wait for unfinished requests during the connection closure process. Any additional requests that arrive after the closure is initiated will be rejected while waiting for ongoing requests to finish.

@Guitarheroua Guitarheroua marked this pull request as ready for review July 5, 2023 11:07
@Guitarheroua Guitarheroua requested a review from peterargue as a code owner July 5, 2023 11:07
engine/access/rpc/backend/connection_factory.go Outdated Show resolved Hide resolved
engine/access/rpc/backend/connection_factory.go Outdated Show resolved Hide resolved
engine/access/rpc/backend/connection_factory.go Outdated Show resolved Hide resolved
engine/access/rpc/backend/connection_factory.go Outdated Show resolved Hide resolved
engine/access/rpc/backend/connection_factory_test.go Outdated Show resolved Hide resolved
@Guitarheroua
Copy link
Contributor Author

@peterargue Fixed all comments

Copy link
Contributor

@peterargue peterargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the updates @Guitarheroua. Added a few more comments based on the changes.

engine/access/rpc/backend/connection_factory.go Outdated Show resolved Hide resolved
engine/access/rpc/backend/connection_factory.go Outdated Show resolved Hide resolved
engine/access/rpc/backend/connection_factory.go Outdated Show resolved Hide resolved
engine/access/rpc/backend/connection_factory.go Outdated Show resolved Hide resolved
engine/access/rpc/backend/connection_factory.go Outdated Show resolved Hide resolved
) error {
// Prevent new request from being sent if the connection is marked for closure
if cachedClient.closeRequested.Load() {
return status.Errorf(codes.Unavailable, "the connection to %s was closed", cachedClient.Address)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the correct behavior, but ideally, we'd never encounter this error since it's too late to reconnect. Is there a check needed in retrieveConnection to ensure the connection isn't closed before reusing?

If we catch it there, then a new connection can be created.

engine/access/rpc/backend/connection_factory.go Outdated Show resolved Hide resolved
Copy link
Contributor

@peterargue peterargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This connection factory has proven to be difficult to get right with the different race conditions in a performant way. I took a shot at refactoring it a bit to make it easier to work with. feel free to take some or all of these changes if you think they help. Also feel free to ignore. I haven't done any testing and I'm sure it's missing metrics, logging, comments, etc.

https://github.com/onflow/flow-go/compare/petera/example-conn-factory-refactor?expand=1

engine/access/rpc/backend/connection_factory.go Outdated Show resolved Hide resolved
…es' of github.com:Guitarheroua/flow-go into guitarheroua/2833-conn-pool-evictions-cause-conn-failures
@Guitarheroua
Copy link
Contributor Author

This connection factory has proven to be difficult to get right with the different race conditions in a performant way. I took a shot at refactoring it a bit to make it easier to work with. feel free to take some or all of these changes if you think they help. Also feel free to ignore. I haven't done any testing and I'm sure it's missing metrics, logging, comments, etc.

https://github.com/onflow/flow-go/compare/petera/example-conn-factory-refactor?expand=1

It was unexpected for me, but it really looks nicer and cleaner. I added all changes to the code, added metrics, commented on everything, and fixed tests. Should be nice.

Copy link
Contributor

@peterargue peterargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a few small comments, but I think this is pretty much ready to go. Can you also do some manual testing using localnet to make sure it's working as expected? let me know if you need any pointers on how to do that.

engine/access/rpc/connection/cache.go Outdated Show resolved Hide resolved
engine/access/rpc/connection/manager.go Outdated Show resolved Hide resolved
engine/access/rpc/connection/connection.go Outdated Show resolved Hide resolved
engine/access/rpc/engine.go Outdated Show resolved Hide resolved
engine/access/rpc/connection/connection_test.go Outdated Show resolved Hide resolved
engine/access/rpc/connection/connection_test.go Outdated Show resolved Hide resolved
@codecov-commenter
Copy link

codecov-commenter commented Jul 20, 2023

Codecov Report

Merging #4534 (f7c7116) into master (2528190) will decrease coverage by 1.75%.
Report is 286 commits behind head on master.
The diff coverage is 55.25%.

@@            Coverage Diff             @@
##           master    #4534      +/-   ##
==========================================
- Coverage   56.25%   54.51%   -1.75%     
==========================================
  Files         653      914     +261     
  Lines       64699    85318   +20619     
==========================================
+ Hits        36396    46509   +10113     
- Misses      25362    35219    +9857     
- Partials     2941     3590     +649     
Flag Coverage Δ
unittests 54.51% <55.25%> (-1.75%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
cmd/scaffold.go 14.83% <0.00%> (+0.06%) ⬆️
engine/access/rest/middleware/metrics.go 0.00% <0.00%> (ø)
engine/access/rest/routes/execution_result.go 56.75% <0.00%> (ø)
engine/access/rpc/backend/backend_accounts.go 54.83% <ø> (ø)
engine/access/rpc/backend/backend_events.go 66.88% <ø> (ø)
engine/access/rpc/backend/backend_scripts.go 16.95% <ø> (ø)
engine/access/rpc/backend/backend_transactions.go 49.01% <0.00%> (+2.86%) ⬆️
engine/access/state_stream/engine.go 0.00% <0.00%> (ø)
engine/common/rpc/convert/blocks.go 78.76% <ø> (ø)
fvm/environment/env.go 100.00% <ø> (ø)
... and 75 more

... and 239 files with indirect coverage changes

}
defer client.mu.Unlock()

if client.ClientConn != nil && client.ClientConn.GetState() != connectivity.Shutdown {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need check for state? We always remove connection from cache when we close the connection.
Additionally, I would be more explicit with implementation of this function by implementing next logic:

  1. GetOrAdd returns new client - we perform initialization of connection, record metrics and return.
  2. GetOrAdd returns cached client - we take lock, record metrics and return

We can simplify and structure this function in a way that we have specific paths.
In reality it's the way I have described it, but the check for path is not based on having the client cached but rather on availability of ClientConn.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On other hand current logic helps to recover from case where someone closed the connection using grpc.ClientConn.Close instead of Manager.Remove(grpcAddress), that is a good thing, but also encourages wrong usage of connection manager.

Copy link
Contributor

@peterargue peterargue Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, this also covers the case where the remote end hung up. I think it's a good idea to keep this, though if we didn't check, the application logic would likely handle it gracefully.


// Remove removes the gRPC client connection associated with the given grpcAddress from the cache.
// It returns true if the connection was removed successfully, false otherwise.
func (m *Manager) Remove(grpcAddress string) bool {
Copy link
Member

@durkmurder durkmurder Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to obtain lock here to make sure that res.ClientConn is indeed initialized.
Without it we can run in scenario:

  • GetOrAdd creates client, takes lock, but haven't created connection yet
  • Another goroutine calls Remove, which gets a client in half-initialized state and removes it from cache.
  • since ClientConn is not yet set, we will perform Close on nil grpc.Connection

Obtaining lock after calling cache.Remove should fix the problem

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

Copy link
Member

@durkmurder durkmurder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very solid, great stuff.

@Guitarheroua
Copy link
Contributor Author

Merged conflicts. Fixed all comments. @peterargue could you please do a final double-check?

@Guitarheroua
Copy link
Contributor Author

added a few small comments, but I think this is pretty much ready to go. Can you also do some manual testing using localnet to make sure it's working as expected? let me know if you need any pointers on how to do that.

Run this branch with all the latest changes on the localnet. Access container successfully created and work shows logs, and displays visual info about the access node. I do not know if I should check something specific?

@peterargue
Copy link
Contributor

peterargue commented Jul 21, 2023

added a few small comments, but I think this is pretty much ready to go. Can you also do some manual testing using localnet to make sure it's working as expected? let me know if you need any pointers on how to do that.

Run this branch with all the latest changes on the localnet. Access container successfully created and work shows logs, and displays visual info about the access node. I do not know if I should check something specific?

Maybe add a few more collection nodes (make -e COLLECTION=6 bootstrap start-cached) , then set the cache size to something small like 1 or 2. You should be able to just run the loader (make load) to generate SendTransaction traffic and make sure evictions are handled properly

@Guitarheroua
Copy link
Contributor Author

added a few small comments, but I think this is pretty much ready to go. Can you also do some manual testing using localnet to make sure it's working as expected? let me know if you need any pointers on how to do that.

Run this branch with all the latest changes on the localnet. Access container successfully created and work shows logs, and displays visual info about the access node. I do not know if I should check something specific?

Maybe add a few more collection nodes (make -e COLLECTION=6 bootstrap start-cached) , then set the cache size to something small like 1 or 2. You should be able to just run the loader (make load) to generate SendTransaction traffic and make sure evictions are handled properly

Here, is what I got. I do not know, whether the error was caused by our changes, looks like not.

> make -e COLLECTION=6 bootstrap start-cached=2
> make load
go run --tags relic ../benchmark/cmd/manual -log-level info -tps 1,10,100 -tps-durations 30s,30s
11:54PM INF metrics server started address=:8080 endpoint=/metrics
11:54PM INF Service Address: f8d6e0586b0a20c7
11:54PM INF Fungible Token Address: ee82856bf20e2aa6
11:54PM INF Flow Token Address: 0ae53cb6e3f42a79
11:54PM INF worker stats TxsExecuted=0 TxsExecutedMovingAverage=0 TxsSent=0 TxsSentMovingAverage=0 TxsTimedout=0 Workers=0
11:54PM INF service account loaded num_keys=50
11:54PM INF creating accounts cumulative=0 num=750 numberOfAccounts=10000
11:54PM INF worker stats TxsExecuted=0 TxsExecutedMovingAverage=0 TxsSent=1 TxsSentMovingAverage=0 TxsTimedout=0 Workers=0
11:54PM INF creating accounts cumulative=750 num=750 numberOfAccounts=10000
11:54PM INF creating accounts cumulative=1500 num=750 numberOfAccounts=10000
11:54PM INF worker stats TxsExecuted=0 TxsExecutedMovingAverage=0 TxsSent=2 TxsSentMovingAverage=0 TxsTimedout=0 Workers=0
11:54PM INF worker stats TxsExecuted=0 TxsExecutedMovingAverage=0 TxsSent=3 TxsSentMovingAverage=0 TxsTimedout=0 Workers=0
...
11:55PM INF worker stats TxsExecuted=10 TxsExecutedMovingAverage=0.10461840683514284 TxsSent=14 TxsSentMovingAverage=0.000007213393639844955 TxsTimedout=0 Workers=0
11:55PM FTL unable to init loader error="error creating accounts: timeout waiting for account creation tx to be executed"
exit status 1
make: *** [Makefile:122: load] Error 1

@peterargue
Copy link
Contributor

Do you see any errors or panics in the Access node logs?

@Guitarheroua
Copy link
Contributor Author

Do you see any errors or panics in the Access node logs?

No, I do not see panic logs in the access node logs. I checked it with Loki output after running the load.

@peterargue
Copy link
Contributor

actually, I we also need to remove this check
https://github.com/Guitarheroua/flow-go/blob/guitarheroua/2833-conn-pool-evictions-cause-conn-failures/engine/access/rpc/backend/backend.go#L228-L234

Your localnet tests would have reset the limit to 250 and not exercised the evictions. If you remove that check, you should be able to run that same manual tests and see lots of closing grpc connection evicted from pool debug messages. We're looking to make sure that

  1. The access node remains stable
  2. The error rate for SendTransaction requests is stable with a low limit (and high rate of evictions)

@Guitarheroua
Copy link
Contributor Author

actually, I we also need to remove this check https://github.com/Guitarheroua/flow-go/blob/guitarheroua/2833-conn-pool-evictions-cause-conn-failures/engine/access/rpc/backend/backend.go#L228-L234

Your localnet tests would have reset the limit to 250 and not exercised the evictions. If you remove that check, you should be able to run that same manual tests and see lots of closing grpc connection evicted from pool debug messages. We're looking to make sure that

1. The access node remains stable

2. The error rate for SendTransaction requests is stable with a low limit (and high rate of evictions)

No panic at all. A lot of:

{
   "level": "debug",
   "node_role": "access",
    "node_id": "c1aac167011455fcd922e19388c79ff1b0d2ea565df24dd1552da9baba859c08",
    "grpc_conn_evicted": "execution_2:9000",   "time": "2023-07-24T21:09:55.170365577Z",
    "message": "closing grpc connection evicted from pool"
 }

@peterargue
Copy link
Contributor

thanks for testing! Approved

@peterargue peterargue merged commit 0607ed2 into onflow:master Jul 26, 2023
@Guitarheroua Guitarheroua deleted the guitarheroua/2833-conn-pool-evictions-cause-conn-failures branch October 30, 2023 09:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants