-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Access] Connection pool evictions cause connection failures #4534
[Access] Connection pool evictions cause connection failures #4534
Conversation
Co-authored-by: Peter Argue <89119817+peterargue@users.noreply.github.com>
@peterargue Fixed all comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the updates @Guitarheroua. Added a few more comments based on the changes.
) error { | ||
// Prevent new request from being sent if the connection is marked for closure | ||
if cachedClient.closeRequested.Load() { | ||
return status.Errorf(codes.Unavailable, "the connection to %s was closed", cachedClient.Address) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the correct behavior, but ideally, we'd never encounter this error since it's too late to reconnect. Is there a check needed in retrieveConnection
to ensure the connection isn't closed before reusing?
If we catch it there, then a new connection can be created.
…es' of github.com:Guitarheroua/flow-go into guitarheroua/2833-conn-pool-evictions-cause-conn-failures
…es' of github.com:Guitarheroua/flow-go into guitarheroua/2833-conn-pool-evictions-cause-conn-failures
…es' of github.com:Guitarheroua/flow-go into guitarheroua/2833-conn-pool-evictions-cause-conn-failures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This connection factory has proven to be difficult to get right with the different race conditions in a performant way. I took a shot at refactoring it a bit to make it easier to work with. feel free to take some or all of these changes if you think they help. Also feel free to ignore. I haven't done any testing and I'm sure it's missing metrics, logging, comments, etc.
https://github.com/onflow/flow-go/compare/petera/example-conn-factory-refactor?expand=1
…low/flow-go into guitarheroua/2833-conn-pool-evictions-cause-conn-failures
…es' of github.com:Guitarheroua/flow-go into guitarheroua/2833-conn-pool-evictions-cause-conn-failures
It was unexpected for me, but it really looks nicer and cleaner. I added all changes to the code, added metrics, commented on everything, and fixed tests. Should be nice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a few small comments, but I think this is pretty much ready to go. Can you also do some manual testing using localnet to make sure it's working as expected? let me know if you need any pointers on how to do that.
Codecov Report
@@ Coverage Diff @@
## master #4534 +/- ##
==========================================
- Coverage 56.25% 54.51% -1.75%
==========================================
Files 653 914 +261
Lines 64699 85318 +20619
==========================================
+ Hits 36396 46509 +10113
- Misses 25362 35219 +9857
- Partials 2941 3590 +649
Flags with carried forward coverage won't be shown. Click here to find out more.
|
} | ||
defer client.mu.Unlock() | ||
|
||
if client.ClientConn != nil && client.ClientConn.GetState() != connectivity.Shutdown { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually need check for state? We always remove connection from cache when we close the connection.
Additionally, I would be more explicit with implementation of this function by implementing next logic:
GetOrAdd
returns new client - we perform initialization of connection, record metrics and return.GetOrAdd
returns cached client - we take lock, record metrics and return
We can simplify and structure this function in a way that we have specific paths.
In reality it's the way I have described it, but the check for path is not based on having the client cached but rather on availability of ClientConn
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On other hand current logic helps to recover from case where someone closed the connection using grpc.ClientConn.Close
instead of Manager.Remove(grpcAddress)
, that is a good thing, but also encourages wrong usage of connection manager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, this also covers the case where the remote end hung up. I think it's a good idea to keep this, though if we didn't check, the application logic would likely handle it gracefully.
|
||
// Remove removes the gRPC client connection associated with the given grpcAddress from the cache. | ||
// It returns true if the connection was removed successfully, false otherwise. | ||
func (m *Manager) Remove(grpcAddress string) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to obtain lock here to make sure that res.ClientConn
is indeed initialized.
Without it we can run in scenario:
- GetOrAdd creates client, takes lock, but haven't created connection yet
- Another goroutine calls
Remove
, which gets a client in half-initialized state and removes it from cache. - since
ClientConn
is not yet set, we will performClose
on nilgrpc.Connection
Obtaining lock after calling cache.Remove
should fix the problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch
…eroua/2833-conn-pool-evictions-cause-conn-failures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very solid, great stuff.
Merged conflicts. Fixed all comments. @peterargue could you please do a final double-check? |
Run this branch with all the latest changes on the localnet. Access container successfully created and work shows logs, and displays visual info about the access node. I do not know if I should check something specific? |
Maybe add a few more collection nodes ( |
Here, is what I got. I do not know, whether the error was caused by our changes, looks like not. > make -e COLLECTION=6 bootstrap start-cached=2
> make load
go run --tags relic ../benchmark/cmd/manual -log-level info -tps 1,10,100 -tps-durations 30s,30s
11:54PM INF metrics server started address=:8080 endpoint=/metrics
11:54PM INF Service Address: f8d6e0586b0a20c7
11:54PM INF Fungible Token Address: ee82856bf20e2aa6
11:54PM INF Flow Token Address: 0ae53cb6e3f42a79
11:54PM INF worker stats TxsExecuted=0 TxsExecutedMovingAverage=0 TxsSent=0 TxsSentMovingAverage=0 TxsTimedout=0 Workers=0
11:54PM INF service account loaded num_keys=50
11:54PM INF creating accounts cumulative=0 num=750 numberOfAccounts=10000
11:54PM INF worker stats TxsExecuted=0 TxsExecutedMovingAverage=0 TxsSent=1 TxsSentMovingAverage=0 TxsTimedout=0 Workers=0
11:54PM INF creating accounts cumulative=750 num=750 numberOfAccounts=10000
11:54PM INF creating accounts cumulative=1500 num=750 numberOfAccounts=10000
11:54PM INF worker stats TxsExecuted=0 TxsExecutedMovingAverage=0 TxsSent=2 TxsSentMovingAverage=0 TxsTimedout=0 Workers=0
11:54PM INF worker stats TxsExecuted=0 TxsExecutedMovingAverage=0 TxsSent=3 TxsSentMovingAverage=0 TxsTimedout=0 Workers=0
...
11:55PM INF worker stats TxsExecuted=10 TxsExecutedMovingAverage=0.10461840683514284 TxsSent=14 TxsSentMovingAverage=0.000007213393639844955 TxsTimedout=0 Workers=0
11:55PM FTL unable to init loader error="error creating accounts: timeout waiting for account creation tx to be executed"
exit status 1
make: *** [Makefile:122: load] Error 1 |
Do you see any errors or panics in the Access node logs? |
No, I do not see panic logs in the access node logs. I checked it with Loki output after running the load. |
actually, I we also need to remove this check Your localnet tests would have reset the limit to 250 and not exercised the evictions. If you remove that check, you should be able to run that same manual tests and see lots of
|
No panic at all. A lot of: {
"level": "debug",
"node_role": "access",
"node_id": "c1aac167011455fcd922e19388c79ff1b0d2ea565df24dd1552da9baba859c08",
"grpc_conn_evicted": "execution_2:9000", "time": "2023-07-24T21:09:55.170365577Z",
"message": "closing grpc connection evicted from pool"
} |
thanks for testing! Approved |
#2833
Context
Connection pool evictions cause connection failures. To resolve this issue, I proposed checking the current state of the ClientConn (GetState() or WaitForStateChange() methods). While these methods can track the state of the connection itself, they do not handle ongoing requests gracefully.
To handle ongoing requests properly, a new client interceptor called the "request watcher" is introduced. This interceptor keeps track of the number of requests that have started and completed. It utilizes synchronization mechanisms to wait for unfinished requests during the connection closure process. Any additional requests that arrive after the closure is initiated will be rejected while waiting for ongoing requests to finish.