-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
grpc.StartGRPCServer
takes 5 seconds to start
#14429
Comments
Happy to submit a patch once we settle on the solution. |
IIRC, |
does it have to be blocking @alexanderbez ? |
@alexanderbez, same as it is right now. Start the goroutine internally. If there is a need to handle the error by the user(unlikely), she should call Serve herself. Handling errors only for 5 secs is pointless. |
I'm referring to So yes, it's starting a server, of course it has to block. @Wondertan I'm not really following your proposal, but if you can open a draft PR I would be happy to take a look at it 😄 |
…#1551) This PR is one of the cleaning PRs that I am doing. It removes the usage of KVStore. Besides simplifying and unifying tests, it also fixes all the existing and potential "bind already in used" errors from the app and tendermint tandem. I tested this by running all the tests with `-parallel 12`, which runs in parallel per pkg. With this flag, we can speed up the time it takes to run all the tests on Ci and locally. Going further, we can also consider running all the tests in parallel with only one instance of the app running rather than an instance per test. Would love this diff to be reddish, but unfortunately, I had to copy a bunch of code from Cosmos SDK. The rationale is in the comments. Also, see cosmos/cosmos-sdk#14429
… (#1551) This PR is one of the cleaning PRs that I am doing. It removes the usage of KVStore. Besides simplifying and unifying tests, it also fixes all the existing and potential "bind already in used" errors from the app and tendermint tandem. I tested this by running all the tests with `-parallel 12`, which runs in parallel per pkg. With this flag, we can speed up the time it takes to run all the tests on Ci and locally. Going further, we can also consider running all the tests in parallel with only one instance of the app running rather than an instance per test. Would love this diff to be reddish, but unfortunately, I had to copy a bunch of code from Cosmos SDK. The rationale is in the comments. Also, see cosmos/cosmos-sdk#14429
cc @Wondertan did this get fixed in celestia / anything that can be upstreamed? Would appreciate this 5 second delay being eliminated as well. Its not clear to me as well why this is a blocking 5 second wait. As I read the godoc, it seems as its explicitly not intended to be blocking: https://pkg.go.dev/google.golang.org/grpc#Server.Serve
So the current implemented semantic seems incorrect - it assumes no errors / successful continual execution after 5 seconds. Whereas what I think you should want is this error to be handled with that same goroutine, with {some error recovery handling} for main application to deal with. (And have no blocking behavior) Should define how we want grpc errors to be dealt with, to infer the error handling strategy. If we want GRPC to be isolated from the main application, this error recovery plan could just be a function thats ran after grpcServer.Serve(), in that goroutine. And otherwise, the function immediately returns with the server. If we wanted grpc failures within 5 seconds to terminate the entire application, we can have the main app pass in a channel to get the error communicated on, and then the error handling logic is ran from within the app. (But entirely non-blocking) |
I think we can share the err channel between all services, and only wait once at the end. |
I think I can drastically clean this stuff up. Let me take a stab at it... |
@ValarDragon, we have a quick solution for the test runner that just avoids the timeout. We don't have any clean and "correct" solution. |
Summary of Bug
grpc.StartGRPCServer
takes 5 seconds to start. It's not really clear why this approach was taken, and git blame didn't give me any definitive answer. Looking at theServe
implementation, I don't see why one would wait 5 seconds to conclude the server started correctly.Why is this a problem? Besides just solving a problem that does not exist, this creates additional problems to end users. E.g., we in celestia-node run dozens of integration tests locally and on CI. Each of the tests running an App has to wait additional 5 seconds, just because. Also, some dev from Secret shared that they have to similarly engineer a solution around the problem.
Running this in a goroutine does not help and is racy, as the func sets GRPCClient to the context. Additionally, the func does not respect native Go's context... Context police 🚓 is discontented.
I don't see any other workaround besides copy code and waiting until it's fixed downstream into our fork,
Possible solutions
Serve
in the goroutine without handling an error for 5 seconds. Additionally, you could pass a logger(through Context?) to log errors, if any. Also, the native Go's context should still be respected, and the server routine should be stopped once it is canceled. Otherwise, the server is started forever and goroutine is similarly leaked/Version
Latest master
Steps to Reproduce
Just run
grpc.StartGRPCServer
The text was updated successfully, but these errors were encountered: