-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[query] Avoid errors when closing shared listener #5559
Conversation
cmd/query/app/server.go
Outdated
s.grpcServer.Stop() | ||
|
||
// Log and close HTTP server | ||
s.logger.Info("Closing HTTP server") | ||
errs = append(errs, s.httpServer.Close()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if feels to me that here is the actual issue. When we're running on a single port with cmux, we create a single Listener. But the way at least HTTP server works is when you pass it a listener it assumes it owns it and will close it when stopping the server. Since we have several servers they probably all try to close the listener, but in random order due to goroutine scheduling. I suspect if we add sleeps in this function after each call to Close (and maybe change the order of closing servers) then we might be able to reproduce the error in the failed test.
If my assumption is correct, I still don't know how to actually fix the issue, since we don't control how servers deal with Listener. One thing we could do is to explicitly check for "conn already closed" error and not return it (but only when running on a single port).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the detailed explanation. I will add sleeps after each close call to try and reproduce the error. Additionally, I will look into checking for the "conn already closed" error and handle it appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yurishkuro I have implemented the requested changes and enhanced the logging in the Close
method. Despite running the TestServerSinglePort
test multiple times with extended sleep intervals, the test is consistently passing, and I am unable to reproduce the original error.
…enhanced logging This commit addresses the persistent timeout issue in the TestServerSinglePort test by making the following improvements: 1. Dynamic Port Assignment: - Used ":0" for automatic port selection in the HTTP and gRPC host ports. 2. Enhanced Logging: - Added detailed logging in the TestServerSinglePort test to log the assigned ports. - Improved logging in the Start and Close methods to provide better insights into server operations. 3. Error Handling: - Ensured graceful handling and logging of errors during server start and stop. These changes aim to make the TestServerSinglePort test more robust and provide better diagnostics in case of failures. Signed-off-by: Shivam Verma <vermaaatul07@gmail.com>
Signed-off-by: Shivam Verma <vermaaatul07@gmail.com>
Signed-off-by: Shivam Verma <vermaaatul07@gmail.com>
Signed-off-by: Shivam Verma <vermaaatul07@gmail.com>
Signed-off-by: Shivam Verma <vermaaatul07@gmail.com>
Signed-off-by: Shivam Verma <vermaaatul07@gmail.com>
@@ -274,7 +274,7 @@ func (s *Server) initListener() (cmux.CMux, error) { | |||
func (s *Server) Start() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should add this option to the logger in the test:
flagsSvc.Logger = zaptest.NewLogger(t, zaptest.WrapOptions(zap.AddCaller()))
that way it prints the source line.
I changed the timeout to 2s and look what the logs say:
logger.go:146: 2024-06-12T15:13:20.232-0400 INFO app/static_handler.go:109 Using UI configuration {"path": ""}
logger.go:146: 2024-06-12T15:13:20.234-0400 INFO app/server.go:256 Query server started {"port": 16686, "addr": ":16686"}
logger.go:146: 2024-06-12T15:13:20.234-0400 INFO app/server.go:300 Starting HTTP server {"port": 16686, "addr": ":16686"}
logger.go:146: 2024-06-12T15:13:20.234-0400 INFO app/server.go:332 Starting CMUX server {"port": 16686, "addr": ":16686"}
logger.go:146: 2024-06-12T15:13:20.234-0400 INFO app/server.go:318 Starting GRPC server {"port": 16686, "addr": ":16686"}
logger.go:146: 2024-06-12T15:13:20.283-0400 INFO app/server.go:358 Stopping gRPC server
logger.go:146: 2024-06-12T15:13:20.284-0400 INFO app/server.go:310 HTTP server stopped {"port": 16686, "addr": ":16686"}
logger.go:146: 2024-06-12T15:13:20.284-0400 INFO app/server.go:323 GRPC server stopped {"port": 16686, "addr": ":16686"}
logger.go:146: 2024-06-12T15:13:22.285-0400 INFO app/server.go:362 Closing HTTP server
logger.go:146: 2024-06-12T15:13:24.286-0400 INFO app/server.go:369 Closing CMux server
logger.go:146: 2024-06-12T15:13:24.286-0400 INFO app/server.go:375 Server stopped
note that after Stopping gRPC server
we're seeing HTTP server stopped
, even before Closing HTTP server
. When I add a log line for CMUX server goroutine, it also logs that CMUX server is stopped before we come to trying to stop HTTP. So it seems when we're running on a single port it's simply sufficient to close gRPC server, which closes the underlying listenter, and all other servers exit automatically (we do already catch ErrServerClosed
errors and don't log them).
I pushed a fix. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5559 +/- ##
=======================================
Coverage 96.20% 96.20%
=======================================
Files 327 327
Lines 16006 16013 +7
=======================================
+ Hits 15398 15405 +7
Misses 432 432
Partials 176 176
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
not sure if it will fix the issue, but at least it provides additional logging should it happen again. |
I will keep this issue in mind and will revisit it once I gain more understanding of how to resolve it. Thank you so much for your help and guidance. |
Description
This PR aims to address the intermittent timeout issue in the
TestServerSinglePort
test by adding enhanced logging to theStart
andClose
methods of the server. These additional logs provide better visibility into the server's behavior during the test, which should help in diagnosing the root cause of the issue.Changes
Enhanced Logging:
Start
method for listener initialization failures.Start
method to log the initialized HTTP and gRPC ports.Close
method to indicate an attempt to close the server and confirm when the server has stopped.These changes provide better diagnostics and insights into server operations, which should help in identifying and resolving the flaky test issue if it occurs again.
Related Issue
Closes #5519
Checklist
CONTRIBUTING_GUIDELINES
jaeger
:make lint test
jaeger-ui
:yarn lint
andyarn test