Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query] Avoid errors when closing shared listener #5559

Merged
merged 9 commits into from
Jun 12, 2024
Merged

[query] Avoid errors when closing shared listener #5559

merged 9 commits into from
Jun 12, 2024

Conversation

vermaaatul07
Copy link
Contributor

@vermaaatul07 vermaaatul07 commented Jun 11, 2024

Description

This PR aims to address the intermittent timeout issue in the TestServerSinglePort test by adding enhanced logging to the Start and Close methods of the server. These additional logs provide better visibility into the server's behavior during the test, which should help in diagnosing the root cause of the issue.

Changes

Enhanced Logging:

  • Added error logging in the Start method for listener initialization failures.
  • Added info logging in the Start method to log the initialized HTTP and gRPC ports.
  • Added info logging in the Close method to indicate an attempt to close the server and confirm when the server has stopped.

These changes provide better diagnostics and insights into server operations, which should help in identifying and resolving the flaky test issue if it occurs again.

Related Issue

Closes #5519

Checklist

  • I have read CONTRIBUTING_GUIDELINES
  • I have signed all commits
  • I have added unit tests for the new functionality
  • I have run lint and test steps successfully
    • for jaeger: make lint test
    • for jaeger-ui: yarn lint and yarn test

@vermaaatul07 vermaaatul07 requested a review from a team as a code owner June 11, 2024 12:37
@vermaaatul07 vermaaatul07 requested a review from pavolloffay June 11, 2024 12:37
@vermaaatul07 vermaaatul07 marked this pull request as draft June 11, 2024 12:49
cmd/query/app/server.go Outdated Show resolved Hide resolved
cmd/query/app/server.go Outdated Show resolved Hide resolved
cmd/query/app/server_test.go Outdated Show resolved Hide resolved
@vermaaatul07 vermaaatul07 changed the title Fix flaky TestServerSinglePort by adding dynamic port assignment and enhanced logging Add Enhanced Logging to Diagnose TestServerSinglePort Issues Jun 11, 2024
cmd/query/app/server.go Outdated Show resolved Hide resolved
cmd/query/app/server.go Outdated Show resolved Hide resolved
cmd/query/app/server.go Outdated Show resolved Hide resolved
cmd/query/app/server.go Outdated Show resolved Hide resolved
cmd/query/app/server.go Outdated Show resolved Hide resolved
cmd/query/app/server.go Outdated Show resolved Hide resolved
cmd/query/app/server.go Outdated Show resolved Hide resolved
cmd/query/app/server.go Outdated Show resolved Hide resolved
cmd/query/app/server.go Outdated Show resolved Hide resolved
cmd/query/app/server.go Outdated Show resolved Hide resolved
s.grpcServer.Stop()

// Log and close HTTP server
s.logger.Info("Closing HTTP server")
errs = append(errs, s.httpServer.Close())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if feels to me that here is the actual issue. When we're running on a single port with cmux, we create a single Listener. But the way at least HTTP server works is when you pass it a listener it assumes it owns it and will close it when stopping the server. Since we have several servers they probably all try to close the listener, but in random order due to goroutine scheduling. I suspect if we add sleeps in this function after each call to Close (and maybe change the order of closing servers) then we might be able to reproduce the error in the failed test.

If my assumption is correct, I still don't know how to actually fix the issue, since we don't control how servers deal with Listener. One thing we could do is to explicitly check for "conn already closed" error and not return it (but only when running on a single port).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the detailed explanation. I will add sleeps after each close call to try and reproduce the error. Additionally, I will look into checking for the "conn already closed" error and handle it appropriately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yurishkuro I have implemented the requested changes and enhanced the logging in the Close method. Despite running the TestServerSinglePort test multiple times with extended sleep intervals, the test is consistently passing, and I am unable to reproduce the original error.

…enhanced logging

This commit addresses the persistent timeout issue in the TestServerSinglePort test by making the following improvements:

1. Dynamic Port Assignment:
   - Used ":0" for automatic port selection in the HTTP and gRPC host ports.

2. Enhanced Logging:
   - Added detailed logging in the TestServerSinglePort test to log the assigned ports.
   - Improved logging in the Start and Close methods to provide better insights into server operations.

3. Error Handling:
   - Ensured graceful handling and logging of errors during server start and stop.

These changes aim to make the TestServerSinglePort test more robust and provide better diagnostics in case of failures.

Signed-off-by: Shivam Verma <vermaaatul07@gmail.com>
Signed-off-by: Shivam Verma <vermaaatul07@gmail.com>
Signed-off-by: Shivam Verma <vermaaatul07@gmail.com>
Signed-off-by: Shivam Verma <vermaaatul07@gmail.com>
Signed-off-by: Shivam Verma <vermaaatul07@gmail.com>
Signed-off-by: Shivam Verma <vermaaatul07@gmail.com>
cmd/query/app/server.go Outdated Show resolved Hide resolved
@@ -274,7 +274,7 @@ func (s *Server) initListener() (cmux.CMux, error) {
func (s *Server) Start() error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should add this option to the logger in the test:

	flagsSvc.Logger = zaptest.NewLogger(t, zaptest.WrapOptions(zap.AddCaller()))

that way it prints the source line.

I changed the timeout to 2s and look what the logs say:

    logger.go:146: 2024-06-12T15:13:20.232-0400 INFO    app/static_handler.go:109       Using UI configuration  {"path": ""}
    logger.go:146: 2024-06-12T15:13:20.234-0400 INFO    app/server.go:256       Query server started    {"port": 16686, "addr": ":16686"}
    logger.go:146: 2024-06-12T15:13:20.234-0400 INFO    app/server.go:300       Starting HTTP server    {"port": 16686, "addr": ":16686"}
    logger.go:146: 2024-06-12T15:13:20.234-0400 INFO    app/server.go:332       Starting CMUX server    {"port": 16686, "addr": ":16686"}
    logger.go:146: 2024-06-12T15:13:20.234-0400 INFO    app/server.go:318       Starting GRPC server    {"port": 16686, "addr": ":16686"}
    logger.go:146: 2024-06-12T15:13:20.283-0400 INFO    app/server.go:358       Stopping gRPC server
    logger.go:146: 2024-06-12T15:13:20.284-0400 INFO    app/server.go:310       HTTP server stopped     {"port": 16686, "addr": ":16686"}
    logger.go:146: 2024-06-12T15:13:20.284-0400 INFO    app/server.go:323       GRPC server stopped     {"port": 16686, "addr": ":16686"}
    logger.go:146: 2024-06-12T15:13:22.285-0400 INFO    app/server.go:362       Closing HTTP server
    logger.go:146: 2024-06-12T15:13:24.286-0400 INFO    app/server.go:369       Closing CMux server
    logger.go:146: 2024-06-12T15:13:24.286-0400 INFO    app/server.go:375       Server stopped

note that after Stopping gRPC server we're seeing HTTP server stopped, even before Closing HTTP server. When I add a log line for CMUX server goroutine, it also logs that CMUX server is stopped before we come to trying to stop HTTP. So it seems when we're running on a single port it's simply sufficient to close gRPC server, which closes the underlying listenter, and all other servers exit automatically (we do already catch ErrServerClosed errors and don't log them).

yurishkuro and others added 3 commits June 12, 2024 15:47
Signed-off-by: Yuri Shkuro <github@ysh.us>
Signed-off-by: Yuri Shkuro <github@ysh.us>
@yurishkuro
Copy link
Member

I pushed a fix.

@yurishkuro yurishkuro changed the title Add Enhanced Logging to Diagnose TestServerSinglePort Issues [query] Avoid errors when closing shared listener Jun 12, 2024
Copy link

codecov bot commented Jun 12, 2024

Codecov Report

Attention: Patch coverage is 76.92308% with 3 lines in your changes missing coverage. Please review.

Project coverage is 96.20%. Comparing base (60f9427) to head (8be6e17).

Files Patch % Lines
cmd/query/app/server.go 76.92% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #5559   +/-   ##
=======================================
  Coverage   96.20%   96.20%           
=======================================
  Files         327      327           
  Lines       16006    16013    +7     
=======================================
+ Hits        15398    15405    +7     
  Misses        432      432           
  Partials      176      176           
Flag Coverage Δ
badger_v1 8.05% <ø> (ø)
badger_v2 1.93% <ø> (ø)
cassandra-3.x-v1 16.45% <ø> (ø)
cassandra-3.x-v2 1.85% <ø> (ø)
cassandra-4.x-v1 16.45% <ø> (ø)
cassandra-4.x-v2 1.85% <ø> (ø)
elasticsearch-7.x-v1 18.89% <ø> (+0.01%) ⬆️
elasticsearch-8.x-v1 19.09% <ø> (+0.01%) ⬆️
elasticsearch-8.x-v2 19.09% <ø> (ø)
grpc_v1 9.48% <ø> (ø)
grpc_v2 7.53% <ø> (+0.01%) ⬆️
kafka 9.77% <ø> (ø)
opensearch-1.x-v1 18.94% <ø> (+0.01%) ⬆️
opensearch-2.x-v1 18.94% <ø> (+0.01%) ⬆️
opensearch-2.x-v2 18.93% <ø> (ø)
unittests 94.07% <76.92%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@yurishkuro yurishkuro marked this pull request as ready for review June 12, 2024 19:57
@yurishkuro yurishkuro merged commit 11cb50d into jaegertracing:main Jun 12, 2024
40 of 42 checks passed
@yurishkuro
Copy link
Member

not sure if it will fix the issue, but at least it provides additional logging should it happen again.

@vermaaatul07
Copy link
Contributor Author

not sure if it will fix the issue, but at least it provides additional logging should it happen again.

I will keep this issue in mind and will revisit it once I gain more understanding of how to resolve it. Thank you so much for your help and guidance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky unit test TestServerSinglePort
2 participants