Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query frontend drops requests when rolling #2672

Closed
joe-elliott opened this issue Jul 18, 2023 · 9 comments
Closed

Query frontend drops requests when rolling #2672

joe-elliott opened this issue Jul 18, 2023 · 9 comments
Labels
keepalive Label to exempt Issues / PRs from stale workflow operations type/bug Something isn't working

Comments

@joe-elliott
Copy link
Member

When rolling the query frontend it drops queries resulting in Grafana errors. There are a lot of details to check here. Does the query frontend correctly drain queries? do we have our k8s config correct to give it time to finish? do we drop readiness while shutting down so queries are no longer routed to a shutting down pod? etc.

Make this as seamless as possible

@ie-pham ie-pham self-assigned this Aug 28, 2023
@ie-pham
Copy link
Collaborator

ie-pham commented Sep 6, 2023

Reproduced in dev-01 with only one query-frontend pod. Executed the query then immediately delete pod.
Screenshot 2023-09-06 at 9 56 05 AM

Logs

2023-09-06 14:55:48.021 | level=warn ts=2023-09-06T14:55:48.021533264Z caller=grpc_logging.go:78 method=/frontend.Frontend/Process duration=1.100965261s err="queue is stopped" msg=gRPC
-- | --

level=error ts=2023-09-06T14:55:47.9870941Z caller=searchsharding.go:181 msg="error executing sharded query" url="/querier/tempo/api/search?blockID=76250f04-6c78-4be8-9e94-215c745822a4&dataEncoding=&dc=%5B%7B%22name%22%3A%22db.statement%22%7D%2C%7B%22name%22%3A%22component%22%7D%2C%7B%22name%22%3A%22http.user_agent%22%7D%2C%7B%22name%22%3A%22otel.library.name%22%7D%2C%7B%22name%22%3A%22db.connection_string%22%7D%2C%7B%22name%22%3A%22organization%22%7D%2C%7B%22name%22%3A%22peer.address%22%7D%2C%7B%22name%22%3A%22net.peer.name%22%7D%2C%7B%22name%22%3A%22blockID%22%7D%2C%7B%22name%22%3A%22db.name%22%7D%2C%7B%22scope%22%3A1%2C%22name%22%3A%22host.name%22%7D%2C%7B%22scope%22%3A1%2C%22name%22%3A%22opencensus.exporterversion%22%7D%2C%7B%22scope%22%3A1%2C%22name%22%3A%22client-uuid%22%7D%2C%7B%22scope%22%3A1%2C%22name%22%3A%22ip%22%7D%2C%7B%22scope%22%3A1%2C%22name%22%3A%22database%22%7D%2C%7B%22scope%22%3A1%2C%22name%22%3A%22os.description%22%7D%2C%7B%22scope%22%3A1%2C%22name%22%3A%22process.runtime.description%22%7D%2C%7B%22scope%22%3A1%2C%22name%22%3A%22container.id%22%7D%2C%7B%22scope%22%3A1%2C%22name%22%3A%22slug%22%7D%2C%7B%22scope%22%3A1%2C%22name%22%3A%22module.path%22%7D%5D&encoding=none&end=1694012136&footerSize=16800&indexPageSize=0&limit=20&pagesToSearch=2&q=%7B+.bloom+%3D+%22foo%22+%7D&size=76407693&start=1693407336&startPage=0&totalRecords=1&version=vParquet3" err="queue is stopped"

@ie-pham
Copy link
Collaborator

ie-pham commented Sep 6, 2023

Screenshot 2023-09-06 at 11 47 04 AM

@ie-pham
Copy link
Collaborator

ie-pham commented Sep 6, 2023

Response: "join iterator peek failed: join iterator peek failed: context deadline exceeded\n\

@ie-pham
Copy link
Collaborator

ie-pham commented Sep 20, 2023

First theory:

  1. when query-frontend is rolled out, connection to queriers are severed
level=error ts=2023-09-20T15:24:26.387841995Z caller=frontend_processor.go:63 msg="error contacting frontend" address=10.132.92.39:9095 err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.132.92.39:9095: connect: connection refused\""
  1. because all queriers are disconnected, the frontend queue is stopped and all on-going requests get abruptly errored
	for q.queues.len() > 0 && q.connectedQuerierWorkers.Load() > 0 {
		q.cond.Wait(context.Background())
	}

Copy link
Contributor

This issue has been automatically marked as stale because it has not had any activity in the past 60 days.
The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity.
Please apply keepalive label to exempt this Issue.

@github-actions github-actions bot added the stale Used for stale issues / PRs label Dec 17, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 2, 2024
@joe-elliott joe-elliott added keepalive Label to exempt Issues / PRs from stale workflow type/bug Something isn't working and removed stale Used for stale issues / PRs labels Jan 2, 2024
@joe-elliott joe-elliott reopened this Jan 2, 2024
@knylander-grafana
Copy link
Contributor

knylander-grafana commented Jan 18, 2024

@joe-elliott Will we need docs of any type for this? Customer warnings or something?

@joe-elliott
Copy link
Member Author

Probably not? I suppose it depends on the fix

@knylander-grafana
Copy link
Contributor

Confirmed with Joe that most likely won't need docs for this, however it does depend upon the fix.

@joe-elliott
Copy link
Member Author

This PR will likely allow configurations to correct this issue:

#3395

If a valid configuration cannot be found will reopen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keepalive Label to exempt Issues / PRs from stale workflow operations type/bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

3 participants