Query frontend drops requests when rolling #2672

joe-elliott · 2023-07-18T19:14:15Z

When rolling the query frontend it drops queries resulting in Grafana errors. There are a lot of details to check here. Does the query frontend correctly drain queries? do we have our k8s config correct to give it time to finish? do we drop readiness while shutting down so queries are no longer routed to a shutting down pod? etc.

Make this as seamless as possible

ie-pham · 2023-09-06T14:57:44Z

Reproduced in dev-01 with only one query-frontend pod. Executed the query then immediately delete pod.

Logs

2023-09-06 14:55:48.021 | level=warn ts=2023-09-06T14:55:48.021533264Z caller=grpc_logging.go:78 method=/frontend.Frontend/Process duration=1.100965261s err="queue is stopped" msg=gRPC
-- | --

level=error ts=2023-09-06T14:55:47.9870941Z caller=searchsharding.go:181 msg="error executing sharded query" url="/querier/tempo/api/search?blockID=76250f04-6c78-4be8-9e94-215c745822a4&dataEncoding=&dc=%5B%7B%22name%22%3A%22db.statement%22%7D%2C%7B%22name%22%3A%22component%22%7D%2C%7B%22name%22%3A%22http.user_agent%22%7D%2C%7B%22name%22%3A%22otel.library.name%22%7D%2C%7B%22name%22%3A%22db.connection_string%22%7D%2C%7B%22name%22%3A%22organization%22%7D%2C%7B%22name%22%3A%22peer.address%22%7D%2C%7B%22name%22%3A%22net.peer.name%22%7D%2C%7B%22name%22%3A%22blockID%22%7D%2C%7B%22name%22%3A%22db.name%22%7D%2C%7B%22scope%22%3A1%2C%22name%22%3A%22host.name%22%7D%2C%7B%22scope%22%3A1%2C%22name%22%3A%22opencensus.exporterversion%22%7D%2C%7B%22scope%22%3A1%2C%22name%22%3A%22client-uuid%22%7D%2C%7B%22scope%22%3A1%2C%22name%22%3A%22ip%22%7D%2C%7B%22scope%22%3A1%2C%22name%22%3A%22database%22%7D%2C%7B%22scope%22%3A1%2C%22name%22%3A%22os.description%22%7D%2C%7B%22scope%22%3A1%2C%22name%22%3A%22process.runtime.description%22%7D%2C%7B%22scope%22%3A1%2C%22name%22%3A%22container.id%22%7D%2C%7B%22scope%22%3A1%2C%22name%22%3A%22slug%22%7D%2C%7B%22scope%22%3A1%2C%22name%22%3A%22module.path%22%7D%5D&encoding=none&end=1694012136&footerSize=16800&indexPageSize=0&limit=20&pagesToSearch=2&q=%7B+.bloom+%3D+%22foo%22+%7D&size=76407693&start=1693407336&startPage=0&totalRecords=1&version=vParquet3" err="queue is stopped"

ie-pham · 2023-09-06T16:47:44Z

ie-pham · 2023-09-06T16:49:39Z

Response: "join iterator peek failed: join iterator peek failed: context deadline exceeded\n\

ie-pham · 2023-09-20T15:47:45Z

First theory:

when query-frontend is rolled out, connection to queriers are severed

level=error ts=2023-09-20T15:24:26.387841995Z caller=frontend_processor.go:63 msg="error contacting frontend" address=10.132.92.39:9095 err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.132.92.39:9095: connect: connection refused\""

because all queriers are disconnected, the frontend queue is stopped and all on-going requests get abruptly errored

	for q.queues.len() > 0 && q.connectedQuerierWorkers.Load() > 0 {
		q.cond.Wait(context.Background())
	}

github-actions · 2023-12-17T00:04:03Z

This issue has been automatically marked as stale because it has not had any activity in the past 60 days.
The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity.
Please apply keepalive label to exempt this Issue.

knylander-grafana · 2024-01-18T18:41:10Z

@joe-elliott Will we need docs of any type for this? Customer warnings or something?

joe-elliott · 2024-01-18T19:13:54Z

Probably not? I suppose it depends on the fix

knylander-grafana · 2024-02-01T01:30:05Z

Confirmed with Joe that most likely won't need docs for this, however it does depend upon the fix.

joe-elliott · 2024-02-15T20:29:33Z

This PR will likely allow configurations to correct this issue:

#3395

If a valid configuration cannot be found will reopen

joe-elliott added the operations label Jul 20, 2023

ie-pham self-assigned this Aug 28, 2023

ie-pham assigned joe-elliott and unassigned ie-pham Oct 17, 2023

stoewer unassigned joe-elliott Oct 17, 2023

github-actions bot added the stale Used for stale issues / PRs label Dec 17, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 2, 2024

joe-elliott added keepalive Label to exempt Issues / PRs from stale workflow type/bug Something isn't working and removed stale Used for stale issues / PRs labels Jan 2, 2024

joe-elliott reopened this Jan 2, 2024

joe-elliott closed this as completed Feb 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query frontend drops requests when rolling #2672

Query frontend drops requests when rolling #2672

joe-elliott commented Jul 18, 2023

ie-pham commented Sep 6, 2023 •

edited

Loading

ie-pham commented Sep 6, 2023

ie-pham commented Sep 6, 2023

ie-pham commented Sep 20, 2023

github-actions bot commented Dec 17, 2023

knylander-grafana commented Jan 18, 2024 •

edited

Loading

joe-elliott commented Jan 18, 2024

knylander-grafana commented Feb 1, 2024

joe-elliott commented Feb 15, 2024

Query frontend drops requests when rolling #2672

Query frontend drops requests when rolling #2672

Comments

joe-elliott commented Jul 18, 2023

ie-pham commented Sep 6, 2023 • edited Loading

ie-pham commented Sep 6, 2023

ie-pham commented Sep 6, 2023

ie-pham commented Sep 20, 2023

github-actions bot commented Dec 17, 2023

knylander-grafana commented Jan 18, 2024 • edited Loading

joe-elliott commented Jan 18, 2024

knylander-grafana commented Feb 1, 2024

joe-elliott commented Feb 15, 2024

ie-pham commented Sep 6, 2023 •

edited

Loading

knylander-grafana commented Jan 18, 2024 •

edited

Loading