Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QFE: new middleware to force query statistics collection #7854

Merged

Conversation

pedro-stanaka
Copy link
Contributor

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

  • Adds a new middleware for both instant and range queries that might force stats collection.
  • Adds new query stats to logs
  • Update protobuf for stats package.

Verification

@pedro-stanaka pedro-stanaka force-pushed the feat/qfe-force-stats-collection branch 4 times, most recently from 6d956d0 to e441d55 Compare October 23, 2024 08:40
@pedro-stanaka pedro-stanaka marked this pull request as ready for review October 23, 2024 08:40
fpetkovski
fpetkovski previously approved these changes Oct 23, 2024
internal/cortex/querier/queryrange/stats_middleware.go Outdated Show resolved Hide resolved
internal/cortex/querier/queryrange/stats_middleware.go Outdated Show resolved Hide resolved
@yeya24
Copy link
Contributor

yeya24 commented Oct 24, 2024

Can you share an example log line? Just want to see how it looks like

@pedro-stanaka
Copy link
Contributor Author

pedro-stanaka commented Oct 24, 2024

Can you share an example log line? Just want to see how it looks like

Formatted for presentation:

{
    "caller": "handler.go:217",
    "grafana_dashboard_uid": "jmwtDwa4k",
    "grafana_panel_id": "120",
    "host": "localhost:8084",
    "level": "info",
    "method": "POST",
    "msg": "slow query detected",
    "org_id": "anonymous",
    "param_analyze": "true",
    "param_dedup": "true",
    "param_end": "1729183020",
    "param_engine": "thanos",
    "param_max_source_resolution": "0s",
    "param_partial_response": "false",
    "param_query": "sum(rate(nginx_ingress_controller_requests{}[5m])) by (ingress)",
    "param_start": "1729172212",
    "param_step": "360",
    "path": "/api/v1/query_range",
    "peak_samples": 409,
    "query_range_hours": 3,
    "query_range_human": "3h0m8s",
    "remote_addr": "[::1]:55935",
    "remote_user": "",
    "time_taken": "3.518708209s",
    "total_samples_loaded": 720187,
    "trace_id": "a990d4d2c94a5530b48d94d0e5e55664",
    "ts": "2024-10-24T20:11:58.16198Z"
}

@pedro-stanaka pedro-stanaka force-pushed the feat/qfe-force-stats-collection branch from 6317abf to 551f368 Compare October 29, 2024 08:04
@pedro-stanaka
Copy link
Contributor Author

pedro-stanaka commented Oct 29, 2024

@yeya24 would you mind reviewing the PR, pls? We use similar log line in our production environment and would like to upstream this. This provides a single place to look for heavy queries and investigate issues coming from the read path.

@pedro-stanaka pedro-stanaka force-pushed the feat/qfe-force-stats-collection branch from 551f368 to 34ae446 Compare October 31, 2024 16:47
Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>
Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>
Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>
@pedro-stanaka pedro-stanaka force-pushed the feat/qfe-force-stats-collection branch from 34ae446 to 6b13da0 Compare October 31, 2024 16:49
Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>
@pedro-stanaka pedro-stanaka force-pushed the feat/qfe-force-stats-collection branch from 6b13da0 to b69c9bb Compare October 31, 2024 16:56
yeya24
yeya24 previously approved these changes Nov 3, 2024
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I think it makes sense to collect the stats and log them. Just some questions about the implementation detail

func (s statsMiddleware) Do(ctx context.Context, r Request) (Response, error) {
if s.forceStats {
r = r.WithStats("all")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand why we need a middleware here. Why we cannot do that in query frontend ServeHTTP function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, you are right. That is much simpler, will do that.

Copy link
Contributor Author

@pedro-stanaka pedro-stanaka Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I started to refactor this I remembered why we have to keep this in a middleware. The QFE http handler only has access to the raw http.Request that means that if we want to change the value of the stats field, we will have to parse the request twice.

Same thing for the statistics, the response (which is can be quite expensive to parse), is only available in its decoded state in the tripperware, the http handler just writes back the response from the tripperware.

internal/cortex/frontend/transport/handler.go Outdated Show resolved Hide resolved

return atomic.LoadInt64(&s.TotalLoadedSamples)
}

// Merge the provide Stats into this one.
func (s *Stats) Merge(other *Stats) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why we are not using this function to add stats samples. And why this funciton is not used anywhere.
This is where Cortex tracks and merges stats https://github.com/cortexproject/cortex/blob/master/pkg/querier/stats/stats.go#L346

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because that is Cortex "private" way of tracking statistics, but in Thanos we chose to abide to Prometheus "simpler" stats interface using only Samples Total and Peak Samples, which is implemented all the way into the thanos promql engine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see so we are not calling AddFetchedSeries, AddFetchedChunkBytes, etc in Thanos querier.
This is fine. But I don't think collecting this data is conflicting of collecting the standard query stats. We can still do both.

Anyway, it is not blocking this PR so we can change if we feel useful

cmd/thanos/query_frontend.go Outdated Show resolved Hide resolved
Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>
Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>
Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>
@pedro-stanaka pedro-stanaka force-pushed the feat/qfe-force-stats-collection branch from 1265ac9 to 457b861 Compare November 4, 2024 14:12
@pedro-stanaka
Copy link
Contributor Author

The E2E tests are working locally:

image

Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>
@pedro-stanaka pedro-stanaka force-pushed the feat/qfe-force-stats-collection branch from f52f085 to 3d47cda Compare November 4, 2024 15:21
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!


return atomic.LoadInt64(&s.TotalLoadedSamples)
}

// Merge the provide Stats into this one.
func (s *Stats) Merge(other *Stats) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see so we are not calling AddFetchedSeries, AddFetchedChunkBytes, etc in Thanos querier.
This is fine. But I don't think collecting this data is conflicting of collecting the standard query stats. We can still do both.

Anyway, it is not blocking this PR so we can change if we feel useful

@pedro-stanaka
Copy link
Contributor Author

I see so we are not calling AddFetchedSeries, AddFetchedChunkBytes, etc in Thanos querier.
This is fine. But I don't think collecting this data is conflicting of collecting the standard query stats. We can still do both.
Anyway, it is not blocking this PR so we can change if we feel useful

I can try working on this as a follow up.

@fpetkovski fpetkovski enabled auto-merge November 5, 2024 08:29
@fpetkovski fpetkovski merged commit 9bc3cc0 into thanos-io:main Nov 5, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants