Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query-frontend performance #716

Closed
josunect opened this issue Dec 14, 2023 · 9 comments
Closed

query-frontend performance #716

josunect opened this issue Dec 14, 2023 · 9 comments

Comments

@josunect
Copy link

Installing Tempo with the operator and the following resources:

  resources:
    total:
      limits:
        memory: 1Gi
        cpu: 2000m

With the following query the pod is OOMKilled:

curl -G -s http://localhost:3200/api/search --data-urlencode 'q={ .service.name = "productpage.bookinfo"} && { } | select("status", ".service_name", ".node_id", ".component", ".upstream_cluster", ".http.method", ".response_flags")' --data-urlencode 'spss=10' --data-urlencode 'limit=100' --data-urlencode 'start=1701948096' --data-urlencode 'end=1702552896' | jq

image

The system seems stable increasing the resources:

  resources:
    total:
      limits:
        memory: 4Gi
        cpu: 8000m

But, it looks a lot for a development environment?

Tested in minikube following https://grafana.com/docs/tempo/latest/setup/operator/.

@andreasgerstmayr
Copy link
Collaborator

hi @josunect!

Can you give more details, i.e. what command is used to generate the traces, and after which timeframe Tempo runs into OOM?

I'd like to test this with a basic Tempo setup (https://github.com/grafana/tempo/blob/main/example/docker-compose/s3/docker-compose.yaml) to see if it's an issue of the operator or Tempo itself.

@josunect
Copy link
Author

Hi, @andreasgerstmayr !

So, we are configuring istio to send traces to Tempo using zipkin, https://istio.io/latest/docs/tasks/observability/distributed-tracing/zipkin/, with a sample rate of 100 with the following configuration:

--set values.meshConfig.defaultConfig.tracing.zipkin.address=tempo-cr-distributor.tempo:9411

There is a Kiali hack script to easily create the environment that we use:

https://github.com/kiali/kiali/tree/master/hack/istio/tempo

And follow these steps (Ex with minikube):

That will install tempo and istio (Configured to send the traces) in different namespaces.
Then port forward the service:

kubectl port-forward svc/tempo-cr-query-frontend 16686:16686 -n tempo

And then run this query every 10 seconds, to get traces from the last hour with a limit of 200:

curl 'http://localhost:16686/api/traces?end=1709135961250000&limit=200&service=productpage.bookinfo&start=1709132361250000'
With ~= 30 minutes traces, running the query 4 times, the query-frontend was killed by OOM:

image

I've found the issue also with the Tempo API.

@andreasgerstmayr
Copy link
Collaborator

Thank you for the detailed instructions! I'll test that in the coming days.

@andreasgerstmayr
Copy link
Collaborator

andreasgerstmayr commented Mar 4, 2024

I can reproduce the OOM of the tempo container in the query-frontend pod with instructions above. The container gets 409 MB memory allocated (5% [1] of the 8 GB allocated to the TempoStack).

The query fetches the entire trace of up to 200 traces, resulting in a ~2 MB large response. The Jaeger UI/API (wrapped with tempo-query) is not optimized for Tempo. For every search query, it fetches a list of matching traceIDs, and then fetches the entire trace in a serial loop [2]. Fetching the entire trace is an expensive operation.

afaics the intended usage is to run a TraceQL query to find interesting traces (traces with errors, high latency, etc.), and then only fetch the entire trace of these (few) matching traces.

Running the same query with TraceQL should improve performance. Tempo recommends to use scoped attributes, i.e. { resource.service.name = "productpage.bookinfo" }:

curl -s -G http://tempo-cr-query-frontend.tempo.svc.cluster.local:3200/api/search --data-urlencode 'q={ resource.service.name = "productpage.bookinfo" }' --data-urlencode start=$(date -d "1 hour ago" +%s) --data-urlencode end=$(date +%s) --data-urlencode limit=200

This query still OOMs on my machine with the resource limits after a while.

However, when I increase the resources of the query-frontend pod [3]:

spec:
  template:
    queryFrontend:
      component:
        resources:
          limits:
            cpu: "2"
            memory: 2Gi

I can run the above curl command in an endless loop and the tempo container of the query-frontend pod uses about 60% CPU and 1.1 GiB memory (and does not run out of memory 😃).

Edit: Also the query with the Jaeger API works now (with 2GB of memory), albeit slow because the Jaeger API is not optimized for Tempo as described above.

[1]

"query-frontend": {cpu: 0.09, memory: 0.05},

[2] https://github.com/grafana/tempo/blob/v2.3.1/cmd/tempo-query/tempo/plugin.go#L270-L281
[3] This feature is already merged in main branch, but not in a released version yet.

@josunect
Copy link
Author

josunect commented Mar 4, 2024

I can reproduce the OOM of the tempo container in the query-frontend pod with instructions above. The container gets 409 MB memory allocated (5% [1] of the 8 GB allocated to the TempoStack).

The query fetches the entire trace of up to 200 traces, resulting in a ~2 MB large response. The Jaeger UI/API (wrapped with tempo-query) is not optimized for Tempo. For every search query, it fetches a list of matching traceIDs, and then fetches the entire trace in a serial loop [2]. Fetching the entire trace is an expensive operation.

afaics the intended usage is to run a TraceQL query to find interesting traces (traces with errors, high latency, etc.), and then only fetch the entire trace of these (few) matching traces.

Running the same query with TraceQL should improve performance. Tempo recommends to use scoped attributes, i.e. { resource.service.name = "productpage.bookinfo" }:

curl -s -G http://tempo-cr-query-frontend.tempo.svc.cluster.local:3200/api/search --data-urlencode 'q={ resource.service.name = "productpage.bookinfo" }' --data-urlencode start=$(date -d "1 hour ago" +%s) --data-urlencode end=$(date +%s) --data-urlencode limit=200

This query still OOMs on my machine with the resource limits after a while.

However, when I increase the resources of the query-frontend pod [3]:

spec:
  template:
    queryFrontend:
      component:
        resources:
          limits:
            cpu: "2"
            memory: 2Gi

I can run the above curl command in an endless loop and the tempo container of the query-frontend pod uses about 60% CPU and 1.1 GiB memory (and does not run out of memory 😃).

Edit: Also the query with the Jaeger API works now (with 2GB of memory), albeit slow because the Jaeger API is not optimized for Tempo as described above.

[1]

"query-frontend": {cpu: 0.09, memory: 0.05},

[2] https://github.com/grafana/tempo/blob/v2.3.1/cmd/tempo-query/tempo/plugin.go#L270-L281
[3] This feature is already merged in main branch, but not in a released version yet.

Thanks for all the information, @andreasgerstmayr ! That is really useful.

Option [3], I think it would be the ideal, as using TraceQL still has some issues in the end, and I think it will help to allocate the resources where they are really needed.

@andreasgerstmayr
Copy link
Collaborator

@josunect we just released version 0.9.0 of the operator, which allows overriding resource requests and limits per component.

Could you test if this resolves the issue? We'll deprecate the current formula to allocate resources to components soon and switch to T-Shirt sizes (#845) instead.

@josunect
Copy link
Author

@josunect we just released version 0.9.0 of the operator, which allows overriding resource requests and limits per component.

Could you test if this resolves the issue? We'll deprecate the current formula to allocate resources to components soon and switch to T-Shirt sizes (#845) instead.

Thank you, @andreasgerstmayr ! I have done some initial testing and it looks like the performance is improved. I will do some more testing and update the issue.

Thanks!

@josunect
Copy link
Author

josunect commented Apr 1, 2024

Hi, @andreasgerstmayr

For the tests that I've done, I didn't found any issue. This resource allocation configuration seems more appropriate.
I think the issue can be closed.

Thank you!

@andreasgerstmayr
Copy link
Collaborator

For the tests that I've done, I didn't found any issue. This resource allocation configuration seems more appropriate. I think the issue can be closed.

Thanks for the follow-up! I'll close this issue then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants