-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
query-frontend performance #716
Comments
hi @josunect! Can you give more details, i.e. what command is used to generate the traces, and after which timeframe Tempo runs into OOM? I'd like to test this with a basic Tempo setup (https://github.com/grafana/tempo/blob/main/example/docker-compose/s3/docker-compose.yaml) to see if it's an issue of the operator or Tempo itself. |
Hi, @andreasgerstmayr ! So, we are configuring istio to send traces to Tempo using zipkin, https://istio.io/latest/docs/tasks/observability/distributed-tracing/zipkin/, with a sample rate of 100 with the following configuration:
There is a Kiali hack script to easily create the environment that we use: https://github.com/kiali/kiali/tree/master/hack/istio/tempo And follow these steps (Ex with minikube):
That will install tempo and istio (Configured to send the traces) in different namespaces.
And then run this query every 10 seconds, to get traces from the last hour with a limit of 200:
I've found the issue also with the Tempo API. |
Thank you for the detailed instructions! I'll test that in the coming days. |
I can reproduce the OOM of the tempo container in the query-frontend pod with instructions above. The container gets 409 MB memory allocated (5% [1] of the 8 GB allocated to the TempoStack). The query fetches the entire trace of up to 200 traces, resulting in a ~2 MB large response. The Jaeger UI/API (wrapped with tempo-query) is not optimized for Tempo. For every search query, it fetches a list of matching traceIDs, and then fetches the entire trace in a serial loop [2]. Fetching the entire trace is an expensive operation. afaics the intended usage is to run a TraceQL query to find interesting traces (traces with errors, high latency, etc.), and then only fetch the entire trace of these (few) matching traces. Running the same query with TraceQL should improve performance. Tempo recommends to use scoped attributes, i.e.
This query still OOMs on my machine with the resource limits after a while. However, when I increase the resources of the query-frontend pod [3]:
I can run the above curl command in an endless loop and the tempo container of the query-frontend pod uses about 60% CPU and 1.1 GiB memory (and does not run out of memory 😃). Edit: Also the query with the Jaeger API works now (with 2GB of memory), albeit slow because the Jaeger API is not optimized for Tempo as described above. [1]
[2] https://github.com/grafana/tempo/blob/v2.3.1/cmd/tempo-query/tempo/plugin.go#L270-L281 [3] This feature is already merged in main branch, but not in a released version yet.
|
Thanks for all the information, @andreasgerstmayr ! That is really useful. Option [3], I think it would be the ideal, as using TraceQL still has some issues in the end, and I think it will help to allocate the resources where they are really needed. |
Thank you, @andreasgerstmayr ! I have done some initial testing and it looks like the performance is improved. I will do some more testing and update the issue. Thanks! |
For the tests that I've done, I didn't found any issue. This resource allocation configuration seems more appropriate. Thank you! |
Thanks for the follow-up! I'll close this issue then. |
Installing Tempo with the operator and the following resources:
With the following query the pod is OOMKilled:
curl -G -s http://localhost:3200/api/search --data-urlencode 'q={ .service.name = "productpage.bookinfo"} && { } | select("status", ".service_name", ".node_id", ".component", ".upstream_cluster", ".http.method", ".response_flags")' --data-urlencode 'spss=10' --data-urlencode 'limit=100' --data-urlencode 'start=1701948096' --data-urlencode 'end=1702552896' | jq
The system seems stable increasing the resources:
But, it looks a lot for a development environment?
Tested in minikube following https://grafana.com/docs/tempo/latest/setup/operator/.
The text was updated successfully, but these errors were encountered: