Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thanos query responds too slow #6845

Open
hellojaewon opened this issue Oct 25, 2023 · 6 comments
Open

Thanos query responds too slow #6845

hellojaewon opened this issue Oct 25, 2023 · 6 comments

Comments

@hellojaewon
Copy link

hellojaewon commented Oct 25, 2023

Thanos, Prometheus and Golang version used:
0.32.4

Object Storage Provider:
minio (not related to this issue)

What happened:
Thanos Query responds to HTTP requests very slowly including /api/v1/query from thanos ruler and basic / resource as well.
The CPU and memory usage of query was almost idle.

What you expected to happen:
It should respond quickly.

How to reproduce it (as minimally and precisely as possible):
We have 10 queries and 3 rulers running on each VM and other component(receive, store as well). After staring the service, some time later we notice that our recording rule is empty for some time durations because we set the timeout to 1 minute for ruler. When we dig into the tcpdump, query reponds after 2 minutes.

Full logs to relevant components:
ruler : 1.1.1.1
query : 2.2.2.2

Request : No. 129
Response : No. 372

    129 2023-10-25 07:40:10.376262    1.1.1.1        2.2.2.2        HTTP     496    430        POST /api/v1/query HTTP/1.1  (application/x-www-form-urlencoded)
    130 2023-10-25 07:40:10.376272    2.2.2.2        1.1.1.1        TCP      66     0          19192 → 43556 [ACK] Seq=733 Ack=432 Win=143 Len=0 TSval=3323103448 TSecr=2621259665
    143 2023-10-25 07:40:25.441848    2.2.2.2        1.1.1.1        TCP      66     0          [TCP Keep-Alive] 19192 → 43556 [ACK] Seq=732 Ack=432 Win=143 Len=0 TSval=3323118514 TSecr=2621259665
    146 2023-10-25 07:40:25.442885    1.1.1.1        2.2.2.2        TCP      66     0          [TCP Keep-Alive ACK] 43556 → 19192 [ACK] Seq=432 Ack=733 Win=143 Len=0 TSval=2621274731 TSecr=3323103448
    147 2023-10-25 07:40:25.568984    1.1.1.1        2.2.2.2        TCP      66     0          [TCP Keep-Alive] 43556 → 19192 [ACK] Seq=431 Ack=733 Win=143 Len=0 TSval=2621274857 TSecr=3323103448
    149 2023-10-25 07:40:25.568997    2.2.2.2        1.1.1.1        TCP      66     0          [TCP Keep-Alive ACK] 19192 → 43556 [ACK] Seq=733 Ack=432 Win=143 Len=0 TSval=3323118641 TSecr=2621274731
    170 2023-10-25 07:40:40.545895    2.2.2.2        1.1.1.1        TCP      66     0          [TCP Keep-Alive] 19192 → 43556 [ACK] Seq=732 Ack=432 Win=143 Len=0 TSval=3323133618 TSecr=2621274731
    172 2023-10-25 07:40:40.546707    1.1.1.1        2.2.2.2        TCP      66     0          [TCP Keep-Alive ACK] 43556 → 19192 [ACK] Seq=432 Ack=733 Win=143 Len=0 TSval=2621289835 TSecr=3323118641
    174 2023-10-25 07:40:40.672989    1.1.1.1        2.2.2.2        TCP      66     0          [TCP Keep-Alive] 43556 → 19192 [ACK] Seq=431 Ack=733 Win=143 Len=0 TSval=2621289961 TSecr=3323118641
    176 2023-10-25 07:40:40.673013    2.2.2.2        1.1.1.1        TCP      66     0          [TCP Keep-Alive ACK] 19192 → 43556 [ACK] Seq=733 Ack=432 Win=143 Len=0 TSval=3323133745 TSecr=2621289835
    192 2023-10-25 07:40:55.649887    2.2.2.2        1.1.1.1        TCP      66     0          [TCP Keep-Alive] 19192 → 43556 [ACK] Seq=732 Ack=432 Win=143 Len=0 TSval=3323148722 TSecr=2621289835
    193 2023-10-25 07:40:55.650827    1.1.1.1        2.2.2.2        TCP      66     0          [TCP Keep-Alive ACK] 43556 → 19192 [ACK] Seq=432 Ack=733 Win=143 Len=0 TSval=2621304939 TSecr=3323133745
    195 2023-10-25 07:40:55.776974    1.1.1.1        2.2.2.2        TCP      66     0          [TCP Keep-Alive] 43556 → 19192 [ACK] Seq=431 Ack=733 Win=143 Len=0 TSval=2621305065 TSecr=3323133745
    196 2023-10-25 07:40:55.776993    2.2.2.2        1.1.1.1        TCP      66     0          [TCP Keep-Alive ACK] 19192 → 43556 [ACK] Seq=733 Ack=432 Win=143 Len=0 TSval=3323148849 TSecr=2621304939
    218 2023-10-25 07:41:10.753882    2.2.2.2        1.1.1.1        TCP      66     0          [TCP Keep-Alive] 19192 → 43556 [ACK] Seq=732 Ack=432 Win=143 Len=0 TSval=3323163826 TSecr=2621304939
    220 2023-10-25 07:41:10.754196    1.1.1.1        2.2.2.2        TCP      66     0          [TCP Keep-Alive ACK] 43556 → 19192 [ACK] Seq=432 Ack=733 Win=143 Len=0 TSval=2621320043 TSecr=3323148849
    224 2023-10-25 07:41:10.880992    1.1.1.1        2.2.2.2        TCP      66     0          [TCP Keep-Alive] 43556 → 19192 [ACK] Seq=431 Ack=733 Win=143 Len=0 TSval=2621320169 TSecr=3323148849
    225 2023-10-25 07:41:10.881016    2.2.2.2        1.1.1.1        TCP      66     0          [TCP Keep-Alive ACK] 19192 → 43556 [ACK] Seq=733 Ack=432 Win=143 Len=0 TSval=3323163953 TSecr=2621320043
    243 2023-10-25 07:41:25.857880    2.2.2.2        1.1.1.1        TCP      66     0          [TCP Keep-Alive] 19192 → 43556 [ACK] Seq=732 Ack=432 Win=143 Len=0 TSval=3323178930 TSecr=2621320043
    245 2023-10-25 07:41:25.858224    1.1.1.1        2.2.2.2        TCP      66     0          [TCP Keep-Alive ACK] 43556 → 19192 [ACK] Seq=432 Ack=733 Win=143 Len=0 TSval=2621335147 TSecr=3323163953
    247 2023-10-25 07:41:25.985020    1.1.1.1        2.2.2.2        TCP      66     0          [TCP Keep-Alive] 43556 → 19192 [ACK] Seq=431 Ack=733 Win=143 Len=0 TSval=2621335273 TSecr=3323163953
    249 2023-10-25 07:41:25.985043    2.2.2.2        1.1.1.1        TCP      66     0          [TCP Keep-Alive ACK] 19192 → 43556 [ACK] Seq=733 Ack=432 Win=143 Len=0 TSval=3323179057 TSecr=2621335147
    271 2023-10-25 07:41:40.961870    2.2.2.2        1.1.1.1        TCP      66     0          [TCP Keep-Alive] 19192 → 43556 [ACK] Seq=732 Ack=432 Win=143 Len=0 TSval=3323194034 TSecr=2621335147
    273 2023-10-25 07:41:40.962157    1.1.1.1        2.2.2.2        TCP      66     0          [TCP Keep-Alive ACK] 43556 → 19192 [ACK] Seq=432 Ack=733 Win=143 Len=0 TSval=2621350250 TSecr=3323179057
    276 2023-10-25 07:41:41.088971    1.1.1.1        2.2.2.2        TCP      66     0          [TCP Keep-Alive] 43556 → 19192 [ACK] Seq=431 Ack=733 Win=143 Len=0 TSval=2621350377 TSecr=3323179057
    278 2023-10-25 07:41:41.088984    2.2.2.2        1.1.1.1        TCP      66     0          [TCP Keep-Alive ACK] 19192 → 43556 [ACK] Seq=733 Ack=432 Win=143 Len=0 TSval=3323194161 TSecr=2621350250
    298 2023-10-25 07:41:56.065855    2.2.2.2        1.1.1.1        TCP      66     0          [TCP Keep-Alive] 19192 → 43556 [ACK] Seq=732 Ack=432 Win=143 Len=0 TSval=3323209138 TSecr=2621350250
    299 2023-10-25 07:41:56.066162    1.1.1.1        2.2.2.2        TCP      66     0          [TCP Keep-Alive ACK] 43556 → 19192 [ACK] Seq=432 Ack=733 Win=143 Len=0 TSval=2621365354 TSecr=3323194161
    303 2023-10-25 07:41:56.193017    1.1.1.1        2.2.2.2        TCP      66     0          [TCP Keep-Alive] 43556 → 19192 [ACK] Seq=431 Ack=733 Win=143 Len=0 TSval=2621365481 TSecr=3323194161
    304 2023-10-25 07:41:56.193029    2.2.2.2        1.1.1.1        TCP      66     0          [TCP Keep-Alive ACK] 19192 → 43556 [ACK] Seq=733 Ack=432 Win=143 Len=0 TSval=3323209265 TSecr=2621365354
    372 2023-10-25 07:42:10.349122    2.2.2.2        1.1.1.1        HTTP/JSON 466    400        HTTP/1.1 200 OK , JavaScript Object Notation (application/json)
    376 2023-10-25 07:42:10.349358    1.1.1.1        2.2.2.2        TCP      66     0          43556 → 19192 [ACK] Seq=432 Ack=1133 Win=146 Len=0 TSval=2621379638 TSecr=3323223421
    396 2023-10-25 07:42:21.545201    1.1.1.1        2.2.2.2        HTTP     496    430        POST /api/v1/query HTTP/1.1  (application/x-www-form-urlencoded)
    397 2023-10-25 07:42:21.545235    2.2.2.2        1.1.1.1        TCP      66     0          19192 → 43556 [ACK] Seq=1133 Ack=862 Win=146 Len=0 TSval=3323234617 TSecr=2621390833
    413 2023-10-25 07:42:36.641368    1.1.1.1        2.2.2.2        TCP      66     0          [TCP Keep-Alive] 43556 → 19192 [ACK] Seq=861 Ack=1133 Win=146 Len=0 TSval=2621405929 TSecr=3323234617
    414 2023-10-25 07:42:36.641392    2.2.2.2        1.1.1.1        TCP      66     0          [TCP Keep-Alive ACK] 19192 → 43556 [ACK] Seq=1133 Ack=862 Win=146 Len=0 TSval=3323249713 TSecr=2621390833

Anything else we need to know:

@mhoffm-aiven
Copy link
Contributor

Can you share some details about your setup please? What downstream store apis are you using for example, every bit of information helps!

@hellojaewon
Copy link
Author

hellojaewon commented Oct 26, 2023

Our thanos deployment consists of receive, query, ruler, store with object storage(minio). Each component runs on virtual machine. We use thanos store for long term data and ruler for recording rule. But, to simplify the issues it can be ignored.
The downstream store apis for queries are receives.
We have 6 receives, 10 queries and each query have connections to 6 receive.

The problems we have are two kinds.
First, queries are failed to update endpoint (receive) frequently with following logs.

update of endpoint failed ... DeadlineExceeded desc = context deadline exceeded
update of endpoint failed ... DeadlineExceeded desc = context deadline exceeded
update of endpoint failed ... DeadlineExceeded desc = context deadline exceeded
...
removing endpoint because it's unhealthy or does not exist
adding new receive with [storeEndpoints examplarsAPI]

These patterns were repeated on every query. As a result we sometimes failed to calculate recording rules. (The recording rules are executed for every minute.)
We temporarily fixed this issue by reducing store.unhealthy-timeout from 5m(which is default) to 30s. Even though the error still exist, it works until now.

The second problem is query replies to HTTP request very slowly.

These two problems are not happens when we restart query. However, after some time later, it happens again.

@yeya24
Copy link
Contributor

yeya24 commented Oct 29, 2023

Without knowing how much data you fetched, it is hard for us to understand its performance.
Can you please share more details about your query? What's the query itself, how many series, chunk fetched?
What's the time range? And it would be the best if you have distributed tracing set up so that we can understand which part of the request is more time consuming

@hellojaewon
Copy link
Author

hellojaewon commented Oct 29, 2023

I think the query and the data size is not important. Because even just simple HTTP GET request takes long under undesirable situation.
Moreoever, we removed the query(2.2.2.2) from the ruler's target. It means there is no HTTP requests to this query.
Here is the tcpdump.

      9 2023-10-29 08:47:06.688459    1.1.1.1         2.2.2.2         TCP      66     0          64443 → 19192 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 WS=256 SACK_PERM
     10 2023-10-29 08:47:06.688499    2.2.2.2         1.1.1.1         TCP      66     0          19192 → 64443 [SYN, ACK] Seq=0 Ack=1 Win=65535 Len=0 MSS=1460 SACK_PERM WS=512
     11 2023-10-29 08:47:06.689601    1.1.1.1         2.2.2.2         TCP      54     0          64443 → 19192 [ACK] Seq=1 Ack=1 Win=262656 Len=0
     12 2023-10-29 08:47:06.690746    1.1.1.1         2.2.2.2         HTTP     137    83         GET / HTTP/1.1
     13 2023-10-29 08:47:06.690760    2.2.2.2         1.1.1.1         TCP      54     0          19192 → 64443 [ACK] Seq=1 Ack=84 Win=65536 Len=0
     73 2023-10-29 08:48:06.705900    1.1.1.1         2.2.2.2         TCP      55     1          [TCP Keep-Alive] 64443 → 19192 [ACK] Seq=83 Ack=1 Win=262656 Len=1
     74 2023-10-29 08:48:06.705948    2.2.2.2         1.1.1.1         TCP      66     0          [TCP Keep-Alive ACK] 19192 → 64443 [ACK] Seq=1 Ack=84 Win=65536 Len=0 SLE=83 SRE=84
    105 2023-10-29 08:48:09.296331    2.2.2.2         1.1.1.1         HTTP     905    851        HTTP/1.1 200 OK  (text/html)
    106 2023-10-29 08:48:09.306338    1.1.1.1         2.2.2.2         TCP      54     0          64443 → 19192 [FIN, ACK] Seq=84 Ack=852 Win=261888 Len=0
    107 2023-10-29 08:48:09.346731    2.2.2.2         1.1.1.1         TCP      54     0          19192 → 64443 [ACK] Seq=852 Ack=85 Win=65536 Len=0
    120 2023-10-29 08:48:24.378753    2.2.2.2         1.1.1.1         TCP      54     0          [TCP Keep-Alive] 19192 → 64443 [ACK] Seq=851 Ack=85 Win=65536 Len=0
    121 2023-10-29 08:48:24.379763    1.1.1.1         2.2.2.2         TCP      54     0          [TCP Keep-Alive ACK] 64443 → 19192 [ACK] Seq=85 Ack=852 Win=261888 Len=0

Anyway, we have five metrics and the most heavy one has almost 8,000 series and takes about 5 seconds to fetch in normal situation.

@yeya24
Copy link
Contributor

yeya24 commented Oct 29, 2023

Anyway, we have five metrics and the most heavy one has almost 8,000 series and takes about 5 seconds to fetch in normal situation.

The 5s is spent on server only? Or it is the RTT from client to server then back to client.

As I said, better to set up some tracing and continuous profiling tools to help narrow down the issue and understand what's slow here.

@hellojaewon
Copy link
Author

hellojaewon commented Oct 29, 2023

As tcpdump shows, there is no problems in TCP layer because the server immediately acked to the client request. However, the application layer protocol, HTTP, response is slow.

Here are our configurations.
We have 6 receives, 10 queries, 3 rulers. Here is the config for one of them.
And I will further find a way to trace or profile the problem to give more details. Thank you.

receive (3.3.3.3)

thanos receive \
--tsdb.path /data/dbfiles/receive \
--grpc-address 0.0.0.0:19090 \
--http-address 0.0.0.0:10909 \
--receive.replication-factor 2\
--label 'receive_replica=\"1\"' \
--receive.local-endpoint 3.3.3.3:19090 \
--receive.hashrings-file /var/thanos/hashring.json \
--remote-write.address 0.0.0.0:10908 \
--objstore.config-file /var/thanos/receive_bucket_config.yml

query (2.2.2.2)

thanos query \
--http-address 0.0.0.0:19192 \
--grpc-address 0.0.0.0:19092 \
--store.sd-files=/var/thanos/target.json \  (contains receive addresses)
--query.auto-downsampling \
--query.timeout 1m \
--store.response-timeout 30s \
--store.sd-interval 3h \
--store.unhealthy-timeout 30s \
--query.max-concurrent 30 \
--query.replica-label replica \
--query.replica-label stage \
--query.replica-label receive_replica

ruler (1.1.1.1)

thanos rule \
--data-dir=/data/thanos \
--http-address=0.0.0.0:10902 \
--grpc-address=0.0.0.0:10901 \
--eval-interval=1m \
--resend-delay=2m \
--tsdb.block-duration=2h \
--tsdb.no-lockfile \
--tsdb.retention=100d \
--tsdb.wal-compression \
--rule-file=/var/thanos/rules/*.yml \
--objstore.config-file=/var/thanos/bucket_config.yml \
--label 'machine=\"ruler-01\"' \
--log.format=json \
--log.level=debug
--query=2.2.2.1:19192 \
--query=2.2.2.2:19192 \
--query=2.2.2.3:19192 \
--query=2.2.2.4:19192 \
...
--query=2.2.2.10:19192

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants