-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Object Store Access] Context Deadline exceeded after upgrading to v0.32.4 from v0.31.0 #6785
Comments
@jnyi Is this reproducible every time? It looks like a small network issue that should happen rarely. |
it is reproducible, we are using IAM assume role in AWS instead of direct username/password, roll back to older version fixed this issue. |
Also notice in the config that the user set the s3 endpoint as |
Umm, @Pacobart did you see Thanos using dualstack endpoint only recently after the upgrade? |
Great find! This is unfortunate they did that but at least I know now where it's coming from. |
@jnyi We got similar issues after updating Thanos version. Now iterating the whole bucket is required due to #6474. The list objects request might take a long time and timed out. |
Is there a way to skip old version when iterating objects in S3? Maybe that is something we can do on our side. |
Speaking about the object storage layout, taking Grafana Mimir as an example, they have layout breakdown by tenant while thanos today put all tenants + raw resolution + 5m/1h downsampled blocks under the same prefix, will prefix them separately and only iterate related logic subpaths make it simpler? |
Like Cortex there is bucket index to help speed up the time when loading blocks. I think Thanos might be able to have similar things to avoid iterating the whole bucket every time. |
Bucket index would be a great addition, I wonder how hard it is to merge that code from Cortex. However, we do need a short term solution to unblock people from upgrading. |
Can we update a directory besides the blocks where just files with names of blocks are? everytime we add blocks we write a new file, everytime we delete stuff we delete the fiels there. then we only need to look there for blocks |
for short term we can introduce a hidden flag to opt into the new behavior maybe and long term we can go towards a block index solution; wdyt? |
https://cortexmetrics.io/docs/blocks-storage/bucket-index/ Sharing Cortex bucket index doc. |
--block-viewer.global.sync-block-timeout=5m https://thanos.io/tip/components/compact.md/ changing these defaults on thanos compactor fixes the iter context deadline timeouts in my cluster |
i confirm these worked. thanks @michaelswierszcz |
Thanos, Prometheus and Golang version used:
Object Storage Provider: AWS s3
What happened: We used k8s and assume role for accessing AWS s3, it worked fine with thanos v0.31.0, after upgrading to v0.32.4, we start seeing compactor complaining, seems regression related to permission.
What you expected to happen: thanos compactor can talk to AWS s3 using assumeRole.
How to reproduce it (as minimally and precisely as possible): normal s3 configuration without users/password, assume role of k8s worker to s3
Full logs to relevant components:
s3 config:
IAM assume role:
Anything else we need to know:
The text was updated successfully, but these errors were encountered: