[Object Store Access] Context Deadline exceeded after upgrading to v0.32.4 from v0.31.0 #6785

jnyi · 2023-10-09T19:06:05Z

Thanos, Prometheus and Golang version used:

v0.31.0 --> v0.32.4

Object Storage Provider: AWS s3

What happened: We used k8s and assume role for accessing AWS s3, it worked fine with thanos v0.31.0, after upgrading to v0.32.4, we start seeing compactor complaining, seems regression related to permission.

What you expected to happen: thanos compactor can talk to AWS s3 using assumeRole.

How to reproduce it (as minimally and precisely as possible): normal s3 configuration without users/password, assume role of k8s worker to s3

Full logs to relevant components:

ts=2023-10-09T18:20:13.796433303Z caller=blocks_cleaner.go:44 level=info name=thanos-compactor msg="started cleaning of blocks marked for deletion"
ts=2023-10-09T18:20:13.797847735Z caller=blocks_cleaner.go:58 level=info name=thanos-compactor msg="cleaning of blocks marked for deletion done"
ts=2023-10-09T18:24:50.467766592Z caller=runutil.go:100 level=error name=thanos-compactor msg="function failed. Retrying in next tick" err="BaseFetcher: iter bucket: Get \"https://<redacted bucket>.s3.dualstack.us-west-2.amazonaws.com/?continuation-token=<redacted>&delimiter=&encoding-type=url&fetch-owner=true&list-type=2&prefix=thanos%2Foregon-dev%2F\": context deadline exceeded"
ts=2023-10-09T18:24:50.46789616Z caller=compact.go:597 level=error name=thanos-compactor msg="retriable error" err="BaseFetcher: iter bucket: Get \"https://<redacted bucket>.s3.dualstack.us-west-2.amazonaws.com/?continuation-token=<redacted>&delimiter=&encoding-type=url&fetch-owner=true&list-type=2&prefix=thanos%2Foregon-dev%2F\": context deadline exceeded"

s3 config:

"config":
  "bucket": "<bucket>"
  "endpoint": "s3.us-west-2.amazonaws.com"
  "insecure": false
  "region": "us-west-2"
  "signature_version2": false
"prefix": "thanos/oregon-dev"
"type": "S3"

IAM assume role:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "ec2.amazonaws.com",
                "AWS": "arn:aws:iam::<aws account id>:role/KubernetesRoles-IAMRoleWorker-<redacted>"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

Anything else we need to know:

The text was updated successfully, but these errors were encountered:

yeya24 · 2023-10-16T07:05:13Z

@jnyi Is this reproducible every time? It looks like a small network issue that should happen rarely.

jnyi · 2023-10-23T20:59:35Z

it is reproducible, we are using IAM assume role in AWS instead of direct username/password, roll back to older version fixed this issue.

Pacobart · 2023-10-23T21:56:56Z

Also notice in the config that the user set the s3 endpoint as s3.us-west-2.amazonaws.com but the logs show s3.dualstack.us-west-2.amazonaws.com. This another issue #6804

yeya24 · 2023-10-23T22:41:32Z

Umm, @Pacobart did you see Thanos using dualstack endpoint only recently after the upgrade?
Looks like dualstack was used since 2019 minio/minio-go#1055

Pacobart · 2023-10-24T21:20:38Z

Umm, @Pacobart did you see Thanos using dualstack endpoint only recently after the upgrade? Looks like dualstack was used since 2019 minio/minio-go#1055

Great find! This is unfortunate they did that but at least I know now where it's coming from.

yeya24 · 2023-11-27T03:26:20Z

@jnyi We got similar issues after updating Thanos version. Now iterating the whole bucket is required due to #6474. The list objects request might take a long time and timed out.
If you are using S3 and enable versioned bucket, you can try to clean up some old versions and try again. This ideally should improve the list performance.

fpetkovski · 2023-11-27T07:10:19Z

Is there a way to skip old version when iterating objects in S3? Maybe that is something we can do on our side.

jnyi · 2023-12-04T18:30:29Z

Speaking about the object storage layout, taking Grafana Mimir as an example, they have layout breakdown by tenant while thanos today put all tenants + raw resolution + 5m/1h downsampled blocks under the same prefix, will prefix them separately and only iterate related logic subpaths make it simpler?

yeya24 · 2023-12-04T22:59:39Z

Like Cortex there is bucket index to help speed up the time when loading blocks. I think Thanos might be able to have similar things to avoid iterating the whole bucket every time.

fpetkovski · 2023-12-05T11:02:41Z

Bucket index would be a great addition, I wonder how hard it is to merge that code from Cortex. However, we do need a short term solution to unblock people from upgrading.

MichaHoffmann · 2023-12-05T16:58:28Z

Can we update a directory besides the blocks where just files with names of blocks are? everytime we add blocks we write a new file, everytime we delete stuff we delete the fiels there. then we only need to look there for blocks

MichaHoffmann · 2023-12-05T17:04:27Z

Bucket index would be a great addition, I wonder how hard it is to merge that code from Cortex. However, we do need a short term solution to unblock people from upgrading.

for short term we can introduce a hidden flag to opt into the new behavior maybe and long term we can go towards a block index solution; wdyt?

yeya24 · 2023-12-05T17:04:30Z

https://cortexmetrics.io/docs/blocks-storage/bucket-index/ Sharing Cortex bucket index doc.
Thanos might not need all the features from it but idea is the same

michaelswierszcz · 2024-02-05T16:10:07Z

--block-viewer.global.sync-block-timeout=5m
--block-viewer.global.sync-block-interval=1m

https://thanos.io/tip/components/compact.md/

changing these defaults on thanos compactor fixes the iter context deadline timeouts in my cluster

BouchaaraAdil · 2024-04-14T22:15:12Z

--block-viewer.global.sync-block-timeout=5m
--block-viewer.global.sync-block-interval=1m

i confirm these worked. thanks @michaelswierszcz

jnyi mentioned this issue Oct 23, 2023

AWS S3 endpoint configuration not working #6804

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Object Store Access] Context Deadline exceeded after upgrading to v0.32.4 from v0.31.0 #6785

[Object Store Access] Context Deadline exceeded after upgrading to v0.32.4 from v0.31.0 #6785

jnyi commented Oct 9, 2023 •

edited

Loading

yeya24 commented Oct 16, 2023

jnyi commented Oct 23, 2023

Pacobart commented Oct 23, 2023

yeya24 commented Oct 23, 2023

Pacobart commented Oct 24, 2023

yeya24 commented Nov 27, 2023

fpetkovski commented Nov 27, 2023

jnyi commented Dec 4, 2023 •

edited

Loading

yeya24 commented Dec 4, 2023

fpetkovski commented Dec 5, 2023

MichaHoffmann commented Dec 5, 2023

MichaHoffmann commented Dec 5, 2023

yeya24 commented Dec 5, 2023

michaelswierszcz commented Feb 5, 2024 •

edited

Loading

BouchaaraAdil commented Apr 14, 2024

[Object Store Access] Context Deadline exceeded after upgrading to v0.32.4 from v0.31.0 #6785

[Object Store Access] Context Deadline exceeded after upgrading to v0.32.4 from v0.31.0 #6785

Comments

jnyi commented Oct 9, 2023 • edited Loading

yeya24 commented Oct 16, 2023

jnyi commented Oct 23, 2023

Pacobart commented Oct 23, 2023

yeya24 commented Oct 23, 2023

Pacobart commented Oct 24, 2023

yeya24 commented Nov 27, 2023

fpetkovski commented Nov 27, 2023

jnyi commented Dec 4, 2023 • edited Loading

yeya24 commented Dec 4, 2023

fpetkovski commented Dec 5, 2023

MichaHoffmann commented Dec 5, 2023

MichaHoffmann commented Dec 5, 2023

yeya24 commented Dec 5, 2023

michaelswierszcz commented Feb 5, 2024 • edited Loading

BouchaaraAdil commented Apr 14, 2024

jnyi commented Oct 9, 2023 •

edited

Loading

jnyi commented Dec 4, 2023 •

edited

Loading

michaelswierszcz commented Feb 5, 2024 •

edited

Loading