Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thanos compactor filling the disk with stack traces on Azure #2565

Closed
joaosilva15 opened this issue May 5, 2020 · 19 comments · Fixed by #3341
Closed

Thanos compactor filling the disk with stack traces on Azure #2565

joaosilva15 opened this issue May 5, 2020 · 19 comments · Fixed by #3341

Comments

@joaosilva15
Copy link

Thanos, Prometheus and Golang version used:

Thanos: 012.2
Go: 1.13.6

Object Storage Provider:

Azure

What happened:
After the upgrade to the newest version of thanos compactor the logging has been filled with stack traces from the azure sdk.
The error states that the deletion-mark.json cannot be found and that the api returned 404 (as expected as the blocks are not to be deleted)

What you expected to happen:

No stack traces on the logs as the error is expected.

How to reproduce it (as minimally and precisely as possible):

Running the compactor against the blob storage is enough in our case.

Full logs to relevant components:

May  5 10:11:50 thanos /usr/bin/thanos[98674]: 2020/05/05 10:11:50 ==> REQUEST/RESPONSE (Try=1/5.08092ms, OpTime=5.11022ms) -- REQUEST ERROR#012   HEAD https://thanos.link/blob/01D7K76ENE3WXX79N5PVQN4X58/deletion-mark.json?timeout=61#012   Authorization: REDACTED#012   User-Agent: [Thanos Azure-Storage/0.7 (go1.13.6; linux)]#012 #012   X-Ms-Version: [2018-11-09]#012   x-ms-date: [Tue, 05 May 2020 10:11:50 GMT]#012   --------------------------------------------------------------------------------#012   RESPONSE Status: 404 The specified blob does not exist.#012   Date: [Tue, 05 May 2020 10:11:49 GMT]#012   Server: [Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0]#012   X-Ms-Error-Code: [BlobNotFound]#012   X-Ms-Request-Id: [5a0f3866-d01e-0030-68c5-2203ee000000]#012   X-Ms-Version: [2018-11-09]#012   --------------------------------------------------------------------------------#012   ERROR:#012-> github.com/Azure/azure-storage-blob-go/azblob.newStorageError, /go/pkg/mod/github.com/!azure/azure-storage-blob-go@v0.8.0/azblob/zc_storage_error.go:42#012===== RESPONSE ERROR (ServiceCode=BlobNotFound) =====#012Description=404 The specified blob does not exist., Details: (none)#012   HEAD https://thanos.link/blob/01D7K76ENE3WXX79N5PVQN4X58/deletion-mark.json?timeout=61#012   Authorization: REDACTED#012   User-Agent: [Thanos Azure-Storage/0.7 (go1.13.6; linux)]#012   X-Ms-Client-Request-Id: [30a18d7a-0e52-493c-41c5-9bc739c346f9]#012   X-Ms-Date: [Tue, 05 May 2020 10:11:50 GMT]#012   X-Ms-Version: [2018-11-09]#012   --------------------------------------------------------------------------------#012   RESPONSE Status: 404 The specified blob does not exist.#012   Date: [Tue, 05 May 2020 10:11:49 GMT]#012   Server: [Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0]#012   X-Ms-Error-Code: [BlobNotFound]#012   X-Ms-Request-Id: [5a0f3866-d01e-0030-68c5-2203ee000000]#012   X-Ms-Version: [2018-11-09]#012#012#012#012goroutine 191 [running]:#012gh.neting.cc/Azure/azure-storage-blob-go/azblob.stack(0xc003d635c0, 0xc004e7e000, 0xc004966ea0)#012#011/go/pkg/mod/github.com/!azure/azure-storage-blob-go@v0.8.0/azblob/zc_policy_request_log.go:146 +0x9d#012gh.neting.cc/Azure/azure-storage-blob-go/azblob.NewRequestLogPolicyFactory.func1.1(0x23069c0, 0xc004ebd680, 0xc004e7e000, 0x10, 0x1b72320, 0x1, 0xc004e40f20)#012#011/go/pkg/mod/github.com/!azure/azure-storage-blob-go@v0.8.0/azblob/zc_policy_request_log.go:96 +0x650#012gh.neting.cc/Azure/azure-pipeline-go/pipeline.PolicyFunc.Do(0xc004cb9450, 0x23069c0, 0xc004ebd680, 0xc004e7e000, 0xc004e41008, 0xc0030b7400, 0x46, 0xc00204c330)#012#011/go/pkg/mod/github.com/!azure/azure-pipeline-go@v0.2.2/pipeline/core.go:43 +0x44#012gh.neting.cc/Azure/azure-storage-blob-go/azblob.(*SharedKeyCredential).New.func1(0x23069c0, 0xc004ebd680, 0xc004e7e000, 0x33981e0, 0x23069c0, 0xc004ebd680, 0xc004d25b10)#012#011/go/pkg/mod/github.com/!azure/azure-storage-blob-go@v0.8.0/azblob/zc_credential_shared_key.go:66 +0x2d2#012gh.neting.cc/Azure/azure-pipeline-go/pipeline.PolicyFunc.Do(0xc005065170, 0x23069c0, 0xc004ebd680, 0xc004e7e000, 0xc004ebd680, 0xc004d25b10, 0xc000000001, 0x0)#012#011/go/pkg/mod/github.com/!azure/azure-pipeline-go@v0.2.2/pipeline/core.go:43 +0x44#012gh.neting.cc/Azure/azure-storage-blob-go/azblob.NewRetryPolicyFactory.func1.1(0x2306940, 0xc000368940, 0xc004be1f00, 0x10, 0x1b72320, 0x64492d747301, 0xc004e40dc0)#012#011/go/pkg/mod/github.com/!azure/azure-storage-blob-go@v0.8.0/azblob/zc_policy_retry.go:204 +0x66c#012gh.neting.cc/Azure/azure-pipeline-go/pipeline.PolicyFunc.Do(0xc004cb94a0, 0x2306940, 0xc000368940, 0xc004be1f00, 0xc004e40e78, 0x30, 0x28, 0x8)#012#011/go/pkg/mod/github.com/!azure/azure-pipeline-go@v0.2.2/pipeline/core.go:43 +0x44#012gh.neting.cc/Azure/azure-storage-blob-go/azblob.NewUniqueRequestIDPolicyFactory.func1.1(0x2306940, 0xc000368940, 0xc004be1f00, 0x10, 0x1b72320, 0x40ca01, 0xc004e40dc0)#012#011/go/pkg/mod/github.com/!azure/azure-storage-blob-go@v0.8.0/azblob/zc_policy_unique_request_id.go:19 +0xae#012gh.neting.cc/Azure/azure-pipeline-go/pipeline.PolicyFunc.Do(0xc002d055e0, 0x2306940, 0xc000368940, 0xc004be1f00, 0xc004e40e60, 0x2a, 0xc00204c1e0, 0xc003cd8478)#012#011/go/pkg/mod/github.com/!azure/azure-pipeline-go@v0.2.2/pipeline/core.go:43 +0x44#012gh.neting.cc/Azure/azure-storage-blob-go/azblob.NewTelemetryPolicyFactory.func1.1(0x2306940, 0xc000368940, 0xc004be1f00, 0x1, 0x0, 0x1, 0xc00494e338)#012#011/go/pkg/mod/github.com/!azure/azure-storage-blob-go@v0.8.0/azblob/zc_policy_telemetry.go:34 +0x15d#012gh.neting.cc/Azure/azure-pipeline-go/pipeline.PolicyFunc.Do(0xc0050651a0, 0x2306940, 0xc000368940, 0xc004be1f00, 0xc0050651a0, 0xc004be1f00, 0xc003cd8548, 0x40ca58)#012#011/go/pkg/mod/github.com/!azure/azure-pipeline-go@v0.2.2/pipeline/core.go:43 +0x44#012gh.neting.cc/Azure/azure-pipeline-go/pipeline.(*pipeline).Do(0xc003367780, 0x2306940, 0xc000368940, 0x22bab00, 0xc004c008c0, 0xc004be1f00, 0x25, 0xc000d932c0, 0x36, 0x0)#012#011/go/pkg/mod/github.com/!azure/azure-pipeline-go@v0.2.2/pipeline/core.go:129 +0x81#012gh.neting.cc/Azure/azure-storage-blob-go/azblob.blobClient.GetProperties(0xc00204c240, 0x5, 0x0, 0x0, 0x0, 0xc00204c248, 0x25, 0xc000d932c0, 0x36, 0x0, ...)#012#011/go/pkg/mod/github.com/!azure/azure-storage-blob-go@v0.8.0/azblob/zz_generated_blob.go:761 +0x3a5#012gh.neting.cc/Azure/azure-storage-blob-go/azblob.BlobURL.GetProperties(0xc00204c240, 0x5, 0x0, 0x0, 0x0, 0xc00204c248, 0x25, 0xc000d932c0, 0x36, 0x0, ...)#012#011/go/pkg/mod/github.com/!azure/azure-storage-blob-go@v0.8.0/azblob/url_blob.go:113 +0x11f#012gh.neting.cc/thanos-io/thanos/pkg/objstore/azure.(*Bucket).Exists(0xc0002620b0, 0x2306940, 0xc000368940, 0xc00204c1b0, 0x2d, 0x0, 0xc005064fc0, 0x0)#012#011/app/pkg/objstore/azure/azure.go:256 +0x2fb#012gh.neting.cc/thanos-io/thanos/pkg/objstore/azure.(*Bucket).getBlobReader(0xc0002620b0, 0x2306940, 0xc000368940, 0xc00204c1b0, 0x2d, 0x0, 0x0, 0x0, 0x0, 0x0, ...)#012#011/app/pkg/objstore/azure/azure.go:176 +0x32b#012gh.neting.cc/thanos-io/thanos/pkg/objstore/azure.(*Bucket).Get(0xc0002620b0, 0x2306940, 0xc000368940, 0xc00204c1b0, 0x2d, 0xc0001b49c0, 0x1c9b540, 0x2335640, 0xc0033676c0)#012#011/app/pkg/objstore/azure/azure.go:226 +0x65#012gh.neting.cc/thanos-io/thanos/pkg/objstore.(*metricBucket).Get(0xc0033676c0, 0x2306940, 0xc000368940, 0xc00204c1b0, 0x2d, 0x0, 0x0, 0x22b8120, 0xc002d05500)#012#011/app/pkg/objstore/objstore.go:331 +0xd9#012gh.neting.cc/thanos-io/thanos/pkg/block/metadata.ReadDeletionMark(0x2306940, 0xc000368940, 0x7f5e9c2bfd10, 0xc0007c6d40, 0x22b71a0, 0xc0008cc5d0, 0xc0007e7680, 0x1a, 0x0, 0x0, ...)#012#011/app/pkg/block/metadata/deletionmark.go:51 +0x163#012gh.neting.cc/thanos-io/thanos/pkg/block.(*IgnoreDeletionMarkFilter).Filter(0xc0009d7ad0, 0x2306940, 0xc000368940, 0xc005dea600, 0xc000811720, 0xc001664000, 0x0, 0x0)#012#011/app/pkg/block/fetcher.go:774 +0x1a1#012gh.neting.cc/thanos-io/thanos/pkg/block.(*BaseFetcher).fetch(0xc000d2c060, 0x2306940, 0xc000368940, 0xc0007c7d40, 0xc0004647d0, 0x5, 0x5, 0x0, 0x0, 0x0, ...)#012#011/app/pkg/block/fetcher.go:432 +0x5b5#012gh.neting.cc/thanos-io/thanos/pkg/block.(*MetaFetcher).Fetch(0xc000d2c3c0, 0x2306940, 0xc000368940, 0xc003eefe00, 0xc004feaeb8, 0xc003eefdb0, 0xfe)#012#011/app/pkg/block/fetcher.go:472 +0x9f#012gh.neting.cc/thanos-io/thanos/pkg/store.(*BucketStore).SyncBlocks(0xc0001ea200, 0x2306940, 0xc000368940, 0xc000c07da8, 0xc001c57a40)#012#011/app/pkg/store/bucket.go:344 +0x66#012main.runStore.func4.1(0xc000c07d80, 0xc000c07d38)#012#011/app/cmd/thanos/store.go:309 +0x60#012gh.neting.cc/thanos-io/thanos/pkg/runutil.Repeat(0x29e8d60800, 0xc0000af440, 0xc003cd9f58, 0x0, 0x0)#012#011/app/pkg/runutil/runutil.go:72 +0x91#012main.runStore.func4(0x0, 0x0)#012#011/app/cmd/thanos/store.go:308 +0x505#012gh.neting.cc/oklog/run.(*Group).Run.func1(0xc000791980, 0xc000465270, 0xc0008ffec0)#012#011/go/pkg/mod/github.com/oklog/run@v1.0.0/group.go:38 +0x27#012created by github.com/oklog/run.(*Group).Run#012#011/go/pkg/mod/github.com/oklog/run@v1.0.0/group.go:37 +0xbb

Anything else we need to know:

I understand that this is not a thanos problem as the thanos is handling the error correctly and performing the correct operation. Just wondering if someone is having the same issue and if during the process of developing the support to azure there was the need to change the sdk log level and if so how to do it. I ran through the code and didn't find the option to do it but would be more than happy to help with a little guidance.

@joaosilva15
Copy link
Author

joaosilva15 commented May 5, 2020

After some investigation it's possible to change the log level in

p := blob.NewPipeline(c, blob.PipelineOptions{
with

Log: pipeline.LogOptions{
                       ShouldLog: func(level pipeline.LogLevel) bool {
                            return level <= pipeline.LogFatal
                        },
},

but this will disable logs lower than fatal to every call we make.

For more insight you can see in https://github.com/Azure/azure-storage-blob-go/blob/master/azblob/zc_policy_request_log.go#L67 that 4xx request are treated as LogError and so to remove the traces we need to change the log level to Panic, Fatal or None.

After reading the code better the 404 code is being left out in the if but for some reason it's still being treated as an error.

@GiedriusS
Copy link
Member

After some investigation it's possible to change the log level in

p := blob.NewPipeline(c, blob.PipelineOptions{

with

Log: pipeline.LogOptions{
                       ShouldLog: func(level pipeline.LogLevel) bool {
                            return level <= pipeline.LogFatal
                        },
},

but this will disable logs lower than fatal to every call we make.

For more insight you can see in https://github.com/Azure/azure-storage-blob-go/blob/master/azblob/zc_policy_request_log.go#L67 that 4xx request are treated as LogError and so to remove the traces we need to change the log level to Panic, Fatal or None.

After reading the code better the 404 code is being left out in the if but for some reason it's still being treated as an error.

Are you talking about the code of Thanos or the Azure's library? I would vote in favor of making ShouldLog always return false because from Thanos' perspective we know better when to log e.g. in this case we shouldn't since that is expected.

@GiedriusS
Copy link
Member

cc @vglafirov do you have any opinion on this?

@joaosilva15
Copy link
Author

The first part of disabling logs was for thanos. The 404 being treated as an error is on the azure library side.

I wouldn't mind disabling the logs as most of the time they don't give more information than thanos does. Maybe only activate the logs on the azure library in debug mode?

@GiedriusS
Copy link
Member

The first part of disabling logs was for thanos. The 404 being treated as an error is on the azure library side.

I wouldn't mind disabling the logs as most of the time they don't give more information than thanos does. Maybe only activate the logs on the azure library in debug mode?

Yep, I think that would be the best solution here with how the library works. We could do a similar thing to what s3 has: https://github.com/thanos-io/thanos/pull/937/files#diff-23ce6f3804b08b263e06e0c152bf4ee5R54-R55. IMHO adding another log level setting would be a bit too much since the log levels there do not match up with what Thanos has: https://github.com/thanos-io/thanos/blob/master/cmd/thanos/main.go#L50-L51 vs. https://github.com/Azure/azure-pipeline-go/blob/master/pipeline/core.go#L59-L80 and inevitably confusion would arise. And we'd have two options for the same thing, essentially :/

What are your thoughts on this?

@joaosilva15
Copy link
Author

Yea, I also think that adding another flag would create confusion and would not be necessary. I would say that maybe only activate the logs on the library when thanos log level is in debug. Not sure what level should be applied to the framework tho. Maybe also debug? I think that is what makes more sense.

I just think that, unless you really want to check whats happening in the library due to some strange behaviour, the logs lines are irrelevant as thanos already gives us that information.

@stale
Copy link

stale bot commented Jun 10, 2020

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jun 10, 2020
@joaosilva15
Copy link
Author

Bumping the issue.

Yes, this is still happening. We also discovered that even when changing the log level this happens because of this line https://github.com/Azure/azure-storage-blob-go/blob/master/azblob/zc_policy_request_log.go#L76

The requests by thanos seem to be marked with forceLog as true which ignores the log level set.

@stale stale bot removed the stale label Jun 12, 2020
@stale
Copy link

stale bot commented Jul 12, 2020

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jul 12, 2020
@joaosilva15
Copy link
Author

Still happening

@stale stale bot removed the stale label Jul 13, 2020
@stale
Copy link

stale bot commented Aug 12, 2020

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Aug 12, 2020
@bwplotka
Copy link
Member

Any ideas if this is something to improve on Azure invocation on Thanos side or in Azure lib itself? (:

Help wanted!

@joaosilva15
Copy link
Author

I didn't investigate deep enough to check what the true issue was but it seemed like the Azure lib was not categorizing the error in a proper way (as per the links above 404 should not be logged). I tried to reduce the log level to None but still, the lib forces the logs to be written.

With that said I think the solution is either to check why the error is not being categorized correctly or find a way to drop the logs completely... I'll try to investigate this further. Still didn't get the time to pick up on this 😞

@dbrennand
Copy link

dbrennand commented Aug 18, 2020

Hey guys, I am also seeing this in the thanos compactor logs. Anything I can do to help? 😄

@stale
Copy link

stale bot commented Oct 17, 2020

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Oct 17, 2020
kakkoyun pushed a commit that referenced this issue Oct 19, 2020
Ever since the introduction of the deletion marker, Thanos compact has
been filling the logs with Azure blob storage exceptions. As discussed
in #2565, the error logging should be handled by Thanos itself, so this
patch turns off standard logging from the Azure library.

However, [due to the way logging is set up in the library][1], it will
always log errors to syslog, regardless of how `ShouldLog` is
configured. As such, and until that can be a configurable behavior, care
should be taken to handle that from the syslog side as well.

[1]: https://github.com/Azure/azure-storage-blob-go/blob/48358e1de5110852097ebbc11c53581d64d47300/azblob/zc_policy_request_log.go#L100-L102

Signed-off-by: Pedro Araujo <phcrva@gmail.com>
@kakkoyun
Copy link
Member

@joaosilva15 Could you check if this has fixed by #3331?

@stale stale bot removed the stale label Oct 19, 2020
@pharaujo
Copy link
Contributor

pharaujo commented Oct 19, 2020

Please be aware that, due to ForceLog shenanigans, when running outside a container, the stack traces will still hit the logs (through syslog). I'm currently dropping these using a rsyslog filter.

@joaosilva15
Copy link
Author

@kakkoyun what @pharaujo said. This change makes a workaround possible but does not fix the final issue, which is in the azure sdk

@pharaujo
Copy link
Contributor

I think I found the issue in the Azure blob lib, but also found a way to disable ForceLog, so I'll submit another MR here to add that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants