Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming postings decoding leads to OOMs #6434

Closed
fpetkovski opened this issue Jun 9, 2023 · 7 comments · Fixed by #6475
Closed

Streaming postings decoding leads to OOMs #6434

fpetkovski opened this issue Jun 9, 2023 · 7 comments · Fixed by #6475

Comments

@fpetkovski
Copy link
Contributor

fpetkovski commented Jun 9, 2023

Object Storage Provider: GCS

What happened:

We have certain queries that have very high postings cardinality, like:

sum(container_memory_working_set_bytes{namespace=~"<namespace>", container!="", pod!=""})

The streaming postings decoding seems to hold on to one snappy decoder for each posting, so for the container and pod labels we end up with a huge amount of open decoders even before we start the merge.

This lead to stores OOMing instantly because each snappy reader uses a 64KB buffer.

What you expected to happen:

I would like to have some mechanism for a more controlled memory management. Maybe we should decide on the strategy based on the size of the compressed data. Alternatively we can decide based on the number of keys.

How to reproduce it (as minimally and precisely as possible):

The query above could be sufficient.

Anything else we need to know:

Screenshot from a small staging environment when running the query
image

@yeya24
Copy link
Contributor

yeya24 commented Jun 9, 2023

Is it caused by #6303? I feel the old format will have the same problem?
If we can include postings as part of the index header, will it solve this problem and will it increase disk space a lot? At least we don't have to deal with postings cache

@fpetkovski
Copy link
Contributor Author

fpetkovski commented Jun 9, 2023

I believe it was introduced by #6303. With the old format we decode one posting fully at a time, and we recycle the snappy decoder. With the streaming decoding we keep all decoders open until everything is merged, or at least that's how I understand it. Since we open one decoding reader per posting, for a matcher like container!="" we end up with many open
decoders and each one has a footprint of 64KB.

I think that with the query above, decoding postings fully has a lower footprint than allocating buffers for decoders, so we end up being worse off.

@GiedriusS
Copy link
Member

Thanks for the report. Let me try to write an optimized routine. It should be pretty straightforward given that we have preallocated []byte slices.

@yeya24
Copy link
Contributor

yeya24 commented Jun 15, 2023

I think we saw the same thing. 40GB when streaming decode postings

image

@GiedriusS
Copy link
Member

GiedriusS commented Jun 15, 2023

I think a simple fix would be to try to estimate the block size instead of opting to use the default block size which is 65KiB. The only problem is that reader size needs to be passed to NewReader so we would have to use a different sync.Pool than the one used for gRPC calls. Perhaps it would make sense to keep a running average of postings list sizes and then use that as the block size during the creation of a new reader?

Creating a new input buffer seems to be unavoidable because in Snappy there can be references to uncompressed data so we need to save the last bytes of uncompressed data.

@yeya24
Copy link
Contributor

yeya24 commented Jul 12, 2023

Hi @fpetkovski, have you tested latest Thanos with #6475, does this pr fix the OOM kill issue?

@fpetkovski
Copy link
Contributor Author

fpetkovski commented Jul 12, 2023

We haven't had a chance to test this yet, but I will report back if we see more OOMs once we upgrade.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants