Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(batch-exports): Buffer batches while async flushing #25631

Merged
merged 7 commits into from
Oct 17, 2024

Conversation

tomasfarias
Copy link
Contributor

@tomasfarias tomasfarias commented Oct 16, 2024

Problem

If we keep a connection to ClickHouse open without activity for too long, it will be closed. It's unclear if the http client, pod OS, or ClickHouse is closing it. Based on ClickHouse logs, it would appear it's somewhere on the client side. I have tried all the knobs offered by aiohttp to keep the connection open for longer, but around 30s seems to be the limit (which, to be fair, is quite a lot).

Anyways, let's not keep the connection inactive for too long. For that, we will read from the socket into a buffer of record batches. This buffer will allow us to keep on reading even if other activities down stream are slow (like flushing).

Speaking of downstream, the consumer of this queue is a writing loop that writes to one file at a time. After each file is written, we allow the main thread to continue, and start writing to a new file. This is done because flushing to a destination can be slow, so we can continue writing new files instead of waiting sequentially.

Potential improvements would write to multiple files at a time, round-robin style, but then error recovery becomes a bit more complicated, so left that out as an improvement for later.

This solution will put more memory pressure and disk pressure in our pods, but our current resource utilization is about 20%-40% of total memory requested, so we have some to spare. The size of the in-memory buffer queue can be configured to never go above a max to ensure we don't OOM.

Changes

  • Adds a new AsyncRecordBatchProducer class to read record batches from an async interator of bytes into a queue.
  • Adds a new aproduce_query_as_arrow_record_batches method to our ClickHouseClient to asynchronously read record batches into a queue from a given ClickHouse query, using the class mentioned in the previous point.
  • Adds a new function start_produce_batch_export_record_batches to initialize the production of record batches in batch exports, using the previous two items.
  • Adds a new function consume_batch_export_record_batches to initialize consumption of record batches from a queue, writing them to a file, and issuing a flush when done.
  • Updated the bigquery batch export to use all of the above: We now start a writing loop that spawns consumption tasks every time we start a flush. This allows us to keep on writing to files even as we flush.
  • Bumped google-cloud-biquery library version. Saw some weird retry errors otherwise.

👉 Stay up-to-date with PostHog coding conventions for a smoother review.

Does this work well for both Cloud and self-hosted?

How did you test this code?

Ran bigquery batch export tests, mostly as-is except that now JSON values are automagically deserialized, so no need to call json.loads again. All passed:

================== 71 passed, 6 skipped in 522.19s (0:08:42) ===================

@tomasfarias tomasfarias marked this pull request as ready for review October 16, 2024 17:11
@tomasfarias tomasfarias requested a review from a team October 16, 2024 17:11
Copy link
Member

@EDsCODE EDsCODE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. As you mentioned, probably could have more tests around the new objects

@tomasfarias tomasfarias force-pushed the feat/non-blocking-flushes-bigquery-batch-export branch from 07fe70f to 619bc59 Compare October 17, 2024 09:22
@tomasfarias tomasfarias merged commit ef24aad into master Oct 17, 2024
86 checks passed
@tomasfarias tomasfarias deleted the feat/non-blocking-flushes-bigquery-batch-export branch October 17, 2024 10:02
Copy link

sentry-io bot commented Oct 20, 2024

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

Did you find this useful? React with a 👍 or 👎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants