refactor(batch-exports): Buffer batches while async flushing #25631

tomasfarias · 2024-10-16T16:02:15Z

Problem

If we keep a connection to ClickHouse open without activity for too long, it will be closed. It's unclear if the http client, pod OS, or ClickHouse is closing it. Based on ClickHouse logs, it would appear it's somewhere on the client side. I have tried all the knobs offered by aiohttp to keep the connection open for longer, but around 30s seems to be the limit (which, to be fair, is quite a lot).

Anyways, let's not keep the connection inactive for too long. For that, we will read from the socket into a buffer of record batches. This buffer will allow us to keep on reading even if other activities down stream are slow (like flushing).

Speaking of downstream, the consumer of this queue is a writing loop that writes to one file at a time. After each file is written, we allow the main thread to continue, and start writing to a new file. This is done because flushing to a destination can be slow, so we can continue writing new files instead of waiting sequentially.

Potential improvements would write to multiple files at a time, round-robin style, but then error recovery becomes a bit more complicated, so left that out as an improvement for later.

This solution will put more memory pressure and disk pressure in our pods, but our current resource utilization is about 20%-40% of total memory requested, so we have some to spare. The size of the in-memory buffer queue can be configured to never go above a max to ensure we don't OOM.

Changes

Adds a new AsyncRecordBatchProducer class to read record batches from an async interator of bytes into a queue.
Adds a new aproduce_query_as_arrow_record_batches method to our ClickHouseClient to asynchronously read record batches into a queue from a given ClickHouse query, using the class mentioned in the previous point.
Adds a new function start_produce_batch_export_record_batches to initialize the production of record batches in batch exports, using the previous two items.
Adds a new function consume_batch_export_record_batches to initialize consumption of record batches from a queue, writing them to a file, and issuing a flush when done.
Updated the bigquery batch export to use all of the above: We now start a writing loop that spawns consumption tasks every time we start a flush. This allows us to keep on writing to files even as we flush.
Bumped google-cloud-biquery library version. Saw some weird retry errors otherwise.

👉 Stay up-to-date with PostHog coding conventions for a smoother review.

Does this work well for both Cloud and self-hosted?

How did you test this code?

Ran bigquery batch export tests, mostly as-is except that now JSON values are automagically deserialized, so no need to call json.loads again. All passed:

================== 71 passed, 6 skipped in 522.19s (0:08:42) ===================

wip feat(batch-exports): Buffer batches in queue while asynchronously flushing

EDsCODE

Great work. As you mentioned, probably could have more tests around the new objects

posthog/temporal/batch_exports/bigquery_batch_export.py

…d queue

sentry-io · 2024-10-20T14:45:27Z

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

‼️ Forbidden: POST https://bigquery.googleapis.com/upload/bigquery/v2/projects/project-chaski/jobs?uploadType=r... posthog.temporal.batch_exports.bigquery_batch_e... View Issue

_{Did you find this useful? React with a 👍 or 👎}

tomasfarias added 5 commits October 16, 2024 17:56

feat(batch-exports): Buffer batches while asynchronously flushing

20d7cb5

wip feat(batch-exports): Buffer batches in queue while asynchronously flushing

fix: Typing issues

2be7372

chore: Add a bunch of docstrings

a00ed41

chore(bigquery): Bump google-cloud-biquery library

c30c1df

chore(batch-exports): Add memory cap to buffer queue

20bba96

tomasfarias marked this pull request as ready for review October 16, 2024 17:11

tomasfarias requested a review from a team October 16, 2024 17:11

EDsCODE approved these changes Oct 17, 2024

View reviewed changes

posthog/temporal/batch_exports/bigquery_batch_export.py Show resolved Hide resolved

tomasfarias added 2 commits October 17, 2024 11:11

chore(batch-exports): Add comment about flushing

f577012

test(batch-exports): Add unit tests for new batch export functions an…

619bc59

…d queue

tomasfarias force-pushed the feat/non-blocking-flushes-bigquery-batch-export branch from 07fe70f to 619bc59 Compare October 17, 2024 09:22

tomasfarias merged commit ef24aad into master Oct 17, 2024
86 checks passed

tomasfarias deleted the feat/non-blocking-flushes-bigquery-batch-export branch October 17, 2024 10:02

timgl pushed a commit that referenced this pull request Oct 17, 2024

refactor(batch-exports): Buffer batches while async flushing (#25631)

48a4186

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(batch-exports): Buffer batches while async flushing #25631

refactor(batch-exports): Buffer batches while async flushing #25631

tomasfarias commented Oct 16, 2024 •

edited

Loading

EDsCODE left a comment

sentry-io bot commented Oct 20, 2024

refactor(batch-exports): Buffer batches while async flushing #25631

refactor(batch-exports): Buffer batches while async flushing #25631

Conversation

tomasfarias commented Oct 16, 2024 • edited Loading

Problem

Changes

Does this work well for both Cloud and self-hosted?

How did you test this code?

EDsCODE left a comment

Choose a reason for hiding this comment

sentry-io bot commented Oct 20, 2024

Suspect Issues

tomasfarias commented Oct 16, 2024 •

edited

Loading