Add a lambda to replicate dynamo CDC stream to ClickHouse #5419
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Here is the working ingestion strategy that I have:
tools/rockset_migration/dynamodb2s3.py
to dump the content of a dynamoDB table to an S3 bucket in JSON format, for example, I test it withpython dynamodb_2_s3.py --dynamodb-table torchci-issues --s3-bucket ossci-raw-job-status --s3-path debug-clickhouse-ingest/Manual/
s3://ossci-raw-job-status/debug-clickhouse-ingest/Manual/*.json
remember to usedynamoKey
as the sort key andReplacingMergeTree
as the table engine. They supports the way we mutate GitHub records. The main reason I use S3 ClickPipe here is that it automatically creates the table for me from the JSON structure.dynamo-clickhouse-replicator
will sync all future changes to the corresponding ClickHouse table.I could add a deployment workflow later if we decide to go this route.
Failed attempts that didn't work out well:
merges
, this is still feasibledynamodb2s3.py
and lambda address this by un-marshaling the records before inserting them into ClickHouse.Ref
dynamoKey
ofpytorch/pytorch/130498
Using the FINAL keyword returns the latest one https://clickhouse.com/docs/en/sql-reference/statements/select/from