Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a lambda to replicate dynamo CDC stream to ClickHouse #5419

Closed

Conversation

huydhn
Copy link
Contributor

@huydhn huydhn commented Jul 11, 2024

Here is the working ingestion strategy that I have:

  1. Bulk transfer all records from a dynamoDB table to a ClickHouse one:
    1. Use tools/rockset_migration/dynamodb2s3.py to dump the content of a dynamoDB table to an S3 bucket in JSON format, for example, I test it with python dynamodb_2_s3.py --dynamodb-table torchci-issues --s3-bucket ossci-raw-job-status --s3-path debug-clickhouse-ingest/Manual/
    2. Use S3 ClickPipe to ingest the data, for example s3://ossci-raw-job-status/debug-clickhouse-ingest/Manual/*.json remember to use dynamoKey as the sort key and ReplacingMergeTree as the table engine. They supports the way we mutate GitHub records. The main reason I use S3 ClickPipe here is that it automatically creates the table for me from the JSON structure.
  2. After the import, the lambda dynamo-clickhouse-replicator will sync all future changes to the corresponding ClickHouse table.

I could add a deployment workflow later if we decide to go this route.

Failed attempts that didn't work out well:

  1. I tried to use ClickHouse JSON object but it has been deprecated https://clickhouse.com/docs/en/sql-reference/data-types/object-data-type. There is an open issue to address this gap RFC: Semistructured Columns ClickHouse/ClickHouse#54864
  2. I tried to create the table manually with CREATE TABLE but it's very tedious given the complex structure of the GitHub webhook payload. For simpler use cases, i.e. merges, this is still feasible
  3. I tried DynamoDB to S3 auto-export route https://docs.aws.amazon.com/prescriptive-guidance/latest/dynamodb-full-table-copy-options/amazon-s3.html but the format is in dynamo JSON format and this confuses ClickHouse. The above script dynamodb2s3.py and lambda address this by un-marshaling the records before inserting them into ClickHouse.

Ref

SELECT dynamoKey, state, labels FROM `torchci-issues` WHERE dynamoKey = 'pytorch/pytorch/130498'
pytorch/pytorch/130498	closed	[{"color":"f7e101","default":false,"description":"Related to continuous integration","id":"1300896147","name":"module: ci","node_id":"MDU6TGFiZWwxMzAwODk2MTQ3","url":"https://api.github.com/repos/pytorch/pytorch/labels/module:%20ci"},{"color":"AA5D26","default":false,"description":"","id":"5626213550","name":"unstable","node_id":"LA_kwDOA-j9z88AAAABT1k0rg","url":"https://api.github.com/repos/pytorch/pytorch/labels/unstable"}]
pytorch/pytorch/130498	open	[]

Using the FINAL keyword returns the latest one https://clickhouse.com/docs/en/sql-reference/statements/select/from

SELECT dynamoKey, state, labels FROM `torchci-issues` FINAL WHERE dynamoKey = 'pytorch/pytorch/130498'
pytorch/pytorch/130498	closed	[{"color":"f7e101","default":false,"description":"Related to continuous integration","id":"1300896147","name":"module: ci","node_id":"MDU6TGFiZWwxMzAwODk2MTQ3","url":"https://api.github.com/repos/pytorch/pytorch/labels/module:%20ci"},{"color":"AA5D26","default":false,"description":"","id":"5626213550","name":"unstable","node_id":"LA_kwDOA-j9z88AAAABT1k0rg","url":"https://api.github.com/repos/pytorch/pytorch/labels/unstable"}]

@huydhn huydhn requested review from ZainRizvi and clee2000 July 11, 2024 00:16
Copy link

vercel bot commented Jul 11, 2024

@huydhn is attempting to deploy a commit to the Meta Open Source Team on Vercel.

A member of the Team first needs to authorize it.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 11, 2024
Copy link

vercel bot commented Jul 11, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
torchci ⬜️ Ignored (Inspect) Visit Preview Jul 11, 2024 0:18am

@huydhn
Copy link
Contributor Author

huydhn commented Sep 18, 2024

Close this in favor of @clee2000 PR

@huydhn huydhn closed this Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants