ddb-sync is a tool used for syncing data from a set of source tables to a set of destination tables in DynamoDB. It's configurable to perform the backfill operation, the stream consumption or both by CLI or config file options.
go get github.com/instructure/ddb-sync
and ddb-sync --help
for usage
ddb-sync has two phases of work: streaming and backfilling. Backfill is a scan of the source table and duplication of each record to the destination table. Stream consumes a pre-configured DynamoDB Stream and writes all record changes to the destination table.
If streaming and backfill are configured on a table, stream runs sequentially after backfill.
Note: On actively written tables, the streaming phase is required to get new records synced.
ddb-sync does not verify source and destination tables consistency and an outside methodology should be used to verify the table has been delivered as expected.
- Table Migrations
- Across regions
- Across accounts
- Table synchronizations
ddb-sync is a golang binary.
Ensure you have a proper Go environment setup and then:
go get github.com/instructure/ddb-sync
The IAM permissions required by the roles the application uses to read and put data to the destination tables is listed as follows for each phase.
Table | Backfill Permissions | Stream Permissions |
---|---|---|
Source Table | dynamodb:DescribeTable dynamodb:Scan |
dynamodb:DescribeStream dynamodb:DescribeTable dynamodb:GetRecords (on <src_table_arn>/streams/*) dynamodb:GetShardIterator |
Destination Table | dynamodb:BatchWriteItem | dynamodb:DeleteItem dynamodb:PutItem |
In order to backfill a table, a policy resembling the following must be present.
{
"Action": [
"dynamodb:DescribeTable",
"dynamodb:Scan"
],
"Resource": [
"<input_table_arn>"
],
"Effect": "Allow"
},
{
"Action": [
"dynamodb:BatchWriteItem"
],
"Resource": [
"<output_table_arn>"
],
"Effect": "Allow"
}
In order to stream a table, a policy resembling the following must be present.
{
"Action": [
"dynamodb:DescribeStream",
"dynamodb:DescribeTable",
"dynamodb:GetShardIterator"
],
"Resource": [
"<input_table_arn>"
],
"Effect": "Allow"
},
{
"Action": [
"dynamodb:GetRecords"
],
"Resource": [
"<input_table_arn>/stream/*"
],
"Effect": "Allow"
},
{
"Action": [
"dynamodb:DeleteItem",
"dynamodb:PutItem"
],
"Resource": [
"<output_table_arn>"
],
"Effect": "Allow"
}
ddb-sync is probably most useful when used with a config file. You can use a config file to run many concurrent operations together. Alternatively, you can provide a single set operation as flags to the CLI. Either a config file or a combination of input and output options are required. In addition if using the flags the input and output table must differ by either name, region, or have a role arn provided.
The ddb-sync configuration file format is YAML. It describes a plan object which has an array of objects of input and output tables, their table name, region and the role ARN to access it with the appropriate permissions listed above.
An example is below.
---
plan:
- input:
table: ddb-sync-source
region: us-west-2
role_arn: arn:aws:iam::<account_num>:role/ddb-sync-READ_ONLY_SOURCE
output:
table: ddb-sync-dest
region: us-east-2
role_arn: arn:aws:iam::<account_num>:role/ddb-sync_WRITE_ONLY_DEST
stream:
disabled: true
backfill:
disabled: false
segments: 0,1,2 # This is 0-indexed, 0 through 3 are valid in this case
total_segments: 4
- input:
table: ddb-sync-source-2
region: us-west-2
role_arn: arn:aws:iam::<account_num>:role/ddb-sync-READ_ONLY_SOURCE
output:
table: ddb-sync-dest-2
region: us-east-2
role_arn: arn:aws:iam::<account_num>:role/ddb-sync_WRITE_ONLY_DEST
Note: Be sure your source tables have provisioned capacity for reads and writes before using the tool. To optimize performance you should adjust provisioned read and write capacity. Be cautious about table sharding constraints.
During the output of the tool we note consumed write capacity units. This can be used to help provision and scale your table with use of the tool. See the output section for more details.
It is highly recommended that you run this utility on an EC2 instance with the proper permissions attached to the instance profile. This will provide predictable network performance, consistent compute resources, and keep your data off the coffee shop router.
If you have a partitioned source table, you may want to consider specifying backfill-total-segments
to optimize your capacity to scan your source table. A typical value for this would be ~N where N is the number of partitions in your source table.
If you are limited by the I/O capacity of an EC2 instance, you can consider splitting up the
backfill segments across multiple EC2 instances. This would be done by specifying total_segments
count and giving an unique set of backfill-segments
to each separate instance. This configuration requires that all backfill segments
complete before invoking a stream operation and therefore stream
must be disabled and invoked as a unique operation in this case.
Invoke the compiled binary and provide options for a run.
ddb-sync <cli-options>
The CLI options are present below:
--config-file string Filename for configuration yaml
--input-region string The input region
--input-role-arn string ARN of the input role
--input-table string Name of the input table
--output-region string The output region
--output-role-arn string ARN of the output role
--output-table string Name of the output table
--backfill-segments ints [Optional] Specify backfill scan segment(s) to target in this operation, 0-indexed. Example: "0,1,2". Prohibits streaming and "backfill-total-segments" must be specified.
--backfill-total-segments int Specify backfill 'Scan' concurrency segments
--backfill Perform the backfill operation (default true)
--stream Perform the streaming operation (default true)
To disable streaming or backfill, pass false.
ddb-sync --stream=false
Backfill only operations will exit (0) upon completion of all steps. However, when streaming steps are enabled, the command will not ever exit. When you've ascertained that your streaming operations have completed, you can SIGINT for a clean shutdown.
ddb-sync logs to stdout and displays status messaging to stderr. Either can be redirected to a file for later viewing.
For instance, ddb-sync --config-file <config_file.yml> 1>ddb-sync-log.out
would capture the
log to a file called ddb-sync-log.out.
The log will display each operation on a table, relevant status to the operation. The format is below.
timestamp [src-table] => [dest-table]: Log statement
ddb-sync logs (with a timestamp prefix) on starting or completing actions as well as shard transitions in the stream. An example is below.
2018/06/19 11:05:27 [ddb-sync-source] ⇨ [ddb-sync-dest]: Backfill complete. 14301 items written over 5m9s
ddb-sync will add progress updates every 20 minutes to the log for historical convenience, e.g.:
2018/06/19 11:12:26
================= Progress Update ==================
[ddb-sync-source] ⇨ [ddb-sync-dest]: Streaming: 400 items written over 20m1s
[ddb-sync-source-2] ⇨ [ddb-sync-dest-2]: Backfill in progress: 822 items written over 1m1s
====================================================
The bottom portion of the output given to the user is the status messaging. It describes each table, the operation being worked on, the write capacity unit usage rate for writing to the destination table and the amount of records written from each operation.
An example follows:
------------------------------------------------ Current Status ---------------------------------------------------
TABLE DETAILS BACKFILL STREAM RATES & BUFFER
⇨ ddb-sync-dest 47K items (29MiB) -COMPLETE- 2019 written (~18h57m latent) 12 items/s ⇨ ◕ ⇨ 32 WCU/s
⇨ ddb-sync-dest-2 ~46K items (~26MiB) -SKIPPED- 789 written (~46m35s latent) 9 items/s ⇨ ◕ ⇨ 17 WCU/s
⇨ ddb-sync-destination ~46K items (~26MiB) 267164 written -PENDING- 49 items/s ⇨ ◕ ⇨ 50 WCU/s
You can use the golang script in contrib/dynamo_writer.go
to
help configure some faked items. You'll need a table that conforms to table partition key schema.
For instance, in inseng, you can test with the source table of ddb-sync-source
.
You can run the item writer to write items:
vgo run contrib/dynamo_writer.go <table_name>
Pull requests welcome for both bug fixes and features.
Our build system includes linting and a test suite so be sure to include further tests as a part of your pull request.