CDC Hadoop Dataflow

A low latency, multi-tenant Change Data Capture(CDC) pipeline to continuously replicate data from OLTP(MySQL) to OLAP(NoSQL) systems with no impact to the source.

This project demonstrate how to build dataflow pipeline to move data from O]operational databases(MySQL, Oracle) to analytics databases(Hadoop, MongoDB, MarkLogic) in real-time using Change Data Capture(CDC), Kafka and tools like Apache NiFi, Kafka Streams or Spark to process and ingest data into Hadoop.

Features

Capture changes from many Data Sources and types.
Feed data to many client types (real-time, slow/catch-up, full bootstrap).
Multi-tenant: can contain data from many different databases, support multiple consumers.
Non-intrusive architecture for change capture.
Both batch and near real time delivery.
Isolate fast consumers from slow consumers.
Isolate sources from consumers
1. Schema changes
2. Physical layout changes
3. Speed mismatch
Change filtering
1. Filtering of database changes at the database level, schema level, table level, and row/column level.
Buffer change records in Kafka for flexible consumption from an arbitrary time point in the change stream including full bootstrap capability of the entire data.
Guaranteed in-commit-order and at-least-once delivery with high availability (at least once vs. exactly once)
Resilience and Recoverability
Schema-awareness

Setup

Install and Run MySQL

Install source MySQL database and configure it with row based replication as per instructions.

Install and Run Kafka

Follow the instructions

Install and Run Maxwell

cd cdc/maxwell
# curl -L -0 https://github.com/zendesk/maxwell/releases/download/v1.0.0/maxwell-1.1.2.tar.gz | tar --strip-components=1 -zx -C .
curl -L -0 https://github.com/xmlking/maxwell/releases/download/1.1.2.1/maxwell-1.1.2.1-kafka-connect.tar.gz | tar --strip-components=1 -zx -C .

Run

cd cdc/maxwell

Run with stdout producer (for testing only)

bin/maxwell --user='maxwell' --password='XXXXXX' --host='127.0.0.1' --producer=stdout
Run with kafka producer

bin/maxwell

Test

Manual Testing

If all goes well you'll see maxwell replaying your inserts:

mysql -u root -p

mysql> CREATE TABLE test.shop
       (
         id BIGINT(20) NOT NULL AUTO_INCREMENT,
         version BIGINT(20) NOT NULL,
         name VARCHAR(255) NOT NULL,
         owner VARCHAR(255) NOT NULL,
         phone_number VARCHAR(255) NOT NULL,
         primary key (id, name)
       );
mysql> INSERT INTO test.shop (version, name, owner, phone_number) values (0, 'aaa', 'bbb', '3331114444');
Query OK, 1 row affected (0.02 sec)

(maxwell)
{"database":"test","table":"shop","pk.id":4,"pk.name":"aaa"}
{"database":"test","table":"shop","type":"insert","ts":1458510224,"xid":33531,"commit":true,"data":{"owner":"bbb","name":"aaa","phone_number":"3331114444","id":4,"version":0}}

Testing via Grails App

You can also use testApp to generate load.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
cdc		cdc
dataflow		dataflow
gradle/wrapper		gradle/wrapper
infrastructure		infrastructure
presentation		presentation
testApp		testApp
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
build.gradle		build.gradle
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CDC Hadoop Dataflow

Features

Setup

Install and Run MySQL

Install and Run Kafka

Install and Run Maxwell

Run

Test

Manual Testing

Testing via Grails App

Reference

About

Releases

Packages

Languages

xmlking/cdc-kafka-hadoop

Folders and files

Latest commit

History

Repository files navigation

CDC Hadoop Dataflow

Features

Setup

Install and Run MySQL

Install and Run Kafka

Install and Run Maxwell

Run

Test

Manual Testing

Testing via Grails App

Reference

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages