The Project aims to establish a robust data pipeline for tracking and analyzing sales performance using various AWS services. The process involves creating a DynamoDB database, implementing Change Data Capture (CDC), utilizing Kinesis streams, and finally, storing and querying the data in Amazon Athena.
- Python
- DynamoDB
- DynamoDB Stream(CDC)
- Kinesis Stream
- Kinesis Filehose
- Event Bridge Pipe(For Stream Ingestion)
- Kinesis Firehose(To Batch Streaming)
- Lambda
- Athena
- S3
-
Data Generation Script
- A Python script has been provided to generate synthetic sales data.
- The script uses the boto3 library to connect with DynamoDB.
- The file is included in the repository(mock_data_generator_for_dynamodb.py)
-
DynamoDB Setup
-
A DynamoDB database named sales-performance-outlook is created.
-
Implemented Change Data Capture (CDC) for tracking updates and deletions in records.
-
Established a DynamoDB table named Orders-data-table with order_id as the key.
-
Enabled DynamoDB stream to capture changes in the table(sales-performance-outlook), specifying what data to capture (old/new item).
-
-
Kinesis Stream
-
Event Bridge Integration
-
Kinesis Firehose
-
Glue
-
S3 Storage and Crawler
-
Set up S3 as the destination for Kinesis Firehose, storing transformed data in files.
-
The Sample File is stored in the repostory directory(Output_Sample)
-
Created a crawler with a JSON classifier to identify raw data patterns in the S3 bucket.
-
Ran the crawler with an output file prefix of outlook_ to create a table in the sales-data-catalog database.
-
"$.order_id,$.product_name,$.quantity,$.price" (classifier pattern)json
-
So the classifier avoids the crawler to scan the raw data violating pattern
-
-
Athena Query
-
Permissions Management
- Added necessary permissions to IAM users for DynamoDB, Kinesis, and Event Bridge.
- The project follows a comprehensive data pipeline architecture to capture, process, and analyze sales data efficiently.
- The inclusion of CDC ensures that changes in records are tracked, providing a complete view of sales performance over time.
- The use of serverless services like Athena and Lambda minimizes infrastructure management efforts.
- The project showcases the integration of multiple AWS services for a seamless end-to-end data processing and analytics solution.