Skip to content

DanBachmann/Pinterest_data_pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pinterest Data Pipeline

Real time & historical data pipeline simulation

Project aim

This project is an educational exercise for hands on big data gathering and processing (ETL) using common big data tools: Kafka, Spark, Airflow (both on prem & in AWS WMAA), AWS Kinesis and DataBricks.

What it does

The Pinterest Data Pipeline is takes example Pinterest user generated data through an ETL process. This is handled in two ways. First through the processes of handling "legacy" data from an external database to an S3 data store for cleansing ad-hock reporting. Secondly, data is handled with data streams.

Concepts learned/demonstrated

  • Using an EC2 instance via SSH, SSHFS and SCP.
  • Local configuration of Kafka topics via command line.
  • Basic MSK configuration in the AWS console.
  • Kinesis stream connectivity with API Gateway
  • Airflow to schedule jobs both on prem and in AWS (WMAA)
  • Use of Databricks to:
    • Connect to AWS S3 bucket
    • Load a series of files into a Spark dataframe
    • Connect to an AWS Kinesis data stream
    • Using PySpark to clean the data/streams and provide basic query results for static data
    • Store data transformed in DataBricks data lake

Report Request Output

Since this project is connected directly to Github via the DataBricks Repo feature, DataBricks removes all notebook output before committing to Github, the reporting output can be seen in [this snapshot of the notebook] (documentation_resources/AiCore%20Pinterest%20project.ipynb) taken locally from DataBricks

Architecture

System languages

While a good portion of the system could be handled with SQL, the focus was to maximise the use of Python on top of Spark to demonstrate the concepts learned. Only Python and some Bash scripts are used for development of this project.

File structure

The project has the following folder structure with the naming rule is any script with a 0 as the first character in the file name is for initial configutaton and only to be run once. More details will be explained in the "Set-up & usage instructions" section of this file. Note: The 0 prefixed set-up file naming convention excludes files starting with 0x115306525 which are naming requirements for this project AWS access.

  • DataBricks
    • PinterestTransformations.py
    • Data cleaning/transformation functions used in the Jypiter notebooks
    • AiCore Pinterest project.ipynb
    • First set of milestones simulating the legacy data handling and report requests
    • AiCore Read Kinesis.ipynb
    • Cloud data stream handing using DataBricks and AWS technologies
      • MWAA_S3_bucket
        • 0x115306525_dag.py
        • DAG file for Airflow in AWS (MWAA) to execute "AiCore Pinterest project.ipynb"
    • documentation_resources
    • Notebooks with the reporting output & documentation support files
    • Kineses
      • 0create_kinesis_streams.py
      • Code to be run once to create the Kinesis data streams in AWS
      • test_stream_api.py
      • Simple tests to check the APIs are in place
      • user_posting_emulation_streaming.py
      • Simulates user interaction with results going to Kinesis streams
    • legacy_simulation
    • See the "Set-up & usage instructions" for more information on these files
    • run_this_to_send_messages_after_ec2_kafka_started.sh
      • run_on_ec2
        • 0setup_topics_and_connectors.sh
        • 1run_this_on_kafka_ec2.sh

Set-up & usage instructions

Legacy data project steps

On prem components

The on premise software for processing of legacy data plus initial exploratory testing was done on Ubuntu Linux using Conda to manage dependencies.
The main components are:

  • Java JRE 8
  • Apache AirFlow
  • Apache Spark
  • Apache Kafka via Confluent
For a detailed technical view of the supporting software used, see [this Conda environment export](documentation_resources/conda-env.yaml).

Create topics

With the software security credentials and other minor configurations set as per the AiCore Data Engineering notebook instructions, the topics for on premise legacy data topics can be created by the following script in the legacy_simulaiton/run_on_ec2 folder:

0setup_topics_and_connector.sh

Legacy Data Population

Shell scripts are provided to prevent errors and to aid in returning to this process in the future. To load the data in the legacy database into an S3 bucket using Kafka, execute the following on the EC2 instance within the VPC:

1run_this_on_kafka_ec2.sh

This will execute the following commands:

cd ~/confluent-7.2.0/bin/
./kafka-rest-start /home/ec2-user/confluent-7.2.0/etc/kafka-rest/kafka-rest.properties

The next script can be run on the same computer, though it is recommended to run it on another box:

2run_this_to_send_messages.sh 

This will execute the following:

python3 user_posting_emulation.py

Data will randomly be sent to the data streams for processing.

Conclusion

This was the final project in the AI Core Data Engineering path which was certainly more of what we would have called "tech ops" in my previous organisation. Big data tools were entirely new to me. Now with some basic hands on experience, I know some of the capabilities and feel confident with our notebooks that I could have these tools up and running.

Some research would still be required before making a professional recommendation on the use of these technologies largely in terms of execution performance. Say that, even without further studies I can say that much of what has been taught and used here would allow a fairly rapid prototype for data handling. I'm grateful for the experience and look forward to applying these skills and learning more.


License

About

real time & historical Pinterest data pipeline simulation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published