Introduction

Trend Factor: Visualizing web trends using the Common Crawl

This is a project I completed in 3 weeks during the Insight Data Engineering Program (New York, 19A Session). The goal of this project is to count the number of web pages that mention a set of terms between January and December 2018. The program uses a subset (~20TB) of the Common Crawl, an archive of web page content. The results can be used to help companies measure brand recognition over a period of time or compare the popularity of different products. A sample batch job has been executed with a set of database names and the UI with the results is temporarily displayed at techwebtrends.xyz. A recording of the WebUI is also available here.

Pipeline

The data was ingested from Amazon S3 using Spark Scala and the final results containing the search terms and their frequencies were saved to PostgreSQL database. Apache Airflow was used for scheduling the batch processing tasks corresponding to each month of the Common Crawl data. The front-end Web UI, showing the results of the batch processing jobs, was build with the Python Dash framework.

Requirements

Languages

Python 2.7
Scala 2.11.8

Technologies

Spark
Airflow
PostgreSQL

Third-Party Libraries

AWS CLI

Environment Setup

Install and configure AWS and the open-source tool Pegasus on your local machine and clone this repository git clone https://github.com/Marinela09/Insight-DE-Project. Configure a VPC with a security group and subnet on AWS and add your local IP address to the VPC inbound rules.

AWS Clusters Setup

For my project, I used the following cluster structure:

7-node Spark Cluster for Batch Processing
1-node Database Cluster for Postgres and front-end Dash application

To deploy each cluster, use pegasus tool. Configure the master and worker yml files and use peg up <yml filename> command. For example, the master file for the spark cluster can be configured using the following yml template:

purchase_type: on_demand
subnet_id: subnet-XXXX
price: string
num_instances: 1
key_name: XXXX-keypair
security_group_ids: sg-XXXX
instance_type: m4.large
tag_name: SparkCluster
vol_size: 100
role: master 
use_eips: true

For the workers, change the number of instances to 6, and the role to worker.

Installing the required technologies

1.Spark and Hadoop Use the following peg commands to install and start spark on the cluster

peg install SparkCluster ssh
peg install SparkCluster aws
peg install SparkCluster environment
peg install SparkCluster spark
peg install SparkCluster hadoop
peg service SparkCluster hadoop start
peg service SparkCluster spark start

2.Apache Airflow

SSH into the master node (peg ssh SparkCluster 1) of the SparkCluster and install Apache Airflow tool following the instructions in this guide.

3.PostgreSQL

SSH into the master node of the database cluster and install PostgresSQL. Configure the database to allow remote connections.

Repository Structure and Run Instructions

./airflow/ contains the python script for the Airflow Directed Acyclic Graph (DAG) definition. This file is used to schedule the batch processing jobs for each month of the common crawl data from 2018.

./dashapp/ contains the python script necessary for running the front-end web application.

./project/ contains all configuration files and the spark scala scripts for replicating my batch processing jobs

build.sbt file is used for the compilation of the spark scala scripts located in the project folder

To replicate my results: Copy the project folder inside the master node of the spark cluster. Download Postgres JDBC driver 42.2.5 jar files inside the project folder. Run sbt package command.

Copy the airflow dag definition file into the master node of the spark cluster in airflow/dags folder. Run the python script python schedule_batch.py, start the Airflow Web UI, and trigger DAG execution:

airflow webserver
airflow scheduler
airflow worker

To run the front-end visualization tool after the batch processing job is done, place the dash app script ranking_app.py into the database cluster and execute python ranking_app.py.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
airflow		airflow
dashapp		dashapp
images		images
project		project
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

Introduction

Pipeline

Requirements

Environment Setup

Repository Structure and Run Instructions

About

Releases

Packages

Languages

Marinela09/Insight-DE-Project

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Introduction

Pipeline

Requirements

Environment Setup

Repository Structure and Run Instructions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages