About the project

It is an academic project created to complete the course "Big data processing".

The main goal of this project was to implement MapReduce processing with Apache Hadoop and Apache Hive.

Additionally, I implemented jobs scheduling using Apache Airflow and infrastructure configuration with Terraform.

Data sets

Source: https://opendata.cityofnewyork.us/data/

NYPD_Motor_Vehicle_Collisions.csv - road accident data.
zips-boroughs.csv – postcode zone data.

The first job (Hadoop MapReduce) loads data from NYPD_Motor_Vehicle_Collisions.csv and counts the number of injured people for each street, distinguishing by the type of injured person (pedestrian, cyclist, motorist) and the type of injury (injured, killed). The results have to be limited to accidents after 2012 and described with a zip code.

The result should have the following attributes:

street
zip code
type of injured person
type of injury
number of injured people

The result file must be in a binary format - I chose the Avro format.

The second job (Hive) loads the output of the first job and the dataset zips-boroughs.csv and searches for three streets in the Manhattan area with the maximum sum of injured and killed people for each type of injured person.

The output file must be in JSON format.

Expected result

{"street":"1 AVENUE","person_type":"cyclist","killed":1,"injured":194}
{"street":"1 AVENUE","person_type":"motorist","killed":0,"injured":427}
{"street":"1 AVENUE","person_type":"pedestrian","killed":2,"injured":308}
{"street":"2 AVENUE","person_type":"cyclist","killed":0,"injured":212}
{"street":"2 AVENUE","person_type":"motorist","killed":0,"injured":514}
{"street":"2 AVENUE","person_type":"pedestrian","killed":2,"injured":309}
{"street":"BROADWAY","person_type":"cyclist","killed":0,"injured":239}
{"street":"BROADWAY","person_type":"motorist","killed":0,"injured":769}
{"street":"BROADWAY","person_type":"pedestrian","killed":6,"injured":463}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
airflow		airflow
docs		docs
hive		hive
infrastructure		infrastructure
mapreduce		mapreduce
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About the project

Data sets

Expected result

Airflow DAG (Directed Acyclic Graph)

About

Releases

Packages

Languages

jewertow/MapReduce-NYC-collisions

Folders and files

Latest commit

History

Repository files navigation

About the project

Data sets

Expected result

Airflow DAG (Directed Acyclic Graph)

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages