Skip to content

Collection of code for submitting Spark/Hadoop/Hive/Pig tasks to EMR (AWS Elastic MapReduce) | #DE

Notifications You must be signed in to change notification settings

yennanliu/spark_emr_dev

Repository files navigation

SPARK-EMR-DEV

demo various data ETL process via AWS EMR

Scala Projects

File structure

# ├── README.md
# ├── athena            : athena query
# ├── build.sbt         : build.sbt build sbt dev env
# ├── config            : config for cres access AWS, 3rd party services
# ├── data              : sample data for script tes
# ├── doc               : ref docs
# ├── hive              : hive scripts 
# ├── project           : sbt project files 
# ├── pyspark           : pyspark code 
# ├── quick_start.sh    : help script run sbt/spark commands
# ├── script            : help script
# ├── src               : main scala spark ETL code
# ├── target            : compiled java file
# └── task_step         : json files define tasks at EMR

Quick Start

Prerequisites

  1. Modify config with yours and rename them (e.g. aws.config.dev -> aws.config) to access services like data source, file system.. and so on.
  2. Install SBT as scala dependency management tool
  3. Install Java, Spark
  4. Modify build.sbt aligned your dev env
  5. Check the spark etl scripts : src

Ref