This is an advanced workshop for people comfortable programming. We'll be writing code in both Python and Scala. You don't have to be an expert in either of those languages, but some familiarity, at least with Python, is recommended.
Students will write 2 components.
- A series of workflow management classes in Python using the Luigi framework.
- A Scala Spark job to join the datasets together and perform some basic group analysis.
The workflow will chain together the processes of downloading multiple data sources for a given day (2019-02-08) from S3, sending those sources as input into the Spark program and verifying expected output files are produced.
It should take students with assistance about 2 - 3 hours including VM setup time to write from scratch.
- VM Setup directions (Recommended)
- For directions to setup the compile/run dependencies to run locally instead of the VM see Local Setup.
- First let's code the Luigi Tasks.
- Then let's write the Scala Spark Job.
- Now let's run the Spark artifict you built with Luigi and Put it all together.
For directions on how to run the fake data generator see Fake Data Generator.