Tetra Concepts Spark Training

Intro

This project sets up a Vagrant VM with a local version of spark installed and a small slice of the Enron corpus. It also features some nice tools to work with the corpus provided by Markus Dale!

Install software

install Virtualbox (https://www.virtualbox.org/wiki/Downloads)
install Vagrant (http://docs.vagrantup.com/v2/installation/index.html)

We are ready to be vagrants ...

from the project directory...

vagrant up (if you get an error, make sure you have changed directory into the project directory)
vagrant ssh

We are now logged into our VM

Let us make sure everything is running like we expect ...

run ~/spark.sh
grab a drink, spark should do its thing for a few moments and finally splash the spark ascii art
type in :paste and hit enter (the :paste command tells the repl that you are going to paste and to not interpret the text until you hit CTRL-d)
copy and paste the following in your spark shell

import org.apache.spark.SparkContext._
import org.apache.spark.rdd._
import com.uebercomputing.mailparser.enronfiles.AvroMessageProcessor
import com.uebercomputing.mailrecord._
import com.uebercomputing.mailrecord.Implicits.mailRecordToMailRecordOps

val args = Array("--avroMailInput", "/opt/rpm1/enron/filemail.avro", "--hadoopConfPath", "hadoop-local.xml")
val config = CommandLineOptionsParser.getConfigOpt(args).get
val recordsRdd = MailRecordAnalytic.getMailRecordsRdd(sc, config)

did you hit CTRL-d? if not, do that. else pat yourself on the back for being an overachiever.
now type:

recordsRdd.count

if you see it return "40419", then you are ready to roll!
if you really want to impress us, figure out who is the person that sent the most emails that had the term "fbi" in the body of the email...

Data

We have a small data-set of 4 executives from Enron:

Kenneth Lay
Jeffrey Skilling
Greg Whalley
Vincent Kaminski

Thanks to

Markus Dale for providing some sweet tools to work with the corpus and advice on setting up the environment! (https://github.com/medale/spark-mail)

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gitignore		.gitignore
README.md		README.md
Vagrantfile		Vagrantfile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tetra Concepts Spark Training

Intro

Install software

We are ready to be vagrants ...

We are now logged into our VM

Data

Thanks to

About

Releases

Packages

ajsander/spark-training-vm

Folders and files

Latest commit

History

Repository files navigation

Tetra Concepts Spark Training

Intro

Install software

We are ready to be vagrants ...

We are now logged into our VM

Data

Thanks to

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages