This project sets up a Vagrant VM with a local version of spark installed and a small slice of the Enron corpus. It also features some nice tools to work with the corpus provided by Markus Dale!
- install Virtualbox (https://www.virtualbox.org/wiki/Downloads)
- install Vagrant (http://docs.vagrantup.com/v2/installation/index.html)
from the project directory...
- vagrant up (if you get an error, make sure you have changed directory into the project directory)
- vagrant ssh
Let us make sure everything is running like we expect ...
- run ~/spark.sh
- grab a drink, spark should do its thing for a few moments and finally splash the spark ascii art
- type in :paste and hit enter (the :paste command tells the repl that you are going to paste and to not interpret the text until you hit CTRL-d)
- copy and paste the following in your spark shell
import org.apache.spark.SparkContext._
import org.apache.spark.rdd._
import com.uebercomputing.mailparser.enronfiles.AvroMessageProcessor
import com.uebercomputing.mailrecord._
import com.uebercomputing.mailrecord.Implicits.mailRecordToMailRecordOps
val args = Array("--avroMailInput", "/opt/rpm1/enron/filemail.avro", "--hadoopConfPath", "hadoop-local.xml")
val config = CommandLineOptionsParser.getConfigOpt(args).get
val recordsRdd = MailRecordAnalytic.getMailRecordsRdd(sc, config)
- did you hit CTRL-d? if not, do that. else pat yourself on the back for being an overachiever.
- now type:
recordsRdd.count
- if you see it return "40419", then you are ready to roll!
- if you really want to impress us, figure out who is the person that sent the most emails that had the term "fbi" in the body of the email...
We have a small data-set of 4 executives from Enron:
- Kenneth Lay
- Jeffrey Skilling
- Greg Whalley
- Vincent Kaminski
Markus Dale for providing some sweet tools to work with the corpus and advice on setting up the environment! (https://github.com/medale/spark-mail)