Code to accompany Advanced Analytics with Spark, by Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills.
The source to accompany the 2nd edition is found in this, the default
master
branch.
The source to accompany the 1st edition may be found in the
1st-edition
branch.
Apache Maven 3.2.5+ and Java 8+ are required to build. From the root level of the project,
run mvn package
to compile artifacts into target/
subdirectories beneath each chapter's directory.
- Chapter 2: https://archive.ics.uci.edu/ml/machine-learning-databases/00210/
- Chapter 3: https://storage.googleapis.com/aas-data-sets/profiledata_06-May-2005.tar.gz
- Chapter 4: https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/
- Chapter 5: https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html (do not use http://www.sigkdd.org/kdd-cup-1999-computer-network-intrusion-detection as the copy has a corrupted line)
- Chapter 6: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles-multistream.xml.bz2
- Chapter 7: ftp://ftp.nlm.nih.gov/nlmdata/sample/medline/ (
*.gz
) - Chapter 8: https://storage.googleapis.com/aas-data-sets/trip_data_1.csv.zip (from http://www.andresmh.com/nyctaxitrips/)
- Chapter 9: See https://github.com/sryza/aas/tree/master/ch09-risk/data ; included download scripts no longer work
- Chapter 10: ftp://ftp.ncbi.nih.gov/1000genomes/ftp/phase3/data/HG00103/alignment/HG00103.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam
- Chapter 11: https://github.com/thunder-project/thunder/tree/v0.4.1/python/thunder/utils/data/fish/tif-stack