title: "Introduction to Big Data with Spark and Python" author: name: Dan Koch url: https://github.com/dmkoch/spark-intro twitter: dkoch theme: sudodoki/reveal-cleaver-theme output: presentation.html
Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate -- Wikipedia
- Volume
- One machine can hold how many Gigabytes, Terabytes, Petabytes?
- 3 hours to read 1 TB from a hard drive
- Velocity -- Writes come in faster than even the largest machine can handle
- Also Variety, Variability, Veracity, Complexity -- Wikipedia
- General-purpose cluster computing framework
- Wide variety of applications supported
- High and low-level APIs "everything included"
- Fast -- designed to run in memory
- Scala
- Java
- Python 2
- Python 3 (Spark 1.4)
- R (Spark 1.4)
- Others with pipe()
- Spark Streaming
- MLlib
- GraphX
- DataFrames / SparkSQL
- SparkR
- Data science support
- General purpose -- you can build a web app too
- Readability
# Download tarball (https://spark.apache.org/downloads.html)
tar xf spark.tar.gz
export SPARK_HOME=/path/to/sparkdir
pip install py4j
- RDD -- Resilient Distributed Dataset
- Transformations
- Actions
- Transparent Scaling -- Laptop to Cluster
- pyspark shell
- standalone scripts
- IPython Notebook
spark-submit \
--master spark:// \
yourscript.py arg1 arg2
Cluster managers
- Standalone
- Mesos
Commercial offerings
- Mesosphere -- Enterprisey Mesos
- Databricks -- Cluster Manager for AWS
- Amazon EMR -- Packaging of Spark on Hadoop YARN
wordcounts = sc.textFile(filename) \
.map(lambda text: re.sub('[^a-z0-9 ]+', '', text.lower()).strip()) \
.flatMap(lambda text: text.split()) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda x, y: x + y) # Transformations
print(wordcounts.takeOrdered(10, key=lambda (k, v): -v)) # Action
- Official documentation https://spark.apache.org/docs/latest/
- Berkeley EdX Introduction to Big Data with Apache Spark
- O'Reilly Learning Spark