-
Notifications
You must be signed in to change notification settings - Fork 59
Latency
Rohit edited this page Mar 16, 2017
·
7 revisions
In the Parallel Programming course we learned about:
- Data Parallelism in the single machine, multi-core, multiprocessor world.
- Parallel Collections as an implementation of this paradigm.
Here we will learn:
- Data Parallelism in a distributed (multi node) setting
- Distributed collections abstraction from Apache Spark as an implementation of this paradigm.
Because of the distribution, we have 2 new issues:
- Partial Failure: crash failures on a subset of machines in the cluster.
- Latency: network communication causes higher latency in some operations - cannot be masked and always present; impacts programming model as well as code directly as we try to reduce network communication.
Apache Spark stands out in the way it handles these issues.
Week 1
- Introduction
- Data Parallel to Distributed Data Parallel
- Latency
- RDDs: Spark's Distributed Collection
- RDDs: Transformation and Action
- Evaluation in Spark: Unlike Scala Collections!
- Cluster Topology Matters!
Week 2
- Reduction Operations (fold, foldLeft, aggregate)
- Pair RDDs
- Pair RDDs: Transformations and Actions
- Pair RDDs: Joins
Week 3
- Shuffling: What it is and why it's important
- Partitioning
- Optimizing with Partitioners
- Wide vs Narrow Dependencies
Week 4