Iterative Reduce Programming Guide

Overview

Designed specifically for parallel iterative algorithms on Hadoop
Implemented directly on top of YARN
Intrinsic Parallelism
Easier to focus on problem and less on distributed programming

Lifecycle of an IterativeReduce Application

Client
- Launches the YARN ApplicationMaster
Master
- Computes required resources
- Obtains resources from YARN
- Launches Workers
- Base class: ComputableMaster
Workers
- Computation on partial data (input split)
- Synchronizes with Master
- Base class: ComputableWorker
Job Completion Mechanics
- Right now jobs complete when all workers send a complete() call to the master, then the master exits.
- The client waits for this as well, but the client can terminate early and the job will continue to run.
- We're looking at ways to make this more elegant

ComputableMaster Component

Overview
- We inherit from this interface (described in Appendix A) for our custom master process
- This is used conceptually similar to the AllReduce primitive
- It is always running on the YARN application master
- At each superstep the master will receive a list of messages from the workers based on the work they’ve done in the last iteration
Methods
- setup( Configuration c )
  - Similar to “setup” in MapReduce, runs before any execution.
  - used to get command line configuration keys into the local process
  - allows the programmer the chance to setup any long lived local data structures that need to be initialized after the config parameters are loaded.
- complete( DataOutputStream out )
  - the method that is called after all workers are complete
  - the system passes in an output stream to write any work/models/etc to HDFS
  - also allows the programmer a chance to clean up any held resources
- T compute( Collection workerUpdates, Collection masterUpdates )
  - this is the method where the superstep action occurs
  - we can see both all of the current superstep’s worker updates and the master updates all previous supersteps generated

ComputableWorker Component

Overview
- We inherit from this interface (described in Appendix B) for our custom worker process
- It is always running as a YARN container until the end of execution
- recieves a record reader reference to process records from
- Every so often it stops computing so the master node can combine results from all workers
Methods
- setup( Configuration c )
  - Similar to “setup” in MapReduce, runs before any execution.
  - used to get command line configuration keys into the local process
  - allows the programmer the chance to setup any long lived local data structures that need to be initialized after the config parameters are loaded.
- T compute( )
  - this is the method where the superstep action occurs
  - we can see both all of the current superstep’s worker updates and the master updates all previous supersteps generated

Framework Demo: Parallel Summation

We’ll walk through a simple application which we used as a Computable unit test during the development of IterativeReduce
- Code: https://github.com/emsixteeen/IterativeReduce/tree/master/src/test/java/com/cloudera/iterativereduce/yarn
- It is meant to run some data through IterativeReduce for testing purposes
This application sums a large set of numbers in a distributed fashion
- Each worker has a split of the total dataset
- After each batch within the split, the batch partial sum is sent to the master
- The master adds up all previous global sums, and then all the worker sums to get the global current sum
CompoundAdditionMaster
- collects partial sums from workers, computes new global sum
CompoundAdditionWorker
- reads data from HDFS
- computes partial sum for batch
UpdateableInt
- inherits from Updateable, is the class that we use to send data back and forth between worker and master

Appendix A: Computable Master Interface

public interface ComputableMaster<T extends Updateable> {
 void setup(Configuration c);
 void complete(DataOutputStream out) throws IOException;
 T compute(Collection<T> workerUpdates, Collection<T> masterUpdates);
 T getResults();
}

Appendix B: Computable Worker Interface

public interface ComputableWorker<T extends Updateable> {
 void setup(Configuration c);
 T compute(List<T> records);
 T compute();
 void setRecordParser(RecordParser r);
 T getResults();
 void update(T t);
}

References and Links

Original Hadoop World 2012 Presentation
- http://www.cloudera.com/content/cloudera/en/resources/library/hadoopworld/strata-hadoop-world-2012-knitting-boar_slide_deck.html
Knitting Boar on Github
- https://github.com/jpatanooga/KnittingBoar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iterative Reduce Programming Guide

Overview

Lifecycle of an IterativeReduce Application

ComputableMaster Component

ComputableWorker Component

Framework Demo: Parallel Summation

Appendix A: Computable Master Interface

Appendix B: Computable Worker Interface

References and Links

Clone this wiki locally