Skip to content

Iterative Reduce Programming Guide

jpatanooga edited this page Nov 9, 2012 · 10 revisions

Overview

  • Designed specifically for parallel iterative algorithms on Hadoop
  • Implemented directly on top of YARN
  • Intrinsic Parallelism
  • Easier to focus on problem and less on distributed programming

Lifecycle of an IterativeReduce Application

  • Client
    • Launches the YARN ApplicationMaster
  • Master
    • Computes required resources
    • Obtains resources from YARN
    • Launches Workers
    • Base class: ComputableMaster
  • Workers
    • Computation on partial data (input split)
    • Synchronizes with Master
    • Base class: ComputableWorker
  • Job Completion Mechanics
    • Right now jobs complete when all workers send a complete() call to the master, then the master exits.
    • The client waits for this as well, but the client can terminate early and the job will continue to run.
    • We're looking at ways to make this more elegant

ComputableMaster Component

  • Overview
    • We inherit from this interface (described in Appendix A) for our custom master process
    • This is used conceptually similar to the AllReduce primitive
    • It is always running on the YARN application master
    • At each superstep the master will receive a list of messages from the workers based on the work they’ve done in the last iteration
  • Methods
    • setup( Configuration c )
      • Similar to “setup” in MapReduce, runs before any execution.
      • used to get command line configuration keys into the local process
      • allows the programmer the chance to setup any long lived local data structures that need to be initialized after the config parameters are loaded.
    • complete( DataOutputStream out )
      • the method that is called after all workers are complete
      • the system passes in an output stream to write any work/models/etc to HDFS
      • also allows the programmer a chance to clean up any held resources
    • T compute( Collection workerUpdates, Collection masterUpdates )
      • this is the method where the superstep action occurs
      • we can see both all of the current superstep’s worker updates and the master updates all previous supersteps generated

ComputableWorker Component

  • Overview
    • We inherit from this interface (described in Appendix B) for our custom worker process
    • It is always running as a YARN container until the end of execution
    • recieves a record reader reference to process records from
    • Every so often it stops computing so the master node can combine results from all workers
  • Methods
    • setup( Configuration c )
      • Similar to “setup” in MapReduce, runs before any execution.
      • used to get command line configuration keys into the local process
      • allows the programmer the chance to setup any long lived local data structures that need to be initialized after the config parameters are loaded.
    • T compute( )
      • this is the method where the superstep action occurs
      • we can see both all of the current superstep’s worker updates and the master updates all previous supersteps generated

Framework Demo: Parallel Summation

  • We’ll walk through a simple application which we used as a Computable unit test during the development of IterativeReduce
  • This application sums a large set of numbers in a distributed fashion
    • Each worker has a split of the total dataset
    • After each batch within the split, the batch partial sum is sent to the master
    • The master adds up all previous global sums, and then all the worker sums to get the global current sum
  • CompoundAdditionMaster
    • collects partial sums from workers, computes new global sum
  • CompoundAdditionWorker
    • reads data from HDFS
    • computes partial sum for batch
  • UpdateableInt
    • inherits from Updateable, is the class that we use to send data back and forth between worker and master

Appendix A: Computable Master Interface

public interface ComputableMaster<T extends Updateable> {
 void setup(Configuration c);
 void complete(DataOutputStream out) throws IOException;
 T compute(Collection<T> workerUpdates, Collection<T> masterUpdates);
 T getResults();
}

Appendix B: Computable Worker Interface

public interface ComputableWorker<T extends Updateable> {
 void setup(Configuration c);
 T compute(List<T> records);
 T compute();
 void setRecordParser(RecordParser r);
 T getResults();
 void update(T t);
}

References and Links