Skip to content
Emilio Coppa edited this page Apr 2, 2014 · 88 revisions

This project contains several diagrams describing Apache Hadoop internals (2.3.0 or later). Even if these diagrams are NOT specified in any formal or unambiguous language (e.g., UML), they should be reasonably understandable (here some diagram notation conventions) and useful for any person who want to grasp the main ideas behind Hadoop. Unfortunately, not all the internal details are covered by these diagrams. You are free to help :)


Actors Tasks Model of computation Extra
  • Job Submitter
  • Node Manager
  • Resource Manager
  • Application Master

  • Map Task
  • Reduce Task
  • Merger
  • Input

  • Job
  • Task
  • Task Attempt
  • Application
  • Container
  • Async Dispatcher
  • Localized Resource
  • Container Allocator [AM]
  • Container Launcher [AM]
  • Containers Launcher [NM]

  • ### Architecture Overview ![Hadoop- Yarn MapReduce - Architecture Overview](https://www.lucidchart.com/publicSegments/view/53302af2-7d38-412b-8275-6ffe0a009433/9_image.png) ### Hadoop Configuration parameters Parameter | File | Default | Diagram(s) ------------- | ------------- | ------------- | ------------- `mapreduce.task.io.sort.mb` | `mapred-site.xml` | 100 | [MapTask > Shuffle](MapTask#post-execution---shuffle) | | | [MapTask > Execution](MapTask#execution) `mapreduce.map.sort.spill.percent` | `mapred-site.xml` | 0.80 | [MapTask > Shuffle](MapTask#post-execution---shuffle) | | | [MapTask > Execution](MapTask#execution) `mapreduce.task.io.sort.factor` | `mapred-site.xml` | 100 | [MapTask > Shuffle](MapTask#post-execution---shuffle) | | | [Merge](MapReduceMerge) | | | [ReduceTask > Shuffle](ReduceTask#shuffle---merge) `mapreduce.map.combine.minspills` | `mapred-site.xml` | 3 | [MapTask > Shuffle](MapTask#post-execution---shuffle) `mapreduce.job.reduces` | `mapred-site.xml` | 1 | [MapTask > Shuffle](MapTask#post-execution---shuffle) | | 0 | [Job > NEW => INITED](Job#new--inited-job_init) `mapreduce.cluster.local.dir` | `mapred-site.xml` | `${hadoop.tmp.dir}`/mapred/local | [MapTask > Shuffle](MapTask#post-execution---shuffle) `mapreduce.reduce.merge.memtomem.enabled` | `mapred-site.xml` | False | [Reduce Task > Shuffle](ReduceTask#shuffle) `mapreduce.framework.name` | `mapred-site.xml` | `yarn`/`local` | [Reduce Task > Shuffle](ReduceTask#shuffle) `mapreduce.reduce.shuffle.parallelcopies` | `mapred-site.xml` | 5 | [Reduce Task > Shuffle](ReduceTask#shuffle) `mapreduce.reduce.memory.totalbytes` | `mapred-site.xml` | `Runtime.maxMemory()` | [Reduce Task > Fetcher](ReduceTask#local-fetcher) `mapreduce.reduce.shuffle.memory.limit.percent` | `mapred-site.xml` | 0.25 | [Reduce Task > Fetcher](ReduceTask#local-fetcher) `mapreduce.job.ubertask.enable` | `mapred-site.xml` | False | [Job > NEW => INITED](Job#new--inited-job_init) `mapreduce.job.ubertask.maxmaps` | `mapred-site.xml` | 9 | [Job > NEW => INITED](Job#new--inited-job_init) `mapreduce.job.ubertask.maxreduces` | `mapred-site.xml` | 1 | [Job > NEW => INITED](Job#new--inited-job_init) `mapreduce.job.ubertask.maxbytes` | `mapred-site.xml` | `dfs.block.size` | [Job > NEW => INITED ](Job#new--inited-job_init) `mapreduce.map.
failures.maxpercent` | `mapred-site.xml` | 0 | [Job > RUNNING => {RUNNING, COMMITTING, FAIL ABORT}](Job#running--running-committing-fail-abort-job_task_completed) `mapreduce.reduce.
failures.maxpercent` | `mapred-site.xml` | 0 | [Job > RUNNING => {RUNNING, COMMITTING, FAIL ABORT}](Job#running--running-committing-fail-abort-job_task_completed) `mapreduce.map.memory.mb` | `mapred-site.xml` | 1024 | [Task Attempt > NEW => UNASSIGNED](TaskAttempt#new--unassigned-ta_schedule) `mapreduce.reduce.memory.mb` | `mapred-site.xml` | 1024 | [Task Attempt > NEW => UNASSIGNED](TaskAttempt#new--unassigned-ta_schedule) `scheduler.maximum-allocation-mb` | `yarn-site.xml` | 8192 | [Container Allocator](ContainerAllocator) `mapreduce.reduce.shuffle.merge.percent` | `mapred-site.xml` | 0.90 | [Reduce Task > Shuffle](ReduceTask#shuffle---merge) `yarn.resourcemanager.scheduler.class` | `yarn-site.xml` | `CapacityScheduler` | [Resource Manager](ResourceManager)
    Clone this wiki locally