Michigan Hadoop (madoop
) is a light weight MapReduce framework for education. Madoop implements the Hadoop Streaming interface. Madoop is implemented in Python and runs on a single machine.
For an in-depth explanation of how to write MapReduce programs in Python for Hadoop Streaming, see our Hadoop Streaming tutorial.
Install Madoop.
$ pip install madoop
Create example MapReduce program with input files.
$ madoop --example
$ tree example
example
├── input
│ ├── input01.txt
│ └── input02.txt
├── map.py
└── reduce.py
Run example word count MapReduce program.
$ madoop \
-input example/input \
-output example/output \
-mapper example/map.py \
-reducer example/reduce.py
Concatenate and print the output.
$ cat example/output/part-*
Goodbye 1
Bye 1
Hadoop 2
World 2
Hello 2
Madoop implements a subset of the Hadoop Streaming interface. You can simulate the Hadoop Streaming interface at the command line with cat
and sort
.
Here's how to run our example MapReduce program on Apache Hadoop.
$ hadoop \
jar path/to/hadoop-streaming-X.Y.Z.jar
-input example/input \
-output output \
-mapper example/map.py \
-reducer example/reduce.py
$ cat output/part-*
Here's how to run our example MapReduce program at the command line using cat
and sort
.
$ cat input/* | ./map.py | sort | ./reduce.py
Madoop | Hadoop | cat /sort |
---|---|---|
Implement some Hadoop options | All Hadoop options | No Hadoop options |
Multiple mappers and reducers | Multiple mappers and reducers | One mapper, one reducer |
Single machine | Many machines | Single Machine |
jar hadoop-streaming-X.Y.Z.jar argument ignored |
jar hadoop-streaming-X.Y.Z.jar argument required |
No arguments |
Lines within a group are sorted | Lines within a group are sorted | Lines within a group are sorted |
Contributions from the community are welcome! Check out the guide for contributing.
Michigan Hadoop is written by Andrew DeOrio awdeorio@umich.edu.