Skip to content

Getting started with Prince in 5 minutes

goossaert edited this page Sep 14, 2010 · 32 revisions

Table of Contents

1. How to install Prince?

  1. Install Hadoop
    If you need help on how to get and install Hadoop, I recommend Michael Noll’s excellent tutorial, available here.
  2. Set the HADOOP_HOME environment variable
    The HADOOP_HOME environment variable must be set to Hadoop’s installation directory. This needs to be done on your local system and on all the nodes of your cluster. The installation directory is the one that contains the “bin” and “contrib” directories. Here is how you can do that.
  3. Install python-setuptools
    On Ubuntu/Debian:
    $ sudo apt-get update
    $ sudo apt-get install python-setuptools
    

    On Redhat/Fedora/SuSE/Mandrake:
    $ su -
    $ yum install python-setuptools
    

    For Cygwin, Mac OS X and other platforms, please read the instructions here:
    http://pypi.python.org/pypi/setuptools
  4. Download the latest version of Prince
    $ wget http://github.com/goossaert/prince/tarball/v0.1 -O prince.tar.gz
  5. Unpack and install Prince
    $ tar zxf prince.tar.gz
    $ cd goossaert-prince-xxxxxxx    # "xxxxxxx" will change depending on the version
    $ sudo python setup.py install
    
  6. Use Prince
    Just add “import prince” at the top of your Python program, and run it with python program.py.

2. The classic word count example

Here is how Prince performs a word count:

wordcount.py

import prince

def wc_mapper(key, value):
    for word in value.split():
        yield word, 1

def wc_reducer(key, values):
    try:                yield key, sum([int(v) for v in values])
    except ValueError:  pass # discard non-numerical values

if __name__ == "__main__":
    prince.init() # Always call prince.init() at the beginning of a program
    prince.run(wc_mapper, wc_reducer, 'input_file', 'output_file', inputformat='text', outputformat='text')
    print prince.dfs.read('output_file/part*') # Read the output file and print it 

And then as the hadoop user, we simply do:

hadoop@prince$ python wordcount.py

More examples can be found on the Example programs with Prince page, including Dijkstra’s algorithm, merge sort and PageRank.

3. Extra-light API

init()

Method to call at the beginning of all programs.

run()

Run an Hadoop task with the specified Python mapper and reducer methods.

get_parameters()

Method callable from the mapper and reducer methods to get parameters passed by the run() method.

dfs.read()

Reads the content of files on the DFS.

dfs.write()

Write content to a file on the DFS.

dfs.exists()

Test if a path exists on the DFS.

You can find more details about these methods in the Prince API Reference.

4. What you need to know before you start coding

  1. Always start with init(). Make sure you call prince.init() before you do anything else in your program.
  2. Do not use print in mappers and reducers. Standard input and output are used to carry information between your methods and Hadoop Streaming. If you print some message or value, it will be written to the standard output, and be considered by Hadoop as a pair of (key, value) and will alter your computations.
  3. Have all mappers and reducers in local name space. All your mapper and reducer methods must be accessible in the name space of your program. This means that they either have to be defined in the same file as your program, or they have to be imported with ‘from imported_file import method’. If you choose the import solution, then you need to make your imported files accessible. This is explained in the next bullet.
  4. Make all imports accessible. If you import other Python files, add the file names to the ‘files’ argument of the run() method. This is particularly important if you import mappers and reducers from another file. If you import external libraries, make sure that they are correctly installed on the worker nodes of your cluster. See the example import count in the repository for more details about that, and have a look at the Prince API Reference to learn about the ‘files’ argument.
  5. Import all files from the directories your programs are in. All imported files must be placed in the same directory as your programs on your local hard drive. This, of course, does not apply to imported libraries if they are correctly installed on the worker nodes.
  6. Make all used files accessible. If you use a file in your computation, you must add this file to the ‘files’ option too.
  7. Avoid global variables. Keep in mind that Hadoop will start multiple processes with your program, on the same node or different nodes. As the memory spaces of processes are separated by definition, global variables make no sense. Of course, you can use global variables if you wish to, but modification to these variables will remain local to the map or reduce process being executed.
  8. Time for a coffee break. Do not expect to have awesome computation speed, since it is Python over Java. For performance, prefer Java to Python. But for learning purposes and fast coding, it is nice to have Python.
  9. The Parallel Dimension Debugging Syndrome. The PDDS is what happen when you send your program into a black hole. This is exactly the case with Hadoop Streaming. Once Hadoop starts mapping or reducing with your Python methods, you have no control over what is happening, and you cannot see the raised exceptions and error messages. A solution to this issue is to use the --trace argument, which allows you to save the trace of an exception raised on Hadoop’s side. You can learn more about it on the dedicated page: Debugging with Prince.

5. Where does the name “Prince” come from?

Prince is a reference to The Little Prince, a novel written by Antoine de Saint-Exupéry and published in 1943. In this novel, there is a very unusual illustration by Saint-Exupéry of a snake eating an elephant:

An elephant in a snake

As the logo of Hadoop is an elephant, and the logo of Python is a snake, this depicts exactly what Prince is doing: it’s an elephant in a snake, or Hadoop in Python. You can learn more about The Litte Prince on Wikipedia. The Litte Prince and the related artworks are copyrighted materials owned by N.R.F. Gallimard.

6. What’s next?

Download Prince, use the API, enjoy and give me feedback. There are tons of cool applications for the MapReduce paradigm, so go find one and solve it with the fun of Python. Here are a few pages you can visit now:

The Little Prince