-
Notifications
You must be signed in to change notification settings - Fork 1
Debugging with Prince
As explained in the “Getting started” page, debugging with Hadoop Streaming can be very difficult. When Hadoop starts mapping or reducing with your Python methods, you have no control over what is happening, and you cannot see the raised exceptions and error messages. This gives the very frustrating feeling sending a program in a black hole, and give rise to the PDDS, the Parallel Dimension Debugging Syndrome.
In order to avoid blowing hours looking for insignificant bugs, a trace writing feature has been included to Prince. To use it, simply add the argument --trace basename
to the command line of your program. If an exception is raised in one of your mapper or reducer method, the trace of the exception will be written on your DFS at the path indicated by the base name augmented of an index number. For instance:
python program.py argument1 argument2 --trace mytracefile
If an exception is raised, it will be intercepted by prince and the trace will be written to the DFS at the location “mytracefile0”. If this file already exists, the trace will be written to “mytracefile1”, and so on until an available location is found. Please be advised that testing for existing files on Hadoop can be very slow, therefore if too many files with the same base name are present, finding an available location can waste a non trivial amount of time. This said, remember to clean trace files every once in a while, or use a different base name at every new run of a program. A fast and easy way to check on the content of the saved trace file is to use Hadoop’s NameNode web interface on your master node, which should be available at the URL: http://your_master:50070/. You can also use the “-cat” command of hadoop/dfs.
Finally, as you notice the --trace
argument can be used even if other arguments are defined in the program. Just make sure that Prince’s init() method is run at the very beginning of the program, and the --trace
argument will be handled and removed, allowing your checks on len(sys.argv) to remain valid.
Here are a few tips on the most common mistake when coding with Python on Hadoop.
- Incorrect I/O tuple. The primary sources of mistakes are incorrect reading and writing of keys and values in mapper and reducer methods. When in doubt, just use the identity mapper and reducer methods to check that what you get in input is what you are expecting.
- Use of print in mappers and reducers. As standard input and output are used by Hadoop Streaming to communicate with Python programs, using print just messes up with this process by introducing invalid tuples.
- Mapper and reducer parameters. Prince takes care of filling the parameters of the mapper and reducer methods specified to it. All keys and values are strings, except for the reducer values parameter which is a generator of strings. Make sure to convert these strings to the primitive type you want with int() or float() before you use them.
- Reducing duplicate keys. A reducer can return or yield only one time a given key. If a key is returned multiple times, the reducing process will fail. Make sure every key returned is unique.