Skip to content
Paul Houle edited this page Sep 20, 2013 · 1 revision

Documentation

ranSample is a bakemono application that takes a random sample out of Hadoop input files.

ranSample works at the line level; it works for processing N-Triples files, but it also works for any file where records are stored on single lines.

If you're controlling bakemono with haruhi, the format is like

haruhi run job [-r slots] fraction input output

If the -r option is unspecified, ranSample runs as a map-only job and the number of output splits will be the same as the number of output splits. If you specify a numeric value for -r, the system will run with that number of reduce slots. If you'd like to reduce the number of output splits because the sampled data is much smaller, this how to do it.

Examples

I sampled 10% of the a statements on my local cluster by running

haruhi run job 0.1 /freebase/20130915/a /little_a

doing I so I get 23 output splits, because that is what I set in the pse3 process. If you'd rather have four output splits, you can do

haruhi run job -r4 0.1 /freebase/20130915/a /little_a2

producing results like

amefurashi$ hadoop fs -ls /little_a2

Found 6 items
-rw-r--r--   3 paul supergroup          0 2013-09-20 17:06 /little_a2/_SUCCESS
drwxr-xr-x   - paul supergroup          0 2013-09-20 17:02 /little_a2/_logs
-rw-r--r--   3 paul supergroup   21601776 2013-09-20 17:06 /little_a2/part-r-00000.gz
-rw-r--r--   3 paul supergroup   21609228 2013-09-20 17:06 /little_a2/part-r-00001.gz
-rw-r--r--   3 paul supergroup   21634772 2013-09-20 17:06 /little_a2/part-r-00002.gz
-rw-r--r--   3 paul supergroup   21589370 2013-09-20 17:06 /little_a2/part-r-00003.gz

Notes

Previous versions of the Infovore had a random sampler application that was not Hadoop based. Random samples have many practical uses such as:

  1. if you need summary statistics, a random sample can give you good enough results in less time
  2. the use of small random samples can speed up your test cycle.

The current version has a workaround for a gotcha: you can't pass null to a reducer

Despite that, ranSample good be a good pedagogical example of how one can pass configuration for a mapper or reducer through Hadoop's Configuration mechanism and I could write it up someday.

Clone this wiki locally