-
Notifications
You must be signed in to change notification settings - Fork 21
ranSample
ranSample
is a bakemono application that takes a random sample out of Hadoop input files.
ranSample
works at the line level; it works for processing N-Triples files, but it also works for any file where records are stored on single lines.
If you're controlling bakemono with haruhi, the format is like
haruhi run job [-r slots] fraction input output
If the -r
option is unspecified, ranSample
runs as a map-only job and the number of output splits will be the same as the number of output splits. If you specify a numeric value for -r
, the system will run with that number of reduce slots. If you'd like to reduce the number of output splits because the sampled data is much smaller, this how to do it.
I sampled 10% of the a
statements on my local cluster by running
haruhi run job 0.1 /freebase/20130915/a /little_a
doing I so I get 23 output splits, because that is what I set in the pse3
process. If you'd rather have four output splits, you can do
haruhi run job -r4 0.1 /freebase/20130915/a /little_a2
producing results like
amefurashi$ hadoop fs -ls /little_a2
Found 6 items
-rw-r--r-- 3 paul supergroup 0 2013-09-20 17:06 /little_a2/_SUCCESS
drwxr-xr-x - paul supergroup 0 2013-09-20 17:02 /little_a2/_logs
-rw-r--r-- 3 paul supergroup 21601776 2013-09-20 17:06 /little_a2/part-r-00000.gz
-rw-r--r-- 3 paul supergroup 21609228 2013-09-20 17:06 /little_a2/part-r-00001.gz
-rw-r--r-- 3 paul supergroup 21634772 2013-09-20 17:06 /little_a2/part-r-00002.gz
-rw-r--r-- 3 paul supergroup 21589370 2013-09-20 17:06 /little_a2/part-r-00003.gz
Previous versions of the Infovore had a random sampler application that was not Hadoop based. Random samples have many practical uses such as:
- if you need summary statistics, a random sample can give you good enough results in less time
- the use of small random samples can speed up your test cycle.
The current version has a workaround for a gotcha: you can't pass null to a reducer
Despite that, ranSample
good be a good pedagogical example of how one can pass configuration for a mapper or reducer through Hadoop's Configuration mechanism and I could write it up someday.