-
Notifications
You must be signed in to change notification settings - Fork 6
Standalone hadoop
What if you don't have access to a hadoop cluster? Does that mean you can't use thrax? Of course you can use it! (It'll just take a lot longer.) Hadoop can be run in standalone mode on a single computer. You might say it defeats the purpose of using hadoop in the first place, but it does give you some nice things for free, like sorting records on disk and things like that. So let's get started setting up a standalone hadoop install.
For a happy medium between standalone and a full cluster, see pseudodistributed hadoop.
It's quite easy to get everything you need in one tarball. Here's one link:
wget http://apache.cs.utah.edu//hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz
More generally, the official Hadoop Common page has links to all recent versions. Once you have the tarball, you can use
tar -xzf hadoop-0.20.2.tar.gz
to unpack it.
The short version: set the three properties listed below to directories where you have lots of free hard disk space.
Hadoop is ready to run in standalone mode essentially as soon as you unpack it. But since standalone mode is really meant for small-scale testing, and not for production usage, you have to make some changes in the configuration if you want to use it with a large dataset. The configuration file you need to change is $HADOOP/conf/mapred-site.xml
. Here's how mine looks:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/Users/jonny/mapred/tmp</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/Users/jonny/mapred/local</value>
<final>true</final>
</property>
<property>
<name>mapred.system.dir</name>
<value>/Users/jonny/mapred/system</value>
<final>true</final>
</property>
</configuration>
As you can see, I added three properties:
-
hadoop.tmp.dir
is the base directory where temporary hadoop files are stored during a job. -
mapred.local.dir
is where intermediate files should be stored during a job. These are things like output from map tasks, chunks of data during the shuffle, etc. -
mapred.system.dir
is where shared files are stored during a job.
The default for these settings is somewhere in the local /tmp
. The problem is that whatever partition /tmp
is on is almost certainly not big enough to hold all the intermediate data on a normal-sized hadoop run. That's why I reset those values to places where I know I have a lot of disk space.