-
Notifications
You must be signed in to change notification settings - Fork 1
Executing SAMOA with Apache Storm
In this tutorial we will describe how to execute SAMOA on top of Storm. Here is an outline of what we want to do:
- Ensure that you have necessary Storm cluster and configuration to execute SAMOA
- Ensure that you have all the SAMOA deployables for execution in the cluster
- Configure samoa-storm.properties
- Execute SAMOA classification task
- Observe the task execution
Before we start the tutorial, please ensure that you already have Storm cluster (preferably Storm 0.8.2) running. You can follow this tutorial to set up a Storm cluster.
You also need to install Storm at the machine where you initiate the deployment, and configure Storm (at least) with this configuration in ~/.storm/storm.yaml
:
########### These MUST be filled in for a storm configuration
nimbus.host: "<enter your nimbus host name here>"
## List of custom serializations
kryo.register:
- samoa.classifiers.trees.AttributeContentEvent: samoa.classifiers.trees.AttributeContentEvent$AttributeCEFullPrecSerializer
- samoa.classifiers.trees.ComputeContentEvent: samoa.classifiers.trees.ComputeContentEvent$ComputeCEFullPrecSerializer
Or, if you are using SAMOA with optimized VHT, you should use this following configuration file:
########### These MUST be filled in for a storm configuration
nimbus.host: "<enter your nimbus host name here>"
## List of custom serializations
kryo.register:
- samoa.classifiers.trees.NaiveAttributeContentEvent: samoa.classifiers.trees.NaiveAttributeContentEvent$NaiveAttributeCEFullPrecSerializer
- samoa.classifiers.trees.ComputeContentEvent: samoa.classifiers.trees.ComputeContentEvent$ComputeCEFullPrecSerializer
Alternatively, if you don't have Storm cluster running, you can execute SAMOA with Storm in local mode as explained in section samoa-storm.properties Configuration.
There are three deployables for executing SAMOA on top of Storm. They are:
-
samoa
is the main script to execute SAMOA. You do not need to change anything in this script. -
SAMOA-Storm-x.x.x.jar
is the deployed jar file.x.x.x
is the version number of SAMOA. -
samoa-storm.properties
contains deployment configurations. You need to set the parameters in this properties file correctly.
Currently, the properties file contains two configurations:
-
samoa.storm.mode
determines whether the task is executed locally (using Storm'sLocalCluster
) or executed in a Storm cluster. Uselocal
if you want to test SAMOA and you do not have a Storm cluster for deployment. Usecluster
if you want to test SAMOA on your Storm cluster. -
samoa.storm.numworker
determines the number of worker to execute the SAMOA tasks in the Storm cluster. This field must be an integer, less than or equal to the number of available slots in you Storm cluster. If you are using local mode, this property corresponds to the number of thread used by Storm's LocalCluster to execute your SAMOA task.
Here is the example of a complete properties file:
# SAMOA Storm properties file
# This file contains specific configurations for SAMOA deployment in the Storm platform
# Note that you still need to configure Storm client in your machine,
# including setting up Storm configuration file (~/.storm/storm.yaml) with correct settings
# samoa.storm.mode corresponds to the execution mode of the Task in Storm
# possible values:
# 1. cluster: the Task will be sent into nimbus. The nimbus is configured by Storm configuration file
# 2. local: the Task will be sent using local Storm cluster
samoa.storm.mode=cluster
# samoa.storm.numworker corresponds to the number of worker processes allocated in Storm cluster
# possible values: any integer greater than 0
samoa.storm.numworker=7
You can execute a SAMOA task using the aforementioned samoa
script with this following format:
samoa <platform> <jar> "<task>"
.
<platform>
can be storm
or s4
. Using storm
option means you are deploying SAMOA on a Storm environment. In this configuration, the script uses the aforementioned yaml file (~/.storm/storm.yaml
) and samoa-storm.properties
to perform the deployment. Using s4
option means you are deploying SAMOA on an Apache S4 environment. Follow this link to learn more about deploying SAMOA on Apache S4.
<jar>
is the location of the deployed jar file (SAMOA-Storm-x.x.x.jar
) in your file system. The location can be a relative path or an absolute path into the jar file.
"<task>"
is the SAMOA task command line such as PrequentialEvaluation
or ClusteringTask
. This command line for SAMOA task follows the format of Massive Online Analysis (MOA).
The complete command to execute SAMOA is:
cd src/main/resources
./samoa storm ../../../target/SAMOA-Storm-0.0.1.jar "PrequentialEvaluation -d /tmp/dump.csv -i 1000000 -f 100000 -l (com.yahoo.labs.samoa.learners.classifiers.trees.VerticalHoeffdingTree -p 4) -s (com.yahoo.labs.samoa.moa.streams.generators.RandomTreeGenerator -c 2 -o 10 -u 10)"
The example above uses Prequential Evaluation task and Vertical Hoeffding Tree classifier.
There are two ways to observe the task execution using Storm UI and by monitoring the dump file of the SAMOA task. Notice that the dump file will be created on the cluster if you are executing your task in cluster
mode.
Go to the web address of Storm UI and check whether the SAMOA task executes as intended. Use this UI to kill the associated Storm topology if necessary.
Several tasks have options to specify a dump file, which is a file that represents the task output. In our example, Prequential Evaluation task has -d
option which specifies the path to the dump file. Since Storm performs the allocation of Storm tasks, you should set the dump file into a file on a shared filesystem if you want to access it from the machine submitting the task.