-
Notifications
You must be signed in to change notification settings - Fork 6
Tools branch
The tools branch (at jweese/thrax/tools) is an experimental branch to refactor the Thrax pipeline into a collection of independently-runnable tools. The next step for this branch will be to include an automatic dependency-handling and task-tracking system, but that kind of depends on hadoop stabilizing the API at some point.
Each tool can be easily run from the command line. We present an example pipeline below. Note: It is important to keep <work directory>
consistent among all the commands run in this pipeline. But the nice thing is, you don't have to worry about the local work directory anymore -- it doesn't exist anymore! The work directory is a directory on the Hadoop cluster.
Extract rules from the corpus. Assuming you have the corpus set up as in the Quickstart, you can extract all the rules (without feature scores) by running
$ hadoop jar bin/thrax.jar edu.jhu.thrax.hadoop.tools.ExtractionTool <conf file> <input path> <work directory>
If using the lexprob
feature, extract word-level lexical probabilities.
$ hadoop jar bin/thrax.jar edu.jhu.thrax.hadoop.tools.TargetWordGivenSourceWordProbabilityTool <input path> <work directory>
$ hadoop jar bin/thrax.jar edu.jhu.thrax.hadoop.tools.SourceWordGivenTargetWordProbabilityTool <input path> <work directory>
Java is so verbose.
Parallelization: extraction and word-level lexical probabilities can all be run at the same time.
Run map-reduce jobs for the features that need it. For each such feature:
$ hadoop jar bin/thrax.jar edu.jhu.thrax.hadoop.tools.FeatureTool <work directory> <feature>
In the above case, the feature is the name as it is written in the thrax.conf file.
Another parallelization advantage: assuming that the extraction and word-level tasks are finished, all of these feature tasks can be run in parallel without any problems!
The final step is to aggregate everything together!
$ hadoop jar bin/thrax.jar edu.jhu.thrax.jadoop.tools.OutputTool <true|false> <work directory> [f1 f2 f3 ...]
The boolean as the first argument indicates wether to label the feature scores or not. f1,f2
and so on are the names of features, again, as they would be written in the config file. For this step, you need to include all map-reduce features and all simple features that you want to be included in the output.
It's that easy! It'll be even easier once we figure out running dependent jobs and everything.