Skip to content

Command Line Usage

jpatanooga edited this page Nov 22, 2012 · 2 revisions

Running KB without any arguments will display the help panel as seen below. Below gives a listing of the options and some detail about each one.

Input Options

--input <src>                    Src data input directory or file
--output <dst>                   Where we'll write the model to in HDFS 
--passes <arg> (=1)              Number of Training Passes
--features <arg>                 Size of the feature vector
--lambda <arg>                   weight of the prior on beta
--vectorFactoryType <arg>        Type of vector factory

The system currently defaults for vectorFactoryType to the RecordFactory for the 20Newsgroups dataset, but we can also configure for:

Dataset Conversion Options

--input <src>                    Src data input directory or file
--output <dst>                   Where we'll write the dataset to in HDFS 
--recordsPerBlock <arg> (=20000) Number of max records per dataset shard

Example:

./convert_20newsgroups.sh --input ./20news-bydate-train/ --output ./ --recordsPerBlock 12000

The conversion tool currently supports only the 20Newsgroups dataset. We convert the 20newsgroups data into a single large file (or multiple large files) to more easily manage it while computing with Knitting Boar.

Clone this wiki locally