Command Line Usage

Running KB without any arguments will display the help panel as seen below. Below gives a listing of the options and some detail about each one.

Input Options

--input <src>                    Src data input directory or file
--output <dst>                   Where we'll write the model to in HDFS 
--passes <arg> (=1)              Number of Training Passes
--features <arg>                 Size of the feature vector
--lambda <arg>                   weight of the prior on beta
--vectorFactoryType <arg>        Type of vector factory

The system currently defaults for vectorFactoryType to the RecordFactory for the 20Newsgroups dataset, but we can also configure for:

RCV1 dataset - https://github.com/JohnLangford/vowpal_wabbit/wiki/Rcv1-example
Standard Mahout-style CSV input (currently disabled)

Dataset Conversion Options

--input <src>                    Src data input directory or file
--output <dst>                   Where we'll write the dataset to in HDFS 
--recordsPerBlock <arg> (=20000) Number of max records per dataset shard

Example:

./convert_20newsgroups.sh --input ./20news-bydate-train/ --output ./ --recordsPerBlock 12000

The conversion tool currently supports only the 20Newsgroups dataset. We convert the 20newsgroups data into a single large file (or multiple large files) to more easily manage it while computing with Knitting Boar.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Command Line Usage

Input Options

Dataset Conversion Options

Clone this wiki locally