-
Notifications
You must be signed in to change notification settings - Fork 14
Command Line Usage
jpatanooga edited this page Nov 22, 2012
·
2 revisions
Running KB without any arguments will display the help panel as seen below. Below gives a listing of the options and some detail about each one.
--input <src> Src data input directory or file
--output <dst> Where we'll write the model to in HDFS
--passes <arg> (=1) Number of Training Passes
--features <arg> Size of the feature vector
--lambda <arg> weight of the prior on beta
--vectorFactoryType <arg> Type of vector factory
The system currently defaults for vectorFactoryType to the RecordFactory for the 20Newsgroups dataset, but we can also configure for:
- RCV1 dataset - https://github.com/JohnLangford/vowpal_wabbit/wiki/Rcv1-example
- Standard Mahout-style CSV input (currently disabled)
--input <src> Src data input directory or file
--output <dst> Where we'll write the dataset to in HDFS
--recordsPerBlock <arg> (=20000) Number of max records per dataset shard
Example:
./convert_20newsgroups.sh --input ./20news-bydate-train/ --output ./ --recordsPerBlock 12000
The conversion tool currently supports only the 20Newsgroups dataset. We convert the 20newsgroups data into a single large file (or multiple large files) to more easily manage it while computing with Knitting Boar.