Skip to content

Quickstart

jcanny edited this page Apr 29, 2014 · 39 revisions

Retrieving Sample Data

The BIDMach bundle includes some scripts for loading medium-sized datasets for experimentation. These are written in bash script (for Linux, Mac or Cygwin). Typing:

<BIDMach_dir>/scripts/getdata.sh

will start loading and converting these datasets into binary files in <BIDMach_dir>/data/. The data include the Reuters news dataset (RCV1), three text datasets from the UC Irvine Data repository (NIP, NYTIMES and PUBMED) and a digit recognition dataset from UCI.

Creating Learners

Most text data are stored as sparse, single-precision matrices. RCV1 includes a category assignment matrix, stored as an FMat with 0,1 values for each combination of input instance/category indicating whether that instance belongs in the category. Category assignments in RCV1 are one-to-many, i.e. each document may be assigned to multiple categories, although single category assignments are common.

For each major class of model (e.g. Generalized Linear Models, Latent Dirichlet Allocation, etc.), there is a simple learner which use a matrix as input. To perform logistic regression on Reuters data, you can start by loading the matrices:

val a = loadSMat("<BIDMach_dir>/data/rcv1/docs.smat.lz4")
val c = loadFMat("<BIDMach_dir>/data/rcv1/cats.fmat.lz4")

which loads the documents in RCV1 into a as a sparse matrix, and the category assignments into c. Then doing:

val (mm, mopts) = GLM.learner(a, c, 1)

creates a learner which will build multiple binary predictors (an OAA or One-Against-All multiclass classifier) for the 110 categories in RCV1 using logistic regression. Logistic regression is one of several GLM (Generalized Linear Model) models. Linear regression is also a GLM model. Support vector machines (SVM) are not GLM models, but the code for the SVM loss function is similar enough that they are included in the GLM package.

The choice of GLM model is set by the third argument. Currently:

0 = linear regression
1 = logistic regression
2 = logistic regression predictor with likelihood (not log likelihood) loss
3 = Support Vector Machine

Tuning and Training

Two objects are returned by the call to GLM.learner. The first "mm" is the learner instance itself, which holds the model, datasource etc. An options object "mopts" is returned separately. The options object is a custom class which holds all of the options for this particular combination of model, mixins, datasource and updater. You can inspect and change any of the options in this instance. To see what they are, do:

> mopts.what
Option Name       Type          Value
 ==========       ====          =====
addConstFeat      boolean       false
lrate             FMat          1
batchSize         int           10000
dim               int           256
doubleScore       boolean       false
epsilon           float         1.0E-5
evalStep          int           11
featType          int           1
initsumsq         float         1.0E-5
links             IMat          1,1,1,1,1,1,1,1,1,1,...
mask              FMat          null
npasses           int           1
nzPerColumn       int           0
pstep             float         0.01
putBack           int           -1
regweight         FMat          1.0000e-07
resFile           String        null
rmask             FMat          null
sample            float         1.0
sizeMargin        float         3.0
startBlock        int           8000
targets           FMat          null
targmap           FMat          null
texp              FMat          0.50000
useGPU            boolean       true
vexp              FMat          0.50000
waitsteps         int           2

most of these are for advanced use only, but here are some basic ones:

npasses:     number of passes over the dataset
batchSize:   size (in instances) in each minibatch
lrate:       the Learning rate
Clone this wiki locally