Quickstart

Table of Contents Retrieving Sample Data Creating Learners Tuning Training Using the Model Prediction

Retrieving Sample Data

The BIDMach bundle includes some scripts for loading medium-sized datasets for experimentation. These are written in bash script (for Linux, Mac or Cygwin). Typing:

<BIDMach_dir>/scripts/getdata.sh

will start loading and converting these datasets into binary files in <BIDMach_dir>/data/. The data include the Reuters news dataset (RCV1), three text datasets from the UC Irvine Data repository (NIPS, NYTIMES and PUBMED) and a digit recognition dataset from UCI.

Creating Learners

Most text data are stored as sparse, single-precision matrices. RCV1 includes a category assignment matrix, stored as an FMat with 0,1 values for each combination of input instance/category indicating whether that instance belongs in the category. Category assignments in RCV1 are one-to-many, i.e. each document may be assigned to multiple categories, although single category assignments are common.

For each major class of model (e.g. Generalized Linear Models, Latent Dirichlet Allocation, etc.), there is a simple learner which use a matrix as input. To perform logistic regression on Reuters data, you can start by loading the matrices:

val a = loadSMat("<BIDMach_dir>/data/rcv1/docs.smat.lz4")
val c = loadFMat("<BIDMach_dir>/data/rcv1/cats.fmat.lz4")

which loads the documents in RCV1 into a as a sparse matrix, and the category assignments into c. Then doing:

val (mm, mopts) = GLM.learner(a, c, 1)

creates a learner which will build multiple binary predictors (an OAA or One-Against-All multiclass classifier) for the 103 categories in RCV1 using logistic regression. Logistic regression is one of several GLM (Generalized Linear Model) models. Linear regression is also a GLM model. Support vector machines (SVM) are not GLM models, but the code for the SVM loss function is similar enough that they are included in the GLM package.

The choice of GLM model is set by the third argument. Currently:

0 = linear regression
1 = logistic regression
2 = logistic regression predictor with likelihood (not log likelihood) loss
3 = Support Vector Machine

Tuning

Two objects are returned by the call to GLM.learner. The first "mm" is the learner instance itself, which holds the model, datasource etc. An options object "mopts" is returned separately. The options object is a custom class which holds all of the options for this particular combination of model, mixins, datasource and updater. You can inspect and change any of the options in this instance. To see what they are, do:

> mopts.what
Option Name       Type          Value
 ==========       ====          =====
addConstFeat      boolean       false
lrate             FMat          1
batchSize         int           10000
dim               int           256
doubleScore       boolean       false
epsilon           float         1.0E-5
evalStep          int           11
featType          int           1
initsumsq         float         1.0E-5
links             IMat          1,1,1,1,1,1,1,1,1,1,...
mask              FMat          null
npasses           int           1
nzPerColumn       int           0
pstep             float         0.01
putBack           int           -1
regweight         FMat          1.0000e-07
resFile           String        null
rmask             FMat          null
sample            float         1.0
sizeMargin        float         3.0
startBlock        int           8000
targets           FMat          null
targmap           FMat          null
texp              FMat          0.50000
useGPU            boolean       true
vexp              FMat          0.50000
waitsteps         int           2

most of these are for advanced use only, but here are some basic ones:

npasses:     number of passes over the dataset
batchSize:   size (in instances) in each minibatch
lrate:       the Learning rate

Npasses is intuitive - it is the number of passes over the data. For classical batch algorithms, this was also the number of iterations. But batch algorithms are very slow on large datasets. Minibatch algorithms, which BIDMach favors, perform many model updates in one pass over the dataset. In fact they are often enough to reach an acceptable loss level after a single pass over the dataset. You should tune the other parameters with a goal of achieving a high likelihood in as few passes as possible.

batchSize is a very important parameter. Pure stochastic gradient algorithms update the model every instance (i.e. with a miniBatch size of 1), but this is very expensive and lacks parallelism. But you can generally achieve a comparable and sometimes faster rate of convergence using minibatch sizes that are much larger. Minibatch sizes in the hundreds to thousands are common, and in fact RCV1 achieves almost optimal convergence with a miniBatch size of 10,000.

lrate (Learning Rate) is the basic scaling constant for the gradient. Small values weakened the gradient on updates which generally slow learning down. Higher values accelerate updates, but can also lead to overshooting.

BIDMach will typically run well with the default parameters, but as you use it and get a feel for their effects, you will probably get a lot of value from systematically tuning them.

Training

You start a learning session by calling a learner's "train" method like this:

> mm.train
corpus perplexity=5582.125391
pass= 0
 2.00%, ll=-0.693, gf=3.745, secs=0.2, GB=0.02, MB/s=92.42, GPUmem=0.83
16.00%, ll=-0.134, gf=10.185, secs=0.9, GB=0.12, MB/s=131.09, GPUmem=0.82
30.00%, ll=-0.123, gf=11.024, secs=1.6, GB=0.22, MB/s=135.70, GPUmem=0.82
44.00%, ll=-0.102, gf=11.353, secs=2.4, GB=0.33, MB/s=138.03, GPUmem=0.82
58.00%, ll=-0.094, gf=11.555, secs=3.1, GB=0.43, MB/s=139.82, GPUmem=0.82
72.00%, ll=-0.074, gf=11.659, secs=3.8, GB=0.53, MB/s=140.06, GPUmem=0.82
87.00%, ll=-0.085, gf=11.733, secs=4.5, GB=0.63, MB/s=140.70, GPUmem=0.82
100.00%, ll=-0.069, gf=11.778, secs=5.2, GB=0.73, MB/s=139.77, GPUmem=0.82
Time=5.1970 secs, gflops=11.78

The learner reports on each line: the percentage of data processed, the log likelihood, net gigaflops so far, seconds so far, Gigabytes processed, Megabytes/second achieved, and (if a GPU is being used) GPU memory used so far.

To avoid overfitting models, BIDMach always holds out a fixed set of miniBatches on each pass to use for scoring. Those "test" minibatches are never used for model training. The log likelihoods it reports come from those test minibatches. This is controlled by the "evalStep" parameter, which defaults to 11. It causes every nth minibatch to be a test minibatch, so the test minibatches are 0, 11, 22, 33,... All other minibatches are used for training.

You can review the convergence of the algorithm by plotting the log likelihood over time. Its tracked in an FMat array in mm.results. So you can do

> val r = mm.results
> plot(r(0,?))

which puts up a plot of the first row of the results array, showing the likelihood on each held-out minibatch.

Using the Model

The model itself is usually represented internally as a matrix. Its accessible from the learner object via its modelmat field. Some models may comprise multiple matrices, and in that case you can extract an array of them with modelmats. A modelmat is necessarily generic, i.e. it has type Mat. Most of the time, it will be a dense matrix, either a CPU (FMat) or GPU (GMat) matrix. Its easiest to manipulate the model on the CPU, so its best to pull it back into a an FMat, which you can do like this:

> val model = FMat(mm.modelmat)

Prediction

Given a model that has been trained, you can make new predictions with it by creating a predictor. To make this realistic, we should first partition the data matrix into training and test instance subsets. We can do this with:

val inds = randperm(a.ncols)
val atest = a(?, inds(0->100000))
val atrain = a(?, inds(100000->a.ncols))
val ctest = c(?, inds(0->100000))
val ctrain = c(?, inds(100000->a.ncols))

which uses a random permutation to split the matrices a and c into training and test subsets. We will build the model from the training subset and use it to predict values for the test data. First we need to create a container "cx" for the predicted values. Then we create a learner and predictor for the data:

val cx = zeros(ctest.nrows, ctest.ncols)
val (mm, mopts, nn, nopts) = GLM.learner(atrain, ctrain, atest, cx, 1)

the leaner mm is identical with the learner we ran earlier. There is another learning object nn, with options nopts returned by this function. This is the predictor. A predictor shares virtually all the functionality as a learner, but is customized for prediction. As such it includes a data source, but normally does not include an updater and usually does not include any mixins. We can first train the model, and then use it for prediction (it is shared between mm and nn):

> mm.train
...
> nn.predict
Predicting
10.00%, ll=-0.375, gf=15.539, secs=0.0, GB=0.01, MB/s=711.31, GPUmem=0.82
20.00%, ll=-0.376, gf=16.788, secs=0.0, GB=0.02, MB/s=767.94, GPUmem=0.82
30.00%, ll=-0.370, gf=16.819, secs=0.0, GB=0.03, MB/s=769.78, GPUmem=0.82
40.00%, ll=-0.377, gf=16.892, secs=0.0, GB=0.04, MB/s=774.01, GPUmem=0.82
50.00%, ll=-0.373, gf=17.181, secs=0.1, GB=0.05, MB/s=787.27, GPUmem=0.82
60.00%, ll=-0.375, gf=15.791, secs=0.1, GB=0.06, MB/s=723.52, GPUmem=0.82
70.00%, ll=-0.376, gf=16.139, secs=0.1, GB=0.07, MB/s=739.70, GPUmem=0.82
80.00%, ll=-0.375, gf=16.240, secs=0.1, GB=0.07, MB/s=744.42, GPUmem=0.82
89.00%, ll=-0.375, gf=16.454, secs=0.1, GB=0.08, MB/s=754.18, GPUmem=0.82
100.00%, ll=-0.373, gf=16.626, secs=0.1, GB=0.09, MB/s=761.94, GPUmem=0.82

which indicates a relatively large negative likelihood. This is misleading, and the likelihoods generated by logistic regression often are. Because the logistic function generates predictions that can be very close to zero or one, the log likelihood will often be large and negative even though the prediction error is very small. Its more useful to compare the actual prediction accuracy which you can do with some matrix algebra:

> val p = ctest *@ cx + (1 - ctest) *@ (1 - cx)
> mean(p, 2)
  0.99128
  0.94204
  0.98562
  0.98679
  0.95985
  0.96614
  0.93597
  0.97642
       ..

These are the accuracies for each target (each category for RCV1).

If you already have a GLM model, you can construct a predictor for it like this:

> val model = mm.model
> val (nn, nopts) = GLM.predictor(model, atest, cx, 1)

and then compute predictions with

nn.predict

as before.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly