Skip to content

Machine Learning Models

Daniel Seita edited this page Feb 12, 2015 · 142 revisions

Table of Contents

Overview

BIDMach supports supervised and unsupervised models. It includes a general regression class with linear, logistic and SVM subclasses, and a factor model class with LDA, NMF and SFA (often implemented with ALS or Alternating Least Squares). There are also random forest and k-means. We have started a group of causal estimators in a separate package. Most models can be combined with (e.g. L1 and L2 regularization) mixins. There are typically several options for updaters, although several models use multiplicative updates and require matching updaters.

All models work with either sparse or dense input matrices (on CPU or GPU). For regression models which have target matrices (indicating category membership) as well as data matrices, the target matrices can also be either sparse or dense.

Regression models include Generalized Linear Model (GLM) and Factorization Machine. Linear Regression, Logistic Regression and Support Vector Machines are implemented under GLM with different link functions and loss gradients. Factorization Machine models interaction between features.

Factor models includes Latent Dirichlet Allocation (LDA), Alternating Least Square (ALS) and Non-negative Matrix Factorization (NMF). A factor model takes a sparse matrix (m-by-n) as input and decompose it into two low dimension matrices (m-by-k and k-by-n). k is usually much smaller than m and n. Dimension reduction is typically used to improve prediction, but may also use the factors to capture topic or cluster structure in the dataset.

Usage of the models are described below. Import the following package before using the models:

import BIDMach.models._

Regression Models

Generalized Linear Model

The Generalized Linear Model (GLM) is a generalization of linear regression. Linear regression, logistic regression and SVM are implemented under GLM with different link functions and likelihood functions. We will be adding other standard GLM models soon. You should be able to deploy your own GLM models with your own link object that extends BIDMach.models.GLM.GLMLink. The following code creates a learner and the associated options for GLM.

val (nn, opts) = LDA.learner(a, c, i)
Input a is a m-by-n data matrix. It can be either dense or sparse. m is the dimension of features and n is the dimension of examples. c is a k-by-n matrix. k is the number of categories. c(i,j)=1 indicates that example j belongs to category i. For details on how to create matrices, please refer to Data Wrangling or Creating Test Matrices. The choice of GLM model is set by the third argument i. Currently:
0 = linear regression
1 = logistic regression
2 = logistic regression predictor with likelihood (not log likelihood) optimization
3 = Support Vector Machine
The learner builds multiple binary predictors for each category, that is, an OAA or One-Against-All multiclass classifier. To see all the parameters of the learner:
opts.what

To train the model:

nn.train

GLM produces the weight vectors for each of the k categories. To get the results of the learner:

nn.modelmat: the feature-weight matrix
To save the model on disk, please see Saving your results. If you have some data to predict, you can construct a predictor for it like this:
val (nn, nopts) = GLM.predictor(model, atest, ctest, i)

Here model is the model produced in the train step, atest is the test data and again i is the choice of GLM model. ctest is a matrix that holds the predictions.(You can go to here to see how to create some test matrices) You can compute predictions with

nn.predict

Linear Regression

Please refer to the usage of GLM, setting the third argument i=0.

Logistic Regression

Please refer to the usage of GLM, setting the third argument i=1 or 2. i=1 produces standard logistic regression which maximizes the total log likelihood across all input instances. A logistic model outputs a probability <math>p</math> for each test instance, and the probability of correctness is <math>p</math> if the label is 1, and <math>1-p</math> if it is zero. Standard logistic regression optimizes the sum of logs of this probability. A weakness of this formulation is that it attributes a lot of weight to incorrectly labeled instances, trying to avoid large negative log likelihoods. In effect it tries hard not to make bad labels worse.

Alternatively, setting i=2 optimizes the average likelihood across instances. This directly optimizes the accuracy of the prediction (i.e. the probability of a correct prediction). It tries to make good labels better. This option often leads to more accurate models (since it is directly optimizing accuracy). It seems to be a little trickier to tune, but you should find the effort worthwhile if you care about absolute accuracy. Whereas standard logistic regression weights instances according to their prediction error (so bad instances have higher weight), this version weights instances "on the fence" more heavily. Thus it resembles an SVM, and is optimizing accuracy on borderline instances.

Support Vector Machine (SVM)

Support vector machine learning is very similar. Start by defining a learner with data and target matrices:

> val (nn, nopts) = GLM.SVMlearner(datamat, labelmat)
then train it with
> nn.train
The algorithm uses the Pegasos SGD (Stochastic Gradient) algorithm [1]. This algorithm minimizes the SVM objective function using gradient steps of the form:

<math> \nabla = \lambda w - \mathbb{1}[y_i\;] \:y_i x_i </math>

where <math>w</math> is the model weight vector, <math>x_i</math> is the ith instance, and <math>y_i</math> is the (0,1) label of the ith instance, and <math>\mathbb{1}</math> is an indicator function which is 1 when its argument is true, 0 otherwise. <math>\lambda</math> is the regularization parameter of the SVM (see [1]). The regularization is implemented using BIDMach's L2Regularizer mixin. <math>\lambda</math> is equivalent to the L2 regularizer weight, which can be set with:

> nopts.reg2weight

The section below on Tuning explains how to set the other important parameters.

[1] Shalev-Schwartz S., Singer Y., Srebro, N, "Pegasos: Primal Estimated sub-GrAdient SOlver for SVM", IMCL 2007.

Constant Terms

The GLM models are homogeneous, and do not include a constant term. To model a constant term, you need to introduce a constant feature, (usually fixed at 1) into the input data source. Make sure you do this after doing any normalization of each input feature. The standard datasources include this as an option (Options.addConstFeat). When this option is set, a constant feature is added as the last feature on each instance (so the number of features increments by 1).

Tuning

Regression models share the following options

var targets:FMat = null
var targmap:FMat = null
var rmask:FMat = null

and in addition, the GLM model has an option:

var links:IMat = null

which specifies the link (0 for linear, 1 for logistic etc) number for each target. The link argument must be set. The others are optional. They are not needed if the input is two datasources (as in GLM.learner(dat,cats,1) since the targets are given explicitly.

A GLM learner will also typically inherit options from a Learner object (e.g. number of passes over the dataset, regularizers (L1 and possibly L2), and options such as learning rate from an updater such as ADAGRAD. Those options are described in detail in those sections of the Docs. Regularization weights, learning rate, number of passes over the data are all settable in the opts object.

Back to the Regression-specific fields, targets is used to define targets for datasets from which they have not been extracted as separate datasources. For instance, from a set of log files for social media data, you might define targets such as emoticons (:-) which exist as features in the dataset. targets is a k x n matrix where k is the number of targets, and n is the number of features. targets will a a zero-one-valued matrix with ones in position (i,j) where j is a feature which defines target i.

targmap supports multimodel inference over the same set of targets. It will typically be a stack of identity matrices replicated p times, where the learner trains p models per target with different parameters. See the BIDMach/Experiments.scala examples of its usage. For simple cases, you can leave it null.

rmask forces certain elements of the prediction matrix to zero. It is a zero-one-valued matrix. This is essential when you use targets to define targets from input features. At a minimum, the target input features need to be forced to zero in rmask.

You can see a full list of a GLM learner options with the what method, like this:

> val (mm,opts) = GLM.learner(dat,cats,1)
> opts.what
Option Name       Type          Value
-----------       ----          -----
addConstFeat      boolean       false
batchSize         int           10000
dim               int           256
doubleScore       boolean       false
epsilon           float         1.0E-5
evalStep          int           11
featType          int           1
initsumsq         float         1.0E-5
links             IMat          1,1,1,1,1,1,1,1,1,1,...
lrate             FMat          1
mask              FMat          null
npasses           int           1
nzPerColumn       int           0
pstep             float         0.01
putBack           int           -1
r1nmats           int           1
reg1weight        FMat          1.0000e-07
resFile           String        null
rmask             FMat          null
sample            float         1.0
sizeMargin        float         3.0
startBlock        int           8000
targets           FMat          null
targmap           FMat          null
texp              FMat          0.50000
useGPU            boolean       true
vexp              FMat          0.50000
waitsteps         int           2

we wont go over all the options here, and they are discussed in the sections on DataSource, Learner, Mixins and Updaters. There are many, but typically tuning a model will focus on a few: lrate which is the learning rate, the regularizer weights (reg1weight and reg2weight), and the updater parameters texp and vexp which are discussed in those respective sections.

Many of the parameters above have FMat type but a single value. That means all models share the same parameters. But you can tailor the parameters for each target (and use multiple models per target using targmap) by passing in a k-element FMat for any parameter, where k is the total number of targets. Thus you can tune parameters in parallel with a single (or few) passes over a large dataset.

Factorization Machines

Factorization machines are described in [2]. They use dimension reduction to model interactions in prediction tasks, and have been shown to work well on collaborative filtering and click prediction. An output <math>\hat{y}</math> is modeled as

<math>\hat{y} = w_0 \: + \: \sum\limits_{i=1}^n \; w_i x_i \: + \sum\limits_{i=1}^n \sum\limits_{j=i+1}^n \; \langle \mathbf{\text{v}}_i, \mathbf{\text{v}}_j \rangle \; x_i x_j</math>

where each <math>\mathbf{\text{v}}_i</math> is a k < n dimensional vector. <math>\hat{y}</math> can be thought of as an approximation to a second-order function <math>y(x)</math> using a dimension reduced interaction matrix. The second summation in the formula above is effectively symmetric and, except for the missing diagonal terms, positive definite. Its been noted that Factorization machines can approximate any second-order function of the features if k is large enough by varying the diagonal weights in the first sum. But this is not necessarily an efficient approximation. In our implementation we generalize the above formula to this one:

<math> \hat{y} = w_0 \: + \sum\limits_{i=1}^n \; w_i x_i \: + \sum\limits_{i=1}^n \sum\limits_{j=i+1}^n \; \langle \mathbf{\text{v}}_i, \mathbf{\text{v}}_j \rangle \; x_i x_j \; - </math><math>\sum\limits_{i=1}^n \sum\limits_{j=i+1}^n\langle\mathbf{\text{u}}_i,\mathbf{\text{u}}_j \rangle x_i x_j</math>

which adds a second sum of negative definite terms. We assume the <math>\mathbf{\text{v}}_i</math> now have dimension <math>k_1</math> and the <math>\mathbf{\text{u}}_i</math> have dimension <math>k_2</math>. This generally gives better predictions for a given total number <math>k=k_1 + k_2</math> of dimensions than using assuming all positive eigenvalues as in the original formulation. We found that <math>k_1 > k_2</math> by a factor of 4 or more works best. Either this is a property of the full second-order matrix, or the positive subspace is more important for high-value predictions.

Factorization machines are integrated with BIDMach's GLM framework, so you can create factorization machines for either linear, logistic or SVM link functions. To create a Factorization machine model, do:

> val (mm, opts) = FM.learner(datamat, labels, d)

where

d
is the link number (0 for linear, 1 or 2 for logistic, 3 for SVM). The options for setting the positive and negative subspace dimensions are:
> opts.dim1 = 224
> opts.dim2 = 32

Otherwise, the options for running and tuning GLM models apply.

Note that factorization machines currently only support singe-target predictions. This will probably change soon.

[2] Rendle, S. "Factorization machines." In Proc. 10th IEEE Int. Conf. on Data Mining, 2010.

Factor Models

Latent Dirichlet Allocation (LDA)

We currently have two models for LDA: BIDMach.models.LDA uses Variational Bayes method for model inference and parameter estimation, and BIDMach.models.LDAgibbs uses Gibbs sampling. In addition, the Variational Bayes model supports both batch and minibatch updates, so effectively there are 3 LDA algorithms available. The usage of the algorithms are the same with a few exceptions in setting the model options. The following code creates a learner and the associated options for LDA.

val (nn, opts) = LDA.learner(a, k)
or 
val (nn, opts) = LDA.learnBatch(a, k)
or 
val (nn, opts) = LDAgibbs.learner(a, k)
The required input a is m-by-n sparse matrix (word-doc-counts). m is the number of features and n is the number of examples. For topic model, m is number of words and n is number of documents. Input k is optional and is the dimension of the model.

To see all the parameters of the learner:

opts.what

To train the model:

nn.train

To get the results of the learner:

nn.modelmats(0): the topic-word matrix, one topic per row.
nn.datamats(1): the factorized topic-document matrix, one document per column.
and you will probably want to wrap a call to FMat(...) around each of these expressions in order to get the result as a float matrix.

Hint: Make sure that opts.putBack is set to 1 if you want to store the topic-document matrix in nn.datamats(1).

You can go to Saving your results to see how to save your results.

Here are the key parameters for LDA and LDAgibbs. Default values are noted in parenthesis.

dim(256): Model dimension
uiter(5): Number of iterations on one block of data
alpha(0.001f): Dirichlet document-topic prior
beta(0.0001f): Dirichlet word-topic prior

* Other key parameters inherited from the learner, datasource and updater:
batchSize: the number of samples processed in a minibatch
power(0.3f): the exponent of the moving average of online learning
npasses(2): number of complete passes over the dataset

Specialized parameters for LDA

exppsi(true): Apply exp(psi(X)) if true, otherwise just use X

Specialized parameters for LDAgibbs

m(100): number of samples (per word) taken for each pass through the data

SFA: Sparse Factor Analysis (equivalent to ALS)

Sparse Factor Analysis is a popular model for collaborative filtering. It is often referred to as ALS (Alternating Least Squares) which is an algorithm rather than a model. ALS is actually a very slow way to implement SFA. We use instead a gradient method with CG solver (should be preconditioned CG soon). Construction of SFA models is similar to LDA. For example, you can create a SFA learner with

val (nn, opts) = SFA.learner(a, k)
The results are also in the same format as in LDA.

Non-negative Matrix Factorization (NMF)

The usage of NMF model is the same as LDA. For example, you can create a NMF learner with

val (nn, opts) = NMF.learner(a, k)
The main options are:
dim(256): Model dimension
uiter(5): Number of iterations on one block of data
uprior(0.01f): Prior on the user (datasource) factor
mprior(0.0001f): Prior on the model
NMFeps(1e-9): A safety floor constant

Other key parameters inherited from the learner, datasource and updater:

batchSize: the number of samples processed in a minibatch
power(0.3f): the exponent of the moving average of online learning
npasses(2): number of complete passes over the dataset

Independent Component Analysis (ICA)

The goal of Independent Component Analysis (ICA) is to find a linear representation of non-Gaussian data where the underlying components are as independent as possible. Let S be an n x k matrix, where each column is an n-dimensional vector that represents the joint measurement of n mixtures at a certain time index. Then, we assume we can represent the "data mixing" process with an n x n, invertible mixing matrix A, so that we observe X = AS. Thus, the input to the model should be an n x k matrix X, and the ICA model should recover W = A^{-1} so that the original sources can be recovered by left-multiplying to the input, i.e., WX = WAS = S.

Rather surprisingly, it turns out that so long as at most one of the sources follows a Gaussian distribution, it is possible to recover the sources subject to ambiguity among permutations and scale, which for many datasets is not an issue.

BIDMach includes an implementation of FastICA, which is the current state of the art algorithm to solve ICA. To use ICA, we need an input matrix X. (See below for important facts regarding zero-mean and whitened data.) If X is an n x k matrix, then its dimension is n (so set opts.dim = n). This is important, since the dimension is built-in with the data and it does not make sense to have a fixed default unlike, for instance, LDA.

Here is some sample usage with a single matrix as input. In reality, large-scale data may require a FilesDS.

class xopts extends Learner.Options with MatDS.Opts with ICA.Opts with ADAGrad.Opts
val opts = new xopts
opts.dim = n // The number of rows of x
val output = loadFMat("ica_output.txt") // A file containing our output
val nn = new Learner(new MatDS(Array(output), opts), new ICA(opts), null, new ADAGrad(opts), opts);
nn.train

Our current model returns the following:

modelmats(0) \\ Our estimated W = A^{-1}
modelmats(1) \\ Our calculated data mean vector

Thus, our predicted source can be computed by

modelmats(0) * (output - modelmats(1))

Even if the output (note: this is the DATA, or the INPUT to the ICA model, a bit confusing but output makes sense because it's (mixing matrix) * (source matrix) = output matrix) is already zero-mean, we are okay because modelmats(1) should become the zero vector.

Right now, our version does not support efficient whitening. It is best if the data is already pre-whitened somehow. If not, one can use eigendecompositions, but this is an expensive operation.

We are currently extending ICA to include a faster and more reliable whitening procedure.

For an excellent survey of ICA, read the following reference:

Independent Component Analysis: A Tutorial Neural Networks, Vol. 13, No. 4-5. (2000), pp. 411-430 by Aapo Hyvärinen, Erkki Oja

Random Forests

Coming Soon

Clustering

K-Means

The usage of KMeans is the similar to LDA. For example, you can create a KMeans learner with

val (nn, opts) = KMeans.learner(a, k)
Where k is the number of clusters. And if you have n data points in d dimension, then the data matrix a should be a d*n matrix.

KMeans is currently a simple implementation that runs for a fixed number of iterations selected by

opts.npasses

Tips

Saving your results

Result matrices are almost always stored as GMat (useGPU=true) or FMat (useGPU=false), and have general (Mat) type. Its easier to work with them as FMats (and this facilitates saving). The result of a training session is stored in the Learner's modelmat or modelmats field (for models with several components). You can extract and convert them to FMat like this:

> val (nn,opts) = LDA.learner(a,256)
> nn.train
> ...
> val m = FMat(nn.modelmat)

After that, you can use saveAs("filename.mat",m,"model") to save your results in a Matlab-compatible HDF5 format mat file, or use saveFMat("filename.fmat.lz4",m) to save in BIDMat's own format using lz4 compression. You will find the latter to be significantly faster. We strongly recommend you use a similar file naming convention (".mat" suffix for matlab-compatible files, and "name.mattype.compresstype" for BIDMat-format files. HDF5 can store multiple arrays in each file, and requires a string name for each (in this case, we named m as "model" in the HDF5 file). BIDMat requires matrices to each be saved in their own file. There is type meta-data in the file but it is easier to track matrix types and contents using filenames. (See here for more details).

Creating Test Matrices

Random Dense Matrices

You can create random dense matrices with either:

> val a=rand(n,m)    //Creating a random FMat
> val a=grand(n,m)   //Creating a random GMat
by convention, BIDMach's datasources assume that the row index is the feature index and the column index is for instances. You may want to use the GMat version if you're in a hurry. Its about 30x faster on our test machine.

Random Labels

To create an array of random binary labels with probability p, just generate a uniform (0,1) array as above, and then do:

> val labels = a < p

which will create an array with the same type as a (assumed to be FMat or GMat) containing 0s and 1s. If you need integer labels apply IMat or GIMat to the output, e.g.

> val labels = IMat(a < p)

To generate random integers in some range 0..k-1, you can do

> val randints = IMat(floor(k * 0.99999f * rand(n,m)))
where the 0.99999f is to guard against the rare event that rand returns 1f.

Random Sparse Matrices

You can create random sparse matrices with the sprand function. It takes two dimension arguments and then a density argument.

val a=sprand(n,m,0.2)   //Creating a random SMat with 20% non-zero values
This returns a sparse matrix with approximately 20% non-zeros and where the non-zero values are distributed uniformly in the range (0,1).

Random Power-Law Matrices

Since power-law matrices are so common in data analysis, there are convenience functions for generating random power-law matrices. The simplest is powrand which takes row,column dimension arguments and then a density argument which indicates the expected number of items per column:

> val a = powrand(10000,1000,100)
> mean(sum(a))
99.5
Which could represent a set of documents over a dictionary of 10000 terms, with 1000 documents of average length 100.

This function generates data by columns. For each column, it samples from a power-law distribution with exponent -1 to determine the number of elements in that column. Using this number, it repeatedly samples from a second power-law distribution with exponent -1 to obtain the final coefficient samples.

powrand uses the general form of sprand, which is:

sprand(nrows, ncols, rowGenFunction, colGenFunction)
This routine samples from the second function (colGenFunction) to get the number of elements in each column and uses this as a count to repeatedly sample from the row generator. For basic power-law data, both functions are specializations of
simplePowerLaw(range:Int)
which produces power-law distributed integers with exponent -1 in the range 0,...,range-1.

For more general power-law data, you can specialize the Pareto random generator whose interface is:

paretoGen(low:Int, high:Int, alpha:Double)
where low,high give the range of output values, and alpha is the shape parameter.