Skip to content
fozziethebeat edited this page Oct 27, 2011 · 3 revisions

The Matrix Package

Introduction

The matrix package provides tools for dense and sparse matrices. We included things such as matrix factorization methods, matrix transforms, matrix multiplication, and serialization/deseriatization between various popular formats.

All code is centered around two interfaces

  1. Matrix.java, which provides the core methods for accessing and updating values in a matrix.
  2. SparseMatrix.java, which provides access to sparse vectors in a matrix.

Beyond these two core interfaces, we the following tools centered around matrices:

  1. Matrix Factorization
  2. Matrix Smoothing or Scaling
  3. Matrix Serialization/Deserialization

Matrix Implementations

  • YaleSparseMatrix, using the [Yale Sparse Matrix Format] (http://en.wikipedia.org/wiki/Sparse_matrix). This matrix is ideal for sparse matrices that can fit in memory.
  • GrowingSparseMatrix, using the [Yale Sparse Matrix Format] (http://en.wikipedia.org/wiki/Sparse_matrix) this matrix can grow to any size based on the largest row and column value set.
  • AtomicGrowingMatrix, an atomic matrix that will grow to any size based on the largest row and column value set.
  • AtomicGrowingSparseMatrix, a sparse atomic matrix that will grow to any size based on the largest row and column value set.
  • AtomicGrowingSparseHashMatrix, a sparse atomic matrix, backed by a hash map, that will grow to any size based on the largest row and column value set. This provides faster access to each cell in the matrix at the cost of accessing rows as vectors.
  • ArrayMatrix, a dense, in memory matrix for reasonably small matrices.
  • OnDiskMatrix, which stores all values on disk, suitable for extremely large matrices.
  • SparseOnDiskMatrix, which stores all values on disk in a sparse format, suitable for extremely large matrices.

Matrices.java provides addition utilities for handling matrices, such as:

Matrix Factorization

Many Semantic Space models require a reduced feature space. Matrix factorization serves as the main method for this reduction. We provide three powerful factorization methods:

  1. Singular Value Decomposition: A given matrix is split into three smaller factor matrices that best approximate the original matrix.
  2. Non-negative Matrix Factorization: A given matrix is split into two probabilistic matrices that approximate the original dataset.
  3. Locality Preserving Projections: Similar to SVD, this method projects a dataset into a smaller latent feature space using local features that capture and retain non-linear features of the dataset as a whole.

Both SVD and NMF factorize an original dataset into two disctint matrices. If the original data set is a word by document matrix, then the resulting factor matrices can be best described as:

  1. A term by latent factor matrix
  2. A latent factor by document matrix.

Both methods implement the generic Matrix Factorization interface that allows users to easily change the factorization method. For example, the following code will save the factor matrices generated by SVD and NMF:

Matrix dataset = ...
MatrixFactorization factorizor = new SingularValueDecompositionLibC();
MatrixIO.writeMatrix(factorizor.dataClasses(), new File("svd-ws.dat"), Format.DENSE_TEXT);
MatrixIO.writeMatrix(factorizor.classFeatures(), new File("svd-ds.dat"), Format.DENSE_TEXT);
factorizor = new NonNegativeMatrixFactorizationMultiplicative();
MatrixIO.writeMatrix(factorizor.dataClasses(), new File("nmf-ws.dat"), Format.DENSE_TEXT);
MatrixIO.writeMatrix(factorizor.classFeatures(), new File("nmf-ds.dat"), Format.DENSE_TEXT);

factorizor.dataClasses() returns the term by latent factor matrix while factorizaor.classFeatures() returns the latent factor by document matrix in this example. Both methods can similary be used on other data matrices, such as document by term matrices, term by term matrices, or other more interesting data sets.

Matrix Smoothing and Scaling

We've included several smoothing metrics, which we refer to as transforms, for data matrices. These typically take a large sparse matrix consisting of either word co-occurrence counts or word document counts and reduce the value of non-informative features and boost the value of highly-informative features. We currently support the following transformations:

We are also adding matrix normalization techniques such as

  • Row Magnitude Transform, which normalizes each row to have a magnitude of 1.

All of the above methods implement the same Transform interface, which can be applied to both in memory matrices, and our variety of on disk matrix formats. To transform an existing matrix with one of the above methods, simply do

Matrix dataset = ...
Transform transform = ...
Matrix transformed = transform.transform(dataset);

And to Transform serialized matrix, simply do

Transform transform = ...
File dataset = ...
Format datasetFormat = ...
File transformed = transform.transform(dataset, datasetFormat)

Matrix Serialization/Deserialization

We provide support for a variety of external data formats for matrix files. Each format has it's own set of trade offs. In particular, some formats are optimal for sparse datasets, while others are best for dense datasets. Similary, binary formats are compact in size, but a not human readable while text based formats consume more disk space.

MatrixIO provides serialization and deserialziation support for Matrix instances. For example, the following code will read a SVDLIBC_SPARSE_BINARY file from disk and load it into a new Matrix:

File matrixFile = ...
Matrix matrix = MatrixIO.readMatrix(matrixFile, Format.SVDLIBC_SPARSE_BINARY);

Similarly, the following will write the now in memory matrix into the MATLAB_SPARSE format:

File outputFile = ...
MatrixIO.writeMatrix(matrix, outputFile, Format.MATLAB_SPARSE);

We also provide support for iterating through values in a stored matrix through the an iterator. For example

Iterator<MatrixEntry> entryIter = MatrixIO.getMatrixFileIterator(outputFile, Format.MATLAB_SPARSE);
while (entryIter.hasNext()) {
    MatrixEntry entry = entryIter.hasNext();
    System.out.printf("%d %d %f\n", entry.row(), entry.column(), entry.value());
}

will iterate through every value that was written to outputFile and print the row, column, and cell value.