Skip to content

Loading and saving matrices

jcanny edited this page May 26, 2014 · 27 revisions

Table of Contents

Matlab and HDF5 Files

BIDMat supports two main file types: a Matlab-compatible HDF5 format, and a simple custom binary format with optional gzip or lz4 compression. The first format makes it easy to exchange data with Matlab, Scipy and many other tools. The second format is generally much faster.

Reading Mat Files

Matrices in Matlab-compatible HDF5 format can be read with commands like this:

scala> val a:IMat = load("d:\data\sentiment\data1.mat","tokens")

scala> val b:SMat = load("d:\data\sentiment\data2.mat","trigrams")

The load command takes a filename argument, followed by the name of a variable in the file. Assuming the data were created by Matlab (with the "-v7.3" option to save), the variable name is the name of the object saved in Matlab.

Note that each variable declaration includes a matrix type. This is important. The load function can return FMat, DMat, IMat, SMat, SDMat, CMat, CSMat or String objects, and its actual return type is AnyRef. Providing a type declaration for the assigned value or variable tells the compiler exactly what type to expect, and allows the variable to be bound to the correct type. Note that CSMat is similar to Matlab's "cell matrix" and its elements may be of any of the above types. Mostly commonly though, the CSMat will hold string data.

The underlying representation is HDF5, a widely-used format for storing matrices of scientific data, and the format now used by Matlab. Matlab's version of this format is prefaced by a 512-byte header. That is the only difference between Matlab's HDF5 files and non-Matlab HDF5 files. Without the header though, Matlab will not read a data file. It will also complain if certain metadata on each array are missing. So its best to use a save function that is compatible with Matlab.

You can also load and save non-Matlab compatible HDF5 files using saveAsHDF5(fname,varname). The file contents are the same as saveAs(fname,varname) but the Matlab-compatible header is not created. You can read these files using the same load(fname,varname) command.

Saving Data to Files

Saving variables to a file is straightforward:

scala> saveAs("d:\data\sentiment\data1.mat", a, "tokens", b, "trigrams")

You can save an arbitrary number of variables to a file. The first argument to saveAs is a filename, and the remaining args form an alternating list of variables from the environment, and String names. The effect is that variable a is saved as "tokens", b is saved as "trigrams" etc. In fact a and b dont have to be references to matrices, they could be any expressions that return the appropriate matrix types.

You can load this data directly into Matlab with the load command (which doesnt need the "-v7.3" option). It will create variables named "tokens" and "trigrams" that are respectively a dense matrice of int32, and a sparse matrix of double.

Limitations

Not all Matlab types are supported. Currently there are dense matrices of double, float, int32 and String, and sparse matrices of double, float, or byte (note that the last two do not exist in Matlab). String data are stored as uint16, which matches well with the internal formats of Matlab and Java/Scala, and will be read by Matlab as strings. A CSMat of string data will be read by Matlab as a cell array of Strings. Unfortunately, this is very inefficient in HDF5, since compression only happens within a given array (i.e. within one string). We don't know of a fast way to exchange string data with Matlab. You can load/save SBMat string data with BIDMat's HDF5 or custom formats, and these are much faster that I/O with CSMats.

BIDMat Files

BIDMat includes a simple binary file format for high-speed load/save of compressed or uncompressed data. Each file holds a single matrix of particular type. We recommend expressing the file type in the file name, although it can be read from a header in the file. To save an FMat a you can do:

> saveFMat("/data/mymat.fmat", a)      or
> saveFMat("/data/mymat.fmat.gz", a)   or
> saveFMat("/data/mymat.fmat.lz4", a)

Each command saves the matrix in BIDMat binary format. The first command stores the matrix uncompressed. The second two commands save with gzip or lz4 compression respectively. The compression type in those cases is determined by the file extension. To load the data from these files you use corresponding load commands:

> val x = loadFMat("/data/mymat.fmat")      or
> val x = loadFMat("/data/mymat.fmat.gz")   or
> val x = loadFMat("/data/mymat.fmat.lz4")
In each case x will have type FMat, and the correct decompression method will be inferred from the file name. Similarly, there are I/O routines for the other major BIDMat types:
Mat type           Save                  Load
FMat         saveFMat(fname,var)     loadFMat(fname)
DMat         saveDMat(fname,var)     loadDMat(fname)
IMat         saveIMat(fname,var)     loadIMat(fname)
SMat         saveSMat(fname,var)     loadSMat(fname)
SDMat        saveSDMat(fname,var)    loadSDMat(fname)
SBMat        saveSBMat(fname,var)    loadSBMat(fname)

Text File I/O (Dense Matrices)

You can also do I/O for matrices that are pure numerical tables. There should be no header data, and no irregularities (or missing values) in the input file. The input routines are (in the BIDMat.HMat object):

loadFMatTxt(fname:String)
loadIMatTxt(fname:String)
loadDMatTxt(fname:String)
you can call these directly but they are also invoked whenever
loadFMat(fname:String)
loadIMat(fname:String)
loadDMat(fname:String)
are applied to filenames that end in a ".txt" or ".txt.gz" or ".txt.lz4" extension. These routines expect the file to contain one row per line, with fields delimited by a space, tab or comma. If there is a ".gz" or ".lz4" suffix, then the compression filter is applied first.

The corresponding output routines are

saveFMatTxt(fname:String,v:FMat)
saveIMatTxt(fname:String,v:IMat)
saveDMatTxt(fname:String,v:DMat)
these routines have optional compress type and delimiter options:
saveFMatTxt(fname:String,v:FMat,compress:Int=0,delim:String="\t")
saveIMatTxt(fname:String,v:IMat,compress:Int=0,delim:String="\t")
saveDMatTxt(fname:String,v:DMat,compress:Int=0,delim:String="\t")
and they are also invoked when you call:
saveFMat(fname:String,v:FMat)
saveIMat(fname:String,v:IMat)
saveDMat(fname:String,v:DMat)
with filename arguments that end in ".txt" or ".txt.gz" or ".txt.lz4".

BIDMat File Compression

BIDMat supports gzip or lz4 file compression as in the examples above. LZ4 compression is typically 5-20 times faster than low-compression gzip. File sizes are larger, but the faster load/save times are a big advantage in most applications we have looked at. The default gzip compression level is 3, which also favors faster compression for somewhat higher file sizes.

Its possible to override the default compression (e.g. to save as a filename without ".gz" or ".lz4") using a third optional argument:

> saveFMat("/data/mymat.fmat", a, compress)
Where compress:Int=2 does gzip compression and compress=3 does lz4 compression. Similarly, you can override filename-inferred compression with an optional argument to the load functions:
> val x = loadFMat("/data/mymat.fmat", compress)
where compress has the same meaning as before. These options apply to all the matrix I/O routines listed above.

With gzip, you can further tailor the compression level from level 1 (faster, lower compression), to level 9 (slower, higher compression) using the global variable:

> Mat.compressionLevel

BIDMat File Format

BIDMat files include a 4-word (16 byte) binary header. The first word specifies the matrix type. It has the form:

WXYZ00ABC (decimal digits)
WXYZ = version number (currently zero)
A = matrix type: 1 (dense), 2 (sparse), 3 (sparse, norows), 4 (3-tensor), 5 (4-tensor), 6 (5-tensor)
B = data type: 0 (byte), 1 (int), 2 (long), 3 (float), 4 (double), 5 (complex float), 6 (complex double)
C = index type (sparse matrices only): 1 (int), 2 (long)

The next 3 words are respectively:

nrows (Int)
ncols (Int)
nnz   (Int) for sparse matrices, zero for dense matrices
The remainder of the file contains the data itself. For dense matrices, this is a single array in column-major order.

For sparse matrices, CSC format is used. There are three data arrays stored in this order:

(compressed) column array (length = ncols + 1)
(optional) row array      (length = nnz)
data array                (length = nnz)
the row array is optional since some sparse matrices have lower-dense rows. That is, the row indices in a given column are a contiguous range from 0 to k-1, where k is the number of non-zeros in that column. These row indices can be automatically generated from the column array, and do not need to be stored.