written by Alex Gittens
licensed under the Creative Commons ShareAlike 4.0 International License
Nystrom Bestiary is a collection of code for experimenting with various SPSD Sketches, including Nystrom extensions based on column sampling, Nystrom extensions based on random mixtures of columns, and 'pinched' and 'prolonged' eigensketches (see the review paper "Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions" equations (5.11) and (5.12) for the precise definition of these two sketches).
It was used to produce the figures in the paper "Revisiting the Nystrom Method for Improved Large-scale Machine Learning" (arXiv preprint link) by Alex Gittens and Michael Mahoney. In particular, the experimental setup to generate exactly those figures is included.
Send comments, suggestions, and complaints to gittens AT icsi FULLSTOP berkeley FULLSTOP edu
- the
extensions/
directory contains the implementations of various Nystrom extensions - the
io/
directory contains the code used to create, load, and process the datasets - the
datasets/
directory contains the datasets used in the experiments - the
experiments/
directory contains a set of m-files that actually runs the Nystrom extensions on various datasets and stores statistics on the errors and timing - the
outputs/
directory is used to store the output of the experiments - the
plots/
directory stores the plots of the timings and errors - the
auxiliary/
directory contains code needed in computing the extensions - the
visualization/
directory contains the code used to produce the plots of the various timings and errors - the
misc/
directory contains miscellany (so far, the code to generate the data for Table 2 in the paper)
ALL m files should be run from the base folder, otherwise you'll run into path issues
To produce the figures in the paper:
#####Short story
Ensure that you are in the base directory, NystromBestiary
, and run the following commands from the Matlab prompt:
addpath(genpath('.'))
create_bestiary_datasets
maxNumCompThreads = 1; # if you want accurate timing info
runall
visualizeall
#####Long story
- add all the subdirectories in this folder to your path
- run
create_bestiary_datasets
to generate some required distance matrices; this step generates about 1.5Gb of data - If you want to have email notifications at the start and end of each
experiment, modify
runall.m
to set thesendEmails
flag to true and set the email-related variables appropriately, then runsetpref('Internet', 'SMTP_Password', 'youremailpassword')
at the Matlab command line - run
runall
(or pick individual experiments) in the experiments directory; this step generates about 2.7Gb of data - wait several days for the experiments to stop running!
- run
visualizeall
The pdfs will be located in the output directory
See the individual m-files for more details. Make appropriate modifications to substitute your own datasets.
- jdqrpcg.m is due to Yvan Notay (see the m file for full attribution)
- notifier.m is due to Benjamin Krause (see the m file for full attribution)
for dataset provenances, see Table 3 in the above mentioned paper (datasets: Abalone, Wine, Spam, Kin8nm, Dexter, Gisette, Enron, Protein, SNPs, HEP, GR, Gnutella)
two additional datasets, Cranfield and Medline, are from the Text to Matrix Generator Matlab Toolbox's website.