Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ML Dimensionality Reduction techniques for large arrays #7

Open
jakirkham opened this issue May 19, 2018 · 8 comments
Open

ML Dimensionality Reduction techniques for large arrays #7

jakirkham opened this issue May 19, 2018 · 8 comments

Comments

@jakirkham
Copy link
Contributor

One of the things that came up during the ImageXD conference, which is also of great interest in our lab, is how to perform dimensionality reductions in Dask particularly on large arrays. These may be stacks of images or other things. As these are usually large array problems, the interest is using Dask Arrays to work on data that would be impractical to work on otherwise. Some techniques of interest include matrix factorization techniques like Dictionary Learning, NMF, etc.

@mrocklin
Copy link
Member

Are there particular algorithms that are optimistic about that would be both useful on some of the example datasets and also feasible to implement and test within a week?

@valentina-s
Copy link

I will be happy to work on NMF (though remotely). There are two types of algorithms usually used: alternating least squares and multiplicative methods. I will need to think (and check some references) regarding which is easier to make distributed. In general they are all iterative and require some thought to avoid the graph becoming too large.

A famous test case for NMF (and dictionary learning) is to apply it on the faces dataset:

http://scikit-learn.org/stable/auto_examples/decomposition/plot_faces_decomposition.html#sphx-glr-auto-examples-decomposition-plot-faces-decomposition-py

@mrocklin
Copy link
Member

Here is a small example doing alternating least squares on multi-dimensional arrays: https://gist.github.com/mrocklin/6fc759ab829a44c4f1969a6d6fc9dd28

This is a naive implementation though and not necessarily of use here, I thought I'd post it just as an example.

@jakirkham
Copy link
Contributor Author

Thanks for chiming in @valentina-s. Was meaning to raise this to your attention. :)

Think we should be able to work something out so we can chat/share code snippets. @NelleV are you aware of any good resources for this that we could use during the sprint?

Indeed the faces dataset would be a good one to use. Also it is common to apply matrix factorization techniques in Calcium image data, which we have a fair bit of. Haven't looked through other datasets that people have provided, but maybe some of them would be good candidates for using this on as well.

Honestly even working out a very rough implementation during the conference would be very useful. These sorts of techniques are pretty important for us at work in a wide range of applications. So it's pretty easy to justify spending time on improving them afterwards.

@jakirkham
Copy link
Contributor Author

Thanks for compiling these, @valentina-s. Looks like we have our reading cut out for us.

Something else that @GaelVaroquaux was sharing earlier is MODL, which would be good to take a look at. This would work best to run on a single powerful node.

Agree that Neurofinder data would work well for this. Talking to @TomAugspurger about making this data easily available from Pangeo. Typically what people do with this data in particular is restructure it in 2D where one dimension is raveled spatial coordinates and the other is time. The goal is to find some meaningful representative images that can be used to reconstruct the original data.

@valentina-s
Copy link

I think having an option to test with a big dataset on a cluster will be great.

I started the brute force conversion of the multiplicative method for NMF here:

https://github.com/valentina-s/daskNMF/blob/master/ExploreNMFmu.ipynb

and I am planning to convert also the coordinate descent solver, and test on my laptop with the faces dataset (which is not really the best testing setup but something to start with).

I think there are two scenarios for images:

  • chunk the images if the images are huge and the sample size is small
  • chunk samples, if sample size is huge, most probably when images come from videos with high sample rate (in that case the random subsampling techniques will help).

I considered the first case, but I think after reading some of the references I will have a better idea of what makes sense.

@jakirkham
Copy link
Contributor Author

FWIW have built a copy of modl for conda in my channel. Only macOS and Linux (working on getting access to a Windows machine as well). The versioning is a little weird; sorry about that. Though conda does seem ok installing that anyways. Also ran the test suite as part of the build. So it should work. Should add this uses nearly everything from conda-forge with the exception of compiler runtime libraries (e.g. libgcc on Linux), which came from defaults.

ref: https://anaconda.org/jakirkham/modl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants