NBLAST score matrix calculation #28

clbarnes · 2021-04-09T17:38:54Z

I feel that the original smat.csv can only be stretched so far. Different score matrices would probably be valuable for different brains, different tasks (lateral pairs/ lineage classification/ segmental repetition). This impl is based on my equally-untested experimental rust impl, which is based on the description of the algorithm in the paper (rather than the R impl).

clbarnes · 2021-04-09T17:46:21Z

How the dist/dot bins are specified needs work as it's hard to intuit with the infinity handling. I'm now leaning towards explicitly only asking for the interior boundaries, and adding -inf and inf to either side; could avoid doing so if they are already infs.

EDIT: now implemented this.

schlegelp · 2021-04-10T08:57:03Z

That's a great idea! I always wondered whether the score matrices based on fly circuit (light level, many brains) might be too forgiving when applied to EM data (within brain comparison, nm resolution, probably better transforms).

Would love to hear @jefferis' thoughts on this. For example: we have a large collection of curated neurons from the hemibrain which we could use to generate an EM-based scoring matrix.

jefferis · 2021-04-10T10:13:39Z

The intention was always to provide different score matrices for different circumstances, just like BLAST. Specifically you are supposed to use different BLAST substitution matrices depending on how distantly related you think your proteins may be. However also just like BLAST the default does sufficiently well that there has not been much impetus to make new matrices. Unlike BLAST there is second factor besides relatedness to consider, which is spatial scale. This is potentially quite important for the larval vs adult difference and would be a good reason to try regenerating a matrix there. In the past I've told people working on brains other than adult fly to try scaling their neurons up or down by a factor of eg 2x or 0.5x for larva or zebrafish, respectively.

In terms of EM vs LM, I don't expect much of an issue. Our registration of the flycircuit dataset was very high quality. The registration across two EM datasets or between l and r of a single dataset will not be substantially better. This is further emphasised by the fact that there was an uncorrected Z compression artefact in the flycircuit data meaning that stuff is a little closer there than in real life.

What will make a significant difference is clustering neurons in the same EM hemisphere. If you want to make an adult scoring matrix for that purpose, then the distances will likely be appreciably smaller due to cofasciculation. However even there I have some concerns that you might end up rewarding the very tight fasciculation of long tracts too much if you do not train based on very similar cell type differences (which will be in the axons and dendritic arbours).

Final point, the original matrices were only constructed with olfactory uPNs. There is no doubt that some greater diversity of cell types would be good.

jefferis · 2021-04-10T10:13:57Z

@dokato FYI

jefferis · 2021-04-10T10:18:33Z

@clbarnes in terms of the dotprops upper and lower bounds, I guess extending them to +/- Inf is no problem. In R I clamped the range when calculating them for each comparison to (0,1) using the zapsmall function. I guess there is a very small computational cost to doing this which could be avoided by range extension.

clbarnes · 2021-04-10T11:20:45Z

Thanks all! Indeed, I've seen some scaling of the score matrix for different sizes of animal, and as you say it would react to cell types differently. I imagine it could also be impacted by how you pre-process the neurons.

Still have yet to assemble some training data for testing this, will let you know when I have.

The inf boundaries wasn't so much a performance concern as it was helping my mental model of fences and fence posts.

schlegelp · 2021-04-13T09:57:01Z

Just a side note but I wonder if it would be possible to implement ScoreMatrixBuilder in a way that we can produces scoring matrices of arbitrary dimension? For example, we might want to bring in a third metric in addition to distance and vector product.

jefferis · 2021-04-13T10:18:52Z

For example, we might want to bring in a third metric in addition to distance and vector product.

This is definitely of interest but I have somewhat resisted it in the past because you quickly end up with limited data feeding in some cells (it's possible that one might need to fit a smooth surface and use that etc).

clbarnes · 2021-04-13T11:05:41Z

Shouldn't be too hard! It'll probably be most flexible to abstract it away from the dist/dot/alpha terminology (just take a Callable[[Dotprops, Dotprops], List[np.ndarray[int]]] for generating axis indices), then a subclass could reintroduce the concrete case.

It might be nice to have the same structure reflected on the ScoreFunction side: the Nblaster generic over Callable[[Dotprops, Dotprops], float] as a score function, have ND lookup tables implement __call__ like the ScoreFunction does currently, and separate the lookup/ surface fit/ passthrough cases rather than handling them all in one object. This would move responsibility for iterating over the point matches into the ScoreFunction, and might reduce some of the shortcuts available to the Nblaster. Any other way just relies on the score function being equally applicable to both scalars and np.ndarrays.

A ScoreLookup2D would be a special case which could have from_csv and from_pandas class methods, so that its use is explicit rather than being overly flexible in the constructor which would make subclassing harder.

clbarnes · 2021-04-13T17:14:26Z

ND case is covered, as well as a distdot-specific subclass for the common case. as is an ND lookup class, and a 2D lookup class which inherits from it and provides pandas IO, plus some convenience cases. Still yet to check whether it actually works...

clbarnes · 2021-04-14T13:16:42Z

47 unit tests added (OK, I cheated, they're parameterised), and the score matrix produced by a synthetic example (matching the neurons already in the repo with their jittered selves) looks like the right shape.

(and rebased on master)

schlegelp · 2021-04-15T10:49:40Z

Another one for the wish list: it would be super useful to have some metric(s) to quantify how well a given scoring function/matrix is doing given the list of matching and non-matching neurons.

Not sure if this would necessarily be part of LookupNd but some way to compare/evaluate scoring matrices and functions would be useful.

schlegelp · 2021-04-15T10:55:35Z

I also wonder if the ScoringFunction class should be moved from navis.nbl.nblast_funcs to navis.nbl.smat to have everything scoring-related in one place. Similarly ScoringFunction would need to be modified to allow N-dimensional look-up.

@clbarnes I'm happy to merge unless you want to first play around with the above.

clbarnes · 2021-04-15T11:39:03Z

Another one for the wish list: it would be super useful to have some metric(s) to quantify how well a given scoring function/matrix is doing given the list of matching and non-matching neurons.

Would this be a measure of the score function or might it fit better on the NBlaster wrapping it? As I imagine it's not too clear how useful it'll be until you've done the summing and normalisation. I guess you might want to just get the raw distdot scores out for some given pairs of neurons so you could then do some separability tests of your own or show them on a violin plot or something (but the noise would be pretty large).

I also wonder if the ScoringFunction class should be moved from navis.nbl.nblast_funcs to navis.nbl.smat to have everything scoring-related in one place. Similarly ScoringFunction would need to be modified to allow N-dimensional look-up.

Yes, I haven't got round to looking at ScoreFunction yet. I wonder if it might be better to deprecate it entirely (with (Pending)DeprecationWarning) - if people want to use the passthrough case, they should just use operator.mul, and if they want a lookup table, there's Lookup2D, which has the same pandas interop. I could make ScoringFunction a wrapper around those two options for compatibility. I opted to just implement to/from dataframe on Lookup2D (rather than add file reading etc.) because pandas' IO is very flexible and it didn't seem worth re-implementing a subset of that when users could just pd.read/to_whatever themselves.

schlegelp · 2021-04-15T12:07:38Z

Would this be a measure of the score function or might it fit better on the NBlaster wrapping it?

To my mind this would be most useful if it was flexible. If it was a separate function that takes NBLAST scores + a list of matching pairs and produces a couple interesting metrics (e.g. min/mean/max within and across type scores), it could be plugged in at the end of LookupNdBuilder as well as after a proper normalized NBLAST.

Yes, I haven't got round to looking at ScoreFunction yet. I wonder if it might be better to deprecate it entirely.

I think that's fair: I'd be surprised if anybody other than us used it in it's current form. Most users won't interact with it anyway. As things are implemented at the moment, there isn't even a way of using e.g. navis.nblast with a custom scoring function. On that end, I had imagined that the nblast functions would get a new parameter scoring_func that looks something like this:

    scoring_func :  str | pandas.DataFrame | Lookup2d | callable, optional
                    A scoring function or matrix:
                      - strings will be passed to `pandas.read_csv` to produce a scoring matrix
                      - DataFrames are expected to be scoring matrices with rows and cols representing distances and dot products, respectively
                      - an instance of `navis.nbl.smats.Lookup2d`
                      - a callable that accepts two (N, ) arrays of distances of
                        nearest-neighbor vector pairs and their dot products and
                        returns a single (N, ) array of the final scores
                      - `None` will use the original scoring matrices from the 
                        NBLAST paper

Also just to make a note: once this is stable enough, it'd be nice to add a tutorial on "making your own NBLAST scoring function" to the docs.

navis/nbl/nblast_funcs.py

clbarnes · 2021-04-15T17:42:11Z

Last commit is a bit speculative - re-adds helper methods on the LookupDistDotBuilder for determining the boundaries. This is trivial for the dots, but less so for the dists: settled on asking for another argument which sets the scaling, but I suspect it won't do very well in practice.

The right way to do this may be to wait until you have all the distdots and then partition it like a balanced tree.

You could just keep a (massively downsampled) actual spatial tree as a score function, of course (because what this algorithm really needs is more nearest neighbour queries...).

schlegelp · 2021-04-15T19:39:16Z

I guess that this would be a case where one might run a couple iterations to find good bounds - for which you need a way to evaluate the results :)

schlegelp · 2021-04-22T15:14:11Z

Just a thought sparked by #38: I was wondering if it would make sense to nudge users to follow a convention by which matrices are always made for data in microns. Minimally this could be something in the docstring, or perhaps a warning if the dotprops are not in microns.

schlegelp · 2021-04-29T12:49:31Z

@schlegelp , any thoughts on which smat.py members should be exported in all?

Given that they are very specialised functions/classes, I would have probably exposed Lookup2d, LookupNd, smat_fcwb smat_fcwb and parse_score_fn at module level (i.e. navis.nbl.LookupNd, not navis.LookupNd) and left the builders in navis.nbl.smat for "power users".

Also include lookup classes.

- Use composition instead of inheritance to allow boundaries to be defined later - Add (questionable) helper methods for determining boundaries

This reverts commit 779616f.

The tests are not run in parallel with each other so there shouldn't be any OS-level interference. I also tested without session-scoped test fixtures. The deadlock occurs on the first parallelised test which runs directly after a non-parallelised test - by re-ordering the tests, the deadlock is avoided.

schlegelp · 2021-07-03T09:47:44Z

Hi @clbarnes. I'm doing some work on the whole nblast module and I was wondering if you wanted to merge this PR to avoid having to making a huge rebase further down the line?

clbarnes · 2021-11-11T12:46:54Z

Sorry to have let this go stale - I think the rebase will be significant so I wouldn't wait on this before 1.0.

clbarnes · 2021-11-25T18:53:44Z

Closing in favour of #69

clbarnes force-pushed the smat branch 2 times, most recently from 3a1505a to 1ea9295 Compare April 9, 2021 17:42

clbarnes force-pushed the smat branch from e6d6346 to d1c20af Compare April 14, 2021 13:28

clbarnes marked this pull request as ready for review April 14, 2021 13:37

clbarnes force-pushed the smat branch 2 times, most recently from c776660 to 4db1fde Compare April 14, 2021 15:57

clbarnes force-pushed the smat branch from 4db1fde to 069037d Compare April 15, 2021 11:48

clbarnes force-pushed the smat branch from f64a358 to 2d03769 Compare April 15, 2021 12:41

schlegelp reviewed Apr 15, 2021

View reviewed changes

navis/nbl/nblast_funcs.py Outdated Show resolved Hide resolved

clbarnes force-pushed the smat branch 2 times, most recently from c6b19c6 to 779616f Compare April 15, 2021 17:35

bdpedigo mentioned this pull request Apr 22, 2021

Allow passing smat kwarg through nblast_allbyall #38

Merged

clbarnes added 22 commits May 12, 2021 10:45

WIP NBLAST score matrix calculation

672aca2

Make bin boundary handling more consistent

1199431

Refactor smat.py for ND case

c52294c

Also include lookup classes.

minor: capitalise constant

b24ed9b

Tests: add fixtures for loading data

4aefc5a

Dotprops: implement __len__

da3b6b5

smat: fixes for tests

511f7c1

Unit tests for smat

b29a254

Do not sort matched pairs

72dfe4c

FutureWarning ScoringFunction, base Nblaster on Lookup2d

41ee0d0

Functionally test for valid score function

cab7e81

Split smat-parsing into function

78f90a8

Refactor smat-checking

3641ac5

Refactor LookupDistDotBuilder

982c69c

- Use composition instead of inheritance to allow boundaries to be defined later - Add (questionable) helper methods for determining boundaries

Revert "Refactor LookupDistDotBuilder"

268d81d

This reverts commit 779616f.

smat: minor refactor

32b756d

Minor refactor

c9b6715

Significant refactors

e9e837a

Use parse_score_fn in synblast

92fc632

Fix lru_cache usage

eef4b65

Expose some of smat.py to the nbl module

e8ecdfb

clbarnes force-pushed the smat branch from 4f86ba5 to e8ecdfb Compare May 12, 2021 09:45

clbarnes marked this pull request as draft May 25, 2021 14:55

clbarnes mentioned this pull request Nov 25, 2021

NBLAST score matrix builder v2 #69

Merged

clbarnes closed this Nov 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NBLAST score matrix calculation #28

NBLAST score matrix calculation #28

clbarnes commented Apr 9, 2021

clbarnes commented Apr 9, 2021 •

edited

Loading

schlegelp commented Apr 10, 2021

jefferis commented Apr 10, 2021

jefferis commented Apr 10, 2021

jefferis commented Apr 10, 2021

clbarnes commented Apr 10, 2021 •

edited

Loading

schlegelp commented Apr 13, 2021

jefferis commented Apr 13, 2021 •

edited

Loading

clbarnes commented Apr 13, 2021 •

edited

Loading

clbarnes commented Apr 13, 2021 •

edited

Loading

clbarnes commented Apr 14, 2021 •

edited

Loading

schlegelp commented Apr 15, 2021

schlegelp commented Apr 15, 2021

clbarnes commented Apr 15, 2021 •

edited

Loading

schlegelp commented Apr 15, 2021 •

edited

Loading

clbarnes commented Apr 15, 2021

schlegelp commented Apr 15, 2021

schlegelp commented Apr 22, 2021

schlegelp commented Apr 29, 2021

schlegelp commented Jul 3, 2021

clbarnes commented Nov 11, 2021

clbarnes commented Nov 25, 2021

NBLAST score matrix calculation #28

NBLAST score matrix calculation #28

Conversation

clbarnes commented Apr 9, 2021

clbarnes commented Apr 9, 2021 • edited Loading

schlegelp commented Apr 10, 2021

jefferis commented Apr 10, 2021

jefferis commented Apr 10, 2021

jefferis commented Apr 10, 2021

clbarnes commented Apr 10, 2021 • edited Loading

schlegelp commented Apr 13, 2021

jefferis commented Apr 13, 2021 • edited Loading

clbarnes commented Apr 13, 2021 • edited Loading

clbarnes commented Apr 13, 2021 • edited Loading

clbarnes commented Apr 14, 2021 • edited Loading

schlegelp commented Apr 15, 2021

schlegelp commented Apr 15, 2021

clbarnes commented Apr 15, 2021 • edited Loading

schlegelp commented Apr 15, 2021 • edited Loading

clbarnes commented Apr 15, 2021

schlegelp commented Apr 15, 2021

schlegelp commented Apr 22, 2021

schlegelp commented Apr 29, 2021

schlegelp commented Jul 3, 2021

clbarnes commented Nov 11, 2021

clbarnes commented Nov 25, 2021

clbarnes commented Apr 9, 2021 •

edited

Loading

clbarnes commented Apr 10, 2021 •

edited

Loading

jefferis commented Apr 13, 2021 •

edited

Loading

clbarnes commented Apr 13, 2021 •

edited

Loading

clbarnes commented Apr 13, 2021 •

edited

Loading

clbarnes commented Apr 14, 2021 •

edited

Loading

clbarnes commented Apr 15, 2021 •

edited

Loading

schlegelp commented Apr 15, 2021 •

edited

Loading