Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NBLAST score matrix calculation #28

Closed
wants to merge 22 commits into from
Closed

Conversation

clbarnes
Copy link
Collaborator

@clbarnes clbarnes commented Apr 9, 2021

I feel that the original smat.csv can only be stretched so far. Different score matrices would probably be valuable for different brains, different tasks (lateral pairs/ lineage classification/ segmental repetition). This impl is based on my equally-untested experimental rust impl, which is based on the description of the algorithm in the paper (rather than the R impl).

@clbarnes clbarnes force-pushed the smat branch 2 times, most recently from 3a1505a to 1ea9295 Compare April 9, 2021 17:42
@clbarnes
Copy link
Collaborator Author

clbarnes commented Apr 9, 2021

How the dist/dot bins are specified needs work as it's hard to intuit with the infinity handling. I'm now leaning towards explicitly only asking for the interior boundaries, and adding -inf and inf to either side; could avoid doing so if they are already infs.

EDIT: now implemented this.

@schlegelp
Copy link
Collaborator

That's a great idea! I always wondered whether the score matrices based on fly circuit (light level, many brains) might be too forgiving when applied to EM data (within brain comparison, nm resolution, probably better transforms).

Would love to hear @jefferis' thoughts on this. For example: we have a large collection of curated neurons from the hemibrain which we could use to generate an EM-based scoring matrix.

@jefferis
Copy link

The intention was always to provide different score matrices for different circumstances, just like BLAST. Specifically you are supposed to use different BLAST substitution matrices depending on how distantly related you think your proteins may be. However also just like BLAST the default does sufficiently well that there has not been much impetus to make new matrices. Unlike BLAST there is second factor besides relatedness to consider, which is spatial scale. This is potentially quite important for the larval vs adult difference and would be a good reason to try regenerating a matrix there. In the past I've told people working on brains other than adult fly to try scaling their neurons up or down by a factor of eg 2x or 0.5x for larva or zebrafish, respectively.

In terms of EM vs LM, I don't expect much of an issue. Our registration of the flycircuit dataset was very high quality. The registration across two EM datasets or between l and r of a single dataset will not be substantially better. This is further emphasised by the fact that there was an uncorrected Z compression artefact in the flycircuit data meaning that stuff is a little closer there than in real life.

What will make a significant difference is clustering neurons in the same EM hemisphere. If you want to make an adult scoring matrix for that purpose, then the distances will likely be appreciably smaller due to cofasciculation. However even there I have some concerns that you might end up rewarding the very tight fasciculation of long tracts too much if you do not train based on very similar cell type differences (which will be in the axons and dendritic arbours).

Final point, the original matrices were only constructed with olfactory uPNs. There is no doubt that some greater diversity of cell types would be good.

@jefferis
Copy link

@dokato FYI

@jefferis
Copy link

@clbarnes in terms of the dotprops upper and lower bounds, I guess extending them to +/- Inf is no problem. In R I clamped the range when calculating them for each comparison to (0,1) using the zapsmall function. I guess there is a very small computational cost to doing this which could be avoided by range extension.

@clbarnes
Copy link
Collaborator Author

clbarnes commented Apr 10, 2021

Thanks all! Indeed, I've seen some scaling of the score matrix for different sizes of animal, and as you say it would react to cell types differently. I imagine it could also be impacted by how you pre-process the neurons.

Still have yet to assemble some training data for testing this, will let you know when I have.

The inf boundaries wasn't so much a performance concern as it was helping my mental model of fences and fence posts.

@schlegelp
Copy link
Collaborator

Just a side note but I wonder if it would be possible to implement ScoreMatrixBuilder in a way that we can produces scoring matrices of arbitrary dimension? For example, we might want to bring in a third metric in addition to distance and vector product.

@jefferis
Copy link

jefferis commented Apr 13, 2021

For example, we might want to bring in a third metric in addition to distance and vector product.

This is definitely of interest but I have somewhat resisted it in the past because you quickly end up with limited data feeding in some cells (it's possible that one might need to fit a smooth surface and use that etc).

@clbarnes
Copy link
Collaborator Author

clbarnes commented Apr 13, 2021

Shouldn't be too hard! It'll probably be most flexible to abstract it away from the dist/dot/alpha terminology (just take a Callable[[Dotprops, Dotprops], List[np.ndarray[int]]] for generating axis indices), then a subclass could reintroduce the concrete case.

It might be nice to have the same structure reflected on the ScoreFunction side: the Nblaster generic over Callable[[Dotprops, Dotprops], float] as a score function, have ND lookup tables implement __call__ like the ScoreFunction does currently, and separate the lookup/ surface fit/ passthrough cases rather than handling them all in one object. This would move responsibility for iterating over the point matches into the ScoreFunction, and might reduce some of the shortcuts available to the Nblaster. Any other way just relies on the score function being equally applicable to both scalars and np.ndarrays.

A ScoreLookup2D would be a special case which could have from_csv and from_pandas class methods, so that its use is explicit rather than being overly flexible in the constructor which would make subclassing harder.

@clbarnes
Copy link
Collaborator Author

clbarnes commented Apr 13, 2021

ND case is covered, as well as a distdot-specific subclass for the common case. as is an ND lookup class, and a 2D lookup class which inherits from it and provides pandas IO, plus some convenience cases. Still yet to check whether it actually works...

@clbarnes
Copy link
Collaborator Author

clbarnes commented Apr 14, 2021

47 unit tests added (OK, I cheated, they're parameterised), and the score matrix produced by a synthetic example (matching the neurons already in the repo with their jittered selves) looks like the right shape.

(and rebased on master)

@clbarnes clbarnes marked this pull request as ready for review April 14, 2021 13:37
@clbarnes clbarnes force-pushed the smat branch 2 times, most recently from c776660 to 4db1fde Compare April 14, 2021 15:57
@schlegelp
Copy link
Collaborator

Another one for the wish list: it would be super useful to have some metric(s) to quantify how well a given scoring function/matrix is doing given the list of matching and non-matching neurons.

nbl_matrix

Not sure if this would necessarily be part of LookupNd but some way to compare/evaluate scoring matrices and functions would be useful.

@schlegelp
Copy link
Collaborator

I also wonder if the ScoringFunction class should be moved from navis.nbl.nblast_funcs to navis.nbl.smat to have everything scoring-related in one place. Similarly ScoringFunction would need to be modified to allow N-dimensional look-up.

@clbarnes I'm happy to merge unless you want to first play around with the above.

@clbarnes
Copy link
Collaborator Author

clbarnes commented Apr 15, 2021

Another one for the wish list: it would be super useful to have some metric(s) to quantify how well a given scoring function/matrix is doing given the list of matching and non-matching neurons.

Would this be a measure of the score function or might it fit better on the NBlaster wrapping it? As I imagine it's not too clear how useful it'll be until you've done the summing and normalisation. I guess you might want to just get the raw distdot scores out for some given pairs of neurons so you could then do some separability tests of your own or show them on a violin plot or something (but the noise would be pretty large).

I also wonder if the ScoringFunction class should be moved from navis.nbl.nblast_funcs to navis.nbl.smat to have everything scoring-related in one place. Similarly ScoringFunction would need to be modified to allow N-dimensional look-up.

Yes, I haven't got round to looking at ScoreFunction yet. I wonder if it might be better to deprecate it entirely (with (Pending)DeprecationWarning) - if people want to use the passthrough case, they should just use operator.mul, and if they want a lookup table, there's Lookup2D, which has the same pandas interop. I could make ScoringFunction a wrapper around those two options for compatibility. I opted to just implement to/from dataframe on Lookup2D (rather than add file reading etc.) because pandas' IO is very flexible and it didn't seem worth re-implementing a subset of that when users could just pd.read/to_whatever themselves.

@schlegelp
Copy link
Collaborator

schlegelp commented Apr 15, 2021

Would this be a measure of the score function or might it fit better on the NBlaster wrapping it?

To my mind this would be most useful if it was flexible. If it was a separate function that takes NBLAST scores + a list of matching pairs and produces a couple interesting metrics (e.g. min/mean/max within and across type scores), it could be plugged in at the end of LookupNdBuilder as well as after a proper normalized NBLAST.

Yes, I haven't got round to looking at ScoreFunction yet. I wonder if it might be better to deprecate it entirely.

I think that's fair: I'd be surprised if anybody other than us used it in it's current form. Most users won't interact with it anyway. As things are implemented at the moment, there isn't even a way of using e.g. navis.nblast with a custom scoring function. On that end, I had imagined that the nblast functions would get a new parameter scoring_func that looks something like this:

    scoring_func :  str | pandas.DataFrame | Lookup2d | callable, optional
                    A scoring function or matrix:
                      - strings will be passed to `pandas.read_csv` to produce a scoring matrix
                      - DataFrames are expected to be scoring matrices with rows and cols representing distances and dot products, respectively
                      - an instance of `navis.nbl.smats.Lookup2d`
                      - a callable that accepts two (N, ) arrays of distances of
                        nearest-neighbor vector pairs and their dot products and
                        returns a single (N, ) array of the final scores
                      - `None` will use the original scoring matrices from the 
                        NBLAST paper

Also just to make a note: once this is stable enough, it'd be nice to add a tutorial on "making your own NBLAST scoring function" to the docs.

@clbarnes clbarnes force-pushed the smat branch 2 times, most recently from c6b19c6 to 779616f Compare April 15, 2021 17:35
@clbarnes
Copy link
Collaborator Author

Last commit is a bit speculative - re-adds helper methods on the LookupDistDotBuilder for determining the boundaries. This is trivial for the dots, but less so for the dists: settled on asking for another argument which sets the scaling, but I suspect it won't do very well in practice.

The right way to do this may be to wait until you have all the distdots and then partition it like a balanced tree.

You could just keep a (massively downsampled) actual spatial tree as a score function, of course (because what this algorithm really needs is more nearest neighbour queries...).

@schlegelp
Copy link
Collaborator

I guess that this would be a case where one might run a couple iterations to find good bounds - for which you need a way to evaluate the results :)

@schlegelp
Copy link
Collaborator

Just a thought sparked by #38: I was wondering if it would make sense to nudge users to follow a convention by which matrices are always made for data in microns. Minimally this could be something in the docstring, or perhaps a warning if the dotprops are not in microns.

@schlegelp
Copy link
Collaborator

@schlegelp , any thoughts on which smat.py members should be exported in all?

Given that they are very specialised functions/classes, I would have probably exposed Lookup2d, LookupNd, smat_fcwb smat_fcwb and parse_score_fn at module level (i.e. navis.nbl.LookupNd, not navis.LookupNd) and left the builders in navis.nbl.smat for "power users".

@schlegelp
Copy link
Collaborator

Hi @clbarnes. I'm doing some work on the whole nblast module and I was wondering if you wanted to merge this PR to avoid having to making a huge rebase further down the line?

@clbarnes
Copy link
Collaborator Author

Sorry to have let this go stale - I think the rebase will be significant so I wouldn't wait on this before 1.0.

@clbarnes
Copy link
Collaborator Author

Closing in favour of #69

@clbarnes clbarnes closed this Nov 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants