-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NBLAST score matrix calculation #28
Conversation
3a1505a
to
1ea9295
Compare
How the dist/dot bins are specified needs work as it's hard to intuit with the infinity handling. I'm now leaning towards explicitly only asking for the interior boundaries, and adding -inf and inf to either side; could avoid doing so if they are already infs. EDIT: now implemented this. |
That's a great idea! I always wondered whether the score matrices based on fly circuit (light level, many brains) might be too forgiving when applied to EM data (within brain comparison, nm resolution, probably better transforms). Would love to hear @jefferis' thoughts on this. For example: we have a large collection of curated neurons from the hemibrain which we could use to generate an EM-based scoring matrix. |
The intention was always to provide different score matrices for different circumstances, just like BLAST. Specifically you are supposed to use different BLAST substitution matrices depending on how distantly related you think your proteins may be. However also just like BLAST the default does sufficiently well that there has not been much impetus to make new matrices. Unlike BLAST there is second factor besides relatedness to consider, which is spatial scale. This is potentially quite important for the larval vs adult difference and would be a good reason to try regenerating a matrix there. In the past I've told people working on brains other than adult fly to try scaling their neurons up or down by a factor of eg 2x or 0.5x for larva or zebrafish, respectively. In terms of EM vs LM, I don't expect much of an issue. Our registration of the flycircuit dataset was very high quality. The registration across two EM datasets or between l and r of a single dataset will not be substantially better. This is further emphasised by the fact that there was an uncorrected Z compression artefact in the flycircuit data meaning that stuff is a little closer there than in real life. What will make a significant difference is clustering neurons in the same EM hemisphere. If you want to make an adult scoring matrix for that purpose, then the distances will likely be appreciably smaller due to cofasciculation. However even there I have some concerns that you might end up rewarding the very tight fasciculation of long tracts too much if you do not train based on very similar cell type differences (which will be in the axons and dendritic arbours). Final point, the original matrices were only constructed with olfactory uPNs. There is no doubt that some greater diversity of cell types would be good. |
@dokato FYI |
@clbarnes in terms of the dotprops upper and lower bounds, I guess extending them to +/- Inf is no problem. In R I clamped the range when calculating them for each comparison to (0,1) using the zapsmall function. I guess there is a very small computational cost to doing this which could be avoided by range extension. |
Thanks all! Indeed, I've seen some scaling of the score matrix for different sizes of animal, and as you say it would react to cell types differently. I imagine it could also be impacted by how you pre-process the neurons. Still have yet to assemble some training data for testing this, will let you know when I have. The inf boundaries wasn't so much a performance concern as it was helping my mental model of fences and fence posts. |
Just a side note but I wonder if it would be possible to implement |
This is definitely of interest but I have somewhat resisted it in the past because you quickly end up with limited data feeding in some cells (it's possible that one might need to fit a smooth surface and use that etc). |
Shouldn't be too hard! It'll probably be most flexible to abstract it away from the dist/dot/alpha terminology (just take a It might be nice to have the same structure reflected on the A |
ND case is covered, as well as a distdot-specific subclass for the common case. as is an ND lookup class, and a 2D lookup class which inherits from it and provides pandas IO, plus some convenience cases. Still yet to check whether it actually works... |
47 unit tests added (OK, I cheated, they're parameterised), and the score matrix produced by a synthetic example (matching the neurons already in the repo with their jittered selves) looks like the right shape. (and rebased on master) |
c776660
to
4db1fde
Compare
Another one for the wish list: it would be super useful to have some metric(s) to quantify how well a given scoring function/matrix is doing given the list of matching and non-matching neurons. Not sure if this would necessarily be part of |
I also wonder if the @clbarnes I'm happy to merge unless you want to first play around with the above. |
Would this be a measure of the score function or might it fit better on the NBlaster wrapping it? As I imagine it's not too clear how useful it'll be until you've done the summing and normalisation. I guess you might want to just get the raw distdot scores out for some given pairs of neurons so you could then do some separability tests of your own or show them on a violin plot or something (but the noise would be pretty large).
Yes, I haven't got round to looking at ScoreFunction yet. I wonder if it might be better to deprecate it entirely (with |
To my mind this would be most useful if it was flexible. If it was a separate function that takes NBLAST scores + a list of matching pairs and produces a couple interesting metrics (e.g. min/mean/max within and across type scores), it could be plugged in at the end of
I think that's fair: I'd be surprised if anybody other than us used it in it's current form. Most users won't interact with it anyway. As things are implemented at the moment, there isn't even a way of using e.g.
Also just to make a note: once this is stable enough, it'd be nice to add a tutorial on "making your own NBLAST scoring function" to the docs. |
c6b19c6
to
779616f
Compare
Last commit is a bit speculative - re-adds helper methods on the The right way to do this may be to wait until you have all the distdots and then partition it like a balanced tree. You could just keep a (massively downsampled) actual spatial tree as a score function, of course (because what this algorithm really needs is more nearest neighbour queries...). |
I guess that this would be a case where one might run a couple iterations to find good bounds - for which you need a way to evaluate the results :) |
Just a thought sparked by #38: I was wondering if it would make sense to nudge users to follow a convention by which matrices are always made for data in microns. Minimally this could be something in the docstring, or perhaps a warning if the dotprops are not in microns. |
Given that they are very specialised functions/classes, I would have probably exposed |
Also include lookup classes.
- Use composition instead of inheritance to allow boundaries to be defined later - Add (questionable) helper methods for determining boundaries
This reverts commit 779616f.
The tests are not run in parallel with each other so there shouldn't be any OS-level interference. I also tested without session-scoped test fixtures. The deadlock occurs on the first parallelised test which runs directly after a non-parallelised test - by re-ordering the tests, the deadlock is avoided.
Hi @clbarnes. I'm doing some work on the whole nblast module and I was wondering if you wanted to merge this PR to avoid having to making a huge rebase further down the line? |
Sorry to have let this go stale - I think the rebase will be significant so I wouldn't wait on this before 1.0. |
Closing in favour of #69 |
I feel that the original smat.csv can only be stretched so far. Different score matrices would probably be valuable for different brains, different tasks (lateral pairs/ lineage classification/ segmental repetition). This impl is based on my equally-untested experimental rust impl, which is based on the description of the algorithm in the paper (rather than the R impl).