-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding robustica option to ICA decomposition to achieve consistent results #1013
Conversation
I'm really excited to see this @BahmanTahayori. I'm going to start looking at it now. Also, some of your style issues are due to us allowing recently python 3.12, which has a few more stringencies. Addressing the following might get the style check passing:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BahmanTahayori I've gone through the entire PR and made some comments. I generally like your approach, but most of my feedback is focused on making sure I understand what's happening and the messages to end users are clear.
I already noticed an issue with the 3-echo integration test where robust ICA outputs fewer than the specified number components.
The one big thing I'd still want to do before approving is to run this on a bunch of my own datasets to make sure it's stable and giving consistently plausible results.
@BahmanTahayori and I talked yesterday. In addition to going through the above review comments, I wanted to highlight one other part of our discussion to other developers (@tsalo @eurunuela @dowdlelt @n-reddy @javiergcas @goodalse2019 @NaomiGaggi @marco7877 @CesarCaballeroGaudes ). Robust ICA seems to always output fewer components than initially requested. This is reasonable. If you have 100 time points and request 90 components, it should not be able to find 90 stable components. This is both a problem and a possible solution for our current issues with PCA dimensionality estimation. On the problem side, even if we get a perfect dimensionality estimate to input into RobustICA, and RobustICA will still output fewer ICA components. On the solution side, I don't really care about how many components are estimated in the PCA step, I care that our ICA components reliably estimate as much of the total structured variance of the data as possible. This means, we can potentially give RobustICA an initial number of components that is too large such as I've been testing this out on a few datasets and it's promising, but not yet perfect. For example, I tested it on a dataset with 160 time points and RobustICA was set to run FastICA for 30 iterations
Ideally, the calculated number of stable components should be the same if we initialize to 65 or 55. That's not happening, but it is reasonably stable. One other thing that’s noteworthy is that I’m running this with a block design flashing checkerboard task, which almost always results in a high variance component with the weights in primary visual cortex and a time series that follows the block design task. I am NOT seeing that when I initialized with 94 components & those final results were qualitatively noiser. This might be a sign that if one overestimates the intial number of components too much, the method breaks down. What might be happening is the algorithm is trying to successfully run FastICA 30 times. When it fails to converge it tries again. For 94 components, it failed 20 times (running FastICA a total of 50 times). For 70 components, it failed to converge only 8 times. In addition to being computationally expensive, this means trying to get ICA to converge with way too many components is resulting in stranger results when it does converge. I'm going to keep playing with this and encourage others to do so as well. I'm also playing with more iterations of FastICA (the new For reference, the manuscript describing this robust ICA approach is https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05043-9 |
Co-authored-by: Dan Handwerker <7406227+handwerkerd@users.noreply.github.com>
Co-authored-by: Dan Handwerker <7406227+handwerkerd@users.noreply.github.com>
* RobustICA: Restructure code loop over robust methods * Addressing the issue with try/except --------- Co-authored-by: Bahman <tahayori@gmail.com>
Just wanted to jump in an officially register my excitement for this - I've checked this out and currently running on some trouble some data from neurostars. I think it will be enlightening. I am also seeing a lot of failure to converge (~550 timepoints, req 180, ~65% var explained) and the phrase exponential back-off crossed my mind. Its not quite applicable here, but the central idea is if something fails, you don't just keep trying the same thing (back-off = more waiting). Here, at the risk of adding another layer of complexity (ugh) perhaps an option should be that too many failures (?) to converge should be logged, and the entire loop should be restarted with N few components? Not exponential, but maybe based on number of timepoints. If someone wild shows up with 1500 timepoints, and starts with 500, it might not be nice to step through 499, 498, etc. But If they have 200 and start with 66, might be smart to do 65, 64, etc. In any case, I'm looking forward to seeing where this goes. |
I just opened #1125 which replaces this PR. I'll plan on closing this PR after @BahmanTahayori and @Lestropie see the other PR and confirm it contains the key changes from this PR. |
Aligning to meica:main
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WIth the new robustica version and the updates from Bahman, this is really close to merging. I made a couple of minor comments and noticed one additional thing to add. @tsalo, could you respond to my dependabot
Q today so that Bahman can have your response during his Wednesday day?
One additional edit here: When tedana is run, we keep track of version numbers of key dependencies so that users can see which versions were used for each execution. We should add robustica (and maybe tqdm) to those dependencies. I can open another PR onto this PR with these changes, but I figure this is just easier to describe here. The changes are in tedana/utils.py
In the imports ( https://github.com/BahmanTahayori/tedana_florey/blob/1b1eb38f03f664254d5e8aa0dfa144a051442a7e/tedana/utils.py#L21 )
add from robustica import __version__ as robustica_version
(new line 21) and from tqdm import __version__ as tqdm_version
(new line 28). The imports need to be in alphabetical order.
In get_system_version_info
https://github.com/BahmanTahayori/tedana_florey/blob/1b1eb38f03f664254d5e8aa0dfa144a051442a7e/tedana/utils.py#L618 add: "robustica": robustica_version,
(new line 620) and "tqdm": tqdm_version,
(new line 624)
Bahman, You also made a comment about wanting to add a figure to better depict robustica's output. I'm fine if you want to add that into this PR, but, if you think this is otherwise ready to merge, you can also open another PR just for that. (I'd also like the iq
varable in ica.py
(mean index quality) stored in a json file in addition to the text log, but that's a slightly tricky edit and I think it will be clearer as a stand-alone PR.
Looking forward to seeing this finally merged soon!
pyproject.toml
Outdated
@@ -30,7 +30,7 @@ dependencies = [ | |||
"pandas>=2.0,<=2.2.2", | |||
"pybtex", | |||
"pybtex-apa-style", | |||
"robustica@git+https://github.com/BahmanTahayori/robustica", | |||
"robustica>=0.1.4", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tsalo I'm still a bit confused about the dependabot. How do we set this up so that it requires a minimum bersion of 0.1.4, but will also automatically open a pull request when new versions of robustica are released?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think something like "robustica>=0.1.4,<=0.1.4" should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @handwerkerd and @tsalo for your comments. I applied all the required changes. Regarding the clustering results, the code is ready and I generate a figure to visualise the clustering in 2D (going from high dimension to low dimension) using TSNE, https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html. I can show the results with or without the noise cluster. I have attached an example for these two cases (the red dots are clustered as noise). In my local code, the result is saved in the "figure" directory but I have not added it to the HTML report. What are your thoughts on adding them to the HTML report? Which figure should be added?
I think now that this PR is ready, it is better to finalise it and then we will add these results and the iq
variable in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me see if I correctly understand these figure: Each small circle is a component from one iteration of fastICA. The black circle are the components that were deemed part of stable clusters and the red circles are the components that didn't end up in any cluster. The blue lines connect the components in a cluster. I'm assuming the final stable component is some type of centroid for each cluster? As an evaluation tool, the ideal would be every combined is in a small, tight circle, but we can see some are in longer lines and some are a bit farther apart. With the plot with the red dots, in the top center, I see a grouping of 3 clusters with red in-between which would tell me that the algorithm picked 3 stable components from this cluster, but, with a few more/different iterations, the expact placement of those stable components in this space might have been slightly different.
Is this description correct?
I do think this is useful and could be added to the html report, but let's definitely do that as a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, your description is correct. I used a convex hull to draw the blue line around each cluster. The centroid is calculated in robustica
(_compute_centroids). Your explanation about the 3 clusters at the top centre is excellent. With a few more or less iterations those three could be combined into 1 cluster.
I will do it in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BahmanTahayori, could you please send me the script you used to generate these figures? I will add an interactive version of them to the reports. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @eurunuela. Dan and I decided to add it as a separate PR as it still requires some minor modifications (adding legends, labels and running more local tests). @handwerkerd if you think that it should be added to this PR, I am happy to go ahead and add the code that generates the plot and saves it in the figure folder and then Eneko will add it to the HTML report. What do you recommend @handwerkerd?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be a different PR. We want to merge this and have RobustICA available for users as soon as possible, and I see the figure as an extra really but not part of the core functionality.
Co-authored-by: Dan Handwerker <7406227+handwerkerd@users.noreply.github.com>
docs/faq.rst
Outdated
We have one option that is generally useful and is also a partial solution. | ||
``--ica_method robustica`` will run `robustica`_. | ||
This is a method that, for a given number of PCA components, | ||
will repeated run ICA and identify components that are stable across iterations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will "repeatedly " run
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BahmanTahayori: This particular fix has already been addressed in 8743aca, but for the sake of future interactions, you can propose such fixes on GitHub in a way that maintainers can accept with a single button press: https://haacked.com/archive/2019/06/03/suggested-changes/ (you may have previously interacted with this mechanism when I was proposing modifications on your fork).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Lestropie. Will consider it for the next time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks ready to merge! @BahmanTahayori and @Lestropie Thank you for all your work on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made most suggestions here but there are two that will need to be committed: removing the tqdm import and adding a warning when the clustering approach is changed.
I think this PR would be ready once these changes are made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to open an issue so we can discuss where we move these defaults.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #1132.
from scipy import __version__ as scipy_version | ||
from scipy import ndimage | ||
from scipy.special import lpmv | ||
from sklearn import __version__ as sklearn_version | ||
from sklearn.utils import check_array | ||
from threadpoolctl import __version__ as threadpoolctl_version | ||
from tqdm import __version__ as tqdm_version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we're using tqdm. We should remove it from here or else users will get errors because it isn't a requirement in our pyproject.toml
file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is being used within robustica
and we're also using it in
Line 158 in 30a9a66
for voxel in tqdm(voxel_idx, desc=f"{echo_num}-echo monoexponential"): |
It's also in our project.toml
. I was mixed in adding it here because this theoretically isn't a library that will affect results, but they seem to keep releasing new versions and I figured it didn't hurt to track the version. I lean towards keeping, but I'm fine either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep it as well.
Co-authored-by: Eneko Uruñuela <e.urunuela@icloud.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you @BahmanTahayori!
Given that the
FastICA
method depends on its seed value, in some cases the generated output can be substantially different. To overcome this issue, we addedrobustica
(https://github.com/CRG-CNAG/robustica/tree/main/robustica) which is a python library based on Icasso as an option to the ICA decomposition.robustica
clusters many iterations ofFastICA
to achieve consistent results.Changes proposed in this pull request:
ica_method
is added as an option (robustica or fastica) and the user can select between the two methods. The number of runs for when robustica is used can be determined by the user through settingn_robust_runs
. The default is set to `robustica' with 30 number of robust runs. This default value is selected based on our experience using robustica for two different datasets with over 250 subjects' data.The DBSCAN clustering algorithm has been implemented in a hard-coded manner.