Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding robustica option to ICA decomposition to achieve consistent results #1013

Merged
merged 47 commits into from
Sep 23, 2024

Conversation

BahmanTahayori
Copy link
Contributor

Given that the FastICA method depends on its seed value, in some cases the generated output can be substantially different. To overcome this issue, we added robustica (https://github.com/CRG-CNAG/robustica/tree/main/robustica) which is a python library based on Icasso as an option to the ICA decomposition. robustica clusters many iterations of FastICA to achieve consistent results.

Changes proposed in this pull request:

  • ica_method is added as an option (robustica or fastica) and the user can select between the two methods. The number of runs for when robustica is used can be determined by the user through setting n_robust_runs. The default is set to `robustica' with 30 number of robust runs. This default value is selected based on our experience using robustica for two different datasets with over 250 subjects' data.

  • The DBSCAN clustering algorithm has been implemented in a hard-coded manner.

@handwerkerd
Copy link
Member

I'm really excited to see this @BahmanTahayori. I'm going to start looking at it now.
I asked on https://mattermost.brainhack.org/brainhack/channels/tedana and no one was definitely planning to attend our tedana dev call this Thursday/Friday so I cancelled that meeting. If you were getting this ready to discuss at that meeting, I'll either recreate the meeting, even if it's just the two of us, or we can schedule a meeting at another time that's slightly more convenient in Australia.

Also, some of your style issues are due to us allowing recently python 3.12, which has a few more stringencies. Addressing the following might get the style check passing:

tedana/decomposition/ica.py:143:5: E722 do not use bare 'except'
tedana/decomposition/pca.py:397:71: E226 missing whitespace around arithmetic operator
tedana/decomposition/pca.py:397:88: E231 missing whitespace after ','
tedana/selection/selection_nodes.py:324:38: E231 missing whitespace after ','

pyproject.toml Outdated Show resolved Hide resolved
Copy link
Member

@handwerkerd handwerkerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BahmanTahayori I've gone through the entire PR and made some comments. I generally like your approach, but most of my feedback is focused on making sure I understand what's happening and the messages to end users are clear.

I already noticed an issue with the 3-echo integration test where robust ICA outputs fewer than the specified number components.

The one big thing I'd still want to do before approving is to run this on a bunch of my own datasets to make sure it's stable and giving consistently plausible results.

tedana/decomposition/ica.py Outdated Show resolved Hide resolved
@handwerkerd
Copy link
Member

@BahmanTahayori and I talked yesterday. In addition to going through the above review comments, I wanted to highlight one other part of our discussion to other developers (@tsalo @eurunuela @dowdlelt @n-reddy @javiergcas @goodalse2019 @NaomiGaggi @marco7877 @CesarCaballeroGaudes ). Robust ICA seems to always output fewer components than initially requested. This is reasonable. If you have 100 time points and request 90 components, it should not be able to find 90 stable components. This is both a problem and a possible solution for our current issues with PCA dimensionality estimation.

On the problem side, even if we get a perfect dimensionality estimate to input into RobustICA, and RobustICA will still output fewer ICA components.

On the solution side, I don't really care about how many components are estimated in the PCA step, I care that our ICA components reliably estimate as much of the total structured variance of the data as possible. This means, we can potentially give RobustICA an initial number of components that is too large such as min(1/3 of time points, 95% of total variance) and RobustICA will settle on a smaller plausible number.

I've been testing this out on a few datasets and it's promising, but not yet perfect. For example, I tested it on a dataset with 160 time points and RobustICA was set to run FastICA for 30 iterations

Initial components Calculated stable ICA components Index Quality
40 37 0.9483
45 42 0.9519
50 38 0.9464
55 43 0.9501
60 46 0.9456
65 45 0.9442
70 49 0.9448
75 48 0.9300
94 49 0.9287

Ideally, the calculated number of stable components should be the same if we initialize to 65 or 55. That's not happening, but it is reasonably stable. One other thing that’s noteworthy is that I’m running this with a block design flashing checkerboard task, which almost always results in a high variance component with the weights in primary visual cortex and a time series that follows the block design task. I am NOT seeing that when I initialized with 94 components & those final results were qualitatively noiser. This might be a sign that if one overestimates the intial number of components too much, the method breaks down. What might be happening is the algorithm is trying to successfully run FastICA 30 times. When it fails to converge it tries again. For 94 components, it failed 20 times (running FastICA a total of 50 times). For 70 components, it failed to converge only 8 times. In addition to being computationally expensive, this means trying to get ICA to converge with way too many components is resulting in stranger results when it does converge.

I'm going to keep playing with this and encourage others to do so as well. I'm also playing with more iterations of FastICA (the new n_robust_runs option in this PR) to see if that results in a more stable number of components across a wider range of inputs. This should definitely be a discussion at our January developers call. If anyone wants to actively look into this between now and then I'll find time to talk more and help you get started.

For reference, the manuscript describing this robust ICA approach is https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05043-9
In parallel, I asked @jefferykclam to work on the dimensionality estimation issue and he was converging on a similar approach based on this publication: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0094943

BahmanTahayori and others added 4 commits February 9, 2024 14:48
Co-authored-by: Dan Handwerker <7406227+handwerkerd@users.noreply.github.com>
Co-authored-by: Dan Handwerker <7406227+handwerkerd@users.noreply.github.com>
* RobustICA: Restructure code loop over robust methods

* Addressing the issue with try/except

---------

Co-authored-by: Bahman <tahayori@gmail.com>
@dowdlelt
Copy link
Collaborator

Just wanted to jump in an officially register my excitement for this - I've checked this out and currently running on some trouble some data from neurostars. I think it will be enlightening.

I am also seeing a lot of failure to converge (~550 timepoints, req 180, ~65% var explained) and the phrase exponential back-off crossed my mind. Its not quite applicable here, but the central idea is if something fails, you don't just keep trying the same thing (back-off = more waiting). Here, at the risk of adding another layer of complexity (ugh) perhaps an option should be that too many failures (?) to converge should be logged, and the entire loop should be restarted with N few components? Not exponential, but maybe based on number of timepoints. If someone wild shows up with 1500 timepoints, and starts with 500, it might not be nice to step through 499, 498, etc. But If they have 200 and start with 66, might be smart to do 65, 64, etc.

In any case, I'm looking forward to seeing where this goes.

@handwerkerd
Copy link
Member

I just opened #1125 which replaces this PR. I'll plan on closing this PR after @BahmanTahayori and @Lestropie see the other PR and confirm it contains the key changes from this PR.

pyproject.toml Outdated Show resolved Hide resolved
@handwerkerd handwerkerd marked this pull request as ready for review September 12, 2024 19:09
Copy link
Member

@handwerkerd handwerkerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WIth the new robustica version and the updates from Bahman, this is really close to merging. I made a couple of minor comments and noticed one additional thing to add. @tsalo, could you respond to my dependabot Q today so that Bahman can have your response during his Wednesday day?

One additional edit here: When tedana is run, we keep track of version numbers of key dependencies so that users can see which versions were used for each execution. We should add robustica (and maybe tqdm) to those dependencies. I can open another PR onto this PR with these changes, but I figure this is just easier to describe here. The changes are in tedana/utils.py

In the imports ( https://github.com/BahmanTahayori/tedana_florey/blob/1b1eb38f03f664254d5e8aa0dfa144a051442a7e/tedana/utils.py#L21 )
add from robustica import __version__ as robustica_version (new line 21) and from tqdm import __version__ as tqdm_version (new line 28). The imports need to be in alphabetical order.

In get_system_version_info https://github.com/BahmanTahayori/tedana_florey/blob/1b1eb38f03f664254d5e8aa0dfa144a051442a7e/tedana/utils.py#L618 add: "robustica": robustica_version, (new line 620) and "tqdm": tqdm_version, (new line 624)

Bahman, You also made a comment about wanting to add a figure to better depict robustica's output. I'm fine if you want to add that into this PR, but, if you think this is otherwise ready to merge, you can also open another PR just for that. (I'd also like the iq varable in ica.py (mean index quality) stored in a json file in addition to the text log, but that's a slightly tricky edit and I think it will be clearer as a stand-alone PR.

Looking forward to seeing this finally merged soon!

docs/faq.rst Outdated Show resolved Hide resolved
pyproject.toml Outdated
@@ -30,7 +30,7 @@ dependencies = [
"pandas>=2.0,<=2.2.2",
"pybtex",
"pybtex-apa-style",
"robustica@git+https://github.com/BahmanTahayori/robustica",
"robustica>=0.1.4",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tsalo I'm still a bit confused about the dependabot. How do we set this up so that it requires a minimum bersion of 0.1.4, but will also automatically open a pull request when new versions of robustica are released?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think something like "robustica>=0.1.4,<=0.1.4" should work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @handwerkerd and @tsalo for your comments. I applied all the required changes. Regarding the clustering results, the code is ready and I generate a figure to visualise the clustering in 2D (going from high dimension to low dimension) using TSNE, https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html. I can show the results with or without the noise cluster. I have attached an example for these two cases (the red dots are clustered as noise). In my local code, the result is saved in the "figure" directory but I have not added it to the HTML report. What are your thoughts on adding them to the HTML report? Which figure should be added?

I think now that this PR is ready, it is better to finalise it and then we will add these results and the iq variable in a separate PR.

Projection_Clusters

Projection_Clusters_noise

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me see if I correctly understand these figure: Each small circle is a component from one iteration of fastICA. The black circle are the components that were deemed part of stable clusters and the red circles are the components that didn't end up in any cluster. The blue lines connect the components in a cluster. I'm assuming the final stable component is some type of centroid for each cluster? As an evaluation tool, the ideal would be every combined is in a small, tight circle, but we can see some are in longer lines and some are a bit farther apart. With the plot with the red dots, in the top center, I see a grouping of 3 clusters with red in-between which would tell me that the algorithm picked 3 stable components from this cluster, but, with a few more/different iterations, the expact placement of those stable components in this space might have been slightly different.

Is this description correct?

I do think this is useful and could be added to the html report, but let's definitely do that as a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, your description is correct. I used a convex hull to draw the blue line around each cluster. The centroid is calculated in robustica (_compute_centroids). Your explanation about the 3 clusters at the top centre is excellent. With a few more or less iterations those three could be combined into 1 cluster.

I will do it in a separate PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BahmanTahayori, could you please send me the script you used to generate these figures? I will add an interactive version of them to the reports. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @eurunuela. Dan and I decided to add it as a separate PR as it still requires some minor modifications (adding legends, labels and running more local tests). @handwerkerd if you think that it should be added to this PR, I am happy to go ahead and add the code that generates the plot and saves it in the figure folder and then Eneko will add it to the HTML report. What do you recommend @handwerkerd?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be a different PR. We want to merge this and have RobustICA available for users as soon as possible, and I see the figure as an extra really but not part of the core functionality.

Co-authored-by: Dan Handwerker <7406227+handwerkerd@users.noreply.github.com>
docs/faq.rst Outdated
We have one option that is generally useful and is also a partial solution.
``--ica_method robustica`` will run `robustica`_.
This is a method that, for a given number of PCA components,
will repeated run ICA and identify components that are stable across iterations.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will "repeatedly " run

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BahmanTahayori: This particular fix has already been addressed in 8743aca, but for the sake of future interactions, you can propose such fixes on GitHub in a way that maintainers can accept with a single button press: https://haacked.com/archive/2019/06/03/suggested-changes/ (you may have previously interacted with this mechanism when I was proposing modifications on your fork).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Lestropie. Will consider it for the next time.

handwerkerd
handwerkerd previously approved these changes Sep 18, 2024
Copy link
Member

@handwerkerd handwerkerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks ready to merge! @BahmanTahayori and @Lestropie Thank you for all your work on this.

Copy link
Collaborator

@eurunuela eurunuela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made most suggestions here but there are two that will need to be committed: removing the tqdm import and adding a warning when the clustering approach is changed.

I think this PR would be ready once these changes are made.

docs/faq.rst Outdated Show resolved Hide resolved
docs/faq.rst Outdated Show resolved Hide resolved
docs/faq.rst Outdated Show resolved Hide resolved
docs/faq.rst Outdated Show resolved Hide resolved
docs/faq.rst Outdated Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to open an issue so we can discuss where we move these defaults.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #1132.

tedana/decomposition/ica.py Outdated Show resolved Hide resolved
tedana/decomposition/ica.py Show resolved Hide resolved
tedana/workflows/tedana.py Outdated Show resolved Hide resolved
from scipy import __version__ as scipy_version
from scipy import ndimage
from scipy.special import lpmv
from sklearn import __version__ as sklearn_version
from sklearn.utils import check_array
from threadpoolctl import __version__ as threadpoolctl_version
from tqdm import __version__ as tqdm_version
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we're using tqdm. We should remove it from here or else users will get errors because it isn't a requirement in our pyproject.toml file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is being used within robustica and we're also using it in

for voxel in tqdm(voxel_idx, desc=f"{echo_num}-echo monoexponential"):

It's also in our project.toml. I was mixed in adding it here because this theoretically isn't a library that will affect results, but they seem to keep releasing new versions and I figured it didn't hurt to track the version. I lean towards keeping, but I'm fine either way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep it as well.

Co-authored-by: Eneko Uruñuela <e.urunuela@icloud.com>
Copy link
Collaborator

@eurunuela eurunuela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you @BahmanTahayori!

@handwerkerd handwerkerd merged commit df58347 into ME-ICA:main Sep 23, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants