Adding robustica option to ICA decomposition to achieve consistent results #1013

BahmanTahayori · 2023-12-20T01:33:28Z

Given that the FastICA method depends on its seed value, in some cases the generated output can be substantially different. To overcome this issue, we added robustica (https://github.com/CRG-CNAG/robustica/tree/main/robustica) which is a python library based on Icasso as an option to the ICA decomposition. robustica clusters many iterations of FastICA to achieve consistent results.

Changes proposed in this pull request:

ica_method is added as an option (robustica or fastica) and the user can select between the two methods. The number of runs for when robustica is used can be determined by the user through setting n_robust_runs. The default is set to `robustica' with 30 number of robust runs. This default value is selected based on our experience using robustica for two different datasets with over 250 subjects' data.
The DBSCAN clustering algorithm has been implemented in a hard-coded manner.

Manual modification of commit f2cdb4e to remove unwanted file additions.

Cherry-pick of 41354cb.

From: #2. Co-authored-by: Robert E. Smith <robert.smith@florey.edu.au>

Cherry-pick of da1b128 (with modification).

Addition of RobustICA (after RS clean)

handwerkerd · 2023-12-20T15:32:54Z

I'm really excited to see this @BahmanTahayori. I'm going to start looking at it now.
I asked on https://mattermost.brainhack.org/brainhack/channels/tedana and no one was definitely planning to attend our tedana dev call this Thursday/Friday so I cancelled that meeting. If you were getting this ready to discuss at that meeting, I'll either recreate the meeting, even if it's just the two of us, or we can schedule a meeting at another time that's slightly more convenient in Australia.

Also, some of your style issues are due to us allowing recently python 3.12, which has a few more stringencies. Addressing the following might get the style check passing:

tedana/decomposition/ica.py:143:5: E722 do not use bare 'except'
tedana/decomposition/pca.py:397:71: E226 missing whitespace around arithmetic operator
tedana/decomposition/pca.py:397:88: E231 missing whitespace after ','
tedana/selection/selection_nodes.py:324:38: E231 missing whitespace after ','

tedana/decomposition/ica.py

tedana/config.py

pyproject.toml

tedana/workflows/tedana.py

tedana/tests/test_integration.py

tedana/decomposition/ica.py

handwerkerd

@BahmanTahayori I've gone through the entire PR and made some comments. I generally like your approach, but most of my feedback is focused on making sure I understand what's happening and the messages to end users are clear.

I already noticed an issue with the 3-echo integration test where robust ICA outputs fewer than the specified number components.

The one big thing I'd still want to do before approving is to run this on a bunch of my own datasets to make sure it's stable and giving consistently plausible results.

tedana/decomposition/ica.py

tedana/workflows/tedana.py

handwerkerd · 2023-12-22T16:08:53Z

@BahmanTahayori and I talked yesterday. In addition to going through the above review comments, I wanted to highlight one other part of our discussion to other developers (@tsalo @eurunuela @dowdlelt @n-reddy @javiergcas @goodalse2019 @NaomiGaggi @marco7877 @CesarCaballeroGaudes ). Robust ICA seems to always output fewer components than initially requested. This is reasonable. If you have 100 time points and request 90 components, it should not be able to find 90 stable components. This is both a problem and a possible solution for our current issues with PCA dimensionality estimation.

On the problem side, even if we get a perfect dimensionality estimate to input into RobustICA, and RobustICA will still output fewer ICA components.

On the solution side, I don't really care about how many components are estimated in the PCA step, I care that our ICA components reliably estimate as much of the total structured variance of the data as possible. This means, we can potentially give RobustICA an initial number of components that is too large such as min(1/3 of time points, 95% of total variance) and RobustICA will settle on a smaller plausible number.

I've been testing this out on a few datasets and it's promising, but not yet perfect. For example, I tested it on a dataset with 160 time points and RobustICA was set to run FastICA for 30 iterations

Initial components	Calculated stable ICA components	Index Quality
40	37	0.9483
45	42	0.9519
50	38	0.9464
55	43	0.9501
60	46	0.9456
65	45	0.9442
70	49	0.9448
75	48	0.9300
94	49	0.9287

Ideally, the calculated number of stable components should be the same if we initialize to 65 or 55. That's not happening, but it is reasonably stable. One other thing that’s noteworthy is that I’m running this with a block design flashing checkerboard task, which almost always results in a high variance component with the weights in primary visual cortex and a time series that follows the block design task. I am NOT seeing that when I initialized with 94 components & those final results were qualitatively noiser. This might be a sign that if one overestimates the intial number of components too much, the method breaks down. What might be happening is the algorithm is trying to successfully run FastICA 30 times. When it fails to converge it tries again. For 94 components, it failed 20 times (running FastICA a total of 50 times). For 70 components, it failed to converge only 8 times. In addition to being computationally expensive, this means trying to get ICA to converge with way too many components is resulting in stranger results when it does converge.

I'm going to keep playing with this and encourage others to do so as well. I'm also playing with more iterations of FastICA (the new n_robust_runs option in this PR) to see if that results in a more stable number of components across a wider range of inputs. This should definitely be a discussion at our January developers call. If anyone wants to actively look into this between now and then I'll find time to talk more and help you get started.

For reference, the manuscript describing this robust ICA approach is https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05043-9
In parallel, I asked @jefferykclam to work on the dimensionality estimation issue and he was converging on a similar approach based on this publication: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0094943

Co-authored-by: Dan Handwerker <7406227+handwerkerd@users.noreply.github.com>

* RobustICA: Restructure code loop over robust methods * Addressing the issue with try/except --------- Co-authored-by: Bahman <tahayori@gmail.com>

dowdlelt · 2024-02-20T21:50:00Z

Just wanted to jump in an officially register my excitement for this - I've checked this out and currently running on some trouble some data from neurostars. I think it will be enlightening.

I am also seeing a lot of failure to converge (~550 timepoints, req 180, ~65% var explained) and the phrase exponential back-off crossed my mind. Its not quite applicable here, but the central idea is if something fails, you don't just keep trying the same thing (back-off = more waiting). Here, at the risk of adding another layer of complexity (ugh) perhaps an option should be that too many failures (?) to converge should be logged, and the entire loop should be restarted with N few components? Not exponential, but maybe based on number of timepoints. If someone wild shows up with 1500 timepoints, and starts with 500, it might not be nice to step through 499, 498, etc. But If they have 200 and start with 66, might be smart to do 65, 64, etc.

In any case, I'm looking forward to seeing where this goes.

handwerkerd · 2024-09-03T01:26:52Z

I just opened #1125 which replaces this PR. I'll plan on closing this PR after @BahmanTahayori and @Lestropie see the other PR and confirm it contains the key changes from this PR.

Aligning to meica:main

pyproject.toml

handwerkerd

WIth the new robustica version and the updates from Bahman, this is really close to merging. I made a couple of minor comments and noticed one additional thing to add. @tsalo, could you respond to my dependabot Q today so that Bahman can have your response during his Wednesday day?

One additional edit here: When tedana is run, we keep track of version numbers of key dependencies so that users can see which versions were used for each execution. We should add robustica (and maybe tqdm) to those dependencies. I can open another PR onto this PR with these changes, but I figure this is just easier to describe here. The changes are in tedana/utils.py

In the imports ( https://github.com/BahmanTahayori/tedana_florey/blob/1b1eb38f03f664254d5e8aa0dfa144a051442a7e/tedana/utils.py#L21 )
add from robustica import __version__ as robustica_version (new line 21) and from tqdm import __version__ as tqdm_version (new line 28). The imports need to be in alphabetical order.

In get_system_version_info https://github.com/BahmanTahayori/tedana_florey/blob/1b1eb38f03f664254d5e8aa0dfa144a051442a7e/tedana/utils.py#L618 add: "robustica": robustica_version, (new line 620) and "tqdm": tqdm_version, (new line 624)

Bahman, You also made a comment about wanting to add a figure to better depict robustica's output. I'm fine if you want to add that into this PR, but, if you think this is otherwise ready to merge, you can also open another PR just for that. (I'd also like the iq varable in ica.py (mean index quality) stored in a json file in addition to the text log, but that's a slightly tricky edit and I think it will be clearer as a stand-alone PR.

Looking forward to seeing this finally merged soon!

docs/faq.rst

handwerkerd · 2024-09-17T15:34:29Z

pyproject.toml

@@ -30,7 +30,7 @@ dependencies = [
    "pandas>=2.0,<=2.2.2",
    "pybtex",
    "pybtex-apa-style",
-    "robustica@git+https://github.com/BahmanTahayori/robustica",
+    "robustica>=0.1.4",


@tsalo I'm still a bit confused about the dependabot. How do we set this up so that it requires a minimum bersion of 0.1.4, but will also automatically open a pull request when new versions of robustica are released?

I think something like "robustica>=0.1.4,<=0.1.4" should work.

Thanks @handwerkerd and @tsalo for your comments. I applied all the required changes. Regarding the clustering results, the code is ready and I generate a figure to visualise the clustering in 2D (going from high dimension to low dimension) using TSNE, https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html. I can show the results with or without the noise cluster. I have attached an example for these two cases (the red dots are clustered as noise). In my local code, the result is saved in the "figure" directory but I have not added it to the HTML report. What are your thoughts on adding them to the HTML report? Which figure should be added?

I think now that this PR is ready, it is better to finalise it and then we will add these results and the iq variable in a separate PR.

Let me see if I correctly understand these figure: Each small circle is a component from one iteration of fastICA. The black circle are the components that were deemed part of stable clusters and the red circles are the components that didn't end up in any cluster. The blue lines connect the components in a cluster. I'm assuming the final stable component is some type of centroid for each cluster? As an evaluation tool, the ideal would be every combined is in a small, tight circle, but we can see some are in longer lines and some are a bit farther apart. With the plot with the red dots, in the top center, I see a grouping of 3 clusters with red in-between which would tell me that the algorithm picked 3 stable components from this cluster, but, with a few more/different iterations, the expact placement of those stable components in this space might have been slightly different.

Is this description correct?

I do think this is useful and could be added to the html report, but let's definitely do that as a separate PR.

Yes, your description is correct. I used a convex hull to draw the blue line around each cluster. The centroid is calculated in robustica (_compute_centroids). Your explanation about the 3 clusters at the top centre is excellent. With a few more or less iterations those three could be combined into 1 cluster.

I will do it in a separate PR.

@BahmanTahayori, could you please send me the script you used to generate these figures? I will add an interactive version of them to the reports. Thanks!

Thanks @eurunuela. Dan and I decided to add it as a separate PR as it still requires some minor modifications (adding legends, labels and running more local tests). @handwerkerd if you think that it should be added to this PR, I am happy to go ahead and add the code that generates the plot and saves it in the figure folder and then Eneko will add it to the HTML report. What do you recommend @handwerkerd?

I think it should be a different PR. We want to merge this and have RobustICA available for users as soon as possible, and I see the figure as an extra really but not part of the core functionality.

Co-authored-by: Dan Handwerker <7406227+handwerkerd@users.noreply.github.com>

BahmanTahayori · 2024-09-18T00:12:02Z

docs/faq.rst

+We have one option that is generally useful and is also a partial solution.
+``--ica_method robustica`` will run `robustica`_.
+This is a method that, for a given number of PCA components,
+will repeated run ICA and identify components that are stable across iterations.


will "repeatedly " run

@BahmanTahayori: This particular fix has already been addressed in 8743aca, but for the sake of future interactions, you can propose such fixes on GitHub in a way that maintainers can accept with a single button press: https://haacked.com/archive/2019/06/03/suggested-changes/ (you may have previously interacted with this mechanism when I was proposing modifications on your fork).

Thanks @Lestropie. Will consider it for the next time.

handwerkerd

This looks ready to merge! @BahmanTahayori and @Lestropie Thank you for all your work on this.

eurunuela

I have made most suggestions here but there are two that will need to be committed: removing the tqdm import and adding a warning when the clustering approach is changed.

I think this PR would be ready once these changes are made.

docs/faq.rst

eurunuela · 2024-09-19T15:20:11Z

tedana/config.py

We probably want to open an issue so we can discuss where we move these defaults.

I opened #1132.

tedana/decomposition/ica.py

tedana/workflows/tedana.py

eurunuela · 2024-09-19T15:27:22Z

tedana/utils.py

 from scipy import __version__ as scipy_version
 from scipy import ndimage
 from scipy.special import lpmv
 from sklearn import __version__ as sklearn_version
 from sklearn.utils import check_array
 from threadpoolctl import __version__ as threadpoolctl_version
+from tqdm import __version__ as tqdm_version


I don't think we're using tqdm. We should remove it from here or else users will get errors because it isn't a requirement in our pyproject.toml file.

It is being used within robustica and we're also using it in

tedana/tedana/decay.py

Line 158 in 30a9a66

for voxel in tqdm(voxel_idx, desc=f"{echo_num}-echo monoexponential"):

It's also in our project.toml. I was mixed in adding it here because this theoretically isn't a library that will affect results, but they seem to keep releasing new versions and I figured it didn't hurt to track the version. I lean towards keeping, but I'm fine either way.

I think we should keep it as well.

Co-authored-by: Eneko Uruñuela <e.urunuela@icloud.com>

eurunuela

LGTM! Thank you @BahmanTahayori!

BahmanTahayori and others added 10 commits August 28, 2023 11:54

Add robustica method

f4eaa3e

Incorporation of major comments regarding robustica addition

b0cac3a

Manual modification of commit f2cdb4e to remove unwanted file additions.

Add robustica 0.1.3 to dependency list

55c2ae4

Cherry-pick of 41354cb.

Multiple fixes to RobustICA addition from code review

cd55a3f

From: #2. Co-authored-by: Robert E. Smith <robert.smith@florey.edu.au>

Specify magic number fixed seed of 42 as a constant

2d9b007

Cherry-pick of da1b128 (with modification).

Merge remote-tracking branch 'upstream/main' into add_robustica_rsclean

09e565e

Updated

fc5f9ea

Robustica Updates

4fc3043

Incorporating the third round of Robert E. Smith's comments

a20ff57

Merge pull request #3 from BahmanTahayori/add_robustica_rsclean

cc5e05d

Addition of RobustICA (after RS clean)

BahmanTahayori mentioned this pull request Dec 20, 2023

Bahman tahayori patch 1 #958

Closed