Code repository for the paper "Across-subject ensemble-learning alleviates the need for large samples for fMRI decoding" by Himanshu Aggarwal, Liza Al-Shikhley and Bertrand Thirion.
A preprint version of this work can be found here: https://hal.science/hal-04636523
git clone git@github.com:man-shu/ensemble-fmri.git
Download downsampled 3mm event-wise GLM effect-size maps of fMRI datasets. It is only ~7GB in size. Yes it is indeed small
cd ensemble-fmri
wget https://zenodo.org/records/12204275/files/data.zip
unzip data.zip -d data
Create a new conda environment with the dependencies.
conda env create -f env/ensemblepy.yml
conda activate ensemblepy
We used a conditional permutation importance method to calculate the importance scores as provided in this package and explained in this paper. I have modified the code to work with the ensemble setting. To install the package, run:
git clone git@github.com:man-shu/Variable_Importance.git
cd Variable_Importance/BBI_package
pip install .
cd ../..
We also compared the ensemble approach with the conventional Graph Convolutional Network (GCN) classifier. The GCN implementation used is inspired by this tutorial
It needs the following dependencies:
pip install torch torch-geometric
All the scripts expect event-wise GLM effect size maps in the data
directory. The data structure should be as follows:
data/neuromod
└── 3mm
├── sub-01.nii.gz
├── sub-01_labels.csv
├── sub-01_runs.csv
├── sub-02.nii.gz
├── sub-02_labels.csv
├── sub-02_runs.csv
├── sub-03.nii.gz
├── sub-03_labels.csv
├── sub-03_runs.csv
├── sub-05.nii.gz
├── sub-05_labels.csv
└── sub-05_runs.csv
-
Under
data
, you need to create a sub-directory with the name of your dataset e.g.neuromod
in this case. -
This sub-directory should contain another sub-directory named as resolution of the data e.g.
3mm
in this case. Note that we downsampled all our data to 3mm resolution because it is computationally expensive to work with the full resolution data. -
The
3mm
directory should contain the effect size maps for each subject. The effect size maps should be in the nifti format. The filenames should be in the format<subject_id>.nii.gz
e.g.sub-01.nii.gz
in this case. The subject_id in itself could be anything as long as it is unique for each file. Each volume in this 4D nifti file should correspond to an event in the task. All runs should be concatenated in this one nifti file. -
The
3mm
directory should also contain the labels for each subject. The labels should be in a CSV file. The filenames should be in the format<subject_id>_labels.csv
e.g.sub-01_labels.csv
in this case. The CSV file should have one column without any header. The column should contain the labels for each event/volume in nifti file. -
Finally, you just need to add the name of your dataset (the sub-directory you just created under
data
) in all the scripts.For example, in
scripts/vary_train_size.py
, you need to add your dataset's name in thedatas
list as follows:# in lines 42-48 datas = [ "neuromod", "forrest", "rsvp", "bold", "aomic_anticipation", "your_dataset_name" ]
You can even remove the other datasets if you are only interested in your dataset.
Section Reproduce the results in the paper explains what each script does and how to run them.
If you want to know how to get event-wise GLM effect size maps, you can follow this tutorial. Note that the method to compute event-wise GLM effect size maps is referred to as Least Squares Separation (LSS) in the tutorial.
We did the same for the AOMIC dataset using the scripts/glm/glm_anticipation.py
script. So once you have downloaded the AOMIC dataset from here you can update the paths in the script and run it to get the event-wise GLM effect size maps. Note that you only need the files with the *task-anticipation*
wildcard in the filename from the AOMIC dataset.
To generate numbers plotted in Fig 2 and Fig 3 (over varying training sizes) in the paper:
-
using the data in the
data
directory, saving the results in theresults
directory, with 20 parallel jobs, and DiFuMo features, run:python scripts/vary_train_size.py data results 20 difumo
-
or with full-voxel space features, run:
python scripts/vary_train_size.py data results 20 voxels
To generate numbers plotted in Fig 4 (over varying numbers of subjects in the ensemble) in the paper:
-
using the data in the
data
directory, saving the results in theresults
directory, with 20 parallel jobs, and DiFuMo features, run:python scripts/vary_n_subs.py data results 20 difumo
-
or with full-voxel space features, run:
python scripts/vary_n_subs.py data results 20 voxels
For computing the importance scores for the ensemble setting using the DiFuMo features and RandomForest classifier, plotted in Supplementary Fig. 2 in the paper:
python scripts/feat_imp.py data results 20
To use the pretraining strategy with L1 penalized Linear SVC (the results plotted in Supplementary Figure 2), you can change line 175 under the pretrain
function in utils.py
from:
clf = LinearSVC(dual="auto")
to:
clf = LinearSVC(dual="auto", penalty="l1")
For comparing the ensemble approach with the GCN approach, run:
python scripts/compare_with_gcn.py data results 20 gcn/param_grid.json
To plot Figures 2 and 3 in the paper, run:
python scripts/plotting/plot_fig2_fig3.py
To plot Figure 4 in the paper, run:
python scripts/plotting/plot_fig4.py
To plot the feature importance scores, run the following command. This will plot feature importance scores on surfaces and glass brains for each subject in each dataset. The plots will be saved in the plots/feat_imp
directory.
python scripts/plotting/plot_suppfig1.py
The plots will look like this:
To plot the comparison with GCN, run the following command. This will create a similar plot as in Fig 2 but with the GCN results.
python scripts/plotting/plot_gcn.py
We ran the experiments on a CPU-based cluster with 72 nodes and 376 GB of RAM, but only used 20 parallel jobs. The OS was Ubuntu 18.04.6 LTS.
The time taken for each experiment is as follows:
# command
time python scripts/vary_train_size.py data results 20 difumo
# output
264026.23s user 4271.26s system 1606% cpu 4:38:23.80 total
# command
time python scripts/vary_train_size.py data results 20 voxels
# output
541518.33s user 20061.83s system 1715% cpu 9:05:44.40 total
# command
time python scripts/vary_n_subs.py data results 20 difumo
# output
264768.95s user 644.23s system 1804% cpu 4:05:10.79 total
# command
time python scripts/vary_n_subs.py data results 20 voxels
# output
361548.14s user 5086.45s system 1828% cpu 5:34:16.29 total
# command
time python scripts/feat_imp.py data results 20
# output
2596918.56s user 65902.53s system 2082% cpu 35:31:34.59 total
If you don't have conda installed, you can install it from here.
In case you face any issues, please feel free to open an issue on this repository.