Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should data be normalized along time or voxels? #653

Closed
notZaki opened this issue Jan 15, 2021 · 22 comments · Fixed by #702
Closed

Should data be normalized along time or voxels? #653

notZaki opened this issue Jan 15, 2021 · 22 comments · Fixed by #702
Labels
question issues detailing questions about the project or its direction

Comments

@notZaki
Copy link
Contributor

notZaki commented Jan 15, 2021

Summary

Input data is normalized to zero-mean and unit-variance prior to PCA. This normalization can be done along two different directions, and this choice affects the output.

Additional Detail

Strategy A: Normalize along second dimension (time)

The data matrix with shape NxT is normalized along the second dimension here:

data_z = ((data.T - data.T.mean(axis=0)) / data.T.std(axis=0)).T # var normalize ts
data_z = (data_z - data_z.mean()) / data_z.std() # var normalize everything

Strategy B: Normalize along first dimension (masked voxels)

The MA-PCA code uses the sklearn.preprocessing.StandardScaler which is initialized here:

scaler = StandardScaler(with_mean=True, with_std=True)

This scaling performed by this scaler is along the first dimension, so it would be expressed as:

(data - data.mean(axis=0)) / data.std(axis=0) # Note the lack of transpose

The PCA implementation in sklearn is also internally normalizing the data using this approach.

Comparison using the two strategies

These two strategies do not produce the same normalized matrix and the choice changes the downstream results. As an example, here's what happens on the five-echo test data

  • If strategy A: MA-PCA estimate 72 components. ICA decomposition fails to converge and the seed has to be changed at least 3 times until convergence
    • Final classification (on local run ): accepted = 28, rejected = 13, ignored = 31.
    • On CI: ICA fails to converge after 10 attempts [log]
  • If strategy A: MA-PCA estimates 66 components. ICA decomposition converges in first attempt
    • Final classification (on CI & local run): accepted = 33, rejected = 11, ignored = 22 [log]
  • If strategy B: MA-PCA estimates 71 components. ICA decomposition converges in first attempt, but no BOLD components are identified
    • Final classification (on local run): accepted = 5, rejected = 56, ignored = 10
    • On CI: No BOLD components found [log]

Potential points for discussion

  • Are different parts of the code intentionally using a different normalization strategy? Does the kundu family of tedpca need Strategy A while MA-PCA needs Strategy B?
  • In my opinion, it makes more sense to normalizes each time-series (Strategy A)
    • Counter point: The original MA-PCA implementation in GIFT uses Strategy B
@tsalo
Copy link
Member

tsalo commented Jan 15, 2021

I think it makes sense to normalize each voxel over time before PCA. I don't think the second normalization (across both voxels and time) is necessary, although I'm not sure if it's a problem.

I don't think it makes sense to normalize each time-point over voxels. We had a similar issue in #228 with the PCA output.

@eurunuela
Copy link
Collaborator

I totally agree with @tsalo . Normalizing the time dimension makes sense, but normalizing across voxels doesn't.

@notZaki
Copy link
Contributor Author

notZaki commented Jan 17, 2021

In that case, I've updated an older PR to normalize over time rather than over voxels. The current mapca code is (1) not doing any normalization in the beginning and (2) normalizing over the voxel dimension while subsampling.

I've also updated the original comment in this issue. I didn't correctly implement one of the strategies. Now the results are more favorable if normalization is done over the time dimension.

@tsalo
Copy link
Member

tsalo commented Jan 17, 2021

@notZaki would you be willing to look into whether the second normalization in strategy A (data_z = (data_z - data_z.mean()) / data_z.std() # var normalize everything ) impacts the results?

@notZaki
Copy link
Contributor Author

notZaki commented Jan 17, 2021

The second step shouldn't substantially change the result because normalizing along rows (or columns, doesn't matter) will also make the global mean close to zero and global std close to 1.

As an example, for the three-echo data in the tests:

data_z = ((data.T - data.T.mean(axis=0)) / data.T.std(axis=0)).T 
data_z.mean() # = 2e-17 
1 - data_z.std() # = -2e-16` 

@tsalo tsalo added the question issues detailing questions about the project or its direction label Jan 19, 2021
@emdupre
Copy link
Member

emdupre commented Jan 27, 2021

Canonically, I think we'd be interested in spatial ICA. This is the form that Kundu describes in motivating ME-ICA in his 2017 NeuroImage paper.

So I'd echo others that strategy A is most appropriate !

@notZaki
Copy link
Contributor Author

notZaki commented Jan 28, 2021

If the ICA sources are spatial, then I would expect the normalization to be along spatial dimension, i.e. strategy B. That would also explain why GIFT uses strategy B, since it also does spatial ICA (I think).
Or do I have it the wrong way around? ... I knew I shouldn't have slept through linear algebra class.

@emdupre
Copy link
Member

emdupre commented Feb 3, 2021

Thanks again to everyone for weighing in, here ! This is something that's definitely important to think through.

I had a chance to gather my thoughts on this. Just to talk through my thinking, here : Let's consider the fMRI timeseries to be a t x v matrix, where t is equal to the number of timepoints and v is equal to the number of voxels. With spatial ICA, we assume that we have t observations of our spatial random variables that we want to recover (these variables will be a 1 x v vector, but can be reconstructed into our expected volume).

In that case, normalizing each voxel over time should improve convergence by making the individual voxel courses on the same scale.

The alternative, to normalize each time point over voxels, would result in each voxel having a "jumpy" timecourse (since it's value at each time step would depend not on itself but on all other voxel's values at that same time step).

I'd be a bit surprised if GIFT is in fact doing the second normalization. The authors suggest here that they'd advocate for "intensity normalization" which is a method for normalizing each voxel over time. Admittedly it's an older paper (2010), but I'm still having trouble imagining how to justify normalizing each time point independently.

But please let me know if I'm misunderstanding something re the normalization options under consideration !

@CesarCaballeroGaudes
Copy link
Contributor

CesarCaballeroGaudes commented Feb 3, 2021

I also had some thoughts about this. An important point is that normalizing in space (i.e. making each volume be zero mean and unit-variance) is equivalent to making global signal regression since one is actually forcing the global signal be zero. AS far as my long term memory allows me, I think we discussed this during the Tedana Hackaton in DC (@javiergcas @eurunuela ) and we decided that this should not be done. I might be wrong. Even so, the procedure still need some kind of normalization of the features (input to PCA) and therefore (again my memory might be wrong) we considered temporal normalization more appropriate.

Btw, @emdupre thanks for the abstract (excellent search engine skills)... it is really helpful, and it is true that the maPCA paper is an old one. The abstract advocates for just demeaning the voxel time series with no variance normalization. Although I believe in these evaluations, I'm slightly surprised because PCA is usually applied with feature normalization so that they all have the same variance.

@eurunuela
Copy link
Collaborator

eurunuela commented Feb 3, 2021

Thank you both @emdupre and @CesarCaballeroGaudes . This whole thing has been driving me nuts for the past month.

What GIFT is doing is take each time-point and z-score the map. Mind that for them data is in v x t as you can see here.

The abstract actually reports the bad test-retest reliability @notZaki has seen when z-scoring the timecourses of each voxel (which, as @CesarCaballeroGaudes mentioned, we thought was the correct thing to do in the hackathon).

Now, the abstract also reports that the best approach for test-retest reliability is the "intensity normalization"; i.e. to divide the timecourses of each voxel by their mean (signal percentage change without removing the mean).

However, the abstract does not study the test-retest reliability of z-scoring the maps at each time-point, which as @CesarCaballeroGaudes mentioned is equivalent to performing global signal regression. We still do not know why GIFT is doing this.

I agree with @emdupre in that normalizing each voxel over time makes more sense, and that's why we thought of implementing this approach during the hackathon.

@emdupre
Copy link
Member

emdupre commented Feb 3, 2021

I'm now wondering how much I really understand GIFT's implementation 😅

I also don't have an intuition for why they would perform this kind of GSR-esque normalization, but I think it stands against their own code comment (in L206) which explicitly identifies the normalization as variance normalization. To make sure I wasn't misunderstanding the previously linked OHBM abstract, I went back to the Beckmann and Smith (2004) paper, which introduced variance normalization and describes their motivation as the following:

An immediate consequence of the fact that we are operating under an isotropic noise model is that as an initial preprocessing step we will modify the original data time courses to be normalized to zero mean and unit variance. This appears to be a sensible step in that on the one hand we know that the voxel-wise standard deviation of resting state data varies significantly over thebrain but on the other hand, all voxels’ time courses are assumed to be generated from the same noise process. This variance-normalization preconditions the data under the“null hypotheses”of purely Gaussian noise, i.e., in the absence of any signal: the data matrix $X$ is identical up to second-order statistics to a simple set of realizations from a $\mathcal{N}(0, 1)$ noise process.

So, I think we'd be on reasonable ground to move forward with the time-series normalization approach !

@CesarCaballeroGaudes
Copy link
Contributor

Thanks for getting back to the Probabilistic ICA paper. A notable difference is that Beckmann & Smith propose variance normalization, not intensity normalization as suggested in Allen & Calhoun's abstract.

@emdupre
Copy link
Member

emdupre commented Feb 3, 2021

A notable difference is that Beckmann & Smith propose variance normalization, not intensity normalization as suggested in Allen & Calhoun's abstract.

Yes, but my point was that the code comment explicitly identifies the procedure as variance normalization. They describe the normalization methods implemented in GIFT here as:

<dd> 'Remove Mean Per Timepoint' - At each time point, image mean is removed. </dd>
<dd> 'Remove Mean Per Voxel' - Time-series mean is removed at each voxel.</dd>
<dd> 'Intensity Normalization' - At each voxel, time-series is scaled to have a 
mean of 100. Since the data is already scaled to percent signal change, there is 
no need to scale the components.</dd>
<dd> 'Variance Normalization' - At each voxel, time-series is linearly detrended 
and converted to Z-scores.</dd>

and performed here.

Someone who's more familiar with the GIFT code base -- could you confirm where the icatb_estimate_dimension script is called ? Is it possible that icatb_preproc_data is called first, and that the normalization we see in estimate_dimension takes place after an initial variance (or intensity, or otherwise) normalization ?

@notZaki
Copy link
Contributor Author

notZaki commented Feb 3, 2021

For GIFT, it looks like the default is to remove mean per timepoint based on icatb_setup_analysis.m and on icatb_defaults.m

I tried looking into nilearn's CanICA, and it looks like it does temporal normalization during the PCA step, but spatial normalization during ICA. I'll have to dive into it again to make sure I understood it properly.


For my own understanding: Would the normalization strategy depend on whether the ICA is spatial or temporal? I assumed it did, so my reasoning was that temporal normalization would be appropriate for temporal ICA, and spatial normalization for spatial ICA. I justified this pairing because I think ICA expects the components to have a zero mean---though I don't know if this is necessarily true.


Throwing another hat in the ring: The data could be normalized by only using the global mean/std. This would be a uniform transformation, so neither the time series nor the spatial components will have any new "jumpy" changes.

@emdupre
Copy link
Member

emdupre commented Feb 4, 2021

I tried looking into nilearn's CanICA, and it looks like it does temporal normalization during the PCA step, but spatial normalization during ICA.

I feel much more well-equipped to speak on CanICA than GIFT 😅 I'll give a brief summary of my understanding below, but I think that if the goal is to match the GIFT implementation, this is sufficiently different that it's unlikely to help us here. Instead, if we're able to find a normalization that will give similar results when running the same data through GIFT and maPCA, I think we can move forward with that solution.


CanICA is using the voxelwise data directly (without any normalization) into an initial PCA to reduce the data dimensionality across subjects; specifically, to reduce the time dimension. This is done as part of the BaseDecomposition class, such that data is of shape n_samples x n_features after this initial dimensionality reduction. n_samples corresponds to the retained number of components while n_features is equal to voxels.

The stacked data is normalized in MultiPCA for each component (i.e., a scaling factor S is calculated separately for each component and used to norm the data). This normed data is then transposed (now n_features x n_samples) and passed to sklearn's randomized_svd. The returned components are transposed again to be n_samples x n_features. Note that although n_samples could differ between these two initial decompositions, in nilearn the n_samples is controlled by n_components and defaults to twenty throughout.

Now we get to CanICA itself. Using the components returned from MultiPCA (n_samples x n_features) we then pass these to scikit's fastica (which expects n_samples x n_features, so our dimensions check out). By default, whiten = True which means that we are again going to normalize our components individually, this time by subtracting the mean. We then pass the normalized matrix to another PCA and use the estimated left singular vectors and diagonal matrix for whitening. Now, we perform the ICA itself. The result is used in CanICA and after some light cleaning the component maps are returned.

@eurunuela
Copy link
Collaborator

Instead, if we're able to find a normalization that will give similar results when running the same data through GIFT and maPCA, I think we can move forward with that solution.

I totally agree and #636 does that right now; i.e. it z-scores the volume for each time-point. We should merge that in before we cut the release.

As I previously said (maybe in the PR, idk), we could further study these differences in the maPCA repo, and why not, it could even be an abstract for OHBM next year.

Re CanICA: thank you for the explanation @emdupre I will have to read the two referenced papers to know more about the reasoning behind the 3 different PCAs and their corresponding normalization (or lack of it).

@emdupre
Copy link
Member

emdupre commented Feb 5, 2021

I totally agree and #636 does that right now; i.e. it z-scores the volume for each time-point. We should merge that in before we cut the release.

I think I was confused, since #636 only removes a variable assignment and a code comment. So thanks for clarifying !

As I previously said (maybe in the PR, idk), we could further study these differences in the maPCA repo, and why not, it could even be an abstract for OHBM next year.

Lots of decomp options, for sure 😄 So this is definitely something we can / should come back to !

@emdupre
Copy link
Member

emdupre commented Feb 5, 2021

Should we close this and #655 now that #/636 is merged ?

@notZaki
Copy link
Contributor Author

notZaki commented Feb 5, 2021

Adding numbers to what @eurunuela mentioned: I ran GIFT's icatb_estimate_dimension(ts_OC, adaptive_mask), where the inputs are tedana's outputs, and compared that against #636. The number of components estimated with mdl were identical:

GIFT tedana
Three Echo 68 68
Four Echo 30 30
Five Echo 64 64

Actually, the pre-PR636 branch (no normalization) gave the same result too.


Regarding closing this issue---the normalization question affects multiple parts of tedana: maPCA, kundu-PCA, and ICA.
For maPCA, we're emulating GIFT for now, but whether that is how it should be done is still unclear to me. So I would be in favour of keeping this issue open as a unified discussion on normalization, or continuing it in 655.

@eurunuela
Copy link
Collaborator

Adding numbers to what @eurunuela mentioned: I ran GIFT's icatb_estimate_dimension(ts_OC, adaptive_mask), where the inputs are tedana's outputs, and compared that against #636. The number of components estimated with mdl were identical:

GIFT tedana
Three Echo 68 68
Four Echo 30 30
Five Echo 64 64
Actually, the pre-PR636 branch (no normalization) gave the same result too.

This is great news as this is what we aimed for during the hackathon. I believe the second normalization step in maPCA is the key one and that's why after #636 has been merged the results are still the same.

Regarding closing this issue---the normalization question affects multiple parts of tedana: maPCA, kundu-PCA, and ICA.
For maPCA, we're emulating GIFT for now, but whether that is how it should be done is still unclear to me. So I would be in favour of keeping this issue open as a unified discussion on normalization, or continuing it in 655.

I agree that we should keep this issue open. I believe the different normalization steps are worth investigating.

Thank you once again @notZaki for bringing all this up.

@tsalo
Copy link
Member

tsalo commented Mar 16, 2021

Should we close this and #655 now that #/636 is merged ?

@emdupre I think this is closable.

@notZaki
Copy link
Contributor Author

notZaki commented Mar 16, 2021

If #702 gets merged then this can be closed. The discussion could be resumed in ME-ICA/mapca if needed.

@tsalo tsalo linked a pull request Mar 16, 2021 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question issues detailing questions about the project or its direction
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants