Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] Normalize data to zero mean and unit variance before dimension estimation #636

Merged
merged 11 commits into from
Feb 5, 2021

Conversation

notZaki
Copy link
Contributor

@notZaki notZaki commented Dec 29, 2020

References #653

Input data is normalized prior to dimension estimation in the gift/matlab implementation, but not in ma_pca.
There is a normalized array data_z but it isn't used as the input to ma_pca:

data_z = ((data.T - data.T.mean(axis=0)) / data.T.std(axis=0)).T # var normalize ts
data_z = (data_z - data_z.mean()) / data_z.std() # var normalize everything
if algorithm in ['mdl', 'aic', 'kic']:
data_img = io.new_nii_like(ref_img, utils.unmask(data, mask))
mask_img = io.new_nii_like(ref_img, mask.astype(int))
voxel_comp_weights, varex, varex_norm, comp_ts = ma_pca.ma_pca(
data_img, mask_img, algorithm)

Not sure if this was intentional, but this PR normalizes the input in ma_pca.

@tsalo
Copy link
Member

tsalo commented Dec 29, 2020

Hi @notZaki. Thank you for noticing this bug. @eurunuela and I are actually working on moving the MA-PCA code out of tedana and into another repository, so that folks can use it without having to install tedana. We're just waiting on licensing info from the GIFT devs before we can make that repository public.

In the meantime, I would recommend changing the data that tedpca feeds into ma_pca rather than changing ma_pca itself.

@notZaki
Copy link
Contributor Author

notZaki commented Dec 29, 2020

Sounds good. In that case, I'll hold off on this PR until the new repository becomes public.

The matlab code mentions GPLv2 at the top of the script, but the year is listed as 2003-2009, so it's probably better that you're confirming with the devs.

@notZaki notZaki closed this Dec 29, 2020
@notZaki notZaki reopened this Jan 15, 2021
@notZaki
Copy link
Contributor Author

notZaki commented Jan 15, 2021

I re-opened this PR because there's something I needed to check with the CI.

@codecov
Copy link

codecov bot commented Jan 17, 2021

Codecov Report

Merging #636 (2b385db) into master (ca7e145) will decrease coverage by 0.00%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #636      +/-   ##
==========================================
- Coverage   93.64%   93.64%   -0.01%     
==========================================
  Files          26       26              
  Lines        2030     2029       -1     
==========================================
- Hits         1901     1900       -1     
  Misses        129      129              
Impacted Files Coverage Δ
tedana/decomposition/ma_pca.py 97.57% <ø> (-0.01%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ca7e145...2b385db. Read the comment docs.

@eurunuela
Copy link
Collaborator

eurunuela commented Jan 19, 2021

I'm sorry I haven't responded before.

We wrote this code on November 2019 and it's been some time so I cannot quite remember the thinking on some of the decisions made during the hackathon. I don't know why we added line 483. In the original GIFT code, they do z-score the data before estimating the number of components to keep, so my guess is we were testing how the z-scoring affected the decomposition and forgot to remove that line.

Now, since we're working on maPCA on its own repo, I'd add an if statement to make sure the input data is z-scored, and apply line 482 when it's not.

Edit: Good catch @notZaki ! Thanks!

@jbteves
Copy link
Collaborator

jbteves commented Jan 19, 2021

@eurunuela now that you bring that up, I actually do remember that while I was reviewing the code because I'm the one who made you put in that TODO. So that's why it's there.

@eurunuela
Copy link
Collaborator

eurunuela commented Jan 19, 2021

@eurunuela now that you bring that up, I actually do remember that while I was reviewing the code because I'm the one who made you put in that TODO. So that's why it's there.

I guess jetlag happened and we just forgot we had to do that. Sorry about that.

Next steps before we merge this PR should be:

Would you like to do that @notZaki ? Or would you prefer I push changes to this PR?

Edit: The second point is mainly for the maPCA repo.

Edit 2: We could just simply z-score in time without the if statement.

@jbteves
Copy link
Collaborator

jbteves commented Jan 19, 2021

@eurunuela I'd open an issue in the mapca repo cross-referencing this one so we don't forget a second time ; - )

@eurunuela
Copy link
Collaborator

Now that I think about it I'm starting to remember why we used data instead of data_z. For maPCa we only want to z-score in time, while tedana z-scores both in time and space before the PCA step, hence the normalization step inside maPCA.

@CesarCaballeroGaudes
Copy link
Contributor

Yes, I also remember that.

@eurunuela
Copy link
Collaborator

eurunuela commented Jan 19, 2021

I've been looking into the different normalization approaches and how they differ. I've checked the one used in GIFT, the sklearn scaler we added during the hackathon, and the z-score applied to the temporal dimension. Here's an overview of what I've found:

I will use this dummy matrix to make the comparisons:

array([[4, 1, 8, 8],
       [7, 3, 7, 5]])

with dimensions 2 (space) by 4 (time).

GIFT

This is how GIFT normalizes variance (see here):

    for n = 1:size(data, 2)
        data(:, n) = detrend(data(:, n), 0) ./ std(data(:, n));
    end

This normalization returns the following matrix:

array([[-0.70710678, -0.70710678,  0.70710678,  0.70710678],
       [ 0.70710678,  0.70710678, -0.70710678, -0.70710678]])

If you try to replicate this on python, you will see that the std functions return different values. In order to perform std like Matlab does, you'd need to add the ddof=1 argument to the numpy call.

Sklearn scaler

This is how we implemented it during the hackathon for maPCA (see here):

scaler = StandardScaler(with_mean=True, with_std=True)
scaler.fit_transform(data)

And it returns:

array([[-1., -1.,  1.,  1.],
       [ 1.,  1., -1., -1.]])

Z-score in temporal dimension

This is how tedana does z-score in the temporal dimension (see here):

data_z = ((data.T - data.T.mean(axis=0)) / data.T.std(axis=0)).T

And it returns:

array([[-0.42409446, -1.44192118,  0.93300782,  0.93300782],
       [ 0.90453403, -1.50755672,  0.90453403, -0.30151134]])

Conclusion

It is clear that the three approaches yield very different results.

I've gone through the paper again and having zero mean and unit variance in the temporal dimension is what makes sense. Mathematically speaking, we're interested in calculating the entropy rate of a Gaussian random process (x[n], n=1,2,...,N; where N is the number of samples, E{x[n]} = 0 and E{x^2[n]} = 1).

To me, z-scoring in the temporal domain makes the most sense and is what I'd do (and that's what has been suggested so far in this PR). However, I wanted to make sure we all see what exaclty the different approaches are doing.

@notZaki
Copy link
Contributor Author

notZaki commented Jan 19, 2021

Thanks for sharing the results @eurunuela

I think the difference between GIFT and StandardScaler will become negligible with a larger matrix. They seem very different on the example data because the first dimension (space) only has two values, so the sample and population standard deviation produce very different values.

If z-scoring in the temporal domain is indeed what makes sense, then perhaps this can be relayed to the GIFT group to confirm whether this was a bug or intentional. This would be more relevant for the mapca repo because it would be diverging from GIFT.

@jbteves
Copy link
Collaborator

jbteves commented Jan 19, 2021

@eurunuela would you mind quickly opening an issue explaining briefly what tedana's options and current behavior are? I'm afraid that your comment will be missed in a PR rather than an issue.

@eurunuela
Copy link
Collaborator

I think the difference between GIFT and StandardScaler will become negligible with a larger matrix. They seem very different on the example data because the first dimension (space) only has two values, so the sample and population standard deviation produce very different values.

You're right. I didn't think of it as I was just trying to keep the demonstration simple. I've just tested the same commands with a bigger matrix (20000 x 100) and the results between GIFT and sklearn are very similar. The biggest difference is in the order of 4e-5.

If z-scoring in the temporal domain is indeed what makes sense, then perhaps this can be relayed to the GIFT group to confirm whether this was a bug or intentional. This would be more relevant for the mapca repo because it would be diverging from GIFT.

There will be no need for that. I've checked that the sklearn scaler z-scores in the temporal domain, so we're safe.

@notZaki
Copy link
Contributor Author

notZaki commented Jan 20, 2021

There will be no need for that. I've checked that the sklearn scaler z-scores in the temporal domain, so we're safe.

I thought that the earlier results were showing that sklearn scaler is z-scoring in spatial domain instead of the temporal domain.

@eurunuela
Copy link
Collaborator

eurunuela commented Jan 20, 2021

I thought that the earlier results were showing that sklearn scaler is z-scoring in spatial domain instead of the temporal domain.

I've looked into that in more depth (see #655). The sklearn scaler appears to be z-scoring both in the spatial and temporal domains.

eurunuela
eurunuela previously approved these changes Jan 23, 2021
Copy link
Collaborator

@eurunuela eurunuela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @notZaki ! I think we should have this PR in our next release to ensure both normalization steps are done.

@emdupre
Copy link
Member

emdupre commented Jan 27, 2021

Thank you @notZaki ! I think we should have this PR in our next release to ensure both normalization steps are done.

Thanks everyone for digging into this ! Just to be clear: is this PR ready for reviews ?

@eurunuela
Copy link
Collaborator

Thanks everyone for digging into this ! Just to be clear: is this PR ready for reviews ?

I think it's ready to get merged, so yes.

@jbteves
Copy link
Collaborator

jbteves commented Jan 27, 2021

@notZaki could you please confirm by switching this PR from "Draft" to "Ready for Review"?

@notZaki
Copy link
Contributor Author

notZaki commented Jan 27, 2021

On my end, I am not completely sure if the normalization should be done along voxels or time.
If there is no doubt that normalization should be along time, then this is ready to get merged. Otherwise, I think the discussion in #653 and #655 need to be resolved first.

@notZaki
Copy link
Contributor Author

notZaki commented Jan 29, 2021

Revisiting this: If the next release is scheduled for next week, then some kind of change should be made because mdl is the default option and I think everyone agrees that the current version (no normalization) is not ideal.

Options are:

  • (A) Normalize in temporal domain
    • Pro: Consistent with how tedana z-scores elsewhere in decomposition/pca
  • (B) Normalize along spatial dimension
    • Pro: This is what GIFT does

I am learning towards option B because I think it's safer to follow an established library (GIFT), but I'm curious what everyone else's thoughts are.

@eurunuela
Copy link
Collaborator

Why don't we go with option B (safest and it will select more components) and study option A on the maPCA repo?

Copy link
Collaborator

@eurunuela eurunuela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @notZaki

@eurunuela
Copy link
Collaborator

Can we please merge this before we cut the release? It makes sure that we're normalizing the data as the first step to maPCA (even if the results don't vary too much).

@emdupre @tsalo @jbteves

@emdupre
Copy link
Member

emdupre commented Feb 1, 2021

Can we please merge this before we cut the release?

Yes, I think we should ! @notZaki had made a great point re: which to use, and I wanted to drag out my old Linear Algebra text to confirm. But I can commit to doing that by Wednesday, which should still leave us enough time to get it merged before cutting on Friday !

@eurunuela
Copy link
Collaborator

Can we please merge this before we cut the release?

Yes, I think we should ! @notZaki had made a great point re: which to use, and I wanted to drag out my old Linear Algebra text to confirm. But I can commit to doing that by Wednesday, which should still leave us enough time to get it merged before cutting on Friday !

Sounds great! Right now, the PR mimics GIFT. I will have a look at the papers too just to make sure.

Base automatically changed from master to main February 1, 2021 23:57
Copy link
Member

@emdupre emdupre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @notZaki ! 🚀

@eurunuela
Copy link
Collaborator

Thanks @notZaki and @emdupre . I'm merging this PR.

@eurunuela eurunuela merged commit f8802df into ME-ICA:main Feb 5, 2021
@notZaki notZaki deleted the normalizeBeforeMAPCA branch February 5, 2021 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants