Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] Normalize PCA mixing matrix over time, not component #228

Merged
merged 8 commits into from
Mar 11, 2019

Conversation

tsalo
Copy link
Member

@tsalo tsalo commented Feb 28, 2019

References #223. One of the concerns I brought up in #223 is that the normalized PCA mixing matrix, which is only used to calculate the weight maps (WTS) within fitmodels_direct, is normalized over component, rather than over time. This strikes me as invalid, though I could be misinterpreting the purpose of the normalization. This will not impact the MLE dimensionality estimation, but should improve the validity of the Kundu PCA decision tree.

Changes proposed in this pull request:

  • Z-score the PCA mixing matrix over time (per component), rather than over components (per TR).

@ME-ICA ME-ICA deleted a comment from codecov bot Feb 28, 2019
@ME-ICA ME-ICA deleted a comment from codecov bot Mar 1, 2019
@ME-ICA ME-ICA deleted a comment from codecov bot Mar 1, 2019
@codecov
Copy link

codecov bot commented Mar 1, 2019

Codecov Report

Merging #228 into master will increase coverage by 0.02%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #228      +/-   ##
==========================================
+ Coverage   47.83%   47.86%   +0.02%     
==========================================
  Files          33       33              
  Lines        2013     2012       -1     
==========================================
  Hits          963      963              
+ Misses       1050     1049       -1
Impacted Files Coverage Δ
tedana/decomposition/eigendecomp.py 10.34% <0%> (+0.05%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3c08b8a...54bf80e. Read the comment docs.

@ME-ICA ME-ICA deleted a comment from codecov bot Mar 2, 2019
@tsalo tsalo requested a review from emdupre March 6, 2019 14:08
@emdupre
Copy link
Member

emdupre commented Mar 10, 2019

To the best of my understanding, the purpose of normalization was so that each components would be on the same scale when passed on to the PCA decision tree (and therefore could use e.g. the same thresholds). If we normalize by time, that assumption no longer holds.

Could you explain a little more why you were thinking it would make sense to normalize over time ?

@tsalo
Copy link
Member Author

tsalo commented Mar 10, 2019

The old way was z-scoring each timepoint across components, while this fix z-scores each component across timepoints. The former approach doesn't just rescale the time series, but also changes them. Granted, the difference is small (the correlation for a random array with 80 components and 160 timepoints before vs. after z-scoring is ~0.99), but it should also not be a valid change to make, as far as I can tell.

@emdupre
Copy link
Member

emdupre commented Mar 11, 2019

Ah, sorry, I think I misunderstood ! So to clarify (for me):

What is the correlation between the original components and this new normalization ? The 0.99 is with the old normalization, correct ?

I think we should be checking the PCA selection tree in the integration tests. This seems like as good a time as any, since I want to know how this is impacting the PCA selection. WDYT ? Should we add it to the three echo dataset ?

@tsalo
Copy link
Member Author

tsalo commented Mar 11, 2019

The time series correlate perfectly after z-scoring the new way. Yeah, the old way gets 0.999.

That's a good idea. I can change the three-echo integration test in this PR.

@emdupre
Copy link
Member

emdupre commented Mar 11, 2019

If this is correlating at 1.0 then I'm wondering if it's really a necessary normalization -- obviously removing the old one seems to be !

If we can fix the merge conflict (sorry, I think I pulled it in with #208 ) then this LGTM !

@tsalo
Copy link
Member Author

tsalo commented Mar 11, 2019

It's used to generate the normalized version of the mixing matrix, which is used in fitmodels_direct. Hypothetically, the scale of the time series should impact the parameter estimates, and I believe that the goal of the z-scoring is to make the parameter estimates from computefeats2 equivalent to betas. This is definitely something worth discussing in the larger context of how metric calculation should be performed (e.g., in #178, #179, and #223), but if we work under the assumption that the current version of fitmodels_direct is generally correct, then this is still a bug that needs to be fixed.

@emdupre
Copy link
Member

emdupre commented Mar 11, 2019

I think this is good to merge. It sounds like we need to do some issue clean-up around metric calculation -- we can deal with that after this is in :)

@jbteves jbteves added breaking change WIll make a non-trivial change to outputs and removed output-change labels Apr 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change WIll make a non-trivial change to outputs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants