-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
computefeats2 does not calculate z-statistics accurately #178
Comments
If we're interested in getting z-scores for the betas, then I think the following is an appropriate replacement for
|
The problem with my proposed approach above is that test statistics for parameter estimates seem to be meaningless when the degrees of freedom are low. The t-statistics end up being grossly inflated (often in the millions to billions with dof = 1), and the z-statistics then end up being binarized at +/- 8 (the max value). Here are a couple of possible solutions:
|
Could you elaborate on point 1? If it's good but computationally intense, we can see if we're clever enough to find a way around that problem. |
I believe that it would look something like this [completely untested code]: def computefeats2(data, mmix, mask, n_iters=1000):
"""
Converts `data` to component space using `mmix`
Parameters
----------
data : (S x T) array_like
Input data
mmix : (T [x C]) array_like
Mixing matrix for converting input data to component space, where `C`
is components and `T` is the same as in `data`
mask : (S,) array_like
Boolean mask array
n_iters : int, optional
Number of iterations to use to build null distributions. Default is 1000.
Returns
-------
data_Z : (S x C) :obj:`numpy.ndarray`
Voxel-wise Z-statistics for each component
"""
if data.ndim != 2:
raise ValueError('Parameter data should be 2d, not {0}d'.format(data.ndim))
elif mmix.ndim not in [2]:
raise ValueError('Parameter mmix should be 2d, not '
'{0}d'.format(mmix.ndim))
elif mask.ndim != 1:
raise ValueError('Parameter mask should be 1d, not {0}d'.format(mask.ndim))
elif data.shape[0] != mask.shape[0]:
raise ValueError('First dimensions (number of samples) of data ({0}) '
'and mask ({1}) do not match.'.format(data.shape[0],
mask.shape[0]))
elif data.shape[1] != mmix.shape[0]:
raise ValueError('Second dimensions (number of volumes) of data ({0}) '
'and mmix ({1}) do not match.'.format(data.shape[0],
mmix.shape[0]))
# z-score inputs to regression
data_z = stats.zscore(data[mask], axis=-1)
mmix_z = stats.zscore(mmix, axis=0)
# get betas of `data`~`mmix` and limit to range [-0.999, 0.999]
# (S x C)
data_betas = get_coeffs(data_z, mmix_z, mask=None)
betas_null = np.zeros(data_betas.shape + (n_iters,)) # S x C x 1000
for i_iter in range(n_iters):
for j_comp in range(mmix.shape[1]):
temp_mmix = mmix_z.copy()
idx = np.random.RandomState(seed=i_iter).permutation(mmix.shape[0])
temp_mmix[:, j_comp] = temp_mmix[idx, j_comp]
temp = get_coeffs(data_z, temp_mmix, mask=None)
betas_null[:, j_comp, i_iter] = temp[:, j_comp]
# Get z-values for betas using empirical nulls
p_values = null_to_p(data_betas, betas_null, tail='two')
signs = np.sign(data_betas - np.mean(betas_null, axis=-1))
data_Z = p_to_z(p_values, tail='two')
data_Z *= signs
return data_Z
def null_to_p(test_value, null_array, tail='two'):
"""
Return two-sided p-value for test value against null array.
Parameters
----------
test_value : :obj:`float`, :obj:`int`, or array_like
Either a single value to compare to a 1D array or an ND array to
compare to an N+1D array.
null_array : array_like
An array with one more dimension than test_value. The last dimension
should be the permutations.
tail : {'two', 'upper', 'lower'}
Tails to use for the calculation of p-values.
Returns
-------
p_value : :obj:`float` or array_like
Either a single p-value or an array with the same shape as test_value
"""
if isinstance(test_value, np.ndarray):
assert null_array.ndim == (test_value.ndim + 1)
values_1d = np.reshape(test_value, test_value.size)
nulls_2d = np.reshape(null_array, (test_value.size, null_array.shape[-1]))
p_values = np.zeros(values_1d.shape)
z_values = np.zeros(values_1d.shape)
for idx in range(values_1d.shape[0]):
p_values[idx] = null_to_p(values_1d[idx], nulls_2d[idx, :], tail=tail)
p_values = np.reshape(p_values, test_value.shape)
return p_values
if tail == 'two':
p_value = (50 - np.abs(stats.percentileofscore(
null_array, test_value) - 50.)) * 2. / 100.
elif tail == 'upper':
p_value = 1 - (stats.percentileofscore(null_array, test_value) / 100.)
elif tail == 'lower':
p_value = stats.percentileofscore(null_array, test_value) / 100.
else:
raise ValueError('Argument "tail" must be one of ["two", "upper", '
'"lower"]')
return p_value
def p_to_z(p, tail='two'):
"""
Convert p-values to z-values.
Parameters
----------
p : array_like
Array of p-values
tail : {'one', 'two'}
Tails to use.
Returns
-------
z : array_like
Array of z-values
"""
eps = np.spacing(1)
p = np.array(p)
p[p < eps] = eps
if tail == 'two':
z = ndtri(1 - (p / 2))
z = np.array(z)
elif tail == 'one':
z = ndtri(1 - p)
z = np.array(z)
z[z < 0] = 0
else:
raise ValueError('Argument "tail" must be one of ["one", "two"]')
if z.shape == ():
z = z[()]
return z UPDATE: This takes about 4 minutes per iteration with 160 components. To build a null distribution, I think we'd want at least 1000 iterations. |
@smoia You've been digging into MELODIC's code a bit. Do you know how they convert their component maps into z-statistics? |
If we just use more aggressive dimensionality reduction in the PCA step prior to metric calculation, then the properly calculated statistics should be valid. We can do this by setting |
If we want to go with the permutation approach, it looks like nilearn.mass_univariate.permuted_ols does what I was thinking, so it's an issue that's been dealt with already I guess. It would add a lot of time to the workflow, but we could maybe support it as an option. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions to tedana:tada: ! |
This issue isn't as stale as you think, bot. |
This issue should be somewhat resolved by our shift to MA-PCA methods (see #178 (comment)), and we decided to refactor the regression function instead of using a nonparametric approach like We still need to do the refactor, which includes converting z-values to z-statistics. |
This is coming up as well in the BrainHack Donostia project |
Still need the contribution? I can look into it! |
I think that the current solution is available here. @CesarCaballeroGaudes @smoia do you know what, if anything, still needs to be done to get it working? EDIT: I'm happy to handle merge conflicts and any documentation tuning that needs to happen if the core code is working. |
That would be amazing. Thank you! |
I believe that
computefeats2
tries to calculate correlation coefficients between the input data and the mixing matrix (see below) by variance normalizing the data and usingnp.linalg.lstsq
. It then crops extreme values (<-0.999 and >0.999) and converts the correlation coefficients to z-values. I assumed that the cropping was to prevent large z-values with correlation coefficients approaching 1 and -1. However, it doesn't look like normalization is doing what we want, and the "correlation coefficients" can in fact end up quite large. I added in a couple of lines to print out the max and min values ofdata_R
before cropping, and got values as high as +/- 12 with our five-echo test dataset.If I'm right, then this is a bug, although I don't recall if
computefeats2
is used for anything beyond generating component weight maps.NOTE: It is used to compute WTS in
fitmodels_direct
, which can impact computed metrics. The mixing matrix is normalized intedica
, so there is no meaningful impact on ICA component maps or metrics, but this isn't done fortedpca
, so there are small differences between metrics and more noticeable differences between component maps. Thetedpca
component maps almost look binarized because so many voxels are cropped.tedana/tedana/model/fit.py
Lines 310 to 318 in 1bc32e4
However, before I try fixing this, I want to make sure that I'm interpreting the function correctly. Is anyone familiar enough with this function to take a look?
BTW, to fix the issue (and get betas equivalent to correlation coefficients), I would do the following:
The text was updated successfully, but these errors were encountered: