Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] Impute: sparse #2357

Merged
merged 5 commits into from
Jun 2, 2017
Merged

[FIX] Impute: sparse #2357

merged 5 commits into from
Jun 2, 2017

Conversation

jerneju
Copy link
Contributor

@jerneju jerneju commented May 30, 2017

Issue

Fixes #2349.

Description of changes

Work in progress.

Includes
  • Code changes
  • Tests
  • Documentation

@nikicc
Copy link
Contributor

nikicc commented May 30, 2017

@jerneju I already did some debugging about this on Friday and IMO the problem is in the file Orange/statistics/until.py, method stats which for sparse returns X.min and X.max. I think X.min and X.max don't handle missing values and return np.nan when some values are missing, while we probably should return minimum and maximum among defined values only. They should probably just be replaced with nanmin and nanmax methods (L256–272 in the same file).

@jerneju
Copy link
Contributor Author

jerneju commented Jun 1, 2017

@jerneju
Copy link
Contributor Author

jerneju commented Jun 1, 2017

@jerneju
Copy link
Contributor Author

jerneju commented Jun 1, 2017

Well, additional issue:
screenshot_20170601_135914

@nikicc nikicc added the DH2017 label Jun 1, 2017
"""
x = np.array([[0], [np.nan], [9]])
x = sp.csr_matrix(x)
self.assertEqual(stats(x)[0][2], 3.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be 4.5 not 3.


n_values = np.prod(x.shape) - np.sum(np.isnan(x.data))
return np.nansum(x.data) / n_values
x.data = np.nan_to_num(x.data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nan_to_num converts np.nans to zeros, which causes mean to also treat them as zeros. E.g. for the sparse array of [np.nan, np.nan, 1] this implementation returns 0.33 instead of 1.

What's wrong with the previous implementation?

if not sp.issparse(c):
c = np.array(c, copy=True)
else:
c = c.copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a copy? Doesn't toarray() already takes care of this?

c = np.array(c, copy=True)
else:
c = c.copy()
c = c.toarray().flatten()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use ravel instead that doesn't necessarily make an other copy?

@nikicc nikicc added this to the 3.4.3 milestone Jun 2, 2017
@nikicc nikicc self-assigned this Jun 2, 2017
@codecov-io
Copy link

codecov-io commented Jun 2, 2017

Codecov Report

Merging #2357 into master will decrease coverage by 0.03%.
The diff coverage is 85.41%.

@@            Coverage Diff             @@
##           master    #2357      +/-   ##
==========================================
- Coverage   73.41%   73.38%   -0.04%     
==========================================
  Files         317      317              
  Lines       55653    55664      +11     
==========================================
- Hits        40859    40850       -9     
- Misses      14794    14814      +20

@@ -281,14 +281,32 @@ def mean(x):
return np.sum(x.data) / n_values


def nanmean(x):
def nanmean(x, axis=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about:

def nanmean(x, axis=None):
    """ Equivalent of np.nanmean that supports sparse or dense matrices. """
    def nanmean_sparse(x):
        n_values = np.prod(x.shape) - np.sum(np.isnan(x.data))
        return np.nansum(x.data) / n_values

    if not sp.issparse(x):
        return np.nanmean(x, axis=axis)
    if axis is None:
        return nanmean_sparse(x)
    if axis in [0, 1]:
        arr = x if axis == 1 else x.T
        return np.array([nanmean_sparse(row) for row in arr])
    else:
        raise NotImplementedError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I did some speed testing. The results are interesting and are listed below:

Ratio for axis 0 : 1.558
Ratio for axis 1 : 0.664

if not sp.issparse(c):
c = np.array(c, copy=True)
else:
c = c.toarray().ravel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about if we take only c.data here and we would need to density the whole column? Consequently, we would need to set only c.data in L314.

@nikicc nikicc changed the title [WIP][FIX] Impute: sparse [FIX] Impute: sparse Jun 2, 2017
@nikicc nikicc merged commit d79e46a into biolab:master Jun 2, 2017
@jerneju jerneju deleted the sparse-impute branch June 5, 2017 07:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants