-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FIX] Impute: sparse #2357
[FIX] Impute: sparse #2357
Conversation
@jerneju I already did some debugging about this on Friday and IMO the problem is in the file |
Orange/tests/test_util.py
Outdated
""" | ||
x = np.array([[0], [np.nan], [9]]) | ||
x = sp.csr_matrix(x) | ||
self.assertEqual(stats(x)[0][2], 3.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be 4.5 not 3.
Orange/statistics/util.py
Outdated
|
||
n_values = np.prod(x.shape) - np.sum(np.isnan(x.data)) | ||
return np.nansum(x.data) / n_values | ||
x.data = np.nan_to_num(x.data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nan_to_num
converts np.nans
to zeros, which causes mean
to also treat them as zeros. E.g. for the sparse array of [np.nan, np.nan, 1]
this implementation returns 0.33 instead of 1.
What's wrong with the previous implementation?
Orange/preprocess/impute.py
Outdated
if not sp.issparse(c): | ||
c = np.array(c, copy=True) | ||
else: | ||
c = c.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a copy? Doesn't toarray()
already takes care of this?
Orange/preprocess/impute.py
Outdated
c = np.array(c, copy=True) | ||
else: | ||
c = c.copy() | ||
c = c.toarray().flatten() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use ravel
instead that doesn't necessarily make an other copy?
Codecov Report
@@ Coverage Diff @@
## master #2357 +/- ##
==========================================
- Coverage 73.41% 73.38% -0.04%
==========================================
Files 317 317
Lines 55653 55664 +11
==========================================
- Hits 40859 40850 -9
- Misses 14794 14814 +20 |
Orange/statistics/util.py
Outdated
@@ -281,14 +281,32 @@ def mean(x): | |||
return np.sum(x.data) / n_values | |||
|
|||
|
|||
def nanmean(x): | |||
def nanmean(x, axis=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about:
def nanmean(x, axis=None):
""" Equivalent of np.nanmean that supports sparse or dense matrices. """
def nanmean_sparse(x):
n_values = np.prod(x.shape) - np.sum(np.isnan(x.data))
return np.nansum(x.data) / n_values
if not sp.issparse(x):
return np.nanmean(x, axis=axis)
if axis is None:
return nanmean_sparse(x)
if axis in [0, 1]:
arr = x if axis == 1 else x.T
return np.array([nanmean_sparse(row) for row in arr])
else:
raise NotImplementedError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I did some speed testing. The results are interesting and are listed below:
Ratio for axis 0 : 1.558
Ratio for axis 1 : 0.664
Orange/preprocess/impute.py
Outdated
if not sp.issparse(c): | ||
c = np.array(c, copy=True) | ||
else: | ||
c = c.toarray().ravel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about if we take only c.data
here and we would need to density the whole column? Consequently, we would need to set only c.data
in L314.
Issue
Fixes #2349.
Description of changes
Work in progress.
Includes