[FIX] Impute: sparse #2357

jerneju · 2017-05-30T17:05:48Z

Issue

Description of changes

Work in progress.

Includes

Code changes
Tests
Documentation

nikicc · 2017-05-30T19:51:17Z

@jerneju I already did some debugging about this on Friday and IMO the problem is in the file Orange/statistics/until.py, method stats which for sparse returns X.min and X.max. I think X.min and X.max don't handle missing values and return np.nan when some values are missing, while we probably should return minimum and maximum among defined values only. They should probably just be replaced with nanmin and nanmax methods (L256–272 in the same file).

jerneju · 2017-06-01T11:58:01Z

https://sentry.io/biolab/orange3/issues/284690104/

jerneju · 2017-06-01T11:58:11Z

https://sentry.io/biolab/orange3/issues/284690041/

jerneju · 2017-06-01T11:59:34Z

Well, additional issue:

nikicc · 2017-06-01T21:40:32Z

Orange/tests/test_util.py

+        """
+        x = np.array([[0], [np.nan], [9]])
+        x = sp.csr_matrix(x)
+        self.assertEqual(stats(x)[0][2], 3.)


This should be 4.5 not 3.

nikicc · 2017-06-01T21:40:36Z

Orange/statistics/util.py


-    n_values = np.prod(x.shape) - np.sum(np.isnan(x.data))
-    return np.nansum(x.data) / n_values
+    x.data = np.nan_to_num(x.data)


nan_to_num converts np.nans to zeros, which causes mean to also treat them as zeros. E.g. for the sparse array of [np.nan, np.nan, 1] this implementation returns 0.33 instead of 1.

What's wrong with the previous implementation?

nikicc · 2017-06-01T21:44:09Z

Orange/preprocess/impute.py

+        if not sp.issparse(c):
+            c = np.array(c, copy=True)
+        else:
+            c = c.copy()


Why do we need a copy? Doesn't toarray() already takes care of this?

nikicc · 2017-06-01T21:44:36Z

Orange/preprocess/impute.py

+            c = np.array(c, copy=True)
+        else:
+            c = c.copy()
+            c = c.toarray().flatten()


Should we use ravel instead that doesn't necessarily make an other copy?

codecov-io · 2017-06-02T12:27:09Z

Codecov Report

Merging #2357 into master will decrease coverage by 0.03%.
The diff coverage is 85.41%.

@@            Coverage Diff             @@
##           master    #2357      +/-   ##
==========================================
- Coverage   73.41%   73.38%   -0.04%     
==========================================
  Files         317      317              
  Lines       55653    55664      +11     
==========================================
- Hits        40859    40850       -9     
- Misses      14794    14814      +20

nikicc · 2017-06-02T12:51:09Z

Orange/statistics/util.py

@@ -281,14 +281,32 @@ def mean(x):
    return np.sum(x.data) / n_values


-def nanmean(x):
+def nanmean(x, axis=None):


What about:

def nanmean(x, axis=None): """ Equivalent of np.nanmean that supports sparse or dense matrices. """ def nanmean_sparse(x): n_values = np.prod(x.shape) - np.sum(np.isnan(x.data)) return np.nansum(x.data) / n_values if not sp.issparse(x): return np.nanmean(x, axis=axis) if axis is None: return nanmean_sparse(x) if axis in [0, 1]: arr = x if axis == 1 else x.T return np.array([nanmean_sparse(row) for row in arr]) else: raise NotImplementedError

Well, I did some speed testing. The results are interesting and are listed below:

Ratio for axis 0 : 1.558
Ratio for axis 1 : 0.664

nikicc · 2017-06-02T12:55:13Z

Orange/preprocess/impute.py

+        if not sp.issparse(c):
+            c = np.array(c, copy=True)
+        else:
+            c = c.toarray().ravel()


What about if we take only c.data here and we would need to density the whole column? Consequently, we would need to set only c.data in L314.

nikicc added the DH2017 label Jun 1, 2017

nikicc suggested changes Jun 1, 2017

View reviewed changes

nikicc added this to the 3.4.3 milestone Jun 2, 2017

nikicc self-assigned this Jun 2, 2017

nikicc suggested changes Jun 2, 2017

View reviewed changes

jerneju added 5 commits June 2, 2017 16:30

[FIX] Impute/Stats: sparse support: mean

47de863

preprocess/impute: numpy -> np

66a5249

[FIX] Impute: sparse support: As a distinct value

54664c0

[FIX] Impute: sparse support: just error message

6053ed1

[FIX] Impute/Preprocess: sparse support: Random

cf584f5

nikicc changed the title ~~[WIP][FIX] Impute: sparse~~ [FIX] Impute: sparse Jun 2, 2017

nikicc approved these changes Jun 2, 2017

View reviewed changes

nikicc merged commit d79e46a into biolab:master Jun 2, 2017

jerneju deleted the sparse-impute branch June 5, 2017 07:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FIX] Impute: sparse #2357

[FIX] Impute: sparse #2357

jerneju commented May 30, 2017 •

edited

Loading

nikicc commented May 30, 2017

jerneju commented Jun 1, 2017

jerneju commented Jun 1, 2017

jerneju commented Jun 1, 2017

nikicc Jun 1, 2017

nikicc Jun 1, 2017

nikicc Jun 1, 2017

nikicc Jun 1, 2017

codecov-io commented Jun 2, 2017 •

edited

Loading

nikicc Jun 2, 2017

jerneju Jun 2, 2017

nikicc Jun 2, 2017

[FIX] Impute: sparse #2357

[FIX] Impute: sparse #2357

Conversation

jerneju commented May 30, 2017 • edited Loading

Issue

Description of changes

Includes

nikicc commented May 30, 2017

jerneju commented Jun 1, 2017

jerneju commented Jun 1, 2017

jerneju commented Jun 1, 2017

nikicc Jun 1, 2017

Choose a reason for hiding this comment

nikicc Jun 1, 2017

Choose a reason for hiding this comment

nikicc Jun 1, 2017

Choose a reason for hiding this comment

nikicc Jun 1, 2017

Choose a reason for hiding this comment

codecov-io commented Jun 2, 2017 • edited Loading

Codecov Report

nikicc Jun 2, 2017

Choose a reason for hiding this comment

jerneju Jun 2, 2017

Choose a reason for hiding this comment

nikicc Jun 2, 2017

Choose a reason for hiding this comment

jerneju commented May 30, 2017 •

edited

Loading

codecov-io commented Jun 2, 2017 •

edited

Loading