-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Statistics.countnans: Fix sparse implementation and add axis support #2558
Statistics.countnans: Fix sparse implementation and add axis support #2558
Conversation
0359a98
to
57c6e28
Compare
@nikicc It appears that the existing code was not counting the number of NaNs, but rather the number of zero elements.
Apparently, this functionality is used in |
811338f
to
87e9276
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kudos for finding and correcting the bug 👍 IMO countnans
should count the number of nans and not zeros like it currently does.
About the failing tests, I quickly looked at the first one and the test is incorrect as was the previous countnans
implementation. Could you also correct the tests for _compute_distributions
according to the new implementation?
Orange/tests/test_statistics.py
Outdated
countnans(csr_matrix(x), axis=1), [1, 1], | ||
'Countnans fails on sparse data with `axis=1`') | ||
|
||
def test_countnans_with_2d_weights(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add a test for weights on sparse matrices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This didn't work before and I wasn't sure whether or not it was even worth the hassle, since I assume we only use sparse matrices when the data is very big, and forming a dense 2d weight matrix of that size kind of defeats the purpose...
However, I've implemented it now. In any case sparse weight matrices weren't supported before and aren't supported now. I don't know if it's a requirement, but it would require more work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, we shouldn't worry about weights on sparse matrices.
Orange/statistics/util.py
Outdated
""" | ||
if not sp.issparse(X): | ||
X = np.asanyarray(X) | ||
isnan = np.isnan(X) | ||
if weights is not None and weights.shape == X.shape: | ||
isnan = isnan * weights | ||
|
||
# In order to keep return types consistent with sparse vectors, we will | ||
# handle `axis=1` given a regular 1d numpy array equivallently as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we drop this compatibility for 1D arrays. The only case I can think of where this would be handy is in cases like:
for row in data.X:
countnans(row, axis=1)
But we probably shouldn't be passing 1D vectors with axis=1
anyway, but should calculate nans on whole matrix.
I would drop this compatibility for 1D and we will worry about this if this ever becomes a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking a bit further, if we really want 100% compatibility, we could make it to also raise an error for countnans
with sparse 1D arrays (e.g. [[1 2 3]]
) and axis=1
. We treat them as 1D array anyhow so I don't see a benefit of supporting both axis=0
and axis=1
if both return the same value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, this does seem to be the most sensible solution. Most of all, we should be consistent with the behaviour on sparse and dense matrices, otherwise much confusion is inevitable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of all, we should be consistent with the behaviour on sparse and dense matrices, otherwise much confusion is inevitable.
But on the other hand, changing user-provided input of axis=1
to axis=0
just because axis=1
would result in an error is IMO worse than some slight incompatibility between sparse and dense. If I could come up with at least one use-case where this incompatibility would really hurt us I might be less against it.
IMO if you want to stick to 100% same behavior, I would prefer if we also raise an error when a sparse row is provided and user passes axis=1. Hence we could essentially treat a sparse matrix of shape (1, X) as 1D array and get exactly the same behaviour as with 1D dense matrices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree, changing the user input was a bad idea. The error is much better.
Orange/tests/test_statistics.py
Outdated
def test_shape_matches_dense_and_sparse_given_array_and_axis_1(self): | ||
dense = np.array([0, 1, 0, 2, 2, np.nan, 1, np.nan, 0, 1]) | ||
sparse = csr_matrix(dense) | ||
np.testing.assert_equal(countnans(dense, axis=1), countnans(sparse, axis=1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this. For example, numpy crashes if you want to sum with axis=1 and only have 1D array.
>>> y = np.array([0, 1, 0, 2, 2, np.nan, 1, np.nan, 0, 1])
>>> y.sum(axis=0)
nan
>>> y.sum(axis=1)
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/Users/Niko/anaconda/envs/orange3/lib/python3.6/site-packages/numpy/core/_methods.py", line 32, in _sum
return umr_sum(a, axis, dtype, out, keepdims)
ValueError: 'axis' entry is out of bounds
Considering the idea from above comments, I would probably just make both of these calls to raise an error.
|
||
|
||
def bincount(X, max_val=None, weights=None, minlength=None): | ||
"""Return counts of values in array X. | ||
|
||
Works kind of like np.bincount(), except that it also supports floating | ||
arrays with nans. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you be willing to also document the max_val
argument? This doesn't seem to be a numpy argument, and it's not obvious what it is for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added thorough documentation.
Orange/statistics/util.py
Outdated
# Since `csr_matrix.values` only contain non-zero values, we must count | ||
# those separately and set the appropriate bin | ||
if sp.issparse(X_): | ||
bc[0] = np.prod(X_.shape) - X_.nnz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we maybe make np.prod(X_.shape) - X_.nnz
as a separate function, something like sparse_count_zeros
? We already have _sparse_has_zeros
and we will need to add the number of zero elements to the Table widget sooner or later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good idea.
Perhaps it would also make sense to implement sparse_num_elements
which would essentially be np.prod(x.shape)
since it's also used a lot. This would make the functionality much clearer, but I'm hesitant because it seems far too excessive.
Orange/tests/test_statistics.py
Outdated
np.testing.assert_equal(bincount(dense), expected) | ||
np.testing.assert_equal(bincount(sparse), expected) | ||
|
||
hist, n_nans = bincount([0., 1., 3], max_val=3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly does max_val=3
does here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a remnant of the old tests. I've removed it and added an appropriate test.
Orange/tests/test_statistics.py
Outdated
np.testing.assert_equal(countnans(x, weights=w, axis=1), [1, 2]) | ||
|
||
|
||
class TestBincount(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps add one test with weights?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
23a7433
to
cd71ba6
Compare
At first I was puzzled why zeros were counted as NaNs in sparse matrices. This has no apparent advantages at all and can be confusing when first dealing with it... This probably came from pandas, where the default fill for sparse matrices is NaN (but can be changed to anything). This doesn't make much sense to me, because if we separate the two, we can easily determine which data is missing and which is actually zero. Pandas does differentiate between the explicit and implicit NaNs, but then again, pandas is far more flexible than plain sparse matrices. |
…ous distributions
2485608
to
2a23396
Compare
Codecov Report
@@ Coverage Diff @@
## master #2558 +/- ##
==========================================
- Coverage 75.83% 75.03% -0.81%
==========================================
Files 338 327 -11
Lines 59532 57645 -1887
==========================================
- Hits 45145 43252 -1893
- Misses 14387 14393 +6 |
9aff785
to
70354e7
Compare
The previous implementation didn't return zero counts for sparse matrices and treated them as NaNs. This was changed and the tests have been updated.
70354e7
to
6f12808
Compare
Not sure where this came from, but it was incorrect. The current assumption is that values not stored in the sparse matrix correspond to zeros and not missing values! Missing values are explicitly stored in sparse matrices as |
@nikicc Is there anything left to do with PR or do you think it could be merged? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pavlin-policar sorry that this took so long 🙈
Overall it looks OK, there are only a few things left:
- fix
bincount
to consider explicit zeros, - make sure that
x
is of correct type whenever you callx.indices
(just cast it tocsr
beforehand)
Also, when discovering the problem of explicit zeros I was wondering if we could hack the dense_sparse
decorator to also inset at least one explicit zero? Doing so, we should have caught the bincount
problem.
[0, 0, 0, 1, 0, 2, np.nan, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], | ||
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], | ||
[0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1.1, 0, 0, 0, 0, 0, 0]] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test doesn't quite work for explicit zeros. E.g. if you add X[0,0] = 0
here, the test should pass and the results should be the same — but it fails. IMO the problem is in the bincount
, check the comment above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, once this is corrected, please add some explicit zeros to this example.
Orange/statistics/util.py
Outdated
|
||
if weights is not None: | ||
zero_weights = weights[zero_indices].sum() | ||
weights = weights[x.indices] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does x.indices
works here for both csc
and csr
? If not, please cast it to whatever (csc/csr) is required beforhand.
Orange/statistics/util.py
Outdated
|
||
if weights.ndim == 1: | ||
n_items = np.prod(x.shape) | ||
zero_indices = np.setdiff1d(np.arange(n_items), x.indices, assume_unique=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does x.indices
works here for both csc
and csr
? If not, please cast it to whatever (csc/csr) is required beforhand.
Orange/statistics/util.py
Outdated
# Since `csr_matrix.values` only contain non-zero values, we must count | ||
# those separately and set the appropriate bin | ||
if sp.issparse(x_original): | ||
bc[0] = zero_weights |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be bc[0] = bc[0] + zero_weights
to account for explicit zeros stored in x.data
.
Orange/statistics/util.py
Outdated
return np.fromiter((np.isnan(row.data).sum() for row in X), dtype=dtype) | ||
|
||
|
||
def sparse_count_zeros(x): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we perhaps rename this to sparse_count_implicit_zeros
to make sure explicit zeros aren't counted?
Orange/statistics/util.py
Outdated
return np.prod(x.shape) - x.nnz | ||
|
||
|
||
def sparse_has_zeros(x): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we perhaps rename this to sparse_has_implicit_zeros
to make sure explicit zeros aren't considered?
Orange/statistics/util.py
Outdated
|
||
|
||
def bincount(X, max_val=None, weights=None, minlength=None): | ||
def sparse_zero_weights(x, weights): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this only meant to be used when both x
and weights
are of the same shape, i.e. one dimensional? If so, should we add the check for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a check here causes more problems than it's worth, because this can be called in two ways that are valid:
x
has shape(1, n)
x
has shape(n, 1)
These are basically equivalent since they are both "1d", but this would make the check more complicated.
b5b4cdc
to
e4206e2
Compare
These commits were moved to PR #2698. |
Issue
The statistics module implementation of
countnans
did not support theaxis
argument. Moreover, it appeared to have been computing number of NaNs incorrectly on sparse data in general.Description of changes
Add implementation of
countnans
which correctly counts NaNs and supports theaxis
keyword.Includes