Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statistics.countnans: Fix sparse implementation and add axis support #2558

Closed

Conversation

pavlin-policar
Copy link
Collaborator

@pavlin-policar pavlin-policar commented Sep 4, 2017

Issue

The statistics module implementation of countnans did not support the axis argument. Moreover, it appeared to have been computing number of NaNs incorrectly on sparse data in general.

Description of changes

Add implementation of countnans which correctly counts NaNs and supports the axis keyword.

Includes
  • Code changes
  • Tests
  • Documentation

@pavlin-policar
Copy link
Collaborator Author

@nikicc It appears that the existing code was not counting the number of NaNs, but rather the number of zero elements.

>>> x = np.array([1, 1, 0, 0, 2, np.nan, 0, 0]).reshape(2, -1)
array([[  1.,   1.,   0.,   0.],
       [  2.,  nan,   0.,   0.]])
>>> _count_nans_per_row_sparse(csr_matrix(x), None)
4.0

Apparently, this functionality is used in Table._compute_distributions (hence my failing test). I haven't looked any further than that, but it appears that either the distributions in the table must be being computed incorrectly, or the function name is wrong.

Copy link
Contributor

@nikicc nikicc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kudos for finding and correcting the bug 👍 IMO countnans should count the number of nans and not zeros like it currently does.

About the failing tests, I quickly looked at the first one and the test is incorrect as was the previous countnans implementation. Could you also correct the tests for _compute_distributions according to the new implementation?

countnans(csr_matrix(x), axis=1), [1, 1],
'Countnans fails on sparse data with `axis=1`')

def test_countnans_with_2d_weights(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add a test for weights on sparse matrices?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This didn't work before and I wasn't sure whether or not it was even worth the hassle, since I assume we only use sparse matrices when the data is very big, and forming a dense 2d weight matrix of that size kind of defeats the purpose...

However, I've implemented it now. In any case sparse weight matrices weren't supported before and aren't supported now. I don't know if it's a requirement, but it would require more work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, we shouldn't worry about weights on sparse matrices.

"""
if not sp.issparse(X):
X = np.asanyarray(X)
isnan = np.isnan(X)
if weights is not None and weights.shape == X.shape:
isnan = isnan * weights

# In order to keep return types consistent with sparse vectors, we will
# handle `axis=1` given a regular 1d numpy array equivallently as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we drop this compatibility for 1D arrays. The only case I can think of where this would be handy is in cases like:

for row in data.X:
    countnans(row, axis=1)

But we probably shouldn't be passing 1D vectors with axis=1 anyway, but should calculate nans on whole matrix.

I would drop this compatibility for 1D and we will worry about this if this ever becomes a problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking a bit further, if we really want 100% compatibility, we could make it to also raise an error for countnans with sparse 1D arrays (e.g. [[1 2 3]]) and axis=1. We treat them as 1D array anyhow so I don't see a benefit of supporting both axis=0 and axis=1 if both return the same value.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this does seem to be the most sensible solution. Most of all, we should be consistent with the behaviour on sparse and dense matrices, otherwise much confusion is inevitable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of all, we should be consistent with the behaviour on sparse and dense matrices, otherwise much confusion is inevitable.

But on the other hand, changing user-provided input of axis=1 to axis=0 just because axis=1 would result in an error is IMO worse than some slight incompatibility between sparse and dense. If I could come up with at least one use-case where this incompatibility would really hurt us I might be less against it.

IMO if you want to stick to 100% same behavior, I would prefer if we also raise an error when a sparse row is provided and user passes axis=1. Hence we could essentially treat a sparse matrix of shape (1, X) as 1D array and get exactly the same behaviour as with 1D dense matrices.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree, changing the user input was a bad idea. The error is much better.

def test_shape_matches_dense_and_sparse_given_array_and_axis_1(self):
dense = np.array([0, 1, 0, 2, 2, np.nan, 1, np.nan, 0, 1])
sparse = csr_matrix(dense)
np.testing.assert_equal(countnans(dense, axis=1), countnans(sparse, axis=1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this. For example, numpy crashes if you want to sum with axis=1 and only have 1D array.

>>> y = np.array([0, 1, 0, 2, 2, np.nan, 1, np.nan, 0, 1])
>>> y.sum(axis=0)
nan
>>> y.sum(axis=1)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/Users/Niko/anaconda/envs/orange3/lib/python3.6/site-packages/numpy/core/_methods.py", line 32, in _sum
    return umr_sum(a, axis, dtype, out, keepdims)
ValueError: 'axis' entry is out of bounds

Considering the idea from above comments, I would probably just make both of these calls to raise an error.



def bincount(X, max_val=None, weights=None, minlength=None):
"""Return counts of values in array X.

Works kind of like np.bincount(), except that it also supports floating
arrays with nans.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you be willing to also document the max_val argument? This doesn't seem to be a numpy argument, and it's not obvious what it is for.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added thorough documentation.

# Since `csr_matrix.values` only contain non-zero values, we must count
# those separately and set the appropriate bin
if sp.issparse(X_):
bc[0] = np.prod(X_.shape) - X_.nnz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maybe make np.prod(X_.shape) - X_.nnz as a separate function, something like sparse_count_zeros? We already have _sparse_has_zeros and we will need to add the number of zero elements to the Table widget sooner or later.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good idea.

Perhaps it would also make sense to implement sparse_num_elements which would essentially be np.prod(x.shape) since it's also used a lot. This would make the functionality much clearer, but I'm hesitant because it seems far too excessive.

np.testing.assert_equal(bincount(dense), expected)
np.testing.assert_equal(bincount(sparse), expected)

hist, n_nans = bincount([0., 1., 3], max_val=3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly does max_val=3 does here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a remnant of the old tests. I've removed it and added an appropriate test.

np.testing.assert_equal(countnans(x, weights=w, axis=1), [1, 2])


class TestBincount(unittest.TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps add one test with weights?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@pavlin-policar pavlin-policar force-pushed the statistics-util-countnans branch 3 times, most recently from 23a7433 to cd71ba6 Compare September 8, 2017 19:28
@pavlin-policar
Copy link
Collaborator Author

At first I was puzzled why zeros were counted as NaNs in sparse matrices. This has no apparent advantages at all and can be confusing when first dealing with it...

This probably came from pandas, where the default fill for sparse matrices is NaN (but can be changed to anything). This doesn't make much sense to me, because if we separate the two, we can easily determine which data is missing and which is actually zero. Pandas does differentiate between the explicit and implicit NaNs, but then again, pandas is far more flexible than plain sparse matrices.

@codecov-io
Copy link

codecov-io commented Sep 9, 2017

Codecov Report

Merging #2558 into master will decrease coverage by 0.8%.
The diff coverage is 95.18%.

@@            Coverage Diff             @@
##           master    #2558      +/-   ##
==========================================
- Coverage   75.83%   75.03%   -0.81%     
==========================================
  Files         338      327      -11     
  Lines       59532    57645    -1887     
==========================================
- Hits        45145    43252    -1893     
- Misses      14387    14393       +6

@pavlin-policar pavlin-policar force-pushed the statistics-util-countnans branch 2 times, most recently from 9aff785 to 70354e7 Compare September 10, 2017 06:29
@nikicc
Copy link
Contributor

nikicc commented Sep 12, 2017

This probably came from pandas, where the default fill for sparse matrices is NaN (but can be changed to anything).

Not sure where this came from, but it was incorrect. The current assumption is that values not stored in the sparse matrix correspond to zeros and not missing values! Missing values are explicitly stored in sparse matrices as np.nan. If there still exist functions that don't obey this rule, they should be corrected.

@pavlin-policar
Copy link
Collaborator Author

@nikicc Is there anything left to do with PR or do you think it could be merged?

Copy link
Contributor

@nikicc nikicc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pavlin-policar sorry that this took so long 🙈

Overall it looks OK, there are only a few things left:

  • fix bincount to consider explicit zeros,
  • make sure that x is of correct type whenever you call x.indices (just cast it to csr beforehand)

Also, when discovering the problem of explicit zeros I was wondering if we could hack the dense_sparse decorator to also inset at least one explicit zero? Doing so, we should have caught the bincount problem.

[0, 0, 0, 1, 0, 2, np.nan, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1.1, 0, 0, 0, 0, 0, 0]]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test doesn't quite work for explicit zeros. E.g. if you add X[0,0] = 0 here, the test should pass and the results should be the same — but it fails. IMO the problem is in the bincount, check the comment above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, once this is corrected, please add some explicit zeros to this example.


if weights is not None:
zero_weights = weights[zero_indices].sum()
weights = weights[x.indices]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does x.indices works here for both csc and csr? If not, please cast it to whatever (csc/csr) is required beforhand.


if weights.ndim == 1:
n_items = np.prod(x.shape)
zero_indices = np.setdiff1d(np.arange(n_items), x.indices, assume_unique=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does x.indices works here for both csc and csr? If not, please cast it to whatever (csc/csr) is required beforhand.

# Since `csr_matrix.values` only contain non-zero values, we must count
# those separately and set the appropriate bin
if sp.issparse(x_original):
bc[0] = zero_weights
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be bc[0] = bc[0] + zero_weights to account for explicit zeros stored in x.data.

return np.fromiter((np.isnan(row.data).sum() for row in X), dtype=dtype)


def sparse_count_zeros(x):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we perhaps rename this to sparse_count_implicit_zeros to make sure explicit zeros aren't counted?

return np.prod(x.shape) - x.nnz


def sparse_has_zeros(x):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we perhaps rename this to sparse_has_implicit_zeros to make sure explicit zeros aren't considered?

@nikicc nikicc self-assigned this Oct 18, 2017


def bincount(X, max_val=None, weights=None, minlength=None):
def sparse_zero_weights(x, weights):
Copy link
Contributor

@nikicc nikicc Oct 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this only meant to be used when both x and weights are of the same shape, i.e. one dimensional? If so, should we add the check for this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a check here causes more problems than it's worth, because this can be called in two ways that are valid:

  1. x has shape (1, n)
  2. x has shape (n, 1)
    These are basically equivalent since they are both "1d", but this would make the check more complicated.

@nikicc
Copy link
Contributor

nikicc commented Oct 20, 2017

These commits were moved to PR #2698.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants