[ENH] Support sparse Jaccard #3657

ajdapretnar · 2019-03-04T14:10:21Z

Issue

Jaccard did not support sparse data, making it useless for text mining.

Description of changes

Custom support for sparse Jaccard.

Includes

Code changes
Tests
Documentation

ajdapretnar · 2019-03-04T14:16:48Z

@lanzagar @thocevar @janezd
This requires some discussion.

Before, if Distances are given sparse discrete data, Jaccard didn't work. Currently fixed with a no-so-nice hack.
DistMatrix cannot be tested with np.testing.assert_array_equal, because max() returns an int instead of numpy.int64. See: DistMatrix: np.testing.assert_array_equal crashes on two different matrices #3658
Distance.Jaccard().fit does not work for any sparse data. How to solve this?

All in all, the code needs to be made nicer and cleaner. Any suggestions welcome.

janezd · 2019-03-07T21:34:53Z

Orange/widgets/unsupervised/owdistances.py

-                    issparse(data.X) and getattr(metric, "fallback", None)
+                    issparse(data.X) and getattr(metric, "fallback",
+                                                 None) and metric is not
+                                distance.Jaccard


Is there a reason for not checking metric.supports_sparse instead of metric is not distance.Jaccard?

The condition that specifically checks for Jaccard a few lines later is needed because there is no specific flag signalling whether a metric supports distances by columns. Here, we have a flag to check, unless I overlooked something.

Jaccard is the only distance that supports discrete attributes via fallback. Other metrics fall back to sklearn's methods, which won't work with discrete.

codecov · 2019-03-12T22:03:49Z

Codecov Report

Merging #3657 into master will increase coverage by <.01%.
The diff coverage is 89.09%.

@@            Coverage Diff             @@
##           master    #3657      +/-   ##
==========================================
+ Coverage   84.31%   84.32%   +<.01%     
==========================================
  Files         370      370              
  Lines       67856    67899      +43     
==========================================
+ Hits        57214    57255      +41     
- Misses      10642    10644       +2

codecov · 2019-03-12T22:03:50Z

Codecov Report

Merging #3657 into master will increase coverage by <.01%.
The diff coverage is 97.82%.

@@            Coverage Diff             @@
##           master    #3657      +/-   ##
==========================================
+ Coverage   84.43%   84.44%   +<.01%     
==========================================
  Files         372      372              
  Lines       68111    68143      +32     
==========================================
+ Hits        57511    57543      +32     
  Misses      10600    10600

janezd · 2019-03-12T22:09:34Z

551effc changes the base class so that it can handle numpy arrays if fallback is not provided. Perhaps we can stop passing numpy arrays to fallbacks and use fallbacks only for sparse data.

fd1fd19 moves @ajdapretnar's code for sparse Jaccard to the proper class and also contains minor fixes in tests.

@ajdapretnar, you still have to compute the probabilities in sparse data fitter for handling missing values when the model is used on dense data.

janezd · 2019-03-12T22:13:29Z

Jaccard.compute_distances is now ugly. Dense data is handled in compute_distances itself, while sparse data is in another method. It should be symmetric: one method for dense and one for sparse, and compute_distances calls one or another. That is, extract most of the code from compute_distances, except for the first two lines, into a separate method.

sparse_jaccard is also not a very good name: jaccard is already the class name. Maybe _compute_sparse and _compute_dense.

janezd · 2019-03-13T16:11:02Z

Cosine tests failed because of wrong clipping: sklearn (correctly) clips to [0, 2], while our function for dense matrices (effectively) clipped to [0, 1]. This commit doesn't belong to this PR, but this PR adds tests for comparison between distances on sparse and dense matrices, which revealed this problem, hence its more practical to fix this here.

ajdapretnar · 2019-03-14T14:40:54Z

Orange/distance/distance.py

@@ -450,7 +453,7 @@ def fit_rows(self, attributes, x, n_vals):
        frequencies of non-zero values per each column.
        """
        if sp.issparse(x):
-            ps = None  # wrong!
+            ps = x.getnnz(axis=0)


If I understand the Cython code correctly, this should be ok. @janezd

ajdapretnar added 4 commits March 1, 2019 16:11

Sparse Jaccard

fa18777

Do not remove nonbinary for sparse

e791397

Test sparse Jaccard

70467ea

Disable check for Jaccard

7ba8c0b

ajdapretnar requested a review from janezd March 4, 2019 14:10

ajdapretnar mentioned this pull request Mar 5, 2019

OWCommonTerms: distance matrix based on common terms in docs biolab/orange3-text#417

Closed

3 tasks

janezd reviewed Mar 7, 2019

View reviewed changes

Distances: Support numpy arrays without fallbacks

551effc

janezd force-pushed the sparse-jaccard branch from fd1fd19 to 9005785 Compare March 12, 2019 22:27

Jaccard distance: Move from a fallback to its own class

d962655

janezd force-pushed the sparse-jaccard branch from 9005785 to a0309b8 Compare March 13, 2019 16:07

Cosine distance: Fix clipping

5cf9802

janezd force-pushed the sparse-jaccard branch from a0309b8 to 5cf9802 Compare March 13, 2019 22:10

janezd mentioned this pull request Mar 13, 2019

Pylint distances #3674

Merged

Section code and extend fitter

9d84d85

ajdapretnar commented Mar 14, 2019

View reviewed changes

janezd self-assigned this Mar 15, 2019

OWDistance: Minor reformatting

a30e688

janezd force-pushed the sparse-jaccard branch from 14e3056 to a30e688 Compare March 15, 2019 14:06

janezd changed the title ~~[WIP][RFC] Support sparse Jaccard~~ [RFC] Support sparse Jaccard Mar 15, 2019

janezd changed the title ~~[RFC] Support sparse Jaccard~~ [ENH] Support sparse Jaccard Mar 15, 2019

janezd merged commit 574f2c2 into biolab:master Mar 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Support sparse Jaccard #3657

[ENH] Support sparse Jaccard #3657

ajdapretnar commented Mar 4, 2019

ajdapretnar commented Mar 4, 2019 •

edited

Loading

janezd Mar 7, 2019

ajdapretnar Mar 11, 2019

codecov bot commented Mar 12, 2019

codecov bot commented Mar 12, 2019 •

edited

Loading

janezd commented Mar 12, 2019

janezd commented Mar 12, 2019

janezd commented Mar 13, 2019

ajdapretnar Mar 14, 2019

[ENH] Support sparse Jaccard #3657

[ENH] Support sparse Jaccard #3657

Conversation

ajdapretnar commented Mar 4, 2019

Issue

Description of changes

Includes

ajdapretnar commented Mar 4, 2019 • edited Loading

janezd Mar 7, 2019

Choose a reason for hiding this comment

ajdapretnar Mar 11, 2019

Choose a reason for hiding this comment

codecov bot commented Mar 12, 2019

Codecov Report

codecov bot commented Mar 12, 2019 • edited Loading

Codecov Report

janezd commented Mar 12, 2019

janezd commented Mar 12, 2019

janezd commented Mar 13, 2019

ajdapretnar Mar 14, 2019

Choose a reason for hiding this comment

ajdapretnar commented Mar 4, 2019 •

edited

Loading

codecov bot commented Mar 12, 2019 •

edited

Loading