-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Support sparse Jaccard #3657
Conversation
@lanzagar @thocevar @janezd
All in all, the code needs to be made nicer and cleaner. Any suggestions welcome. |
issparse(data.X) and getattr(metric, "fallback", None) | ||
issparse(data.X) and getattr(metric, "fallback", | ||
None) and metric is not | ||
distance.Jaccard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for not checking metric.supports_sparse
instead of metric is not distance.Jaccard
?
The condition that specifically checks for Jaccard a few lines later is needed because there is no specific flag signalling whether a metric supports distances by columns. Here, we have a flag to check, unless I overlooked something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jaccard is the only distance that supports discrete attributes via fallback. Other metrics fall back to sklearn's methods, which won't work with discrete.
Codecov Report
@@ Coverage Diff @@
## master #3657 +/- ##
==========================================
+ Coverage 84.31% 84.32% +<.01%
==========================================
Files 370 370
Lines 67856 67899 +43
==========================================
+ Hits 57214 57255 +41
- Misses 10642 10644 +2 |
Codecov Report
@@ Coverage Diff @@
## master #3657 +/- ##
==========================================
+ Coverage 84.43% 84.44% +<.01%
==========================================
Files 372 372
Lines 68111 68143 +32
==========================================
+ Hits 57511 57543 +32
Misses 10600 10600 |
551effc changes the base class so that it can handle numpy arrays if fallback is not provided. Perhaps we can stop passing numpy arrays to fallbacks and use fallbacks only for sparse data. fd1fd19 moves @ajdapretnar's code for sparse Jaccard to the proper class and also contains minor fixes in tests. @ajdapretnar, you still have to compute the probabilities in sparse data fitter for handling missing values when the model is used on dense data. |
|
Cosine tests failed because of wrong clipping: sklearn (correctly) clips to [0, 2], while our function for dense matrices (effectively) clipped to [0, 1]. This commit doesn't belong to this PR, but this PR adds tests for comparison between distances on sparse and dense matrices, which revealed this problem, hence its more practical to fix this here. |
@@ -450,7 +453,7 @@ def fit_rows(self, attributes, x, n_vals): | |||
frequencies of non-zero values per each column. | |||
""" | |||
if sp.issparse(x): | |||
ps = None # wrong! | |||
ps = x.getnnz(axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand the Cython code correctly, this should be ok. @janezd
Issue
Jaccard did not support sparse data, making it useless for text mining.
Description of changes
Custom support for sparse Jaccard.
Includes