[ENH] Bhatthacharayya distance #4111

AndrejaKovacic · 2019-10-17T14:39:24Z

Description of changes

A new metric is introduced in distances widget. Bhatthacharayya is used to measure the distance between distributions. Currently, the user has to already have the data that they see as a distribution. In the future, pivot could be extended to have aggregation of all features, filtered by column values, so distributions could be generated in Orange itself.

Includes

Code changes
Tests
Documentation

codecov · 2019-10-17T14:51:52Z

Codecov Report

Merging #4111 into master will increase coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master   #4111      +/-   ##
=========================================
+ Coverage   85.68%   85.7%   +0.01%     
=========================================
  Files         390     389       -1     
  Lines       69727   69813      +86     
=========================================
+ Hits        59744   59830      +86     
  Misses       9983    9983

janezd · 2019-10-18T07:39:25Z

Orange/distance/distance.py

@@ -644,6 +644,36 @@ class PearsonRAbsolute(CorrelationDistance):
    def fit(self, _):
        return PearsonModel(True, self.axis, self.impute)

+def _prob_dist(a):
+    # Makes the vector sum to one, as to mimick probability distribution.
+    return a/np.sum(a)


Add spaces around /.

janezd · 2019-10-18T07:39:48Z

Orange/distance/distance.py

+    b = _prob_dist(b)
+    if sp.issparse(a):
+        return -np.log(np.sum(np.sqrt(a.multiply(b))))
+    return -np.log(np.sum(np.sqrt(a*b)))


janezd

Apologies for pestering you. I have a few comments, and we can talk about this on Friday, but if you'd prefer to be done with it, we can merge it as it is.

Orange/distance/distance.py

janezd · 2019-10-23T21:39:10Z

Orange/widgets/unsupervised/owdistances.py

@@ -157,6 +160,9 @@ def _fix_missing():
                      _fix_discrete, _fix_missing, _fix_nonbinary):
            if not check():
                return None
+        if (METRICS[self.metric_idx][0] == 'Bhattacharyya') and _min(data.X) < 0:


It is in general a bad idea to use string literals like this. Somebody will rename it ... and there we go. I'd be happier with METRICS[self.metric_idx][1] is distance.Bhattacharyya (and, by the way, you can remove the parentheses).

On the other hand, why wouldn't the distance itself (i.e. function distance. _bhattacharyya) test the values and raise ValueError("Bhattcharyya distance requires non-negative values")? The widget already catches and shows ValueError exceptions.

As I proposed on Friday, this is better because it gives a reasonable error also to anybody that would call this distance from a script. With current code, if somebody calls Bhattacharyya with negative values, (s)he will get just a RuntimeWarning: invalid value encountered in sqrt, the result will be nan ... and this nan will propagate until an unrelated function crashes down the road. Testing and raising an exception there is better because it helps debugging.

This is open for discussion -- on Friday.

You are right, I moved it.

janezd · 2019-11-01T18:49:20Z

Orange/distance/distance.py

+    # Raise an exception for infinities, nans and negative values
+    check_array(a,
+                accept_sparse=True, accept_large_sparse=True, ensure_2d=False)
+    if a.min() < 0:


@AndrejaKovacic, would this be OK, too?

Dense arrays also have a method min, so there's no need for if. Also, I think it is better to not change the exception message raised by check_array so the caller is informed that, for instance, there are nan values in the data.

Yes, it's more informative this way.

AndrejaKovacic added 2 commits October 15, 2019 11:07

Add bhattacaryya

9e7e5b2

Add bhattcharyya distance docs

cb6c698

AndrejaKovacic force-pushed the bhatthacharayya branch from 57be411 to 94d5249 Compare October 17, 2019 14:54

janezd self-assigned this Oct 18, 2019

janezd reviewed Oct 18, 2019

View reviewed changes

Add bhattcharyya test

93b0494

AndrejaKovacic force-pushed the bhatthacharayya branch from 94d5249 to 93b0494 Compare October 18, 2019 08:42

janezd reviewed Oct 23, 2019

View reviewed changes

AndrejaKovacic force-pushed the bhatthacharayya branch 2 times, most recently from 7766bd9 to 0bc5bb0 Compare October 25, 2019 08:40

Move input validation to distance method

3e6b549

AndrejaKovacic force-pushed the bhatthacharayya branch from 0bc5bb0 to 3e6b549 Compare October 25, 2019 12:03

distance.check_non_negative: Simplify

69083d1

janezd reviewed Nov 1, 2019

View reviewed changes

janezd merged commit 0c0a1e9 into biolab:master Nov 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Bhatthacharayya distance #4111

[ENH] Bhatthacharayya distance #4111

AndrejaKovacic commented Oct 17, 2019

codecov bot commented Oct 17, 2019 •

edited

Loading

janezd Oct 18, 2019

janezd Oct 18, 2019

janezd left a comment

janezd Oct 23, 2019

AndrejaKovacic Oct 25, 2019 •

edited

Loading

janezd Nov 1, 2019

AndrejaKovacic Nov 1, 2019

[ENH] Bhatthacharayya distance #4111

[ENH] Bhatthacharayya distance #4111

Conversation

AndrejaKovacic commented Oct 17, 2019

Description of changes

Includes

codecov bot commented Oct 17, 2019 • edited Loading

Codecov Report

janezd Oct 18, 2019

Choose a reason for hiding this comment

janezd Oct 18, 2019

Choose a reason for hiding this comment

janezd left a comment

Choose a reason for hiding this comment

janezd Oct 23, 2019

Choose a reason for hiding this comment

AndrejaKovacic Oct 25, 2019 • edited Loading

Choose a reason for hiding this comment

janezd Nov 1, 2019

Choose a reason for hiding this comment

AndrejaKovacic Nov 1, 2019

Choose a reason for hiding this comment

codecov bot commented Oct 17, 2019 •

edited

Loading

AndrejaKovacic Oct 25, 2019 •

edited

Loading