-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FIX] Distances: prevent inf numbers #2380
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2380 +/- ##
=========================================
Coverage ? 74.25%
=========================================
Files ? 320
Lines ? 55836
Branches ? 0
=========================================
Hits ? 41459
Misses ? 14377
Partials ? 0 |
Rebased. |
@@ -160,6 +161,9 @@ def checks(metric, data): | |||
except MemoryError: | |||
self.Error.distances_memory_error() | |||
return | |||
|
|||
if np.isinf(met).any(): | |||
met = np.nan_to_num(met) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the conditional, this would make one linear pass instead of three.
@@ -160,6 +161,8 @@ def checks(metric, data): | |||
except MemoryError: | |||
self.Error.distances_memory_error() | |||
return | |||
|
|||
met = np.nan_to_num(met) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. This is just a quick local patch, not a solution.
It reminds me of a guy (whom we all know and) who - years ago - programmed some machine learning algorithm. He occasionally got probabilities above 1, so he fixed the problem by adding
if (p > 1) {
p = 1;
}
If we get an error report, we shouldn't find the simplest way to just fix it somewhere. If some distances are infinite, they are infinite. We can't just set them to 42 and pretend everything is cool. If we believe that distance matrix should never contain infinite values, we should raise an exception. If we decide that infinite values can arise and are legal, other functions/widgets should be able to handle those.
In this case, I'd say that infinite values are allowed and hierarchical clustering should do something about them. Replacing them with large values (perhaps len(data) * np.nanmax(data)
) could make sense within the hierarchical clustering. Or it may be even more complex since it depends upon linkage.
So: if there are infinite distances, we need to think what would we like the clustering to do with them, not what is the easiest way to get rid of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps len(data) * np.nanmax(data)
If you opt for this suggestion, please, just don't use np.nanmax(data)
. We should never use np.nanmax(data)
since it doesn't work on sparse data! We have our own implementation of nanmax
in Orange/statistics/util.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was about to write that while I totally agree with @janezd, most solutions for inf distances will probably be equivalent to substitution with large numbers. In which case maybe we should consider if we want every widget to duplicate this substitution.
But there is always the possibility that somewhere inf numbers would be illegal (and we would want a warning). And even worse - this substitution will probably not allow downstream methods to ignore the problem, since you still cannot multiply the new "large values", or add two of them etc. which could happen in average linkage and so on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds more like clipping. With lower and upper range?
@@ -763,6 +763,9 @@ class Outputs: | |||
cluster_roles = ["Attribute", "Class variable", "Meta variable"] | |||
basic_annotations = ["None", "Enumeration"] | |||
|
|||
class Error(widget.OWWidget.Error): | |||
not_finite_distances = Msg("Not all distances are finite") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rephrase: Some distances are infinite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why some error messages have dot at the end and some have not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, we have to decide. @BlazZupan, @astaric, @lanzagar, @ajdapretnar? I don't have a clear preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vote for dots in sentences (verb present).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. No verb, no dot. Verb, dot. Period. :)
However, some error messages are not sentences, like "Out of memory" or "Error during optimization". What about those?
Message without verb bad, message with verb good? Makes them longer, though. Sometimes long not good, when widget narrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh no, I would keep those dot-less. Short, too.
@@ -970,6 +974,9 @@ def set_distances(self, matrix): | |||
|
|||
self.matrix = matrix | |||
if matrix is not None: | |||
if not np.all(np.isfinite(matrix)): | |||
self.Error.not_finite_distances() | |||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may have overlooked something, but I think this is wrong: it aborts the function without reseting the items (and related models). Note that the check above sets matrix
to None
. In line with this, you should probably have something like:
if matrix is not None:
if not np.all(np.isfinite(matrix)):
self.Error.not_finite_distances()
matrix = None
self.matrix = matrix
if matrix is not None:
self._set_items(matrix.row_items, matrix.axis)
else:
self._set_items(None)
The line self.matrix = matrix
is moved from above to the place after the check for infinite values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, fixed.
if matrix is not None: | ||
N, _ = matrix.shape | ||
if N < 2: | ||
self.error("Empty distance matrix") | ||
matrix = None | ||
if not np.all(np.isfinite(matrix)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, not like that. If it comes here, matrix
can be set to None
in the line above. Another if matrix is not None
is needed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is Ok now.
@@ -1082,6 +1090,8 @@ def _update_labels(self): | |||
self.labels.setMinimumWidth(1 if labels else -1) | |||
|
|||
def _invalidate_clustering(self): | |||
if self.matrix is not None and not np.all(np.isfinite(self.matrix)): | |||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, this check may be redundant since np.all(np.isfinite(self.matrix))
is already checked above and self.matrix
is set to None in this case. Hence, the situation in which this conjunction is true cannot arise, I guess.
Second -- if this condition can be fulfilled -- is skipping these calls really OK?
_update()
clears the plot and then performs further cleanup ifself.matrix is None
. I guess this must be done if the widget receives invalid data._update_labels()
checks forself.root
(which is set toNone
in_update()
) and works correctly whenmatrix
isNone
._invalidate_output()
has to be called passNone
to outputs.
I believe that these two lines have to be removed. But please check whether there are any other side-effects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two lines are now removed.
if matrix is not None: | ||
N, _ = matrix.shape | ||
if N < 2: | ||
self.error("Empty distance matrix") | ||
matrix = None | ||
if not np.all(np.isfinite(matrix)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is Ok now.
Issue
Description of changes
Includes