-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[MX-9588] Add micro averaging strategy for F1 metric #9777
Conversation
Regarding other approaches, something I looked at was the following: class MacroMetric(EvalMetric):
def __init__(self, base_metric):
super(MacroMetric, self).__init__("macro_" + base_metric.name, output_names=base_metric.output_names,
label_names=base_metric.label_names)
self.base_metric = base_metric
def update(self, labels, preds):
self.base_metric.update(labels, preds)
self.sum_metric += self.base_metric.get()[1]
self.num_inst += 1
self.base_metric.reset() Any metric that has defined the "micro" behavior can then be used as "macro" just by calling |
This fixes #9588 |
python/mxnet/metric.py
Outdated
@@ -503,21 +578,27 @@ class F1(EvalMetric): | |||
label_names : list of str, or None | |||
Name of labels that should be used when updating with update_dict. | |||
By default include all labels. | |||
average : str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
, default 'macro'
raise ValueError("%s currently only supports binary classification." | ||
% self.__class__.__name__) | ||
|
||
for y_pred, y_true in zip(pred_label, label): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to use a for-loop here? Using array arithmetic OPs should be more efficient. Also we can try not to convert them into numpy array so it can be calculated in GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but there's another issue for that, so I assumed it would be done in a separate PR. #9586
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's address that in a separate PR. My last attempt faced performance issues when switching to ndarray-based logic, so we should address the problem once it's more clear how to resolve the performance issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've used codes like
tp = nd.sum((pred == 1) * (label == 1)).asscalar()
fp = nd.sum((pred == 1) * (label == 0)).asscalar()
fn = nd.sum((pred == 0) * (label == 1)).asscalar()
precision = float(tp) / (tp + fp)
recall = float(tp) / (tp + fn)
f1 = 2 * (precision * recall) / (precision + recall)
to calculate the F1 and I find it's much faster in GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've previously written the codes to accelerate F1 calculation in GPU. However it's not based on metric and directly uses NDArray:
def nd_f1(pred, label, num_class, average="micro"):
"""Evaluate F1 using mx.nd.NDArray
Parameters
----------
pred : nd.NDArray
Shape (num, label_num) or (num,)
label : nd.NDArray
Shape (num, label_num) or (num,)
num_class : int
average : str
Returns
-------
f1 : float
"""
if pred.dtype != np.float32:
pred = pred.astype(np.float32)
label = label.astype(np.float32)
assert num_class > 1
assert pred.ndim == label.ndim
if num_class == 2 and average == "micro":
tp = nd.sum((pred == 1) * (label == 1)).asscalar()
fp = nd.sum((pred == 1) * (label == 0)).asscalar()
fn = nd.sum((pred == 0) * (label == 1)).asscalar()
precision = float(tp) / (tp + fp)
recall = float(tp) / (tp + fn)
f1 = 2 * (precision * recall) / (precision + recall)
else:
assert num_class is not None
pred_onehot = nd.one_hot(indices=pred, depth=num_class)
label_onehot = nd.one_hot(indices=label, depth=num_class)
tp = pred_onehot * label_onehot
fp = pred_onehot * (1 - label_onehot)
fn = (1 - pred_onehot) * label_onehot
if average == "micro":
tp = nd.sum(tp).asscalar()
fp = nd.sum(fp).asscalar()
fn = nd.sum(fn).asscalar()
precision = float(tp) / (tp + fp)
recall = float(tp) / (tp + fn)
f1 = 2 * (precision * recall) / (precision + recall)
elif average == "macro":
if tp.ndim == 3:
tp = nd.sum(tp, axis=(0, 1))
fp = nd.sum(fp, axis=(0, 1))
fn = nd.sum(fn, axis=(0, 1))
else:
tp = nd.sum(tp, axis=0)
fp = nd.sum(fp, axis=0)
fn = nd.sum(fn, axis=0)
precision = nd.mean(tp / (tp + fp)).asscalar()
recall = nd.mean(tp / (tp + fn)).asscalar()
f1 = 2 * (precision * recall) / (precision + recall)
else:
raise NotImplementedError
return f1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, it should be addressed in a later PR.
fscore2 = 2. * (1) / (2 * 1 + 0 + 0) | ||
fscore_total = 2. * (1 + 1) / (2 * (1 + 1) + (1 + 0) + (0 + 0)) | ||
np.testing.assert_almost_equal(microF1.get()[1], fscore_total) | ||
np.testing.assert_almost_equal(macroF1.get()[1], (fscore1 + fscore2) / 2.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the test part I think one way is to compare the result with sklearn.metrics.f1_score. However I'm not sure if it works for CI. @marcoabreu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're lacking dependency support on Windows slaves yet, we can't use this if scikit is not present as a dependency yet - I can't check right now. I'd propose to just try it out and see whether it works or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, trying this out. We'll see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that sklearn is not installed in some machines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've just reverted the commit.
I think the current way is fine. All we need to do is to provide a mx.metric version of sklearn.metrics.f1_score. |
python/mxnet/metric.py
Outdated
@@ -475,8 +475,84 @@ def update(self, labels, preds): | |||
self.num_inst += num_samples | |||
|
|||
|
|||
class _BinaryClassificationMixin(object): | |||
""" | |||
Private mixin for keeping track of TPR, FPR, TNR, FNR counts for a classification metric. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add more explanation of what this does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. Let me know if that's not what you had in mind.
python/mxnet/metric.py
Outdated
@register | ||
class F1(EvalMetric): | ||
class F1(EvalMetric, _BinaryClassificationMixin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use member instead of multiple inheritance if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of the cases where usage of mix-ins is proper for backward compatibility and ease of use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can make a BinaryClassificationMetric
class that is intended to be abstract and inherits from EvalMetric
without affecting backward compatibility. It seemed more appropriate as a mixin here since it would be useless as a concrete class and doesn't actually implement any functionality required by EvalMetric
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multiple inheritance is very rarely necessary.
In this case I think _BinaryClassificationMixin should either inherit EvalMetric or be refactored into a few utility functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
composition is also possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, the current multiple inheritance seems reasonable, in that calculating counts and keeping counts are two separate concerns. Mixins is more flexible and likely require less code when we extend these to multi-class/multi-label/top-k use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Composition is equally flexible with the drawback of extra dereference, though I'm fine either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check latest commit to see if that's what you had in mind?
Also, one thing to note is that when the output has only one-label, the micro F1 is equivalent to accuracy https://stackoverflow.com/questions/37358496/is-f1-micro-the-same-as-accuracy . This could potentially help accelerate the computation of micro f1. Also, we sometimes need to deal with multi-label classification and may need to support that in the future. |
@sxjscience let's track the multi-label case in #9589 |
Obviously, no change is expected since none of the update logic was changed here. Before
After
|
This reverts commit 797c01c.
python/mxnet/metric.py
Outdated
@@ -503,21 +582,27 @@ class F1(EvalMetric): | |||
label_names : list of str, or None | |||
Name of labels that should be used when updating with update_dict. | |||
By default include all labels. | |||
average : str, default 'macro' | |||
Strategy to be used for aggregating across micro-batches. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"mini-batches" is more commonly used.
python/mxnet/metric.py
Outdated
average : str, default 'macro' | ||
Strategy to be used for aggregating across micro-batches. | ||
"macro": average the F1 scores for each batch | ||
"micro": compute a single F1 score across all batches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add period at the end. Currently it renders into:
http://mxnet-doc.s3-accelerate.dualstack.amazonaws.com/api/python/metric/metric.html#mxnet.metric.F1
Thanks all for the review! |
* add macro/micro f1 and test and binary abstraction * make average an option * use metric.create * add decimal for float division * add default in docstring, reference generic base class in error msg * expand on docstring * use scikit in test * Revert "use scikit in test" This reverts commit 797c01c. * use composition * minibatches
* add macro/micro f1 and test and binary abstraction * make average an option * use metric.create * add decimal for float division * add default in docstring, reference generic base class in error msg * expand on docstring * use scikit in test * Revert "use scikit in test" This reverts commit 797c01c. * use composition * minibatches
Description
This PR adds a mixin class that F1 and other metrics like precision and recall can leverage in the future. It also provides a new option for the F1 metric called
average
which defines how the metric will be aggregated across mini batches.Checklist
Essentials
make lint
)Approach
The "micro" vs "macro" update strategy is not specific to F1 score. The macro update just takes an average of averages, which can be done for any metric. It may be best to design an abstraction where any metric can have the micro/macro update option, but I couldn't see a good way to do that here that would:
For now, the behavior for each type of update is hard coded into the
update
method of theF1
class. We can discuss the approach.Please let me know if I have missed or overlooked anything :)