[MX-9588] Add micro averaging strategy for F1 metric #9777

sethah · 2018-02-13T04:25:25Z

Description

This PR adds a mixin class that F1 and other metrics like precision and recall can leverage in the future. It also provides a new option for the F1 metric called average which defines how the metric will be aggregated across mini batches.

Checklist

Essentials

Passed code style checking (make lint)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Approach

The "micro" vs "macro" update strategy is not specific to F1 score. The macro update just takes an average of averages, which can be done for any metric. It may be best to design an abstraction where any metric can have the micro/macro update option, but I couldn't see a good way to do that here that would:

be easy to use for end users AND
maintain backward compatibility AND
maintain current semantics

For now, the behavior for each type of update is hard coded into the update method of the F1 class. We can discuss the approach.

Please let me know if I have missed or overlooked anything :)

sethah · 2018-02-13T04:28:53Z

Regarding other approaches, something I looked at was the following:

class MacroMetric(EvalMetric):

    def __init__(self, base_metric):
        super(MacroMetric, self).__init__("macro_" + base_metric.name, output_names=base_metric.output_names,
                                          label_names=base_metric.label_names)
        self.base_metric = base_metric

    def update(self, labels, preds):
        self.base_metric.update(labels, preds)
        self.sum_metric += self.base_metric.get()[1]
        self.num_inst += 1
        self.base_metric.reset()

Any metric that has defined the "micro" behavior can then be used as "macro" just by calling metric = mx.metric.MacroMetric(mx.metric.F1()), but that seems pretty awkward for the users, and also changes the default behavior to micro, which doesn't work for backwards compatibility. The current solution works well enough, and we could probably revisit in a later PR if needed.

szha · 2018-02-13T05:18:07Z

This fixes #9588

szha · 2018-02-13T05:20:30Z

python/mxnet/metric.py

@@ -503,21 +578,27 @@ class F1(EvalMetric):
    label_names : list of str, or None
        Name of labels that should be used when updating with update_dict.
        By default include all labels.
+    average : str


, default 'macro'

sxjscience · 2018-02-13T17:37:31Z

python/mxnet/metric.py

+            raise ValueError("%s currently only supports binary classification."
+                             % self.__class__.__name__)
+
+        for y_pred, y_true in zip(pred_label, label):


Do we have to use a for-loop here? Using array arithmetic OPs should be more efficient. Also we can try not to convert them into numpy array so it can be calculated in GPU.

I agree, but there's another issue for that, so I assumed it would be done in a separate PR. #9586

Let's address that in a separate PR. My last attempt faced performance issues when switching to ndarray-based logic, so we should address the problem once it's more clear how to resolve the performance issue.

I've used codes like

tp = nd.sum((pred == 1) * (label == 1)).asscalar() fp = nd.sum((pred == 1) * (label == 0)).asscalar() fn = nd.sum((pred == 0) * (label == 1)).asscalar() precision = float(tp) / (tp + fp) recall = float(tp) / (tp + fn) f1 = 2 * (precision * recall) / (precision + recall)

to calculate the F1 and I find it's much faster in GPU.

I've previously written the codes to accelerate F1 calculation in GPU. However it's not based on metric and directly uses NDArray:

def nd_f1(pred, label, num_class, average="micro"): """Evaluate F1 using mx.nd.NDArray Parameters ---------- pred : nd.NDArray Shape (num, label_num) or (num,) label : nd.NDArray Shape (num, label_num) or (num,) num_class : int average : str Returns ------- f1 : float """ if pred.dtype != np.float32: pred = pred.astype(np.float32) label = label.astype(np.float32) assert num_class > 1 assert pred.ndim == label.ndim if num_class == 2 and average == "micro": tp = nd.sum((pred == 1) * (label == 1)).asscalar() fp = nd.sum((pred == 1) * (label == 0)).asscalar() fn = nd.sum((pred == 0) * (label == 1)).asscalar() precision = float(tp) / (tp + fp) recall = float(tp) / (tp + fn) f1 = 2 * (precision * recall) / (precision + recall) else: assert num_class is not None pred_onehot = nd.one_hot(indices=pred, depth=num_class) label_onehot = nd.one_hot(indices=label, depth=num_class) tp = pred_onehot * label_onehot fp = pred_onehot * (1 - label_onehot) fn = (1 - pred_onehot) * label_onehot if average == "micro": tp = nd.sum(tp).asscalar() fp = nd.sum(fp).asscalar() fn = nd.sum(fn).asscalar() precision = float(tp) / (tp + fp) recall = float(tp) / (tp + fn) f1 = 2 * (precision * recall) / (precision + recall) elif average == "macro": if tp.ndim == 3: tp = nd.sum(tp, axis=(0, 1)) fp = nd.sum(fp, axis=(0, 1)) fn = nd.sum(fn, axis=(0, 1)) else: tp = nd.sum(tp, axis=0) fp = nd.sum(fp, axis=0) fn = nd.sum(fn, axis=0) precision = nd.mean(tp / (tp + fp)).asscalar() recall = nd.mean(tp / (tp + fn)).asscalar() f1 = 2 * (precision * recall) / (precision + recall) else: raise NotImplementedError return f1

Shouldn't this conversation take place on #9586? This PR only aims to fix #9588.

OK, it should be addressed in a later PR.

sxjscience · 2018-02-13T17:44:04Z

tests/python/unittest/test_metric.py

+    fscore2 = 2. * (1) / (2 * 1 + 0 + 0)
+    fscore_total = 2. * (1 + 1) / (2 * (1 + 1) + (1 + 0) + (0 + 0))
+    np.testing.assert_almost_equal(microF1.get()[1], fscore_total)
+    np.testing.assert_almost_equal(macroF1.get()[1], (fscore1 + fscore2) / 2.)


For the test part I think one way is to compare the result with sklearn.metrics.f1_score. However I'm not sure if it works for CI. @marcoabreu

Since we're lacking dependency support on Windows slaves yet, we can't use this if scikit is not present as a dependency yet - I can't check right now. I'd propose to just try it out and see whether it works or not.

Ok, trying this out. We'll see.

Seems that sklearn is not installed in some machines.

Yeah, I've just reverted the commit.

sxjscience · 2018-02-13T17:52:52Z

I think the current way is fine. All we need to do is to provide a mx.metric version of sklearn.metrics.f1_score.

piiswrong · 2018-02-13T19:17:07Z

python/mxnet/metric.py

@@ -475,8 +475,84 @@ def update(self, labels, preds):
            self.num_inst += num_samples


+class _BinaryClassificationMixin(object):
+    """
+    Private mixin for keeping track of TPR, FPR, TNR, FNR counts for a classification metric.


Could you add more explanation of what this does?

Updated. Let me know if that's not what you had in mind.

piiswrong · 2018-02-13T19:17:28Z

python/mxnet/metric.py

 @register
-class F1(EvalMetric):
+class F1(EvalMetric, _BinaryClassificationMixin):


use member instead of multiple inheritance if possible

This is one of the cases where usage of mix-ins is proper for backward compatibility and ease of use.

I think we can make a BinaryClassificationMetric class that is intended to be abstract and inherits from EvalMetric without affecting backward compatibility. It seemed more appropriate as a mixin here since it would be useless as a concrete class and doesn't actually implement any functionality required by EvalMetric.

multiple inheritance is very rarely necessary.
In this case I think _BinaryClassificationMixin should either inherit EvalMetric or be refactored into a few utility functions

composition is also possible

Honestly, the current multiple inheritance seems reasonable, in that calculating counts and keeping counts are two separate concerns. Mixins is more flexible and likely require less code when we extend these to multi-class/multi-label/top-k use cases.

Composition is equally flexible with the drawback of extra dereference, though I'm fine either way.

Can you check latest commit to see if that's what you had in mind?

sxjscience · 2018-02-13T20:41:51Z

Also, one thing to note is that when the output has only one-label, the micro F1 is equivalent to accuracy https://stackoverflow.com/questions/37358496/is-f1-micro-the-same-as-accuracy . This could potentially help accelerate the computation of micro f1.

Also, we sometimes need to deal with multi-label classification and may need to support that in the future.

szha · 2018-02-13T22:23:29Z

@sxjscience let's track the multi-label case in #9589

szha · 2018-02-13T22:25:17Z

@sethah do you mind running the related test from #9705 and report the before/after change results here?

sethah · 2018-02-14T16:34:24Z

Obviously, no change is expected since none of the update logic was changed here.

Before

Metric         Data-Ctx  Label-Ctx   Data Size   Batch Size     Output Dim     Elapsed Time
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      131072      16             2              0.82034
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      131072      64             2              0.46931
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      131072      256            2              0.35179
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      131072      1024           2              0.35274
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      16384       16             2              0.10991
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      16384       64             2              0.057258
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      16384       256            2              0.046497
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      16384       1024           2              0.044378
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      2048        16             2              0.01292
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      2048        64             2              0.0070581
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      2048        256            2              0.005384
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      2048        1024           2              0.00511
------------------------------------------------------------------------------------------

After

Metric         Data-Ctx  Label-Ctx   Data Size   Batch Size     Output Dim     Elapsed Time
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      131072      16             2              0.81942
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      131072      64             2              0.46574
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      131072      256            2              0.37653
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      131072      1024           2              0.32401
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      16384       16             2              0.099127
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      16384       64             2              0.057564
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      16384       256            2              0.044527
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      16384       1024           2              0.04604
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      2048        16             2              0.012832
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      2048        64             2              0.0071719
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      2048        256            2              0.0058281
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      2048        1024           2              0.0057449
------------------------------------------------------------------------------------------

This reverts commit 797c01c.

szha · 2018-02-15T03:27:03Z

python/mxnet/metric.py

@@ -503,21 +582,27 @@ class F1(EvalMetric):
    label_names : list of str, or None
        Name of labels that should be used when updating with update_dict.
        By default include all labels.
+    average : str, default 'macro'
+        Strategy to be used for aggregating across micro-batches.


"mini-batches" is more commonly used.

szha · 2018-02-15T03:33:53Z

python/mxnet/metric.py

+    average : str, default 'macro'
+        Strategy to be used for aggregating across micro-batches.
+            "macro": average the F1 scores for each batch
+            "micro": compute a single F1 score across all batches


Add period at the end. Currently it renders into:
http://mxnet-doc.s3-accelerate.dualstack.amazonaws.com/api/python/metric/metric.html#mxnet.metric.F1

sethah · 2018-02-15T18:59:34Z

Thanks all for the review!

* add macro/micro f1 and test and binary abstraction * make average an option * use metric.create * add decimal for float division * add default in docstring, reference generic base class in error msg * expand on docstring * use scikit in test * Revert "use scikit in test" This reverts commit 797c01c. * use composition * minibatches

sethah added 3 commits February 12, 2018 19:49

add macro/micro f1 and test and binary abstraction

79afc4a

make average an option

bc44e9d

use metric.create

7ec5a88

sethah requested a review from szha as a code owner February 13, 2018 04:25

sethah changed the title ~~Mx 9588~~ [MX-9588] Add micro averaging strategy for F1 metric Feb 13, 2018

szha self-assigned this Feb 13, 2018

add decimal for float division

dc0f0e8

cjolivier01 approved these changes Feb 13, 2018

View reviewed changes

szha reviewed Feb 13, 2018

View reviewed changes

add default in docstring, reference generic base class in error msg

e448019

sxjscience reviewed Feb 13, 2018

View reviewed changes

piiswrong reviewed Feb 13, 2018

View reviewed changes

expand on docstring

3c86317

szha mentioned this pull request Feb 13, 2018

mx.metric F1 is using numpy logic #9586

Open

sethah added 3 commits February 14, 2018 08:51

use scikit in test

797c01c

Revert "use scikit in test"

115a635

This reverts commit 797c01c.

use composition

85b8503

szha reviewed Feb 15, 2018

View reviewed changes

minibatches

2a19389

szha merged commit d03182f into apache:master Feb 15, 2018

ThomasDelteil mentioned this pull request Jul 30, 2018

metric should have TP FP TN FN Precision Recall F1 for both macro and micro versions #9588

Closed

thomelane mentioned this pull request Aug 15, 2018

metric.F1 doc needs clarification #9587

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MX-9588] Add micro averaging strategy for F1 metric #9777

[MX-9588] Add micro averaging strategy for F1 metric #9777

sethah commented Feb 13, 2018 •

edited

Loading

sethah commented Feb 13, 2018

szha commented Feb 13, 2018

szha Feb 13, 2018

sxjscience Feb 13, 2018

sethah Feb 13, 2018

szha Feb 13, 2018

sxjscience Feb 13, 2018

sxjscience Feb 13, 2018

sethah Feb 13, 2018

sxjscience Feb 13, 2018

sxjscience Feb 13, 2018

marcoabreu Feb 13, 2018

sethah Feb 14, 2018

sxjscience Feb 14, 2018

sethah Feb 14, 2018

sxjscience commented Feb 13, 2018

piiswrong Feb 13, 2018

sethah Feb 13, 2018

piiswrong Feb 13, 2018

szha Feb 13, 2018

sethah Feb 13, 2018

piiswrong Feb 14, 2018

piiswrong Feb 14, 2018

szha Feb 14, 2018

szha Feb 14, 2018

sethah Feb 14, 2018

sxjscience commented Feb 13, 2018 •

edited

Loading

szha commented Feb 13, 2018

szha commented Feb 13, 2018

sethah commented Feb 14, 2018

szha Feb 15, 2018

szha Feb 15, 2018

sethah commented Feb 15, 2018

[MX-9588] Add micro averaging strategy for F1 metric #9777

[MX-9588] Add micro averaging strategy for F1 metric #9777

Conversation

sethah commented Feb 13, 2018 • edited Loading

Description

Checklist

Essentials

Approach

sethah commented Feb 13, 2018

szha commented Feb 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sxjscience commented Feb 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sxjscience commented Feb 13, 2018 • edited Loading

szha commented Feb 13, 2018

szha commented Feb 13, 2018

sethah commented Feb 14, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sethah commented Feb 15, 2018

sethah commented Feb 13, 2018 •

edited

Loading

sxjscience commented Feb 13, 2018 •

edited

Loading