Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MX-9588] Add micro averaging strategy for F1 metric #9777

Merged
merged 10 commits into from
Feb 15, 2018
Merged

Conversation

sethah
Copy link
Contributor

@sethah sethah commented Feb 13, 2018

Description

This PR adds a mixin class that F1 and other metrics like precision and recall can leverage in the future. It also provides a new option for the F1 metric called average which defines how the metric will be aggregated across mini batches.

Checklist

Essentials

  • Passed code style checking (make lint)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Approach

The "micro" vs "macro" update strategy is not specific to F1 score. The macro update just takes an average of averages, which can be done for any metric. It may be best to design an abstraction where any metric can have the micro/macro update option, but I couldn't see a good way to do that here that would:

  • be easy to use for end users AND
  • maintain backward compatibility AND
  • maintain current semantics

For now, the behavior for each type of update is hard coded into the update method of the F1 class. We can discuss the approach.

Please let me know if I have missed or overlooked anything :)

@sethah sethah requested a review from szha as a code owner February 13, 2018 04:25
@sethah
Copy link
Contributor Author

sethah commented Feb 13, 2018

Regarding other approaches, something I looked at was the following:

class MacroMetric(EvalMetric):

    def __init__(self, base_metric):
        super(MacroMetric, self).__init__("macro_" + base_metric.name, output_names=base_metric.output_names,
                                          label_names=base_metric.label_names)
        self.base_metric = base_metric

    def update(self, labels, preds):
        self.base_metric.update(labels, preds)
        self.sum_metric += self.base_metric.get()[1]
        self.num_inst += 1
        self.base_metric.reset()

Any metric that has defined the "micro" behavior can then be used as "macro" just by calling metric = mx.metric.MacroMetric(mx.metric.F1()), but that seems pretty awkward for the users, and also changes the default behavior to micro, which doesn't work for backwards compatibility. The current solution works well enough, and we could probably revisit in a later PR if needed.

@sethah sethah changed the title Mx 9588 [MX-9588] Add micro averaging strategy for F1 metric Feb 13, 2018
@szha szha self-assigned this Feb 13, 2018
@szha
Copy link
Member

szha commented Feb 13, 2018

This fixes #9588

@@ -503,21 +578,27 @@ class F1(EvalMetric):
label_names : list of str, or None
Name of labels that should be used when updating with update_dict.
By default include all labels.
average : str
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, default 'macro'

raise ValueError("%s currently only supports binary classification."
% self.__class__.__name__)

for y_pred, y_true in zip(pred_label, label):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to use a for-loop here? Using array arithmetic OPs should be more efficient. Also we can try not to convert them into numpy array so it can be calculated in GPU.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but there's another issue for that, so I assumed it would be done in a separate PR. #9586

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's address that in a separate PR. My last attempt faced performance issues when switching to ndarray-based logic, so we should address the problem once it's more clear how to resolve the performance issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used codes like

tp = nd.sum((pred == 1) * (label == 1)).asscalar()
fp = nd.sum((pred == 1) * (label == 0)).asscalar()
fn = nd.sum((pred == 0) * (label == 1)).asscalar()
precision = float(tp) / (tp + fp)
recall = float(tp) / (tp + fn)
f1 = 2 * (precision * recall) / (precision + recall)

to calculate the F1 and I find it's much faster in GPU.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've previously written the codes to accelerate F1 calculation in GPU. However it's not based on metric and directly uses NDArray:

def nd_f1(pred, label, num_class, average="micro"):
    """Evaluate F1 using mx.nd.NDArray

    Parameters
    ----------
    pred : nd.NDArray
        Shape (num, label_num) or (num,)
    label : nd.NDArray
        Shape (num, label_num) or (num,)
    num_class : int
    average : str

    Returns
    -------
    f1 : float
    """
    if pred.dtype != np.float32:
        pred = pred.astype(np.float32)
        label = label.astype(np.float32)
    assert num_class > 1
    assert pred.ndim == label.ndim
    if num_class == 2 and average == "micro":
        tp = nd.sum((pred == 1) * (label == 1)).asscalar()
        fp = nd.sum((pred == 1) * (label == 0)).asscalar()
        fn = nd.sum((pred == 0) * (label == 1)).asscalar()
        precision = float(tp) / (tp + fp)
        recall = float(tp) / (tp + fn)
        f1 = 2 * (precision * recall) / (precision + recall)
    else:
        assert num_class is not None
        pred_onehot = nd.one_hot(indices=pred, depth=num_class)
        label_onehot = nd.one_hot(indices=label, depth=num_class)
        tp = pred_onehot * label_onehot
        fp = pred_onehot * (1 - label_onehot)
        fn = (1 - pred_onehot) * label_onehot
        if average == "micro":
            tp = nd.sum(tp).asscalar()
            fp = nd.sum(fp).asscalar()
            fn = nd.sum(fn).asscalar()
            precision = float(tp) / (tp + fp)
            recall = float(tp) / (tp + fn)
            f1 = 2 * (precision * recall) / (precision + recall)
        elif average == "macro":
            if tp.ndim == 3:
                tp = nd.sum(tp, axis=(0, 1))
                fp = nd.sum(fp, axis=(0, 1))
                fn = nd.sum(fn, axis=(0, 1))
            else:
                tp = nd.sum(tp, axis=0)
                fp = nd.sum(fp, axis=0)
                fn = nd.sum(fn, axis=0)
            precision = nd.mean(tp / (tp + fp)).asscalar()
            recall = nd.mean(tp / (tp + fn)).asscalar()
            f1 = 2 * (precision * recall) / (precision + recall)
        else:
            raise NotImplementedError
    return f1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this conversation take place on #9586? This PR only aims to fix #9588.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it should be addressed in a later PR.

fscore2 = 2. * (1) / (2 * 1 + 0 + 0)
fscore_total = 2. * (1 + 1) / (2 * (1 + 1) + (1 + 0) + (0 + 0))
np.testing.assert_almost_equal(microF1.get()[1], fscore_total)
np.testing.assert_almost_equal(macroF1.get()[1], (fscore1 + fscore2) / 2.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the test part I think one way is to compare the result with sklearn.metrics.f1_score. However I'm not sure if it works for CI. @marcoabreu

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're lacking dependency support on Windows slaves yet, we can't use this if scikit is not present as a dependency yet - I can't check right now. I'd propose to just try it out and see whether it works or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, trying this out. We'll see.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that sklearn is not installed in some machines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've just reverted the commit.

@sxjscience
Copy link
Member

I think the current way is fine. All we need to do is to provide a mx.metric version of sklearn.metrics.f1_score.

@@ -475,8 +475,84 @@ def update(self, labels, preds):
self.num_inst += num_samples


class _BinaryClassificationMixin(object):
"""
Private mixin for keeping track of TPR, FPR, TNR, FNR counts for a classification metric.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add more explanation of what this does?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Let me know if that's not what you had in mind.

@register
class F1(EvalMetric):
class F1(EvalMetric, _BinaryClassificationMixin):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use member instead of multiple inheritance if possible

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the cases where usage of mix-ins is proper for backward compatibility and ease of use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can make a BinaryClassificationMetric class that is intended to be abstract and inherits from EvalMetric without affecting backward compatibility. It seemed more appropriate as a mixin here since it would be useless as a concrete class and doesn't actually implement any functionality required by EvalMetric.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multiple inheritance is very rarely necessary.
In this case I think _BinaryClassificationMixin should either inherit EvalMetric or be refactored into a few utility functions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

composition is also possible

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, the current multiple inheritance seems reasonable, in that calculating counts and keeping counts are two separate concerns. Mixins is more flexible and likely require less code when we extend these to multi-class/multi-label/top-k use cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Composition is equally flexible with the drawback of extra dereference, though I'm fine either way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check latest commit to see if that's what you had in mind?

@sxjscience
Copy link
Member

sxjscience commented Feb 13, 2018

Also, one thing to note is that when the output has only one-label, the micro F1 is equivalent to accuracy https://stackoverflow.com/questions/37358496/is-f1-micro-the-same-as-accuracy . This could potentially help accelerate the computation of micro f1.

Also, we sometimes need to deal with multi-label classification and may need to support that in the future.

@szha
Copy link
Member

szha commented Feb 13, 2018

@sxjscience let's track the multi-label case in #9589

@szha
Copy link
Member

szha commented Feb 13, 2018

@sethah do you mind running the related test from #9705 and report the before/after change results here?

@sethah
Copy link
Contributor Author

sethah commented Feb 14, 2018

Obviously, no change is expected since none of the update logic was changed here.

Before

Metric         Data-Ctx  Label-Ctx   Data Size   Batch Size     Output Dim     Elapsed Time
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      131072      16             2              0.82034
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      131072      64             2              0.46931
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      131072      256            2              0.35179
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      131072      1024           2              0.35274
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      16384       16             2              0.10991
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      16384       64             2              0.057258
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      16384       256            2              0.046497
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      16384       1024           2              0.044378
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      2048        16             2              0.01292
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      2048        64             2              0.0070581
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      2048        256            2              0.005384
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      2048        1024           2              0.00511
------------------------------------------------------------------------------------------

After

Metric         Data-Ctx  Label-Ctx   Data Size   Batch Size     Output Dim     Elapsed Time
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      131072      16             2              0.81942
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      131072      64             2              0.46574
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      131072      256            2              0.37653
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      131072      1024           2              0.32401
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      16384       16             2              0.099127
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      16384       64             2              0.057564
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      16384       256            2              0.044527
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      16384       1024           2              0.04604
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      2048        16             2              0.012832
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      2048        64             2              0.0071719
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      2048        256            2              0.0058281
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      2048        1024           2              0.0057449
------------------------------------------------------------------------------------------

@@ -503,21 +582,27 @@ class F1(EvalMetric):
label_names : list of str, or None
Name of labels that should be used when updating with update_dict.
By default include all labels.
average : str, default 'macro'
Strategy to be used for aggregating across micro-batches.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"mini-batches" is more commonly used.

average : str, default 'macro'
Strategy to be used for aggregating across micro-batches.
"macro": average the F1 scores for each batch
"micro": compute a single F1 score across all batches
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szha szha merged commit d03182f into apache:master Feb 15, 2018
@sethah
Copy link
Contributor Author

sethah commented Feb 15, 2018

Thanks all for the review!

rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
* add macro/micro f1 and test and binary abstraction

* make average an option

* use metric.create

* add decimal for float division

* add default in docstring, reference generic base class in error msg

* expand on docstring

* use scikit in test

* Revert "use scikit in test"

This reverts commit 797c01c.

* use composition

* minibatches
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
* add macro/micro f1 and test and binary abstraction

* make average an option

* use metric.create

* add decimal for float division

* add default in docstring, reference generic base class in error msg

* expand on docstring

* use scikit in test

* Revert "use scikit in test"

This reverts commit 797c01c.

* use composition

* minibatches
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants