Proposal to mxnet.metric #18046

acphile · 2020-04-14T01:50:31Z

Motivation

mxnet.metric provides different methods for users to judge the performance of models. But currently there are some shortcomings which need to be improved in mxnet.metric. We propose to refactor the metrics interface to fix all issues and place the new interface under mx.gluon.metrics.

In general, we want to make the following improvements:

Moving the API to the gluon namespace
Make the API more user-friendly and pythonic
Structure the API to make hybridization of the complete training loop more easily feasible in the future.

1. Inconsistency in computational granularity of metrics

Currently there are two computational granularities in mxnet.metric:

“macro” level: calculate average performance per batch , like implementation in MAE
“micro” level: calculate average performance per sample, like implementation in Accuracy, CrossEntropy

Generally, “micro” level is more useful because usually we focus on average performance of data samples in the test set rather than that of testing batches. So here we need to make arrangements between these metrics.

2. For future hybridization of the complete training loop

Currently metrics in mxnet.metric receives “list of NDArray” and calculate results by numpy. In fact, many metrics’ computation could be implemented in nn.HybridBlock. Using HybridBlock.hybridize(), the computation could be done in the backend, which could be faster. By refactoring the mxnet.metric, we could one day compile the model with the metric like Tensorflow and do the complete training loop including evaluation fully in the backend. Thus our new API design takes into account the hybridization use-case, so that hybridizing the complete training loop will be easily possible once the backend support is there.

3. lacking some useful metrics

Although many metrics are already included, some still need to be implemented.

Apart from the metrics already provided in mxnet.metric: http://mxnet.incubator.apache.org/api/python/docs/api/metric/index.html?highlight=metric#module-mxnet.metric , we plan to add the following metrics:

F-beta score: (1+beta^2)precisionrecall/(beta^2*precision+recall)
binary accuracy with threshold: using a confidence threshold to judge whether the example is positive or negative
MeanCosineSimilarity: return the average cosin similarity between predictions and ground truth
MeanPairwiseDistance: return the average pairwise distance between predictions and ground truth

4. Fixing issues in the existing metrics

Some special cases and input shapes need to be examined and fixed.
About EvalMetric (base class in metrics.py)

distinction between local and global:
a. Currently for metrics in metric.py, when update() is called, both local accumulator and global accumulator are updated with the same value.
b. Global accumulator may be useful when there are different parts during evaluation (for example, joint training on different datasets). You may want to get evaluation result of one part and call “reset_local()” to continue the evaluation for next part. In the end, you can call “get_global()” to obtain the overall evaluation performance.
c. You may also define the way to update local and global results in your own metric(EvalMetric)
parameter “output_names” “label_names” and method “update_dict”
a. Seemingly I only find “update_dict” in “https://github.com/apache/incubator-mxnet/blob/48e9e2c6a1544843ba860124f4eaa8e7bac6100b/python/mxnet/module/executor_group.py”, where I think using “update” is also reasonable.
b. I don’t know where the corresponding parameter "output_names","label_names" could be used, since there are not corresponding examples.
get_name_value()
a. return metric’s name and metric’s evalutaion value pairs.
b. It is helpful when using CompositeEvalMetric

Here are the detailed changes to be made:

improve Class MAE (and MSE, RMSE)
a. including parameter “average”, default average=“macro”
i. “macro” represents average per batch
ii. “micro” represents average per example
b. including micro level calculation:
improve Class _BinaryClassification
a. support the situation len(pred.shape)==1
i. for binary classification, we only need to output a confidence score of being positive, like: pred=[0.1,0.3,0.7] or like pred=[[0.1],[0.3],[0.7]]
b. including parameter “threshold”, default: threshold=0.5
i. sometimes we may need to define a threshold that when confidence(positive) > threshold, we classify it as positive, otherwise negative
c. including parameter “beta” default: beta=1
i. updating “fscore” calculation with F-beta= (1+beta^2)precisionrecall/(beta^2*precision+recall), which is more general
d. including method binary_accuracy:
i. calculation: (true_positives+true_negatives)/total_examples
improve Class TopKAccuracy
a. Line 578-579: self.global_sum_metric should be accumulated
add Class MeanCosineSimilarity(axis=-1, eps=1e-12)
add Class MeanPairwiseDistance(p=2)

Comparisons with other framework

Compared with Pytorch Ignite

Reference: https://pytorch.org/ignite/metrics.html
Base class for metrics is implemented independently. Metrics in ignite.metrics use .attach() method to use the output of the engine’s process_function. It is done by letting the engine to add_event_handler.
Metric arithmetics are supported, which is like mxnet.metrics.CustomMetric
Some metrics currently are not included in ours:

Compared with Tensorflow Keras

Reference: https://tensorflow.google.cn/api_docs/python/tf/keras/metrics?hl=en
Base class for metrics inherits from tf.keras.engine.base_layer.Layer, which is also the class from which all layers inherit. Metric functions in tf.keras.metrics could be supplied in the metrics parameter when a model is compiled.
Generally, metric functions in tf.keras.metrics have an input sample_weight defining contributing weights when updating the states.
tf.keras.metrics use Accuracyand SparseCategoricalAccuracyto denote the situation that y_pred is predicted label and the situation that y_pred is probability distribution, which I think may be to avoid internal shape checking. Currently we could combine them in one metric.
Some metrics currently are not included in ours:

AUC
BinaryAccuracy
Hinge related, like SquaredHinge Hinge CategoricalHinge
CosineSimilarity
KLDivergence
LogCoshError :logcosh = log((exp(x) + exp(-x))/2), where x is the error (y_pred - y_true)
MeanIoU
Poisson
SensitivityAtSpecificity

sxjscience · 2020-04-14T21:46:16Z

I think we can also borrow ideas from the design in AllenNLP: https://github.com/allenai/allennlp/tree/master/allennlp/training/metrics

sxjscience · 2020-04-16T15:46:15Z

Also, I suggest to remove the option of macro averaging. I don't think the current implementation is correct. In scikit-learn, there is no macro option for MAE (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html#sklearn.metrics.mean_absolute_error), MSE (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error). And for F1 score, the macro option is used for multi-label/multi-class prediction. See also: #9586 (comment)

acphile · 2020-04-27T08:43:06Z

Here are the updated changes to be made:

1. improve Class MAE, MSE, RMSE

a. UPD: remove “macro” supports which represents average per batch
b. Rewrite RMSE to inherit from MSE

2. improve Class _BinaryClassification

a. UPD: including parameter “class_type” in [‘binary’, ‘multiclass’, ‘multilabel’]
b. support the situation len(pred.shape)==1 for class_type='binary'
     i. for binary classification, we only need to output a confidence score of being positive, like: pred=[0.1,0.3,0.7] or like pred=[[0.1],[0.3],[0.7]]
c. including parameter “threshold”, default: threshold=0.5
     i. sometimes we may need to define a threshold that when confidence(positive) > threshold, we classify it as positive, otherwise negative
     ii. used when class_type in [‘binary’, ‘multilabel’]
d. including parameter “beta” default: beta=1
     i. updating “fscore” calculation with F-beta= (1+beta^2)*precision*recall/(beta^2*precision+recall), which is more general
e. UPD: add cases for multillabel/multiclass
     i. including paramater ‘class_type’ in [‘binary’, ‘multilabel’, ‘multiclass’]
     ii. For ‘multilabel’, pred should be (N, ..., C) and label should be (N, ..., C)
     iii. For ‘multiclass’, pred should be (N, ..., C) and label should be (N, ...)
f. UPD: replace global_fscore with micro_fscore

3. add Class BinaryAccuracy(threshold=0.5)

4. add Class MeanCosineSimilarity(axis=-1, eps=1e-12)

5. add Class MeanPairwiseDistance(p=2)

6. improve Class F1:

a. F1(class_type="binary", threshold=0.5, average="micro")
b. average in [“binary”, “micro”, “macro”]:
     i. "macro": Calculate metrics for each label and return unweighted mean of f1.
     ii. "micro": Calculate metrics globally by counting the total TP, FN and FP.
     iii. None: Return f1 scores for each class (numpy.ndarray) .

7. add Class Fbeta(class_type="binary", beta=1, threshold=0.5, average="micro")

8. UPD: using mxnet.numpy instead of numpy

leezu · 2020-05-27T22:52:06Z

Closed by #18083

acphile added the Feature request label Apr 14, 2020

acphile mentioned this issue Apr 16, 2020

Changes to mxnet.metric #18083

Merged

7 tasks

leezu mentioned this issue May 20, 2020

Pin mxnet version in response to mx.metric reorg dmlc/gluon-cv#1310

Merged

leezu closed this as completed May 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal to mxnet.metric #18046

Proposal to mxnet.metric #18046

acphile commented Apr 14, 2020

sxjscience commented Apr 14, 2020

sxjscience commented Apr 16, 2020

acphile commented Apr 27, 2020

leezu commented May 27, 2020

Proposal to mxnet.metric #18046

Proposal to mxnet.metric #18046

Comments

acphile commented Apr 14, 2020

Motivation

1. Inconsistency in computational granularity of metrics

2. For future hybridization of the complete training loop

3. lacking some useful metrics

4. Fixing issues in the existing metrics

Comparisons with other framework

Compared with Pytorch Ignite

Compared with Tensorflow Keras

sxjscience commented Apr 14, 2020

sxjscience commented Apr 16, 2020

acphile commented Apr 27, 2020

1. improve Class MAE, MSE, RMSE

2. improve Class _BinaryClassification

3. add Class BinaryAccuracy(threshold=0.5)

4. add Class MeanCosineSimilarity(axis=-1, eps=1e-12)

5. add Class MeanPairwiseDistance(p=2)

6. improve Class F1:

7. add Class Fbeta(class_type="binary", beta=1, threshold=0.5, average="micro")

8. UPD: using mxnet.numpy instead of numpy

leezu commented May 27, 2020