Skip to content

Text-related (NLP) metrics

Compare
Choose a tag to compare
@Borda Borda released this 10 Aug 14:22
· 1545 commits to master since this release

[0.5.0] - 2021-08-09

This release includes general improvements to the library and new metrics within the NLP domain.

https://devblog.pytorchlightning.ai/torchmetrics-v0-5-nlp-metrics-f4232467b0c5

Natural language processing is arguably one of the most exciting areas of machine learning, with models such as BERT, ROBERTA, GPT-3 etc., really pushing what automated text translation, recognition, and generation systems are capable of. 

With the introduction of these models, many metrics have been proposed that measure how well these models perform. TorchMetrics v0.5 includes 4 such metrics: BERT score, BLEU, ROUGE and WER.

Detail changes

Added

  • Added Text-related (NLP) metrics:
  • Added MetricTracker wrapper metric for keeping track of the same metric over multiple epochs (#238)
  • Added other metrics:
    • Symmetric Mean Absolute Percentage error (SMAPE) (#375)
    • Calibration error (#394)
    • Permutation Invariant Training (PIT) (#384)
  • Added support in nDCG metric for target with values larger than 1 (#349)
  • Added support for negative targets in nDCG metric (#378)
  • Added None as reduction option in CosineSimilarity metric (#400)
  • Allowed passing labels in (n_samples, n_classes) to AveragePrecision (#386)

Changed

  • Moved psnr and ssim from functional.regression.* to functional.image.* (#382)
  • Moved image_gradient from functional.image_gradients to functional.image.gradients (#381)
  • Moved R2Score from regression.r2score to regression.r2 (#371)
  • Pearson metric now only store 6 statistics instead of all predictions and targets (#380)
  • Use torch.argmax instead of torch.topk when k=1 for better performance (#419)
  • Moved check for number of samples in R2 score to support single sample updating (#426)

Deprecated

  • Rename r2score >> r2_score and kldivergence >> kl_divergence in functional (#371)
  • Moved bleu_score from functional.nlp to functional.text.bleu (#360)

Removed

  • Removed restriction that threshold has to be in (0,1) range to support logit input (#351, #401)
  • Removed restriction that preds could not be bigger than num_classes to support logit input (#357)
  • Removed module regression.psnr and regression.ssim (#382):
  • Removed (#379):
    • function functional.mean_relative_error
    • num_thresholds argument in BinnedPrecisionRecallCurve

Fixed

  • Fixed bug where classification metrics with average='macro' would lead to wrong result if a class was missing (#303)
  • Fixed weighted, multi-class AUROC computation to allow for 0 observations of some class, as contribution to final AUROC is 0 (#376)
  • Fixed that _forward_cache and _computed attributes are also moved to the correct device if metric is moved (#413)
  • Fixed calculation in IoU metric when using ignore_index argument (#328)

Contributors

@BeyondTheProof, @Borda, @CSautier, @discort, @edwardclem, @gagan3012, @hugoperrin, @karthikrangasai, @paul-grundmann, @quancs, @rajs96, @SkafteNicki, @vatch123

If we forgot someone due to not matching commit email with GitHub account, let us know :]