CharacTER: MT metric (#286)

* init character MT metric * Update README.md * make style * add isorts fixes * make style * fix example in README * add cer dependency for tests * Update metrics/character/requirements.txt Co-authored-by: helen <31600291+mathemakitten@users.noreply.github.com> * Update metrics/character/README.md Co-authored-by: helen <31600291+mathemakitten@users.noreply.github.com> * Update metrics/character/README.md Co-authored-by: helen <31600291+mathemakitten@users.noreply.github.com> * Update metrics/character/character.py Co-authored-by: helen <31600291+mathemakitten@users.noreply.github.com> * Update metrics/character/character.py Co-authored-by: helen <31600291+mathemakitten@users.noreply.github.com> * Delete .gitattributes * require cer >=1.1.0 * use calculate_cer when given a string * add separate test for single/corpus * streamline output format the corpus version now only adds attributes, but cer_scores will always be present and always a list * style * update documentation * add singleton example * update cer dependency to 1.2.0 * make metric more robust Now correctly accepts single strings and lists as input. Now only returns cer_scores and not other statistics as this seems rather uncommon and might be confusing for users. * fix doctest formatting * use non-local metric name * update dependency * simplify metric, assume we always work with batches * aggregate scores add aggregate and return_all_scores arguments * add multi-reference option * remove "Literal" * Delete tests.py Do tests via doctest instead * Apply suggestions from code review Co-authored-by: helen <31600291+mathemakitten@users.noreply.github.com> * Update description Co-authored-by: helen <31600291+mathemakitten@users.noreply.github.com> Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
huggingface · Dec 8, 2022 · 544f1e8 · 544f1e8
1 parent 83322c1
commit 544f1e8
Show file tree

Hide file tree

Showing 5 changed files with 284 additions and 0 deletions.
diff --git a/metrics/character/README.md b/metrics/character/README.md
@@ -0,0 +1,106 @@
+---
+title: CharacTER
+emoji: 🔤
+colorFrom: orange
+colorTo: red
+sdk: gradio
+sdk_version: 3.0.2
+app_file: app.py
+pinned: false
+tags:
+- evaluate
+- metric
+- machine-translation
+description: >-
+  CharacTer is a character-level metric inspired by the commonly applied translation edit rate (TER).
+---
+
+# Metric Card for CharacTER
+
+## Metric Description
+CharacTer is a character-level metric inspired by the translation edit rate (TER) metric. It is 
+defined as the minimum number of character edits required to adjust a hypothesis, until it completely matches the
+reference, normalized by the length of the hypothesis sentence. CharacTer calculates the character level edit
+distance while performing the shift edit on word level. Unlike the strict matching criterion in TER, a hypothesis
+word is considered to match a reference word and could be shifted, if the edit distance between them is below a
+threshold value. The Levenshtein distance between the reference and the shifted hypothesis sequence is computed on the
+character level. In addition, the lengths of hypothesis sequences instead of reference sequences are used for
+normalizing the edit distance, which effectively counters the issue that shorter translations normally achieve lower
+TER.
+
+## Intended Uses
+CharacTER was developed for machine translation evaluation.
+
+## How to Use
+
+```python
+import evaluate
+character = evaluate.load("character")
+
+# Single hyp/ref 
+preds = ["this week the saudis denied information published in the new york times"]
+refs = ["saudi arabia denied this week information published in the american new york times"]
+results = character.compute(references=refs, predictions=preds)
+
+# Corpus example
+preds = ["this week the saudis denied information published in the new york times",
+         "this is in fact an estimate"]
+refs = ["saudi arabia denied this week information published in the american new york times",
+        "this is actually an estimate"]
+results = character.compute(references=refs, predictions=preds)
+```
+
+### Inputs
+- **predictions**: a single prediction or a list of predictions to score. Each prediction should be a string with
+     tokens separated by spaces.
+- **references**: a single reference or a list of reference for each prediction. Each reference should be a string with
+     tokens separated by spaces.
+
+
+### Output Values
+
+*=only when a list of references/hypotheses are given
+
+- **count** (*): how many parallel sentences were processed
+- **mean** (*): the mean CharacTER score
+- **median** (*): the median score
+- **std** (*): standard deviation of the score
+- **min** (*): smallest score
+- **max** (*): largest score
+- **cer_scores**: all scores, one per ref/hyp pair
+
+### Output Example
+```python
+{
+    'count': 2,
+    'mean': 0.3127282211789254,
+    'median': 0.3127282211789254,
+    'std': 0.07561653111280243,
+    'min': 0.25925925925925924,
+    'max': 0.36619718309859156,
+    'cer_scores': [0.36619718309859156, 0.25925925925925924]
+}
+```
+
+## Citation
+```bibtex
+@inproceedings{wang-etal-2016-character,
+    title = "{C}harac{T}er: Translation Edit Rate on Character Level",
+    author = "Wang, Weiyue  and
+      Peter, Jan-Thorsten  and
+      Rosendahl, Hendrik  and
+      Ney, Hermann",
+    booktitle = "Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers",
+    month = aug,
+    year = "2016",
+    address = "Berlin, Germany",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/W16-2342",
+    doi = "10.18653/v1/W16-2342",
+    pages = "505--510",
+}
+```
+
+## Further References
+- Repackaged version that is used in this HF implementation: [https://github.com/bramvanroy/CharacTER](https://github.com/bramvanroy/CharacTER)
+- Original version: [https://github.com/rwth-i6/CharacTER](https://github.com/rwth-i6/CharacTER)
diff --git a/metrics/character/app.py b/metrics/character/app.py
@@ -0,0 +1,6 @@
+import evaluate
+from evaluate.utils import launch_gradio_widget
+
+
+module = evaluate.load("character")
+launch_gradio_widget(module)
diff --git a/metrics/character/character.py b/metrics/character/character.py
@@ -0,0 +1,169 @@
+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""CharacTER metric, a character-based TER variant, for machine translation."""
+import math
+from statistics import mean, median
+from typing import Iterable, List, Union
+
+import cer
+import datasets
+from cer import calculate_cer
+from datasets import Sequence, Value
+
+import evaluate
+
+
+_CITATION = """\
+@inproceedings{wang-etal-2016-character,
+    title = "{C}harac{T}er: Translation Edit Rate on Character Level",
+    author = "Wang, Weiyue  and
+      Peter, Jan-Thorsten  and
+      Rosendahl, Hendrik  and
+      Ney, Hermann",
+    booktitle = "Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers",
+    month = aug,
+    year = "2016",
+    address = "Berlin, Germany",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/W16-2342",
+    doi = "10.18653/v1/W16-2342",
+    pages = "505--510",
+}
+"""
+
+_DESCRIPTION = """\
+CharacTer is a character-level metric inspired by the commonly applied translation edit rate (TER). It is
+defined as the minimum number of character edits required to adjust a hypothesis, until it completely matches the
+reference, normalized by the length of the hypothesis sentence. CharacTer calculates the character level edit
+distance while performing the shift edit on word level. Unlike the strict matching criterion in TER, a hypothesis
+word is considered to match a reference word and could be shifted, if the edit distance between them is below a
+threshold value. The Levenshtein distance between the reference and the shifted hypothesis sequence is computed on the
+character level. In addition, the lengths of hypothesis sequences instead of reference sequences are used for
+normalizing the edit distance, which effectively counters the issue that shorter translations normally achieve lower
+TER."""
+
+_KWARGS_DESCRIPTION = """
+Calculates how good the predictions are in terms of the CharacTER metric given some references.
+Args:
+    predictions: a list of predictions to score. Each prediction should be a string with
+     tokens separated by spaces.
+    references: a list of references for each prediction. You can also pass multiple references for each prediction,
+     so a list and in that list a sublist for each prediction for its related references. When multiple references are
+     given, the lowest (best) score is returned for that prediction-references pair.
+     Each reference should be a string with tokens separated by spaces.
+    aggregate: one of "mean", "sum", "median" to indicate how the scores of individual sentences should be
+     aggregated
+    return_all_scores: a boolean, indicating whether in addition to the aggregated score, also all individual
+     scores should be returned
+Returns:
+    cer_score: an aggregated score across all the items, based on 'aggregate'
+    cer_scores: (optionally, if 'return_all_scores' evaluates to True) a list of all scores, one per ref/hyp pair
+Examples:
+    >>> character_mt = evaluate.load("character")
+    >>> preds = ["this week the saudis denied information published in the new york times"]
+    >>> refs = ["saudi arabia denied this week information published in the american new york times"]
+    >>> character_mt.compute(references=refs, predictions=preds)
+    {'cer_score': 0.36619718309859156}
+    >>> preds = ["this week the saudis denied information published in the new york times",
+    ...          "this is in fact an estimate"]
+    >>> refs = ["saudi arabia denied this week information published in the american new york times",
+    ...         "this is actually an estimate"]
+    >>> character_mt.compute(references=refs, predictions=preds, aggregate="sum", return_all_scores=True)
+    {'cer_score': 0.6254564423578508, 'cer_scores': [0.36619718309859156, 0.25925925925925924]}
+    >>> preds = ["this week the saudis denied information published in the new york times"]
+    >>> refs = [["saudi arabia denied this week information published in the american new york times",
+    ...          "the saudis have denied new information published in the ny times"]]
+    >>> character_mt.compute(references=refs, predictions=preds, aggregate="median", return_all_scores=True)
+    {'cer_score': 0.36619718309859156, 'cer_scores': [0.36619718309859156]}
+"""
+
+
+@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class Character(evaluate.Metric):
+    """CharacTer is a character-level metric inspired by the commonly applied translation edit rate (TER)."""
+
+    def _info(self):
+        return evaluate.MetricInfo(
+            module_type="metric",
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            inputs_description=_KWARGS_DESCRIPTION,
+            features=[
+                datasets.Features(
+                    {"predictions": Value("string", id="prediction"), "references": Value("string", id="reference")}
+                ),
+                datasets.Features(
+                    {
+                        "predictions": Value("string", id="prediction"),
+                        "references": Sequence(Value("string", id="reference"), id="references"),
+                    }
+                ),
+            ],
+            homepage="https://github.com/bramvanroy/CharacTER",
+            codebase_urls=["https://github.com/bramvanroy/CharacTER", "https://github.com/rwth-i6/CharacTER"],
+        )
+
+    def _compute(
+        self,
+        predictions: Iterable[str],
+        references: Union[Iterable[str], Iterable[Iterable[str]]],
+        aggregate: str = "mean",
+        return_all_scores: bool = False,
+    ):
+        if aggregate not in ("mean", "sum", "median"):
+            raise ValueError("'aggregate' must be one of 'sum', 'mean', 'median'")
+
+        predictions = [p.split() for p in predictions]
+        # Predictions and references have the same internal types (both lists of strings),
+        # so only one reference per prediction
+        if isinstance(references[0], str):
+            references = [r.split() for r in references]
+
+            scores_d = cer.calculate_cer_corpus(predictions, references)
+            cer_scores: List[float] = scores_d["cer_scores"]
+
+            if aggregate == "sum":
+                score = sum(cer_scores)
+            elif aggregate == "mean":
+                score = scores_d["mean"]
+            else:
+                score = scores_d["median"]
+        else:
+            # In the case of multiple references, we just find the "best score",
+            # i.e., the reference that the prediction is closest to, i.e. the lowest characTER score
+            references = [[r.split() for r in refs] for refs in references]
+
+            cer_scores = []
+            for pred, refs in zip(predictions, references):
+                min_score = math.inf
+                for ref in refs:
+                    score = calculate_cer(pred, ref)
+
+                    if score < min_score:
+                        min_score = score
+
+                cer_scores.append(min_score)
+
+            if aggregate == "sum":
+                score = sum(cer_scores)
+            elif aggregate == "mean":
+                score = mean(cer_scores)
+            else:
+                score = median(cer_scores)
+
+        # Return scores
+        if return_all_scores:
+            return {"cer_score": score, "cer_scores": cer_scores}
+        else:
+            return {"cer_score": score}
diff --git a/metrics/character/requirements.txt b/metrics/character/requirements.txt
@@ -0,0 +1,2 @@
+git+https://github.com/huggingface/evaluate@{COMMIT_PLACEHOLDER}
+cer>=1.2.0
diff --git a/setup.py b/setup.py
@@ -105,6 +105,7 @@
 TESTS_REQUIRE = [
     # test dependencies
     "absl-py",
+    "cer>=1.2.0",  # for characTER
     "nltk",  # for NIST and probably others
     "pytest",
     "pytest-datadir",