Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CharacTER: MT metric #286

Merged
merged 35 commits into from
Dec 8, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
70645e2
init character MT metric
Sep 2, 2022
8b13546
Update README.md
Sep 2, 2022
735af7a
make style
Sep 2, 2022
6f2a680
add isorts fixes
Sep 2, 2022
3c66bb8
make style
Sep 11, 2022
6dc3573
fix example in README
Sep 11, 2022
bc464e4
add cer dependency for tests
Sep 12, 2022
5778a94
Update metrics/character/requirements.txt
Sep 14, 2022
5aeb507
Update metrics/character/README.md
Sep 14, 2022
2b3b2d4
Update metrics/character/README.md
Sep 14, 2022
aac9896
Update metrics/character/character.py
Sep 14, 2022
c55a7a2
Update metrics/character/character.py
Sep 15, 2022
8523429
Delete .gitattributes
Sep 15, 2022
fdda8ae
require cer >=1.1.0
Sep 15, 2022
a7ee66b
use calculate_cer when given a string
Sep 15, 2022
32df29c
add separate test for single/corpus
Sep 15, 2022
3df21b6
streamline output format
Sep 15, 2022
5a4985b
style
Sep 15, 2022
71927cd
update documentation
Sep 15, 2022
7b82410
add singleton example
Sep 15, 2022
5ce2961
Merge branch 'huggingface:main' into character
Dec 6, 2022
3541c98
update cer dependency to 1.2.0
Dec 6, 2022
7206075
make metric more robust
Dec 6, 2022
2ce5bc6
fix doctest formatting
Dec 6, 2022
fdff370
use non-local metric name
Dec 6, 2022
215ae81
update dependency
Dec 6, 2022
b4ec6f6
simplify metric, assume we always work with batches
Dec 6, 2022
21ac66e
aggregate scores
Dec 7, 2022
7b1d80e
add multi-reference option
Dec 7, 2022
e8996d6
Merge branch 'main' into character
lvwerra Dec 8, 2022
9c1338f
remove "Literal"
Dec 8, 2022
7854efc
Merge branch 'character' of https://github.com/BramVanroy/evaluate in…
Dec 8, 2022
3c96c3c
Delete tests.py
Dec 8, 2022
8d15353
Apply suggestions from code review
Dec 8, 2022
c35f53a
Update description
Dec 8, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 106 additions & 0 deletions metrics/character/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
---
title: CharacTER
emoji: 🔤
colorFrom: orange
colorTo: red
sdk: gradio
sdk_version: 3.0.2
app_file: app.py
pinned: false
tags:
- evaluate
- metric
- machine-translation
description: >-
CharacTer is a character-level metric inspired by the commonly applied translation edit rate (TER).
---

# Metric Card for CharacTER

## Metric Description
CharacTer is a character-level metric inspired by the translation edit rate (TER) metric. It is
defined as the minimum number of character edits required to adjust a hypothesis, until it completely matches the
reference, normalized by the length of the hypothesis sentence. CharacTer calculates the character level edit
distance while performing the shift edit on word level. Unlike the strict matching criterion in TER, a hypothesis
word is considered to match a reference word and could be shifted, if the edit distance between them is below a
threshold value. The Levenshtein distance between the reference and the shifted hypothesis sequence is computed on the
character level. In addition, the lengths of hypothesis sequences instead of reference sequences are used for
normalizing the edit distance, which effectively counters the issue that shorter translations normally achieve lower
TER.

## Intended Uses
CharacTER was developed for machine translation evaluation.

## How to Use

```python
import evaluate
character = evaluate.load("character")

# Single hyp/ref
preds = ["this week the saudis denied information published in the new york times"]
refs = ["saudi arabia denied this week information published in the american new york times"]
results = character.compute(references=refs, predictions=preds)

# Corpus example
preds = ["this week the saudis denied information published in the new york times",
"this is in fact an estimate"]
refs = ["saudi arabia denied this week information published in the american new york times",
"this is actually an estimate"]
results = character.compute(references=refs, predictions=preds)
BramVanroy marked this conversation as resolved.
Show resolved Hide resolved
```

### Inputs
- **predictions**: a single prediction or a list of predictions to score. Each prediction should be a string with
tokens separated by spaces.
- **references**: a single reference or a list of reference for each prediction. Each reference should be a string with
tokens separated by spaces.
BramVanroy marked this conversation as resolved.
Show resolved Hide resolved


### Output Values

*=only when a list of references/hypotheses are given

- **count** (*): how many parallel sentences were processed
- **mean** (*): the mean CharacTER score
- **median** (*): the median score
- **std** (*): standard deviation of the score
- **min** (*): smallest score
- **max** (*): largest score
- **cer_scores**: all scores, one per ref/hyp pair

### Output Example
```python
{
'count': 2,
'mean': 0.3127282211789254,
'median': 0.3127282211789254,
'std': 0.07561653111280243,
'min': 0.25925925925925924,
'max': 0.36619718309859156,
'cer_scores': [0.36619718309859156, 0.25925925925925924]
}
```

## Citation
```bibtex
@inproceedings{wang-etal-2016-character,
title = "{C}harac{T}er: Translation Edit Rate on Character Level",
author = "Wang, Weiyue and
Peter, Jan-Thorsten and
Rosendahl, Hendrik and
Ney, Hermann",
booktitle = "Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers",
month = aug,
year = "2016",
address = "Berlin, Germany",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/W16-2342",
doi = "10.18653/v1/W16-2342",
pages = "505--510",
}
```

## Further References
- Repackaged version that is used in this HF implementation: [https://github.com/bramvanroy/CharacTER](https://github.com/bramvanroy/CharacTER)
- Original version: [https://github.com/rwth-i6/CharacTER](https://github.com/rwth-i6/CharacTER)
6 changes: 6 additions & 0 deletions metrics/character/app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
import evaluate
from evaluate.utils import launch_gradio_widget


module = evaluate.load("character")
launch_gradio_widget(module)
169 changes: 169 additions & 0 deletions metrics/character/character.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""CharacTER metric, a character-based TER variant, for machine translation."""
import math
from statistics import mean, median
from typing import Iterable, List, Union

import cer
import datasets
from cer import calculate_cer
from datasets import Sequence, Value

import evaluate


_CITATION = """\
@inproceedings{wang-etal-2016-character,
title = "{C}harac{T}er: Translation Edit Rate on Character Level",
author = "Wang, Weiyue and
Peter, Jan-Thorsten and
Rosendahl, Hendrik and
Ney, Hermann",
booktitle = "Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers",
month = aug,
year = "2016",
address = "Berlin, Germany",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/W16-2342",
doi = "10.18653/v1/W16-2342",
pages = "505--510",
}
"""

_DESCRIPTION = """\
CharacTer is a character-level metric inspired by the commonly applied translation edit rate (TER). It is
defined as the minimum number of character edits required to adjust a hypothesis, until it completely matches the
reference, normalized by the length of the hypothesis sentence. CharacTer calculates the character level edit
distance while performing the shift edit on word level. Unlike the strict matching criterion in TER, a hypothesis
word is considered to match a reference word and could be shifted, if the edit distance between them is below a
threshold value. The Levenshtein distance between the reference and the shifted hypothesis sequence is computed on the
character level. In addition, the lengths of hypothesis sequences instead of reference sequences are used for
normalizing the edit distance, which effectively counters the issue that shorter translations normally achieve lower
TER."""

_KWARGS_DESCRIPTION = """
Calculates how good the predictions are in terms of the CharacTER metric given some references.
Args:
predictions: a list of predictions to score. Each prediction should be a string with
tokens separated by spaces.
references: a list of references for each prediction. You can also pass multiple references for each prediction,
so a list and in that list a sublist for each prediction for its related references. When multiple references are
given, the lowest (best) score is returned for that prediction-references pair.
Each reference should be a string with tokens separated by spaces.
aggregate: one of "mean", "sum", "median" to indicate how the scores of individual sentences should be
aggregated
return_all_scores: a boolean, indicating whether in addition to the aggregated score, also all individual
scores should be returned
Returns:
cer_score: an aggregated score across all the items, based on 'aggregate'
cer_scores: (optionally, if 'return_all_scores' evaluates to True) a list of all scores, one per ref/hyp pair
Examples:
>>> character_mt = evaluate.load("character")
>>> preds = ["this week the saudis denied information published in the new york times"]
>>> refs = ["saudi arabia denied this week information published in the american new york times"]
>>> character_mt.compute(references=refs, predictions=preds)
{'cer_score': 0.36619718309859156}
>>> preds = ["this week the saudis denied information published in the new york times",
... "this is in fact an estimate"]
>>> refs = ["saudi arabia denied this week information published in the american new york times",
... "this is actually an estimate"]
>>> character_mt.compute(references=refs, predictions=preds, aggregate="sum", return_all_scores=True)
{'cer_score': 0.6254564423578508, 'cer_scores': [0.36619718309859156, 0.25925925925925924]}
>>> preds = ["this week the saudis denied information published in the new york times"]
>>> refs = [["saudi arabia denied this week information published in the american new york times",
... "the saudis have denied new information published in the ny times"]]
>>> character_mt.compute(references=refs, predictions=preds, aggregate="median", return_all_scores=True)
{'cer_score': 0.36619718309859156, 'cer_scores': [0.36619718309859156]}
"""


@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
class Character(evaluate.Metric):
"""CharacTer is a character-level metric inspired by the commonly applied translation edit rate (TER)."""

def _info(self):
return evaluate.MetricInfo(
module_type="metric",
description=_DESCRIPTION,
citation=_CITATION,
inputs_description=_KWARGS_DESCRIPTION,
features=[
datasets.Features(
{"predictions": Value("string", id="prediction"), "references": Value("string", id="reference")}
),
datasets.Features(
{
"predictions": Value("string", id="prediction"),
"references": Sequence(Value("string", id="reference"), id="references"),
}
),
],
homepage="https://github.com/bramvanroy/CharacTER",
codebase_urls=["https://github.com/bramvanroy/CharacTER", "https://github.com/rwth-i6/CharacTER"],
)

def _compute(
self,
predictions: Iterable[str],
references: Union[Iterable[str], Iterable[Iterable[str]]],
aggregate: str = "mean",
return_all_scores: bool = False,
):
if aggregate not in ("mean", "sum", "median"):
raise ValueError("'aggregate' must be one of 'sum', 'mean', 'median'")

predictions = [p.split() for p in predictions]
# Predictions and references have the same internal types (both lists of strings),
# so only one reference per prediction
if isinstance(references[0], str):
references = [r.split() for r in references]

scores_d = cer.calculate_cer_corpus(predictions, references)
cer_scores: List[float] = scores_d["cer_scores"]

if aggregate == "sum":
score = sum(cer_scores)
elif aggregate == "mean":
score = scores_d["mean"]
else:
score = scores_d["median"]
else:
# In the case of multiple references, we just find the "best score",
# i.e., the reference that the prediction is closest to, i.e. the lowest characTER score
references = [[r.split() for r in refs] for refs in references]

cer_scores = []
for pred, refs in zip(predictions, references):
min_score = math.inf
for ref in refs:
score = calculate_cer(pred, ref)

if score < min_score:
min_score = score

cer_scores.append(min_score)

if aggregate == "sum":
score = sum(cer_scores)
elif aggregate == "mean":
score = mean(cer_scores)
else:
score = median(cer_scores)

# Return scores
if return_all_scores:
return {"cer_score": score, "cer_scores": cer_scores}
else:
return {"cer_score": score}
2 changes: 2 additions & 0 deletions metrics/character/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
git+https://github.com/huggingface/evaluate@{COMMIT_PLACEHOLDER}
cer>=1.2.0
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,7 @@
TESTS_REQUIRE = [
# test dependencies
"absl-py",
"cer>=1.2.0", # for characTER
"nltk", # for NIST and probably others
"pytest",
"pytest-datadir",
Expand Down