-
Notifications
You must be signed in to change notification settings - Fork 258
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* init character MT metric * Update README.md * make style * add isorts fixes * make style * fix example in README * add cer dependency for tests * Update metrics/character/requirements.txt Co-authored-by: helen <31600291+mathemakitten@users.noreply.github.com> * Update metrics/character/README.md Co-authored-by: helen <31600291+mathemakitten@users.noreply.github.com> * Update metrics/character/README.md Co-authored-by: helen <31600291+mathemakitten@users.noreply.github.com> * Update metrics/character/character.py Co-authored-by: helen <31600291+mathemakitten@users.noreply.github.com> * Update metrics/character/character.py Co-authored-by: helen <31600291+mathemakitten@users.noreply.github.com> * Delete .gitattributes * require cer >=1.1.0 * use calculate_cer when given a string * add separate test for single/corpus * streamline output format the corpus version now only adds attributes, but cer_scores will always be present and always a list * style * update documentation * add singleton example * update cer dependency to 1.2.0 * make metric more robust Now correctly accepts single strings and lists as input. Now only returns cer_scores and not other statistics as this seems rather uncommon and might be confusing for users. * fix doctest formatting * use non-local metric name * update dependency * simplify metric, assume we always work with batches * aggregate scores add aggregate and return_all_scores arguments * add multi-reference option * remove "Literal" * Delete tests.py Do tests via doctest instead * Apply suggestions from code review Co-authored-by: helen <31600291+mathemakitten@users.noreply.github.com> * Update description Co-authored-by: helen <31600291+mathemakitten@users.noreply.github.com> Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
- Loading branch information
1 parent
83322c1
commit 544f1e8
Showing
5 changed files
with
284 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,106 @@ | ||
--- | ||
title: CharacTER | ||
emoji: 🔤 | ||
colorFrom: orange | ||
colorTo: red | ||
sdk: gradio | ||
sdk_version: 3.0.2 | ||
app_file: app.py | ||
pinned: false | ||
tags: | ||
- evaluate | ||
- metric | ||
- machine-translation | ||
description: >- | ||
CharacTer is a character-level metric inspired by the commonly applied translation edit rate (TER). | ||
--- | ||
|
||
# Metric Card for CharacTER | ||
|
||
## Metric Description | ||
CharacTer is a character-level metric inspired by the translation edit rate (TER) metric. It is | ||
defined as the minimum number of character edits required to adjust a hypothesis, until it completely matches the | ||
reference, normalized by the length of the hypothesis sentence. CharacTer calculates the character level edit | ||
distance while performing the shift edit on word level. Unlike the strict matching criterion in TER, a hypothesis | ||
word is considered to match a reference word and could be shifted, if the edit distance between them is below a | ||
threshold value. The Levenshtein distance between the reference and the shifted hypothesis sequence is computed on the | ||
character level. In addition, the lengths of hypothesis sequences instead of reference sequences are used for | ||
normalizing the edit distance, which effectively counters the issue that shorter translations normally achieve lower | ||
TER. | ||
|
||
## Intended Uses | ||
CharacTER was developed for machine translation evaluation. | ||
|
||
## How to Use | ||
|
||
```python | ||
import evaluate | ||
character = evaluate.load("character") | ||
|
||
# Single hyp/ref | ||
preds = ["this week the saudis denied information published in the new york times"] | ||
refs = ["saudi arabia denied this week information published in the american new york times"] | ||
results = character.compute(references=refs, predictions=preds) | ||
|
||
# Corpus example | ||
preds = ["this week the saudis denied information published in the new york times", | ||
"this is in fact an estimate"] | ||
refs = ["saudi arabia denied this week information published in the american new york times", | ||
"this is actually an estimate"] | ||
results = character.compute(references=refs, predictions=preds) | ||
``` | ||
|
||
### Inputs | ||
- **predictions**: a single prediction or a list of predictions to score. Each prediction should be a string with | ||
tokens separated by spaces. | ||
- **references**: a single reference or a list of reference for each prediction. Each reference should be a string with | ||
tokens separated by spaces. | ||
|
||
|
||
### Output Values | ||
|
||
*=only when a list of references/hypotheses are given | ||
|
||
- **count** (*): how many parallel sentences were processed | ||
- **mean** (*): the mean CharacTER score | ||
- **median** (*): the median score | ||
- **std** (*): standard deviation of the score | ||
- **min** (*): smallest score | ||
- **max** (*): largest score | ||
- **cer_scores**: all scores, one per ref/hyp pair | ||
|
||
### Output Example | ||
```python | ||
{ | ||
'count': 2, | ||
'mean': 0.3127282211789254, | ||
'median': 0.3127282211789254, | ||
'std': 0.07561653111280243, | ||
'min': 0.25925925925925924, | ||
'max': 0.36619718309859156, | ||
'cer_scores': [0.36619718309859156, 0.25925925925925924] | ||
} | ||
``` | ||
|
||
## Citation | ||
```bibtex | ||
@inproceedings{wang-etal-2016-character, | ||
title = "{C}harac{T}er: Translation Edit Rate on Character Level", | ||
author = "Wang, Weiyue and | ||
Peter, Jan-Thorsten and | ||
Rosendahl, Hendrik and | ||
Ney, Hermann", | ||
booktitle = "Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers", | ||
month = aug, | ||
year = "2016", | ||
address = "Berlin, Germany", | ||
publisher = "Association for Computational Linguistics", | ||
url = "https://aclanthology.org/W16-2342", | ||
doi = "10.18653/v1/W16-2342", | ||
pages = "505--510", | ||
} | ||
``` | ||
|
||
## Further References | ||
- Repackaged version that is used in this HF implementation: [https://github.com/bramvanroy/CharacTER](https://github.com/bramvanroy/CharacTER) | ||
- Original version: [https://github.com/rwth-i6/CharacTER](https://github.com/rwth-i6/CharacTER) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
import evaluate | ||
from evaluate.utils import launch_gradio_widget | ||
|
||
|
||
module = evaluate.load("character") | ||
launch_gradio_widget(module) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,169 @@ | ||
# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
"""CharacTER metric, a character-based TER variant, for machine translation.""" | ||
import math | ||
from statistics import mean, median | ||
from typing import Iterable, List, Union | ||
|
||
import cer | ||
import datasets | ||
from cer import calculate_cer | ||
from datasets import Sequence, Value | ||
|
||
import evaluate | ||
|
||
|
||
_CITATION = """\ | ||
@inproceedings{wang-etal-2016-character, | ||
title = "{C}harac{T}er: Translation Edit Rate on Character Level", | ||
author = "Wang, Weiyue and | ||
Peter, Jan-Thorsten and | ||
Rosendahl, Hendrik and | ||
Ney, Hermann", | ||
booktitle = "Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers", | ||
month = aug, | ||
year = "2016", | ||
address = "Berlin, Germany", | ||
publisher = "Association for Computational Linguistics", | ||
url = "https://aclanthology.org/W16-2342", | ||
doi = "10.18653/v1/W16-2342", | ||
pages = "505--510", | ||
} | ||
""" | ||
|
||
_DESCRIPTION = """\ | ||
CharacTer is a character-level metric inspired by the commonly applied translation edit rate (TER). It is | ||
defined as the minimum number of character edits required to adjust a hypothesis, until it completely matches the | ||
reference, normalized by the length of the hypothesis sentence. CharacTer calculates the character level edit | ||
distance while performing the shift edit on word level. Unlike the strict matching criterion in TER, a hypothesis | ||
word is considered to match a reference word and could be shifted, if the edit distance between them is below a | ||
threshold value. The Levenshtein distance between the reference and the shifted hypothesis sequence is computed on the | ||
character level. In addition, the lengths of hypothesis sequences instead of reference sequences are used for | ||
normalizing the edit distance, which effectively counters the issue that shorter translations normally achieve lower | ||
TER.""" | ||
|
||
_KWARGS_DESCRIPTION = """ | ||
Calculates how good the predictions are in terms of the CharacTER metric given some references. | ||
Args: | ||
predictions: a list of predictions to score. Each prediction should be a string with | ||
tokens separated by spaces. | ||
references: a list of references for each prediction. You can also pass multiple references for each prediction, | ||
so a list and in that list a sublist for each prediction for its related references. When multiple references are | ||
given, the lowest (best) score is returned for that prediction-references pair. | ||
Each reference should be a string with tokens separated by spaces. | ||
aggregate: one of "mean", "sum", "median" to indicate how the scores of individual sentences should be | ||
aggregated | ||
return_all_scores: a boolean, indicating whether in addition to the aggregated score, also all individual | ||
scores should be returned | ||
Returns: | ||
cer_score: an aggregated score across all the items, based on 'aggregate' | ||
cer_scores: (optionally, if 'return_all_scores' evaluates to True) a list of all scores, one per ref/hyp pair | ||
Examples: | ||
>>> character_mt = evaluate.load("character") | ||
>>> preds = ["this week the saudis denied information published in the new york times"] | ||
>>> refs = ["saudi arabia denied this week information published in the american new york times"] | ||
>>> character_mt.compute(references=refs, predictions=preds) | ||
{'cer_score': 0.36619718309859156} | ||
>>> preds = ["this week the saudis denied information published in the new york times", | ||
... "this is in fact an estimate"] | ||
>>> refs = ["saudi arabia denied this week information published in the american new york times", | ||
... "this is actually an estimate"] | ||
>>> character_mt.compute(references=refs, predictions=preds, aggregate="sum", return_all_scores=True) | ||
{'cer_score': 0.6254564423578508, 'cer_scores': [0.36619718309859156, 0.25925925925925924]} | ||
>>> preds = ["this week the saudis denied information published in the new york times"] | ||
>>> refs = [["saudi arabia denied this week information published in the american new york times", | ||
... "the saudis have denied new information published in the ny times"]] | ||
>>> character_mt.compute(references=refs, predictions=preds, aggregate="median", return_all_scores=True) | ||
{'cer_score': 0.36619718309859156, 'cer_scores': [0.36619718309859156]} | ||
""" | ||
|
||
|
||
@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION) | ||
class Character(evaluate.Metric): | ||
"""CharacTer is a character-level metric inspired by the commonly applied translation edit rate (TER).""" | ||
|
||
def _info(self): | ||
return evaluate.MetricInfo( | ||
module_type="metric", | ||
description=_DESCRIPTION, | ||
citation=_CITATION, | ||
inputs_description=_KWARGS_DESCRIPTION, | ||
features=[ | ||
datasets.Features( | ||
{"predictions": Value("string", id="prediction"), "references": Value("string", id="reference")} | ||
), | ||
datasets.Features( | ||
{ | ||
"predictions": Value("string", id="prediction"), | ||
"references": Sequence(Value("string", id="reference"), id="references"), | ||
} | ||
), | ||
], | ||
homepage="https://github.com/bramvanroy/CharacTER", | ||
codebase_urls=["https://github.com/bramvanroy/CharacTER", "https://github.com/rwth-i6/CharacTER"], | ||
) | ||
|
||
def _compute( | ||
self, | ||
predictions: Iterable[str], | ||
references: Union[Iterable[str], Iterable[Iterable[str]]], | ||
aggregate: str = "mean", | ||
return_all_scores: bool = False, | ||
): | ||
if aggregate not in ("mean", "sum", "median"): | ||
raise ValueError("'aggregate' must be one of 'sum', 'mean', 'median'") | ||
|
||
predictions = [p.split() for p in predictions] | ||
# Predictions and references have the same internal types (both lists of strings), | ||
# so only one reference per prediction | ||
if isinstance(references[0], str): | ||
references = [r.split() for r in references] | ||
|
||
scores_d = cer.calculate_cer_corpus(predictions, references) | ||
cer_scores: List[float] = scores_d["cer_scores"] | ||
|
||
if aggregate == "sum": | ||
score = sum(cer_scores) | ||
elif aggregate == "mean": | ||
score = scores_d["mean"] | ||
else: | ||
score = scores_d["median"] | ||
else: | ||
# In the case of multiple references, we just find the "best score", | ||
# i.e., the reference that the prediction is closest to, i.e. the lowest characTER score | ||
references = [[r.split() for r in refs] for refs in references] | ||
|
||
cer_scores = [] | ||
for pred, refs in zip(predictions, references): | ||
min_score = math.inf | ||
for ref in refs: | ||
score = calculate_cer(pred, ref) | ||
|
||
if score < min_score: | ||
min_score = score | ||
|
||
cer_scores.append(min_score) | ||
|
||
if aggregate == "sum": | ||
score = sum(cer_scores) | ||
elif aggregate == "mean": | ||
score = mean(cer_scores) | ||
else: | ||
score = median(cer_scores) | ||
|
||
# Return scores | ||
if return_all_scores: | ||
return {"cer_score": score, "cer_scores": cer_scores} | ||
else: | ||
return {"cer_score": score} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
git+https://github.com/huggingface/evaluate@{COMMIT_PLACEHOLDER} | ||
cer>=1.2.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters