Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small fix for SacreBLEUScore and the mteval-v13a tokenizer #1778

Merged
merged 1 commit into from
May 13, 2023

Conversation

RistoAle97
Copy link
Contributor

@RistoAle97 RistoAle97 commented May 12, 2023

The behaviour of the mteval-v13a tokenizer (the ones used by WMT ) is the same as the original implementation now.

What does this PR do?

Fixes #1776

Before submitting
  • Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?
PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃


📚 Documentation preview 📚: https://torchmetrics--1778.org.readthedocs.build/en/1778/

The behaviour is the same as the original implementation now
@RistoAle97
Copy link
Contributor Author

Some more insight on the effect of the change, I compared the hf version with the torchmetrics one (before and after applying my little change), here are the results on the newstest2013 en-es test set:

# translations is a sequence of generated spanish sentences, targets is a sequence of sequences of given spanish sentences
> hf_score = hf_scb.compute(predictions=translations, references=targets)
> hf_bleu_score = "{0:.5f}".format(hf_score["score"])
> print(f"HF implementation\nBLEU score: {hf_bleu_score}\npreds_len: {hf_score['sys_len']}\ntarget_len: {hf_score['ref_len']}")
... HF implementation
... BLEU score: 30.73243
... preds_len: 67090
... target_len: 70528

> scb_tm = SacreBLEUScore()  # current implementation
> tm_score = "{0:.5f}".format(scb_tm(translations, targets) * 100)
> print(f"Torchmetrics implementation\nBLEU score: {tm_score}\npreds_len: {scb_tm.preds_len}\ntarget_len: {scb_tm.target_len}")
... Torchmetrics implementation
... BLEU score: 30.67770
... preds_len: 67028.0
... target_len: 70468.0

> scb_tm_new = SacreBLEUScoreTest()  # my implementation with the applied change
> tm_score_new = "{0:.5f}".format(scb_tm_new(translations, targets) * 100)
> print(f"Torchmetrics implementation\nBLEU score: {tm_score_new}\npreds_len: {scb_tm_new.preds_len}\ntarget_len: {scb_tm_new.target_len}")
... Torchmetrics new version
... BLEU score: 30.73244
... preds_len: 67090.0
... target_len: 70528.0

I also compared the three versions on the WMT14 en-fr test set:

... Huggingface SacreBLEU: {'score': 33.17529924795069, 'counts': [48246, 29265, 19045, 12569], 'totals': [76495, 73492, 70489, 67487], 'precisions': [63.070788940453625, 39.82066075219071, 27.018400034047865, 18.624327648287817], 'bp': 0.9894540029829183, 'sys_len': 76495, 'ref_len': 77306}

... Torchmetrics SacreBLEU:
... score: 33.1378173828125
... preds_len: 76434.0
... target_len: 77247.0

... Torchmetrics SacreBLEU new version:
... score: 33.1753044128418
... preds_len: 76495.0
... target_len: 77306.0

while the BLEU score is still different, the discrepancy between the original version (used through HuggingFace) and the new torchmetrics implementation is smaller. The computed lengths for both predictions and references are the same now.

@stancld stancld enabled auto-merge (squash) May 13, 2023 08:42
@codecov
Copy link

codecov bot commented May 13, 2023

Codecov Report

Merging #1778 (50d97ba) into master (17c0e9f) will decrease coverage by 42%.
The diff coverage is 100%.

Additional details and impacted files
@@           Coverage Diff            @@
##           master   #1778     +/-   ##
========================================
- Coverage      87%     46%    -42%     
========================================
  Files         253     253             
  Lines       14164   14164             
========================================
- Hits        12387    6458   -5929     
- Misses       1777    7706   +5929     

@mergify mergify bot added the ready label May 13, 2023
@stancld stancld merged commit 2d35650 into Lightning-AI:master May 13, 2023
@RistoAle97 RistoAle97 deleted the sacrebleu_fix branch September 4, 2023 12:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The BLEU score computed by the SacreBLEUScore metric is not the same as the original implementation
3 participants