Add support to print multi system results as JSON #213

me-manikanta · 2022-09-28T17:09:16Z

Issue #207

Added logic to convert the multi-system output to a dictionary and print the dictionary as JSON

Verified the changes for the following:

BLEU and CHRF for Multiple Systems

python3 sacrebleu/sacrebleu.py --input data/wmt17-submitted-data/txt/system-outputs/newstest2017/de-en/newstest2017.KIT.4951.de-en data/wmt17-submitted-data/txt/system-outputs/newstest2017/de-en/newstest2017.LIUM-NMT.4733.de-en --language-pair hi-en  -lc  --encoding utf-8-sig data/wmt17-submitted-data/txt/system-outputs/newstest2017/de-en/newstest2017.C-3MA.4958.de-en -m bleu chrf  -sh --format json

Output:

[
    {
        "System": "data/wmt17-submitted-data/txt/system-outputs/newstest2017/de-en/newstest2017.KIT.4951.de-en",
        "BLEU": "55.1",
        "chrF2": "73.8"
    },
    {
        "System": "data/wmt17-submitted-data/txt/system-outputs/newstest2017/de-en/newstest2017.LIUM-NMT.4733.de-en",
        "BLEU": "52.2",
        "chrF2": "72.1"
    }
]

Paired test using bootstrap resampling

python3 sacrebleu/sacrebleu.py --input data/wmt17-submitted-data/txt/system-outputs/newstest2017/de-en/newstest2017.KIT.4951.de-en data/wmt17-submitted-data/txt/system-outputs/newstest2017/de-en/newstest2017.LIUM-NMT.4733.de-en --language-pair hi-en  -lc  --encoding utf-8-sig data/wmt17-submitted-data/txt/system-outputs/newstest2017/de-en/newstest2017.C-3MA.4958.de-en -m bleu chrf  -sh --format json --paired-bs

Output:

[
    {
        "System": "Baseline: data/wmt17-submitted-data/txt/system-outputs/newstest2017/de-en/newstest2017.KIT.4951.de-en",
        "BLEU": {
            "score": 55.05078088066641,
            "p_value": null,
            "mean": 55.04298847040095,
            "ci": 0.7779333388823026
        },
        "chrF2": {
            "score": 73.81214722094455,
            "p_value": null,
            "mean": 73.79934401771284,
            "ci": 0.48946591806097217
        }
    },
    {
        "System": "data/wmt17-submitted-data/txt/system-outputs/newstest2017/de-en/newstest2017.LIUM-NMT.4733.de-en",
        "BLEU": {
            "score": 52.18305014543905,
            "p_value": 0.000999000999000999,
            "mean": 52.18573885163444,
            "ci": 0.7636661706325505
        },
        "chrF2": {
            "score": 72.10939066073507,
            "p_value": 0.000999000999000999,
            "mean": 72.10817046815828,
            "ci": 0.5031330617618934
        }
    }
]

Paired test using approximate randomization

python3 sacrebleu/sacrebleu.py --input data/wmt17-submitted-data/txt/system-outputs/newstest2017/de-en/newstest2017.KIT.4951.de-en data/wmt17-submitted-data/txt/system-outputs/newstest2017/de-en/newstest2017.LIUM-NMT.4733.de-en --language-pair hi-en  -lc  --encoding utf-8-sig data/wmt17-submitted-data/txt/system-outputs/newstest2017/de-en/newstest2017.C-3MA.4958.de-en -m bleu chrf  -sh --format json --paired-ar

Output:

[
    {
        "System": "Baseline: data/wmt17-submitted-data/txt/system-outputs/newstest2017/de-en/newstest2017.KIT.4951.de-en",
        "BLEU": {
            "score": 55.05078088066641,
            "p_value": null,
            "mean": null,
            "ci": null
        },
        "chrF2": {
            "score": 73.81214722094455,
            "p_value": null,
            "mean": null,
            "ci": null
        }
    },
    {
        "System": "data/wmt17-submitted-data/txt/system-outputs/newstest2017/de-en/newstest2017.LIUM-NMT.4733.de-en",
        "BLEU": {
            "score": 52.18305014543905,
            "p_value": 9.999000099990002e-05,
            "mean": null,
            "ci": null
        },
        "chrF2": {
            "score": 72.10939066073507,
            "p_value": 9.999000099990002e-05,
            "mean": null,
            "ci": null
        }
    }
]

Other tests

Similar results are generated when using --paired-bs-n and --paired-ar-n

sacrebleu/utils.py

mjpost · 2022-10-05T11:57:47Z

These seem to be randomly failing due to downloading timeouts.

mjpost approved these changes Sep 29, 2022

View reviewed changes

sacrebleu/utils.py Outdated Show resolved Hide resolved

me-manikanta added 2 commits October 2, 2022 01:42

Add support to print multi system results as JSON

7cc9a09

Rename System to system

12547e4

me-manikanta force-pushed the feat/stat-sig-json-support branch from abb5580 to 12547e4 Compare October 1, 2022 20:12

mjpost added 3 commits October 4, 2022 07:26

Updated CHANGELOG

ce1482c

Merge branch 'master' into me-manikanta-feat/stat-sig-json-support

bb6cbcf

Restored Python 3.6 (don't want to do a major release)

b04bbd3

version bump

b35633f

mjpost merged commit 37de171 into mjpost:master Oct 6, 2022

me-manikanta deleted the feat/stat-sig-json-support branch October 8, 2022 16:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support to print multi system results as JSON #213

Add support to print multi system results as JSON #213

me-manikanta commented Sep 28, 2022

mjpost commented Oct 5, 2022

Add support to print multi system results as JSON #213

Add support to print multi system results as JSON #213

Conversation

me-manikanta commented Sep 28, 2022

Verified the changes for the following:

BLEU and CHRF for Multiple Systems

Paired test using bootstrap resampling

Paired test using approximate randomization

Other tests

mjpost commented Oct 5, 2022