Having better defaults for ChrF #124

alvations · 2020-11-24T17:32:50Z

The chrf++ is a better chrf than the default chrf6 used in sacrebleu, could we change the default to chrf++?

From the original ChrF creator: https://twitter.com/amelija16mp/status/1331288013880614913

ozancaglayan · 2020-11-24T18:12:03Z

This would require changing chrf_order param to chrf_char_order, and adding a new chrf_word_order argument. For Chrf++, I think the defaults are 6 and 2, respectively. The current chrF in sacreBLEU probably amounts to these params being 6 and 0. Can someone confirm?

alvations · 2020-11-24T18:23:56Z

@m-popovic should be the "authoritative" person to confirm that =)

ozancaglayan · 2021-02-24T19:10:03Z

I added this into my refactor2021 branch. I should note that chrF++.py does a primitive tokenization for word-level n-gram matching e.g. it only separates out punctuations from beginning and end of words. Can someone come up with a regexp to replace the below implementation that I've got from chrF++? (P.S: The below is actually quite fast but I am curious about the regexp as well)

def separate_punctuation(line):
    words = line.strip().split()
    tokenized = []
    for w in words:
        if len(w) == 1:
            tokenized.append(w)
        else:
            lastChar = w[-1] 
            firstChar = w[0]
            if lastChar in string.punctuation:
                tokenized += [w[:-1], lastChar]
            elif firstChar in string.punctuation:
                tokenized += [firstChar, w[1:]]
            else:
                tokenized.append(w)
    
    return tokenized

Also, because of the way the sentences are tokenized as above, languages like zh and ja would probably not benefit from having word-level ngrams, right?

martinpopel · 2021-02-24T23:03:56Z

it only separates out punctuations from beginning and end of words

Yes, only a single punctuation character, only at beginning and end of words, but if there is a punctuation at the end, no punctuation from the beginning is separated:

separate_punctuation('(hi)') == ['(hi', ')']

This seems very wrong, but if we need to replicate chrF++, we need to accept that.

I am curious about the regexp as well

The following code is so obfuscated and also about 3.3 times slower than the original, so I mention it just as a joke:

import string
punct = string.punctuation
ugly_re = f'\s+|((?<=\S)[{punct}](?=\s|$)|(?<=\s)[{punct}](?=\S*[^{punct}](?:\s|$)))'

def separate_punctuation_re(line):
    return list(filter(None, re.split(ugly_re, line)))

The following code seems to be 23% faster than the original (on a randomly chosen file):

import string
punct_set = set(string.punctuation)

def separate_punctuation4(line):
    tokenized = []
    for w in line.split():
        if len(w) == 1:
            tokenized.append(w)
        else:
            if w[-1] in punct_set:
                tokenized += (w[:-1], w[-1])
            elif w[0] in punct_set:
                tokenized += (w[0], w[1:])
            else:
                tokenized.append(w)
    return tokenized

ozancaglayan · 2021-02-25T17:45:26Z

Didn't notice that glitch about being first/last. Definitely weird, probably an elif / if confusion? I'll note it there as a comment.
Will integrate your last version, thanks!

- Allow using epsilon smoothing (#144) - Add multi-reference support - Add chrF++ support through the word_order argument (#124)

- Build: Add Windows and OS X testing to github workflow - Improve documentation and type annotations. - Drop `Python < 3.6` support and migrate to f-strings. - Drop input type manipulation through `isinstance` checks. If the user does not obey to the expected annotations, exceptions will be raised. Robustness attempts lead to confusions and obfuscated score errors in the past (fixes #121) - Use colored strings in tabular outputs (multi-system evaluation mode) through the help of `colorama` package. - tokenizers: Add caching to tokenizers which seem to speed up things a bit. - `intl` tokenizer: Use `regex` module. Speed goes from ~4 seconds to ~0.6 seconds for a particular test set evaluation. (fixes #46) - Signature: Formatting changed (mostly to remove '+' separator as it was interfering with chrF++). The field separator is now '|' and key values are separated with ':' rather than '.'. - Metrics: Scale all metrics into the [0, 100] range (fixes #140) - BLEU: In case of no n-gram matches at all, skip smoothing and return 0.0 BLEU (fixes #141). - BLEU: allow modifying max_ngram_order (fixes #156) - CHRF: Added multi-reference support, verified the scores against chrF++.py, added test case. - CHRF: Added chrF+ support through `word_order` argument. Added test cases against chrF++.py. Exposed it through the CLI (--chrf-word-order) (fixes #124) - CHRF: Add possibility to disable effective order smoothing (pass --chrf-eps-smoothing). This way, the scores obtained are exactly the same as chrF++, Moses and NLTK implementations. We keep the effective ordering as the default for compatibility, since this only affects sentence-level scoring with very short sentences. (fixes #144) - CLI: Allow modifying TER arguments through CLI. We still keep the TERCOM defaults. - CLI: Prefix metric-specific arguments with --chrf and --ter. To maintain compatibility, BLEU argument names are kept the same. - CLI: Added `--format/-f` flag. The single-system output mode is now `json` by default. If you want to keep the old text format persistently, you can export `SACREBLEU_FORMAT=text` into your shell. - CLI: sacreBLEU now supports evaluating multiple systems for a given test set in an efficient way. Through the use of `tabulate` package, the results are nicely rendered into a plain text table, LaTeX, HTML or RST (cf. --format/-f argument). The systems can be either given as a list of plain text files to `-i/--input` or as a tab-separated single stream redirected into `STDIN`. In the former case, the basenames of the files will be automatically used as system names. - Statistical tests: sacreBLEU now supports confidence interval estimation through bootstrap resampling for single-system evaluation (`--confidence` flag) as well as paired bootstrap resampling (`--paired-bs`) and paired approximate randomization tests (`--paired-ar`) when evaluating multiple systems (fixes #40 and fixes #78).

mjpost mentioned this issue Feb 15, 2021

adds mtedx valid and test data #136

Merged

ozancaglayan self-assigned this Feb 24, 2021

ozancaglayan added this to the 2.0.0 milestone Feb 24, 2021

ozancaglayan added a commit that referenced this issue Mar 26, 2021

CHRF: Adapt to new metric API

459bf06

- Allow using epsilon smoothing (#144) - Add multi-reference support - Add chrF++ support through the word_order argument (#124)

ozancaglayan mentioned this issue Mar 26, 2021

Changes for 2.0.0 #152

Merged

ozancaglayan linked a pull request Mar 26, 2021 that will close this issue

Changes for 2.0.0 #152

Merged

ozancaglayan closed this as completed in #152 Jul 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Having better defaults for ChrF #124

Having better defaults for ChrF #124

alvations commented Nov 24, 2020

ozancaglayan commented Nov 24, 2020

alvations commented Nov 24, 2020

ozancaglayan commented Feb 24, 2021 •

edited

Loading

martinpopel commented Feb 24, 2021 •

edited

Loading

ozancaglayan commented Feb 25, 2021

Having better defaults for ChrF #124

Having better defaults for ChrF #124

Comments

alvations commented Nov 24, 2020

ozancaglayan commented Nov 24, 2020

alvations commented Nov 24, 2020

ozancaglayan commented Feb 24, 2021 • edited Loading

martinpopel commented Feb 24, 2021 • edited Loading

ozancaglayan commented Feb 25, 2021

ozancaglayan commented Feb 24, 2021 •

edited

Loading

martinpopel commented Feb 24, 2021 •

edited

Loading