Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Having better defaults for ChrF #124

Closed
alvations opened this issue Nov 24, 2020 · 5 comments · Fixed by #152
Closed

Having better defaults for ChrF #124

alvations opened this issue Nov 24, 2020 · 5 comments · Fixed by #152
Assignees
Milestone

Comments

@alvations
Copy link

The chrf++ is a better chrf than the default chrf6 used in sacrebleu, could we change the default to chrf++?

From the original ChrF creator: https://twitter.com/amelija16mp/status/1331288013880614913

@ozancaglayan
Copy link
Collaborator

This would require changing chrf_order param to chrf_char_order, and adding a new chrf_word_order argument. For Chrf++, I think the defaults are 6 and 2, respectively. The current chrF in sacreBLEU probably amounts to these params being 6 and 0. Can someone confirm?

@alvations
Copy link
Author

@m-popovic should be the "authoritative" person to confirm that =)

@ozancaglayan ozancaglayan self-assigned this Feb 24, 2021
@ozancaglayan ozancaglayan added this to the 2.0.0 milestone Feb 24, 2021
@ozancaglayan
Copy link
Collaborator

ozancaglayan commented Feb 24, 2021

I added this into my refactor2021 branch. I should note that chrF++.py does a primitive tokenization for word-level n-gram matching e.g. it only separates out punctuations from beginning and end of words. Can someone come up with a regexp to replace the below implementation that I've got from chrF++? (P.S: The below is actually quite fast but I am curious about the regexp as well)

def separate_punctuation(line):
    words = line.strip().split()
    tokenized = []
    for w in words:
        if len(w) == 1:
            tokenized.append(w)
        else:
            lastChar = w[-1] 
            firstChar = w[0]
            if lastChar in string.punctuation:
                tokenized += [w[:-1], lastChar]
            elif firstChar in string.punctuation:
                tokenized += [firstChar, w[1:]]
            else:
                tokenized.append(w)
    
    return tokenized

Also, because of the way the sentences are tokenized as above, languages like zh and ja would probably not benefit from having word-level ngrams, right?

@martinpopel
Copy link
Collaborator

martinpopel commented Feb 24, 2021

it only separates out punctuations from beginning and end of words

Yes, only a single punctuation character, only at beginning and end of words, but if there is a punctuation at the end, no punctuation from the beginning is separated:

separate_punctuation('(hi)') == ['(hi', ')']

This seems very wrong, but if we need to replicate chrF++, we need to accept that.

I am curious about the regexp as well

The following code is so obfuscated and also about 3.3 times slower than the original, so I mention it just as a joke:

import string
punct = string.punctuation
ugly_re = f'\s+|((?<=\S)[{punct}](?=\s|$)|(?<=\s)[{punct}](?=\S*[^{punct}](?:\s|$)))'

def separate_punctuation_re(line):
    return list(filter(None, re.split(ugly_re, line)))

The following code seems to be 23% faster than the original (on a randomly chosen file):

import string
punct_set = set(string.punctuation)

def separate_punctuation4(line):
    tokenized = []
    for w in line.split():
        if len(w) == 1:
            tokenized.append(w)
        else:
            if w[-1] in punct_set:
                tokenized += (w[:-1], w[-1])
            elif w[0] in punct_set:
                tokenized += (w[0], w[1:])
            else:
                tokenized.append(w)
    return tokenized

@ozancaglayan
Copy link
Collaborator

Didn't notice that glitch about being first/last. Definitely weird, probably an elif / if confusion? I'll note it there as a comment.
Will integrate your last version, thanks!

ozancaglayan added a commit that referenced this issue Mar 26, 2021
- Allow using epsilon smoothing (#144)
- Add multi-reference support
- Add chrF++ support through the word_order argument (#124)
@ozancaglayan ozancaglayan linked a pull request Mar 26, 2021 that will close this issue
ozancaglayan added a commit that referenced this issue Jul 18, 2021
  - Build: Add Windows and OS X testing to github workflow
  - Improve documentation and type annotations.
  - Drop `Python < 3.6` support and migrate to f-strings.
  - Drop input type manipulation through `isinstance` checks. If the user does not obey
    to the expected annotations, exceptions will be raised. Robustness attempts lead to
    confusions and obfuscated score errors in the past (fixes #121)
  - Use colored strings in tabular outputs (multi-system evaluation mode) through
    the help of `colorama` package.
  - tokenizers: Add caching to tokenizers which seem to speed up things a bit.
  - `intl` tokenizer: Use `regex` module. Speed goes from ~4 seconds to ~0.6 seconds
    for a particular test set evaluation. (fixes #46)
  - Signature: Formatting changed (mostly to remove '+' separator as it was
    interfering with chrF++). The field separator is now '|' and key values
    are separated with ':' rather than '.'.
  - Metrics: Scale all metrics into the [0, 100] range (fixes #140)
  - BLEU: In case of no n-gram matches at all, skip smoothing and return 0.0 BLEU (fixes #141).
  - BLEU: allow modifying max_ngram_order (fixes #156)
  - CHRF: Added multi-reference support, verified the scores against chrF++.py, added test case.
  - CHRF: Added chrF+ support through `word_order` argument. Added test cases against chrF++.py.
    Exposed it through the CLI (--chrf-word-order) (fixes #124)
  - CHRF: Add possibility to disable effective order smoothing (pass --chrf-eps-smoothing).
    This way, the scores obtained are exactly the same as chrF++, Moses and NLTK implementations.
    We keep the effective ordering as the default for compatibility, since this only
    affects sentence-level scoring with very short sentences. (fixes #144)
  - CLI: Allow modifying TER arguments through CLI. We still keep the TERCOM defaults.
  - CLI: Prefix metric-specific arguments with --chrf and --ter. To maintain compatibility, BLEU argument names are kept the same.
  - CLI: Added `--format/-f` flag. The single-system output mode is now `json` by default.
    If you want to keep the old text format persistently, you can export `SACREBLEU_FORMAT=text` into your
    shell.
  - CLI: sacreBLEU now supports evaluating multiple systems for a given test set
    in an efficient way. Through the use of `tabulate` package, the results are
    nicely rendered into a plain text table, LaTeX, HTML or RST (cf. --format/-f argument).
    The systems can be either given as a list of plain text files to `-i/--input` or
    as a tab-separated single stream redirected into `STDIN`. In the former case,
    the basenames of the files will be automatically used as system names.
  - Statistical tests: sacreBLEU now supports confidence interval estimation
    through bootstrap resampling for single-system evaluation (`--confidence` flag)
    as well as paired bootstrap resampling (`--paired-bs`) and paired approximate
    randomization tests (`--paired-ar`) when evaluating multiple systems (fixes #40 and fixes #78).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants