Skip to content

Releases: Tiiiger/bert_score

Version 0.3.3

10 May 22:23
3a45dc2
Compare
Choose a tag to compare
  • Fixing the bug with empty strings issue #47.
  • Supporting 6 ELECTRA models and 24 smaller BERT models.
  • A new Google sheet for keeping the performance (i.e., pearson correlation with human judgment) of different models on WMT16 to-English.
  • Including the script for tuning the best number of layers of an English pre-trained model on WMT16 to-English data (See the details).

Version 0.3.2

18 Apr 17:04
Compare
Choose a tag to compare
  • Bug fixed: fixing the bug in v0.3.1 when having multiple reference sentences.
  • Supporting multiple reference sentences with our command-line tool

Version 0.3.1

18 Apr 17:09
58bc5d7
Compare
Choose a tag to compare
  • A new BERTScorer object that caches the model to avoid re-loading it multiple times. Please see our jupyter notebook example for the usage.
  • Supporting multiple reference sentences for each example. The score function now can take a list of lists of strings as the references and return the score between the candidate sentence and its closest reference sentence.

Version 0.3.0

18 Apr 17:11
926c516
Compare
Choose a tag to compare
  • Supporting Baseline Rescaling: we apply a simple linear transformation to enhance the readability of BERTscore using pre-computed "baselines". It has been pointed out (e.g. by #20, #23) that the numerical range of BERTScore is exceedingly small when computed with RoBERTa models. In other words, although BERTScore correctly distinguishes examples through ranking, the numerical scores of good and bad examples are very similar. We detail our approach in a separate post.

Version 0.2.3

18 Apr 17:17
Compare
Choose a tag to compare
  • Supporting DistilBERT (Sanh et al.), ALBERT (Lan et al.), and XLM-R (Conneau et al.) models.
  • Including the version of huggingface's transformers in the hash code for reproducibility

Version 0.2.2

18 Apr 17:18
Compare
Choose a tag to compare
  • Bug fixed: when using RoBERTaTokenizer, we now set add_prefix_space=True which was the default setting in huggingface's pytorch_transformers (when we ran the experiments in the paper) before they migrated it to transformers. This breaking change in transformers leads to a lower correlation with human evaluation. To reproduce our RoBERTa results in the paper, please use version 0.2.2.
  • The best number of layers for DistilRoBERTa is included
  • Supporting loading a custom model

Version 0.2.1

18 Apr 17:19
Compare
Choose a tag to compare
  • SciBERT (Beltagy et al.) models are now included. Thanks to AI2 for sharing the models. By default, we use the 9th layer (the same as BERT-base), but this is not tuned.

Version 0.2.0

18 Apr 17:20
88d6b5d
Compare
Choose a tag to compare
  • Supporting BERT, XLM, XLNet, and RoBERTa models using huggingface's Transformers library
  • Automatically picking the best model for a given language
  • Automatically picking the layer based on a model
  • IDF is not set as default as we show in the new version that the improvement brought by importance weighting is not consistent