This repository contains information on two BERT versions pretrained on a preprocessed version of the CORD-19 dataset, namely ClinicalCovidBERT and BioCovidBERT.
This project was inspired by the covid_bert_base
model from Deepset and discussions on potential improvements of this model on Kaggle. The contribution of this project is based on two pillars:
- better initialization: the training is initialized with existing BERT versions trained on scientific corpuses, namely ClinicalBERT and BioBERT.
- specialized vocabulary: the customized vocabulary provided on the BioBERT repository, used for training BioBERT-Large v1.1 (+ PubMed 1M), is used to train
biocovid-bert-large-cased
.
Model | Downloads |
---|---|
clinicalcovid-bert-base-cased |
config.json • tensorflow model • pytorch_model.bin • vocab.txt |
biocovid-bert-large-cased |
config.json • tensorflow model • pytorch_model.bin • vocab.txt |
- BERT base default configuration
- Cased
- Initialized from Bio+Clinical BERT
- Using English
bert_base_cased
default vocabulary - Using whole-word masking
- Pretrained on a preprocessed version of the CORD-19 dataset including titles, abstract and body text (approx. 1.5GB)
- Training parameters:
train_batch_size
: 512max_seq_length
: 128max_predictions_per_seq
: 20num_train_steps
: 150000num_warmup_steps
: 10000learning_rate
: 2e-5
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("manueltonneau/clinicalcovid-bert-base-cased")
model = AutoModel.from_pretrained("manueltonneau/clinicalcovid-bert-base-cased")
- BERT large default configuration
- Cased
- Initialized from BioBERT-Large v1.1 (+ PubMed 1M) using their custom 30k vocabulary
- Using whole-word masking
- Pretrained on the same preprocessed version of the CORD-19 dataset including titles, abstract and body text (approx. 1.5GB)
- Training parameters:
train_batch_size
: 512max_seq_length
: 128max_predictions_per_seq
: 20num_train_steps
: 200000num_warmup_steps
: 10000learning_rate
: 2e-5
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("manueltonneau/biocovid-bert-large-cased")
model = AutoModel.from_pretrained("manueltonneau/biocovid-bert-large-cased")
Emily Alsentzer, John Murphy, William Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, and Matthew McDermott. 2019. Publicly available clinical BERT embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, pages 72-78, Minneapolis, Minnesota, USA. Association for Computational Linguistics.
Lee, Jinhyuk, et al. "BioBERT: a pre-trained biomedical language representation model for biomedical text mining." Bioinformatics 36.4 (2020): 1234-1240. BioBERT
Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.
Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC)