SciFive provided a Text-Text framework for biomedical language and natural language in NLP. Under the T5's framework and desrbibed in the paper SciFive: a text-to-text transformer model for biomedical literature, SciFive achieve state-of-the-art and competitive results on multiple biomedical-natural language tasks.
We are migrating SciFive into BioT5X: Pretrained T5X Transformer for Biomedical Text Generation and Classification that use T5X and Flaxformer
📝 Our example BioT5X Fine-tunning notebook for the BLURB Tasks finetunning_biot5x_blurb.ipynb
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("razent/SciFive-base-Pubmed")
model = AutoModelForSeq2SeqLM.from_pretrained("razent/SciFive-base-Pubmed")
sentence = "Identification of APC2 , a homologue of the adenomatous polyposis coli tumour suppressor ."
text = sentence + " </s>"
encoding = tokenizer.encode_plus(text, pad_to_max_length=True, return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"].to("cuda"), encoding["attention_mask"].to("cuda")
outputs = model.generate(
input_ids=input_ids, attention_mask=attention_masks,
max_length=256,
early_stopping=True
)
for output in outputs:
line = tokenizer.decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=True)
print(line)
Our base Google Cloud Storage URI is at gs://scifive
As described in our paper, we make public 6 version of SciFive, each one has been benchmarked to achieve state-of-the-art on different biomedical task. They are all available on our Google Cloud bucket, we are working on release the models on HuggingFace also.
Instruction on access Cloud Storage from the command line with python library gsutil is described here
The following table contains pretrained SciFive checkpoints.
Model | Size | Step | Config | Checkpoint |
---|---|---|---|---|
SciFive Pubmed | base & large | 1194600 & 1196500 | T5 configs | gs://scifive/models/pubmed/{size}/ |
SciFive Pubmed+PMC | base & large | 1200000 | T5 configs | gs://scifive/models/pubmed_pmc/{size}/ |
SciFive PMC | base & large | 1200000 | T5 configs | gs://scifive/models/pmc/{size}/ |
{size}
is eitherbase
orlarge
- Pubmed: gs://scifive/pretrain/pubmed
- PMC: gs://scifive/pretrain/pmc
Below, we give an example of how to use SciFive on Huggingface to generate MedNLI outputs. We also publish our SciFive finetuned on MedNLI for reproducing experiments.
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("razent/SciFive-large-Pubmed_PMC-MedNLI")
model = AutoModelForSeq2SeqLM.from_pretrained("razent/SciFive-large-Pubmed_PMC-MedNLI")
model.cuda()
sent_1 = "In the ED, initial VS revealed T 98.9, HR 73, BP 121/90, RR 15, O2 sat 98% on RA."
sent_2 = "The patient is hemodynamically stable"
text = f"mednli: sentence1: {sent_1} sentence2: {sent_2}"
encoding = tokenizer.encode_plus(text, padding='max_length', max_length=256, return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"].to("cuda"), encoding["attention_mask"].to("cuda")
outputs = model.generate(
input_ids=input_ids, attention_mask=attention_masks,
max_length=8,
early_stopping=True
)
for output in outputs:
line = tokenizer.decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=True)
print(line)
All of the finetune dataset already pre-procossed into text-text format also availabe at this
If you use SciFive model or our code for publications, please cite:
@misc{phan2021scifive,
title={SciFive: a text-to-text transformer model for biomedical literature},
author={Long N. Phan and James T. Anibal and Hieu Tran and Shaurya Chanana and Erol Bahadroglu and Alec Peltekian and Grégoire Altan-Bonnet},
year={2021},
eprint={2106.03598},
archivePrefix={arXiv},
primaryClass={cs.CL}
}