RMIT READ-BioMed

In this project, we experiment with a range of prompting strategies for genetic information extraction to evaluate the performance, and find limitations of using generative technologies.

List of Publications

Effectiveness of Cross-linguistic Extraction of Genetic Information using Generative Large Language Models

Lesser the shots, higher the hallucinations: Exploration of Genetic Information Extraction using Generative Large Language Models - TBA

Project overview

Organisation of information about genes, genetic variants, and associated diseases from vast quantities of scientific literature texts through automated information extraction (IE) strategies can facilitate progress in personalised medicine.

We systematically evaluate the performance of generative large language models (LLMs) on the extraction of specialised genetic information, focusing on end-to-end IE encompassing both named entity recognition and relation extraction. We experiment across multilingual datasets with a range of instruction strategies, including zero-shot and few-shot prompting along with providing an annotation guideline. Optimal results are obtained with few-shot prompting. However, we also identify that generative LLMs failed to adhere to the instructions provided, leading to over-generation of entities and relations. We therefore carefully examine the effect of learning paradigms on the extent to which genetic entities are fabricated, and the limitations of exact matching to determine performance of the model.

Set up

Download the datasets for IE tasks
Create train, and test datasets following the below format for each of the datasets.
- For each dataset create a <dataset_type>_text.tsv file and a <dataset_type>_gold_annotations.tsv
- <dataset_type>_text.tsv is a TSV file containing the columns pmid (ID of the paper), text (Text from literature)
- <dataset_type>_gold_annotations.tsv is a TSV file containing the ground truth/ gold annotations in order to do pairwise comparisons to evaluate the performance of this system. Contains the below columns.
  - For Named Entity Recognition (NER):
    - pmid: PubMed ID of the paper
    - filename: File name of the paper the text is from
    - mark: Annotation ID following the BRAT format
    - label: Entity label eg: Disease
    - offset1: Starting index of the span
    - offset2: Ending index of the span
    - span: Identified entity eg: Síndrome de Gorlin
  - For Relation Extraction (RE) or join NER and RE (NERRE):
    - pmid: PubMed ID of the paper
    - filename: File name of the paper the text is from
    - mark1: Annotation ID for first entity following the BRAT format
    - label1: First entity label eg: Gene
    - offset1_start: Starting index of the first span
    - offset1_end: Ending index of the first span
    - span1: First entity identified eg: DUSP6
    - mark2: Annotation ID for second entity following the BRAT format
    - label2: Second entity label eg: Disease
    - offset2_start: Starting index of the second span
    - offset2_end: Ending index of the second span
    - span2: Second entity identified eg: Mood Disorders
    - relation_mark: ID for the relation identified
    - relation_type: Relation type to annotate eg: biomarker
- Alternatively: If the datasets are either one of GenoVarDis, TBGA or Variome the data can be cleaned and pre-processed once CLEAN-DATA=true in the .env file.
Install Jupyter notebook
Set up models
- Currently supported models are:
  - GPT-3.5 Turbo, model id: gpt-35-turbo-16k
  - Llama 3 70b Instruct, model id: meta.llama3-70b-instruct-v1:0
- Using Azure OpenAI
  - Setting up Azure OpenAI model
  - Setting up connection to GPT-3.5 Turbo using Azure OpenAI service
  Note: The environment variables should be set inside the .env file.
- Using Amazon Bedrock
  - Getting started with Amazon Bedrock
  - Install AWS CLI
  - Configure SSO for authentication
    - aws configure sso
    - aws sso login --profile <PROFILE-NAME>
Duplicate the .env-template file as .env and populate according to the task and model.
[Optional] Add custom prompts to the matching prompt library file: <task>_prompts.json.
[Optional] To add other models, update models.py file by creating the corresponding model class similar to class GPTModel.

Note: The number of examples being added should not exceed the number of training texts available.

Run

Run the Python program via IDE python main.py.

Evaluation

Brat-Eval is the tool we have used for evaluation.

A summary of the datasets, extracted instances, hallucinated instances, visualisation of results, and performance details will be generated in <RESULT-FOLDER-PATH>/results once the program has finished running.

Releases

GenoVarDis 2024

ALTA 2024

Contributors

Milindi Kodikara

Karin Verspoor

🧩 READ stands for Reading, Extraction, and Annotation of Documents!

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
.env-template		.env-template
.gitignore		.gitignore
README.md		README.md
analysis.py		analysis.py
analysis_template.ipynb		analysis_template.ipynb
cleaners.py		cleaners.py
evaluation.py		evaluation.py
evaluation.sh		evaluation.sh
main.py		main.py
models.py		models.py
ner_prompts.json		ner_prompts.json
nerre_prompts.json		nerre_prompts.json
prompts.py		prompts.py
re_prompts.json		re_prompts.json
requirements.txt		requirements.txt
result_cleaner.py		result_cleaner.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RMIT READ-BioMed

List of Publications

Project overview

Set up

Note: The number of examples being added should not exceed the number of training texts available.

Run

Evaluation

Releases

Contributors

About

Releases 2

Languages

Milindi-Kodikara/RMIT-READ-BioMed

Folders and files

Latest commit

History

Repository files navigation

RMIT READ-BioMed

List of Publications

Project overview

Set up

Note: The number of examples being added should not exceed the number of training texts available.

Run

Evaluation

Releases

Contributors

About

Topics

Resources

Stars

Watchers

Forks

Releases 2

Languages