Using LLMs to Improve the Descriptions of Biobricks Registry Entries

Description lengths in the above plot are in terms of number of characters. It can be observed that a significant number of parts have very short descriptions. For now, we hypothesise that the quality of a description is directly proportional to its length.

Experimental Procedure

We obtain the complete database of parts from the iGEM Registry in the form of 24094 FASTA entries. A copy of this database is available /parts.txt.
We choose registry parts with descriptions of length upto 4 characters. There are 708 such parts. We will use our model to generate better descriptions for these parts. To fine-tune our model, we select parts with good descriptions. In accordance with our hypothesis, we choose parts with descriptions of length greater than 80 characters. There are 758 such parts. We will use these parts to fine-tune our model.
For all 1466 parts, we scrape the corresponding part pages (available at parts.igem.org/Part:{ID}) and use an LLM to summarize the contents of the page in up to 25 tokens.
We fine-tune the chosen LLM on the 758 parts with good descriptions. Our prompts includes the the type of the part along with the summarized information from its webpage. The target output is the description for that part.
We use the fine-tuned LLM to generate descriptions for the 708 parts with poor descriptions. Our prompts includes the the type of the part along with the summarized information from its webpage. We generate descriptions with lengths up to 60 tokens.

Results

We used 2 models to produce our results. The first model was t5-large and the second model was google/flan-t5-large (obtained from HuggingFace). The results from these models are available in /output_t5_large.txt and /output_flan_t5_large.txt respectively. Some examples are given below (for the same parts).

T5-Large

Part ID	Original Description	Generated Description
BBa_C0090	smaI	smaI Coding Region for ahl synthase from serratia sp., _{. cerevisiae (Serratia)}
BBa_C0400	BglG	BglG is known to bind to terminator loops in bgl operon causing antisepsis.
BBa_C0420	BglF	BglF Sensor permease of bgl operon Sequence and Features Assembly Compatibility%3Answer%3B Bgl%3D

Flan-T5-Large

Part ID	Original Description	Generated Description
BBa_C0090	smaI	smaI coding region for AHL synthase smaI from Serratia
BBa_C0400	BglG	BglG is a protein that is known to bind to terminator loops in bgl
BBa_C0420	BglF	BglF Sensor permease of Bgl operon Sequence and Features Assembly Compatibility:

Future Work

This repository is work in progress, and the following tasks are planned for the future.

Processing scraped data to improve the quality of summaries.
Experimenting with more LLMs.
Using a larger and more reliable dataset for fine-tuning.
Experimenting with prompt-tuning.
Experimenting with different prompt formats, and exploring if one-shot or few-shot learning improves results.
Exploring reinforcement learning through human feedback.

Call for Collaboration

Labelling

Our hypothesis about the quality of descriptions being directly proportional to their length does not provide a very reliable metric for classification. iGEM teams can work collaboratively to label the parts in the registry as good or poor based on the quality of their descriptions, and this can potentially improve the quality of our dataset. This labelling can also help us set-up a reinforcement learning pipeline.

Development

Developing this pipeline requires expertise in several areas that we are still learning and exploring. We would love to collaborate with teams that have experience in machine learning, natural language processing and LLMs.

Integration

We would like to integrate this pipeline with the iGEM Registry, so that once our model generates good-enough results, we can modify the entries in the registry with the new descriptions.

Evaluation

We would like other iGEM teams to help us in evaluating the results from this pipeline during the several stages of development so that we can track the quality of our model.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
README.md		README.md
build_csv.py		build_csv.py
get_bb.py		get_bb.py
model_flan5_large.ipynb		model_flan5_large.ipynb
model_t5_large.ipynb		model_t5_large.ipynb
output_flant5_large.txt		output_flant5_large.txt
output_t5_large.txt		output_t5_large.txt
parts versus description lengths.png		parts versus description lengths.png
parts.csv		parts.csv
parts.txt		parts.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Using LLMs to Improve the Descriptions of Biobricks Registry Entries

Experimental Procedure

Results

T5-Large

Flan-T5-Large

Future Work

Call for Collaboration

Labelling

Development

Integration

Evaluation

About

Releases

Packages

Languages

iGEMIISc/biobricks

Folders and files

Latest commit

History

Repository files navigation

Using LLMs to Improve the Descriptions of Biobricks Registry Entries

Experimental Procedure

Results

T5-Large

Flan-T5-Large

Future Work

Call for Collaboration

Labelling

Development

Integration

Evaluation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages