Description lengths in the above plot are in terms of number of characters. It can be observed that a significant number of parts have very short descriptions. For now, we hypothesise that the quality of a description is directly proportional to its length.
-
We obtain the complete database of parts from the iGEM Registry in the form of 24094 FASTA entries. A copy of this database is available
/parts.txt
. -
We choose registry parts with descriptions of length upto 4 characters. There are 708 such parts. We will use our model to generate better descriptions for these parts. To fine-tune our model, we select parts with good descriptions. In accordance with our hypothesis, we choose parts with descriptions of length greater than 80 characters. There are 758 such parts. We will use these parts to fine-tune our model.
-
For all 1466 parts, we scrape the corresponding part pages (available at
parts.igem.org/Part:{ID}
) and use an LLM to summarize the contents of the page in up to 25 tokens. -
We fine-tune the chosen LLM on the 758 parts with good descriptions. Our prompts includes the the type of the part along with the summarized information from its webpage. The target output is the description for that part.
-
We use the fine-tuned LLM to generate descriptions for the 708 parts with poor descriptions. Our prompts includes the the type of the part along with the summarized information from its webpage. We generate descriptions with lengths up to 60 tokens.
We used 2 models to produce our results. The first model was t5-large
and the second model was google/flan-t5-large
(obtained from HuggingFace). The results from these models are available in /output_t5_large.txt
and /output_flan_t5_large.txt
respectively. Some examples are given below (for the same parts).
Part ID | Original Description | Generated Description |
---|---|---|
BBa_C0090 | smaI | smaI Coding Region for ahl synthase from serratia sp., . cerevisiae (Serratia) |
BBa_C0400 | BglG | BglG is known to bind to terminator loops in bgl operon causing antisepsis. |
BBa_C0420 | BglF | BglF Sensor permease of bgl operon Sequence and Features Assembly Compatibility%3Answer%3B Bgl%3D |
Part ID | Original Description | Generated Description |
---|---|---|
BBa_C0090 | smaI | smaI coding region for AHL synthase smaI from Serratia |
BBa_C0400 | BglG | BglG is a protein that is known to bind to terminator loops in bgl |
BBa_C0420 | BglF | BglF Sensor permease of Bgl operon Sequence and Features Assembly Compatibility: |
This repository is work in progress, and the following tasks are planned for the future.
- Processing scraped data to improve the quality of summaries.
- Experimenting with more LLMs.
- Using a larger and more reliable dataset for fine-tuning.
- Experimenting with prompt-tuning.
- Experimenting with different prompt formats, and exploring if one-shot or few-shot learning improves results.
- Exploring reinforcement learning through human feedback.
Our hypothesis about the quality of descriptions being directly proportional to their length does not provide a very reliable metric for classification. iGEM teams can work collaboratively to label the parts in the registry as good or poor based on the quality of their descriptions, and this can potentially improve the quality of our dataset. This labelling can also help us set-up a reinforcement learning pipeline.
Developing this pipeline requires expertise in several areas that we are still learning and exploring. We would love to collaborate with teams that have experience in machine learning, natural language processing and LLMs.
We would like to integrate this pipeline with the iGEM Registry, so that once our model generates good-enough results, we can modify the entries in the registry with the new descriptions.
We would like other iGEM teams to help us in evaluating the results from this pipeline during the several stages of development so that we can track the quality of our model.