Skip to content

Code and data for the ACL2024 paper "InstructProtein: Aligning Human and Protein Language via Knowledge Instruction".

License

Notifications You must be signed in to change notification settings

HICAI-ZJU/InstructProtein

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

InstructProtein: Aligning Human and Protein Language via Knowledge Instruction

This repository is the official model and benchmark proposed in a paper: InstructProtein: Aligning Human and Protein Language via Knowledge Instruction. The model is available at HuggingFace.

Description

InstructProtein is the first large generative language model exploring the feasibility of bidirectional generation between human and protein language. The idea comes from the research of the Understudied Proteins Initiative: Some proteins, such as the tumor suppressor p53, have been studied extensively. By contrast, thousands of human proteins remain 'understudied'. To align proteins with natural language, InstructProtein is fine-tuned on a high-quality dataset constructed from Knowledge Instruction.

Installation

>> git clone https://github.com/HICAI-ZJU/InstructProtein
>> cd InstructProtein

Edit configuration.py script with the absolute location of dump folder in your environment.

How to evaluate models

We provide scripts for evaluating the protein understanding and design capabilities of large language models in ./benchmarks. Let’s take subcellular localization prediction as an example.

>> cd benchmarks/scripts
>> python run_subcellular_localization.py

How to train models

We provide a toy instruction dataset in ./dump/pretrain/raw/instructions.txt.

# Build training dataset
>> cd ./scripts
>> bash ./create_training_dataset.sh

# Train model
>> bash ./training.sh

Limitations

The current model, developed through instruction tuning using the knowledge instruction dataset, serves as a preliminary example. Despite its initial success in controlled environments, it lacks the robustness to manage complex, real-world, production-level tasks.

Reference

If you use our repository, please cite the following related paper:

@inproceedings{wang-etal-2024-instructprotein,
    title = "{I}nstruct{P}rotein: Aligning Human and Protein Language via Knowledge Instruction",
    author = "Wang, Zeyuan  and
      Zhang, Qiang  and
      Ding, Keyan  and
      Qin, Ming  and
      Zhuang, Xiang  and
      Li, Xiaotong  and
      Chen, Huajun",
    editor = "Ku, Lun-Wei  and
      Martins, Andre  and
      Srikumar, Vivek",
    booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.acl-long.62",
    pages = "1114--1136",
}

About

Code and data for the ACL2024 paper "InstructProtein: Aligning Human and Protein Language via Knowledge Instruction".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published