Mutual Annotation-Based Prediction of Protein Domain Functions with Domain2GO

Identifying the functions of known proteins is an important task in understanding their critical roles in biological processes. Although high throughput sequencing tools seem to make this task easier, experimental function identification techniques cannot keep up with the increasing amount of sequence data due to their expensive and time-consuming nature. Continuously expanding output of the high throughput sequencing tools is stored in public databases such as the UniProt Knowledgebase (UniProtKB) and the majority of it (currently ~%99, November 2021) is not experimentally annotated yet.

Here, we propose a new method called Domain2GO with the aim of identifying unknown protein functions by associating domains with Gene Ontology terms, thus defining the problem as domain function prediction. Domain2GO mappings are generated using the existing domain and GO annotation data. In order to obtain highly reliable associations, we employed statistical resampling and analyzed the co-occurrence patterns of domains and GO terms on the same proteins. Furthermore, three different association probability measures were calculated by the expectation-maximization (EM) algorithm and an ablation study was performed to compare the predictive performance of the different versions of the proposed method. As a use-case study, examples selected from the finalized mappings are examined via literature review, to assess their biological relevance. We then applied the proposed method to predict protein functions, by propagating domain-associated GO terms to proteins that are annotated with those domains. For protein function prediction performance evaluation and comparison against other methods, we employed CAFA3 challenge datasets. The results demonstrated the potential of Domain2GO, especially when predicting molecular function and biological process terms, as it performed better than baseline predictors and curated associations (Fmax = 0.48 and 0.36 for MFO and BPO, respectively). The approach proposed here can be extended to other ontologies, biological entities, and their features to explore unknown relationships in complex and large-scale biological data.

The study is summarized in the schematic workflow below.

Schematic representation of the proposed method. (A) The source datasets were downloaded and organized; (B) initial mapping between the InterPro domains and GO terms were obtained, and the mapping parameters were calculated; (C) generation of the randomized annotation and mapping sets were constructed; (D) co-occurrence similarity distributions were plotted, and thresholds were selected based on statistical resampling; (E) an ablation study was conducted by calculating the enrichment of top predictions ranked by different statistical measures and finalized Domain2GO mappings were generated by filtering initial mappings; (F) protein function predictions were generated by propagating Domain2GO mappings to target proteins.

Online User-Friendly Interface for Domain2GO

We offer a user-friendly interface for Domain2GO that allows users to predict the functions of a protein sequence without the need to install any software. You can access the interface from here.

Features:

Easy Query Input: Enter a protein sequence directly or upload a FASTA file.
Automated Domain Identification: The program identifies domains within the query sequence.
Function Prediction: Generates function predictions based on the Domain2GO mapping set.
Downloadable Results: Generated function predictions are available for download as a CSV file.

For more detailed instructions, please refer to the User Guide page of the user interface. This tool simplifies the process of protein function prediction, making it accessible to users without requiring extensive computational resources or software installations.

Content

Mutual Annotation-Based Prediction of Protein Domain Functions with Domain2GO

Installation

Clone the Git repository.
We highly recommend you to use conda platform for installing dependencies properly. After installation of appropriate conda version for your operating system, create and activate conda environment with dependencies as below:
```
conda create --n domain2go_env
conda activate domain2go_env
conda env update --file requirements.yml
```
Download the datasets from here. Uncompress the "input_data.zip" file and move the contents into the input_data folder of clonned repo. "output.zip" file contains all output files generated by source code. If you want to skip some steps of the process when re-producing results, we strongly recommend you to uncompress the "output.zip" file and move the contents into the output folder of clonned repo, as some steps use the previous step's output (For example enrichment analysis uses the output of the em algorithm).

Descriptions of Folders and Files in the Domain2GO Repository

bin folder includes the source code of Domain2GO. To run the code, please set the location of bin folder as the current working directory.
input_data folder contains various training datasets and cafa folder, which contains files related to CAFA evaluation step.
- input_data/cafa/benchmark contains CAFA3 benchmark dataset.
- input_data/cafa/predictions contains propagated and parsed versions of our raw predictions. We include raw prediction generation step in our source code but propagation step is not included since this is performed using https://github.com/CAFA-Challenge/CAFA_evaluation.
output folder contains all output files generated by source code.
- output/cafa/benchmark contains two files that is not created by Domain2GO source code; blast.txt and naive.txt. These predictions are generated as explained in the CAFA experiments, using the Swiss-Prot annotations that existed before the CAFA3 annotation collection period (version date: 2016/09).

Downloading the Finalized Domain and Protein Function Prediction Datasets

Final Domain2GO mapping set is available at the output folder of this repository. You can also download it together with the protein function prediction set from here. "output.zip" file contains all output files generated by source code:
- "filtered_original.txt" file in this folder contains the resulting Domain2GO mapping set (26,696 associations between 4,742 InterPro domains and 11,742 GO terms). This file contains the following columns:
  - GO: GO term ID
  - GO_aspect: GO term category
  - GO_name: GO term name
  - Interpro: InterPro domain accession
  - n: number of co-occurrences of the domain and the GO term
  - n_go: number of proteins annotated with the GO term
  - n_ip: number of proteins annotated with the domain
  - s: co-occurrence similarity score (association probability)
- "protein_function_predictions.txt" file in this folder contains the protein function prediction set (5,046,060 GO term predictions for 291,519 proteins and 11,742 GO terms) obtained by propagating the finalized Domain2GO mapping set to proteins annotated with the domains.

Predicting the Functions of a New Protein Sequence Using Domain2GO Mappings

You can generate function predictions for a query protein using the final Domain2GO mapping set. Please note that the following program is designed to generate predictions for a single protein due to the extended runtime of InterProScan. If you need predictions for multiple UniProtKB/Swiss-Prot proteins, we recommend utilizing our comprehensive protein function prediction dataset available here. The file "protein_function_predictions.txt" within this folder contains function predictions for a substantial collection of 291,519 UniProtKB/Swiss-Prot proteins.

To generate function predictions for a single protein sequence, you can either use user interface or command line.

User Interface

You can access the user interface of Domain2GO from here. You can query a protein sequence by entering it directly or by uploading a FASTA file. Upon entering the query sequence, the program will identify domains within the sequence and generate function predictions based on the final Domain2GO mapping set. The output will be available for download as a CSV file. For more information please refer to the User Guide page of the user interface.

Command Line

To generate function predictions for a protein sequence using command line, please use the following command:

python run_domain2go.py --mapping_path=../output

Upon executing the command, you will be prompted to provide an email and a query protein sequence or the path to a FASTA file containing the sequence. Email is required for InterProScan to notify you when your job is done. Your email will not be used for any other purpose.

You have two options for entering the query protein sequence:

1. Direct Sequence Entry

Please enter the protein sequence or fasta file location: MEYGKVIFLFLLFLKSGQGESLENYIKTEGASLSNSQKKQFVASSTEECEALCEKETEFVCRSFEHYNKEQKCVIMSENSKTSSVERKRDVVLFEKRIYLSDCKSGNGRNYRGTLSKTKSGITCQKWSDLSPHVPNYAPSKYPDAGLEKNYCRNPDDDVKGPWCYTTNPDIRYEYCDVPECEDECMHCSGENYRGTISKTESGIECQPWDSQEPHSHEYIPSKFPSKDLKENYCRNPDGEPRPWCFTSNPEKRWEFCNIPRCSSPPPPPGPMLQCLKGRGENYRGKIAVTKSGHTCQRWNKQTPHKHNRTPENFPCRGLDENYCRNPDGELEPWCYTTNPDVRQEYCAIPSCGTSSPHTDRVEQSPVIQECYEGKGENYRGTTSTTISGKKCQAWSSMTPHQHKKTPDNFPNADLIRNYCRNPDGDKSPWCYTMDPTVRWEFCNLEKCSGTGSTVLNAQTTRVPSVDTTSHPESDCMYGSGKDYRGKRSTTVTGTLCQAWTAQEPHRHTIFTPDTYPRAGLEENYCRNPDGDPNGPWCYTTNPKKLFDYCDIPQCVSPSSFDCGKPRVEPQKCPGRIVGGCYAQPHSWPWQISLRTRFGEHFCGGTLIAPQWVLTAAHCLERSQWPGAYKVILGLHREVNPESYSQEIGVSRLFKGPLAADIALLKLNRPAAINDKVIPACLPSQDFMVPDRTLCHVTGWGDTQGTSPRGLLKQASLPVIDNRVCNRHEYLNGRVKSTELCAGHLVGRGDSCQGDSGGPLICFEDDKYVLQGVTSWGLGCARPNKPGVYVRVSRYISWIEDVMKNN

Additionally, provide a name for the protein sequence:

Please enter a name for the protein sequence: sp|O18783|PLMN_NOTEU

2. FASTA File Entry

Please enter the protein sequence or fasta file location: ../input_data/example_protein_query.fasta

Once you've entered the query sequence, the program will identify domains within the sequence and generate function predictions based on the final Domain2GO mapping set. The output will be saved in the specified path.

Here is an example output:

protein_name	GO_ID	GO_term	GO_category	sequence_region	probability	domain_accession	domain_name
sp\|O18783\|PLMN_NOTEU	GO:0042730	fibrinolysis	biological_process	101-264;273-354;369-450;473-557	0.224	IPR000001	Kringle
sp\|O18783\|PLMN_NOTEU	GO:0004175	endopeptidase activity	molecular_function	576-804	0.293	IPR001254	Trypsin_dom
sp\|O18783\|PLMN_NOTEU	GO:0004252	serine-type endopeptidase activity	molecular_function	576-804	0.598	IPR001254	Trypsin_dom
sp\|O18783\|PLMN_NOTEU	GO:0008233	peptidase activity	molecular_function	576-804	0.223	IPR001254	Trypsin_dom
sp\|O18783\|PLMN_NOTEU	GO:0008236	serine-type peptidase activity	molecular_function	576-804	0.593	IPR001254	Trypsin_dom
sp\|O18783\|PLMN_NOTEU	GO:0016825	hydrolase activity, acting on acid phosphorus-nitrogen bonds	molecular_function	576-804	0.584	IPR001254	Trypsin_dom
sp\|O18783\|PLMN_NOTEU	GO:0017171	serine hydrolase activity	molecular_function	576-804	0.584	IPR001254	Trypsin_dom
sp\|O18783\|PLMN_NOTEU	GO:0031638	zymogen activation	biological_process	576-804	0.259	IPR001254	Trypsin_dom
sp\|O18783\|PLMN_NOTEU	GO:0070011	peptidase activity, acting on L-amino acid peptides	molecular_function	576-804	0.223	IPR001254	Trypsin_dom

How to Reproduce the Results in the Paper

To generate the Domain2GO mapping set, please set the location of bin folder as the current working directory, and run the main_training script with the desired settings as below:

python main_training.py --em skip --enrichment skip --cafa_eval skip

Explanation of the Parameters

--em EM algorithm mode (default: skip); EM algorithm can be performed to calculate only theta (likelihood) scores by setting --em only_theta, or to calculate both theta and E (evidence) by setting --em full_mode.

--enrichment whether or not to perform enrichment analysis (default:skip, can be set to any other value to perform EA)

--cafa_eval whether or not to perform CAFA evaluation (default:skip, can be set to any other value to perform evaluation)

Publication

Please refer to our publication for more information. If you use the Domain2GO method or the datasets provided in this repository, please cite this paper:

Ulusoy E., Doğan T. (2024) Mutual annotation-based prediction of protein domain functions with Domain2GO. Protein Science, 33(6), e4988. Link

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
bin		bin
figures		figures
input_data		input_data
output		output
.gitignore		.gitignore
README.md		README.md
requirements.yml		requirements.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mutual Annotation-Based Prediction of Protein Domain Functions with Domain2GO

Online User-Friendly Interface for Domain2GO

Features:

Content

Installation

Descriptions of Folders and Files in the Domain2GO Repository

Downloading the Finalized Domain and Protein Function Prediction Datasets

Predicting the Functions of a New Protein Sequence Using Domain2GO Mappings

User Interface

Command Line

How to Reproduce the Results in the Paper

Explanation of the Parameters

Publication

License

About

Releases

Packages

Contributors 2

Languages

ervau/Domain2GO

Folders and files

Latest commit

History

Repository files navigation

Mutual Annotation-Based Prediction of Protein Domain Functions with Domain2GO

Online User-Friendly Interface for Domain2GO

Features:

Content

Installation

Descriptions of Folders and Files in the Domain2GO Repository

Downloading the Finalized Domain and Protein Function Prediction Datasets

Predicting the Functions of a New Protein Sequence Using Domain2GO Mappings

User Interface

Command Line

How to Reproduce the Results in the Paper

Explanation of the Parameters

Publication

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages