Do protein language models learn phylogeny ?

Description

Deep machine learning demonstrates a capacity to uncover evolutionary relationships directly from protein sequences, in effect internalising notions inherent to classical phylogenetic tree inference. We connect these two paradigms by assessing the capacity of protein-based language models (pLMs) to discern phylogenetic relationships without being explicitly trained to do so. We evaluate ESM2, ProtTrans and MSA-Transformer relative to classical phylogenetic methods, while also considering sequence insertions and deletions (indels) across 114 Pfam datasets.

Getting Started

Running the analysis is faster on GPU. It is recommended to use a GPU with at least 45 GB of RAM or more.

Dependencies

pip install -r requirements.txt

bio==1.7.1
fair-esm==2.0.0
huggingface-hub==0.24.3
pandas==2.2.2
pysam==0.22.1
scikit-learn==1.5.1
scipy==1.14.0
sentencepiece==0.2.0
torch==2.1.0
transformers==4.43.3
ete3==3.1.3

For FastTree

conda install -c bioconda fasttree

Running the investigations presented in the paper

Prepare data for analysis (create LG tree, LG matrix, remove and standarise gaps, and shuffles the amino acids in sequences)

_{***** Note:: Aligned fasta file is needed for phylogenetic tree which constructs the distance matrix based on the tree for comparison.*****}

python prep_data.py -a aligned fasta file

arguments:
-a aligned fasta file with full path

example:
python prep_data.py -a ../data/PF00158/PF00158.aln

One-hot correlation analysis

_{***** Note:: Run step 1 first to ensure all files are created. *****}

python one_hot_corr.py -a aligned fasta file -m model type

arguments:
-a aligned fasta file with full path
-m model type (options: esm2, pt, msa)

example:
python one_hot_corr.py -a ../data/PF00158/PF00158.aln -m esm2

Homology correlation analysis using RSS (order) / (magnitude) for low-gap and high-gap pfam datasets

_{***** Note:: Run step 1 first to ensure all files are created. *****}

python homology_corr.py -a aligned fasta file -m model type -s shuffled fasta
                                                              -c column attention

arguments:
-a aligned fasta file with full path
-m model type (options: esm2, pt, msa)
-s shuffled fasta file boolean (options: Y for shuffled)
-c column attention representation, only works for model type 'msa'. Use Y when column attention is needed
 (uses layer 1 head 5 from MSA-Transformer for this analysis)

example:
python homology_corr.py -a ../data/PF00158/PF00158.aln -m esm2 -s N -c N
or
python homology_corr.py -a ../data/PF00158/PF00158.aln -m msa -s N -c Y

Local homolog similarity analysis

_{***** Note:: Run step 1 first to ensure all files are created. *****}

python local_homolog_sim.py -a aligned fasta file -m model type -k nearest neighbours

arguments:
-a aligned fasta file with full path
-m model type (options: esm2, pt, msa)
-k nearest neighbours ( 5, 10, 20)

python local_homolog_sim.py -a ../data/PF00158/PF00158.aln -m esm2 -k 5

Fine to coarse evolutionary correlation analysis

_{***** Note:: Run step 1 first to ensure all files are created. *****}

python fine_coarse_corr.py -a aligned fasta file -m model type -c column attention

arguments:
-a aligned fasta file with full path
-m model type (options: esm2, pt, msa)
-c column attention representation, only works for model type 'msa'
 (uses layer 1 head 5 from MSA-Transformer for this analysis)

example:
python fine_coarse_corr.py -a ../data/PF00158/PF00158.aln -m msa -c Y

Elastic Net regression training for salient neuron analysis

_{***** Note:: Run step 1 first to ensure all files are created. *****}

python salient_neurons.py -a aligned fasta file -m model type

arguments:
-a aligned fasta file with full path
-m model type (options: esm2, pt, msa)

example:
python salient_neurons.py -a ../data/PF00158/PF00158.aln -m esm2

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
data		data
src		src
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Do protein language models learn phylogeny ?

Description

Getting Started

Dependencies

Running the investigations presented in the paper

About

Releases

Packages

Languages

santule/pLMEvo

Folders and files

Latest commit

History

Repository files navigation

Do protein language models learn phylogeny ?

Description

Getting Started

Dependencies

Running the investigations presented in the paper

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages