RNABERT

This repo contains the code for our paper "Informative RNA-base embedding for functional RNA clustering and structural alignment". Please contact me at akiyama@dna.bio.keio.ac.jp for any question. Please cite this paper if you use our code or system output.

In this package, we provides resources including: source codes of the RNABERT model, pre-trained weights, prediction module.

1. Environment setup

Our code is written with python Python 3.6.5. Our code requires PyTorch version >= 1.4.0, biopython version >=1.76, and C++17 compatible compiler. Please follow the instructions here: https://github.com/pytorch/pytorch#installation. Also, please make sure you have at least one NVIDIA GPU.

1.1 Install the package and other requirements

(Required)

git clone https://github.com//RNABERT
cd RNABERT
python setup.py install

2. Pre-train (Skip this section if you only want to make predictions)

2.1 Data processing

Pre-train consists of two tasks, MLM and SAL. The SAL tasks use family-specific multiple alignments for training. If you want to train with your own data, see the template data at /sample/mlm/ for MLM task and /sample/sal/ for SAL task. RNABERT requires that RNA sequences be represented in fasta format. All nucleotides are represented by A, U (T), G, C. You can download the data I used for the experiment from the link below.

DATASETS

2.2 Model Training

The MLM task specifies the percentage of nucleotides to be masked "--maskrate" and the number of mask patterns "--mag". Adjust the batch size according to the memory size of your GPU.

export TRAIN_FILE=sample/mlm/sample.fa
export PRE_WEIGHT= #optional
export OUTPUT_WEIGHT=/path/to/output/weight

python MLM_SFP.py 
    --pretraining ${PRE_WEIGHT} \
    --outputweight ${OUTPUT_WEIGHT} \
    --data_mlm ${TRAIN_FILE} \
    --epoch 10 \
    --batch 40 \
    --mag 3 \
    --maskrate 0.2 \

The SAL task takes multiple alignments per family as input, and "--mag" can be used to specify how many pairwise alignments should be generated for a single sequence.

export TRAIN_FILE=sample/sal/sample.afa.txt
export PRE_WEIGHT= #optional
export OUTPUT_WEIGHT=/path/to/output/weight

python MLM_SFP.py 
    --pretraining ${PRE_WEIGHT} \
    --outputweight ${OUTPUT_WEIGHT} \
    --data_mul ${TRAIN_FILE} \
    --epoch 10 \
    --batch 40 \
    --mag 5 \

2.3 Download pre-trained RNABERT

RNABERT

Download the pre-trained model in to a directory. This model has been created using a full Rfam 14.3 dataset (~400nt).

3. Prediction

After the model is fine-tuned, we can get predictions by running

export PRED_FILE=sample/aln/sample.raw.fa
export PRE_WEIGHT=/path/to/pretrained/weight

python MLM_SFP.py 
    --pretraining ${PRE_WEIGHT} \
    --data_alignment ${PRED_FILE} \
    --batch 40 \
    --show_aln

3. Earn embeddings

To obtain the embedding vector for the RNA sequence, run

python MLM_SFP.py 
    --pretraining ${PRE_WEIGHT} \
    --data_embedding ${PRED_FILE} \
    --embedding_output ${OUTPUT_FILE} \
    --batch 40 \

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
build		build
sample		sample
utils		utils
MLM_SFP.py		MLM_SFP.py
README.md		README.md
RNA_bert_config.json		RNA_bert_config.json
alignment.cpp		alignment.cpp
dataload.py		dataload.py
module.py		module.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RNABERT

1. Environment setup

1.1 Install the package and other requirements

2. Pre-train (Skip this section if you only want to make predictions)

2.1 Data processing

2.2 Model Training

2.3 Download pre-trained RNABERT

3. Prediction

3. Earn embeddings

About

Releases

Packages

Languages

mana438/RNABERT

Folders and files

Latest commit

History

Repository files navigation

RNABERT

1. Environment setup

1.1 Install the package and other requirements

2. Pre-train (Skip this section if you only want to make predictions)

2.1 Data processing

2.2 Model Training

2.3 Download pre-trained RNABERT

3. Prediction

3. Earn embeddings

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages