Skip to content

Latest commit

 

History

History
71 lines (44 loc) · 2.67 KB

README.md

File metadata and controls

71 lines (44 loc) · 2.67 KB

STARRseqCNN

This repository contains and describes code used to train, evaluate, and interprete multi task STARRseq CNNs based on STARRseq data from multiple experimental set-ups provided by the Kaikkonen Lab. The folder models contains the trained model used for further analysis described in the manuscript.

Clone this repository

Use git clone https://github.com/ThorbenMaa/STARRseqCNN.git. Operate from inside the directory.

Install dependencies

I recommend to use mamba to create environments and install dependencies:

mamba env create --name CNN_TM --file=./envs/CNN_TM.yml
mamba env create --name modisco_lite --file=./envs/modisco_lite.yml
mamba env create --name exp_activity_analysis --file=./envs/exp_activity_analysis.yml

Required files

Here is a list of required files that you need to place in this folder. Please start all scripts from this folder. The STARRseq activity files:

  • 2023-01-10_22-29-33\ myCounts.minDNAfilt.depthNorm.keepHaps\ -\ starr.haplotypes.oligo1.txt
  • 2023-01-10_22-29-33\ myCounts.minDNAfilt.depthNorm.keepHaps\ -\ starr.haplotypes.oligo2.txt

A file with all oligos:

  • starrseq-all-final-toorder_oligocomposition.csv

Folder with files with p-values.

Worklflow multitask CNN training, evaluation, and interpretation

Run


# model training
mamba activate CNN_TM
bash model_train_eavl_interpretation/sbatch_Train_CNN_TM.sh

# test CNN
bash model_train_eavl_interpretation/sbatch_test_CNN.sh

# ism
bash model_train_eavl_interpretation/sbatch_ism_non_overfitted_multitask.sh


# tfmodisco-lite
## activate env
mamba activate modisco_lite

## download JASPAR and make it nice (without making it nice only the motif names but not the IDs will be displayed in the tfmodisco reports)
wget https://jaspar.genereg.net/download/data/2022/CORE/JASPAR2022_CORE_vertebrates_non-redundant_pfms_meme.txt
cat JASPAR2022_CORE_vertebrates_non-redundant_pfms_meme.txt | awk '{{if ($1=="MOTIF") {{print $1,$2"_"$3,$3}} else {{print $0}}}}' > JASPAR2022_CORE_vertebrates_non-redundant_pfms_meme_nice.txt


bash model_train_eavl_interpretation/sbatch_tfmodisco_v2_nonOverfitted.sh

Workflow for analysis

Activate the environment using:

mamba activate exp_activity_analysis

All analysis pipelines can be found in the PipelineCommands folder. The ones named pipeline_jointPost_*.sh without JASPAR (i.e. with CNN motifs as motif source) were used for the manuscript.

Motif enrichment can be tested with the script sbatch_run_all.sh in the enrichment_ana folder. The script needs to be run from inside this folder.