STARRseqCNN

This repository contains and describes code used to train, evaluate, and interprete multi task STARRseq CNNs based on STARRseq data from multiple experimental set-ups provided by the Kaikkonen Lab. The folder models contains the trained model used for further analysis described in the manuscript.

Clone this repository

Use git clone https://github.com/ThorbenMaa/STARRseqCNN.git. Operate from inside the directory.

Install dependencies

I recommend to use mamba to create environments and install dependencies:

mamba env create --name CNN_TM --file=./envs/CNN_TM.yml
mamba env create --name modisco_lite --file=./envs/modisco_lite.yml
mamba env create --name exp_activity_analysis --file=./envs/exp_activity_analysis.yml

Required files

Here is a list of required files that you need to place in this folder. Please start all scripts from this folder. The STARRseq activity files:

2023-01-10_22-29-33\ myCounts.minDNAfilt.depthNorm.keepHaps\ -\ starr.haplotypes.oligo1.txt
2023-01-10_22-29-33\ myCounts.minDNAfilt.depthNorm.keepHaps\ -\ starr.haplotypes.oligo2.txt

A file with all oligos:

starrseq-all-final-toorder_oligocomposition.csv

Folder with files with p-values.

Worklflow multitask CNN training, evaluation, and interpretation

Run


# model training
mamba activate CNN_TM
bash model_train_eavl_interpretation/sbatch_Train_CNN_TM.sh

# test CNN
bash model_train_eavl_interpretation/sbatch_test_CNN.sh

# ism
bash model_train_eavl_interpretation/sbatch_ism_non_overfitted_multitask.sh


# tfmodisco-lite
## activate env
mamba activate modisco_lite

## download JASPAR and make it nice (without making it nice only the motif names but not the IDs will be displayed in the tfmodisco reports)
wget https://jaspar.genereg.net/download/data/2022/CORE/JASPAR2022_CORE_vertebrates_non-redundant_pfms_meme.txt
cat JASPAR2022_CORE_vertebrates_non-redundant_pfms_meme.txt | awk '{{if ($1=="MOTIF") {{print $1,$2"_"$3,$3}} else {{print $0}}}}' > JASPAR2022_CORE_vertebrates_non-redundant_pfms_meme_nice.txt


bash model_train_eavl_interpretation/sbatch_tfmodisco_v2_nonOverfitted.sh

Workflow for analysis

Activate the environment using:

mamba activate exp_activity_analysis

All analysis pipelines can be found in the PipelineCommands folder. The ones named pipeline_jointPost_*.sh without JASPAR (i.e. with CNN motifs as motif source) were used for the manuscript.

Motif enrichment can be tested with the script sbatch_run_all.sh in the enrichment_ana folder. The script needs to be run from inside this folder.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
ExpSetUpSpecificCNN		ExpSetUpSpecificCNN
VariantEffects		VariantEffects
allMotifsWithSignificantEffects		allMotifsWithSignificantEffects
enrichment_ana		enrichment_ana
envs		envs
model_train_eval_interpretation		model_train_eval_interpretation
models		models
pipelineCommands		pipelineCommands
JASPAR2022_CORE_vertebrates_non-redundant_pfms_meme.txt		JASPAR2022_CORE_vertebrates_non-redundant_pfms_meme.txt
JASPAR2022_CORE_vertebrates_non-redundant_pfms_meme_nice.txt		JASPAR2022_CORE_vertebrates_non-redundant_pfms_meme_nice.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

STARRseqCNN

Clone this repository

Install dependencies

Required files

Worklflow multitask CNN training, evaluation, and interpretation

Workflow for analysis

About

Releases

Packages

Languages

ThorbenMaa/STARRseqCNN

Folders and files

Latest commit

History

Repository files navigation

STARRseqCNN

Clone this repository

Install dependencies

Required files

Worklflow multitask CNN training, evaluation, and interpretation

Workflow for analysis

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages