Preprocessing for SNP tissue enrichment analysis methods
conda env create -n SNP_enrich --file environment.yaml
conda activate SNP_enrich
Requires bedtools2 in the $PATH for overlapping:
$ wget
$ tar -zxvf bedtools-2.29.1.tar.gz
$ cd bedtools2
$ make
On VM may require installing libraries:
sudo apt-get install build-essential
sudo apt-get install libz-dev
sudo apt install libbz2-dev
sudo apt install libclang-dev
sudo apt-get install liblzma-dev
sudo apt-get install bzip2
$ python
In future, please check if bed files are hg19/Hg38 and liftover if needed. (Sumstats from GCS are Hg38)
Edit config file to specify the GCS buckets for input sumstats, temporary processing folders, and already existing results.
$ vim configs/configs.txt
The following will process all summary stats which do not already have LDSC results:
$ NCORES = number of cores
$ scripts/ gs://genetics-portal-dev-analysis/xg1/LDSC_enrichments/* gs://genetics-portal-dev-analysis/xg1/Test_sumstat_inputs/*.parquet $NCORES
$ INPUT_STUDY_PATH=gs://genetics-portal-dev-analysis/xg1/rsid_sumstats/
Commands are written to LDSC_studies_to_run.txt
Pipe to parallel:
$ cat LDSC_studies_to_run.txt | parallel -j $NCORES --joblog logs/
Generate epigenetic input: For EPIMAP:
Generate consensus peak files:
$ python scripts/CHEERS_preprocessing/
Which creates the consensus peaks in :
$ ../../tmp/Master_enhancers.sorted.merged.bed
Generate signals:
$ python --Sample "BSS01668" --Peaks "../../tmp/Master_enhancers.sorted.merged.bed" --outdir "../../tmp/H3K27ac"
The output directory of the signals in consensus peak files can be passed to the CHEERS normalisation script.
Generate consensus peak files:
$ python scripts/CHEERS_preprocessing/ --prefix BLUEPRINT --Peaks configs/BLUEPRINT_peaks.tsv --outdir ~/BLUEPRINT_peaks/
Generate signals in consensus peaks (for one sample): (URLs are in configs/BLUEPRINT_signals.tsv)
$ python scripts/CHEERS_preprocessing/ --Sample 0 --Peaks ~/BLUEPRINT_peaks/BLUEPRINT_Consensus_peaks.bed --BW_URL --outdir /home/xg1/BLUEPRINT_peaks/ReadsInPeaks
Can be piped to parallel:
$ cat configs/BP_generate_signal_manifest.txt | parallel -j $NCORES
The output directory of the signals in consensus peak files can be passed to the CHEERS normalisation script.
We used the latest Open Targets finemapping results, the latest publicly available credible SNP sets can be downloaded at:
$ gs://open-targets-genetics-releases/22.02.01/v2d_credset
I’ve saved the credible SNPs as ~/Credible_SNP_sets/finemapping_220401.parquet/
Generate credible SNP sets in hg19 for EPIMAP:
$ python
$ --Study_ID GCSTxxxx
$ --input_credset ~/Credible_SNP_sets/finemapping_220401.parquet/
$ --Enrichment_outdir ~/CHEERS/Results/
$ --input_peak Normalised_signals.txt
$ --outdir ~/Credible_SNP_sets/Formatted_hg19/
This will also create a manifest file: ~/compute_CHEERS_enrichments.txt
Which can be used to run CHEERS compute enrichments in parallel:
$ cat ~/compute_CHEERS_enrichments.txt | parallel -j $NCORES
Generate credible SNP sets in hg38 for BLUEPRINT:
$ python --Study_ID GCST006979 --input_credset ~/Credible_SNP_sets/finemapping_220401.parquet/ --Enrichment_outdir /home/xg1/BLUEPRINT_peaks/Results/ --input_peak /home/xg1/BLUEPRINT_peaks/Normalised/BLUEPRINT_counts_normToMax_quantileNorm_euclideanNorm.txt --outdir ~/Credible_SNP_sets/Formatted_hg38/
This will also create a manifest file: ~/compute_CHEERS_enrichments.txt
Which can be used to run CHEERS compute enrichments in parallel:
$ cat ~/compute_CHEERS_enrichments.txt | parallel -j $NCORES