A Snakemake workflow for calculating the complexity of a GFA file.
The usage of this workflow is described in the Snakemake Workflow Catalog.
The snakemake.old file was used to generate several regions, but it requires a good amount of manual work to prepare each run. So I tried to make the snakemake file but it hangs when running on slurm nodes on expanse (but not on the login node for dry-runs) and I'm not sure why.
Our complexity metric scales with the size of the genome so I had been using a random background sequence to compare against and determine signifcance. Though it requires more testing, I think using a highly polygenic trait's sumstats
file like height gwas hits actually produces a more realistic background distribution.
Only minigraph cactus version 1.1 of the pangenome has the walk labels we use to generate the node-hap map so we are limited to this version. It uses GRCh38 chromosomes and labels to build off of.
All scripts should be in workflow/scripts/
- Generate a node-haplotype map using
build_node_sample_map.sh
and build a gfabase.gfab
file for fast graph subsetting. Note: this file is not an executable and more a collection of one liners that I saved in a single location. - Build
bed
files for graph segmenting. From.sumstats
file formats I built the scriptssumstats_to_plinklinear.py
andplinklinear_to_bed.py
to generate bed files. It requires a snp to GRCh38 coordinate file I currated at/home/wwford/datasets/rsID.coords.bed
on Expanse. Anyone should be able to read this file. - Segment subgraphs using
bed_to_gfas.py
. - Calculate complexity using
gfa_to_complexity.py
There is another script manhattan_range.py
adapted from Arya's script that takes in plinklinear files and plots manhattan plots to verify that we are accurately grabbing the significant regions.
In the workflow above we use that will likely need to be downloaded if not already: GFA Base tabix bgzip
snakemake (though I didn't get the efficient version of this to work)
rsID.coords.bed file example:
chr10 132889560 132889561 rs11815293
chr2 156133793 156133794 rs12694551
...
Feel free to email me if you have any questions: wwford@ucsd.edu