Skip to content

Validation of genetic variants from NGS data using Deep Convolutional Neural Networks

License

Notifications You must be signed in to change notification settings

marc-vaisband/deepCNNvalid

Repository files navigation

deepCNNvalid

Code accompanying the publication "Validation of genetic variants from NGS data using Deep Convolutional Neural Networks" (https://www.biorxiv.org/content/10.1101/2022.04.12.488021v1).

The recommended way to run it is to download and decompress the version archived at Zenodo (https://zenodo.org/record/6409366), as it already includes binary dumps in NumPy format of all processed datasets. In this case, run

cd training

and

python 5fold_cv.py

from shell in the root directory to replicate the results of the cross-validation,

python longterm_training.py

for the 100 split evaluation, and

python compare_datasets.py

for the evaluation on the independent held-out validation dataset from Kotani et al. (https://www.nature.com/articles/s41375-018-0253-3).

Make sure to run the scripts from the training directory so that the path specification to the datasets is correct; though of course you can easily adapt the folder and data structure if that is more suitable for you.

The code in this work was executed using Python 3.8.8 and relies on the following packages:

  • NumPy 1.19.2
  • Pandas 1.2.3
  • TensorFlow 2.4.1
  • Keras 2.4.3
  • scikit-learn 0.24.1

If, on the other hand, you wish to rebuild the data from scratch, run

sh download_and_precess_data.sh

This will download the sequences from the Sequence Read Archive (https://www.ncbi.nlm.nih.gov/sra) and European Nucleotide Archive (https://www.ebi.ac.uk/ena/browser/) before processing them. Be advised, that this will download approximately 1000GB of sequencing data before processing it. The required tools for data processing, whose binaries should be findable in PATH, are:

  • sratoolkit v2.9.2
  • bwa-mem v0.7.15
  • picardtools v2.2.2
  • GATK v3.7
  • samtools v1.3.1
  • varscan v2.4.2
  • annovar v2015Dec14

Additionally, you will need the mm10 build of the mouse genome (https://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/) and the mouse dbSNP (ftp://ftp.ncbi.nih.gov/snp/organisms/archive/mouse_10090/VCF/00-All.vcf.gz). For more information on downloading the required databases, see the included shell scripts.

About

Validation of genetic variants from NGS data using Deep Convolutional Neural Networks

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published