Skip to content

Latest commit

 

History

History
59 lines (43 loc) · 2.12 KB

README.md

File metadata and controls

59 lines (43 loc) · 2.12 KB

deepCNNvalid

Code accompanying the publication "Validation of genetic variants from NGS data using Deep Convolutional Neural Networks" (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05255-7).

The recommended way to run it is to download and decompress the version archived at Zenodo (https://zenodo.org/record/6409366), as it already includes binary dumps in NumPy format of all processed datasets. In this case, run

cd training

and

python 5fold_cv.py

from shell in the root directory to replicate the results of the cross-validation,

python longterm_training.py

for the 100 split evaluation, and

python compare_datasets.py

for the evaluation on the independent held-out validation dataset from Kotani et al. (https://www.nature.com/articles/s41375-018-0253-3).

Make sure to run the scripts from the training directory so that the path specification to the datasets is correct; though of course you can easily adapt the folder and data structure if that is more suitable for you.

The code in this work was executed using Python 3.8.8 and relies on the following packages:

  • NumPy 1.19.2
  • Pandas 1.2.3
  • TensorFlow 2.4.1
  • Keras 2.4.3
  • scikit-learn 0.24.1

If, on the other hand, you wish to rebuild the data from scratch, run

sh download_and_precess_data.sh

This will download the sequences from the Sequence Read Archive (https://www.ncbi.nlm.nih.gov/sra) and European Nucleotide Archive (https://www.ebi.ac.uk/ena/browser/) before processing them. Be advised, that this will download approximately 1000GB of sequencing data before processing it. The required tools for data processing, whose binaries should be findable in PATH, are:

  • sratoolkit v2.9.2
  • bwa-mem v0.7.15
  • picardtools v2.2.2
  • GATK v3.7
  • samtools v1.3.1
  • varscan v2.4.2
  • annovar v2015Dec14

Additionally, you will need the mm10 build of the mouse genome (https://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/) and the mouse dbSNP (ftp://ftp.ncbi.nih.gov/snp/organisms/archive/mouse_10090/VCF/00-All.vcf.gz). For more information on downloading the required databases, see the included shell scripts.