-
Notifications
You must be signed in to change notification settings - Fork 10
Quick Start
The biograph full_pipeline
command will convert reads to the BioGraph format, call variants against a reference, calculate coverage, and assign genotypes and quality scores using a machine learning model.
$ . bg7/bin/activate
(bg7)$ biograph reference --in hs37d5.fasta --refdir hs37d5/
...
(bg7)$ biograph full_pipeline --biograph my.bg --ref hs37d5/ \
--reads /path/to/my_reads.bam \
--model /path/to/biograph_model-6.0.5.ml \
--tmp /path/to/large/tmp/
2020-09-10 16:02:01,944 [INFO] Running biograph full_pipeline
...
2020-09-10 17:38:53,981 [INFO] Finished full_pipeline
(bg7)$
Results of the analysis are saved inside the BioGraph in the analysis/
folder.
(bg7)$ ls my.bg/analysis/
results.vcf.gz results.vcf.gz.tbi
Running the entire full_pipeline
requires reads and a reference in BioGraph reference format. The full path to the BioGraph classifier model should also be provided.
BioGraph uses a genetic reference to speed up the read conversion process, as well as for variant calling and genotyping. It requires a number of indices in addition to the FASTA file itself that are all kept in a single reference directory.
Conversion from a BWA indexed FASTA to the BioGraph reference format only needs to be done once for each reference of interest (hs37d5, GRCh38, etc.) The resulting reference directory is about 14GB and takes roughly a half hour to process. The resulting reference directory can be reused for all subsequent analyses.
For example, the following command will convert the file hs37d5.fasta
and save the resulting BioGraph reference to the directory ./hs37d5/
:
(bg7)$ biograph reference --in /path/to/hs37d5.fasta --refdir ./hs37d5/
Prebuilt references (including hs37d5 and GRCh38) may be downloaded from AWS S3 at s3://spiral-public/references/
. The README.txt
under each prefix includes information about each reference, including where the source FASTA was downloaded from.
Note that BioGraph references consist of folders with several files inside.
The logs from every BioGraph command are saved under the qc/
folder inside the BioGraph. Additional statistics and reports are saved to the json and html files.
(bg7)$ ls my.bg/qc/
create_log.txt timings.json
create_stats.json variants_log.txt
kmer_quality_report-BELOW_MIN_COUNT.html variants_stats.json
kmer_quality_report.html
For more details about what is stored in the various files under the BioGraph directory, see What is Inside a BioGraph?
The intermediary files generated by each step in the pipeline are automatically removed at the end of the analysis. These files can be quite large and are generally not required. However, you may wish to keep some or all intermediaries for QC or other purposes.
The --keep vcf
parameter will keep intermediary VCFs, --keep jl
will keep all dataframes, and --keep all
will keep everything.
(bg7)$ biograph full_pipeline --biograph my.bg --ref hs37d5/ \
--reads /path/to/my_reads.bam \
--model /path/to/biograph_model-6.0.5.ml \
--keep all
...
(bg7)$ ls my.bg/analysis/
coverage.vcf.gz df.jl discovery.vcf.gz.tbi results.vcf.gz
coverage.vcf.gz.tbi discovery.vcf.gz grm.jl results.vcf.gz.tbi
Each step in the BioGraph pipeline has several optional parameters. These are covered in detail in the next section, Customizing the BioGraph Pipeline.