Skip to content

Tutorial

Ai Okada edited this page Jun 12, 2024 · 21 revisions

Tutorial: using nanomonsv to obtain somatic structural variation

Preparation

System requirements

  • CPU: 8vCPUs
  • Memory: 40GByte RAM
  • Free space: 500GByte

Pull singularity images

Use the following command to pull the required singularity images.

mkdir $PWD/image
singularity pull $PWD/image/sra-tools_3.0.0.sif docker://ncbi/sra-tools:3.0.0
singularity pull $PWD/image/minimap2_2.17.sif docker://aokad/minimap2:2.17
singularity pull $PWD/image/nanomonsv_v0.5.0.sif docker://friend1ws/nanomonsv:v0.5.0

Download reference files

Use the following command to download the reference.

mkdir $PWD/reference
wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta \
  -O $PWD/reference/Homo_sapiens_assembly38.fasta
wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta.fai \
  -O $PWD/reference/Homo_sapiens_assembly38.fasta.fai

Tumor and matched control data

The Oxford Nanopore Sequencing data used in the bioRxiv paper is available through the public sequence repository service (BioProject ID: PRJDB10898):

SRA-toolkit is a set of tools for working with data registered in the SRA.

Use the following command to download sequence data.

COLO829 (14.5 hours)

mkdir -p $PWD/fastq/COLO829
singularity exec $PWD/image/sra-tools_3.0.0.sif fasterq-dump -e 8 -O $PWD/fastq/COLO829 DRR258589

COLO829BL (13 hours)

mkdir $PWD/fastq/COLO829BL
singularity exec $PWD/image/sra-tools_3.0.0.sif fasterq-dump -e 8 -O $PWD/fastq/COLO829BL DRR258590

Next, alignment with minimap2 and sorting with samtools.

COLO829 (6 hours)

mkdir -p $PWD/bam/COLO829/
singularity exec $PWD/image/minimap2_2.17.sif sh -c \
  "minimap2 -ax map-ont -t 8 -p 0.1 $PWD/reference/Homo_sapiens_assembly38.fasta $PWD/fastq/COLO829/DRR258589.fastq \
  | samtools view -Shb > $PWD/bam/COLO829/COLO829.unsorted.bam && \
  samtools sort -@ 8 -m 2G $PWD/bam/COLO829/COLO829.unsorted.bam -o $PWD/bam/COLO829/COLO829.bam && \
  samtools index $PWD/bam/COLO829/COLO829.bam"

COLO829BL (5 hours)

mkdir -p $PWD/bam/COLO829BL/
singularity exec $PWD/image/minimap2_2.17.sif sh -c \
  "minimap2 -ax map-ont -t 8 -p 0.1 $PWD/reference/Homo_sapiens_assembly38.fasta $PWD/fastq/COLO829BL/DRR258590.fastq \
  | samtools view -Shb > $PWD/bam/COLO829BL/COLO829BL.unsorted.bam && \
  samtools sort -@ 8 -m 2G $PWD/bam/COLO829BL/COLO829BL.unsorted.bam -o $PWD/bam/COLO829BL/COLO829BL.bam && \
  samtools index $PWD/bam/COLO829BL/COLO829BL.bam"

Remove temporary files.

rm -r $PWD/fastq/
rm $PWD/bam/COLO829/COLO829.unsorted.bam
rm $PWD/bam/COLO829BL/COLO829BL.unsorted.bam

Control panel

We prepared a control panel that has been created using the 30 Nanopore sequencing data from the Human Pangenome Reference Consortium, which you can download by the following command:

mkdir -p $PWD/control_panel
wget https://zenodo.org/api/files/08b52270-9f9b-47bd-b03d-81f5859d676f/hprc_year1_data_freeze_nanopore_guppy4_minimap2_2_24_merge_control_GRCh38.tar.gz -O $PWD/control_panel/hprc_year1_data_freeze_nanopore_guppy4_minimap2_2_24_merge_control_GRCh38.tar.gz
tar -xvf $PWD/control_panel/hprc_year1_data_freeze_nanopore_guppy4_minimap2_2_24_merge_control_GRCh38.tar.gz -C $PWD/control_panel/

This control panel is made by aligning 36 Nanopore sequencing data to the GRCh38 reference genome (obtained from here) with minimap2 version 2.24. When you use these control panels and publish, do not forget to credit to HPRC!

parse stage

This step parses all the supporting reads of putative somatic SVs.

COLO829 (1 hour)

singularity exec $PWD/image/nanomonsv_v0.5.0.sif \
  nanomonsv parse \
    $PWD/bam/COLO829/COLO829.bam \
    $PWD/output/COLO829/COLO829

COLO829BL (1 hour)

singularity exec $PWD/image/nanomonsv_v0.5.0.sif \
  nanomonsv parse \
    $PWD/bam/COLO829BL/COLO829BL.bam \
    $PWD/output/COLO829BL/COLO829BL

After successful completion, you will find supporting reads stratified by deletions, insertions, and rearrangements:

$PWD/output/
    |- COLO829/
    |   |- COLO829.deletion.sorted.bed.gz
    |   |- COLO829.insertion.sorted.bed.gz
    |   |- COLO829.rearrangement.sorted.bedpe.gz
    |   |- COLO829.bp_info.sorted.bed.gz
    |   |- COLO829.bp_info.sorted.bed.gz.tbi
    |
    |- COLO829BL/
        |- COLO829BL.deletion.sorted.bed.gz
        |- {output_prefix}.insertion.sorted.bed.gz
        |- {output_prefix}.rearrangement.sorted.bedpe.gz
        |- {output_prefix}.bp_info.sorted.bed.gz
        |- {output_prefix}.bp_info.sorted.bed.gz.tbi

get stage

This step gets the SV result from the parsed supporting reads data obtained above.

COLO829BL and COLO829BL (1.5 hours)

singularity exec $PWD/image/nanomonsv_v0.5.0.sif \
  nanomonsv \
    get \
    $PWD/output/COLO829/COLO829 \
    $PWD/bam/COLO829/COLO829.bam \
    $PWD/reference/Homo_sapiens_assembly38.fasta \
    --control_prefix $PWD/output/COLO829BL/COLO829BL \
    --control_bam $PWD/bam/COLO829BL/COLO829BL.bam \
    --processes 8 \
    --single_bnd \
    --use_racon \
    --control_panel_prefix $PWD/control_panel/hprc_year1_data_freeze_nanopore_guppy4_minimap2_2_24_merge_control_GRCh38/hprc_year1_data_freeze_nanopore_guppy4_minimap2_2_24_merge_control_GRCh38

After successful execution, you will be able to find the result file names as $PWD/output/COLO829/COLO829.nanomonsv.result.txt.

removing indels within simple repeat

One of the most effective filters is removing insertions and deletions confined in simple repeat regions. For that, the user needs to prepare the bgzip'ed and tabix'ed simple repeat bed file as follows:

wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/simpleRepeat.txt.gz   
zcat simpleRepeat.txt.gz | cut -f 2-4 | sort -k1,1 -k2,2n -k3,3n > simpleRepeat.bed   
bgzip -c simpleRepeat.bed > simpleRepeat.bed.gz
tabix -p bed simpleRepeat.bed.gz 

Then,

wget https://raw.githubusercontent.com/friend1ws/nanomonsv/master/misc/add_simple_repeat.py
singularity exec $PWD/image/nanomonsv_v0.5.0.sif \
  python3 add_simple_repeat.py \
    $PWD/output/COLO829/COLO829.nanomonsv.result.txt \
    $PWD/output/COLO829/COLO829.nanomonsv.result.filt.txt \
    simpleRepeat.bed.gz

Now, indels confined within simple repeat are labeled as "Simple_repeat" in COLO829.nanomonsv.result.filt.txt file. You can create a file that includes only SVs that passed every filter checks as follows:

head -n 1 COLO829.nanomonsv.result.filt.txt > COLO829.nanomonsv.result.filt.pass.txt
tail -n +2 COLO829.nanomonsv.result.filt.txt | grep PASS >> COLO829.nanomonsv.result.filt.pass.txt