Collection of tools to make popscle easier to use.
Filter BAM file for usage with popscle dsc-pileup by keeping reads:
- which overlap with SNPs in the VCF file
- and which have a cell barcode (default: "CB" tag) contained in the cell barcode list Keeping only relevant reads for popscle dsc-pileup can speedup it up quite significantly (depending on the reduction of the number of reads in the filtered BAM file vs original).
$ ./filter_bam_file_for_popscle_dsc_pileup.sh
Usage: filter_bam_file_for_popscle_dsc_pileup input_bam_filename barcodes_tsv_filename vcf_filename output_bam_filename [barcode_tag]
Purpose: Filter BAM file for usage with popscle dsc-pileup by keeping reads:
- which overlap with SNPs in the VCF file
- and which have a cell barcode (default: "CB" tag) contained in the cell barcode list
Keeping only relevant reads for popscle dsc-pileup can speedup it up quite significantly
(depending on the reduction of the number of reads in the filtered BAM file vs original).
# Create filtered BAM with only the reads dsc-pileup needs.
./filter_bam_file_for_popscle_dsc_pileup.sh \
./samples_to_demultiplex/outs/possorted_genome_bam.bam \
./samples_to_demultiplex/outs/filtered_feature_bc_matrix/barcodes.tsv \
samples.vcf \
/tmp/samples_to_demultiplex.filter_bam_file_for_popscle_dsc_pileup.bam
# Use filtered BAM file for dsc-pileup.
popscle dsc-pileup \
--sam /tmp/samples_to_demultiplex.filter_bam_file_for_popscle_dsc_pileup.bam \
--vcf samples.vcf \
--group-list ./samples_to_demultiplex/outs/filtered_feature_bc_matrix/barcodes.tsv \
--out samples_to_demultiplex.pileup
Sort VCF file in the same order as the BAM file so the following popscle dsc-pileup
error can be solved easily:
[E:%s] Your VCF/BCF files and SAM/BAM/CRAM files have different ordering of chromosomes. SAM/BAM/CRAM file has %s before %s, but VCF/BCF file has %s after %s"
$ ./sort_vcf_same_as_bam.sh
Usage: sort_vcf_same_as_bam BAM_file VCF_file [VCF_type]
Arguments:
- BAM_file: BAM file from which to get the contig order to sort the VCF file.
- VCF_file: VCF file to sort by contig order as defined in the BAM file.
- VCF_type: VCF ouput file type (default: same as input VCF file type):
v: uncompressed VCF, z: compressed VCF,
u: uncompressed BCF, b: compressed BCF
Purpose:
Sort VCF file in the same order as the BAM file, so it can be used with popscle.
# Sort VCF file in the same order as the BAM file, so it can be used with popscle.
./sort_vcf_same_as_bam.sh \
./samples_to_demultiplex/outs/possorted_genome_bam.bam \
samples.vcf \
> /tmp/samples.sorted_as_in_bam.vcf
# Sort gzipped VCF file in the same order as the BAM file and write compressed VCF file.
./sort_vcf_same_as_bam.sh \
./samples_to_demultiplex/outs/possorted_genome_bam.bam \
samples.vcf.gz \
> /tmp/samples.sorted_as_in_bam.vcf.gz
# Sort gzipped VCF file in the same order as the BAM file and write uncompressed VCF file.
./sort_vcf_same_as_bam.sh \
./samples_to_demultiplex/outs/possorted_genome_bam.bam \
samples.vcfi.gz \
v \
> /tmp/samples.sorted_as_in_bam.vcf
BCFtools can be used for filtering VCF files.
Looking at the BCFtools filtering expressions manual gives an idea how to create your own filters for mutations.
Import functions in current shell.
# Import functions.
source filter_vcf_file_for_popscle.sh
# Check if all needed programs are installed.
check_if_programs_exists
get_number_of_samples_in_vcf [VCF_file]
Example:
$ get_number_of_samples_in_vcf DGRP2.source_NCSU.dm6.final.bcf
205
Get all the sample names available in the BCF/VCF file (after the FORMAT
column).
get_samples_names_in_vcf [VCF_file]
Example:
$ get_samples_names_in_vcf DGRP2.source_NCSU.dm6.final.bcf | head
DGRP-021
DGRP-026
DGRP-028
DGRP-031
DGRP-032
DGRP-038
DGRP-040
DGRP-041
DGRP-042
DGRP-045
Extract only certain samples from a VCF file with multiple samples.
subset_samples_from_vcf comma_separated_samples_names [VCF_file]
Example:
subset_samples_from_vcf DGRP-032,DGRP-026,DGRP-042 DGRP2.source_NCSU.dm6.final.bcf | get_samples_names_in_vcf
DGRP-032
DGRP-026
DGRP-042
Only keep SNPs from VCF file (filter out INDELs and other mutations).
only_keep_snps [VCF_file]
Filter out mutations missing genotype (./.
) info for one or more samples.
For those mutations no info is available if the sample has the reference and/or mutations,
so it might be better to skip this mutation in popscle dsc-pileup
.
filter_out_mutations_missing_genotype_for_one_or_more_samples [VCF_file]
Filter out mutations that are heterozygous for one or more samples.
This can be useful to reduce the number of mutations for popscle dsc-pileup
when working with inbred lines
(all mutations are supposed to be homozygous).
In combination with filter_out_mutations_not_unique_for_one_sample
, the number of mutations can be reduced
even further.
filter_out_mutations_heterozygous_for_one_or_more_samples [VCF_file]
Filter out mutations that have homozygous reference calls in all samples.
If the mutation position contains the reference for both alleles in all samples,
the mutation is not informative and can be skipped for popscle dsc-pileup
.
filter_out_mutations_homozygous_reference_in_all_samples [VCF_file]
Filter out mutations that are heterozygous in all samples.
If all samples are inbred lines, you might want to remove all non-homozygous SNPs.
filter_out_mutations_heterozygous_in_all_samples [VCF_file]
Filter out mutations that are homozygous in all samples.
If the mutation position contains the mutation for both alleles in all samples,
the mutation is not informative and can be skipped for popscle dsc-pileup
.
filter_out_mutations_homozygous_in_all_samples [VCF_file]
Only keep mutations (heterozygous/homozygous) which are found only in one sample, but not at all (heterozygous/homozygous) in other samples.
only_keep_mutations_heterozygous_or_homozygous_in_one_sample [VCF_file]
Calculate allele frequency (AF
), allele count (AC
) and total number of alleles (AN
) from genotype info of each sample.
This will add AF
, AC
and AN
info fields or update those fields based on the genotype info of each sample in case they
were set incorrectly.
It is recommended to run this function before running:
filter_out_mutations_homozygous_reference_in_all_samples
: needs correct value forAC
.filter_out_mutations_homozygous_in_all_samples
: needs correct value forAC
andAN
.only_keep_mutations_homozygous_in_one_sample
: needs correct value forAC
.only_keep_mutations_heterozygous_in_one_sample
: needs correct value forAC
.only_keep_mutations_heterozygous_or_homozygous_in_one_sample
: needs correct value forAC
.
Running it after subset_samples_from_vcf
is also recommended as that function only updates 'AF' but not AC
and AN
.
calculate_AF_AC_AN_values_based_on_genotype_info [VCF_file]
Create (minimal) VCF file for popscle dsc-pileup
for 3 inbread lines (homozygous genotype SNPs are very common):
- Only keep mutations for 3 selected samples
- Only keep SNPs
- (Re)calculate allele frequency (
AF
), allele count (AC
), total number of alleles (AN
). - Remove all SNPs which are missing genotype information for at least one sample (not useful to call those positions in
popscle dsc-pileup
). - Remove all SNPs which are homozygous reference in all samples (not useful to call those positions in
popscle dsc-pileup
). - Remove all SNPs which are homozygous in all samples (not useful to call those positions in
popscle dsc-pileup
). - Remove all SNPs which are heterozygous in at least one sample (those mutations shouldn't exist in inbred lines).
- Only keep SNPs heterozygous or homozygous in one sample (but as heterozygous mutations are already filtered out, only keep homozygous ones).
subset_samples_from_vcf DGRP-032,DGRP-026,DGRP-042 DGRP2.source_BCM-HGSC.dm6.final.bcf \
| only_keep_snps \
| calculate_AF_AC_AN_values_based_on_genotype_info \
| filter_out_mutations_missing_genotype_for_one_or_more_samples \
| filter_out_mutations_homozygous_reference_in_all_samples \
| filter_out_mutations_homozygous_in_all_samples \
| filter_out_mutations_heterozygous_for_one_or_more_samples \
| only_keep_mutations_heterozygous_or_homozygous_in_one_sample \
> output.vcf
Create (minimal) VCF file for popscle dsc-pileup
(heterozygous genotype SNPs are very common):
- Only keep SNPs.
- (Re)calculate allele frequency (
AF
), allele count (AC
), total number of alleles (AN
). - Remove all SNPs which are missing genotype information for at least one sample (not useful to call those positions in
popscle dsc-pileup
). - Remove all SNPs which are homozygous reference in all samples (not useful to call those positions in
popscle dsc-pileup
). - Remove all SNPs which are homozygous in all samples (not useful to call those positions in
popscle dsc-pileup
).
only_keep_snps input.vcf \
| calculate_AF_AC_AN_values_based_on_genotype_info \
| filter_out_mutations_missing_genotype_for_one_or_more_samples \
| filter_out_mutations_homozygous_reference_in_all_samples \
| filter_out_mutations_homozygous_in_all_samples \
> output.vcf