Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release 1.2.3 #48

Merged
merged 24 commits into from
Oct 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
3f0794e
remove 'default' conda channel from environment.yml
cmsauer Oct 16, 2024
f57bf7e
update python version
cmsauer Oct 16, 2024
a9440a1
remove importing of unnecessary packages
cmsauer Oct 16, 2024
969949c
commit new updated allele_counter for improved speed and performance.
cmsauer Oct 16, 2024
8963cf2
allele counter version update to v1.2.3
cmsauer Oct 16, 2024
d83ebb5
implement new allele counter into savana workflow and add tmpdir func…
cmsauer Oct 16, 2024
a908901
add and improve various parameters and functions for allele counting …
cmsauer Oct 16, 2024
616e3dd
minor parameter fix
cmsauer Oct 16, 2024
5421498
minor parameter fix
cmsauer Oct 16, 2024
6fc8e6d
minor parameter fix
cmsauer Oct 16, 2024
8709991
fix minor bug in allele counter ref base
cmsauer Oct 17, 2024
a9f0af8
add cna_threads parameter
cmsauer Oct 17, 2024
215a7aa
add mean BAF and number of used hetSNPs to main output
cmsauer Oct 17, 2024
f6674dd
clean CN output to main output files and add headers.
cmsauer Oct 17, 2024
8cd6f7d
addition of parameters and improvement of parameter documentation
cmsauer Oct 17, 2024
d1acdff
addition of parameters and improvement of parameter documentation
cmsauer Oct 17, 2024
b336f85
addition of parameters and improvement of parameter documentation
cmsauer Oct 17, 2024
58bce95
addition of parameters and improvement of parameter documentation
cmsauer Oct 17, 2024
8d421ff
fix output file headers
cmsauer Oct 17, 2024
c9d26d2
fix output file headers
cmsauer Oct 17, 2024
dc667ff
minor bug fix
cmsauer Oct 17, 2024
9d8b7a6
fix for coverage in t2t :surfing_woman:
helrick Oct 17, 2024
1429ace
adding --overwrite flag, cleaning up tmpdir
helrick Oct 17, 2024
a762648
Merge branch 't2t-coverage-fixes' into csauer/allele_count_dev
helrick Oct 17, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 19 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,11 +99,16 @@ After installing, SAVANA can be run on long-read data with a minumum set of argu
savana --tumour <tumour-file> --normal <normal-file> --outdir <outdir> --ref <ref-fasta>
```

This will call somatic SVs. We recommend running with the --contigs argument (`--contigs example/contigs.chr.hg38.txt`) to only examine chromosomes of interest. To compute copy number aberrations, you must provide a phased VCF for the germline sample (generated using whatshap - see [Generating Phased VCF](#generating-phased-vcf)). Then, to call both SVs and CNAs you can run savana with:
This will call somatic SVs. We recommend running with the --contigs argument (`--contigs example/contigs.chr.hg38.txt`) to only examine chromosomes of interest. To compute copy number aberrations, you must provide a phased VCF for the germline sample (see [Generating Phased VCF](#generating-phased-vcf)). Then, to call both SVs and CNAs you can run savana with:
```
savana --tumour <tumour-file> --normal <normal-file> --outdir <outdir> --ref <ref-fasta> --phased_vcf <vcf-file> --blacklist <blacklist-bed-file>
```
> Note, that if you do not want to use a blacklist to compute copy number aberrations, you will have to specify the `--no_blacklist` flag instead.
Note, that if you do not want to use a blacklist to compute copy number aberrations, you will have to specify the `--no_blacklist` flag instead.
Additionally, if you have already generated the heterozygous SNP allele counts using the above command, you can skip this step by providing the <allele_counts_hetSNPs.bed> file instead of the phased VCF using the `--allele_counts_het_snps` flag.
Example command to run savana for both of the above:
```
savana --tumour <tumour-file> --normal <normal-file> --outdir <outdir> --ref <ref-fasta> --allele_counts_het_snps <allele_counts_hetSNPs.bed> --no_blacklist
```

### Quickstart

Expand Down Expand Up @@ -142,7 +147,7 @@ Argument|Description
--cna_resuce| Copy number abberation output file for this sample (used to rescue variants)
--cna_rescue_distance| Maximum distance from a copy number abberation for a variant to be rescued by it
--threads| Number of threads to use (default is maximum available)
--cna_threads| Number of threads to use for CNA calling (default is to use `--threads`)
--cna_threads| Number of threads to use for CNA calling (default is maximum available)
--ref_index| Full path to reference genome fasta index (ref path + ".fai" used by default)
--single_bnd| Report single breakend variants in addition to standard types (False by default)
--single_bnd_min_length| Minimum length of single breakend to consider (default=100)
Expand All @@ -155,20 +160,22 @@ Argument|Description
--coverage_binsize | Length used for coverage bins (default=5)
--chunksize | Chunksize to use when splitting genome for parallel analysis (default=1000000)
*CNA Algorithm Arguments*
--tmpdir| Temp directory for allele counting temp files (defaults to outdir)
--allele_counts_het_snps| If allele counting has already been performed provide the path for the allele counts of heterozygous SNPs to skip this step
--allele_mapq| Mapping quality threshold for reads to be included in the allele counting (default = 0)
--allele_mapq| Mapping quality threshold for reads to be included in the allele counting (default = 5)
--allele_min_reads| Minimum number of reads required per het SNP site for allele counting (default = 10)
--ac_window| Window size for allele counting to parallelise (default = 1000000)
--cn_binsize| Bin window size in kbp (default=10)
--blacklist| Path to the blacklist file
--breakpoints| Path to SAVANA VCF file to incorporate savana breakpoints into copy number analysis
--chromosomes| Chromosomes to analyse. To run on all chromosomes leave unspecified (default). To run on a subset of chromosomes only specify the chromosome numbers separated by spaces. For x and y chromosomes, use 23 and 24, respectively. E.g. use "-c 1 4 23 24" to run chromosomes 1, 4, X and Y
--chromosomes| Chromosomes to analyse. To run on all chromosomes leave unspecified (default). To run on a subset of chromosomes only specify the chromosome numbers separated by spaces. For x and y chromosomes, use 23 and 24, respectively. E.g. use "--chromosomes 1 4 23 24" to run chromosomes 1, 4, X and Y
--readcount_mapq| Mapping quality threshold for reads to be included in the read counting (default = 5)
--no_blacklist| Don't use a blacklist
--bl_threshold| Percentage overlap between bin and blacklist threshold to tolerate for read counting (default = 5). Please specify percentage threshold as integer, e.g. "-t 5". Set "-t 0" if no overlap with blacklist is to be tolerated
--no_basesfilter| Do not filter bases
--bases_threshold| Percentage of known bases per bin required for read counting (default = 75). Please specify percentage threshold as integer, e.g. "-bt 95"
--smoothing_level| Size of neighbourhood for smoothing
--trim| Trimming percentage to be used
--smoothing_level| Size of neighbourhood for smoothing (default = 10)
--trim| Trimming percentage to be used (0.025)
--min_segment_size| Minimum size for a segement to be considered a segment (default = 5)
--shuffles| Number of permutations (shuffles) to be performed during CBS (default = 1000)
--p_seg| p-value used to test segmentation statistic for a given interval during CBS using (shuffles) number of permutations (default = 0.05)
Expand All @@ -181,12 +188,14 @@ Argument|Description
--max_cellularity| Maximum cellularity to be considered for copy number fitting. If hetSNPs allele counts are provided this is estimated during copy number fitting. Alternatively a purity value can be provided if the purity of the sample is already known
--cellularity_step| Cellularity step size for grid search space used during for copy number fitting (default = 0.01)
--cellularity_buffer| Cellularity buffer to define purity grid search space during copy number fitting (default = 0.1)
--overrule_cellularity| Set to sample`s purity if known. This value will overrule the cellularity estimated using hetSNP allele counts (not used by default)
--distance_function| Distance function to be used for copy number fitting. choices=[RMSD, MAD] (default = RMSD)
--distance_filter_scale_factor| Distance filter scale factor to only include solutions with distances < scale factor * min(distance)
--distance_precision| Number of digits to round distance functions to (default = 3)
--max_proportion_zero| Maximum proportion of fitted copy numbers to be tolerated in the zero or negative copy number state (default = 0.1)
--min_proportion_close_to_whole_number| Minimum proportion of fitted copy numbers sufficiently close to whole number to be tolerated for a given fit (default = 0.5)
--max_distance_from_whole_number| Distance from whole number for fitted value to be considered sufficiently close to nearest copy number integer (default = 0.25)
--main_cn_step_change| Max main copy number step change across genome to be considered for a given solution
--min_ps_size| Minimum size (number of SNPs) for phaseset to be considered for purity estimation (default = 10)
--min_ps_length| Minimum length (bps) for phaseset to be considered for purity estimation (default = 500000)
*Additional Arguments*
Expand Down Expand Up @@ -271,7 +280,7 @@ To enable the examination of inserted sequence, `{sample}.inserted_sequences.fa`
### Output Files CNA Algorithm

#### Raw read counts TSV
`{sample}_{cn_binsize}_read_counts.tsv` contains all raw and unfiltered read counts for each bin across the reference genome for the tumour and matched normal sample. In addition, SAVANA also outputs other intermediate files during copy number processing, including the filtered read counts (`{sample}_{cn_binsize}_read_counts_filtered.tsv`) and the, if provided, matched-normal normalised log2 transformed read counts (`{sample}_{cn_binsize}_read_counts_mnorm_log2r.tsv`).
`{sample}_{cn_binsize}_raw_read_counts.tsv` contains all raw and unfiltered read counts for each bin across the reference genome for the tumour and matched normal sample. In addition, SAVANA also outputs other intermediate files during copy number processing, including the filtered and matched-normal normalised log2 transformed read counts (`{sample}_{cn_binsize}_read_counts_mnorm_log2r.tsv`).

#### Segmented log2r relative copy number TSV
`{sample}_{cn_binsize}_read_counts_mnorm_log2r_segmented.tsv` contains the final relative copy number (log2r) data post CBS segmentation. This includes the log2r relative copy number for each bin across the reference genome, as well as the segment IDs and according segmented log2r relative copy number values.
Expand All @@ -285,7 +294,7 @@ The final and main SAVANA CNA output file is `{sample}_{cn_binsize}_segmented_ab
## Phasing Information
### Generating Phased VCF

We recommend using [WhatsHap](https://whatshap.readthedocs.io/en/stable/index.html) to generate phased VCF from matched normal samples. As an example, WhatsHap can be run using the following command:
We recommend using [LongPhase](https://github.com/twolinin/longphase) or [WhatsHap](https://whatshap.readthedocs.io/en/stable/index.html) to generate phased VCF from matched normal samples. As an example, WhatsHap can be run using the following command:

```
whatshap phase --ignore-read-groups -o <phased.vcf.gz> --reference=<ref-fasta> <germline_snps.vcf> <normal-file>
Expand All @@ -295,7 +304,7 @@ Germline SNPs (`<germline_snps.vcf>`) can for example be obtained using [Clair3]

### Generating Phased BAMs

Again, we recommend using [WhatsHap](https://whatshap.readthedocs.io/en/stable/index.html) for tagging sequencing reads by haplotype to generate phased BAMs (see example code below) using the `<phased.vcf.gz>` obtained in the previous step.
Again, we recommend using [LongPhase](https://github.com/twolinin/longphase) or [WhatsHap](https://whatshap.readthedocs.io/en/stable/index.html) for tagging sequencing reads by haplotype to generate phased BAMs (see example code below) using the `<phased.vcf.gz>` obtained in the previous step.

```
whatshap haplotag --ignore-read-groups -o <phased_tumour.bam> --reference <ref-fasta> <phased.vcf.gz> <tumour-file> && samtools index <phased_tumour.bam>
Expand Down
3 changes: 1 addition & 2 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,8 @@
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- python=3.9.6
- python=3.9
- pybedtools=0.9.0
- pysam=0.21.0
- cyvcf2=0.30.16
Expand Down
Loading