Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(1) "Error waiting for container: invalid character 'u' looking for beginning of value" (2) "Could not execute because the application was not found or a compatible .NET SDK is not installed." #212

Closed
animesh opened this issue Mar 26, 2021 · 61 comments

Comments

@animesh
Copy link

animesh commented Mar 26, 2021

Docker is running though... below is the full log

Command executing: Powershell.exe docker pull smithlab/spritz ;docker run --rm -i -t --name spritz-956943642 -v """Z:\AGS\AGS RNAseq CNIO\raw data:/app/analysis""" -v """Z:\AGS\AGS RNAseq CNIO\raw data\data:/app/data""" -v """Z:\AGS\AGS RNAseq CNIO\raw data\configs:/app/configs""" smithlab/spritz; docker stop spritz-956943642
Saving output to Z:\AGS\AGS RNAseq CNIO\raw data\workflow_2021-03-23-10-45-25.txt. Please monitor it there...

Using default tag: latest
latest: Pulling from smithlab/spritz
Digest: sha256:55172c3a6e32257f977c9512e473f647eaeab32e35b6d96341598d6b96f97615
Status: Image is up to date for smithlab/spritz:latest
docker.io/smithlab/spritz:latest
�Building DAG of jobs...�
�Using shell: /bin/bash�
�Provided cores: 12�
�Rules claiming more threads will be scaled down.�
�Conda environments: ignored�
�Job counts:
	count	jobs
	1	all
	1	base_recalibration
	1	build_transfer_mods
	1	call_gvcf_varaints
	1	call_vcf_variants
	1	download_snpeff
	1	final_vcf_naming
	1	finish_variants
	1	generate_reference_snpeff_database
	1	hisat2_groupmark_bam
	1	reference_protein_xml
	1	split_n_cigar_reads
	1	tmpdir
	1	transfer_modifications_variant
	1	variant_annotation_ref
	15�
��
�[Tue Mar 23 09:45:38 2021]�
�rule download_snpeff:
    output: SnpEff/snpEff.config, SnpEff/snpEff.jar, SnpEff_4.3_SmithChemWisc_v2.zip
    log: data/SnpEffInstall.log
    jobid: 5�
��
��
�[Tue Mar 23 09:45:38 2021]�
�rule build_transfer_mods:
    output: TransferUniProtModifications/TransferUniProtModifications/bin/Release/netcoreapp3.1/TransferUniProtModifications.dll
    log: data/TransferUniProtModifications.build.log
    jobid: 72
    benchmark: data/TransferUniProtModifications.build.benchmark�
��
��
�[Tue Mar 23 09:45:38 2021]�
�rule tmpdir:
    output: tmp, temporary
    log: data/tmpdir.log
    jobid: 68�
��
�Removing temporary output file temporary.�
�[Tue Mar 23 09:45:38 2021]�
�Finished job 68.�
�1 of 15 steps (7%) done�
��
�[Tue Mar 23 09:45:38 2021]�
�rule hisat2_groupmark_bam:
    input: analysis/align/combined.sorted.bam, tmp
    output: analysis/variants/combined.sorted.grouped.bam, analysis/variants/combined.sorted.grouped.bam.bai, analysis/variants/combined.sorted.grouped.marked.bam, analysis/variants/combined.sorted.grouped.marked.bam.bai, analysis/variants/combined.sorted.grouped.marked.metrics
    log: analysis/variants/combined.sorted.grouped.marked.log
    jobid: 16
    benchmark: analysis/variants/combined.sorted.grouped.marked.benchmark
    wildcards: dir=analysis
    resources: mem_mb=16000�
��
�[Tue Mar 23 09:45:50 2021]�
�Finished job 72.�
�2 of 15 steps (13%) done�
�Removing temporary output file SnpEff_4.3_SmithChemWisc_v2.zip.�
�[Tue Mar 23 09:46:31 2021]�
�Finished job 5.�
�3 of 15 steps (20%) done�
��
�[Tue Mar 23 09:46:31 2021]�
�rule generate_reference_snpeff_database:
    input: SnpEff/snpEff.jar, data/ensembl/Homo_sapiens.GRCh38.97.gff3, data/ensembl/Homo_sapiens.GRCh38.pep.all.fa, data/ensembl/Homo_sapiens.GRCh38.dna.primary_assembly.karyotypic.fa
    output: SnpEff/data/Homo_sapiens.GRCh38/protein.fa, SnpEff/data/Homo_sapiens.GRCh38/genes.gff, SnpEff/data/genomes/Homo_sapiens.GRCh38.fa, SnpEff/data/Homo_sapiens.GRCh38/doneHomo_sapiens.GRCh38.txt
    log: SnpEff/data/Homo_sapiens.GRCh38/snpeffdatabase.log
    jobid: 4
    benchmark: SnpEff/data/Homo_sapiens.GRCh38/snpeffdatabase.benchmark
    resources: mem_mb=16000�
��
�[Tue Mar 23 09:50:32 2021]�
�Finished job 4.�
�4 of 15 steps (27%) done�
��
�[Tue Mar 23 09:50:32 2021]�
�rule reference_protein_xml:
    input: SnpEff/data/Homo_sapiens.GRCh38/doneHomo_sapiens.GRCh38.txt, SnpEff/snpEff.jar, data/ensembl/Homo_sapiens.GRCh38.dna.primary_assembly.karyotypic.fa, TransferUniProtModifications/TransferUniProtModifications/bin/Release/netcoreapp3.1/TransferUniProtModifications.dll, data/uniprot/Homo_sapiens.protein.xml.gz
    output: analysis/variants/doneHomo_sapiens.GRCh38.97.txt, analysis/variants/Homo_sapiens.GRCh38.97.protein.xml, analysis/variants/Homo_sapiens.GRCh38.97.protein.xml.gz, analysis/variants/Homo_sapiens.GRCh38.97.protein.fasta, analysis/variants/Homo_sapiens.GRCh38.97.protein.withdecoys.fasta, analysis/variants/Homo_sapiens.GRCh38.97.protein.withmods.xml, analysis/variants/Homo_sapiens.GRCh38.97.protein.withmods.xml.gz
    log: analysis/variants/Homo_sapiens.GRCh38.97.spritz.log
    jobid: 74
    benchmark: analysis/variants/Homo_sapiens.GRCh38.97.spritz.benchmark
    wildcards: dir=analysis
    resources: mem_mb=16000�
��
�Removing temporary output file analysis/variants/Homo_sapiens.GRCh38.97.protein.xml.�
�Removing temporary output file analysis/variants/Homo_sapiens.GRCh38.97.protein.withmods.xml.�
�[Tue Mar 23 10:08:21 2021]�
�Finished job 74.�
�5 of 15 steps (33%) done�
time="2021-03-23T15:01:58+01:00" level=error msg="error waiting for container: invalid character 'u' looking for beginning of value"
Done!


@acesnik
Copy link
Collaborator

acesnik commented Mar 26, 2021

Hmm, I've never seen that one before. Waiting for 'u' is a particularly odd error message, too!

It seems like this is a Docker issue that's still open: docker/for-mac#5139, dealing with the VM freezing. Have you gotten this error multiple times?

@animesh
Copy link
Author

animesh commented Mar 27, 2021

Yes, it comes and goes... when i restart, it seems to get stuck in following steps... never made it to the end of this data, I can probably try to ditch the docker and run natively? Is there some "dryrun" output of spritz which I can go one step at a time?

Using default tag: latest
latest: Pulling from smithlab/spritz
Digest: sha256:55172c3a6e32257f977c9512e473f647eaeab32e35b6d96341598d6b96f97615
Status: Image is up to date for smithlab/spritz:latest
docker.io/smithlab/spritz:latest
�Building DAG of jobs...�
�Using shell: /bin/bash�
�Provided cores: 12�
�Rules claiming more threads will be scaled down.�
�Conda environments: ignored�
�Job counts:
	count	jobs
	1	all
	1	base_recalibration
	1	build_transfer_mods
	1	call_gvcf_varaints
	1	call_vcf_variants
	1	copy_gff3_to_snpeff
	1	custom_protein_xml
	1	download_snpeff
	1	final_vcf_naming
	1	finish_isoform
	1	finish_isoform_variants
	1	finish_variants
	1	generate_reference_snpeff_database
	1	generate_snpeff_database
	1	hisat2_groupmark_bam
	1	reference_protein_xml
	1	split_n_cigar_reads
	1	tmpdir
	1	transfer_modifications_isoformvariant
	1	transfer_modifications_variant
	1	variant_annotation_custom
	1	variant_annotation_ref
	22�
��
�[Fri Mar 26 10:06:55 2021]�
�rule tmpdir:
    output: tmp, temporary
    log: data/tmpdir.log
    jobid: 68�
��
��
�[Fri Mar 26 10:06:55 2021]�
�rule build_transfer_mods:
    output: TransferUniProtModifications/TransferUniProtModifications/bin/Release/netcoreapp3.1/TransferUniProtModifications.dll
    log: data/TransferUniProtModifications.build.log
    jobid: 72
    benchmark: data/TransferUniProtModifications.build.benchmark�
��
��
�[Fri Mar 26 10:06:55 2021]�
�rule download_snpeff:
    output: SnpEff/snpEff.config, SnpEff/snpEff.jar, SnpEff_4.3_SmithChemWisc_v2.zip
    log: data/SnpEffInstall.log
    jobid: 5�
��
�Removing temporary output file temporary.�
�[Fri Mar 26 10:06:55 2021]�
�Finished job 68.�
�1 of 22 steps (5%) done�
��
�[Fri Mar 26 10:06:55 2021]�
�rule copy_gff3_to_snpeff:
    input: analysis/isoforms/combined.transcripts.genome.cds.gff3
    output: SnpEff/data/combined.transcripts.genome.gff3/genes.gff
    log: SnpEff/data/combined.transcripts.genome.gff3/copy_gff3_to_snpeff.log
    jobid: 77�
��
��
�[Fri Mar 26 10:06:55 2021]�
�rule hisat2_groupmark_bam:
    input: analysis/align/combined.sorted.bam, tmp
    output: analysis/variants/combined.sorted.grouped.bam, analysis/variants/combined.sorted.grouped.bam.bai, analysis/variants/combined.sorted.grouped.marked.bam, analysis/variants/combined.sorted.grouped.marked.bam.bai, analysis/variants/combined.sorted.grouped.marked.metrics
    log: analysis/variants/combined.sorted.grouped.marked.log
    jobid: 16
    benchmark: analysis/variants/combined.sorted.grouped.marked.benchmark
    wildcards: dir=analysis
    resources: mem_mb=16000�
��
�[Fri Mar 26 10:06:57 2021]�
�Finished job 77.�
�2 of 22 steps (9%) done�
�[Fri Mar 26 10:07:06 2021]�
�Finished job 72.�
�3 of 22 steps (14%) done�
�Removing temporary output file SnpEff_4.3_SmithChemWisc_v2.zip.�
�[Fri Mar 26 10:07:53 2021]�
�Finished job 5.�
�4 of 22 steps (18%) done�
��
�[Fri Mar 26 10:07:53 2021]�
�rule generate_reference_snpeff_database:
    input: SnpEff/snpEff.jar, data/ensembl/Homo_sapiens.GRCh38.97.gff3, data/ensembl/Homo_sapiens.GRCh38.pep.all.fa, data/ensembl/Homo_sapiens.GRCh38.dna.primary_assembly.karyotypic.fa
    output: SnpEff/data/Homo_sapiens.GRCh38/protein.fa, SnpEff/data/Homo_sapiens.GRCh38/genes.gff, SnpEff/data/genomes/Homo_sapiens.GRCh38.fa, SnpEff/data/Homo_sapiens.GRCh38/doneHomo_sapiens.GRCh38.txt
    log: SnpEff/data/Homo_sapiens.GRCh38/snpeffdatabase.log
    jobid: 4
    benchmark: SnpEff/data/Homo_sapiens.GRCh38/snpeffdatabase.benchmark
    resources: mem_mb=16000�
��
��
�[Fri Mar 26 10:07:53 2021]�
�rule generate_snpeff_database:
    input: SnpEff/snpEff.jar, SnpEff/data/combined.transcripts.genome.gff3/genes.gff, data/ensembl/Homo_sapiens.GRCh38.pep.all.fa, data/ensembl/Homo_sapiens.GRCh38.dna.primary_assembly.karyotypic.fa
    output: SnpEff/data/combined.transcripts.genome.gff3/protein.fa, SnpEff/data/genomes/combined.transcripts.genome.gff3.fa, SnpEff/data/combined.transcripts.genome.gff3/done.txt
    log: SnpEff/data/combined.transcripts.genome.gff3/snpeffdatabase.log
    jobid: 111
    benchmark: SnpEff/data/combined.transcripts.genome.gff3/snpeffdatabase.benchmark
    resources: mem_mb=16000�
��
�[Fri Mar 26 10:10:40 2021]�
�Finished job 111.�
�5 of 22 steps (23%) done�
��
�[Fri Mar 26 10:10:40 2021]�
�rule custom_protein_xml:
    input: SnpEff/snpEff.jar, data/ensembl/Homo_sapiens.GRCh38.dna.primary_assembly.karyotypic.fa, SnpEff/data/combined.transcripts.genome.gff3/genes.gff, SnpEff/data/combined.transcripts.genome.gff3/protein.fa, SnpEff/data/genomes/combined.transcripts.genome.gff3.fa, SnpEff/data/combined.transcripts.genome.gff3/done.txt, TransferUniProtModifications/TransferUniProtModifications/bin/Release/netcoreapp3.1/TransferUniProtModifications.dll, data/uniprot/Homo_sapiens.protein.xml.gz
    output: analysis/isoforms/combined.spritz.isoform.protein.xml, analysis/isoforms/combined.spritz.isoform.protein.withdecoys.fasta, analysis/isoforms/combined.spritz.isoform.protein.xml.gz, analysis/isoforms/combined.spritz.isoform.protein.withmods.xml, analysis/isoforms/combined.spritz.isoform.protein.withmods.xml.gz, analysis/isoforms/combined.spritz.isoform.protein.fasta
    log: analysis/isoforms/combined.spritz.isoform.log
    jobid: 114
    benchmark: analysis/isoforms/combined.spritz.isoform.benchmark
    wildcards: dir=analysis
    resources: mem_mb=16000�
��
�[Fri Mar 26 10:12:46 2021]�
�Finished job 4.�
�6 of 22 steps (27%) done�
��
�[Fri Mar 26 10:12:46 2021]�
�rule reference_protein_xml:
    input: SnpEff/data/Homo_sapiens.GRCh38/doneHomo_sapiens.GRCh38.txt, SnpEff/snpEff.jar, data/ensembl/Homo_sapiens.GRCh38.dna.primary_assembly.karyotypic.fa, TransferUniProtModifications/TransferUniProtModifications/bin/Release/netcoreapp3.1/TransferUniProtModifications.dll, data/uniprot/Homo_sapiens.protein.xml.gz
    output: analysis/variants/doneHomo_sapiens.GRCh38.97.txt, analysis/variants/Homo_sapiens.GRCh38.97.protein.xml, analysis/variants/Homo_sapiens.GRCh38.97.protein.xml.gz, analysis/variants/Homo_sapiens.GRCh38.97.protein.fasta, analysis/variants/Homo_sapiens.GRCh38.97.protein.withdecoys.fasta, analysis/variants/Homo_sapiens.GRCh38.97.protein.withmods.xml, analysis/variants/Homo_sapiens.GRCh38.97.protein.withmods.xml.gz
    log: analysis/variants/Homo_sapiens.GRCh38.97.spritz.log
    jobid: 74
    benchmark: analysis/variants/Homo_sapiens.GRCh38.97.spritz.benchmark
    wildcards: dir=analysis
    resources: mem_mb=16000�
��
�Removing temporary output file analysis/isoforms/combined.spritz.isoform.protein.xml.�
�Removing temporary output file analysis/isoforms/combined.spritz.isoform.protein.withmods.xml.�
�[Fri Mar 26 10:15:16 2021]�
�Finished job 114.�
�7 of 22 steps (32%) done�
��
�[Fri Mar 26 10:15:16 2021]�
�rule finish_isoform:
    input: analysis/isoforms/combined.spritz.isoform.protein.fasta, analysis/isoforms/combined.spritz.isoform.protein.withdecoys.fasta, analysis/isoforms/combined.spritz.isoform.protein.withmods.xml.gz
    output: analysis/final/combined.spritz.isoform.protein.fasta, analysis/final/combined.spritz.isoform.protein.withdecoys.fasta, analysis/final/combined.spritz.isoform.protein.withmods.xml.gz
    log: analysis/isoforms/finish_isoform.log
    jobid: 113
    wildcards: dir=analysis�
��
�[Fri Mar 26 10:15:17 2021]�
�Finished job 113.�
�8 of 22 steps (36%) done�
�Removing temporary output file analysis/variants/Homo_sapiens.GRCh38.97.protein.xml.�
�Removing temporary output file analysis/variants/Homo_sapiens.GRCh38.97.protein.withmods.xml.�
�[Fri Mar 26 10:19:50 2021]�
�Finished job 74.�
�9 of 22 steps (41%) done�

@acesnik
Copy link
Collaborator

acesnik commented Mar 27, 2021

Sorry about Docker being flakey. That's frustrating.

For the time being, I'd recommend using the commandline version, https://github.com/smith-chem-wisc/Spritz/wiki/Spritz-commandline-usage. You should be able to specify the directory you're using for the analysis in the "analysisDirectory" config.yaml specification.

@animesh
Copy link
Author

animesh commented Mar 28, 2021

Thanks @acesnik , just to confirm before i proceed, I see there a config.yaml which I am guessing is created by the GUI?

$find . -iname config.yaml
./configs/config.yaml

$cat ./configs/config.yaml
version: 1
sra: []
sra_se: []
fq: [22286_CGATGT_C5E7AANXX_5_20141008B_20141008.bam., 22287_TGACCA_C5E7AANXX_5_20141008B_20141008.bam., 22288_ACAGTG_C5E7AANXX_5_20141008B_20141008.bam., 22289_GCCAAT_C5E7AANXX_5_20141008B_20141008.bam., 22290_CAGATC_C5E7AANXX_5_20141008B_20141008.bam., 22291_CTTGTA_C5E7AANXX_5_20141008B_20141008.bam., 22292_AGTCAA_C5E7AANXX_5_20141008B_20141008.bam., 22293_AGTTCC_C5E7AANXX_5_20141008B_20141008.bam., 22294_ATGTCA_C5E7AANXX_6_20141008B_20141008.bam., 22295_CCGTCC_C5E7AANXX_6_20141008B_20141008.bam., 22296_GTCCGC_C5E7AANXX_6_20141008B_20141008.bam., 22297_GTGAAA_C5E7AANXX_6_20141008B_20141008.bam., 22298_ATCACG_C5E7AANXX_6_20141008B_20141008.bam., 22299_TTAGGC_C5E7AANXX_6_20141008B_20141008.bam., 22300_ACTTGA_C5E7AANXX_6_20141008B_20141008.bam., 22301_GATCAG_C5E7AANXX_6_20141008B_20141008.bam., 22302_TAGCTT_C5E7AANXX_7_20141008B_20141008.bam., 22303_GGCTAC_C5E7AANXX_7_20141008B_20141008.bam., 22304_GTGGCC_C5E7AANXX_7_20141008B_20141008.bam., 22305_GTTTCG_C5E7AANXX_7_20141008B_20141008.bam., 22306_CGTACG_C5E7AANXX_7_20141008B_20141008.bam., 22307_GAGTGG_C5E7AANXX_7_20141008B_20141008.bam., 22308_ACTGAT_C5E7AANXX_7_20141008B_20141008.bam., 22309_ATTCCT_C5E7AANXX_7_20141008B_20141008.bam.]
fq_se: []
analysisDirectory: [analysis]
release: "97"
species: "Homo_sapiens"
organism: "human"
genome: "GRCh38"
analyses: [variant, isoform]
spritzversion: "0.2.4"
...
~                                                                                                                                                                          ~                 

I am wondering if I need to change the analysisDirectory: [analysis] to analysisDirectory: [$PWD] and can I go beyond 12 thread which seemed to be limit in the GUI for the following invocation from the $PWD? Specifically, if this will restart the process from where GUI left snakemake -j 24 --resources mem_mb=64000 ?

@acesnik
Copy link
Collaborator

acesnik commented Mar 29, 2021

The analysisDirectory in this case should be the absolute path to the directory that has the FASTQs in it. This should be a different directory than the one with the Snakefile, which is where you will run the snakemake command.

The snakemake command you listed looks good.

@animesh
Copy link
Author

animesh commented Mar 29, 2021

Thanks @acesnik for looking into this 👍🏼 but I am not sure where to run the snakemake command from? I tried to find the makefile

(spritz) animeshs@DMED7596:~/rnAGS$ find . -iname "*snake*"

or the workflow

(spritz) animeshs@DMED7596:~/rnAGS$ find . -iname "*work*"
./workflow_2021-01-28-12-29-46.txt
./workflow_2021-01-29-10-20-21.txt
./workflow_2021-01-31-13-16-21.txt
./workflow_2021-01-31-17-27-10.txt
./workflow_2021-02-02-11-09-48.txt
./workflow_2021-02-04-12-46-07.txt
./workflow_2021-02-04-13-20-47.txt
./workflow_2021-02-06-14-07-34.txt
./workflow_2021-02-06-18-14-16.txt
./workflow_2021-02-08-09-00-33.txt
./workflow_2021-02-10-10-56-29.txt
./workflow_2021-02-15-15-49-22.txt
./workflow_2021-02-23-12-32-18.txt
./workflow_2021-02-25-11-34-18.txt
./workflow_2021-02-25-17-52-46.txt
./workflow_2021-03-02-11-21-18.txt
./workflow_2021-03-07-14-33-17.txt
./workflow_2021-03-08-17-28-24.txt
./workflow_2021-03-19-11-57-04.txt
./workflow_2021-03-23-10-45-25.txt
./workflow_2021-03-26-11-06-45.txt

without success? Any ideas where it might be or which directory to initiate the command from?

@acesnik
Copy link
Collaborator

acesnik commented Mar 29, 2021

Try searching for the Snakefile and run it from that directory. It looks like you got the spritz environment set up, so that's good! That environment.yaml file is in the same folder as the Snakefile that is the working directory.

@acesnik
Copy link
Collaborator

acesnik commented Mar 29, 2021

In other words, assuming your git clone is named Spritz, you should be able to run it in Spritz/Spritz where the Snakefile is located. https://github.com/smith-chem-wisc/Spritz/tree/master/Spritz

@animesh
Copy link
Author

animesh commented Mar 29, 2021

Looks like (spritz) animeshs@DMED7596:~/Spritz/Spritz$ snakemake -j 24 --resources mem_mb=64000 1>log.1.txt 2>log.2.txt 0>log.0.txt >> log.txt & seems to have worked but it didn't seem to start from where it left (attached log log.2.txt) in the GUI version? Also there is some error like message in the log

(base) animeshs@DMED7596:~$ tail -f Spritz/Spritz/log.2.txt
Write-protected output files for rule reference_protein_xml:
/home/animeshs/rnAGS/variants/doneHomo_sapiens.GRCh38.97.txt
/home/animeshs/rnAGS/variants/Homo_sapiens.GRCh38.97.protein.xml.gz
/home/animeshs/rnAGS/variants/Homo_sapiens.GRCh38.97.protein.fasta
/home/animeshs/rnAGS/variants/Homo_sapiens.GRCh38.97.protein.withdecoys.fasta
/home/animeshs/rnAGS/variants/Homo_sapiens.GRCh38.97.protein.withmods.xml.gz
  File "/home/animeshs/miniconda3/envs/spritz/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 131, in run_jobs
  File "/home/animeshs/miniconda3/envs/spritz/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 433, in run
  File "/home/animeshs/miniconda3/envs/spritz/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 225, in _run
  File "/home/animeshs/miniconda3/envs/spritz/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 150, in _run

but htop shows stuff are running?
image
wondering if I set things up correctly in first place though?

@acesnik
Copy link
Collaborator

acesnik commented Mar 29, 2021

Oh, I see, I think it started earlier than you were before because the resources that were downloaded into the docker container (genome, gene model, etc) were downloaded again, so the timestamps of those resources are later than the previous place you were at...

Regarding the files that are write-protected, I wonder if there are any hanging docker containers. You can see if there are any still running with docker container ls.

I think the best thing to do now would be to just to let it run, or to start it again with the --keep-going flag, i.e. ~/Spritz/Spritz$ snakemake -j 24 --keep-going --resources mem_mb=64000 1>log.1.txt 2>log.2.txt 0>log.0.txt >> log.txt since you got an error with the reference_protein_xml database rule. That flag will let it go as far as possible towards the other final databases as it get and ignore that reference_protein_xml write error if it keeps popping up.

@animesh
Copy link
Author

animesh commented Mar 29, 2021

I reboot the machine, so docker is out, alteast that's what I get when

C:\Users\animeshs\GD\scripts>docker container ls
error during connect: This error may indicate that the docker daemon is not running.: Get http://%2F%2F.%2Fpipe%2Fdocker_engine/v1.24/containers/json: open //./pipe/docker_engine: The system cannot find the file specified.

BTW the last command crashed with message
log.2.txt , now trying --keep-going flag, keeping fingers crossed...

@acesnik
Copy link
Collaborator

acesnik commented Mar 29, 2021

Okay! Hoping for the best! Thanks for the patience with this one!

@acesnik
Copy link
Collaborator

acesnik commented Apr 6, 2021

Did the rest of the run go okay for you?

@animesh
Copy link
Author

animesh commented Apr 6, 2021

It crashed with that "Write-protected" log.2.txt error , tried chown -R which didn't work, probably need to restart the machine but waiting for some other work to finish first... is there anyways to go over this without restarting?

@acesnik
Copy link
Collaborator

acesnik commented Apr 6, 2021

Is it possible to remove these files manually?

rm -f /home/animeshs/rnAGS/variants/doneHomo_sapiens.GRCh38.97.txt
rm -f /home/animeshs/rnAGS/variants/Homo_sapiens.GRCh38.97.protein.xml.gz
rm -f /home/animeshs/rnAGS/variants/Homo_sapiens.GRCh38.97.protein.fasta
rm -f /home/animeshs/rnAGS/variants/Homo_sapiens.GRCh38.97.protein.withdecoys.fasta
rm -f /home/animeshs/rnAGS/variants/Homo_sapiens.GRCh38.97.protein.withmods.xml.gz

@animesh
Copy link
Author

animesh commented Apr 6, 2021

Now its blaming "Write-protected output files for rule hisat2_align_bam_fq:
/home/animeshs/rnAGS/align/22308_ACTGAT_C5E7AANXX_7_20141008B_20141008.bam..fq.sorted.bam" , below is the invocation and log:

(spritz) animeshs@DMED7596:~/Spritz/Spritz$ snakemake -j 24 --keep-going --resources mem_mb=64000 1>log.1.txt 2>log.2.txt 0>log.0.txt >> log.txt &
[1] 23965
(spritz) animeshs@DMED7596:~/Spritz/Spritz$ tail -f log.*
==> log.0.txt <==

==> log.1.txt <==

==> log.2.txt <==
Building DAG of jobs...

==> log.txt <==

==> log.2.txt <==
Using shell: /bin/bash
Provided cores: 24
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=64000
Conda environments: ignored
Job counts:
        count   jobs
        1       LongOrfs
        1       Predict
        1       all
        24      assemble_transcripts_fq
        1       base_recalibration
        1       blastp
        1       call_gvcf_varaints
        1       call_vcf_variants
        1       cdna_alignment_orf_to_genome_orf
        1       copy_gff3_to_snpeff
        1       custom_protein_xml
        1       final_vcf_naming
        1       finish_isoform
        1       finish_isoform_variants
        1       finish_variants
        1       generate_snpeff_database
        1       gtf_file_to_cDNA_seqs
        1       gtf_to_alignment_gff3
        24      hisat2_align_bam_fq
        1       hisat2_groupmark_bam
        1       hisat2_merge_bams
        1       merge_transcripts
        1       reference_protein_xml
        1       remove_exon_and_utr_information
        1       split_n_cigar_reads
        1       transfer_modifications_isoformvariant
        1       transfer_modifications_variant
        1       variant_annotation_custom
        1       variant_annotation_ref
        75
ProtectedOutputException in line 202 of /home/animeshs/Spritz/Spritz/rules/align.smk:
Write-protected output files for rule hisat2_align_bam_fq:
/home/animeshs/rnAGS/align/22308_ACTGAT_C5E7AANXX_7_20141008B_20141008.bam..fq.sorted.bam
  File "/home/animeshs/miniconda3/envs/spritz/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 131, in run_jobs
  File "/home/animeshs/miniconda3/envs/spritz/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 433, in run
  File "/home/animeshs/miniconda3/envs/spritz/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 225, in _run
  File "/home/animeshs/miniconda3/envs/spritz/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 150, in _run

@acesnik
Copy link
Collaborator

acesnik commented Apr 6, 2021

I've never seen this before. Are the other snakemake processes still hanging around for some reason?

@acesnik
Copy link
Collaborator

acesnik commented Apr 6, 2021

ps aux | grep snakemake

@animesh
Copy link
Author

animesh commented Apr 7, 2021

There are couple of other process but i think these are unrelated?

(spritz) animeshs@DMED7596:~/Spritz/Spritz$ ps aux | grep snakemake
animeshs  1959  0.0  0.0   4636     0 pts/1    S+   Mar29   0:00 /bin/sh -c snakemake --snakefile /home/animeshs/miniconda3/envs/atlas/lib/python3.6/site-packages/atlas/Snakefile --directory /mnt/z/ayu  --rerun-incomplete --configfile '/mnt/z/ayu/config.yaml' --nolock   --use-conda --conda-prefix /mnt/z/ayu/databases/conda_envs   all
animeshs  1960  3.8  0.0 1895004    4 pts/1    Sl+  Mar29 491:32 /home/animeshs/miniconda3/envs/atlas/bin/python3.6 /home/animeshs/miniconda3/envs/atlas/bin/snakemake --snakefile /home/animeshs/miniconda3/envs/atlas/lib/python3.6/site-packages/atlas/Snakefile --directory /mnt/z/ayu --rerun-incomplete --configfile /mnt/z/ayu/config.yaml --nolock --use-conda --conda-prefix /mnt/z/ayu/databases/conda_envs all
animeshs 24722  0.0  0.0  14872  2252 pts/2    S+   09:24   0:00 grep --color=auto snakemake

@acesnik
Copy link
Collaborator

acesnik commented Apr 7, 2021

Hmm, yeah that looks like it's from a pipeline named atlas. If you named the spritz environment atlas instead, you might be able to cancel those.

@animesh
Copy link
Author

animesh commented Apr 7, 2021

so you think conda envs are cross-contaminating?

@acesnik
Copy link
Collaborator

acesnik commented Apr 7, 2021

No, I don't think they're colliding. I really don't know why those output files are write protected. That's pretty mysterious! I do think restarting after the other runs finish is a good idea.

@animesh
Copy link
Author

animesh commented Apr 7, 2021

Yes that is the plan, will get back once through 👍🏼

@animesh
Copy link
Author

animesh commented Apr 13, 2021

I eventually ended up deleting the whole/home/animeshs/rnAGS/align/and restarting but then it fails with Error in rule reference_protein_xml log.2.txt , further digging into the Homo_sapiens.GRCh38.97.spritz.log dotnet's System.NullReferenceException , any ideas what might be a way forward? Do I need to install mono in WSL?

@acesnik
Copy link
Collaborator

acesnik commented Apr 21, 2021

Are you using version 0.2.4? That looks familiar from errors we were getting in v0.2.3.

@acesnik
Copy link
Collaborator

acesnik commented Apr 21, 2021

mono won't help here, as it only works with .NET Framework applications, and this is targeting a .NET Core framework.

@animesh
Copy link
Author

animesh commented Apr 22, 2021

Just running conda update --all bring the Spritz to latest? What the best way to check the version? The git log shows:

commit d48529ec60331be1875ce8376bbeb8cc9b426b9b (HEAD -> master, origin/master, origin/HEAD)
Author: Anthony <cesnik@wisc.edu>
Date:   Thu Mar 18 13:05:41 2021 -0500

    Use sra-tools 2.10.1 for prefetch/fastq-dump (#209)

    * add openssh

    * use 2.10.1 for sra-tools

@acesnik
Copy link
Collaborator

acesnik commented Apr 22, 2021

That is the latest commit. Thanks for checking!

I'll look into this more later today.

@acesnik
Copy link
Collaborator

acesnik commented May 3, 2021

Sorry for the delay on this. I took a look but didn't get very far. I'm traveling now, so I'll be able to take a closer look in a couple weeks.

@animesh
Copy link
Author

animesh commented Aug 7, 2021

Looks like it went past that issue but it does say Job failed, going on with independent jobs ? Was not sure which log is relevant so i have tar-gzipped them
log.zip
let me know if it works? Could it be because of bumping the genome version to 100 from 97/98 i used earlier in re-running the pipeline?

@acesnik
Copy link
Collaborator

acesnik commented Aug 9, 2021

Interesting. It looks like some of the transcript assemblies are failing to build.

Could you send me the log files at /home/animeshs/rnAGS/*/*.log? For example, it looks like /home/animeshs/rnAGS/isoforms/22289_GCCAAT_C5E7AANXX_5_20141008B_20141008.bam..fq.sorted.gtf.log is one for a failed run.

@acesnik
Copy link
Collaborator

acesnik commented Aug 9, 2021

I don't think the issue is from bumping the genome version. I've seen these issues before when the aligned read counts are low for some reason.

@acesnik
Copy link
Collaborator

acesnik commented Aug 9, 2021

You could also look at some of the log files from aligning these files to see if that's the case, e.g. if there are low read counts. In a typical experiment, I would expect >80% or >90% of reads to be aligned. I've seen these types of errors when there are <20% aligned, for example, which points to an alignment issue.

@acesnik
Copy link
Collaborator

acesnik commented Aug 9, 2021

If spritz cannot detect isoforms with stringtie, I'd recommend just finishing the pipeline with the variant analysis. That is, you can try removing isoform from the config file's requested analyses to see if it finishes the job.

@animesh
Copy link
Author

animesh commented Aug 9, 2021

OK, i have disabled isoform call and reran the workflow

(spritzbase) animeshs@DMED7596:~/Spritz/Spritz/workflow$ vim config/config.yaml
(spritzbase) animeshs@DMED7596:~/Spritz/Spritz/workflow$ snakemake -j 12 --keep-going --resources mem_mb=64000 --use-conda --conda-frontend mamba
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 12
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=64000
Job counts:
        count   jobs
        1       all
        1       base_recalibration
        1       call_gvcf_varaints
        1       call_vcf_variants
        1       final_vcf_naming
        1       finish_variants
        1       reference_protein_xml
        1       split_n_cigar_reads
        1       transfer_modifications_variant
        1       variant_annotation_ref
        10

[Mon Aug  9 13:32:16 2021]
rule split_n_cigar_reads:
    input: /home/animeshs/rnAGS/variants/combined.sorted.grouped.marked.bam, ../resources/ensembl/Homo_sapiens.GRCh38.dna.primary_assembly.karyotypic.fa, ../resources/ensembl/Homo_sapiens.GRCh38.dna.primary_assembly.karyotypic.fa.fai, ../resources/ensembl/Homo_sapiens.GRCh38.dna.primary_assembly.karyotypic.dict, ../resources/tmp
    output: /home/animeshs/rnAGS/variants/combined.fixedQuals.bam, /home/animeshs/rnAGS/variants/combined.sorted.grouped.marked.split.bam, /home/animeshs/rnAGS/variants/combined.sorted.grouped.marked.split.bam.bai
    log: /home/animeshs/rnAGS/variants/combined.sorted.grouped.marked.split.log
    jobid: 20
    benchmark: /home/animeshs/rnAGS/variants/combined.sorted.grouped.marked.split.benchmark
    wildcards: dir=/home/animeshs/rnAGS
    resources: mem_mb=24000


[Mon Aug  9 13:32:16 2021]
rule reference_protein_xml:
    input: ptmlist.txt, PSI-MOD.obo.xml, ../resources/SnpEff/data/Homo_sapiens.GRCh38/doneHomo_sapiens.GRCh38.txt, ../resources/SnpEff/snpEff.jar, ../resources/ensembl/Homo_sapiens.GRCh38.dna.primary_assembly.karyotypic.fa, ../SpritzModifications/bin/x64/Release/net5.0/SpritzModifications.dll, ../resources/uniprot/Homo_sapiens.protein.xml.gz
    output: /home/animeshs/rnAGS/variants/doneHomo_sapiens.GRCh38.100.txt, /home/animeshs/rnAGS/variants/Homo_sapiens.GRCh38.100.protein.xml, /home/animeshs/rnAGS/variants/Homo_sapiens.GRCh38.100.protein.xml.gz, /home/animeshs/rnAGS/variants/Homo_sapiens.GRCh38.100.protein.fasta, /home/animeshs/rnAGS/variants/Homo_sapiens.GRCh38.100.protein.withdecoys.fasta, /home/animeshs/rnAGS/variants/Homo_sapiens.GRCh38.100.protein.withmods.xml, /home/animeshs/rnAGS/variants/Homo_sapiens.GRCh38.100.protein.withmods.xml.gz
    log: /home/animeshs/rnAGS/variants/Homo_sapiens.GRCh38.100.spritz.log
    jobid: 2
    benchmark: /home/animeshs/rnAGS/variants/Homo_sapiens.GRCh38.100.spritz.benchmark
    wildcards: dir=/home/animeshs/rnAGS
    resources: mem_mb=16000

Activating conda environment: /home/animeshs/Spritz/Spritz/workflow/.snakemake/conda/8d931524

but it crashed the WSL itself...
BTW cat /home/animeshs/rnAGS/isoforms/*bam..fq.sorted.gtf.log returned empty, i guess it is because of the rerunning? what is the most optimal way to check for mapped read counts ?

@acesnik
Copy link
Collaborator

acesnik commented Aug 9, 2021

Thanks for giving that a try!

Could you tell me more about the WSL crash? If I remember correctly, you have 64 GB of RAM, so that's probably not the issue...

Thanks for the information about the *.sorted.gtf.log files being empty.

Please check on the files at /home/animeshs/rnAGS/align/*.hisat2.log to see information about the alignment rates.

@animesh
Copy link
Author

animesh commented Aug 9, 2021

Yes i have 64GB RAM free but it looks like snakemake directory lock was the issue as a rerun after invoking with snakemake with --unlock snakemake -j 12 --keep-going --resources mem_mb=64000 --use-conda --conda-frontend mamba is still going strong, keeping fingers crossed! BTW

(base) animeshs@DMED7596:~$ cat /home/animeshs/rnAGS/align/22289_GCCAAT_C5E7AANXX_5_20141008B_20141008.bam..fq.hisat2.log
34632639 reads; of these:
  34632639 (100.00%) were paired; of these:
    2271032 (6.56%) aligned concordantly 0 times
    29827102 (86.12%) aligned concordantly exactly 1 time
    2534505 (7.32%) aligned concordantly >1 times
    ----
    2271032 pairs aligned concordantly 0 times; of these:
      167458 (7.37%) aligned discordantly 1 time
    ----
    2103574 pairs aligned 0 times concordantly or discordantly; of these:
      4207148 mates make up the pairs; of these:
        3544017 (84.24%) aligned 0 times
        513094 (12.20%) aligned exactly 1 time
        150037 (3.57%) aligned >1 times
94.88% overall alignment rate
[bam_sort_core] merging from 23 files and 1 in-memory blocks...

Thanks be to you @acesnik for making&sharing, may the force be with you 👍🏽

@acesnik
Copy link
Collaborator

acesnik commented Aug 10, 2021

Okay, great! No problem!

I'm going to close this issue. Feel free to reopen or open a new one if you run into anything new!

@acesnik acesnik closed this as completed Aug 10, 2021
@animesh
Copy link
Author

animesh commented Aug 11, 2021

It went fine for a while but then crashed
2021-08-09T133906.922634.snakemake.log , looks like the underlying issue is something to do with GATK/htsjdk.samtools.util.RuntimeIOException
combined.sorted.grouped.marked.split.log more RAM needed?

@acesnik
Copy link
Collaborator

acesnik commented Aug 11, 2021

It looks like the drive where spritz is located may have run out of storage space. I think this happens when it's writing temporary files and the temporary file location (../resources/tmp) runs out of space.

Could you share /home/animeshs/rnAGS/variants/Homo_sapiens.GRCh38.100.spritz.log, as well?

@animesh
Copy link
Author

animesh commented Aug 11, 2021

I think drive space should not be the issue as

(spritzbase) animeshs@DMED7596:~/Spritz/Spritz/workflow$ df -kh
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb        251G  216G   23G  91% /
none             43G  4.0K   43G   1% /mnt/wsl
tools           953G  746G  208G  79% /init
none             43G     0   43G   0% /dev
none             43G     0   43G   0% /run
none             43G     0   43G   0% /run/lock
none             43G  8.0K   43G   1% /run/shm
none             43G     0   43G   0% /run/user
tmpfs            43G     0   43G   0% /sys/fs/cgroup
drivers         953G  746G  208G  79% /usr/lib/wsl/drivers
lib             953G  746G  208G  79% /usr/lib/wsl/lib
drvfs           953G  746G  208G  79% /mnt/c
drvfs           3.7T  3.4T  263G  93% /mnt/f
drvfs           9.1T  4.5T  4.7T  49% /mnt/z

/mnt/z has the data... looking into the log file you asked for
Homo_sapiens.GRCh38.100.spritz.log which seems to me complaining about a DLL?

@acesnik
Copy link
Collaborator

acesnik commented Aug 11, 2021

The latter issue in Homo_sapiens.GRCh38.100.spritz.log was fixed here: #217 Could you try doing git fetch --all; git pull origin master to see if those changes help?

@acesnik
Copy link
Collaborator

acesnik commented Aug 11, 2021

Is your /home directory with spritz also on /mnt/z? If not, I'd recommend running spritz from /mnt/z, as well.

With it there where there's plenty of space, I really don't know why a file is closing in the middle of the SplitNCigarReads tool execution. Could you please check whether temporary files being saved to ../resources/tmp on /mnt/z/ within the Spritz folder?

I just double checked the option for specifying the temporary directory (--tmp-dir), and it looks like that is still correct, so I don't think that's the issue.

@animesh
Copy link
Author

animesh commented Aug 12, 2021

OK, i have pull/move and re-running the pipeline

(base) animeshs@DMED7596:~/Spritz$ git fetch --all
Fetching origin
(base) animeshs@DMED7596:~/Spritz$ git pull origin master
From https://github.com/animesh/Spritz
 * branch            master     -> FETCH_HEAD
(base) animeshs@DMED7596:~/Spritz$ cd ..git pull origin master
(base) animeshs@DMED7596:~/Spritz$ cd ..
(base) animeshs@DMED7596:~$mv Spritz /mnt/z/.
(base) animeshs@DMED7596:~$cd /mnt/z/Spritz/Spritz/workflow
(base) animeshs@DMED7596:/mnt/z/Spritz/Spritz/workflow$ conda activate spritzbase
(spritzbase) animeshs@DMED7596:/mnt/z/Spritz/Spritz/workflow$ snakemake -j 12 --keep-going --resources mem_mb=64000 --use-conda --conda-frontend mamba
Building DAG of jobs...
Creating conda environment envs/proteogenomics.yaml...
Downloading and installing remote packages.
...

but after this it keeps crashing at 1st step? Should i just redo or something else i can looking into for saving the work so far?

@acesnik
Copy link
Collaborator

acesnik commented Aug 12, 2021

I'm not sure what's going wrong based on that output. Is it producing any log files? I also wonder whether the analysis directory is specified as an absolute path in the config file.

@animesh
Copy link
Author

animesh commented Aug 13, 2021

Looks like it is getting stuck and conda activation stage, below is the one i re-ran and just cancelled, it was running since yesterday...

(spritzbase) animeshs@DMED7596:/mnt/z/Spritz/Spritz/workflow$ snakemake -j 8 --unlock --keep-going --resources mem_mb=32000 --use-conda --conda-frontend mamba
Unlocking working directory.
(spritzbase) animeshs@DMED7596:/mnt/z/Spritz/Spritz/workflow$ snakemake -j 8 --keep-going --resources mem_mb=32000 --use-conda --conda-frontend mamba
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 8
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=32000
Job counts:
        count   jobs
        1       all
        1       base_recalibration
        1       call_gvcf_varaints
        1       call_vcf_variants
        1       final_vcf_naming
        1       finish_variants
        1       split_n_cigar_reads
        1       transfer_modifications_variant
        1       variant_annotation_ref
        9

[Thu Aug 12 14:36:34 2021]
rule split_n_cigar_reads:
    input: /home/animeshs/rnAGS/variants/combined.sorted.grouped.marked.bam, ../resources/ensembl/Homo_sapiens.GRCh38.dna.primary_assembly.karyotypic.fa, ../resources/ensembl/Homo_sapiens.GRCh38.dna.primary_assembly.karyotypic.fa.fai, ../resources/ensembl/Homo_sapiens.GRCh38.dna.primary_assembly.karyotypic.dict, ../resources/tmp
    output: /home/animeshs/rnAGS/variants/combined.fixedQuals.bam, /home/animeshs/rnAGS/variants/combined.sorted.grouped.marked.split.bam, /home/animeshs/rnAGS/variants/combined.sorted.grouped.marked.split.bam.bai
    log: /home/animeshs/rnAGS/variants/combined.sorted.grouped.marked.split.log
    jobid: 20
    benchmark: /home/animeshs/rnAGS/variants/combined.sorted.grouped.marked.split.benchmark
    wildcards: dir=/home/animeshs/rnAGS
    resources: mem_mb=24000

Activating conda environment: /mnt/z/Spritz/Spritz/workflow/.snakemake/conda/3ddf2249
^CTerminating processes on user request, this might take some time.
^CCancelling snakemake on user request.
[Fri Aug 13 09:33:02 2021]
Error in rule split_n_cigar_reads:
    jobid: 20
    output: /home/animeshs/rnAGS/variants/combined.fixedQuals.bam, /home/animeshs/rnAGS/variants/combined.sorted.grouped.marked.split.bam, /home/animeshs/rnAGS/variants/combined.sorted.grouped.marked.split.bam.bai
    log: /home/animeshs/rnAGS/variants/combined.sorted.grouped.marked.split.log (check log file(s) for error message)
    conda-env: /mnt/z/Spritz/Spritz/workflow/.snakemake/conda/3ddf2249
    shell:
        (gatk --java-options "-Xmx24000M -Dsamjdk.compression_level=9" FixMisencodedBaseQualityReads -I /home/animeshs/rnAGS/variants/combined.sorted.grouped.marked.bam -O /home/animeshs/rnAGS/variants/combined.fixedQuals.bam && gatk --java-options "-Xmx24000M -Dsamjdk.compression_level=9" SplitNCigarReads -R ../resources/ensembl/Homo_sapiens.GRCh38.dna.primary_assembly.karyotypic.fa -I /home/animeshs/rnAGS/variants/combined.fixedQuals.bam -O /home/animeshs/rnAGS/variants/combined.sorted.grouped.marked.split.bam --tmp-dir ../resources/tmp || gatk --java-options "-Xmx24000M -Dsamjdk.compression_level=9" SplitNCigarReads -R ../resources/ensembl/Homo_sapiens.GRCh38.dna.primary_assembly.karyotypic.fa -I /home/animeshs/rnAGS/variants/combined.sorted.grouped.marked.bam -O /home/animeshs/rnAGS/variants/combined.sorted.grouped.marked.split.bam --tmp-dir ../resources/tmp; samtools index /home/animeshs/rnAGS/variants/combined.sorted.grouped.marked.split.bam) &> /home/animeshs/rnAGS/variants/combined.sorted.grouped.marked.split.log
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job split_n_cigar_reads since they might be corrupted:
/home/animeshs/rnAGS/variants/combined.fixedQuals.bam, /home/animeshs/rnAGS/variants/combined.sorted.grouped.marked.split.bam
Job failed, going on with independent jobs.

Below is the config, it is a link to absolute pathj, should it be relative?

(spritzbase) animeshs@DMED7596:/mnt/z/Spritz/Spritz/workflow$ less config/config.yaml
sra: [] #  paired-end SRAs, comma separated, can leave empty, e.g. SRR629563
fq: [22286_CGATGT_C5E7AANXX_5_20141008B_20141008.bam., 22287_TGACCA_C5E7AANXX_5_20141008B_20141008.bam., 22288_ACAGTG_C5E7AANXX_5_20141008B_20141008.bam., 22289_GCCAAT_C5E7AANXX_5_20141008B_20141008.bam., 22290_CAGATC_C5E7AANXX_5_20141008B_20141008.bam., 22291_CTTGTA_C5E7AANXX_5_20141008B_20141008.bam., 22292_AGTCAA_C5E7AANXX_5_20141008B_20141008.bam., 22293_AGTTCC_C5E7AANXX_5_20141008B_20141008.bam., 22294_ATGTCA_C5E7AANXX_6_20141008B_20141008.bam., 22295_CCGTCC_C5E7AANXX_6_20141008B_20141008.bam., 22296_GTCCGC_C5E7AANXX_6_20141008B_20141008.bam., 22297_GTGAAA_C5E7AANXX_6_20141008B_20141008.bam., 22298_ATCACG_C5E7AANXX_6_20141008B_20141008.bam., 22299_TTAGGC_C5E7AANXX_6_20141008B_20141008.bam., 22300_ACTTGA_C5E7AANXX_6_20141008B_20141008.bam., 22301_GATCAG_C5E7AANXX_6_20141008B_20141008.bam., 22302_TAGCTT_C5E7AANXX_7_20141008B_20141008.bam., 22303_GGCTAC_C5E7AANXX_7_20141008B_20141008.bam., 22304_GTGGCC_C5E7AANXX_7_20141008B_20141008.bam., 22305_GTTTCG_C5E7AANXX_7_20141008B_20141008.bam., 22306_CGTACG_C5E7AANXX_7_20141008B_20141008.bam., 22307_GAGTGG_C5E7AANXX_7_20141008B_20141008.bam., 22308_ACTGAT_C5E7AANXX_7_20141008B_20141008.bam., 22309_ATTCCT_C5E7AANXX_7_20141008B_20141008.bam.] # paired-end fastq prefixes, comma separated, can leave empty, e.g. TestPairedEnd
sra_se: [] # single-end SRAs, comma separated, can leave empty, e.g. SRR8070095
fq_se: [] # single-end fastq prefixes, comma separated, can leave empty, e.g. TestSingleEnd
analysis_directory: [/home/animeshs/rnAGS] # for paths to drive e.g. /mnt/c/AnalysisFolder
species: "Homo_sapiens" # ensembl species name
genome: "GRCh38" # ensembl genome version
release: "100" # ensembl release version
organism: "human" # based on uniprot
# analyses: [isoform,variant,quant] # isoform construction, variant calling, transcript quantification
analyses: [variant]
spritz_version: "0.3.4" # should be the same here, common.smk, and MainWindow.xml.cs
prebuilt_spritz_mods: False

@acesnik
Copy link
Collaborator

acesnik commented Aug 13, 2021

You could try deleting the /mnt/z/Spritz/Spritz/workflow/.snakemake/ folder that has the environments and try again. It should just rebuild the environments quickly before running.

@animesh
Copy link
Author

animesh commented Aug 17, 2021

Looks like it config.zip finally worked 2021-08-13T153310.143502.snakemake.log @acesnik 👍🏽

I am guessing that following are the results for the variant calling?

(spritzbase) animeshs@DMED7596:/mnt/z/Spritz/Spritz/workflow$ ls -ltrh /home/animeshs/rnAGS/final/
total 1.1G
-rwxrwxrwx 1 animeshs animeshs 552M Aug 16 16:21 combined.spritz.snpeff.vcf
-rwxrwxrwx 1 animeshs animeshs  65M Aug 16 16:21 combined.spritz.snpeff.protein.fasta
-rwxrwxrwx 1 animeshs animeshs 150M Aug 16 16:21 combined.spritz.snpeff.protein.withdecoys.fasta
-rwxrwxrwx 1 animeshs animeshs  78M Aug 16 16:21 combined.spritz.snpeff.protein.withmods.xml.gz
-rwxrwxrwx 1 animeshs animeshs  34M Aug 16 16:21 Homo_sapiens.GRCh38.100.protein.fasta
-rwxrwxrwx 1 animeshs animeshs  79M Aug 16 16:21 Homo_sapiens.GRCh38.100.protein.withdecoys.fasta
-rwxrwxrwx 1 animeshs animeshs  76M Aug 16 16:21 Homo_sapiens.GRCh38.100.protein.withmods.xml.gz

If so, are the following numbers of protein sequences look reasonable to you?

(spritzbase) animeshs@DMED7596:/mnt/z/Spritz/Spritz/workflow$ for i in  /home/animeshs/rnAGS/final/*.fasta; do echo $i; grep "^>" $i | wc; done
/home/animeshs/rnAGS/final/Homo_sapiens.GRCh38.100.protein.fasta
  63305  437182 6560454
/home/animeshs/rnAGS/final/Homo_sapiens.GRCh38.100.protein.withdecoys.fasta
 152804  611216 14673896
/home/animeshs/rnAGS/final/combined.spritz.snpeff.protein.fasta
  73488  585678 29130709
/home/animeshs/rnAGS/final/combined.spritz.snpeff.protein.withdecoys.fasta
 176685 1069046 67683537

The with decoys numbers are confusing me though, like 152804 is off by 26194 if the reverse is included like 2*63305 or does it contain the variants too? The number difference is 29709 for the last two...

How are the variants being encoded in the fasta files? Is there a straightforward within Spritz way to summarize those?

@acesnik
Copy link
Collaborator

acesnik commented Aug 17, 2021

These two files are actually the results of different programs, which should probably be clearer in the filenames. The first ones (Homo_sapiens.GRCh38.100.protein.fasta, combined.spritz.snpeff.protein.fasta) are produced by Spritz's fork of SnpEff that does variant annotation, and the second batch (Homo_sapiens.GRCh38.100.protein.withdecoys.fasta, combined.spritz.snpeff.protein.withmods.xml.gz, etc) is produced by SpritzModifications (formerly TransferUniProtModifications) that applies them to proteins. The reason for the discrepancy in counts is that SpritzModifications does limited combinatorial expansion of heterozygous variations, whereas SnpEff does not.

If you check grep "^>mz" $i | wc for targets, versus grep "^>rev_mz" $i | wc for decoys, you can verify that the number of decoys is the same as the number of targets (e.g., 76402 for both for Homo_sapiens.GRCh38.100.protein.withdecoys.fasta).

@acesnik
Copy link
Collaborator

acesnik commented Aug 17, 2021

If you want to use the FASTA file with variants but without decoys, I would recommend selecting grep "^>mz" combined.spritz.snpeff.protein.withdecoys.fasta > combined.spritz.snpeff.protein.spritzmods.fasta.

@animesh
Copy link
Author

animesh commented Aug 17, 2021

Great @acesnik , 152804/2=>76402 matches perfectly though 88340 and 88345 don't, does it mean there are only 5 variants?

(spritzbase) animeshs@DMED7596:/mnt/z/Spritz/Spritz/workflow$ for i in  /home/animeshs/rnAGS/final/*.fasta; do echo $i;grep "^>mz" $i | wc; done
/home/animeshs/rnAGS/final/Homo_sapiens.GRCh38.100.protein.fasta
  63305  437182 6560454
/home/animeshs/rnAGS/final/Homo_sapiens.GRCh38.100.protein.withdecoys.fasta
  76402  305608 7184144
/home/animeshs/rnAGS/final/combined.spritz.snpeff.protein.fasta
  73488  585678 29130709
/home/animeshs/rnAGS/final/combined.spritz.snpeff.protein.withdecoys.fasta
  88340  443908 33028098
(spritzbase) animeshs@DMED7596:/mnt/z/Spritz/Spritz/workflow$ for i in  /home/animeshs/rnAGS/final/*.fasta; do echo $i;grep "^>rev_mz" $i | wc; done
/home/animeshs/rnAGS/final/Homo_sapiens.GRCh38.100.protein.fasta
      0       0       0
/home/animeshs/rnAGS/final/Homo_sapiens.GRCh38.100.protein.withdecoys.fasta
  76402  305608 7489752
/home/animeshs/rnAGS/final/combined.spritz.snpeff.protein.fasta
      0       0       0
/home/animeshs/rnAGS/final/combined.spritz.snpeff.protein.withdecoys.fasta
  88345  625138 34655439

do you think the results are fine in general? Should i go forward with /home/animeshs/rnAGS/final/combined.spritz.snpeff.protein.withdecoys.fasta* none the less?

@animesh
Copy link
Author

animesh commented Aug 17, 2021

Thanks, looks like you replied in between @acesnik . So i am guessing the variants are being encoded within the ^mz_* ? How to extract them and confirm, any way within Spritz?

@acesnik
Copy link
Collaborator

acesnik commented Aug 17, 2021

Oh, that's interesting about the extra decoys. There is a process of reversing the variants in decoy generation, and that might change the length and thus the filtering based on having proteins >7 AAs. I'll look into that.

The decoy discrepancy is pretty small, so it might not have the biggest effect, but like I mentioned, you could select just the targets with grep "^>mz" combined.spritz.snpeff.protein.withdecoys.fasta > combined.spritz.snpeff.protein.spritzmods.fasta.

All target proteins are encoded with "^>mz", and all decoys are encoded with "^>rev_mz". Variants will have the tag "variant:" in the header, so to check for the count of proteins with variants, I'd recommend doing grep "^>mz" $i | grep "variant:" | wc.

@acesnik
Copy link
Collaborator

acesnik commented Aug 17, 2021

In typical human experiments, I usually find about a quarter of the entries have variants.

@animesh
Copy link
Author

animesh commented Aug 17, 2021

Awesome @acesnik 👍🏽 Looks like there are about 15000 variants

(spritzbase) animeshs@DMED7596:/mnt/z/Spritz/Spritz/workflow$ for i in  /home/animeshs/rnAGS/final/*.fasta; do echo $i;grep "^>mz" $i | grep "variant:" | wc ; done
/home/animeshs/rnAGS/final/Homo_sapiens.GRCh38.100.protein.fasta
      0       0       0
/home/animeshs/rnAGS/final/Homo_sapiens.GRCh38.100.protein.withdecoys.fasta
      0       0       0
/home/animeshs/rnAGS/final/combined.spritz.snpeff.protein.fasta
  14695  179491 23040270
/home/animeshs/rnAGS/final/combined.spritz.snpeff.protein.withdecoys.fasta
  17324  159844 26350306

which is less than expected then? Probably the read depth was not enough, where to check for the mapping metrics and play with thresholds used for variant-calling? Is there some easy to configure within the config.yaml?

@acesnik
Copy link
Collaborator

acesnik commented Aug 17, 2021

That's right around where I would expect. Probably 15-30% of entries or something. I wouldn't try changing any of the thresholding, personally.

It's actually probably more than 15000 variants, since multiple are encoded per protein. You can check on some other figures in the output of SpritzModifications at combined.spritz.snpeff.protein.withmods.log. Right now, it has one summary for the targets only, and then another one for both targets and decoys (which you can ignore).

The first one also tells you the number of variant containing proteins (hopefully the same number as you saw above), the number of unique variants on those proteins, and the number of unique variants by type: missense, frameshift, deletion, etc.

@acesnik
Copy link
Collaborator

acesnik commented Aug 17, 2021

You can also take a step back from there and look at the numbers of variants detected before applying them to proteins by opening combined.spritz.snpeff.html, which is a report from SnpEff variant annotation.

@animesh
Copy link
Author

animesh commented Aug 17, 2021

The numbers seem to be matching withmods.log

Welcome to SpritzModifications!
Transfering modifications from UniProt database ../resources/uniprot/Homo_sapiens.protein.xml.gz to /home/animeshs/rnAGS/variants/combined.spritz.snpeff.protein.xml
76402	Canonincal proteins translated from gene model (without applied variations)
46883	Proteins with exact sequence match in UniProt
16422	Proteins without exact sequence match in UniProt
Analyzing resulting database /home/animeshs/rnAGS/variants/combined.spritz.snpeff.protein.withmods.xml
Spritz Database Summary
--------------------------------------------------------------
63305	Total number of canonical protein entries (before applying variations)
73491	Total number of protein entries
51237	Total modifications appended from UniProt out of 53436
10186	Total number of variant containing protein entries
13729	Total number of unique variants
231	Total number of unique synonymous variants
13498	Total number of unique nonsynonymous variants
12833	Number of unique SNV missense variants
188	Number of unique MNV missense variants
346	Number of unique frameshift variants
2	Number of unique insertion variants
21	Number of unique deletion variants
74	Number of unique stop gain variants
34	Number of unique stop loss variants
Spritz Database Summary
--------------------------------------------------------------
126610	Total number of canonical protein entries (before applying variations)
146983	Total number of protein entries
100982	Total modifications appended from UniProt out of 105326
20373	Total number of variant containing protein entries
27584	Total number of unique variants
474	Total number of unique synonymous variants
27110	Total number of unique nonsynonymous variants
25781	Number of unique SNV missense variants
378	Number of unique MNV missense variants
691	Number of unique frameshift variants
4	Number of unique insertion variants
40	Number of unique deletion variants
148	Number of unique stop gain variants
68	Number of unique stop loss variants

I will check the HTML too, thanks a lot @acesnik 👍🏽

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants