Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WorkflowError: Missing Input files for rule prescore #64

Closed
raghvendra44 opened this issue Apr 25, 2024 · 8 comments
Closed

WorkflowError: Missing Input files for rule prescore #64

raghvendra44 opened this issue Apr 25, 2024 · 8 comments

Comments

@raghvendra44
Copy link

raghvendra44 commented Apr 25, 2024

Hello,
I am trying to setup the repository on my conda mamba environment.

The options that I selected for installing the pre-requisites.

The following questions will quide you through selecting the files and dependencies needed for CADD.
After this, you will see an overview of the selected files before the download and installation starts.
Please note, that for successfully running CADD locally, you will need the conda environment and at least one set of annotations.

Do you want to install the virtual environments with all CADD dependencies via conda? (y)/n y
Do you want to install CADD v1.7 for GRCh37/hg19? (y)/n n
Do you want to install CADD v1.7 for GRCh38/hg38? (y)/n y
Do you want to load annotations (Annotations can also be downloaded manually from the website)? (y)/n y
Do you want to load prescored variants (Makes SNV calling faster. Can also be loaded/installed later.)? y/(n) y
Do you want to load prescored variants for scoring with annotations (Warning: These files are very big)? y/(n) n
Do you want to load prescored variants for scoring without annotations? y/(n) y
Do you also want to load prescored InDels? We provide scores for well known InDels from sources like ClinVar, gnomAD/TOPMed etc. y/(n) y

The following will be loaded: (disk space occupied)
 - Setup of the virtual environments including all dependencies for CADD v1.7 (16 GB).
 - Download CADD annotations for GRCh38-v1.7 (336 GB)
 - Download prescored SNV (without annotations) for GRCh38-v1.7 (81 GB)
 - Download prescored InDels (without annotations) for GRCh38-v1.7 (1.2 GB)
Please make sure you have enough disk space available.
Ready to continue? (y)/n y
Starting installation. This will take some time.

And after the installation was complete, error occured!

gnomad.genomes.r4.0.indel.tsv.gz: OK
gnomad.genomes.r4.0.indel.tsv.gz.tbi: OK
Setting up virtual environments for CADD v1.7
No validator found for JSON Schema version identifier 'https://json-schema.org/draft/2020-12/schema'
Defaulting to validator for JSON Schema version 'http://json-schema.org/draft-07/schema#'
Note that schema file may not be validated correctly.
Building DAG of jobs...
WorkflowError:
MissingInputException: Missing input files for rule decompress:
    output: test/input.novel.vcf
    wildcards: file=test/input.novel
    affected files:
        test/input.novel.vcf.gz
MissingInputException: Missing input files for rule prescore:
    output: test/input.novel.vcf, test/input.pre.tsv
    wildcards: file=test/input
    affected files:
        /CADD-1.7/CADD-scripts-1.7/data/prescored/GRCh38_v1.7/incl_anno

Can someone suggest how to fix this? or can i still test the script using the files i have?

thanks!

@visze
Copy link
Collaborator

visze commented Apr 25, 2024

Hi. Thanks for reporting! The error occured during Setting up virtual environments for CADD v1.7. It is really the last step and is dependend on the file test/input.vcf
I already see that it would be better to bgzip this file (test/input.vcf.gz) because the .vcf might got detelted because it is defined as tmp in the workflow. Which snakemake version do you use?

In theory you should have everything except the conda environments. You can ty tr continue and run the workflow via the CADD.sh script. The environments should be installed there too if they do not exist. Let me know if that works.

@raghvendra44
Copy link
Author

raghvendra44 commented Apr 25, 2024

Thanks a lot for the quick response!
the snakemake version in use is 7.32.4

So, upon running the ./CADD.sh after converting the input.vcf to input.vcf.gz
It is saying that i need to specify the input file with variants in it.

> ./CADD.sh
CADD-v1.7 (c) University of Washington, Hudson-Alpha Institute for Biotechnology and Berlin Institute of Health at Charité - Universitätsmedizin Berlin 2013-2023. All rights reserved.
No input file specified. To run CADD, a list of variants has to be provided in a vcf or vcf.gz file.

Is it expecting me to provide the parameters ./CADD.sh -o test/input.novel.vcf -g GRCh38 -a test/input.vcf.gz?

Edited:

I tried running using this command ./CADD.sh -o test/input.novel.vcf -g GRCh38 -a test/input.vcf.gz
but the error is still the same.

CADD-v1.7 (c) University of Washington, Hudson-Alpha Institute for Biotechnology and Berlin Institute of Health at Charité - Universitätsmedizin Berlin 2013-2023. All rights reserved.
Running snakemake pipeline:
snakemake /tmp/tmp.qPUqZ4ID6B/input.tsv.gz --use-conda --conda-prefix /gpfs/data/user/raghvendra/CADD-1.7/CADD-scripts-1.7/envs/conda --cores 1
--configfile /gpfs/data/user/raghvendra/CADD-1.7/CADD-scripts-1.7/config/config_GRCh38_v1.7.yml --snakefile /gpfs/data/user/raghvendra/CADD-1.7/CADD-scripts-1.7/Snakefile -q
No validator found for JSON Schema version identifier 'https://json-schema.org/draft/2020-12/schema'
Defaulting to validator for JSON Schema version 'http://json-schema.org/draft-07/schema#'
Note that schema file may not be validated correctly.
Building DAG of jobs...
WorkflowError:
MissingInputException: Missing input files for rule prescore:
    output: /tmp/tmp.qPUqZ4ID6B/input.novel.vcf, /tmp/tmp.qPUqZ4ID6B/input.pre.tsv
    wildcards: file=/tmp/tmp.qPUqZ4ID6B/input
    affected files:
        /gpfs/data/user/raghvendra/CADD-1.7/CADD-scripts-1.7/data/prescored/GRCh38_v1.7/incl_anno
MissingInputException: Missing input files for rule decompress:
    output: /tmp/tmp.qPUqZ4ID6B/input.novel.vcf
    wildcards: file=/tmp/tmp.qPUqZ4ID6B/input.novel
    affected files:
        /tmp/tmp.qPUqZ4ID6B/input.novel.vcf.gz

@visze
Copy link
Collaborator

visze commented Apr 25, 2024

Can you make sure that the folder: /gpfs/data/user/raghvendra/CADD-1.7/CADD-scripts-1.7/data/prescored/GRCh38_v1.7/incl_anno is there and accessable on your run node/location and prescored files are in it (should be in total 4 files, 2 tsv.gz and 2 tsv.gz.tbi)

@raghvendra44
Copy link
Author

I dont have a folde called incl_anno.
I have a folder called no_anno within which there are 4 files as you mentioned.

gnomad.genomes.r4.0.indel.tsv.gz
whole_genome_SNVs.tsv.gz
gnomad.genomes.r4.0.indel.tsv.gz.tbi
whole_genome_SNVs.tsv.gz.tbi

@visze
Copy link
Collaborator

visze commented Apr 25, 2024 via email

@raghvendra44
Copy link
Author

raghvendra44 commented Apr 25, 2024

I tried running ./CADD.sh -o test/input.novel.vcf -g GRCh38 test/input.vcf.gz, but now there is a new error about jsonschema

CADD-v1.7 (c) University of Washington, Hudson-Alpha Institute for Biotechnology and Berlin Institute of Health at Charité - Universitätsmedizin Berlin 2013-2023. All rights reserved.
Running snakemake pipeline:
snakemake /tmp/tmp.C8tDZgRwkx/input.tsv.gz --use-conda --conda-prefix /gpfs/data/user/raghvendra/CADD-1.7/CADD-scripts-1.7/envs/conda --cores 1
--configfile /gpfs/data/user/raghvendra/CADD-1.7/CADD-scripts-1.7/config/config_GRCh38_v1.7_noanno.yml --snakefile /gpfs/data/user/raghvendra/CADD-1.7/CADD-scripts-1.7/Snakefile -q
WorkflowError in file /gpfs/data/user/raghvendra/CADD-1.7/CADD-scripts-1.7/Snakefile, line 14:
The Python 3 package jsonschema must be installed in order to use the validate directive.
  File "/gpfs/data/user/raghvendra/CADD-1.7/CADD-scripts-1.7/Snakefile", line 14, in <module>

and when i try downloading jsonschema conda install -c anaconda jsonschema
I get this error

Traceback (most recent call last):
  File "/apps/codes/anaconda3/bin/pip", line 7, in <module>
    from pip._internal.cli.main import main
  File "/apps/codes/anaconda3/lib/python3.7/site-packages/pip/_internal/cli/main.py", line 10, in <module>
    from pip._internal.cli.autocompletion import autocomplete
  File "/apps/codes/anaconda3/lib/python3.7/site-packages/pip/_internal/cli/autocompletion.py", line 9, in <module>
    from pip._internal.cli.main_parser import create_main_parser
  File "/apps/codes/anaconda3/lib/python3.7/site-packages/pip/_internal/cli/main_parser.py", line 7, in <module>
    from pip._internal.cli import cmdoptions
  File "/apps/codes/anaconda3/lib/python3.7/site-packages/pip/_internal/cli/cmdoptions.py", line 23, in <module>
    from pip._internal.cli.progress_bars import BAR_TYPES
  File "/apps/codes/anaconda3/lib/python3.7/site-packages/pip/_internal/cli/progress_bars.py", line 12, in <module>
    from pip._internal.utils.logging import get_indentation
  File "/apps/codes/anaconda3/lib/python3.7/site-packages/pip/_internal/utils/logging.py", line 18, in <module>
    from pip._internal.utils.misc import ensure_dir
  File "/apps/codes/anaconda3/lib/python3.7/site-packages/pip/_internal/utils/misc.py", line 33, in <module>
    from pip._internal.locations import (
  File "/apps/codes/anaconda3/lib/python3.7/site-packages/pip/_internal/locations.py", line 15, in <module>
    from distutils.command.install import SCHEME_KEYS  # type: ignore
  File "/apps/codes/anaconda3/lib/python3.7/distutils/command/install.py", line 9, in <module>
    from distutils.core import Command
  File "/apps/codes/anaconda3/lib/python3.7/distutils/core.py", line 18, in <module>
    from distutils.config import PyPIRCCommand
  File "/apps/codes/anaconda3/lib/python3.7/distutils/config.py", line 7, in <module>
    from configparser import RawConfigParser
  File "/apps/compilers/python3/lib/python3.8/site-packages/configparser.py", line 11, in <module>
    from backports.configparser import (
ModuleNotFoundError: No module named 'backports.configparser'

@raghvendra44
Copy link
Author

Ok so, Finally i was able to fix the jsonschema error. I had 2 pythons of different versions loaded. So probably it was conflicting. So, Unloading one of them and then re-running the script helped! but again, a new error popped up.

$ ./CADD.sh -o test/input.novel.vcf -g GRCh38 test/input.vcf.gz
CADD-v1.7 (c) University of Washington, Hudson-Alpha Institute for Biotechnology and Berlin Institute of Health at Charité - Universitätsmedizin Berlin 2013-2023. All rights reserved.
Running snakemake pipeline:
snakemake /tmp/tmp.rQ1yzI4yeO/input.tsv.gz --use-conda --conda-prefix /gpfs/data/user/raghvendra/CADD-1.7/CADD-scripts-1.7/envs/conda --cores 1
--configfile /gpfs/data/user/raghvendra/CADD-1.7/CADD-scripts-1.7/config/config_GRCh38_v1.7_noanno.yml --snakefile /gpfs/data/user/raghvendra/CADD-1.7/CADD-scripts-1.7/Snakefile -q
Building DAG of jobs...
Your conda installation is not configured to use strict channel priorities. This is however crucial for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorities by executing 'conda config --set channel_priority strict'.
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job           count
----------  -------
decompress        1
join              1
prepare           1
prescore          1
total             4

Select jobs to execute...
Activating conda environment: envs/conda/7531dbc4fc81a53c2bcc0253cc0bd059_
Select jobs to execute...
Activating conda environment: envs/conda/7531dbc4fc81a53c2bcc0253cc0bd059_
Removing temporary output /tmp/tmp.rQ1yzI4yeO/input.vcf.
Select jobs to execute...
Activating conda environment: envs/conda/7531dbc4fc81a53c2bcc0253cc0bd059_
MissingInputException in rule score in file /gpfs/data/user/raghvendra/CADD-1.7/CADD-scripts-1.7/Snakefile, line 263:
Missing input files for rule score:
    output: /tmp/tmp.rQ1yzI4yeO/input.novel.tsv
    wildcards: file=/tmp/tmp.rQ1yzI4yeO/input
    affected files:
        /gpfs/data/user/raghvendra/CADD-1.7/CADD-scripts-1.7/data/models/GRCh38/CADDv1.7-GRCh38.mod
        /gpfs/data/user/raghvendra/CADD-1.7/CADD-scripts-1.7/data/models/GRCh38/conversionTable_CADDv1.7-GRCh38.txt

I believe @visze your inputs here would be help me fix this.

@visze
Copy link
Collaborator

visze commented Apr 25, 2024

The last two files are missing. They are part of a large tar.gz file. Seems that the install script was not successful. Please follow the installation rules step by step

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants