Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GTF file does not contain any sense (+) strand products #65

Open
dferran1 opened this issue Oct 3, 2022 · 12 comments
Open

GTF file does not contain any sense (+) strand products #65

dferran1 opened this issue Oct 3, 2022 · 12 comments

Comments

@dferran1
Copy link

dferran1 commented Oct 3, 2022

Hello,

I am attempting to use SNPGenie to analyze synonymous and non-synonymous SNPs in a VCF file. I used gffread to generate a gtf file from my gff3 file and then used grep to select only the exons in the gtf file. I get a message from SNPGenie that the gtf file "does not contain any sense (+) strand products" but on visual inspection it has both + and - in the strand column (see attachment as a .txt as Github did not allow a .gtf). What is the best way to resolve this issue?

The bash command is as follows: perl $snpgenie --fastafile=$reference.chrI.linearized.fa --gtffile=$gff_file.exons.gtf --snpreport=$vcf --vcfformat=4

Thank you,

David Ferranti
stickleback_v5_ensembl_genes.exons.gtf.txt

@singing-scientist
Copy link
Contributor

Greetings! I notice that the GTF file has records for multiple chromosomes/contigs. Unfortunately, as described in the documentation, SNPGenie requires that each run must correspond to exactly one input sequence (i.e. exactly one reference contig in one FASTA file). Thus, it will need to be run separately for each chromosome — one run each, with the appropriate VCF/FASTA/GTF combination containing records for just that chrom. Let me know if that solves the issue!

C

@dferran1
Copy link
Author

dferran1 commented Oct 3, 2022

Hello,

Thank you for the quick response--restricting the gtf file to just chromosome 1 gives the same error message.

@singing-scientist
Copy link
Contributor

Understood. Please provide a small, fully reproducible example (i.e., one FASTA, one GTF, one VCF, and the exact command line used) and I will get to this ASAP! (Apologies for any delays as I am traveling.)

C

@dferran1
Copy link
Author

dferran1 commented Oct 3, 2022

Sure thing--I greatly appreciate you taking the time to help me get this set up.

Here are the three files and the script I use to reformat them and run SNPGenie:

https://drive.google.com/file/d/1NQ6ofZBBGGukO2oifmCaP23IolY9HJT2/view?usp=sharing

@singing-scientist
Copy link
Contributor

Thanks very much! Could you provide a small example, e.g., much less than 500 Mb? For example, input files for the smallest chromosome (or just a fraction of it)?

@dferran1
Copy link
Author

dferran1 commented Oct 3, 2022

Here's the same set of files/script but tweaked to only extract the mitochondrial genome:

https://drive.google.com/file/d/1zXQGYdOejHdQVDALtUiJqNeD04mEIRZI/view?usp=sharing

@singing-scientist
Copy link
Contributor

Hello David. I downloaded the files, which includes a FASTA file of 473 Mb and 25 chromosomes. However, I need a minimal example that is appropriate for SNPGenie, as described in the documentation on this GitHub: one FASTA with one sequence inside; one VCF with variants called relative to that one sequence; and a GTF for the protein-coding regions of that one sequence. Please take a close look at the documentation. If there is still an issue once input specifications are met, feel free to send me a set of example files so I may reproduce the error!

@dferran1
Copy link
Author

dferran1 commented Oct 7, 2022

Hmmm the bash script should have produced the mitochondrial genome-only intermediate files. Anyways here they are:
SNP_Genie_mtDNA_Files.zip

And the SNPGenie command: perl $snpgenie --fastafile=stickleback_v5_assembly.pitx1.chrM.linearized.fa --gtffile=stickleback_v5_ensembl_genes.exons.chrM.gtf --snpreport=Marine.Samples.subsampled.blackspotted.remove.ancient.filtered.bedtools.chrM.vcf --vcfformat=4

Just replace @snpgenie with the appropriate path.

@singing-scientist
Copy link
Contributor

Greetings, David! Each coding product must be labeled as type CDS (see the GTF paragraph of the documentation. If you replace 'exon' with 'CDS' in your file, it should work. Let me know!

C

@dferran1
Copy link
Author

Hello,

Ah--I see. From what I can tell using a GTF with only CDS seems to fix that issue. Unfortunately I am now getting an error that some genes do not have coordinates with a complete set of codons (number of nucleotides must be a multiple of 3). I think I need to go back and troubleshoot the assembly/annotation before I can proceed.

Thanks again for all your assistance with SNPGenie.

Best,

David

@Rohit-Satyam
Copy link

Rohit-Satyam commented Oct 17, 2023

Did this issue ever got resolved? I get a similar error. I must also admit that this game of manually altering fasta and gtf file is bizarre and confusing and should be automated. I spent the entire day altering the gtf in multiple ways and yet it didn't run.

@singing-scientist
Copy link
Contributor

Greetings @Rohit-Satyam! Thanks very much. Indeed in an ideal world of infinite funding and time, all would be made to work for all conceivable input! Regarding the FASTA, there should be no need to alter anything — the input requirement is simply for a single reference sequence. Regarding the GTF file, the requirement is for non-redundant CDS records. If you have checked that your input meet these criteria and are still getting the error, please feel free to share a small reproducible example (FASTA, VCF, GTF) and I'd be happy to investigate what's happening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants