Skip to content

Commit

Permalink
[translate] skip parsing source (nuc) as a feature
Browse files Browse the repository at this point in the history
This check is already in place for non-VCF inputs, and my guess is it
was omitted here as the TB pipeline's GFF file didn't include a 'source'
annotation. I don't think 'source' is actually a valid GFF ID and I
suspect we've just been applying the INSDC/GenBank term to GFF files,
but it is one of the two fields parsed by `load_features` and there are
GFF files in Nextstrain build pipelines which use it. Modifying the
underlying `load_features` would be a better solution, but that's a
bigger project for another day.

We additionally update the error message to use the same feature name we
export.

Closes #591
Supersedes #1109
  • Loading branch information
jameshadfield committed Dec 11, 2023
1 parent 4d449ae commit 84343d8
Show file tree
Hide file tree
Showing 3 changed files with 8 additions and 4 deletions.
1 change: 1 addition & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
* Incompatible arguments are now checked, especially related to VCF vs FASTA inputs.
* `--vcf-reference` and `--root-sequence` are now mutually exclusive.
* translate: Tree nodes are checked against the node-data JSON input to ensure sequences are present. [#1348][] (@jameshadfield)
* translate: The 'source' ID for GFF files is now ignored as a potential gene feature. [#1348][] (@jameshadfield)
* translate: Improvements to command line arguments. [#1348][] (@jameshadfield)
* `--tree` and `--ancestral-sequences` are now required arguments.
* separate VCF-only arguments into their own group
Expand Down
9 changes: 6 additions & 3 deletions augur/translate.py
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ def translate_feature(aln, feature):
return translations


def translate_vcf_feature(sequences, ref, feature):
def translate_vcf_feature(sequences, ref, feature, feature_name):
'''Translates a subsequence of input nucleotide sequences.
Parameters
Expand Down Expand Up @@ -168,7 +168,7 @@ def str_reverse_comp(str_seq):
# Need to get ref translation to store. check if multiple of 3 for sanity.
# will be padded in safe_translate if not
if len(refNuc)%3:
print("Gene length of {} is not a multiple of 3. will pad with N".format(feature.qualifiers['Name'][0]), file=sys.stderr)
print(f"Gene length of {feature_name!r} is not a multiple of 3. will pad with N", file=sys.stderr)

ref_aa_seq = safe_translate(refNuc)
prot['reference'] = ref_aa_seq
Expand Down Expand Up @@ -409,13 +409,16 @@ def run(args):
print("Read in {} features from reference sequence file".format(len(features)))

## Read in sequences & for each sequence translate each feature _except for_ the source (nuc) feature
## Note that `load_features` _only_ extracts {'gene', 'source'} for GFF files, {'CDS', 'source'} for GenBank.
translations = {}
if is_vcf:
(sequences, ref) = sequences_vcf(args.vcf_reference, args.ancestral_sequences)
features_without_variation = []
for fname, feat in features.items():
if feat.type=='source':
continue
try:
translations[fname] = translate_vcf_feature(sequences, ref, feat)
translations[fname] = translate_vcf_feature(sequences, ref, feat, fname)
except NoVariationError:
features_without_variation.append(fname)
if len(features_without_variation):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Translate amino acids for genes using a GFF3 file where the gene names are store
> --output-node-data aa_muts.json \
> --alignment-output translations.vcf \
> --vcf-reference-output translations_reference.fasta
Gene length of rrs_Rvnr01 is not a multiple of 3. will pad with N
Gene length of 'rrs' is not a multiple of 3. will pad with N
Read in 187 specified genes to translate.
Read in 187 features from reference sequence file
162 genes had no mutations and so have been be excluded.
Expand Down

0 comments on commit 84343d8

Please sign in to comment.