-
Notifications
You must be signed in to change notification settings - Fork 23
Variant Descriptions Position Numbering
The standard human sequence variant nomenclature uses different position numbering schemes to describe variants relative to the reference sequence. Mutalyzer checks if the specified reference sequence is compatible with the selected position numbering scheme for the sequence variation. Variant descriptions involving upstream or downstream regulatory sequences and intron sequences can only be checked using genomic sequence records. Therefore, genomic records with correct annotation of all genes, transcripts and protein isoforms support most position numbering schemes. Mutalyzer automatically converts the given variant description to other position numbering schemes supported by the reference sequence and its annotation. Mutalyzer will not return results when the selected reference sequence does not contain sufficient sequence or annotation to support the nomenclature check of the variant.
There are six position numbering schemes (Sequence Types).
The Genomic position numbering scheme is applied to raw genomic records. The value 1 is assigned to the first base in the record and all bases are counted from there. In the output, genomic numbering is indicated by the g. prefix preceding the position number(s). LRG records and all GenBank records with 'DNA' in the first line will be accepted.
Chromosomal position numbering uses the same 1-based coordinate system as the Ensembl browser and the GFF and SAM file formats. The 0-based coordinate system is used by the UCSC Genome Browser and the BED and BAM file formats. The difference between these systems and their conversion is explained on BioStar's Cheat Sheet.
Please note that well-annotated genomic sequence records containing annotated transcripts and corresponding coding sequences can be used in combination with non-coding DNA, coding DNA and protein position numbering schemes.
The Non-coding DNA position numbering scheme can be used with:
- GenBank records containing genomic sequences with annotated transcripts without a corresponding coding sequence.
- LRG records
- GenBank records containing transcript sequences without annotated coding sequences, provided that no intronic bases are involved in the variation. Mutalyzer needs a correctly annotated genomic reference sequence to check HGVS Non-coding DNA numbering of intron positions.
The value 1 is assigned to to the first base of the transcript in the record and all the exonic bases are counted from there. Intronic bases are numbered x+1, x+2, x+3, ... y-3, y-2, y-1 where x is the value of the last exonic base upstream of the intron, y is the value of the first exonic base downstream of the intron and x and y are consecutive numbers. Intronic position numbers are always counted from the closest exonic base. In case of a tie, the upstream base is used. In the output, non-coding DNA numbering is indicated by the n. prefix preceding the position number(s).
The Coding DNA or cDNA position numbering scheme can be used with:
- Genbank records containing genomic sequences with annotated transcripts and corresponding coding sequences.
- LRG records
- Genbank records containing transcript sequences with annotated coding sequences, provided that no intronic bases are involved in the variation. Mutalyzer needs a correctly annotated genomic reference sequence to check HGVS Coding DNA numbering of intron positions.
The value 1 is assigned to the A of the ATG start codon and all the exonic bases between start and stop are counted normally.
5' untranslated region: Exonic bases upstream of (i.e. before) the ATG are numbered -1, -2, -3 and so on.
3' untranslated region: Exonic bases downstream of (i.e. behind) the stop codon are numbered *1, *2, *3 and so on.
Intronic bases in the Coding sequence are numbered x+1, x+2, x+3, ... y-3, y-2, y-1 where x is the value of the last exonic base upstream of the intron, y is the value of the first exonic base downstream of the intron and x and y are consecutive numbers. Intronic position numbers are always counted from the closest exonic base. In case of a tie, the upstream base is used.
In case of: a 5' untranslated region split over two or more exons: Intronic bases are numbered -x+1, -x+2, -x+3, ... -y-3, -y-2, -y-1 where -x is the value of the last exonic base upstream of the intron, -y is the value of the first exonic base downstream of the intron and x and y are consecutive numbers.
In case of: a 3' untranslated region split over two or more exons: Intronic bases are numbered *x+1, *x+2, *x+3, ... *y-3, *y-2, *y-1 where *x is the value of the last exonic base upstream of the intron, *y is the value of the first exonic base downstream of the intron and y are consecutive numbers.
In the output, cDNA numbering is indicated by the c. prefix preceding the position number(s).
The RNA position numbering scheme has not yet been implemented in Mutalyzer 2. In the output, RNA numbering will be indicated by the r. notation preceding the position number(s). In gene variant databases, r. numbering is used to describe variants observed in transcripts by RT-PCR followed by sequencing. Nucleotide position numbering follows the same scheme as that of the corresponding non-coding or coding DNA.
The Mitochondrial DNA (mtDNA) position numbering scheme uses raw genomic records. The value 1 is assigned to the first base in the record and from there all bases are counted normally.
The Protein position numbering scheme is used to generate variant descriptions at protein level from genomic or Coding DNA descriptions by translation of the Coding sequence. The current version of Mutalyzer 2 does not yet support checks of protein variants using a GenBank protein record. The value 1 is assigned to the first amino acid of the translated Coding sequence and from there all amino acids are counted normally. In the output, protein variants have the prefix p. folllowed by the amino acid changes between parentheses to indicate that they are predicted by translation of the modified Coding sequence.
The EST position numbering scheme can be used with GenBank EST records. The value 1 is assigned to the first base in the record and from there all bases are counted normally. Sequence variation descriptions based on EST sequences lack the c. prefix to indicate that only part of the coding sequence may be present. All records with 'EST' in the first line will be accepted. These records do not allow checks of intronic sequence variations.
The picture below shows the different position numbering schemes applied to the same genomic sequence. Please note that the suffix -u and +d in intergenic region position numbering below the transcripts are not part of current HGVS nomenclature and are shown for demonstration purposes only.
The current Coding DNA position numbering scheme of the standard human sequence variant nomenclature makes no distinction between upstream intergenic and 5' UTR positions and downstream intergenic and 3' UTR positions. It will be clear that upstream and downstream intergenic positions are not represented in NM_ (or NR_) RefSeq transcript reference sequences.
For upstream and downstream intergenic positions, Mutalyzer only allows the use of the - and * coding DNA position numbering if these positions are present in the genomic reference sequence specified.
We have proposed the HGVS to use the following numbering system for intergenic positions, but it has not (yet) been accepted (see Numbering untranscribed nucleotides) which also can be applied to genes with non-coding transcripts. We had already implemented this in Mutalyzer beta-20, but have reverted this in Mutalyzer beta-21 at the request of the HGVS nomenclature committee.
For upstream intergenic positions, Mutalyzer beta-20 combined the position of the first nucleotide of the transcript with the suffix -u followed by the position of the upstream nucleotide. Intergenic bases upstream of Non-coding DNA are numbered n.1-uy, ..., n.1-u3, n.1-u2, n.1-u1 where y is the value of the most upstream base and n.1-u1 is the value of the first intergenic base upstream of the first exon. Intergenic bases upstream of Coding DNA are numbered c.x-uy, ..., c.x-u3, c.x-u2, c.x-u1 where x is the value of the first nucleotide of the first exon and y is the value of the most upstream base. The advantage of this notation is that the -u position corresponds to the - position used by most researchers to describe transcription factor binding sites.
In the picture above, the first nucleotide of the genomic sequence, g.1, corresponds to c.-113-u270 for the protein-encoding transcript and n.1-u270 for the non-coding transcript, respectively.
For downstream intergenic positions, Mutalyzer beta-20 combined the position of the last nucleotide of the transcript with the suffix +d followed by the position of the downstream nucleotide. Intergenic bases downstream of Non-coding DNA are numbered n.x+d1, n.x+d2, n.x+d3 ... where x is the value of the last nucleotide of the last exon. Intergenic bases downstream of Coding DNA are numbered c.x+d1, c.x+d2, c.x+d3, ... where x is the value of the last nucleotide of the last exon.
In the picture above, the last nucleotide of the genomic sequence, g.1470, corresponds to c.*410+d120 for the protein-encoding transcript and n.747+d120 for the non-coding transcript, respectively.
The current version of Mutalyzer supports deprecated (non-HGVS) notations for exon positions and exon ranges using exon numbers and positions in introns. More information can be found here.