Welcome to the reVUE-data repository! This repository is designed to store data related to reVUE, a platform dedicated to collecting and analyzing information about Variants of Unknown Effect (VUEs) in cancer research. The repository contains all reVUE data in VUEs.json
.
All reVUE variants are stored in VUEs.json
and grouped by gene.
For each gene:
- hugoGeneSymbol (string)
- transcriptId (string)
- genomicLocationDescription (string): Description of the variant location, usually include which exon or intron it is or a common pattern the variants have
- defaultEffect (string): Summary of default effect predicted by VEP, e.g., "splice".
- comment (string): Short summary of the actual effects, e.g. "Complete exon 3 skip"
- context (string): Clinical or research context, e.g., "Actionable in GIST".
- revisedProteinEffects (array): A list of reVUE variants for this gene. Each element in this array represents a variant.
- variant (string): HGVSg
- genomicLocation (string)
- transcriptId (string)
- vepPredictedProteinEffect (string)
- vepPredictedVariantClassification (string)
- revisedProteinEffect (string)
- revisedVariantClassification (string)
- confirmed (boolean): If a variant has
confirmed=true
, Genome Nexus will overwrite VEP protein change and variant classification to reVUE revised results in annotation APIs. - references (array)
- pubmedId (string)
- referenceText (string)
- mutationOrigin (string):
germline
orsomatic
- counts (object): This part is generated by
variant_count.py
under./scripts/
. The data used for counting are from mskimpact, tcga (allxxx_tcga_pan_can_atlas_2018
), genie, mskimpact_nonsignedout (look fordata_nonsignedout_mutations.txt
):- germlineVariantsCount (integer)
- somaticVariantsCount (integer)
- unknownVariantsCount (integer)
- totalPatientCount (integer)
- genePatientCount (integer)
Submit new VUE here. Or you can create a github issue.
TODO
Please find more details on curation section below
After merging new updates to VUEs.json, reVUE website will automatically display the most recent data.
We regularly release updates for Genome Nexus . After each release, the Genome Nexus annotation API response will reflect the latest data.
- Splice_
- Exon_Skip_ (skip one or multiple whole exon)
- In_Frame
- Out_Of_Frame _ Non_Start (first coding exon skipped)
- Exon_Extension_ (extend exon)
- In_Frame
- Out_Of_Frame _ Nonsense (introduce stop codon)
- Exon_shortening_ (truncate a portion of exon)
- In_Frame
- Out_Of_Frame
- Nonsense (introduce stop codon)
- Intron_Retention_ (introduce the whole intron)
- In_Frame
- Out_Of_Frame
- Exon_Skip_ (skip one or multiple whole exon)
Notes:
- If a deletion spans the whole exon, it's classified as
Splice_Exon_Skip_
. For example if a variant deletes exon 4 and half of exon 5, it's classified asSplice_Exon_Skip_
, notSplice_Exon_shortening_
. - If a variant introduces part of the intron, it's classified as
Splice_Exon_Extension_
, only when it introduces the whole intron, then it'sSplice_Intron_Retention_
If the length of insertion or deletion is a multiple of 3, this variant is In-Frame
, otherwise it's Out-Of-Frame
For example, MET exon 14 skipping variant: 7:g.116412044G>A
.
First verify the transcript ID in the published or submitted data. Then, navigate to the Integrative Genomics Viewer (IGV) and locate position 116412044
.
This location is at the boundary of exon 14 and intron 14, which modifies the splicing process and results in the skipping of exon 14.
Exon 13 ends with one nucleotide G
from codon 963 Asp (D)
, while exon 15 starts with two nucleotides A
and T
from codon 1010 Asp (D)
. Therefore, when exon 13 is connected with exon 15, the remaining G
combines with A
and T
, forming a new amino acid Asp (D)
.
Based on the information above, we know the new the new protein change is p.D963_E1009del
.
For example, ATM exon 17 skipping variant: 11:g.108138071T>C
.
First verify the transcript ID in the published or submitted data. Then, navigate to the Integrative Genomics Viewer (IGV) and locate position 108138071
.
This position is the second nucleotide at the junction of exon 17 and intron 17. This location alters the splicing process, leading to the skipping of exon 17. Given that the length of exon 17 is 57 amino acids and 1 nucleotide, this results in a frame shift.
Exon 16 ends with a complete codon, while exon 18 begins with two nucleotides (G
and T
) of codon 880 Gly (G)
. When exon 18 is connected with exon 16, the reading frame starts from exon 18 and takes every 3 nucleotides, leading to a frameshift. In comparison with the wild type, the codon at this position should be 823 Ala (A)
of exon 17. However, due to the deletion of exon 17, exon 18 takes its place, and the new amino acid sequence changes to V
P
L
I
L
, then it encounters the stop codon.
From the information provided, we can deduce that the first changed codon is 823 Ala (A)
, which changes to Val (V)
. It then encounters a termination after 5 codons. Therefore, the protein change of this variant is p.A823Vfs*5.
Genome Nexus annotation API provides VEP predicted annotation. For example:
https://www.genomenexus.org/annotation/7:g.55248980_55248981insTCCAGGAAGCCT?fields=annotation_summary
Replace variant ID in the url and check annotation_summary
for VEP predicted annotations.
The data used for counting are from mskimpact, tcga (all xxx_tcga_pan_can_atlas_2018
), genie, mskimpact_nonsignedout (look for data_nonsignedout_mutations.txt
). Make sure you have all the files downloaded on local.
The counting number is generated by variant_count.py
under ./scripts/
.
After adding all other fields, run this command:
python variant_count.py
The script will do the counting and add numbers to json directly.
For confirmed reVUE, Genome Nexus API returns the following information in response (example of EGFR inframe insertion):
"vues": {
"hugoGeneSymbol": "EGFR",
"genomicLocationDescription": "5 bases upstream from the 5' end of exon 20 (7:g.55248980_55248981insTCCAGGAAGCCT)",
"defaultEffect": "splice",
"comment": "Inset a repeated sequence from 55248980-55248992",
"variant": "7:g.55248980_55248981insTCCAGGAAGCCT",
"genomicLocation": "7,55248980,55248981,-,TCCAGGAAGCCT",
"transcriptId": "ENST00000275493",
"revisedProteinEffect": "p.A763_Y764insFQEA",
"revisedVariantClassification": "Splice_Exon_Extension_In_Frame",
"revisedVariantClassificationStandard": "In_Frame_Ins",
"context": "Recurrent in lung cancer, can be linked to Level 1 TKIs",
"vepPredictedProteinEffect": "p.X762_splice",
"vepPredictedVariantClassification": "Splice_Region",
"mutationOrigin": null,
"references": [
{
"pubmedId": 31715539,
"referenceText": "Sousa et al., 2020"
}
],
"confirmed": true
}