Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: remove pseudogenes, read through genes, and ones with only predicted functions #510

Closed
haowang-bioinfo opened this issue Mar 23, 2023 · 6 comments
Assignees

Comments

@haowang-bioinfo
Copy link
Member

haowang-bioinfo commented Mar 23, 2023

Description of the issue:

It is proposed to remove following genes, because they are either pseudogenes, or genes from read through transcripts (from two adjacent genes in the same strand with same orientation), or the ones with only predicted functions.

gene id check results
ENSG00000011052 read through
ENSG00000099984 pseudogene
ENSG00000100101 novel transcript
ENSG00000114786 read through
ENSG00000125954 read through
ENSG00000133475 pseudogene
ENSG00000137700 with only predicted function
ENSG00000168970 read through
ENSG00000172782 with only predicted function
ENSG00000189366 pseudogene, only predicted function
ENSG00000250741 read through, no annotation
ENSG00000255730 with only predicted function
ENSG00000259030 read through
ENSG00000259075 read through, only predicted function
ENSG00000259916 only predicted function
ENSG00000272916 novel transcript, only predicted function
ENSG00000275183 only predicted function
ENSG00000276380 pseudogene
ENSG00000282301 read through
ENSG00000284844 only predicted function
ENSG00000285043 novel protein, without function annotation
ENSG00000285269 read through, only predicted function
@feiranl
Copy link
Collaborator

feiranl commented Mar 24, 2023

Looks good! May I know how these genes were identified here? I double checked some read through, which is clear. As for the novel protein or predicted function, I am not sure how to double check them

@haowang-bioinfo
Copy link
Member Author

haowang-bioinfo commented Mar 24, 2023

May I know how these genes were identified here?

very good question!

Now the number of genes is 2292 in genes.tsv. When loading all these Ensembl ids to uniprot.org/id-mapping for mapping to UniProtKB, then these genes pop out as unmapped.

As for the novel protein or predicted function, I am not sure how to double check them

Obviously they don't have annotated function by Uniprot. But they can be checked on Ensembl (eg, ENSG00000011052) and NCBI (654364)

@haowang-bioinfo
Copy link
Member Author

as the only gene for MAR01621 and MAR06386, the deletion of ENSG00000137700 is suspended.

@haowang-bioinfo
Copy link
Member Author

haowang-bioinfo commented Mar 28, 2023

as the only gene for MAR01621 and MAR06386, the deletion of ENSG00000137700 is suspended.

as a follow-up, replace ENSG00000137700 with ENSG00000281500, because both refer to the same gene SLC37A4 while the later was reviewed with proper annotation and substrates.

@haowang-bioinfo haowang-bioinfo mentioned this issue Apr 18, 2023
3 tasks
haowang-bioinfo added a commit that referenced this issue Apr 18, 2023
- replace ENSG00000137700 with ENSG00000281500, as proposed in #510
- remove other GCS genes from GPR of MAR06409, as discussed in #529
@haowang-bioinfo
Copy link
Member Author

now there are still 5 genes that are left unmapped to Uniprot, here are manually checked and associated as below:

Ensembl id Uniprot id check results
ENSG00000139304 A0A087WZU1 protein-tyrosine-phosphatase
ENSG00000161133 Q3LFD5 ubiquitin specific peptidase 41
ENSG00000175164 A0A087X009 predicted function
ENSG00000181222 A0A6Q8PGB0 DNA-dependent RNA polymerase
ENSG00000249948 Q9H227 beta-glucosidase, with both normal and pseudogene isoforms

@haowang-bioinfo
Copy link
Member Author

fixed in #537

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants