Mapping NCBI accessions/GIs to Uniprot IDs produces few hits #1419

laurahspencer · 2022-03-03T22:32:25Z

laurahspencer
Mar 3, 2022
Collaborator

I have a set of NCBI protein/nucleotide entries in a genome annotation file, and want to ultimately perform GO enrichment analysis. It seems that the best approach is to map the NCBI entries to Uniprot IDs using this Uniprot tool which also retrieves GO IDs. My issue is that very few of my entries map to Uniprot IDs, for example:

Using NCBI accession numbers as my input:

243 out of 13883 RefSeq Protein identifiers were successfully mapped to 244 UniProtKB IDs. (Out of the remaining ids, 10495 were successfully mapped to the UniParc sequence archive)
11 out of 722 RefSeq Nucleotide identifiers were successfully mapped to 11 UniProtKB ID.

Using NCBI GI numbers as my input (which I retrieved using the Batch Entrez tool:

664 out of 14605 GI number identifiers were successfully mapped to 665 UniProtKB IDs in the table below. (Out of the remaining ids, 270 were successfully mapped to the UniParc sequence archive)

Anyone else run into this issue? Or has anyone used an enrichment analysis tool that uses NCBI accession numbers as the input?
FYI I've also tried entering the accession and GI numbers in DAVID, but it doesn't recognize them. As reference, here are a few of the entries:

NCBI_Accession.version	NCBI_Accession	NCBI_GI
CAA68079.1	CAA68079	1483311
CAA67357.1	CAA67357	1771447
AAB94634.1	AAB94634	2739396
AAC61691.1	AAC61691	3660667
P81576.1	P81576	5921936
CAC13698.1	CAC13698	14089939
AAK50057.2	AAK50057	14861878
AAK77775.1	AAK77775	15021498
BAB64297.1	BAB64297	15425681

NOTE: I have seen this issue and sam's notebook entry where he uses a python script to retrieve UniprotIDs, which I haven't tested on my gene sets, however I presume that it would use the same Uniprot database and therefore the same results.

sr320 · 2022-03-03T22:35:42Z

sr320
Mar 3, 2022
Maintainer

If it was me, I would take all the sequences and blast to Uniprot to get associated accessions.

4 replies

laurahspencer Mar 3, 2022
Collaborator Author

I did that too, which I could certainly use, but Giles did a more comprehensive blast approach which resulted in a much higher percentage of annotated genes.

sr320 Mar 3, 2022
Maintainer

I guess what I mean is use your new IDs and batch retrieve fasta, then blast (again)- it also just might be there are limited "hits" but I would not trust unless I did the blast.

sr320 Mar 3, 2022
Maintainer

However that is just doing things in circles, sorry.... If your set on DAVID you can just feed it any and all IDs...

laurahspencer Mar 3, 2022
Collaborator Author

I'm definitely not set on DAVID - I was hoping to use GO_MWU, which requires GO terms as input. But I was resorting to DAVID, hoping the IDs I had (or could retrieve) would work in DAVID - they do not.

kubu4 · 2022-03-03T22:53:37Z

kubu4
Mar 3, 2022
Maintainer

I have a set of NCBI protein/nucleotide entries in a genome annotation file

Can you please provide link to file(s)?

NCBI GI numbers as my input

Out of curiosity, why are you using these? I'm fairly certain these have been deprecated (years ago) and are not really the standard way to identify NCBI "stuff".

Which UniProt database were you using (reviewed/unreviewd)? Unreviewed should get you more matches.

6 replies

kubu4 Mar 3, 2022
Maintainer

Yep. Need to get GO terms.

But, I'm confused by this, since this addresses what you proposed:

I did that too, which I could certainly use, but Giles did a more comprehensive blast approach which resulted in a much higher percentage of annotated genes.

@laurahspencer , have you not used the results from these BLASTs?

sr320 Mar 3, 2022
Maintainer

have you not used the results from these BLASTs?

that's the ticket... using these other IDs instead of Uniprot Accessions in DAVID

laurahspencer Mar 3, 2022
Collaborator Author

@kubu4 here is the annotation file: bestblast.bkc_ref.tab

I'm not using all the entries- I filtered for e-value and those with NCBI accession numbers.

laurahspencer Mar 3, 2022
Collaborator Author

The annotation file I'm working with is one Giles generated - he blasted against multiple NCBI databases which resulted in ~80% success rate. While I was waiting for his to finish, I also blasted against the Uniprot database, resulting in far fewer hits. So, I was proceeding with his results. I have been running into major issues, though, trying to use the NCBI accession numbers (with or without the version number) in any gene enrichment or gene ID conversion tool. That's why I tried to use the NCBI GI instead, which seems more readily accepted (e.g. you used that in your NCBI -> Uniprot mapping python script).

laurahspencer Mar 3, 2022
Collaborator Author

Which UniProt database were you using (reviewed/unreviewd)? Unreviewed should get you more matches.

That's a great question! I will check that now.
UPDATE: If I understand your question, I'm using the unreviewed one (UniprotKB).

kubu4 · 2022-03-04T14:33:55Z

kubu4
Mar 4, 2022
Maintainer

Having perused your BLAST output table, the primary issue is you have a mix of database IDs. Some are GenBank (e.g. QYJ58449.1), some are RefSeq (e.g. XP_033432496.1). This will make things difficult to deal with when trying to map from one database to another (e.g. the UniProt ID mapping service). The service expects a single IDtype (e.g. GenBank) to mapped to a single database (e.g. UniProt SwissProt). A mix of ID types like you have in your BLAST table cannot be mapped via a "batch" process. You'd have to parse out the different IDs, based on the ID type (which is not indicated in the BLAST table).

To add to this, not all RefSeq accessions are mapped to/from UniProt. It seems like RefSeq mapping criteria is pretty stringent:

https://www.uniprot.org/help/ncbi_mappings

So, with all that in mind, I really think you should take @sr320's approach and perform a BLAST to SwissProt yourself in order to obtain SwissProt accessions. This will then allow you to do a batch submission to UniProt to obtain gene ontology terms.

2 replies

laurahspencer Mar 8, 2022
Collaborator Author

Ah- that makes a lot of sense. I didn't realize that having multiple ID types would mess up conversion - I assumed it would recognize some, and discard others that didn't match the criteria. Thanks for having a look!

sr320 Mar 8, 2022
Maintainer

I think in early versions of DAVID it could actually use multiple ID types, but now it does not...

This comment has been hidden.

Sign in to view

This comment has been hidden.

Sign in to view

This comment has been hidden.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mapping NCBI accessions/GIs to Uniprot IDs produces few hits #1419

{{title}}

Replies: 4 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

This comment has been hidden.

This comment has been hidden.

{{title}}

{{title}}

{{title}}

Select a reply

Mapping NCBI accessions/GIs to Uniprot IDs produces few hits #1419

laurahspencer Mar 3, 2022 Collaborator

Replies: 4 comments · 13 replies

sr320 Mar 3, 2022 Maintainer

laurahspencer Mar 3, 2022 Collaborator Author

sr320 Mar 3, 2022 Maintainer

sr320 Mar 3, 2022 Maintainer

laurahspencer Mar 3, 2022 Collaborator Author

kubu4 Mar 3, 2022 Maintainer

kubu4 Mar 3, 2022 Maintainer

sr320 Mar 3, 2022 Maintainer

laurahspencer Mar 3, 2022 Collaborator Author

laurahspencer Mar 3, 2022 Collaborator Author

laurahspencer Mar 3, 2022 Collaborator Author

This comment has been hidden.

This comment has been hidden.

kubu4 Mar 4, 2022 Maintainer

laurahspencer Mar 8, 2022 Collaborator Author

sr320 Mar 8, 2022 Maintainer

laurahspencer
Mar 3, 2022
Collaborator

Replies: 4 comments 13 replies

sr320
Mar 3, 2022
Maintainer

laurahspencer Mar 3, 2022
Collaborator Author

sr320 Mar 3, 2022
Maintainer

sr320 Mar 3, 2022
Maintainer

laurahspencer Mar 3, 2022
Collaborator Author

kubu4
Mar 3, 2022
Maintainer

kubu4 Mar 3, 2022
Maintainer

sr320 Mar 3, 2022
Maintainer

laurahspencer Mar 3, 2022
Collaborator Author

laurahspencer Mar 3, 2022
Collaborator Author

laurahspencer Mar 3, 2022
Collaborator Author

kubu4
Mar 4, 2022
Maintainer

laurahspencer Mar 8, 2022
Collaborator Author

sr320 Mar 8, 2022
Maintainer