Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using web history instead of ids returns more results than expected #193

Open
stitam opened this issue Jan 8, 2024 · 5 comments
Open

Using web history instead of ids returns more results than expected #193

stitam opened this issue Jan 8, 2024 · 5 comments

Comments

@stitam
Copy link

stitam commented Jan 8, 2024

When I run entrez_search() with use_history and retmax = 5, the function will respect this and return 5 UIDs:

biosample_uid <- entrez_search(db = "biosample", term = "Microthrix parvicella", use_history = TRUE, retmax = 5)
length(biosample_uid$ids)
5

When I convert these to assembly UID without using web history, the entrez_link() will return 5 UIDs, as expected:

assembly_uid <- entrez_link(dbfrom = "biosample", id = biosample_uid$ids, db = "assembly")
length(assembly_uid$links$biosample_assembly)
5

However, when I use the web history, the function will return more than 5 UIDs:

assembly_uid <- entrez_link(dbfrom = "biosample", web_history = biosample_uid$web_history, db = "assembly")
length(assembly_uid$links$biosample_assembly)
14

Even if I set retmax:

assembly_uid <- entrez_link(dbfrom = "biosample", web_history = biosample_uid$web_history, db = "assembly", retmax = 5)
length(assembly_uid$links$biosample_assembly)
14

It seems this is not an entrez_link() thing, it also occurs e.g. with entrez_fetch():

meta <- entrez_fetch(db = "biosample", web_history = biosample_uid$web_history, rettype = "full", retmode = "xml")
xmllist <- xml2::as_list(xml2::read_xml(meta))[[1]]
length(xmllist)
16

Again, ids instead of web history work fine:

meta <- entrez_fetch(db = "biosample", id = biosample_uid$ids, rettype = "full", retmode = "xml")
xmllist <- xml2::as_list(xml2::read_xml(meta))[[1]]
length(xmllist)
5

Any idea what is going on here? Many thanks.

@allenbaron
Copy link

It looks like entrez_link() doesn't support the retmax parameter (see The E-utilities In-Depth: Parameters, Syntax and More). The web history contains all the search hits (not affected by retmax), so the subsequent entrez_link() call using the web history returns them all as well.

entrez_fetch() does support the retmax parameter, so meta <- entrez_fetch(db = "biosample", web_history = biosample_uid$web_history, rettype = "full", retmode = "xml", retmax = 5) returns the only 5 results.

https://www.nlm.nih.gov/dataguide/eutilities/history.html#storing-results-to-the-history-server-using-esearch may also have relevant information.

rentrez does have entrez_post() that can be used to modify web history. Usually, the application is adding IDs to a web history. I can't speak to removing or limiting the IDs in a web history. I guess it might be possible to create a new web history from the existing one but I can't say how off the top of my head. You'd have to dig into the documentation.

@stitam
Copy link
Author

stitam commented Jan 9, 2024

Many thanks @allenbaron for pointing out that entrez_fetch() also supports retmax I think this will solve my issue.

@stitam
Copy link
Author

stitam commented Jan 11, 2024

I can confirm that entrez_fetch() works perfectly with histories; if a run entrez_search() with history and receive e.g. dozens of thousands of ids, I can use a for loop and iteratively adjust the retstart argument to get all the data in chunks, using the history. If I want, I can also adjust the retmax argument to control the size of the chunks. Thanks @allenbaron for pointing this out!

However, entrez_link() behaviour is strange. What I want to do is run entrez_search() to get many many ids, then run entrez_link() to convert these ids and finally entrez_fetch() to get the data; and I want to use histories for this flow. If I break down my entrez_search() using retstart and retmax, then I will have multiple webhistory objects, but entrez_link() will not respect them, which risks that I will have duplicates in the end. However, if entrez_search() (with a huge retmax value) returns all ids in one go, entrez_link() will not convert all of them:

biosample_uid <- entrez_search(db = "biosample", term = "Escherichia coli", use_history = TRUE, retmax = 20000)
length(biosample_uid$ids)
20000
assembly_uid <- entrez_link(dbfrom = "biosample", web_history = biosample_uid$web_history, db = "assembly")
length(assembly_uid$links$biosample_assembly)
294

So in this case I lose a large number of ids. My questions:

  • Do you folks think there is an issue with entrez_link() behaviour?
  • Is this an NCBI thing we have to accept or is this an rentrez issue that can be fixed?

Any idea how I can efficently e.g. convert all E.coli BioSample IDs to Assembly IDs and then fetch the assembly summaries?

@vtorda
Copy link

vtorda commented Feb 14, 2024

Hi,
I have recently played with rentrez, and I have noticed that not necessarily all linked ID will have data in the target database. I searched for taxon ids based on a term that I linked with the nuccore database. I noticed that even though there is a taxon ID linked with nuccore, it won't fetch you nuccore data. Usually, these IDs stand for general higher-level taxa or unclassified terms. Could be similar in your case? Also, why would you expect to get as many assemblies and as many biosample IDs you have?

@stitam
Copy link
Author

stitam commented Feb 19, 2024

Thanks @vtorda, I suspect my issue is different. I would not expect all biosamples to have a corresponding assembly but for e.g. Escherichia coli I would expect to find many biosamples and also many assemblies. If I query the NCBI Assembly database directly:

assembly_uid_no_link <- entrez_search(db = "assembly", term = "Escherichia coli", retmax = 20000)
length(assembly_uid_no_link$ids)
20000

So if I look for E. coli assemblies directly, I get 20k of them (FYI, there are currently >250k assemblies overall on NCBI), however, if I first look for E. coli biosamples and then link them to assemblies, I get 294. Since all assemblies have corresponting biosamples, I would expect to get the same number of assemblies from 1. requesting biosample ids and converting to assembly ids and 2. requesting assembly ids directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants