Using web history instead of ids returns more results than expected #193

stitam · 2024-01-08T22:30:28Z

When I run entrez_search() with use_history and retmax = 5, the function will respect this and return 5 UIDs:

biosample_uid <- entrez_search(db = "biosample", term = "Microthrix parvicella", use_history = TRUE, retmax = 5)
length(biosample_uid$ids)
5

When I convert these to assembly UID without using web history, the entrez_link() will return 5 UIDs, as expected:

assembly_uid <- entrez_link(dbfrom = "biosample", id = biosample_uid$ids, db = "assembly")
length(assembly_uid$links$biosample_assembly)
5

However, when I use the web history, the function will return more than 5 UIDs:

assembly_uid <- entrez_link(dbfrom = "biosample", web_history = biosample_uid$web_history, db = "assembly")
length(assembly_uid$links$biosample_assembly)
14

Even if I set retmax:

assembly_uid <- entrez_link(dbfrom = "biosample", web_history = biosample_uid$web_history, db = "assembly", retmax = 5)
length(assembly_uid$links$biosample_assembly)
14

It seems this is not an entrez_link() thing, it also occurs e.g. with entrez_fetch():

meta <- entrez_fetch(db = "biosample", web_history = biosample_uid$web_history, rettype = "full", retmode = "xml")
xmllist <- xml2::as_list(xml2::read_xml(meta))[[1]]
length(xmllist)
16

Again, ids instead of web history work fine:

meta <- entrez_fetch(db = "biosample", id = biosample_uid$ids, rettype = "full", retmode = "xml")
xmllist <- xml2::as_list(xml2::read_xml(meta))[[1]]
length(xmllist)
5

Any idea what is going on here? Many thanks.

The text was updated successfully, but these errors were encountered:

allenbaron · 2024-01-08T23:17:08Z

It looks like entrez_link() doesn't support the retmax parameter (see The E-utilities In-Depth: Parameters, Syntax and More). The web history contains all the search hits (not affected by retmax), so the subsequent entrez_link() call using the web history returns them all as well.

entrez_fetch() does support the retmax parameter, so meta <- entrez_fetch(db = "biosample", web_history = biosample_uid$web_history, rettype = "full", retmode = "xml", retmax = 5) returns the only 5 results.

https://www.nlm.nih.gov/dataguide/eutilities/history.html#storing-results-to-the-history-server-using-esearch may also have relevant information.

rentrez does have entrez_post() that can be used to modify web history. Usually, the application is adding IDs to a web history. I can't speak to removing or limiting the IDs in a web history. I guess it might be possible to create a new web history from the existing one but I can't say how off the top of my head. You'd have to dig into the documentation.

stitam · 2024-01-09T10:44:57Z

Many thanks @allenbaron for pointing out that entrez_fetch() also supports retmax I think this will solve my issue.

stitam · 2024-01-11T09:49:13Z

I can confirm that entrez_fetch() works perfectly with histories; if a run entrez_search() with history and receive e.g. dozens of thousands of ids, I can use a for loop and iteratively adjust the retstart argument to get all the data in chunks, using the history. If I want, I can also adjust the retmax argument to control the size of the chunks. Thanks @allenbaron for pointing this out!

However, entrez_link() behaviour is strange. What I want to do is run entrez_search() to get many many ids, then run entrez_link() to convert these ids and finally entrez_fetch() to get the data; and I want to use histories for this flow. If I break down my entrez_search() using retstart and retmax, then I will have multiple webhistory objects, but entrez_link() will not respect them, which risks that I will have duplicates in the end. However, if entrez_search() (with a huge retmax value) returns all ids in one go, entrez_link() will not convert all of them:

biosample_uid <- entrez_search(db = "biosample", term = "Escherichia coli", use_history = TRUE, retmax = 20000)
length(biosample_uid$ids)
20000
assembly_uid <- entrez_link(dbfrom = "biosample", web_history = biosample_uid$web_history, db = "assembly")
length(assembly_uid$links$biosample_assembly)
294

So in this case I lose a large number of ids. My questions:

Do you folks think there is an issue with entrez_link() behaviour?
Is this an NCBI thing we have to accept or is this an rentrez issue that can be fixed?

Any idea how I can efficently e.g. convert all E.coli BioSample IDs to Assembly IDs and then fetch the assembly summaries?

vtorda · 2024-02-14T10:22:31Z

Hi,
I have recently played with rentrez, and I have noticed that not necessarily all linked ID will have data in the target database. I searched for taxon ids based on a term that I linked with the nuccore database. I noticed that even though there is a taxon ID linked with nuccore, it won't fetch you nuccore data. Usually, these IDs stand for general higher-level taxa or unclassified terms. Could be similar in your case? Also, why would you expect to get as many assemblies and as many biosample IDs you have?

stitam · 2024-02-19T14:57:33Z

Thanks @vtorda, I suspect my issue is different. I would not expect all biosamples to have a corresponding assembly but for e.g. Escherichia coli I would expect to find many biosamples and also many assemblies. If I query the NCBI Assembly database directly:

assembly_uid_no_link <- entrez_search(db = "assembly", term = "Escherichia coli", retmax = 20000)
length(assembly_uid_no_link$ids)
20000

So if I look for E. coli assemblies directly, I get 20k of them (FYI, there are currently >250k assemblies overall on NCBI), however, if I first look for E. coli biosamples and then link them to assemblies, I get 294. Since all assemblies have corresponting biosamples, I would expect to get the same number of assemblies from 1. requesting biosample ids and converting to assembly ids and 2. requesting assembly ids directly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using web history instead of ids returns more results than expected #193

Using web history instead of ids returns more results than expected #193

stitam commented Jan 8, 2024 •

edited

Loading

allenbaron commented Jan 8, 2024

stitam commented Jan 9, 2024

stitam commented Jan 11, 2024 •

edited

Loading

vtorda commented Feb 14, 2024

stitam commented Feb 19, 2024

Using web history instead of ids returns more results than expected #193

Using web history instead of ids returns more results than expected #193

Comments

stitam commented Jan 8, 2024 • edited Loading

allenbaron commented Jan 8, 2024

stitam commented Jan 9, 2024

stitam commented Jan 11, 2024 • edited Loading

vtorda commented Feb 14, 2024

stitam commented Feb 19, 2024

stitam commented Jan 8, 2024 •

edited

Loading

stitam commented Jan 11, 2024 •

edited

Loading