Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Web history object - No esummary records found in file #178

Open
d-caraballo opened this issue Apr 27, 2022 · 6 comments
Open

Web history object - No esummary records found in file #178

d-caraballo opened this issue Apr 27, 2022 · 6 comments

Comments

@d-caraballo
Copy link

Hi, I am trying to download all available bat coronaviruses. I used the following query:
bat_cov_ids<-entrez_search(db="nuccore", term="Bat coronavirus", retmax = 10000)
This returned 4214 hits, which I could access using entrez_summary coupled with the get-metadata function.

Now, I am trying to compare these results with a different search strategy. I want to seek for all coronavirus sequences in the "nuccore" database, and then filtering for bat hosts using the standarised taxonomy as in the tutorial.

I use the following code:

covs<-entrez_search(db="nuccore", term="txid11118[Organism]")

Which yields:
Entrez search result with 4715446 hits (object contains 20 IDs and a web_history object)
Search term (as translated): txid11118[Organism]

Then I use entrez_summary:
entrez_summary(db="nuccore", web_history=covs$web_history)

And I get the message:
Error during wrapup: No esummary records found in file

What is going wrong??

@allenbaron
Copy link

Your trying to retrieve too many records and the only response from the server is "Too many UIDs in request. Maximum number of UIDs is 500 for JSON format output."

@d-caraballo
Copy link
Author

Thanks, Allen. But the use of web_history wasn't precisely to avoid the "large request" problem? How can I get the complete record (4.7E6 hits!) and then filter by host species?

@allenbaron
Copy link

I'm sorry to disappoint you but your going to have to do some extra work here if you want this to work. rentrez cannot handle this use case without extra coding.

Before you do anything else, I recommend you review the E-Utilities documentation, particularly where it discusses large requests in Usage Guidelines and Requirements.

rentrez does instantiate an Entrez History object when use_history = TRUE in entrez_search. An Entrez History object is basically required for large requests (> 200 records I think) but the Entrez Utilities still have limits on how many records you can retrieve in a single request. For ESummary the limit is dependent on the record format requested, 500 for json and 10,000 for xml (for more details about each Utility see The E-utilities In-Depth: Parameters, Syntax and More. To obtain more than that from a History object is possible but requires paging (see "Minimizing the Number of Requests" in the E-Utilities documentation; the Application 3 link provides an example of paging).

rentrez does not have the ability to page, so it will not work with the History object created. You could do this using the E-direct utilities on the command line, which I recommend if you are serious about getting this data. It might also be possible to get all the record IDs from entrez_search() and then request them in chunks of 10,000 with entrez_summary() but you should be aware that there is a bug in rentrez that prevents this from working (see PR #174). I fixed this specific issue in a fork when I realized rentrez is not being actively maintained.

One more thing for your consideration, the first 10,000 records of your request have a size of 221 MB.

@LauraVP1994
Copy link

You seem to have the same problem as I have. I did find a way around this problem (at least it worked for me with pubmed). You can use an lapply or for loop, I included my code in issue #180.

@allenbaron
Copy link

Ideally, rentrez would be updated to implement E-utilties paging feature with a web history.

@J-Moravec
Copy link

Encountered the same issue:

rentrez::entrez_summary(db="gds", web_history=esearch$web_history)
# Esummary includes error message: Too many UIDs in request. Maximum number of UIDs is 500 for JSON format output. 

Which got more confusing when specifying retmode="XML" in a hope that this will rectify the problem:

rentrez::entrez_summary(db="gds", web_history=esearch$web_history, retmode="XML")
# Error in UseMethod("parse_esummary") : 
# no applicable method for 'parse_esummary' applied to an object of class "character"

Since documentation specifically says to use the web_history argument when the number of records is too large, it should be documented that it is not a panacea and how to work with a large number of records.

I will try to submit a PR once I figure out how to do it cleanly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants