-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Search and fetch are providing the wrong data? #185
Comments
When trying to get cross database data (i.e. sequences from nucleotide database corresponding to samples from biosample database) its necessary to use library(rentrez)
library(XML)
search <- rentrez::entrez_search(
db = "biosample",
term = "SAMN30954130[ACCN]",
retmax = 9999,
use_history = TRUE
)
nuc_id <- rentrez::entrez_link(
dbfrom = "biosample",
web_history = search$web_history,
db = "nucleotide"
)
fetch_test <- rentrez::entrez_fetch(
db = "nucleotide",
id = nuc_id$links$biosample_nuccore,
rettype = "xml"
)
fetch_list <- XML::xmlToList(fetch_test) Created on 2023-01-27 by the reprex package (v2.0.1)
|
Thank you! How would I scale this up to get the data from all the sequences that I need? Using this code, I can perform the search and link, but I can't seem to perform entrez_fetch using a list of linked IDs because the list is too long. `search <- rentrez::entrez_search( nuc_id <- rentrez::entrez_link( #request is too large fetch_list <- XML::xmlToList(fetch_test)` After some searching, I tried to change the link function to get a web_history and fetch that way, but this code provides an error (HTTP failure: 400): `search <- rentrez::entrez_search( #this seems to be working now #request is too large fetch_list <- XML::xmlToList(fetch_test)` |
rentrez does have a bug with the post method (see my comment in PR #163) but I don't think that should affect you if you're only using the It may be an issue with the number of records you're requesting at a time, see issue #178 for possible help. |
I am trying to download sequence data from E. coli samples within the state of Washington - it's about 1283 sequences, which I know is a lot. The problem that I am running into is that entrez_search and/or entrez_fetch seem to be pulling the wrong data. For example, the following code does pull 1283 IDs, but when I use entrez_fetch on those IDs, the sequence data I get is from chickens and corn and things that are not E. coli:
search <- entrez_search(db = "biosample", term = "Escherichia coli[Organism] AND geo_loc_name=USA:WA[attr]", retmax = 9999, use_history = T)
Similarly, I tried pulling the sequence from one sample manually as a test. When I search for the accession number SAMN30954130 on the NCBI website, I see metadata for an E. coli sample. When I use this code, I see metadata for a chicken:
search <- entrez_search(db = "biosample", term = "SAMN30954130[ACCN]", retmax = 9999, use_history = T) fetch_test <- entrez_fetch(db = "nucleotide", id = search$ids, rettype = "xml") fetch_list <- xmlToList(fetch_test)
The text was updated successfully, but these errors were encountered: