Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making entrez_fetch output accessible in R #179

Open
ellalalalalalala opened this issue Jun 17, 2022 · 3 comments
Open

Making entrez_fetch output accessible in R #179

ellalalalalalala opened this issue Jun 17, 2022 · 3 comments

Comments

@ellalalalalalala
Copy link

Hi!

Thanks for your nice package for communicating with NCBI in R.

I have a list of transcript accessions, of which I want to extract the GenBank entries. My list contains 1232 accessions, which I found to be too much for fetching (HTTP failure 414, the request is too large) and web_history option didn´t help. I read several issues about that problem, e.g. #163, and found out that reducing the input list might solve it. I was now successful by reducing my list to 200 accessions (500 did not work).

Now I´m running into the next problem, as my fetched file does not seem to be accessible.

I tried:

interest_acc<-entrez_fetch(db="nuccore", id=interest_list_sh1, rettype="native", parsed=T)

but get the error "XML content does not seem to be XML [....]". If I take the "parsed=T" out, the entrez_fetch command works, but I dont know how to make the fetched file accessible (xmlToList as shown in issue #113 doesnt work due to the XML incompatibility issue, either). Seems to be a similar problem as #91.

Is there any possiblility, how to access the information? I would actually only need the information, in which nucleotide positions of my transcript of interest the CDS starts, and in which it ends. So maybe I could also fetch a more simple record than the whole "native" one?

Thanks a lot for any help or suggestions!

@allenbaron
Copy link

That output is definitely not xml. I looked into the code of entrez_fetch and it appears to be passing the retmode argument with "" as input, which NCBI is interpreting as null and using the default retmode which is text. See the NCBI Entrez Utilities manual chapter on EFetch.

It works for me with retmode = "xml".

@ellalalalalalala
Copy link
Author

ellalalalalalala commented Jun 20, 2022

Hi @allenbaron , thanks a lot for your fast answer and help!!

It works perfectly fine with the retmode addition. :)

interest_acc<-entrez_fetch(db="nucleotide", id=interest_list_sh1, rettype="db", retmode = "xml", parsed=T)
interest_acc1<-xmlToList(interest_acc)

I still run into two issues and would be very happy for any suggestions (I´m still an R beginner...).

After running entrez_fetch with a list of 181 accessions as a search input (id=) and subsequently running xmlToList, I end up with a huge list containing GBSeq entries.

Strangely, all 181 GBSeq entries belong to the same GBSeq_locus (the first accession of my list). I checked the NCBI Entrez Utilities manual chapter on EFetch, but to me my input looks as is should (comma-delimited list of accessions):

> interest_list_sh1
[1] "NM_001271885.2,NM_005763.4,NM_004996.4,XM_017020319.1,NM_007011.8,NM_001012750.3,NM_001282925.3,NM_006111.3,NM_014384.3,NM_012287.6,NM_001042473.4,NM_001039844.3,NM_001278352.2,NM_005736.4,NM_001278579.2,NM_001382777.1,NM_001025107.3,NM_001207008.2,NM_001010982.5,NM_001035507.3,NM_001164623.3,NM_018361.5,NM_001321441.2,NM_022831.4,NM_032876.6,NM_001199852.2,NM_001005353.3,NM_001270546.1,NM_005088.3,NM_001206646.2,NM_001136275.2,NM_001135745.2,NM_001002246.2,NM_001002248.3,NM_001002249.2,NM_016476.11,NM_001242374.1,NM_001242546.2,NM_001257999.3,NM_015114.3,XM_024449382.1,NM_030816.5,NM_001164440.2,NM_032208.3,NM_001286780.2,XM_017015231.2,NM_001272071.2,NM_003916.5,NM_001030006.2,NM_019043.4,NM_001166050.2,NM_020531.3,NM_001024226.2,NM_001025593.3,NM_001030055.2,NM_001198665.2,NM_014786.4,NM_001173479.2,NM_001177.6,NM_001037174.2,NM_001009584.2,NR_033669.2,NM_080863.5,NM_001198798.2,NM_001321967.2,NM_001030287.4,NM_001675.4,NM_182810.3,NM_001290048.2,NM_001367549.1,NM_001001323.2,NM_001001396.3,NM_001002031.4,NM_001039457.3,NM_001695.5,NM_001258388.2,NM_000489.6,NM_001127696.2,NM_001301668.3,NM_001319075.2,NM_001378495.1,NM_012342.3,NM_004656.4,NM_001271606.2,NM_001291429.2,NM_001300905.2,NM_001170714.3,NM_005872.3,NM_001318975.1,NM_001320715.2,NM_001191.4,NM_001184772.3,NM_001313998.2,NM_001378149.1,NM_001320632.2,NM_001167820.2,NM_006129.5,NM_004329.3,NM_001204.7,NM_001301206.2,NM_001330491.2,NM_004052.4,NM_032515.5,NM_001286149.2,NM_007371.4,NM_032352.4,XM_005250671.5,NM_001003676.3,NM_001145199.2,NM_017924.4,NM_014117.3,NM_001195124.3,NM_001100621.3,NM_001008239.3,NR_024113.2,NM_001286173.2,NM_001013649.4,NM_018356.3,NM_175921.6,NM_001029863.2,NM_001286635.2,NM_001303039.2,NM_023080.3,NM_017998.3,NM_001261390.2,NM_001363669.2,NM_001159772.2,NM_001024649.2,NM_001105530.2,NM_001042476.2,NM_007359.5,NM_001224.5,NM_001145064.3,NM_001172895.1,NM_001206747.2,NM_004657.6,NM_001127321.1,NM_001303494.2,NM_133459.4,NM_001321829.1,NM_001099225.2,NM_005436.5,NM_175884.6,NM_001346100.2,NM_001199189.2,NM_001142300.2,NM_001009186.2,NR_003110.2,NM_001142401.3,NM_000610.4,NM_001789.3,NM_007065.4,NM_001270436.2,NM_001039802.2,NM_001038702.2,NM_024529.5,NM_003503.4,NM_080668.4,NM_015083.4,NM_001300960.2,NM_001145306.2,NM_001322373.2,NM_001172640.2,NM_016343.4,NM_022909.4,NM_001199803.3,NM_001271006.2,NM_001318219.1,NM_018131.5,NM_001350652.2,NM_005507.3,NM_138441.3,NM_001008390.2,NM_001164144.3,NM_001114121.2,NM_001114122.3,NM_012110.4,NM_001144073.2,NM_007236.5,NM_001389651.1,NM_014918.5"
> class(interest_list_sh1)
[1] "character"

Would you have an idea what could be wrong with my id input?

Second point: to find out the CDS start and end position, I used rettype="native" instead rettype="db". I screened the huge output list for the information on CDS span within the respective accession. I found the information in the path: interest_acc1$"Bioseq-set_seq-set"$"Seq-entry"$"Seq-entry_set"$"Bioseq-set"$"Bioseq-set_annot"$"Seq-annot"$"Seq-annot_data"$"Seq-annot_data_ftable"$"Seq-feat"$"Seq-feat_location"$"Seq-loc"$"Seq-loc_int"$"Seq-interval" and were able to drop the output into a new list; but the information to which accession the annotation info belongs is somehow lost in interest_acc1. Do you have any recommendation on how to make the information I need (transcript accession with corresponding cds start and end) accessible? Maybe I can solve this problem myself after the first issue is solved!

Thanks a lot in advance and have a great day!

@allenbaron
Copy link

I could not reproduce your first problem. If you run into more problems, I highly recommend creating a "reprex" (reproducible example). They provide a solid starting point for help with troubleshooting and identifying problems.

Here's an example with the first six accessions in your list:

library(rentrez)

interest_list_sh1 <- c("NM_001271885.2", "NM_005763.4", "NM_004996.4",
                       "XM_017020319.1", "NM_007011.8", "NM_001012750.3")

interest_acc <- rentrez::entrez_fetch(
    db="nucleotide",
    id=interest_list_sh1,
    rettype="db",
    retmode = "xml",
    parsed=T
)

interest_acc1 <- XML::xmlToList(interest_acc)

sapply(interest_acc1, function(.x) .x$GBSeq_locus)
#>          GBSeq          GBSeq          GBSeq          GBSeq          GBSeq 
#> "NM_001271885"    "NM_005763"    "NM_004996" "XM_017020319"    "NM_007011" 
#>          GBSeq 
#> "NM_001012750"

Created on 2022-06-20 by the reprex package (v2.0.1)

For problem 2, converting the XML to a list from the "native" type and then extracting the relevant info would be really painful and I would avoid it. If you have to get the information from that I'd suggest you look into using XML selectors on the XML directly and don't do the XML::xmlToList() step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants