IntChron parser #115

joeroe · 2020-10-08T14:15:26Z

IntChron <https://intchron.org/> is listed in #2 as not being included because it requires web scraping. However, after spending some time playing with it, I wonder if this might be revisited.

Essentially, IntChron seems to do the same thing as c14bazAAR—systematically compile dates from existing databases—with a web-based API. An IntChron parser would be more complicated than the existing parsers because, as far as I can tell, there is no way to extract the entire database as a single file. But it should still be possible to get it without resorting to web scraping. The key is that every HTML page on IntChron can also be accessed in csv, json, or txt format. This includes the "index" pages that eventually lead you to individual date records. It think it could be worth the extra complexity for IntChron because it does seem to include a lot of dates (for example the entire ORAU database) and it's backed by the Oxford C14 Lab so it's likely to grow over time.

I can think of a few ways you could approach this, depending on how much flexibility you want to give the user. At the simplest, one could implement a multi-stage parser in c14bazAAR:

Retrieve the list of "hosts" (https://intchron.org/host.csv)
Retrieve the list of records-by-country for each host (e.g. https://intchron.org/oxa/record.csv)
Retrieve the list of sites for each country (e.g. https://intchron.org/record/oxa/Jordan.csv)
Retrieve the list of dates for each site (e.g. https://intchron.org/record/oxa/Jordan/Dhuweila.csv)
Parse and collate the dates (actually quite easy because the IntChron format is similar to c14bazAAR's)

On the other end of the spectrum, one could write an R interface to IntChron as its own package, which c14bazAAR could then use as a dependency to retrieve either the entire database or a user-specific subset. That could be worthwhile if the IntChron standard does become widely used, but as things stand I'm not sure that it's worth the extra effort.

I'd be happy to put some work into this, but I thought I would first raise the issue and ask whether you think it is something that fits into c14bazAAR, and what the best approach to doing it might be.

nevrome · 2020-10-08T14:33:25Z

Very cool! - I was not aware of this option.

This indeed sounds like an application for an own package, because the data is not as monolithic as for most of the other "databases" (tables) in c14bazAAR. But writing a parser that simply collects everything may be a good first step towards that direction, as you can ignore the user input for now and nail down the tree merge algorithm first.

A PR would be very welcome! ORAU is extremely juicy.

joeroe · 2020-10-09T13:57:39Z

@nevrome That was my thinking too. I have a rough parser at joeroe/c14bazAAR/tree/intchron. It does seem to be worth it – crawling the full database returns over 11,000 dates, most of which are new for c14bazAAR:

intchron <- get_intchron("https://intchron.org/host")
# Or to save time:
# load("playground/intchron-cache-20201009.Rd")
length(unique(intchron$labcode))
#> [1] 11613

all <- get_c14data("all")
sum(!intchron$labcode %in% all$labnr)
#> [1] 9882

But it's extremely slow. Getting the whole database took about an hour on my fast university connection, because we have to make something like ~2000 separate HTTP requests.

So I'm thinking that splitting this off to its own package is a good idea after all. That way you could provide functions for getting subsets of the full IntChron database (e.g. by host/source, by country) and encourage the user to use that granularity in the c14bazAAR parser. Some sort of caching might also help.

nevrome · 2020-10-09T14:07:09Z

Alright - thanks for testing - excellent work! Downloading the whole thing is not feasible then and an own package for specific queries is clearly the way to go.

Maybe one solution to ensure the interoperability with c14bazAAR would be to use the c14_date_list data format for this new package?

joeroe · 2020-10-12T10:51:27Z

I've split the basic API interaction and querying off into its own package: joeroe/rintchron. I'll rewrite the parser on my intchron branch to use these instead. I also managed to get the time taken to retrieve the whole database down to 7 minutes (joeroe/rintchron#3), so I think we're close to it being viable to use as a normal c14bazAAR database, especially if there are separate parsers for ORAU, NCRF, etc.

nevrome · 2020-10-12T11:50:26Z

Great job! So we could go through intchron to get the data from different individual databases? We could write a parser function get_orau() which calls rintchron::intchron()?

joeroe · 2020-10-12T11:57:46Z

I think that's the way to go, yeah.

joeroe mentioned this issue Oct 19, 2020

IntChron parser prototype joeroe/c14bazAAR#1

Closed

joeroe mentioned this issue Apr 6, 2021

Parsers for Palmisano's datasets #120

Closed

dirkseidensticker self-assigned this May 5, 2021

dirkseidensticker linked a pull request Sep 2, 2021 that will close this issue

Parser for databases hosted on IntChron platform via rintchron #147

Open

dirkseidensticker mentioned this issue Sep 3, 2021

Error with intchron(c("intimate", "nrcf", "oxa")) joeroe/rintchron#13

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IntChron parser #115

IntChron parser #115

joeroe commented Oct 8, 2020

nevrome commented Oct 8, 2020

joeroe commented Oct 9, 2020

nevrome commented Oct 9, 2020

joeroe commented Oct 12, 2020 •

edited

Loading

nevrome commented Oct 12, 2020

joeroe commented Oct 12, 2020

IntChron parser #115

IntChron parser #115

Comments

joeroe commented Oct 8, 2020

nevrome commented Oct 8, 2020

joeroe commented Oct 9, 2020

nevrome commented Oct 9, 2020

joeroe commented Oct 12, 2020 • edited Loading

nevrome commented Oct 12, 2020

joeroe commented Oct 12, 2020

joeroe commented Oct 12, 2020 •

edited

Loading