Skip to content

Commit

Permalink
fixing vignette for API possible problems
Browse files Browse the repository at this point in the history
  • Loading branch information
Socorro Dominguez committed Aug 14, 2024
1 parent e6f9e2a commit d6c9306
Showing 1 changed file with 102 additions and 73 deletions.
175 changes: 102 additions & 73 deletions vignettes/neotoma2-package.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -30,13 +30,13 @@ The [Neotoma Paleoecology Database](https://www.neotomadb.org) is a domain-speci
### Resources

* [Neotoma Homepage](https://www.neotomadb.org)
* The Neotoma homepage, with links to contacts, news and other tools and resources.
* The Neotoma homepage, with links to contacts, news and other tools and resources.
* [Neotoma Database Manual](https://open.neotomadb.org/manual/)
* Documentation for the database itself, with examples of SQL queries and descriptions of Neotoma tables.
* Documentation for the database itself, with examples of SQL queries and descriptions of Neotoma tables.
* [Neotoma (JSON) API](https://api.neotomadb.org/)
* A tool to obtain data in JSON format directly through calls to Neotoma.
* A tool to obtain data in JSON format directly through calls to Neotoma.
* [Neotoma GitHub Organization](https://github.com/NeotomaDB)
* Open code repositories to see how folks are using Neotoma and what kinds of projects we're working on.
* Open code repositories to see how folks are using Neotoma and what kinds of projects we're working on.
* Workshops and Code Examples ([see the section below](#workshops-and-code-examples))

## Neotoma Data Structure
Expand Down Expand Up @@ -186,7 +186,7 @@ We can also use `set_sites()` as a tool to update the metadata associated with a
```r
# Update a value within an existing `sites` object:
longer_alex[[3]] <- set_site(longer_alex[[3]],
altitude = 3000)
altitude = 3000)
longer_alex
```

Expand Down Expand Up @@ -247,34 +247,54 @@ brazil <- '{"type": "Polygon",
# functionality of the `sf` package.
brazil_sf <- geojsonsf::geojson_sf(brazil)
brazil_datasets <- get_datasets(loc = brazil_sf)
brazil_datasets
brazil_datasets <- tryCatch({
get_datasets(loc = brazil_sf)
}, error = function(e) {
message("Failed to retrieve datasets for Brazil: ", e$message)
NULL
})
```

Now we have an object called `brazil_datasets` that contains `r length(brazil_datasets)`.

You can plot these findings!

```{r leafletBrazil}
plotLeaflet(brazil_datasets)
if (!is.null(brazil_datasets)) {
plotLeaflet(brazil_datasets)
} else {
cat("Datasets could not be retrieved due to an API error. Please try again later.")
}
```

## Filtering Records

Sometimes we take a large number of records, do some analysis, and then choose to select a subset. For example, we may want to select all sites in a region, and then subset those by dataset type. If we want to look at only the geochronological datasets from Brazil, we can start with the set of records returned from our `get_datasets()` query, and then use the `filter` function in `neotoma2` to select only those datasets that are geochronologic:

```{r filterBrazil}
brazil_dates <- neotoma2::filter(brazil_datasets,
datasettype == "geochronologic")
# or:
if (!is.null(brazil_datasets)) {
brazil_dates <- neotoma2::filter(brazil_datasets,
datasettype == "geochronologic")
} else {
cat("Datasets could not be filtered due to a previous API error. Please try again later.")
}
brazil_dates <- brazil_datasets %>%
neotoma2::filter(datasettype == "geochronologic")
# or:
if (!is.null(brazil_datasets)) {
brazil_dates <- brazil_datasets %>%
neotoma2::filter(datasettype == "geochronologic")
} else {
cat("Datasets could not be filtered due to a previous API error. Please try again later.")
}
# With boolean operators:
brazil_space <- brazil_datasets %>% neotoma2::filter(lat > -18 & lat < -16)
if (!is.null(brazil_datasets)) {
brazil_space <- brazil_datasets %>% neotoma2::filter(lat > -18 & lat < -16)
} else {
cat("Datasets could not be filtered due to a previous API error. Please try again later.")
}
```

The `filter()` function takes as the first argument, a datasets object, followed by the criteria we want to use to filter. Current supported criteria includes:
Expand Down Expand Up @@ -308,17 +328,26 @@ brazil <- '{"type": "Polygon",
# functionality of the `sf` package.
brazil_sf <- geojsonsf::geojson_sf(brazil)
brazil_records <- get_datasets(loc = brazil_sf) %>%
neotoma2::filter(datasettype == "pollen" & age_range_young <= 1000 & age_range_old >= 10000) %>%
get_downloads(verbose = FALSE)
count_by_site <- samples(brazil_records) %>%
dplyr::filter(elementtype == "pollen" & units == "NISP") %>%
group_by(siteid, variablename) %>%
summarise(n = n()) %>%
group_by(variablename) %>%
summarise(n = n()) %>%
arrange(desc(n))
brazil_records <- tryCatch({
get_datasets(loc = brazil_sf) %>%
neotoma2::filter(datasettype == "pollen" & age_range_young <= 1000 & age_range_old >= 10000) %>%
get_downloads(verbose = FALSE)
}, error = function(e) {
message("Failed to retrieve records for Brazil: ", e$message)
NULL
})
if (!is.null(brazil_records)) {
count_by_site <- samples(brazil_records) %>%
dplyr::filter(elementtype == "pollen" & units == "NISP") %>%
group_by(siteid, variablename) %>%
summarise(n = n()) %>%
group_by(variablename) %>%
summarise(n = n()) %>%
arrange(desc(n))
} else {
cat("Records could not be retrieved due to an API error. Please try again later.")
}
```

In this code chunk we define the bounding polygon for our sites, filter by time and dataset type, and then return the full records for those sites. We get a `sites` object with dataset and sample information (because we used `get_downloads()`). We execute the `samples()` function to extract all the samples from the `sites` objects, and then filter the resulting `data.frame` to pull only pollen (a pollen dataset may contain spores and other elements that are not, strictly speaking, pollen) that are counted using the number of identified specimens (or NISP). We then `group_by()` the unique site identifiers (`siteid`) and the taxa (`variablename`) to get a count of the number of times each taxon appears in each site. We then want to `summarize()` to a higher level, just trying to understand how many sites each taxon appears in. After that we `arrange()` so that the records show the most common taxa first in the resulting variable `count_by_site`.
Expand All @@ -335,14 +364,14 @@ The most simple case is a search for a publication based on one or more publicat

We can use a single publication ID or multiple IDs. In either case the API returns the publication(s) and creates a new `publications` object (which consists of multiple individual `publication`s).

```{r pubsbyid}
```{r pubsbyid, eval=FALSE}
one <- get_publications(12)
two <- get_publications(c(12, 14))
```

From there we can then then subset and extract elements from the list using the standard `[[` format. For example:

```{r showSinglePub}
```{r showSinglePub, eval=FALSE}
two[[2]]
```

Expand All @@ -362,31 +391,31 @@ We can also use search elements to search for publications. The `get_publicatio
* `limit`
* `offset`

```{r fulltestPubSearch}
```{r fulltestPubSearch, eval=FALSE}
michPubs <- get_publications(search = "Michigan", limit = 2)
```

This results in a set of `r length(michPubs)` publications from Neotoma, equal to the `limit`. If the number of matching publications is less than the limit then the `length()` will be smaller.

Text matching in Neotoma is approximate, meaning it is a measure of the overall similarity between the search string and the set of article titles. This means that using a nonsense string may still return results results:

```{r nonsenseSearch}
```{r nonsenseSearch, eval=FALSE}
noise <- get_publications(search = "Canada Banada Nanada", limit = 5)
```

This returns a result set of length `r length(noise)`.

This returns the (Neotoma) ID, the citation and the publication DOI (if that is stored in Neotoma). We can get the first publication using the standard `[[` nomenclature:

```{r getSecondPub}
```{r getSecondPub, eval=FALSE}
two[[1]]
```

The output will look similar to the output for `two` above, however you will see that only a single result will be returned and the class (for a single publication) will be of type `publication` (as opposed to `publications`).

We can select an array of `publication` objects using the `[[` method, either as a sequence (`1:10`, or as a numeric vector (`c(1, 2, 3)`)):

```{r subsetPubs}
```{r subsetPubs, eval=FALSE}
# Select publications with Neotoma Publication IDs 1 - 10.
pubArray <- get_publications(1:10)
# Select the first five publications:
Expand All @@ -398,58 +427,58 @@ subPub

Just as we can use the `set_sites()` function to set new site information, we can also create new publication information using `set_publications()`. With `set_publications()` you can enter as much or as little of the article metadata as you'd like, but it's designed (in part) to use the CrossRef API to return information from a DOI.

```{r setNewPub}
```{r setNewPub, eval=FALSE}
new_pub <- set_publications(
articletitle = "Myrtle Lake: a late- and post-glacial pollen diagram from northern Minnesota",
journal = "Canadian Journal of Botany",
volume = 46)
articletitle = "Myrtle Lake: a late- and post-glacial pollen diagram from northern Minnesota",
journal = "Canadian Journal of Botany",
volume = 46)
```

A `publication` has a large number of slots that can be defined. These may be left blank, they may be set directly after the publication is defined:

```{r setPubValue}
```{r setPubValue, eval=FALSE}
new_pub@pages <- "1397-1410"
```

## Workshops and Code Examples

* 2022 International AL/IPA Meeting; Bariloche, Argentina
* [English Language Simple Workflow](https://open.neotomadb.org/Workshops/IAL_IPA-November2022/simple_workflow.html)
* Topics: Simple search, climate gradients, stratigraphic plotting
* Spatial Domain: South America
* Dataset Types: Diatoms
* [Spanish Language Simple Workflow](https://open.neotomadb.org/Workshops/IAL_IPA-November2022/simple_workflow_ES.html)
* Topics: Simple search, climate gradients, stratigraphic plotting
* Spatial Domain: South America
* Dataset Types: Diatoms
* [English Language Complex Workflow](https://open.neotomadb.org/Workshops/IAL_IPA-November2022/complex_workflow.html)
* Topics: Chronology building, Bchron
* Spatial Domain: South America
* Dataset Types: Diatoms
* [Spanish Language Complex Workflow](https://open.neotomadb.org/Workshops/IAL_IPA-November2022/complex_workflow_ES.html)
* Topics: Chronology building, Bchron
* Spatial Domain: South America
* Dataset Types: Diatoms
* [English Language Simple Workflow](https://open.neotomadb.org/Workshops/IAL_IPA-November2022/simple_workflow.html)
* Topics: Simple search, climate gradients, stratigraphic plotting
* Spatial Domain: South America
* Dataset Types: Diatoms
* [Spanish Language Simple Workflow](https://open.neotomadb.org/Workshops/IAL_IPA-November2022/simple_workflow_ES.html)
* Topics: Simple search, climate gradients, stratigraphic plotting
* Spatial Domain: South America
* Dataset Types: Diatoms
* [English Language Complex Workflow](https://open.neotomadb.org/Workshops/IAL_IPA-November2022/complex_workflow.html)
* Topics: Chronology building, Bchron
* Spatial Domain: South America
* Dataset Types: Diatoms
* [Spanish Language Complex Workflow](https://open.neotomadb.org/Workshops/IAL_IPA-November2022/complex_workflow_ES.html)
* Topics: Chronology building, Bchron
* Spatial Domain: South America
* Dataset Types: Diatoms
* 2022 European Pollen Database Meeting; Prague, Czech Republic
* [English Language Simple Workflow](https://open.neotomadb.org/Workshops/EPD-May2022/simple_workflow.html)
* Topics: Simple search, climate gradients, stratigraphic plotting, taxonomic harmonization
* Spatial Domain: Europe/Czech Republic
* Dataset Types: Pollen
* [English Language Complex Workflow](https://open.neotomadb.org/Workshops/EPD-May2022/complex_workflow.html)
* Topics: Chronology building, Bchron
* Spatial Domain: Europe/Czech Republic
* Dataset Types: Pollen
* [English Language Simple Workflow](https://open.neotomadb.org/Workshops/EPD-May2022/simple_workflow.html)
* Topics: Simple search, climate gradients, stratigraphic plotting, taxonomic harmonization
* Spatial Domain: Europe/Czech Republic
* Dataset Types: Pollen
* [English Language Complex Workflow](https://open.neotomadb.org/Workshops/EPD-May2022/complex_workflow.html)
* Topics: Chronology building, Bchron
* Spatial Domain: Europe/Czech Republic
* Dataset Types: Pollen
* 2022 American Quaternary Association Meeting
* [English Language Simple Workflow](https://open.neotomadb.org/Workshops/AMQUA-June2022/simple_workflow.html)
* Topics: Simple search, climate gradients, stratigraphic plotting
* Spatial Domain: North America
* Dataset Types: Pollen
* [English Language Complex Workflow](https://open.neotomadb.org/Workshops/AMQUA-June2022/complex_workflow.html)
* Topics: Chronologies
* Spatial Domain: North America
* Dataset Types: Pollen
* [English Language Simple Workflow](https://open.neotomadb.org/Workshops/AMQUA-June2022/simple_workflow.html)
* Topics: Simple search, climate gradients, stratigraphic plotting
* Spatial Domain: North America
* Dataset Types: Pollen
* [English Language Complex Workflow](https://open.neotomadb.org/Workshops/AMQUA-June2022/complex_workflow.html)
* Topics: Chronologies
* Spatial Domain: North America
* Dataset Types: Pollen
* Neotoma-charcoal Workshop, Göttingen, Germany. Authors: Petr Kuneš & Thomas Giesecke
* [English Language Workflow](https://rpubs.com/petrkunes/neotoma-charcoal)
* Topics: Simple Search, PCA, DCA, Charcoal/Pollen Correlation
* Spatial Domain: Global/Czech Republic
* Dataset Types: Pollen, Charcoal
* [English Language Workflow](https://rpubs.com/petrkunes/neotoma-charcoal)
* Topics: Simple Search, PCA, DCA, Charcoal/Pollen Correlation
* Spatial Domain: Global/Czech Republic
* Dataset Types: Pollen, Charcoal

0 comments on commit d6c9306

Please sign in to comment.