09-chunks.Rmd

# Extracting text {#chunks}

Functions for extracting parts of texts used to live inside of `fulltext`, but have now
moved to the package [pubchunks][].

The `pubchunks::pub_chunks` function tries to make it easy to extract the parts of
articles you want. This only works with XML format articles though since although
we can get text out of PDFs, there is no machine readable way to say
"I want the abstract".

In addition to only working with XML, this function only has knowledge about a
select set of publishers for which we've encoded knowledge about how to get different
sections of the article. Not all publishers use the same format XML - so each publisher is
slightly different for how to get to each section. That is, to get to the abstract
requires slightly different xpath for publisher A vs. publisher B vs. publisher C.

An alternative to `pubchunks` is to use xpath or css selectors yourself to slice and
dice XML. 

## Usage {#chunks-usage}


```{r}
library(fulltext)
library(pubchunks)
```

Get a full text article

```{r chunks_ft_get}
x <- ft_get('10.1371/journal.pone.0086169')
```

Note that unlike previous versions of `fulltext` you now have to collect (`ft_collect()`)
the text from the XML file on disk. Then you can pass to `pub_chunks()`, here to get
authors.

```{r chunks_ft_get_collect}
x %>% ft_collect %>% pub_chunks("authors")
```

In another example, let's search for PLOS articles.

```{r rplos_dois}
library("rplos")
(dois <- searchplos(q="*:*", fl='id',
   fq=list('doc_type:full',"article_type:\"research article\""),
     limit=5)$data$id)
```

Then get the full text

```{r}
x <- ft_get(dois)
```

Then pull out various sections of each article.

> remember to pull out the full text first

```{r}
x <- ft_collect(x)
```

```{r eval = FALSE}
x %>% pub_chunks("front")
x %>% pub_chunks("body")
x %>% pub_chunks("back")
x %>% pub_chunks("history")
x %>% pub_chunks("authors")
x %>% pub_chunks(c("doi","categories"))
x %>% pub_chunks("all")
x %>% pub_chunks("publisher")
x %>% pub_chunks("acknowledgments")
x %>% pub_chunks("permissions")
x %>% pub_chunks("journal_meta")
x %>% pub_chunks("article_meta")
```

## Tabularize {#chunks-tabularize}

The function `pub_tabularize()` is useful for coercing the output of `pub_chunks()` into a data.frame,
the lingua franca of data work in R.

```{r}
library(data.table)
x <- pub_chunks(x, c("doi", "title"))
x <- pub_tabularize(x)
rbindlist(x$plos, fill = TRUE)
```

## Other inputs {#chunks-other-inputs}

`pub_chunks()` works with other inputs besides the output of `fulltext::ft_get()`.

### Files

```{r}
x <- system.file("examples/10_1016_0021_8928_59_90156_x.xml", 
  package = "pubchunks")
```

```{r}
pub_chunks(x, "abstract")
pub_chunks(x, "title")
pub_chunks(x, "authors")
pub_chunks(x, c("title", "refs"))
```

The output of `pub_chunks()` is a list with an S3 class `pub_chunks` to make 
internal work in the package easier. You can easily see the list structure 
by using `unclass()`.

### xml in a string

```{r}
xml <- paste0(readLines(x), collapse = "")
pub_chunks(xml, "title")
```

### xml2 objects

```{r}
xml <- paste0(readLines(x), collapse = "")
xml <- xml2::read_xml(xml)
pub_chunks(xml, "title")
```


[pubchunks]: https://github.com/ropensci/pubchunks/