112_networkds.Rmd

<style>@import url(style.css);</style>
[Introduction to Data Analysis](index.html "Course index")

# 11.2. Network(d)s

This section brings a bit of text mining into network analysis, as we will turn word associations into network ties (networks of word associations) and visualize the result. The code draws on the ideas of [Cornelius Puschmann][ynada], and the example data is [Julian Assange's][assange] address at the United Nations in September 2012.

[assange]: http://wikileaks.org/Transcript-of-Julian-Assange.html
[ynada]: http://blog.ynada.com/303

```{r packages, message = FALSE, warning = FALSE}
packages <- c("intergraph", "GGally", "ggplot2", "network", "RColorBrewer", "sna", "tm")
packages <- lapply(packages, FUN = function(x) {
  if(!require(x, character.only = TRUE)) {
    install.packages(x)
    library(x, character.only = TRUE)
  }
})
```

Our first step is to read the text file, to extract all words from the speech, and to extract all associations like "welcome to" or "fine deeds". We replace Cornelius Puschmann's original code by a functional approach to the job, using one the many R [apply][bh-apply] functions (`sapply` being from the `plyr` package) for the main word association routine.

[032]: 032_iteration.html
[bh-apply]: http://badhessian.org/loops-matrices-and-apply-functions/

```{r puschmann}
build.corpus <- function(x, skip = 0) {
  # Read the text source.
  src = scan(x, what = "char", sep = "\n", encoding = "UTF-8", skip = skip)
  # Extract all words.
  txt = unlist(strsplit(gsub("[[:punct:]|[:digit:]]", " ", tolower(src)), "[[:space:]]+"))
  # Remove single letters.
  txt = txt[nchar(txt) > 1]
  # Function to create word nodes.
  associate <- function(x) {
    y = c(txt[x], txt[x + 1])
    if(!TRUE %in% (y %in% c("", stopwords("en")))) y
    }
  # Build word network.
  net = do.call(rbind, sapply(1:(length(txt) - 1), associate))
  # Return network object.
  return(network(net))
}
# Example data.
net <- build.corpus("data/assange.txt")
```

The word network is plotted as a very sparse network, trimmed to associations of non-trivial words that appear at least three times in the speech. Trivial words are removed by matching them to the list of English stopwords found in the `tm` package. We again use the `ggnet` function, but [have a look][rdm-igraph] at the `igraph` package for an alternative.

[rdm-igraph]: http://rdatamining.wordpress.com/2012/05/17/an-example-of-social-network-analysis-with-r-using-package-igraph/

```{r plot-assange-auto, message = FALSE, fig.width = 12, fig.height = 9, tidy = FALSE, warning = FALSE}
# Plot with ggnet.
ggnet(net, weight = "degree", subset = 3,
      alpha = 1, segment.color = "grey", label = TRUE, vjust = - 2,
      legend = "none")
```

Since the corpus was created out of a simple function call, we can now find any corpus, prepare it and plot it in just a few lines. The next example is a plot of word associations in [Cory Doctorow's][gh-doctorow] speech to the Chaos Communication Congress in December 2011. Try running the same graph on any plain text speech file (here's one by [Barack Obama][bo]).

[gh-doctorow]: https://github.com/jwise/28c3-doctorow/blob/master/transcript.md
[bo]: http://librarian.net/dnc/speeches/obama.txt

```{r plot-doctorow-auto, message = FALSE, fig.width = 12, fig.height = 9, tidy = FALSE, warning = FALSE}
# Target locations
link = "https://raw.github.com/jwise/28c3-doctorow/master/transcript.md"
file = "data/doctorow.txt"
# Download speech.
if(!file.exists(file)) download(link, file, mode = "wb")
# Build corpus.
net <- build.corpus(file, skip = 37)
# Plot with ggnet.
ggnet(net, weight = "indegree", subset = 5,
      alpha = 1, segment.color = "grey", label = TRUE, vjust = - 2,
      legend = "none")
```

> __Next week__: [Data-driven advances](120_data.html).