forked from apulvino/rvest_tutorial
-
Notifications
You must be signed in to change notification settings - Fork 1
/
part-1-getting-started.Rmd
379 lines (238 loc) · 14.5 KB
/
part-1-getting-started.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
---
title: "Part 1: Getting Started"
author: Dan Turner (dturner@u.northwestern.edu)
---
# **Setup instructions**
## Welcome to my tutorial on the basics of scraping the web using R!
This is an *R Notebook* that has code chunks inline with explanations. Run the code blocks by pressing the *Run* button while the cursor is in that code chunk, or by the key combination *Cmd+Shift+Enter*. Each code chunk behaves kind of like an isolated `.R` script file, but the results appear beneath the code chunk, instead of in the console.
```{r RUN THIS}
library(rvest) # Web scraping
library(tidyverse) # Data wrangling
library(RCurl) # Downloading files from the internet
#install.packages("stringr")
library(stringr)
demo_mode <- TRUE # Skip the the code that takes more than a few minutes to run
```
If there are no errors, then you are ready to go onto the next part. Otherwise, you probably need to install some packages or dependencies.
# Reading web source code
We are going to use a hidden web browser feature to understand the source code for pages we want to scrape.
After opening a new window in a modern web browser such as Firefox, Chrome, Safari, or Edge, open some page from Wikipedia, such as https://en.wikipedia.org/wiki/Purple_Rain_(film).
## Using Inspector view
(Tested on Chrome, Safari, Edge and Firefox)
Right-click any element (such as a picture or title) on the web page you want to 'inspect' and click 'Inspect'/'Inspect Element'. This will open a new window that shows the HTML code (and more) corresponding to the element you selected.
Take a minute with the built-in selector tool (usually a cross hairs or magnifying class icon) to explore different elements of the page. Some of the most common HTML tags you will see are div, span, a, table, img, p, ul, and li. We will use these tags to extract relevant information from the page.
Using this tool, you can browse the code and see how it was interpreted into the visual layout that you see, you can see the CSS style properties for every element, and you can even test changes to any of the website's code in real time.
*Web scraping partly depends on the uniformity of the underlying code*
Web code is extremely variable in quality and style, and so the first step to extracting data from a webpage is always understanding how the data is displayed.
## HTML versus CSS
Most websites use two major file types, HTML and CSS. HTML (hyper text markup language) is a data file that contains code representing text, links to pictures, tables of data, and everything else. CSS (cascading style sheets) contains code that browsers use to visually style the HTML. In the old days, HTML code would have tags for bold <b> and italics <em> and so on, but not anymore. For a long time, it has been best practices to do all styling using CSS classes.
We can scrape elements of websites using their HTML (hierarchically grouped) OR CSS (stylistically grouped). Advanced users might get use out of xpath as well, but that is beyond the scope of this workshop (sorry).
## Principles of scraping
Because the code for websites is often poorly written (sorry, webdevs), I want to offer some guidelines to help decide when and how to develop a web scraping solution for your project.
Rule 1. Don't scrape what you can download
Rule 2. Don't scrape what you cannot easily clean
Rule 3. Convert data into native data types ASAP (from strings)
# Scraping Wikipedia
*For our first example, we will scrape some lists from Wikipedia.*
Let's compare a list of films set in Minnesota (A) to a list of films actually shot in Minnesota (B). I want to use these lists to answer the simple question, do films shot in Minnesota tend to be set in Minnesota?
Link A. https://en.wikipedia.org/wiki/Category:Films_set_in_Minnesota
Link B. https://en.wikipedia.org/wiki/Category:Films_shot_in_Minnesota
Movie titles in these lists are represented in bullet pointed lists organized into alphabetical categories. The code for the first category ("0-9", in h3 tags) looks like this:
```
<div class="mw-category-group"><h3>0–9</h3>
<ul><li><a href="/wiki/20/20:_In_an_Instant" title="20/20: In an Instant">20/20: In an Instant</a></li>
<li><a href="/wiki/360_(film)" title="360 (film)">360 (film)</a></li></ul></div>
```
## Common HTML Tags
Here is a quick breakdown of these tags:
`div` means 'division', which is used to apply the "mw-category-group" CSS class to this chunk of HTML
`ul` means 'unorderd list' = bullet point list
`li` means 'list item' = individual bullet point
`a` means 'anchor link' = normal hyperlink
`class` is for CSS styling, which we refer to with `#` and `.` like `#mw-category-group`
We only need the names of the films, but if we wanted to know more about these films later (say, their release date or budget), we might want the link to their Wikipedia page.
Check out one of the movie's pages to get a sense for the data potential: https://en.wikipedia.org/wiki/Purple_Rain_(film)
*With this in mind, let's scrape films set and shot in Minnesota to get the titles of the films (the text in the <a> tag) and their links (the <a> tag's 'href').*
For other curious people, here's a quick link to the trailer for the Purple Rain film: https://www.youtube.com/watch?v=AuXK8ZbTmLk It won an Oscar.
## Scraping the movie titles and links
```{r}
# Download the html of the two links into R variables
films_set_html <- read_html("https://en.wikipedia.org/wiki/Category:Films_set_in_Minnesota")
films_shot_html <- read_html("https://en.wikipedia.org/wiki/Category:Films_shot_in_Minnesota")
# Peek... word is from the stringr package and extracts words from text-based, string objects... to get the html object/link converted to a text-based string object we used html_text from rvest
word(html_text(films_shot_html), 100, 200)
```
You should see the list punctuated by new line tags (`\n`),
## Use web developer tools
Next we will scrape the data we want using by using the information from Inspector View to write some inclusion/exclusion rules. Usually you can simply right-click an example of the information you want in Inspector View and use that to build your rule. *When you right-click the line in Inspector View, hover over 'Copy' and click 'Copy Selector'.* On the page Films Set in Minnesota this gives us:
```
#mw-pages > div > div > div:nth-child(1) > ul > li:nth-child(1) > a
```
If we paste that into the `html_nodes()` function, it returns the first title of the first film on the page ('The Adventures of Rocky and Bullwinkle').
```{r}
films_set_html %>%
html_nodes('#mw-pages > div > div > div:nth-child(1) > ul > li:nth-child(1) > a') %>%
html_text() # this extracts text from within HTML tags
```
But we want every film, not just the first one. If you look at the rule, you will see two references to `nth-child(1)`, which is a newer way to specify CSS styles based on parent-child relationships and order (with ':').
Delete it to include all of the films in all of the categories:
```{r}
films_set_html %>%
html_nodes('#mw-pages > div > div > div > ul > li > a') %>%
html_text() # this extracts text from within HTML tags
```
Now let's finish the job by storing the links and titles in R variables.
```{r}
# Titles, same as above
films_set_titles <- films_set_html %>%
html_nodes('#mw-pages > div > div > div > ul > li > a') %>%
html_text()
# The rule works equally well for the other link, too
films_set_links <- films_set_html %>%
html_nodes('#mw-pages > div > div > div > ul > li > a') %>%
html_attr("href")
# Join the titles and links as a data frame
films_set_mn <- data.frame("title" = films_set_titles, "link" = films_set_links)
# Peek
head(films_set_mn)
# Cleanup
rm(films_set_html, films_set_titles, films_set_links)
```
We follow the same process to generate the films_shot list:
```{r}
# Titles, same as above
films_shot_titles <- films_shot_html %>%
html_nodes('#mw-pages > div > div > div > ul > li > a') %>%
html_text()
# The rule works equally well for the other link, too
films_shot_links <- films_shot_html %>%
html_nodes('#mw-pages > div > div > div > ul > li > a') %>%
html_attr("href")
# Join the titles and links as a data frame
films_shot_mn <- data.frame("title" = films_shot_titles, "link" = films_shot_links)
# Peek
head(films_shot_mn)
# Cleanup
rm(films_shot_html, films_shot_titles, films_shot_links)
```
## Do films shot in Minnesota tend to be set in Minnesota?
```{r}
# Films shot in MN, set in MN
# intersect will find the films that are both shot and set in Minnesota
length(intersect(films_shot_mn$title,
films_set_mn$title)) / length(films_shot_mn)
# Films shot in MN, NOT set in MN
# setdiff will find all the films shot in Minnesota that AREN'T set in Minnesota... it turns out there are more of them
length(setdiff(films_shot_mn$title,
films_set_mn$title)) / length(films_shot_mn)
```
In the context of films shot in Minnesota, most are NOT set there.
# How to scrape multiple pages
Let's get a little closer to Northwestern and get the titles and links for films set and shot in Chicago.
*But there is a problem.* There are many more films set and shot in Chicago than in Minnesota, and Wikipedia only lists 200 items per list per page. See for yourself:
```{r}
# same as before
films_set_chicago <- read_html("https://en.wikipedia.org/wiki/Category:Films_set_in_Chicago") %>%
html_nodes('#mw-pages > div > div > div > ul > li > a') %>%
html_text()
length(films_set_chicago)
```
The list only has *200* items in it, but according to the link we are scraping, the full list is about twice that size. If we were browsing Wikipedia, we could click "next page" and see how the list continues, but that's not how we're reading it.
This is something you should note about websites you're scraping--how do you get all of the data to be represented in the HTML? 'Next' button? Infinite scroll?
## Crawling
One solution to this issue is to scrape every page and remove duplicates we find. But, this will take a lot of time and it will burden our post-processing. Instead, we will use the recursive property of pagination (that the next page can have a next page, which can have a next page) to crawl all pages, one at a time.
```{r}
# function to scrape links and names
w.scrape <- function(full_url, rule){
# get titles
the_titles <- read_html(full_url) %>%
html_nodes( rule ) %>%
html_text()
# get links
the_links <- read_html(full_url) %>%
html_nodes( rule ) %>%
html_attr('href')
# as a dataframe
df <- data.frame("titles" = the_titles, "links" = the_links,
stringsAsFactors = FALSE)
return ( df)
}
# return the urls of the next pages
w.tree <- function(rel_url){
root = "https://en.wikipedia.org"
full_url <- paste0(root, rel_url)
# see if there's a next page link
to.continue <- read_html( full_url ) %>%
html_node("#mw-pages > a:last-child") %>% # the 'next page' link is the last link in this div
html_text() %>%
all.equal("next page")
# if so, get the link
if(to.continue == TRUE){
next.page <- read_html( full_url ) %>%
html_node("#mw-pages > a:last-child") %>% # the 'next page' link is the last link in this div
html_attr('href')
w.tree(next.page) # recurse
return(next.page)
}
}
# convenience function to make a list of urls and their text from wikipedia category pages
w.list <- function(rel_url, rule){
root = "https://en.wikipedia.org"
to.scrape <- c(rel_url, w.tree( rel_url ) ) # not tested beyond 2 pages
output <- data.frame() # container
for(page in to.scrape){
output <- rbind( w.scrape( paste0(root, page), rule ), output )
Sys.sleep(0.5) # pause 1/2 second before scraping the next page
}
return(unique(output)) # return unique rows
}
```
The reason why we're using loops is so that we can pause between iterations using Sys.sleep(). This is important if you want to be a friendly scraper, but ultimately much slower than allowing R to use parallel connections.
```{r message=FALSE, warning=FALSE}
child_rule = "#mw-pages li a"
# let's see how it works
films_set_chicago <- w.list(rel_url = "/wiki/Category:Films_set_in_Chicago",
rule = child_rule)
```
I get 400 films, which means we have just under 2 pages of items.
That should be all of them!
## Scale it up
We could scrape every city or every state, or both, using the same basic methods as we employed for the Chicago list and the URLS. Doing so means touching many more HTML pages, increasing the chances we will hit an error.
Flow control is important for handling exceptions, so you use `next` and `break`, depending on whether you want the loop to skip the current item, or stop altogether.
*The code chunk below will scrape all of the state-level pages of films on Wikipedia.*
```{r message=FALSE, warning=FALSE}
# let's get all of the movies from these links
rel_link_list <- c( "/wiki/Category:Films_set_in_the_United_States_by_state",
"/wiki/Category:Films_shot_in_the_United_States_by_state")
# using the Inspector tool on 'Films set in Akron, Ohio' I copied the selector path
# I also had to delete the 'child' selectors, as I did before
#parent_rule <- '#mw-subcategories li a'
parent_rule <- '#mw-subcategories > div > div > div > ul > li > div > div.CategoryTreeItem > a'
child_rule = "#mw-pages li a"
# loop all the geographical area links to get all the list page links
# scrape all the geo categories
for(link in rel_link_list){
if(demo_mode == TRUE){ break } # this takes a while
geo_links <- w.list( rel_url = link, rule = parent_rule )
geo_content <- data.frame(matrix(ncol = 5, nrow = 0))
# for each geo category scrape the links
for(i in 1:nrow(geo_links)) {
temp <- w.list( rel_url = paste0( geo_links$links[i] ) ,
rule = child_rule )
if(nrow(temp) == 0){
next # if our rule fails, skip this link
} else {
temp$parent_title <- geo_links$titles[i]
temp$parent_link <- geo_links$links[i]
geo_content <- rbind(geo_content, temp)
}
rm(temp)
Sys.sleep(0.5) # pause 1/2 second before scraping the next page
}#inner
}#outer
rm(link, i, geo_links, rel_link_list) # cleanup
# saving this so you can load it without running it
# saveRDS(geo_content, "geo_content.rds") # 6062 films set in various states
```
This just shows how you can easily scale-up some simple scraping script into something with more of an appetite.
In the next part we will practice some web scraping, then we will build on this in one more part.