-
Notifications
You must be signed in to change notification settings - Fork 5
/
part-2-practice-answers.Rmd
127 lines (85 loc) · 3.47 KB
/
part-2-practice-answers.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
---
title: "Part 2: Web scraping practice"
author: Dan Turner (dturner@u.northwestern.edu)
---
```{r run this first}
library(rvest) # Web scraping
library(tidyverse) # Data wrangling
library(RCurl) # Download files from the internet
```
# Part 2: **Web scraping practice**
## Challenge 1
Modify the rule below to list the titles of all the blog posts on the first page found at the URL:
```{r Challenge 1 answer}
url <- "https://forum.thegradcafe.com/"
# rule <- "#ipsLayout_mainArea > section > div:nth-child(8) > article:nth-child(1) > div.cBlog_grid_item__body.ipsPad > div:nth-child(1) > h2 > span > a"
#rule <- "#ipsLayout_mainArea > section > div.cBlog_grid_row.cBlog_grid_row--primary > article > div.cBlog_grid_item__body.ipsPad > div > h2 > span > a"
## ATP
# headings
rule <- "#ipsLayout_mainArea > section > ol > li > ol > li > div.ipsDataItem_main > h4 > a"
read_html(url) %>%
html_nodes(rule) %>%
html_text()
#subheadings
rule2 <- "#ipsLayout_mainArea > section > ol > li > ol > li > div.ipsDataItem_main > ul > li"
read_html(url) %>%
html_nodes(rule2) %>%
html_text()
## putting it together as it appears on gradcafe
#paste(rule,rule2,sep =", ")
read_html(url) %>%
html_nodes(
paste(rule,rule2, sep =", ")
) %>%
html_text()
```
## Challenge 2
Modify the rule below to make a dataframe consisting of the titles, links, author, and date. The author and date will require you to use Inspector view to build and test two more rules.
```{r Challenge 2 answer}
url <- "https://forum.thegradcafe.com/"
rule <- "#ipsLayout_mainArea > section > ol > li > ol > li > ul > li.ipsDataItem_lastPoster__title > a"
#two modifications required:
#add the requisite class for the author/date specifications: .ipsType_light.ipsType_blendLinks
#add the nth child rule to slice the respective author or date depending on which one you need: :nth-child(<1|2>)
#new author_rule
author_rule <- "#ipsLayout_mainArea > section > ol > li > ol > li > ul > li.ipsType_light.ipsType_blendLinks > a:nth-child(1)"
#new date rule
date_rule <- "#ipsLayout_mainArea > section > ol > li > ol > li > ul > li.ipsType_light.ipsType_blendLinks > a:nth-child(2)"
titles <- read_html(url) %>%
html_nodes(rule) %>%
html_text()
links <- read_html(url) %>%
html_nodes(rule) %>%
html_attr('href')
authors <- read_html(url) %>%
html_nodes(author_rule) %>%
html_text()
dates <- read_html(url) %>%
html_nodes(date_rule) %>%
html_text()
df <- data.frame(titles, links, authors, dates, stringsAsFactors = FALSE)
View(df)
```
## Challenge 3
Now that we can extract data from one page, let's make sure we can get every page.
Write a function that lists every page of recently online users (link="https://forum.thegradcafe.com/online/").
```{r Challenge 3 sample answer}
url <- "https://forum.thegradcafe.com/online"
# let's find out how many pages there are
page_rule <- "li.ipsPagination_last > a"
#page_rule <- ""
# the url of the last page has its page number in it
page.count <- read_html(url) %>%
html_nodes(page_rule) %>%
html_attr('href')
# scanning for the number
page.count <- gsub("[^\\d]+", "", page.count, perl=TRUE)[1]
# I think this is the most simple solution
all_the_links <- paste0("https://forum.thegradcafe.com/online/?page=", 1:page.count)
### If desired could wrap the above code in the following:
#while(TRUE){
#
# Sys.sleep(300)
#}
# Serves as a potentially useful solution to keep track of how user traffic is changing, for example, every five minutes (not tested)
```