-
Notifications
You must be signed in to change notification settings - Fork 1
/
Capstone_Presentation.Rmd
80 lines (48 loc) · 2.9 KB
/
Capstone_Presentation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
---
title: "Capstone Presentation: Yelp Data Set"
author: "Jacob Townson"
date: "02/20/2015"
output: ioslides_presentation
---
```{r include=FALSE}
require(mosaic)
require(MASS)
```
## Question
Is there a correlation between the frequency of how often someone reviews on Yelp and the amount they travel? If so, do more frequent reviewers travel more than non-frequent reviewers based off of the cities and coordinates they review at?
Why is this interesting?
- It could be helpful for organizational purposes on Yelp. If more frequent users travel more, maybe they could make recommendations of organizations for users to visit when they travel.
## Methods
```{r include = FALSE}
user_travel <- read.csv("./user_travel")
```
- After cleaning, the end product was a data frame that consisted only of the user ids, the number of cities visited, the maximum radius of travel, and the number of reviews made. Then models were made using the lm function in R:
```{r eval = FALSE}
city.predict <- lm(cities~reviews, data = user_travel)
distance.predict <- lm(distance~reviews, data = user_travel)
```
- The residuals of these seem quadratic, so we add squared and cubed variables to the data and make a new model.
```{r include=FALSE}
reviewssq <- (user_travel$reviews)^2
reviewscu <- (user_travel$reviews)^3
user_travel <- cbind(user_travel, reviewssq, reviewscu)
```
```{r}
city.predict2 <- lm(cities~reviews+reviewssq+reviewscu,
data = user_travel)
distance.predict2 <- lm(distance~reviews+reviewssq+reviewscu,
data = user_travel)
```
Now we need to use the stepAIC function to get the best possible model.
```{r include=FALSE}
distance.predict2 <- lm(distance~reviews+reviewssq+reviewscu, data = user_travel)
city.predict2 <- (stepAIC(city.predict2, scope = list(lower = ~reviews)))
distance.predict2 <- (stepAIC(distance.predict2, scope = list(lower = ~reviews)))
```
## Residuals After Using stepAIC
```{r fig.height=2.5}
plot(city.predict2, which=1:2, labels.id = '')
```
## Discussion
After using the stepAIC function and also adding in the squared and cubed values for the number of reviews per user, I conclude that reviews and the amount of travel are not correlated. The line for the residuals vs. the fitted seem to be all right based purely on the line, but the data is so scattered it would lead us to believe differently. Then the Q-Q plots both seem to imply a quadratic relationship, but then when adding in squared and cubed values, not much of a difference seems to be made. It almost seems as though there may be no correlation at all, and we were simply lucky to get the results that we did.
In retrospect, I wish I had more time to do the project. If I had more time, I would add other variables to the model to see if maybe they had more to do with the amount of travel a user makes rather than purely the reviews. I think this would make for interesting future work.