-
Notifications
You must be signed in to change notification settings - Fork 0
/
FINAL.Rmd
160 lines (133 loc) · 4.26 KB
/
FINAL.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
---
title: "FInal REport"
author: "OwenYeh"
date: "20170130 created"
output:
html_document:
toc: TRUE
number_sections: FALSE
theme: yeti
highlight: pygments
self_contained: No
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
> Ralph Waldo Emerson : We have conquered the power, so we have the strength
##輸/匯入資料
將來自"https://storage.googleapis.com/r_rookies/kaggle_titanic_train.csv" 匯入資料庫之中,並且命名為titanic
```{r}
titanic <- read.csv("https://storage.googleapis.com/r_rookies/kaggle_titanic_train.csv")
str(titanic)
summary(titanic)
```
<br />
##將Titanic 的資料彙整成圖,觀察性別、年紀與社經地位
```{r message=FALSE, warning=FALSE}
library(ggplot2)
library(plotly)
TT1<-ggplot(titanic, aes(Pclass)) +
geom_bar(aes(fill = Sex))+
ggtitle("the quantity of Passenger classes vesus different kind of sexes")
ggplotly(TT1)
```
<br />
<br />
<br />
```{r message=FALSE, warning=FALSE}
TT2<-ggplot(titanic, aes(x = Age)) +
geom_histogram(aes(fill = Sex))+
ggtitle("the quantity of different ages humanbeings vesus different kind of sexes")
ggplotly(TT2)
```
<br />
##清除未知數
使程式碼簡潔化
```{r}
titanic <- titanic[complete.cases(titanic), ]
titanic$Survived <- factor(titanic$Survived)
titanic$Embarked <- as.character(titanic$Embarked)
titanic$Embarked[titanic$Embarked == ""] <- "S"
titanic$Embarked <- factor(titanic$Embarked)
```
<br />
##分組
以70%的資料作為訓練駔,剩下的為測試組
```{r message=FALSE}
n <- nrow(titanic)
set.seed(90)
shuffled_titanic <- titanic[sample(n), ]
train_indices <- 1:round(0.7 * n)
train <- shuffled_titanic[train_indices, ]
test_indices <- (round(0.7 * n) + 1):n
test <- shuffled_titanic[test_indices, ]
```
<br />
##準確度計算
利用`rpart`的樹模型,並利用`confusion_matrix`去計算其準確度
```{r}
#install.packages("rpart")
library(rpart)
tree_fit <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, data = train, method = "class")
prediction <- predict(tree_fit, test[, c("Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked")], type = "class")
confusion_matrix <- table(test$Survived, prediction)
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy
```
<br />
##探索沒上傳不存在的資料
輸入 to_predict
```{r}
url <- "https://storage.googleapis.com/py_ds_basic/kaggle_titanic_test.csv"
to_predict <- read.csv(url)
summary(to_predict)
```
先將年齡用平均年齡填滿
```{r message=FALSE}
#install.packages("dplyr")
library(dplyr)
#install.packages("magrittr")
library(magrittr)
```
```{r}
mean_age_by_Pclass <- to_predict %>%
group_by(Pclass) %>%
summarise(mean_age = round(mean(Age, na.rm = TRUE)))
filter_1 <- is.na(to_predict$Age) & to_predict$Pclass == 1
filter_2 <- is.na(to_predict$Age) & to_predict$Pclass == 2
filter_3 <- is.na(to_predict$Age) & to_predict$Pclass == 3
mean_age_by_Pclass
to_predict[filter_1, ]$Age <- 41
to_predict[filter_2, ]$Age <- 29
to_predict[filter_3, ]$Age <- 24
```
再將FARE用其平均值填滿
```{r}
fare_mean <- mean(to_predict$Fare, na.rm = TRUE)
to_predict$Fare[is.na(to_predict$Fare)] <- fare_mean
```
匯出一個結論
```{r}
summary(to_predict)
```
**to_predict的NA已經清空**
##修正匯出格式
```{r message = FALSE}
predicted <- predict(tree_fit, newdata = to_predict[, c("Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked")])
result <- data.frame(to_predict[, "PassengerId"], predicted)
names(result) <- c("PassengerId", "Survived")
head(result, n = 10)
```
##寫出結論
```{r}
write.csv(result, file = "result_of_the_Tiatanic_report.csv", row.names = FALSE)
```
![The Result and the rank](https://4.bp.blogspot.com/-OCLkiSwee-Q/WI8an8LOPgI/AAAAAAAAAT4/_IajLvnpyPc7T27cG3nblqJ_-BYaDA3ywCLcB/s1600/FINAL1.png)
#references
台大資工系統訓練班 R 程式設計班的教學專案--
<https://yaojenkuo.github.io/r_programming/>
R Markdown Cheat Sheet--
<http://www.rstudio.com/wp-content/uploads/2016/03/rmarkdown-cheatsheet-2.0.pdf>
DataCampRmd--
<https://github.com/dspim/DataCampRmd/blob/master/index.md>
![what message does R handle and created for you when it's correctly used looks like:](http://www.migflug.com/jetflights/wp-content/uploads/2015/03/reditt-com-mig-25-foxbat.jpg)