forked from joseph-rickert/RbootCampDataWeek2014
-
Notifications
You must be signed in to change notification settings - Fork 0
/
10_Classification.Rmd
176 lines (138 loc) · 4.8 KB
/
10_Classification.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
---
title: "10 - Classification"
author: "Joseph Rickert"
date: "Friday, August 29, 2014"
output: html_document
---
```{r}
library(rattle) # for weather data set
library(rpart) # CART Decision Trees
library(colorspace) # used to generate colors for plots
library(randomForest)
library(ROCR) # ROC
library(kernlab) # SVM library
library(e1071) # SVM library
library(ada) # Boosting library
```
## Some Convenience Functions
#
```{r}
# Function to divide data into training, and test sets
ttIndex <- function(data=data,pctTrain=0.7)
{
# fcn to create indices to divide data into random
# training, validation and testing data sets
N <- nrow(data)
trainInd <<- sample(N, pctTrain*N)
testInd <<- setdiff(seq_len(N),trainInd)
}
# Function to generate the confusion matrix and percent correct
score <- function(model,target=data[testInd, 21],predict=pr){
results.test <- table(target,predict,dnn=c("Actual", "Predicted"))
pct.test.correct <- round(100 * sum(diag(results.test)) / sum(results.test),2)
results <- list(results.test,pct.test.correct)
(results)
}
```
## Read the Data and Prepare the Training and Test Sets
Get the weather data and select the subset for modeling
```{r}
name <- "weather.csv"
dataDir <- "C:/DATA/Rattle Data"
path <- file.path(dataDir,name)
data <- read.csv(path,header=TRUE)
# head(data)
# Select variables for the model
data <- subset(data,select=c(MinTemp:RainToday,RainTomorrow))
set.seed(42) # Set seed
ttIndex(data) # Pick out rows (index into data) for the training and test data sets
```
## Build a Tree Model with rpart
The rpart algorithm based on recursive partitioning
(See section 11.2 of Data Mining with Rattle and R by williams)
The rpart Algorithm:
1. - Partition the data set according to some criterion of "best" partition
2. - Do the same for each of the two new subsets
3. - Once a partition is made, stick with it (greedy approach)
Measures of "best" partition:
1. - information gain (the default)
2. - Gini
Information Gain Algorithm:
For all possible splits (partitions)
1. - Split data, D, into to subsets S1 and S2 where D = S1 U S2
2. - Calculate information I1 and I2 associated with S1 and S2
3. - Compute total information of split: Info(D,S1,S2) = (|D1|/D)*I1 + (|D2|/|D|)*I2
4. - Compute the information gain of the split: info(D) - info(D,S1,S2)
5. - Select split with greatest information gain
### Build a classification tree model
```{r}
form <- formula(RainTomorrow ~ .) # Describe the model to R
model <- rpart(formula=form,data=data[trainInd,]) # Build the model
#
model
```
### Interpreting the Model Results
Every line of the output the follows will have
1. node: a node number
2. split: the logic for how the node splits the data
3. n: the number of observations considered at that split
4. loss: the number of incorrectly classified observations
5. the majority class at that node
6. yprob: the distribution of classes at that node
So for the second line above: Pressure3pm>=1011.9 204 16 No (0.92156863 0.07843137)
1. node: 2)
2. split: if Pressure3pm > 1011.9 go left down tree
3. n: 204 obversations went down this branch
4. loss: 16 misclassified observations
5. Most observations were No
6. 92% of obs have target var No, 8% are yes
### Examine the results
```{r}
printcp(model)
summary(model)
#rpart:::summary.rpart
leaf <-model$where # find out in which leaf each observation ended up
leaf
```
### Plot Tree
First a Basic R plot
```{r}
opar <- par(xpd=TRUE) # Plotting is clipped to the figure region
plot(model)
text(model)
par(opar)
```
Now, a Rattle style plot
```{r}
drawTreeNodes(model)
title(main="Decision Tree weather.csv $ RainTomorrow")
```
### Evaluate model performance on the test set
Run the tree model on the test set and generate an error matrix
```{r}
pr <- predict(model, data[testInd, ], type="class")
#
score(model) # generate the confusion matrix
```
### Draw the ROC Curve
First, create a prediction object.
```{r}
pred <- prediction(as.vector(as.numeric(pr)), data[testInd,21])
perf <- performance(pred,"tpr","fpr")
plot(perf, main="ROC curve", colorize=T)
```
### Explore an unpruned tree
The complexity parameter sets the minimum benefit that must be gained at each split of the decision tree. (default = .01) Typical behavior with cp=0 is to see the error reate decrease at first and then begin to increase.
```{r}
control <- rpart.control(minsplit=10,
minbucket=5,
maxdepth=20,
usesurrogate=0,
maxsurrogate=0,
cp=0,
minibucket=0)
model2 <- rpart(formula=form,control=control,data=data[trainInd,])
print(model2$cptable)
plotcp(model2)
grid()
```