-
Notifications
You must be signed in to change notification settings - Fork 0
/
PML_project2.Rmd
398 lines (260 loc) · 17.3 KB
/
PML_project2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
WEIGHT LIFTING EXCERCISE DATA ANALYSIS
========================================================
## SUMMARY
In this paper, several models will be built to fit the Weight Lifting Exercise data. The models will be compared each other and the best performing one will be chosen to make predictions on which class test data belongs to.
The corresponding .Rmd source file with the code used to perform the analysis can be found at the following Github repository: <https://github.com/MauroGentile/PracMachineLearning>
After an initial preprocessing, data will be split into a training set, representing the 70% of available data, and a validation set representing the remaining 30% of data. A total of 7 different classification models (Random Forest, Knn, C5.0, GBM, Linear, Quadratic and Flexible Discrmininant Analysis) will be then built on the training set and optimized through a 10 folds cross validation.
The several models will be finally compared in terms of:
1. accuracy, evaluated on the validation set
2. training time, that is to say, the time needed by each model to train the training set
As a result, it will be shown that Random Forest is the best performing algorithm with a 99.3% accuracy on the validation set, that is to say a 0.7% Out of Sample Error. However Random Forest is also a extremely resource consuming algorithm which took 1 hour and 43 minutes to perform a 10 cross validation training on my iMac 2.7 GHz Core i5 with 8 GB 1600 DDR3 memory.
On the other hand, the algorithm C5.0 reaches nearly the same Out Of Sample error as Random Forest (0.8% vs 0.7%) but it took only 14 minutes to perform the 10 cross validation on the same training set, that is to say, it was 7 time faster than Random Forest.
In other words, RF and C5.0 are nearly indistinguishable models from an accuracy point of view but C5.0 is a much faster algorithm than Random Forest. At least on this data set.
Taking this concept to an extreme, if computing resources or timing is an issue and the training procedure has to be frequently repeated (for instance if the data set gets frequently updated), Knn may become an excellent alternative because while not reaching an accuracy as high as Random Forest, it's still an excellent model with a superb 96.9% accuracy but taking only 1 minute and a half to train the data, that is to say 70 times lower than Random Forest and nearly 10 times lower than C5.0.
In conclusion, Random Forest is the algorithm showing the lowest Out Of Sample Error rate and therefore it is chosen as the best model. However if the training operation has to be repeated, we may end up preferring C5.0 over Random Forest and, depending on how frequent the training has to be repeated, we may also end up preferring Knn over C5.0, foregoing a little bit of accuracy for the sake of training speed.
## PRELIMINARY SET-UP
### Loading and preprocessing
Let's first load the data from the training file and make the introductory manipulations.
In order we will:
1. read the data
2. replace non numeric values with NA's
3. exclude the first 7 columns which have nothing to do with the output
4. exclude those variables with more than 95% of NA values
```{r reading, echo=TRUE, cache=TRUE, results='hide'}
library(caret)
initialTrainingDataset <- read.csv("pml-training.csv", header = TRUE)
```
```{r replace_non_numeric values, echo=TRUE, cache=TRUE, results='hide'}
suppressWarnings(for (i in 1:(ncol(initialTrainingDataset)-1)) initialTrainingDataset[,i] <- as.numeric(as.character(initialTrainingDataset[,i])))
```
```{r column_selections, echo=TRUE, cache=TRUE, results='hide'}
# exclude the first 7 columns
PartiallyCleanedTrainingData <- initialTrainingDataset[,-c(1:7)]
# exclude all columns with more than 95% of NAs values
FinalTrainingData <- PartiallyCleanedTrainingData[, -which(colSums(is.na(PartiallyCleanedTrainingData))/nrow(PartiallyCleanedTrainingData) > 0.95)]
```
### Split in training and validation sets
Training set will be extracted through the function "createDataPartition" in the caret package which will make a "stratified" extraction (i.e. maintaining the output class balance as in the original data set).
Since there is a large amount of data available, we can build the models using up to 70% of data while the remaining 30% will be used to evaluate accuracies.
```{r splitting, echo=TRUE, cache=TRUE}
inTrain <- createDataPartition(y = FinalTrainingData$classe, p = 0.7, list= FALSE)
train <- FinalTrainingData[inTrain, ]
validation <- FinalTrainingData[-inTrain, ]
dim(train)
dim(validation)
```
### Class balance check
In classifications problems it is important to check weather classes are balanced.
```{r classbalance, echo=TRUE, cache=TRUE}
table(train$classe)
```
In this case, while class A is much more populated than the other classes, the imbalance is not cause of concern.
## MODELS BUILDING
In the following several classification models will be applied on training data:
1. RF: Random forest
2. LDA: Linear Discriminant Analysis
3. QDA: Quadratic Discriminant Analysis
4. FDA: Flexible Discriminant Analysis
5. GBM: Gradient boosting
6. KNN: k-nearest neighbors
7. C5.0: evolution of C4.5
Each model will be automatically optimized through a 10 fold cross validation passed as a parameter to the train function from the caret package (parameters: method="cv", number=10).
Center and scale preprocessing will be applied exclusively to LDA, QDA, FDA and Knn since the other models are not sensitive to such preprocessing (refer to "applied predictive modeling", by Max Kuhn and Kjell Johnson 2013, pg. 550).
Through the function proc.time(), the processing time of each method will be also evaluated and stored for subsequent comparative analysis.
### RANDOM FOREST
```{r randomForest, echo=TRUE, cache=TRUE}
set.seed(100)
ptm <- proc.time()
rf <- train(classe ~ ., data = train, method = "rf", prox =
TRUE, trControl = trainControl(method = "cv",
number = 10))
rf_time<- proc.time() - ptm
```
### LDA
```{r LDA, echo=TRUE, cache=TRUE}
set.seed(100)
ptm <- proc.time()
lda <- train(classe ~ .,data = train,
method = "lda",
preProc = c("center","scale"),
trControl = trainControl(method = "cv", number=10))
LDA_time<- proc.time() - ptm
```
### QDA
```{r QDA, echo=TRUE, cache=TRUE}
set.seed(100)
ptm <- proc.time()
qda <- train(classe ~ ., data = train,
method = "qda",
preProc = c("center","scale"),
trControl = trainControl(method = "cv", number=10))
QDA_time<- proc.time() - ptm
```
### FDA
```{r FDA, echo=TRUE, cache=TRUE}
set.seed(100)
ptm <- proc.time()
fda <- train(classe ~ .,data = train,
method = "fda",
preProc = c("center","scale"),
trControl = trainControl(method = "cv", number=10))
FDA_time<- proc.time() - ptm
```
### GBM
```{r boost, echo=TRUE, cache=TRUE, results='hide'}
set.seed(100)
ptm <- proc.time()
GBM <- train(classe ~ ., method = "gbm", data = train, trControl = trainControl(method = "cv", number=10))
GBM_time<- proc.time() - ptm
```
### Knn
```{r KNN, echo=TRUE, cache=TRUE}
set.seed(100)
ptm <- proc.time()
knn <- train(classe ~ .,data = train,
method = "knn",
preProc = c("center","scale"),
trControl = trainControl(method = "cv", number=10))
knn_time<- proc.time() - ptm
```
### C5.0
```{r c5, echo=TRUE, cache=TRUE}
library(C50)
library(plyr)
set.seed(100)
ptm <- proc.time()
c5 <- train(classe ~ .,data = train,
method = "C5.0",
trControl = trainControl(method = "cv",
number = 10))
c5_time<- proc.time() - ptm
```
## ACCURACY EVALUATION ON THE VALIDATION SET AND BEST MODEL SELECTION
### Confusion matrix
Confusion matrix for each model will now be calculated on the validation set.
```{r cm, echo=TRUE, cache=TRUE}
cm_rf <- confusionMatrix(validation$classe, predict(rf, validation))
cm_knn <- confusionMatrix(validation$classe, predict(knn, validation))
cm_c5 <- confusionMatrix(validation$classe, predict(c5, validation))
cm_lda <- confusionMatrix(validation$classe, predict(lda, validation))
cm_qda <- confusionMatrix(validation$classe, predict(qda, validation))
cm_fda <- confusionMatrix(validation$classe, predict(fda, validation))
cm_GBM <- confusionMatrix(validation$classe, predict(GBM, validation))
```
While the graphic comparison of accuracy and confidence intervals of the 7 models will be performed in the next chapter, the tabular version of all these confusion matrices can be found the Appendix 1.
### Plot of accuracies on validation set (which is equivalent to 1-Out of Sample Error)
```{r df, echo=TRUE, cache=TRUE}
Models <- data.frame(
name=character(),
accuracy=numeric(),
lower=numeric(),
upper=numeric(),
system_time=numeric(),
user_time=numeric(),
elapsed_time=numeric(),
stringsAsFactors=FALSE
)
Models <- data.frame(name="Random forests", accuracy=cm_rf$overall[1], lower=cm_rf$overall[3], upper=cm_rf$overall[4], elapsed_time=rf_time[3], user_time=rf_time[1]+rf_time[4], system_time=rf_time[2]+rf_time[5])
Models <- rbind(Models, data.frame(name="KNN", accuracy=cm_knn$overall[1], lower=cm_knn$overall[3], upper=cm_knn$overall[4], elapsed_time=knn_time[3], user_time=knn_time[1]+knn_time[4], system_time=knn_time[2]+knn_time[5]))
Models <- rbind(Models, data.frame(name="C5", accuracy=cm_c5$overall[1], lower=cm_c5$overall[3], upper=cm_c5$overall[4], elapsed_time=c5_time[3], user_time=c5_time[1]+c5_time[4], system_time=c5_time[2]+c5_time[5]))
Models <- rbind(Models, data.frame(name="LDA", accuracy=cm_lda$overall[1], lower=cm_lda$overall[3], upper=cm_lda$overall[4], elapsed_time=LDA_time[3], user_time=LDA_time[1]+LDA_time[4], system_time=LDA_time[2]+LDA_time[5]))
Models <- rbind(Models, data.frame(name="QDA", accuracy=cm_qda$overall[1], lower=cm_qda$overall[3], upper=cm_qda$overall[4], elapsed_time=QDA_time[3], user_time=QDA_time[1]+QDA_time[4], system_time=QDA_time[2]+QDA_time[5]))
Models <- rbind(Models, data.frame(name="FDA", accuracy=cm_fda$overall[1], lower=cm_fda$overall[3], upper=cm_fda$overall[4], elapsed_time=FDA_time[3], user_time=FDA_time[1]+FDA_time[4], system_time=FDA_time[2]+FDA_time[5]))
Models <- rbind(Models, data.frame(name="GBM", accuracy=cm_GBM$overall[1], lower=cm_GBM$overall[3], upper=cm_GBM$overall[4], elapsed_time=GBM_time[3], user_time=GBM_time[1]+GBM_time[4], system_time=GBM_time[2]+GBM_time[5]))
```
```{r accuracy_plot, echo=TRUE, cache=TRUE}
Models <- Models[ order(Models[,2]), ]
ggplot(Models, aes(x=name, y=accuracy)) +
geom_point(size=4) +
geom_errorbar(aes(ymin=lower, ymax=upper), width=.2) +
scale_x_discrete(limits=Models$name)+
coord_flip()
```
In the above plot, accuracy on the validation set of each method will be displayed together with its confidence interval. The corresponding out of sample error is the complement to 1 of accuracy.
Two methods stand out from the rest and are virtually indistinguishable one other: Random Forest and C5.0 with an accuracy on the validation set of 99.3% and 99.2% respectively that is to say a Out of Sample error of 0.7% and 0.8%. Both models also show a very narrow confidence interval, 0.5% wide.
Other 2 well performing techniques are the Knn and GBM with an accuracy of 96.9% and 96.5% respectively but a wider confidence interval with respect to RF and C5.0.
The discriminant analysis models performance are much worse with accuracies lower than 90%. The LDA is especially poor with an accuracy as low as 70%.
The fact that LDA performs much worse than QDA which in turns performs much worse than Knn confirms that this is a problem with highly non linear decision boundaries among classes.
### Plot of elapsed training time
During the model building phase, the function pre_proc() was used to calculate the time needed to build each of the 7 models.
In the graph below, the training elapsed time of each model will be plotted versus the accuracy, so that in 1 single plot we have both the variables of interest for models comparison and evaluation.
As you can see, Random Forest is an extremely resource consuming algorithm. It took 6212.69 sec to train the data set on my iMac 2.7 GHz Core i5 with 8 GB 1600 DDR3 memory.
The C5.0 algorithm, while having the same exact accuracy, takes 1/7 of time to train the same training data (860 sec).
C5.0 is therefore a better model than Random Forest at least on this data set since while reaching nearly the same accuracy, it is also much faster.
Stretching this concept to an extreme, depending on how often the user is called to update the model building phase (for instance on data set which may be updated frequently), one may also prefer to use the Knn algorithm instead of C5.0 and Random Forest, because while not reaching the same accuracy (96.9% in the case of Knn, 99.2% in case of C5.0 and 99.3% in case of Random Forest), Knn took only 1 minute and a half to train the data set while Random Forest took 1hr and 43 minutes and C5.0 took 14 minutes to perform the same task on the same dataset.
```{r times, echo=TRUE, cache=TRUE}
rf_time
c5_time
knn_time
GBM_time
QDA_time
FDA_time
LDA_time
```
```{r elapsed_time_plot, echo=TRUE, cache=TRUE}
ggplot(Models,aes(x=accuracy, y=elapsed_time)) + geom_point() + geom_text(aes(x=accuracy, label=name), size=4, vjust=-0.8)+ xlim(0.6, 1.1)+ ylim(0, 6300)
```
## PERFORMANCE ON THE TEST DATA
As last step in this project, I would like to compare performance of Random Forest, C5.0 and Knn models, on the 20 row data set provided in the project.
Comparison will be performed both in terms of accuracy and in terms of testing time. Please notice that in this case, the elapsed time being considered is NOT the time to build the model (as above) but the time needed to make 20 predictions using already available models (built above).
This test data will be used only once and, regardless the output, the models built will NOT be changed since otherwise this test set would become part of the training set.
Before running the code, and before seeing the output I anticipate my expectations: if my analysis above is correct, I would expect the RF and C5.0 models giving the exact same output. I would expect also Knn to give the exact same result (Knn accuracy is not as good as C5.0 but it is still very high) even though I would not be surprised for discrepancies.
Finally I would also expect Knn testing time to be much lower than Random Forest and C5.0.
```{r test_data, echo=TRUE, cache=TRUE}
testingData <- read.csv("pml-testing.csv", header = TRUE)
suppressWarnings(for (i in 1:(ncol(testingData )-1)) testingData [,i] <- as.numeric(as.character(testingData [,i])))
ptm <- proc.time()
test_rf <- predict(rf, newdata = testingData)
rf_test_time<- proc.time() - ptm
ptm <- proc.time()
test_c5 <- predict(c5, newdata = testingData)
c5_test_time<- proc.time() - ptm
ptm <- proc.time()
test_knn <- predict(knn, newdata = testingData)
knn_test_time<- proc.time() - ptm
test_rf
test_c5
test_knn
# Generate the 20 txt files with the 20 predictions to be submitted through the Coursera web page.
# Please notice that these prediction has been generated through c5, i.e. the best model chosen for this data set.
n = length(test_c5)
for(i in 1:n){
filename = paste0("problem_id_",i,".txt")
write.table(test_c5[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
```
Let's count the output difference between RF/C5.0 and between C5.0/Knn.
Number of different predicted cases between RF and C5.0:
```{r test_data2, echo=TRUE, cache=TRUE}
sum(!(test_rf==test_c5))
```
Number of different predicted cases between C5 and Knn:
```{r test_data3, echo=TRUE, cache=TRUE}
sum(!(test_knn==test_c5))
```
Differences in testing time:
```{r test_data4, echo=TRUE, cache=TRUE}
rf_test_time
c5_test_time
knn_test_time
rf_c5_ratio <- rf_test_time[3]/c5_test_time[3]
c5_knn_ratio <- c5_test_time[3]/knn_test_time[3]
```
C5 was `r rf_c5_ratio ` faster than RF in making predictions.
Knn was `r c5_knn_ratio` faster than C5.0 in making predictions.
As expected, C5.0 and random Forest produced the same exact predictions.
On the other hand, I was a bit surprised to see Knn generating 2 different predictions than those generated by C5.0 and Random Forest. I expected at most 1 discrepancy, if any, given that the 3 models have similar accuracies, and all higher than 96%.
Finally, as expected, Knn was the fastest model to perform the predictions.
## CONCLUSIONS
We have seen that while Random Forest and C5.0 are the best model on this data set and virtually indistinguishable each other with an Out of Sample error of 0.7 and 0.8% respectively, C5.0 takes far less time to train the data set. C5.0 is therefore chosen as preferred model on this data set.
In circumstances where training operation has to performed frequently, Knn may become a excellent alternative if training elapsed time is a concern: giving up for a few points in accuracy, substantial gaining in training and testing processing times can be achieved.
## APPENDIX 1: CONFUSION MATRIX OF EACH OF THE 7 CONSIDERED MODELS
```{r out, echo=TRUE, cache=TRUE}
cm_rf
cm_c5
cm_knn
cm_lda
cm_qda
cm_fda
cm_GBM
```