This repository has been archived by the owner on Apr 5, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
project-nb.Rmd
527 lines (426 loc) · 19.6 KB
/
project-nb.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
---
title: "Wine quality prediction with Machine Learning"
subtitle: "edX Capstone Final Project"
author: "Ciro B Rosa"
date: "21-Jan-2022"
output:
pdf_document: default
html_notebook: default
---
<br>
<br>
<br>
### Introduction
<br>
This is the second project submitted by the author for the HarvardX Data Science course, under the Capstone module. It aims to predict quality of wine based on its chemical composition. The source dataset can be found on Kaggle, at the following link:
https://www.kaggle.com/yasserh/wine-quality-dataset
For ease of execution, all needed files can be downloaded from Github's author's page at:
https://github.com/cirobr/ds9-capstone-winequality-edx.git
<br>
<br>
<br>
### Project Contents
<br>
#### Dataset
<br>
The dataset "WineQT.csv" is stored on folder "/dat. It has 1142 observations and 13 (thirteen) columns: one outcome (quality), eleven predictors, and one column (Id), which is merely a sequential numbering for the observations. All columns are named at first row.
<br>
<br>
#### Code
<br>
Three files are presented in this project:
* File "project-code.R" contains the entire script for the project;
* File "project-nb.Rmd" is its correspondent notebook file;
* File "project-nb.pdf" is the notebook executed and stored in PDF format.
<br>
<br>
<br>
### Project Description
<br>
Code starts with environment setup, then the dataset is read and cleaned up. Dataset is randomly split in three parts: Training, Testing, and Validation. The Validation split will not be touched up to the point the final model is chosen.
Next, a sequence of ML models are applied to classify the outcome "quality". Its efficiency is measured by the Accuracy calculated parameter and its correspondent Confusion Matrix. A table with all accuracies is generated at the end of each model, in order to compare efficiencies.
At the end, the Validation split will be run, in order to confirm the forecasted accuracy and if there has been overtraning of the model.
The project was developed and tested on a general purpose Linux Ubuntu 20.04 notebook, with both RStudio Version 1.4.1717 and R version 4.1.2 installed. It was also tested on a VM Linux Ubuntu 20.04 guest over a Windows 10 host. As such, code is expected to run smoothly on any machine with 8GB RAM.
<br>
<br>
<br>
### Project Execution
<br>
#### Environment setup
<br>
```{r warning=FALSE}
# suppress warnings
oldw <- getOption("warn")
options(warn = -1)
# install packages (if not already installed)
if(!require(ggplot2)) install.packages("ggplot2", repos = "http://cran.us.r-project.org")
if(!require(tidyverse)) install.packages("tidyverse", repos = "http://cran.us.r-project.org")
if(!require(caret)) install.packages("caret", repos = "http://cran.us.r-project.org")
if(!require(matrixStats)) install.packages("matrixStats", repos = "http://cran.us.r-project.org")
if(!require(nnet)) install.packages("nnet", repos = "http://cran.us.r-project.org")
if(!require(randomForest)) install.packages("randomForest", repos = "http://cran.us.r-project.org")
if(!require(kknn)) install.packages("kknn", repos = "http://cran.us.r-project.org")
# libraries
library(ggplot2)
library(tidyverse)
library(caret)
library(matrixStats)
library(nnet) # multinom model
library(randomForest) # rf model
library(kknn) # kknn model
# global variables
numberOfDigits <- 5
options(digits = numberOfDigits)
proportionTestSet <- 0.20
control <- trainControl(method = "cv", # 10-fold cross validation configuration
number = 10,
p = .9,
allowParallel = TRUE)
# global functions
plot_optimization_chart <- function(model, chart_title){
m <- model %>%
ggplot(highlight = TRUE) +
ggtitle(chart_title)
print(m)
model$bestTune
model$finalModel
}
```
<br>
Next, the main dataset is read:
```{r warning=FALSE}
# read dataset from csv
# source of dataset: https://www.kaggle.com/yasserh/wine-quality-dataset
wineDF <- read_csv(file = "./dat/WineQT.csv") %>% as_tibble()
glimpse(wineDF)
```
<br>
A basic cleanup is also done, as the original dataset is already provided in very good condition:
```{r warning=FALSE}
# remove column "Id" (not used) and make outcome "quality" as first column
wineDF <- wineDF %>% select(-Id) %>% relocate(quality)
# on column names, replace spaces by underline "_"
names(wineDF) <- gsub(" ", "_", names(wineDF))
names(wineDF)
# check for NA's
sum(is.na(wineDF))
# display the cleaned dataset
head(wineDF)
```
<br>
Now, it is time to split the main dataset as follows: validationSet, testSet, and trainSet:
```{r warning=FALSE}
# split validation set for the final step (20%)
set.seed(1, sample.kind="Rounding")
testIndex <- createDataPartition(y = wineDF$quality,
times = 1,
p = proportionTestSet,
list = FALSE)
validationSet <- wineDF[testIndex,]
# from the remaining dataset, split train and test sets (80 / 20%)
mainSet <- wineDF[-testIndex,]
set.seed(2, sample.kind="Rounding")
testIndex <- createDataPartition(y = mainSet$quality,
times = 1,
p = proportionTestSet,
list = FALSE)
testSet <- mainSet[testIndex,]
trainSet <- mainSet[-testIndex,]
```
<br>
<br>
#### Data exploration
<br>
First of all, let's quickly check which wine quality rankings are present on the dataset:
```{r warning=FALSE}
# check for the different wine quality classes on the trainSet
qualityClasses <- unique(trainSet$quality) %>% sort()
qualityClasses
```
<br>
Secondly, let's check a histogram for the distribution of wine quality. At the same time, we will check if, for each quality category, the data split respected the proportion for each of the classes at both trainSet and testSet:
```{r warning=FALSE}
# check for stratification of train / test split
p1 <- trainSet %>%
group_by(quality) %>%
summarize(qty = n()) %>%
mutate(split = 'trainSet')
p2 <- testSet %>%
group_by(quality) %>%
summarize(qty = n()) %>%
mutate(split = 'testSet')
p <- bind_rows(p1, p2) %>% group_by(split)
p %>% ggplot(aes(quality, qty, fill = split)) +
geom_bar(stat="identity", position = "dodge") +
ggtitle("Stratification of Testset / Trainset split") +
scale_x_continuous(breaks = qualityClasses)
```
On the above chart, it is seen that the amount of training data for categories 4-7 is much higher than for categories 3 and 8. This fact could lead to difficulties on the model forecast development, as the categories with little data may not be predicted with good accuracy.
<br>
<br>
Next, we will take a look at the name of each chemical present at wine samples. We will also rank these chemicals for its variability within trainSet:
```{r fig.height=7, fig.width=10, warning=FALSE, dpi=200}
# variability analysis of predictors
X <- trainSet %>% select(-quality) %>% as.matrix()
sX <- colSds(X)
sX <- bind_cols(colnames(X), sX) %>% as_tibble()
colnames(sX) <- c("parameter", "stdev")
sX %>%
arrange(desc(stdev)) %>% # First sort by val.
mutate(parameter=factor(parameter, levels=parameter)) %>% # This trick update the factor levels
ggplot(aes(parameter, stdev)) +
geom_bar(stat= "identity") +
geom_text(aes(label=sprintf("%0.2f", stdev)), size=3, vjust=-0.8) +
theme(axis.text.x = element_text(angle = 40, hjust = 1, size=10)) +
ggtitle("Ranking of predictors by variability")
```
<br>
After checking the variability of all predictors, we will remove those that are considered as "low variability" by the "nearZeroVar" function.
It is worth to mention that, despite the fact the above histogram shows predictors with (visually) low variability, they are all kept on the database after executing "nearZeroVar". That means the function execution considered that those predictors do have enough variability to stay in.
```{r warning=FALSE}
# remove predictors with small variance
nzv <- trainSet %>%
select(-quality) %>%
nearZeroVar()
removedPredictors <- colnames(trainSet[,nzv])
removedPredictors
trainSet <- trainSet %>% select(-all_of(removedPredictors))
```
<br>
As a last step on the data preparation, memory is cleaned up and the outcome "quality" is changed to "factor", as all analysis to follow will be categorical (i.e. wine classification).
```{r warning=FALSE}
# clean for temporary data
rm(mainSet, testIndex, wineDF, X, p, p1, p2, sX)
rm(nzv, removedPredictors)
# change outcome to factor
trainSet$quality <- as_factor(trainSet$quality)
testSet$quality <- as_factor(testSet$quality)
validationSet$quality <- as_factor(validationSet$quality)
```
<br>
<br>
#### Find the best model
<br>
In this analysis, all but the first model (naive) will be checked and optimized against the trainSet, then tested using 10-fold cross validation against the testSet. Its correspondent Confusion Matrix and Accuracy, besides the key parameters for the best fit model, are displayed accordingly.
<br>
<br>
##### *The naive average model*
<br>
This model simply takes the average rounded to its nearest integer of the "quality" vector within the trainSet as unique quality prediction for all wines. Due to this approach, one can anticipate its performance will be extremely poor. Nevertheless, the author decided to include it merely for didactic purposes.
```{r warning=FALSE}
# training
mu <- mean(as.numeric(trainSet$quality))
mu <- round(mu)
# predicting
N <- length(testSet$quality)
predicted <- replicate(N, mu) %>% factor(levels = qualityClasses)
# accuracy calculation
cm <- confusionMatrix(data = predicted, reference = testSet$quality)
cm$table
acc <- cm$overall["Accuracy"]
accuracyResults <- tibble(model = "naive",
accuracy = acc)
accuracyResults
```
As anticipated, the prediction just assumes all wines are category 4, which leads to a poor Confusion Matrix, where its main diagonal is not dominated by high values as a good model would give. Its accuracy of only 3.8% also reflects that.
<br>
<br>
<br>
##### *Penalized multinomial regression*
<br>
The model "multinom" has not been covered on course lectures. It is part of the "nnet" package and its documentation can be found at":
https://cran.r-project.org/web/packages/nnet/nnet.pdf
Its tuneGrid has only one possible parameter (weight decay), and as such it is programmed to be scanned by the tuneGrid.
```{r warning=FALSE}
# training
set.seed(3, sample.kind="Rounding")
mNomFit <- trainSet %>% train(quality ~ .,
method = "multinom",
data = .,
trace = FALSE, # suppress printing of iterations
tuneGrid = data.frame(decay = seq(0.01, 0.5, 0.01)),
trControl = control
)
plot_optimization_chart(mNomFit, "Optimization for multinom model")
# predicting
predicted <- testSet %>% predict(mNomFit, newdata = .)
# accuracy calculation
cm <- confusionMatrix(data = predicted, reference = testSet$quality)
cm$table
acc <- cm$overall["Accuracy"]
accuracyResults <- bind_rows(accuracyResults, tibble(model ="multinom",
accuracy = acc))
accuracyResults
```
Compared to the naive model, we have obtained a significant improvement on accuracy, above 60%. Its confusion matrix also present a much better prediction of categories 5-6, besides marginal prediction for category 7, and nonexistent prediction at categories 3, 4 and 8. The result is in line to the observed fact that there is a small amount of data for those nonperforming classes.
<br>
<br>
<br>
##### *Random forest*
<br>
```{r warning=FALSE}
# training
set.seed(3, sample.kind="Rounding")
rfFit <- trainSet %>% train(quality ~ .,
method = "rf",
data = .,
tuneGrid = data.frame(mtry = seq(3, 7, 1)),
trControl = control)
plot_optimization_chart(rfFit, "Optimization for random forest")
# predicting
predicted <- testSet %>% predict(rfFit, newdata = ., type = "raw")
# accuracy calculation
cm <- confusionMatrix(data = predicted, reference = testSet$quality)
cm$table
acc <- cm$overall["Accuracy"]
accuracyResults <- bind_rows(accuracyResults, tibble(model ="rf",
accuracy = acc))
accuracyResults
```
The Random Forest model presented further improvement on overall accuracy than the previous Multinom model, apparently thanks to better prediction of category 7. However, prediction of categories 3, 4 and 8 remain unsatisfactory.
<br>
<br>
<br>
##### *KNN*
<br>
This specific KNN (k-nearest-neighbors) model is part of the Caret package and it has been covered during course lectures. It has only one tuning parameter.
At a later step, we will also run the KKNN (weighted-KNN), which has more tuning parameters than this model.
```{r warning=FALSE}
# training
set.seed(3, sample.kind="Rounding")
knnFit <- trainSet %>% train(quality ~ .,
method = "knn",
data = .,
tuneGrid = data.frame(k = seq(3, 71, 2)),
trControl = control
)
plot_optimization_chart(knnFit, "Optimization for knn")
# predicting
predicted <- testSet %>% predict(knnFit, newdata = ., type = "raw")
# accuracy calculation
cm <- confusionMatrix(data = predicted, reference = testSet$quality)
cm$table
acc <- cm$overall["Accuracy"]
accuracyResults <- bind_rows(accuracyResults, tibble(model ="knn",
accuracy = acc))
accuracyResults
```
Accuracy for KNN is deteriorated, when compared to all past models but naive. This is reflected at the confusion matrix, which shows poor prediction at category 7.
<br>
<br>
<br>
##### *KKNN / search among all models available*
<br>
The KKNN (weighted-KNN) model has three tuning parameters, one of them is the "kernel", which brings the possibility of analyzing a given data point taking into account its surrounding points as well.
For the tuning grid, the author decided to search the performance for all kernels. The best performing kernel will be taken for further tuning at the next step.
Model documentation can be found at:
https://cran.r-project.org/web/packages/kknn/kknn.pdf
```{r warning=FALSE}
# training
set.seed(3, sample.kind="Rounding")
kknnFit <- trainSet %>% train(quality ~ .,
method = "kknn",
data = .,
tuneGrid = data.frame(kmax = seq(3, 41, 2),
distance = c(1, 2),
kernel = c("rectangular",
"triangular",
"epanechnikov",
"biweight",
"tri-weight",
"cos",
"inv",
"gaussian",
"rank",
"optimal"
)),
trControl = control)
plot_optimization_chart(kknnFit, "Optimization for kknn / models search")
# predicting
predicted <- testSet %>% predict(kknnFit, newdata = ., type = "raw")
# accuracy calculation
cm <- confusionMatrix(data = predicted, reference = testSet$quality)
cm$table
acc <- cm$overall["Accuracy"]
accuracyResults <- bind_rows(accuracyResults, tibble(model ="kknn/all models",
accuracy = acc))
accuracyResults
```
So far, the KKNN is the best one analyzed. The optimization chart clearly shows the "inv" kernel brings the best performance compared to all others. For this reason, it will be further tuned on the next step.
<br>
<br>
<br>
##### *KKNN / inv model*
<br>
The model "inv" is the winner from the search that was carried out above. As such, the tuning parameters for this run are expanded.
```{r warning=FALSE}
# training
set.seed(3, sample.kind="Rounding")
kknnInvFit <- trainSet %>% train(quality ~ .,
method = "kknn",
data = .,
tuneGrid = data.frame(kmax = seq(3, 81, 2),
distance = seq(0.5, 2, 0.5),
kernel = "inv"),
trControl = control)
plot_optimization_chart(kknnInvFit, "Optimization for kknn / inv model")
# predicting
predicted <- testSet %>% predict(kknnInvFit, newdata = ., type = "raw")
# accuracy calculation
cm <- confusionMatrix(data = predicted, reference = testSet$quality)
cm$table
acc <- cm$overall["Accuracy"]
accuracyResults <- bind_rows(accuracyResults, tibble(model ="kknn/inv",
accuracy = acc))
accuracyResults
```
We have obtained an accuracy in excess of 70% in the experiment. For this reason, the author chooses the kknn/inv model as the model for the final verification using the validationSet, at the next step.
<br>
<br>
<br>
##### *Best model validation*
<br>
```{r warning=FALSE}
# predicting
predicted <- validationSet %>% predict(kknnInvFit, newdata = ., type = "raw")
# accuracy calculation
cm <- confusionMatrix(data = predicted, reference = validationSet$quality)
cm$table
acc <- cm$overall["Accuracy"]
accuracyResults <- bind_rows(accuracyResults, tibble(model ="kknn/inv/validation",
accuracy = acc))
accuracyResults
# reset warnings
options(warn = oldw)
```
The KKNN/Inv Kernel model is validated. The accuracy observed with the validation set has dropped to around 61%, from the previous 70+%. This result is expected for the fact that the validationSet was never seen by the model. As such, the performance of its prediction becomes lower than the one found when using the testSet (which has somehow been seen by the model).
<br>
<br>
<br>
### Conclusion
<br>
The dataset WineQT was cleaned and analyzed. A number of models were tested, and the best performing one, the KKNN/Inv Kernel, showed and accuracy in excess of 70% on testing and above 60% on validation. A good generalization of the model has been achieved, thanks mainly to the 10-fold cross-validation executed on the learning/training phase.
On the other hand, lack of training data at categories 3, 4, 8 have led to a relatively poor performance on the accuracy of the model, despite the successful effort on improving the figures. The fact is confirmed by simple inspection of the confusion matrix.
<br>
<br>
<br>
### Next steps / Future work
<br>
Two possible ways are foreseen by the author, in order to improve the model accuracy:
* To collect mode training data;
* To investigate other algorithms such as SVM and CNN.
<br>
<br>
The author is currently enrolled on a PhD program in Computer Science / AI / ML, with forecasted conclusion by 2024. He plans to research application of AI/ML on embedded microcontrollers, with and without GPU support. His plans include usage of the Julia language on his research, for the following reasons:
* It is a compiled language;
* Native parallel processing support;
* Growing support for ARM microcontrollers;
* Syntax similarity to Python.
<br>
<br>
<br>
### References
<br>
* Text book: http://amlbook.com/
* NNET: https://cran.r-project.org/web/packages/nnet/nnet.pdf
* KKNN: https://cran.r-project.org/web/packages/kknn/kknn.pdf