-
Notifications
You must be signed in to change notification settings - Fork 0
/
har.Rmd
208 lines (138 loc) · 8.35 KB
/
har.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
---
title: "Machine Learning and Human Activity Recognition: Building a Classifier for Wearable Accelerometers’ Data"
author: "@jrcajide"
output: html_document
---
[Human Activity Recognition](https://en.wikipedia.org/wiki/Activity_recognition) is a new and key research area in the last years and is gaining increasing attention by the pervasive computing research community.
Research on activity recognition has traditionally focused on discriminating between different activities, i.e. to predict "which" activity was performed at a specic point in time.
This analysis (based on [Qualitative Activity Recognition of Weight Lifting Exercises](http://groupware.les.inf.puc-rio.br/work.jsf?p1=11201), focuses in the **quality of executing an activity** and the results underline the potential of model-based assessment and the positive impact of real-time user feedback on the quality of execution.
The data, [Weight Lifting Exercises Dataset](http://groupware.les.inf.puc-rio.br/har) is about six young health participants that performed one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).
**Random forest** was the machine learning model used to classify unlabeled data achieved from the 4 accelerometers in the test data set and predict the class of each repetition based 52 variables.
Although random forest implementation through the `caret` package got a great performance, the final model was tunned and run with the Breiman and Cutler's random forest approach implimented via the `randomForest` package achieving **99% accuracy** on its classification task.
# 1. Load training and testing data
Data importing and wrangling tasks were performed used `data.table` library.
`Empty/NaN/NA/#DIV/0!` values found in the original data set were labeled as *missing data*.
```{r, echo=TRUE}
rm(list = ls());gc(reset = T)
set.seed(1973)
# loading libraries -------------------------------------------------------
library(data.table)
library(dplyr)
library(knitr)
library(randomForest)
library(caret)
library(ggplot2)
library(ggthemes)
library(viridis)
# importing data ----------------------------------------------------------
DT.train <- fread("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", stringsAsFactors = F, drop = 'V1', na.strings = c('','#DIV/0!','NA'))
DT.test <- fread("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", stringsAsFactors = F, drop = 'V1', na.strings = c('','#DIV/0!','NA'))
```
# 2. Exploratory data analysis
## Missing values
Many of the variables have a high percentage of missing values.
```{r, echo=TRUE}
str(DT.train)
```
There are several variables that contain approximately 97.93% missing values.
```{r}
# missing values ----------------------------------------------------------
# Percentagg of missing values by variable:
sapply(DT.train, function(x) sum(is.na(x)) / nrow(DT.train) )
```
Those variables were removed from the data sets:
```{r}
DT.train <- DT.train[, .SD, .SDcols=sapply(DT.train, function(x) (sum(is.na(x))) / nrow(DT.train)) < 0.9793089 ]
DT.test <- DT.test[, .SD, .SDcols=sapply(DT.test, function(x) (sum(is.na(x))) / nrow(DT.test)) < 0.9793089 ]
```
Only common variables in both training and testing data sets, related to the belt, forearm, arm and dumbell, are needed to predict the `classe` variable:
```{r}
DT.train <- DT.train[, grep("classe|belt|arm|dumbbell",names(DT.train)), with=F]
DT.train <- DT.train[, which((names(DT.train) %in% names(DT.test)) | names(DT.train)=="classe"), with=F]
DT.train <- DT.train[, classe := as.factor(classe)]
DT.train[]
```
## Classes
Simply barplot showing the frequency of each class in the training data set:
```{r}
barplot(table(DT.train$classe),col=viridis(5), border = "white", main="Classes for repetitions of the Unilateral Dumbbell Biceps Curl", sub="Exactly according to the specification (Class A), mistakes (Class B to E)")
# barplot(prop.table(table(DT.train$classe)),col=viridis(5))
```
# 3. Cross validation
Cross validation is a technique for assessing how the results of a statistical analysis will generalize to an independent data set.
Training data set was splited into validation and train data. The `createDataPartition` in the `caret` package was used for this task.
```{r}
# Cross validation
inTrain <- createDataPartition(y = DT.train$classe, p = 0.6, list = FALSE)
DT.validation <- DT.train[-inTrain, ]
DT.train <- DT.train[inTrain, ]
```
# 4. Modelling
The random forests technique examines a large ensemble of decision trees, by first generating a random sample of the original data with replacement (bootstrapping).
## 4.1. Model tunning
`tuneRF` searches for optimal mtry values (with respect to Out-of-Bag error estimate) given the data, that is, the number of variable per level split.
```{r}
bestmtry <- tuneRF(DT.train[,-ncol(DT.train), with=F], DT.train$classe, ntreeTry=100, stepFactor=1.5, improve=0.01, trace=TRUE, plot=TRUE, dobest=FALSE)
```
## 4.2. Generating the classification model
A Random forest model was generated with the training data and validated with the validation data.
```{r}
bestmtry <- bestmtry[bestmtry[, 2] == min(bestmtry[, 2]), 1]
rf <- randomForest(classe ~ . , data=DT.train, mtry=bestmtry, ntree=1000, keep.forest=TRUE, importance=TRUE, test=DT.validation)
print(rf)
```
# 4.3. Assessing model accuracy
Checking model accuracy over the validation data set shows that **the model is able to classify correctly more than 99% of the observations**.
```{r}
prediction <- predict(rf, DT.validation)
confusionMatrix(prediction, DT.validation$classe)
```
```{r}
# Show model error
plot(rf, main = "Accuracy as a function of predictors", col=viridis(6))
legend('topright', colnames(rf$err.rate), col=viridis(6), fill=viridis(6))
```
The dark line shows the overall error rate which falls below 0.01%. The other lines shows the error rates for each class classification.
## 4.4. Relative variable importance.
* Plotting the mean decrease in Gini calculated across all trees *
For each variable in the data set, it tells how important that variable is in classifying the data.
The plot shows each variable on the y-axis, and their importance on the x-axis. They are ordered top-to-bottom as most important. to least important. Therefore, the most important variables are at the top and an estimate of their importance is given by the value on the x-axis.
```{r}
# 4.3 Variable importance
# Let’s look at relative variable importance by plotting the mean decrease in Gini calculated across all trees.
# Get importance
importance <- importance(rf)
varImportance <- data.table(variables = row.names(importance),
importance = round(importance[ ,'MeanDecreaseGini'],2))
varImportance <- varImportance[, Rank := min_rank(desc(importance))][order((Rank)),]
# Use ggplot2 to visualize the relative importance of variables
ggplot(varImportance, aes(x = reorder(variables, importance), y = importance, fill = importance)) +
geom_bar(stat='identity') +
geom_text(aes(x = variables, y = 0.5, label = Rank), hjust=0, vjust=0.55, size = 3, colour = 'white') +
labs(x = 'Variables', y = 'Importance (Mean Decrease GINI)') +
scale_fill_viridis(discrete=F) +
coord_flip() +
theme_few() +
ggtitle("Relative importance of variables") +
theme(plot.title = element_text(lineheight=.8, face="bold"), legend.position="none")
```
* GINI importance measures the average gain of purity by splits of a given variable. *
# 5. Predict
Predicting the classes of the 20 observations from the test data set.
```{r}
# Predict using the test set
prediction <- predict(rf, DT.test)
prediction
```
```{r}
# Save the solution adding the clasification result to each observation in the test set
DT.test$classe <- prediction
# Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes
barplot(table(DT.test$classe),col=viridis(5), border = "white", main="Labels assigned by the model", sub="")
```
Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes
# 6. Exporting the results.
```{r}
# Write the solution to file
fwrite(DT.test, "solution.csv", )
```