-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.Rmd
764 lines (591 loc) · 40.4 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
---
title: "Cyclistic Bike-Share Data Analysis"
author: "Thanakorn Thanakraikiti"
date: "Published on 15 August 2023 | Updated on `r format(Sys.time(), '%d %B %Y')`"
output:
# html_document:
rmdformats::downcute:
# rmdformats
self_contained: true
thumbnails: true
lightbox: true
gallery: true
# html_document
highlight: tango
df_print: paged
code_folding: hide
---
```{r setup, include=FALSE}
# Set root directory
knitr::opts_knit$set(root.dir = "/Users/thanakorntha/Documents/GitHub/Data/cyclistic-analysis")
# Set chunk options
knitr::opts_chunk$set(message = FALSE,
warning = FALSE,
fig.align = "center",
fig.asp = 0.80,
# fig.height = 4,
# fig.width = 6,
out.width = "70%")
# Load package
library(tidyverse)
# Set personal theme
my_custom_theme <- function(base_size = 14) {
# Font
font <- ""
# Replace elements we want to change
theme_bw(base_size = base_size) %+replace%
theme(
#
## Grid elements
panel.grid.minor = element_blank(),
panel.grid.major = element_blank(),
panel.border = element_blank(),
axis.ticks = element_blank(),
axis.line.x = element_line(color = "#343434"),
axis.ticks.x = element_line(color = "#343434"),
axis.line.y = element_line(color = "#343434"),
axis.ticks.y = element_line(color = "#343434"),
#
## Text Elements
#
# Title
plot.title = element_text(size = rel(1.00), color = "#343434", face = "bold", hjust = 0, margin = margin(0, 0, 10, 0)),
plot.subtitle = element_text(size = rel(0.75), color = "#343434", face = "plain", hjust = 0),
plot.caption = element_text(size = rel(0.50), color = "#343434", face = "italic", hjust = 0),
# Axis
axis.title = element_text(size = rel(0.50), color = "#343434", face = "bold"),
axis.title.x = element_text(size = rel(1.00), color = "#343434", face = "bold", margin = margin(5, 0, 5, 0)),
axis.text = element_text(size = rel(0.50), color = "#343434", face = "plain"),
# Legend
legend.title = element_blank(),
legend.text = element_text(size = rel(0.50), color = "#343434", face = "plain"),
#
## Legend
legend.position = "top",
legend.key = element_rect(fill = "transparent", color = NA),
legend.key.size = unit(1, "lines"),
legend.background = element_rect(fill = "transparent", color = NA),
legend.margin = margin(0, 0, 0, 0),
#
## Faceting labels
strip.background = element_rect(fill = "#343434", color = "#343434"),
strip.text = element_text(size = rel(0.80), color = "white", face = "bold", margin = margin(5, 0, 5, 0))
#
)
}
# Change default theme
theme_set(my_custom_theme())
```
# Project Overview
Cyclistic is a successful American bicycle-sharing program that was established in 2016. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system at any time.
Cyclistic's marketing strategy has primarily focused on building general awareness and appealing to broad consumer segments. The program offers a variety of pricing plans, including single-ride passes, full-day passes, and annual memberships. Cyclistic classifies its riders into two groups based on pricing plans: casual riders (*users who purchase single-ride passes or full-day passes*) and annual members (*users who purchase an annual membership*).
Cyclistic's flexible pricing plans attract a larger customer base, but financial analysts have determined that annual members are more profitable. However, casual riders are already aware of the Cyclistic program and have chosen Cyclistic to meet their mobility needs. This suggests that a marketing campaign that targets existing customers is likely to be more effective at expanding the business than a campaign that targets only new customers.
Therefore, Cyclistic's marketing analytics team is interested in understanding how casual riders and annual members use Cyclistic bikes differently. By understanding these differences, the marketing analytics team can develop more targeted marketing strategies to convert casual riders into annual members.
# Dataset
This project analyzes Cyclistic's historical bike trip data from January to December 2022, downloaded and organized in CSV format from the [Divvy system data website](https://divvy-tripdata.s3.amazonaws.com/index.html). The 2022 data comprises twelve monthly datasets, each containing detailed trip information. These datasets hold the following fields for each trip:
1. **ride_id:** unique ID number for all rides
2. **rideable_type:** type of bike
3. **started_at:** date and time the ride started
4, **ended_at:** date and time the ride ended
5. **start_station_name:** name of the station where the ride started
6. **start_station_id:** ID number of the station where the ride started
7. **end_station_name:** name of the station where the ride started
8. **end_station_id:** ID number of the station where the ride started
9. **start_lat:** latitude of the location where the ride started
10. **start_lng:** longitude of the location where the ride started
11. **end_lat:** latitude of the location where the ride ended
12. **end_lng:** longitude of the location where the ride ended
13. **member_casual:** type of user
These credible public datasets, collected and updated monthly by the company since 2013, offer valuable insights into Cyclistic's usage patterns across different customers. While data privacy regulations restrict access to personally identifiable information (e.g., linking pass purchases to credit cards), the rich data remains crucial for understanding how customer types use Cyclistic bikes.
However, before utilizing the data for analysis, it requires processing. This involves cleaning inconsistencies and errors, and transforming it into a suitable format for analysis.
> **Note:** The datasets have a different name because Cyclistic is a fictional company. For the purposes of this case study, the datasets are appropriate and will enable you to answer the business questions. The data has been made available by Motivate International Inc. under this [license](https://divvybikes.com/data-license-agreement).
# Methodology
This project leverages the six-step data analysis framework to unlock transformative insights for Cyclistic's growth. From pinpointing key business questions to extracting actionable recommendations, each step is meticulously designed to yield valuable discoveries.
To efficiently manage and analyze the extensive data, this project utilizes the powerful capabilities of R and RStudio. These tools enable us to delve deeper into the data, extracting meaningful insights that will inform critical decisions and shape Cyclistic's future.
# Data Cleaning
This section dives into the data cleaning process, essential for preparing the dataset for analysis. It covers key steps such as importing the data, removing duplicates, correcting structural errors, handling missing values, adding relevant columns, and dropping unnecessary data. Each step contributes to ensuring high-quality, reliable data for accurate insights.
## Import the Data
I begin by equipping myself with essential libraries like tidyverse, providing me with the ability to model, transform, and visualize data throughout the journey.
```{r load-package}
library(tidyverse)
library(skimr)
library(scales)
```
Then, I start importing all 12 monthly datasets and merging them into a single dataset named **trip_data**. To achieve this, I leverage two key functions:
1. **`list.files`** to generate a list of all file paths for efficient reading, and
2. **`read_csv`** to iteratively read each file into memory.
Once all files are read, I use the **`bind_rows`** function to seamlessly stitch them together into a unified dataset.
```{r import-data}
trip_data <- list.files(path = "./data/", pattern = "*-divvy-tripdata.csv", full.names = TRUE) %>%
lapply(read_csv) %>%
bind_rows %>%
arrange(started_at)
```
I would like to take a quick look with the dataset. Cyclists pedaled their way to more than 5 million trips on Cyclistic bikes in 2022, demonstrating the significant impact of bike-sharing programs.
```{r preview-data-1}
dim(trip_data)
head(trip_data)
```
## Remove Duplicates
I use the **`duplicated`** and **`sum`** functions to meticulously check for duplicate trip data. The results reveal no duplicate values, indicating excellent data integrity and providing a reliable foundation for further analysis.
```{r check-duplicates}
sum(duplicated(trip_data$ride_id))
```
## Correct Structural Errors
The next step is to address structural errors, those inconsistencies in data formatting that can lead to misinterpretations. This often takes the form of typos, inconsistent capitalization, or duplicate entries. With this in mind, I'll primarily focus on cleaning up the member type, bike type, and start and end station columns, as they contain text data and are particularly susceptible to such issues.
### Member Type
The ***member_casual*** column has two distinct values: *casual* and *member*. This perfectly aligns with our goal of comparing bike usage between annual members and casual riders, as distinct user types are clearly identified in the data
```{r preview-member}
member_type <- count(trip_data, member_casual, name = "count")
member_type
```
### Bike Type
To understand the distribution of bike types used, I take a look at the ***rideable_type*** column. While *classic_bike* and *electric_bike* have over 2.6 million rides, *docked_bike* shows a significantly lower count with only 170,000.
```{r preview-bike-1}
bike_type <- count(trip_data, rideable_type, name = "count")
bike_type
```
Investigating further, I can confirm that *docked_bike* refers to the same type of bike as *docked_bike*. Therefore, I replace *docked_bike* with *classic_bike* in the column. This correction leaves only classic bike and *electric_bike* as valid categories for further analysis.
```{r fix-bike}
trip_data_v2 <- trip_data %>%
mutate(rideable_type = str_replace_all(rideable_type, "docked_bike", "classic_bike"))
```
```{r preview-bike-2}
bike_type_v2 <- count(trip_data_v2, rideable_type, name = "count")
bike_type_v2
```
### Start and End Station
To thoroughly examine station names, I create separate **start_station** and **end_station** data.
```{r preview-start-station-1, rows.print=5}
start_station <- trip_data_v2 %>%
count(start_station_name, name = "count") %>%
arrange(start_station_name)
start_station
```
```{r preview-end-station-1, include=FALSE, rows.print=5}
end_station <- trip_data_v2 %>%
count(end_station_name, name = "count") %>%
arrange(end_station_name)
end_station
```
After verifying their validity, I identify two areas requiring correction:
1. **Removing Test Stations:** A total of eight test stations are found in the data: "Pawel Bialowas - Test- PBSC charging station", "Hastings WH 2", "DIVVY CASSETTE REPAIR MOBILE STATION", "Base - 2132 W Hubbard Warehouse", "Base - 2132 W Hubbard", "NewHastings", "WestChi", and "WEST CHI-WATSON". I filter these out using the **`filter`** function to ensure accuracy in our analysis.
```{r remove-test-station}
test_station_list <- c("Pawel Bialowas - Test- PBSC charging station",
"Hastings WH 2",
"DIVVY CASSETTE REPAIR MOBILE STATION",
"Base - 2132 W Hubbard Warehouse",
"Base - 2132 W Hubbard",
"NewHastings",
"WestChi",
"WEST CHI-WATSON")
trip_data_v2 <- trip_data_v2 %>%
filter(!(trip_data_v2$start_station_name %in% test_station_list |
trip_data_v2$end_station_name %in% test_station_list))
```
2. **Addressing Inconsistencies within Station Names:** Typos (e.g., "Michgan" instead of "Michigan"), special symbols (e.g., '\*'), and directional words (e.g., "north" or "south") are present in station names. To address these inconsistencies, I use the **`str_replace_all`** function to remove them, ensuring consistent data for further analysis.
```{r fix-inconsistent-words}
words <- c("*", " - Charging", " (Temp)", "amp;", "Public Rack - ",
" - north corner", " - south corner", " - midblock south", " - midblock",
" - North", " - South", " - East", " - West",
" - NE", " - NW", " - SE", " - SW",
" - N", " - S", " - E", " - W",
" NE", " NW", " SE", " SW")
for (word in words) {
trip_data_v2 <- trip_data_v2 %>%
mutate(start_station_name = str_replace_all(start_station_name, fixed(word, ignore_case = TRUE), "")) %>%
mutate(end_station_name = str_replace_all(end_station_name, fixed(word, ignore_case = TRUE), ""))
}
trip_data_v2 <- trip_data_v2 %>%
mutate(start_station_name = str_replace_all(start_station_name, regex(" (?<=\\s)[N|S|E|W]$", ignore_case = TRUE), "")) %>%
mutate(end_station_name = str_replace_all(end_station_name, regex(" (?<=\\s)[N|S|E|W]$", ignore_case = TRUE), ""))
```
After applying these corrections, a re-examination of the **start_station_v2** and **end_station_v2** data confirms the successful removal of test stations and inconsistencies, providing a clean and consistent dataset for subsequent analysis.
```{r preview-start-station-2, rows.print=5}
start_station_v2 <- trip_data_v2 %>%
count(start_station_name, name = "count") %>%
arrange(start_station_name)
start_station_v2
```
```{r preview-end-station-2, include=FALSE, rows.print=5}
end_station_v2 <- trip_data_v2 %>%
count(end_station_name, name = "count") %>%
arrange(end_station_name)
end_station_v2
```
## Handle Missing Data
I use the **`colSum`** and **`is.na`** functions to meticulously identify missing values within the dataset. The analysis reveals six columns with missing data: ***start_station_name*** (833,025 rows), ***start_station_id*** (833,025 rows), ***end_station_name*** (891,896 rows), ***end_station_id*** (891,896 rows), ***end_lat*** (5,858 rows), and ***end_lng*** (5,858 rows).
```{r preview-na-columns-1, rows.print=13}
colSums(is.na(trip_data_v2))
```
To address these missing values, I prioritize ***start_station_name*** and ***end_station_name***, as their missing entries can be imputed using geographic coordinates as a reference. I refrain from imputing the remaining four columns: ***start_station_id*** and ***end_station_id*** will be removed in the subsequent step due to their irrelevance to the analysis, while ***end_lat*** and ***end_lng*** lack suitable reference data for imputation and will also be discarded.
```{r preview-na-start-station-1, rows.print=5}
start_station_location <- trip_data_v2 %>%
count(start_lat, start_lng , start_station_name, name = "count") %>%
arrange(start_lat, start_lng)
start_station_location
```
```{r preview-na-end-station-1, include=FALSE, rows.print=5}
end_station_location <- trip_data_v2 %>%
count(end_lat, end_lng, end_station_name, name = "count") %>%
arrange(end_lat, end_lng)
end_station_location
```
To effectively handle missing station names, I create four temporary columns representing increasingly granular levels of start and end coordinates (ranging from 5 digits to 1 digit). These columns facilitate accurate matching and assignment of station names to missing entries, prioritizing those with more precise location data. This iterative approach ensures that the most reliable matches are filled in first.
```{r impute-na-station-names}
digit <- 5
while (digit > 1) {
trip_data_v2 <- trip_data_v2 %>%
mutate(start_lat_round = round(start_lat, digits = digit),
start_lng_round = round(start_lng, digits = digit),
end_lat_round = round(end_lat, digits = digit),
end_lng_round = round(end_lng, digits = digit))
trip_data_v2 <- trip_data_v2 %>%
group_by(start_lat_round, start_lng_round) %>%
fill(start_station_name, .direction = "downup") %>%
fill(start_station_id, .direction = "downup") %>%
ungroup()
trip_data_v2 <- trip_data_v2 %>%
group_by(end_lat_round, end_lng_round) %>%
fill(end_station_name, .direction = "downup") %>%
fill(end_station_id, .direction = "downup") %>%
ungroup()
digit <- digit - 1
}
trip_data_v2 <- trip_data_v2 %>%
select(!c(start_lat_round, start_lng_round, end_lat_round, end_lng_round))
```
The imputation process yields a significant reduction in missing values: ***start_station_name*** dropped by 11,821 records each, while ***end_station_name*** saw a decrease of 41,386 records each.
```{r preview-na-columns-2, rows.print=13}
colSums(is.na(trip_data_v2))
```
```{r preview-na-start-station-2, rows.print=5}
start_station_location_v2 <- trip_data_v2 %>%
count(start_lat, start_lng, start_station_name, name = "count") %>%
arrange(start_lat, start_lng)
start_station_location_v2
```
```{r preview-na-end-station-2, include=FALSE, rows.print=5}
end_station_location_v2 <- trip_data_v2 %>%
count(end_lat, end_lng, end_station_name, name = "count") %>%
arrange(end_lat, end_lng)
end_station_location_v2
```
To ensure a focus on complete trips, I subsequently remove all remaining rows with missing values using the **`drop_na`** function. This results in 5,614,669 remaining trips, representing a relatively minor reduction from the original dataset.
```{r remove-na-values}
trip_data_v2 <- drop_na(trip_data_v2)
dim(trip_data_v2)
```
## Add Relevant Columns
To facilitate time-series analysis, I create three new columns: ***ride_length_min***, ***day_of_week***, and ***month***.
```{r add-columns}
trip_data_v2$ride_length_min <- as.double(difftime(trip_data_v2$ended_at, trip_data_v2$started_at, units = "mins"))
trip_data_v2$day_of_week <- wday(trip_data_v2$started_at, label = TRUE)
trip_data_v2$month <- format(trip_data_v2$started_at, "%b")
```
However, both ***day_of_week*** and ***month*** are initially stored as unsorted text strings and not categorized in their natural order. Therefore, I utilize the **`ordered`** function to transform them based on their natural order (Monday to Sunday for days and January to December for months), ensuring their suitability for further analysis.
```{r order-columns}
trip_data_v2 <- within(trip_data_v2, {
day_of_week <- ordered(day_of_week, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))
month <- ordered(month, levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
})
```
## Drop Unnecessary Data
The final step involves identifying and addressing outliers in the data. This is crucial to ensure the validity of the numbers and prevent distortion of the analysis. Outliers are identified using interquartile range (IQR) method, and their relevance is assessed based on their impact on the analysis and potential for being errors. If an outlier is deemed irrelevant or a mistake, it's removed from the dataset. This process may involve dropping irrelevant columns, correcting error inputs, and filtering out potential outliers.
### Irrelevant Columns
Following the previous step's conclusion, I use the **`select`** function eliminate the ***start_station_id*** and ***end_station_id*** columns from the dataset, as they are not meaningful for this project's analysis.
```{r drop-columns}
trip_data_v2 <- trip_data_v2 %>%
select( !c(start_station_id, end_station_id) )
```
### Error Inputs
The output reveals inconsistencies in ride lengths, particularly with unusually short or long durations. To address these inconsistencies, I filter out invalid rides based on the following criteria:
- **Rides with zero start/end geographic coordinates**
- **Rides with less than 60 seconds in duration (including negative times)** could indicate false starts or attempts to secure the bike, and will be excluded from further analysis.
- **Rides exceeding 24 hours in length** are considered invalid outliers, as users are not expected to keep bikes for longer than a day.
> **Note:** See the [Divvy System Data](https://divvybikes.com/system-data) and [Divvy article](https://help.divvybikes.com/hc/en-us/articles/360033484791-What-if-I-keep-a-bike-out-too-long-) for a detailed explanation of trip durations.
```{r preview-errors-1, echo=FALSE}
summary(trip_data_v2$ride_length_min)
```
```{r remove-error}
trip_data_v2 <- trip_data_v2 %>%
filter( !(trip_data_v2$start_lat == 0 | trip_data_v2$start_lng == 0 | trip_data_v2$end_lat == 0 | trip_data_v2$end_lng == 0 |
trip_data_v2$ride_length_min < 1 | trip_data_v2$ride_length_min > 1440) )
```
A thorough re-examination of the data confirms the successful removal of all previously identified errors.
```{r preview-errors-2, echo=FALSE}
summary(trip_data_v2$ride_length_min)
```
### Potential Outliers
Before removing outliers, I examine the distribution of ***ride_length_min*** using a boxplot. The numerous outliers obscure the underlying data trend. These extreme values, while technically part of the dataset, don't reflect the typical ride duration.
```{r plot-outliers-1}
ggplot(data = trip_data_v2, aes(x = member_casual, y = ride_length_min, fill = member_casual)) +
geom_boxplot() +
coord_flip() +
theme(legend.position = "none") +
labs(x = "Member type",
y = "Ride length (in minutes)",
title = "Box plot showing 'ride_length_min' before removing outliers")
```
To identify them, I use the Interquartile Range (IQR) method. The IQR measures the spread of the middle 50% of the data, excluding the lower and upper quartiles (Q1 and Q3). In this case, Q1 is 6.05 minutes and Q3 is 18.67 minutes, making the IQR 12.62 minutes.
```{r find-outliers-1}
quantiles <- as.numeric(quantile(trip_data_v2$ride_length_min, probs = c(0.25, 0.50, 0.75), na.rm = FALSE))
iqr_value <- IQR(trip_data_v2$ride_length_min)
```
```{r find-outliers-2, echo=FALSE}
paste0("Q1: ", format(quantiles[1], nsmall = 2), " minutes")
paste0("Q3: ", format(quantiles[3], nsmall = 2), " minutes")
paste0("IQR: ", format(iqr_value, nsmall = 2), " minutes")
```
After this, I need to set up a fence outside of Q1 and Q3. Outliers lie outside a "fence" built around this central portion of the data. This fence is constructed by adding and subtracting 1.5 times the IQR to Q1 and Q3, respectively. From the output below, the lower fence is -12.88 minutes and the upper fence is 37.59 minutes.
```{r find-outliers-3}
lower_fence <- quantiles[1] - ( 1.5 * iqr_value )
upper_fence <- quantiles[3] + ( 1.5 * iqr_value )
```
```{r find-outliers-4, echo=FALSE}
paste0("Lower Fence: ", format(lower_fence, nsmall = 2), " minutes")
paste0("Upper Fence: ", format(upper_fence, nsmall = 2), " minutes")
```
So, any value less than -12.88 minutes or greater than 37.59 minutes is considered an outlier and removed from the data.
```{r remove-outliers}
trip_data_v2 <- trip_data_v2 %>%
filter(!(trip_data_v2$ride_length_min < lower_fence | trip_data_v2$ride_length_min > upper_fence))
```
Filtering out these outliers significantly cleans the data. The maximum ride length after removal drops to 37.58 minutes, confirming the successful elimination of extreme values. This refined dataset provides a more accurate representation of typical ride durations for further analysis.
```{r preview-outliers, echo=FALSE}
summary(trip_data_v2$ride_length_min)
```
```{r plot-outliers-2}
ggplot(data = trip_data_v2, aes(x = member_casual, y = ride_length_min, fill = member_casual)) +
geom_boxplot() +
coord_flip() +
theme(legend.position = "none") +
labs(x = "Member type",
y = "Ride length (in minutes)",
title = "Box plot showing 'ride_length_min' after removing outliers")
```
## Validate the Data
After data cleaning, I re-examined the dataset, finding it now contains 5,091,142 trip records, a reduction of 576,575 (10%) compared to the original 5,667,717. There are 14 columns in the data, all of which have no missing values, ensuring a complete dataset for thorough analysis.
```{r validate-data}
skim_without_charts(trip_data_v2)
```
To verify if the data follows a normal distribution, I use the Empirical Rule, which states that 99.7% of data points in a normal distribution cluster within three standard deviations of the mean (68% within 1 SD, 95% within 2 SD). I create the **`calculate_percentage`** function to measure the data percentage within specific standard deviation ranges. Its output reveals that 72.10%, 94.08%, and 99.38% of the data fall within one, two, and three standard deviations of the mean, respectively, confirming a close resemblance to a normal distribution. This suggests the data can be reliably analyzed with various statistical methods.
```{r prove-empirical-rule-1}
summary_stats <- summarise(trip_data_v2,
sd = sd(ride_length_min),
mean = mean(ride_length_min),
count = n())
calculate_percentage <- function(n_sd) {
filtered_count <- trip_data_v2 %>%
filter(between(ride_length_min,
summary_stats$mean - n_sd * summary_stats$sd,
summary_stats$mean + n_sd * summary_stats$sd)) %>%
summarise(count = n())
round((filtered_count$count / summary_stats$count) * 100, 2)
}
percentage_sd1 <- calculate_percentage(1)
percentage_sd2 <- calculate_percentage(2)
percentage_sd3 <- calculate_percentage(3)
```
```{r prove-empirical-rule-2, echo=FALSE}
paste0("One standard deviation: ", format(percentage_sd1, nsmall = 2), "%")
paste0("Two standard deviations: ", percentage_sd2, "%")
paste0("Three standard deviations: ", percentage_sd3, "%")
```
# Analysis
This section presents an analysis of Cyclistic's historical trip data from January to December 2022, with the objective of identifying the differences in the use of Cyclistic bikes between annual members and casual riders.
```{r create-mode-function, echo=FALSE}
find_mode <- function(x) {
u <- unique(x)
tab <- tabulate(match(x, u))
u[tab == max(tab)]
}
```
```{r group-variables, echo=FALSE}
# Group the data by the 'member_casual' column, calculating stats for each group
stats_user <- trip_data_v2 %>%
group_by(member_casual) %>%
summarize(mean_ride_length_min = mean(ride_length_min),
median_ride_length_min = median(ride_length_min),
min_ride_length_min = min(ride_length_min),
max_ride_length_min = max(ride_length_min),
sd_ride_length_min = sd(ride_length_min),
ride_count = n())
# Group the data by 'member_casual' and 'day_of_week' columns, calculating stats for each group
stats_user_day <- trip_data_v2 %>%
group_by(member_casual, day_of_week) %>%
summarize(mean_ride_length_min = mean(ride_length_min),
median_ride_length_min = median(ride_length_min),
min_ride_length_min = min(ride_length_min),
max_ride_length_min = max(ride_length_min),
sd_ride_length_min = sd(ride_length_min),
ride_count = n(),
.groups = "drop")
# Group the data by 'member_casual' and 'hour' columns, calculating stats for each group
stats_user_hour <- trip_data_v2 %>%
mutate(hour = as.numeric(format(started_at, "%H"))) %>%
group_by(member_casual, hour) %>%
summarize(mean_ride_length_min = mean(ride_length_min),
median_ride_length_min = median(ride_length_min),
min_ride_length_min = min(ride_length_min),
max_ride_length_min = max(ride_length_min),
sd_ride_length_min = sd(ride_length_min),
ride_count = n(),
.groups = "drop")
# Group the data by 'member_casual' and 'month' columns, calculating stats for each group
stats_user_month <- trip_data_v2 %>%
group_by(member_casual, month) %>%
summarize(mean_ride_length_min = mean(ride_length_min),
median_ride_length_min = median(ride_length_min),
min_ride_length_min = min(ride_length_min),
max_ride_length_min = max(ride_length_min),
sd_ride_length_min = sd(ride_length_min),
ride_count = n(),
.groups = "drop")
```
## Overall Summary of Cyclistic Rides in 2022
As shown in the below table, it presents a statistical summary of time duration for Cyclistic's users, including casual riders and annual members.
```{r summarize-descriptive-statistics, echo=FALSE}
summary(trip_data_v2$ride_length_min)
```
The time duration of Cyclistic's rides can also be visualized by a histogram. The histogram shows that the distribution of ride times is positively skewed, with a high cluster of lower values and a spread-out tail on the right. This means that there are a few rides that are much longer than the majority of rides.
```{r plot-distribution-duration-all, echo=FALSE}
# Create a histogram
ggplot(trip_data_v2, aes(x = ride_length_min)) +
geom_histogram(color = "black", fill = "darkgray") +
scale_x_continuous(limits = c(0, 40)) +
scale_y_continuous(limits = c(0, 600000), labels = unit_format(unit = "k", scale = 1e-3), expand = c(0, 0)) +
labs(x = "Time Duration (in Minutes)",
y = "Number of Rides",
title = "The distribution of ride lengths (in minutes) shows a slight right \nskew, indicating more rides happen for shorter durations \ncompared to longer ones.") +
# Add a median line
geom_vline(xintercept = median(trip_data_v2$ride_length_min), linetype = "dashed", size = 0.50, color = "red") +
annotate("text", label = "MED", x = median(trip_data_v2$ride_length_min), y = 600000, vjust = 1.25, hjust = -0.25, size = 2.50, color = "red") +
# Add a mean line
geom_vline(xintercept = mean(trip_data_v2$ride_length_min), linetype = "dashed", size = 0.50, color = "blue") +
annotate("text", label = "AVG ", x = mean(trip_data_v2$ride_length_min), y = 600000, vjust = 1.25, hjust = -0.25, size = 2.50, color = "blue") +
# Add a mode line
geom_vline(xintercept = find_mode(trip_data_v2$ride_length_min), linetype = "dashed", size = 0.50, color = "darkgreen") +
annotate("text", label = "MOD", x = find_mode(trip_data_v2$ride_length_min), y = 600000, vjust = 1.25, hjust = -0.25, size = 2.50, color = "darkgreen")
```
Because of the positive skew, the mean ride time is not a very accurate measure of the typical ride time. The median ride time is a more accurate measure, as it is not as affected by the outliers. So, the median ride time will be chosen to conduct further descriptive analysis.
## Casual Riders vs Annual Members {.tabset .tabset-pills}
Casual riders appear to ride for longer periods of time than annual members. The median time duration for casual riders is 11.52 minutes, which is higher compared to the median time duration of 8.72 minutes for annual members. However, the data also shows that casual riders took a total of 1,946,822 rides, while annual members took 3,144,320 rides. From these findings, it can be concluded that annual members tend to be more regular users of the bike-sharing service. On the other hand, casual riders are more likely to take longer trips when they do use the service, even though they use it less frequently.
### Time Duration by Users
```{r plot-bar-duration-user, echo=FALSE}
ggplot(stats_user, aes(x = member_casual, y = median_ride_length_min, fill = member_casual)) +
geom_bar(stat = "identity", color = "black") +
scale_y_continuous(limits = c(0, 15), expand = c(0, 0)) +
scale_x_discrete(labels = c("Casual", "Member")) +
theme(legend.position = "none") +
labs(x = "User Type",
y = "Time Duration (in Minutes)",
title = "Median time duration on bikes, by user type") +
geom_text(aes(label = round(median_ride_length_min, 2)), position = position_dodge(width = 0.90), vjust = 2)
```
### Ride Counts by Users
```{r plot-bar-count-user, echo=FALSE}
ggplot(stats_user, aes(x = member_casual, y = ride_count, fill = member_casual)) +
geom_bar(stat = "identity", color = "black") +
scale_y_continuous(limits = c(0, 3500000), labels = unit_format(unit = "M", scale = 1e-6), expand = c(0, 0)) +
scale_x_discrete(labels = c("Casual", "Member")) +
theme(legend.position = "none") +
labs(x = "User Type",
y = "Number of Rides",
title = "Total number of rides, by user type") +
geom_text(aes(label = prettyNum(ride_count, big.mark = ",")), position = position_dodge(width = 0.90), vjust = 2)
```
## Day of Week {.tabset .tabset-pills}
Casual riders tend to take longer bike rides than annual members, especially on the weekend. On average, casual riders spend about 13 minutes longer on bikes than annual members on the weekend, but annual members take time cycling at a consistent level throughout the day. In contrast, annual members use Cyclistic bikes more frequently on the weekday and their usage gradually decreases as the weekend approaches. Casual riders, on the other hand, tend to take more bike rides on weekends. This suggests that the longer ride times of casual riders are highly correlated with their increased usage during the weekends.
### Time Duration by Users and Day of Week
```{r ride-length-user-day, echo=FALSE}
ggplot(stats_user_day, aes(x = day_of_week, y = median_ride_length_min, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge", color = "black") +
scale_y_continuous(limits = c(0, 15), expand = c(0, 0)) +
scale_fill_discrete(labels = c("Casual", "Member")) +
labs(x = "Day of Week",
y = "Time Duration (in Minutes)",
title = "Median time duration on bikes, by user type and day of week") +
geom_text(aes(label = round(median_ride_length_min, 2)), position = position_dodge(width = 0.90), hjust = -0.10, angle = 270) +
geom_hline(yintercept = median(trip_data_v2$ride_length_min), color = "darkgray", linetype = "dashed", size = 0.50) +
annotate("text", label = "Median", x = 7, y = median(trip_data_v2$ride_length_min), vjust = -0.50, hjust = -0.25, size = 2.50, color = "darkgray")
```
### Ride Counts by Users and Day of Week
```{r ride-count-user-day, echo=FALSE}
ggplot(stats_user_day, aes(x = day_of_week, y = ride_count, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge", color = "black") +
scale_y_continuous(limits = c(0, 550000), labels = unit_format(unit = "k", scale = 1e-3), expand = c(0, 0)) +
scale_fill_discrete(labels = c("Casual", "Member")) +
labs(x = "Day of Week", y = "Number of Rides", title = "Total number of rides, by user type and day of week") +
geom_text(aes(label = prettyNum(ride_count, big.mark = ",")), position = position_dodge(width = 0.90), hjust = -0.10, angle = 270)
```
## Hour {.tabset .tabset-pills}
Casual riders are most likely to take longer bike rides after the midday hours, from 1 PM to 3 PM. After that, their usage decreases slightly. Annual members, on the other hand, take time on bikes at a consistent level throughout the day, with no significant spikes or dips in usage. By the number of rides, annual members use Cyclistic bikes at a consistent rate throughout the day, with three peaks in usage: at 8 AM, at 12 PM, and at 5 PM, whereas casual riders start using Cyclistic bikes at 5 AM and gradually increase their usage until they peak at 5 PM.
### Time Duration by Users and Hour
```{r plot-line-duration-user-hour, echo=FALSE}
ggplot(stats_user_hour, aes(x = hour, y = median_ride_length_min, group = member_casual)) +
geom_line(aes(color = member_casual), size = 1) +
geom_point(aes(color = member_casual), size = 1) +
scale_x_continuous(limits = c(0, 23), breaks = seq(0, 23, 1)) +
scale_y_continuous(limits = c(0, 15), expand = c(0, 0)) +
scale_color_discrete(labels = c("Casual", "Member")) +
labs(x = "Hour",
y = "Time Duration (in Minutes)",
title = "Median time duration on bikes, by user type and hour") +
geom_hline(yintercept = median(trip_data_v2$ride_length_min), color = "darkgray", linetype = "dashed", size = 0.50) +
annotate("text", label = "Median", x = 23, y = median(trip_data_v2$ride_length_min), vjust = -0.75, hjust = 0.70, size = 2.50, color = "darkgray")
```
### Ride Counts by Users and Hour
```{r plot-line-count-user-hour, echo=FALSE}
ggplot(stats_user_hour, aes(x = hour, y = ride_count, group = member_casual)) +
geom_line(aes(color = member_casual), size = 1) +
geom_point(aes(color = member_casual), size = 1) +
scale_x_continuous(limits = c(0, 23), breaks = seq(from = 0, to = 23, by = 1)) +
scale_y_continuous(limits = c(0, 350000), labels = unit_format(unit = "k", scale = 1e-3), expand = c(0, 0)) +
scale_color_discrete(labels = c("Casual", "Member")) +
labs(x = "Hour",
y = "Number of Rides",
title = "Total number of rides, by user type and hour") +
annotate("rect", xmin = 8, xmax = 17, ymin = 0, ymax = 350000, alpha = 0.10, fill = "darkorange") +
geom_vline(xintercept = c(8, 12, 17), linetype = "dashed", size = 0.50, color = "darkorange")
```
## Month {.tabset .tabset-pills}
The findings show that the median ride time for both casual and member riders is relatively consistent throughout the year, with the exception of three months: March, April and May. However, casual riders tend to spend more time on bikes than annual members. They also shows that the number of Cyclistic rides increases from month to month, with a peak in July for casual riders and August for annual members. After that, the number of rides suddenly declines. This can be concluded that
### Time Duration by Users and Month
```{r plot-line-duration-user-month, echo=FALSE}
ggplot(stats_user_month, aes(x = month, y = median_ride_length_min, group = member_casual)) +
geom_line(aes(color = member_casual), size = 1) +
geom_point(aes(color = member_casual), size = 1) +
scale_y_continuous(limits = c(0, 15), expand = c(0, 0)) +
scale_color_discrete(labels = c("Casual", "Member")) +
labs(x = "Month",
y = "Time Duration (in Minutes)",
title = "Median time duration on bikes, by user type and month") +
geom_hline(yintercept = median(trip_data_v2$ride_length_min), color = "darkgray", linetype = "dashed", size = 0.50) +
annotate("text", label = "Median", x = 12, y = median(trip_data_v2$ride_length_min), vjust = -0.75, hjust = 0.70, size = 2.50, color = "darkgray")
```
### Ride Counts by Users and Month
```{r plot-line-count-user-month, echo=FALSE}
ggplot(stats_user_month, aes(x = month, y = ride_count, group = member_casual)) +
geom_line(aes(color = member_casual), size = 1) +
geom_point(aes(color = member_casual), size = 1) +
scale_y_continuous(limits = c(0, 450000), labels = unit_format(unit = "k", scale = 1e-3), expand = c(0, 0)) +
scale_color_discrete(labels = c("Casual", "Member")) +
labs(x = "Month",
y = "Number of Rides",
title = "Total number of rides, by user type and month") +
annotate("rect", xmin = 7, xmax = 8, ymin = 0, ymax = 450000, alpha = 0.10, fill = "darkorange") +
geom_vline(xintercept = c(7, 8), linetype = "dashed", size = 0.50, color = "darkorange")
```
# Conclusion
This section provides concluding thought and recommendations of this report.
## Key Findings
Based on the data collected, it can be concluded that
- **Casual riders:** They are more likely to use Cyclistic bikes for leisure purpose, including enjoyment, fitness, exploring, and much more. They spend longer periods of time and have a higher number of rides on **weekends**. They take a ride throughout the day, but have higher rides at the evening.
- **Annual members:** They tend to use Cyclistic bikes for specific purpose, such as commute to work. They spend shorter periods of time and have a higher number of rides on **weekdays**. They take a ride during rush hour , especially at 8 AM, 12 PM, and 5 PM.
- **Warmer months:** Both casual riders and annual members take more rides during Spring (in the beginning of March) and peak usage at the end of Summer (July and August).
## Recommendations
The report makes the following recommendations:
1. **Price Incentives:** Offer incentives for casual riders to become annual members, such as discounts, free ride credits, or gift cards. This would make it a more attractive option for casual riders who are not sure if they will use Cyclistic bikes enough to justify the full annual membership price.
2. **Personalized Marketing:** Personalize the marketing messages that we send to casual riders. This will help to ensure that we are reaching them with the right message at the right time. For example, we could send a message to casual riders who have taken a certain number of rides in a month, or who have used Cyclistic bikes during rush hour.
3. **New Member Plans:** Consider to add a monthly or quarterly membership option. It would also be a good option for casual riders who only bike during certain times of the year, such as springtime and summertime.