docs/index.Rmd

---
title: "Cyclistic Bike-Share Data Analysis"
author: "Thanakorn Thanakraikiti"
date: "Published on 15 August 2023 | Updated on `r format(Sys.time(), '%d %B %Y')`"
output:
  # html_document:
  rmdformats::downcute:
    # rmdformats
    self_contained: true
    thumbnails: true
    lightbox: true
    gallery: true
    # html_document
    highlight: tango
    df_print: paged 
    code_folding: hide
---

```{r setup, include=FALSE}
# Set root directory
knitr::opts_knit$set(root.dir = "/Users/thanakorntha/Documents/GitHub/Data/cyclistic-analysis")

# Set chunk options
knitr::opts_chunk$set(message = FALSE, 
                      warning = FALSE, 
                      fig.align = "center", 
                      fig.asp = 0.80, 
                      # fig.height = 4, 
                      # fig.width = 6, 
                      out.width = "70%")

# Load package
library(tidyverse)

# Set personal theme
my_custom_theme <- function(base_size = 14) {
        
  # Font
  font <- ""

  # Replace elements we want to change
  theme_bw(base_size = base_size) %+replace%
    theme(
      #
      ## Grid elements
      panel.grid.minor   = element_blank(),
      panel.grid.major   = element_blank(),
      panel.border       = element_blank(),
      axis.ticks         = element_blank(),
      axis.line.x        = element_line(color = "#343434"),
      axis.ticks.x       = element_line(color = "#343434"),
      axis.line.y        = element_line(color = "#343434"),
      axis.ticks.y       = element_line(color = "#343434"),
      #
      ## Text Elements
      #
      # Title
      plot.title         = element_text(size = rel(1.00), color = "#343434", face = "bold",   hjust = 0, margin = margin(0, 0, 10, 0)),
      plot.subtitle      = element_text(size = rel(0.75), color = "#343434", face = "plain",  hjust = 0),
      plot.caption       = element_text(size = rel(0.50), color = "#343434", face = "italic", hjust = 0),
      # Axis
      axis.title         = element_text(size = rel(0.50), color = "#343434", face = "bold"),
      axis.title.x       = element_text(size = rel(1.00), color = "#343434", face = "bold", margin = margin(5, 0, 5, 0)),
      axis.text          = element_text(size = rel(0.50), color = "#343434", face = "plain"),
      # Legend
      legend.title       = element_blank(),
      legend.text        = element_text(size = rel(0.50), color = "#343434", face = "plain"),
      #
      ## Legend
      legend.position    = "top",
      legend.key         = element_rect(fill = "transparent", color = NA),
      legend.key.size    = unit(1, "lines"),
      legend.background  = element_rect(fill = "transparent", color = NA),
      legend.margin      = margin(0, 0, 0, 0),
      #
      ## Faceting labels
      strip.background = element_rect(fill = "#343434", color = "#343434"),
      strip.text = element_text(size = rel(0.80), color = "white", face = "bold", margin = margin(5, 0, 5, 0))
      #
    )
}

# Change default theme
theme_set(my_custom_theme())
```

# Project Overview

Cyclistic is a successful American bicycle-sharing program that was established in 2016. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system at any time.

Cyclistic's marketing strategy has primarily focused on building general awareness and appealing to broad consumer segments. The program offers a variety of pricing plans, including single-ride passes, full-day passes, and annual memberships. Cyclistic classifies its riders into two groups based on pricing plans: casual riders (*users who purchase single-ride passes or full-day passes*) and annual members (*users who purchase an annual membership*).

Cyclistic's flexible pricing plans attract a larger customer base, but financial analysts have determined that annual members are more profitable. However, casual riders are already aware of the Cyclistic program and have chosen Cyclistic to meet their mobility needs. This suggests that a marketing campaign that targets existing customers is likely to be more effective at expanding the business than a campaign that targets only new customers.

Therefore, Cyclistic's marketing analytics team is interested in understanding how casual riders and annual members use Cyclistic bikes differently. By understanding these differences, the marketing analytics team can develop more targeted marketing strategies to convert casual riders into annual members.

# Dataset

This project analyzes Cyclistic's historical bike trip data from January to December 2022, downloaded and organized in CSV format from the [Divvy system data website](https://divvy-tripdata.s3.amazonaws.com/index.html). The 2022 data comprises twelve monthly datasets, each containing detailed trip information. These datasets hold the following fields for each trip:

1.  **ride_id:** unique ID number for all rides
2.  **rideable_type:** type of bike
3.  **started_at:** date and time the ride started
4,  **ended_at:** date and time the ride ended
5.  **start_station_name:** name of the station where the ride started
6.  **start_station_id:** ID number of the station where the ride started
7.  **end_station_name:** name of the station where the ride started
8.  **end_station_id:** ID number of the station where the ride started
9.  **start_lat:** latitude of the location where the ride started
10. **start_lng:** longitude of the location where the ride started
11. **end_lat:** latitude of the location where the ride ended
12. **end_lng:** longitude of the location where the ride ended
13. **member_casual:** type of user

These credible public datasets, collected and updated monthly by the company since 2013, offer valuable insights into Cyclistic's usage patterns across different customers. While data privacy regulations restrict access to personally identifiable information (e.g., linking pass purchases to credit cards), the rich data remains crucial for understanding how customer types use Cyclistic bikes. 

However, before utilizing the data for analysis, it requires processing. This involves cleaning inconsistencies and errors, and transforming it into a suitable format for analysis.

> **Note:** The datasets have a diﬀerent name because Cyclistic is a fictional company. For the purposes of this case study, the datasets are appropriate and will enable you to answer the business questions. The data has been made available by Motivate International Inc. under this [license](https://divvybikes.com/data-license-agreement).

# Methodology

This project leverages the six-step data analysis framework to unlock transformative insights for Cyclistic's growth. From pinpointing key business questions to extracting actionable recommendations, each step is meticulously designed to yield valuable discoveries. 

To efficiently manage and analyze the extensive data, this project utilizes the powerful capabilities of R and RStudio. These tools enable us to delve deeper into the data, extracting meaningful insights that will inform critical decisions and shape Cyclistic's future.

# Data Cleaning

This section dives into the data cleaning process, essential for preparing the dataset for analysis. It covers key steps such as importing the data, removing duplicates, correcting structural errors, handling missing values, adding relevant columns, and dropping unnecessary data. Each step contributes to ensuring high-quality, reliable data for accurate insights.

## Import the Data

I begin by equipping myself with essential libraries like tidyverse, providing me with the ability to model, transform, and visualize data throughout the journey.

```{r load-package}
library(tidyverse)
library(skimr)
library(scales)
```

Then, I start importing all 12 monthly datasets and merging them into a single dataset named **trip_data**. To achieve this, I leverage two key functions: 

1. **`list.files`** to generate a list of all file paths for efficient reading, and 
2. **`read_csv`** to iteratively read each file into memory. 

Once all files are read, I use the **`bind_rows`** function to seamlessly stitch them together into a unified dataset.

```{r import-data}
trip_data <- list.files(path = "./data/", pattern = "*-divvy-tripdata.csv", full.names = TRUE) %>% 
  lapply(read_csv) %>% 
  bind_rows %>% 
  arrange(started_at)
```

I would like to take a quick look with the dataset. Cyclists pedaled their way to more than 5 million trips on Cyclistic bikes in 2022, demonstrating the significant impact of bike-sharing programs.

```{r preview-data-1}
dim(trip_data)
head(trip_data)
```

## Remove Duplicates

I use the **`duplicated`** and **`sum`** functions to meticulously check for duplicate trip data. The results reveal no duplicate values, indicating excellent data integrity and providing a reliable foundation for further analysis.

```{r check-duplicates}
sum(duplicated(trip_data$ride_id))
```

## Correct Structural Errors

The next step is to address structural errors, those inconsistencies in data formatting that can lead to misinterpretations. This often takes the form of typos, inconsistent capitalization, or duplicate entries. With this in mind, I'll primarily focus on cleaning up the member type, bike type, and start and end station columns, as they contain text data and are particularly susceptible to such issues.

### Member Type

The ***member_casual*** column has two distinct values: *casual* and *member*. This perfectly aligns with our goal of comparing bike usage between annual members and casual riders, as distinct user types are clearly identified in the data

```{r preview-member}
member_type <- count(trip_data, member_casual, name = "count")
member_type
```

### Bike Type

To understand the distribution of bike types used, I take a look at the ***rideable_type*** column. While *classic_bike* and *electric_bike* have over 2.6 million rides, *docked_bike* shows a significantly lower count with only 170,000. 

```{r preview-bike-1}
bike_type <- count(trip_data, rideable_type, name = "count")
bike_type
```

Investigating further, I can confirm that *docked_bike* refers to the same type of bike as *docked_bike*. Therefore, I replace *docked_bike* with *classic_bike* in the column. This correction leaves only classic bike and *electric_bike* as valid categories for further analysis.

```{r fix-bike}
trip_data_v2 <- trip_data %>% 
  mutate(rideable_type = str_replace_all(rideable_type, "docked_bike", "classic_bike"))
```

```{r preview-bike-2}
bike_type_v2 <- count(trip_data_v2, rideable_type, name = "count")
bike_type_v2
```

### Start and End Station

To thoroughly examine station names, I create separate **start_station** and **end_station** data. 

```{r preview-start-station-1, rows.print=5}
start_station <- trip_data_v2 %>% 
  count(start_station_name, name = "count") %>% 
  arrange(start_station_name)

start_station
```

```{r preview-end-station-1, include=FALSE, rows.print=5}
end_station <- trip_data_v2 %>% 
  count(end_station_name, name = "count") %>% 
  arrange(end_station_name)

end_station
```

After verifying their validity, I identify two areas requiring correction:

1. **Removing Test Stations:** A total of eight test stations are found in the data: "Pawel Bialowas - Test- PBSC charging station", "Hastings WH 2", "DIVVY CASSETTE REPAIR MOBILE STATION", "Base - 2132 W Hubbard Warehouse", "Base - 2132 W Hubbard", "NewHastings", "WestChi", and "WEST CHI-WATSON". I filter these out using the **`filter`** function to ensure accuracy in our analysis.

```{r remove-test-station}
test_station_list <- c("Pawel Bialowas - Test- PBSC charging station", 
                       "Hastings WH 2", 
                       "DIVVY CASSETTE REPAIR MOBILE STATION", 
                       "Base - 2132 W Hubbard Warehouse", 
                       "Base - 2132 W Hubbard", 
                       "NewHastings", 
                       "WestChi", 
                       "WEST CHI-WATSON")

trip_data_v2 <- trip_data_v2 %>% 
  filter(!(trip_data_v2$start_station_name %in% test_station_list | 
           trip_data_v2$end_station_name %in% test_station_list))
```

2. **Addressing Inconsistencies within Station Names:** Typos (e.g., "Michgan" instead of "Michigan"), special symbols (e.g., '\*'), and directional words (e.g., "north" or "south") are present in station names. To address these inconsistencies, I use the **`str_replace_all`** function to remove them, ensuring consistent data for further analysis.

```{r fix-inconsistent-words}
words <- c("*", " - Charging", " (Temp)", "amp;", "Public Rack - ", 
           " - north corner", " - south corner", " - midblock south", " - midblock", 
           " - North", " - South", " - East", " - West", 
           " - NE", " - NW", " - SE", " - SW", 
           " - N", " - S", " - E", " - W", 
           " NE", " NW", " SE", " SW")

for (word in words) {
  trip_data_v2 <- trip_data_v2 %>% 
    mutate(start_station_name = str_replace_all(start_station_name, fixed(word, ignore_case = TRUE), "")) %>%
    mutate(end_station_name   = str_replace_all(end_station_name, fixed(word, ignore_case = TRUE), ""))
}

trip_data_v2 <- trip_data_v2 %>% 
  mutate(start_station_name = str_replace_all(start_station_name, regex(" (?<=\\s)[N|S|E|W]$", ignore_case = TRUE), "")) %>% 
  mutate(end_station_name   = str_replace_all(end_station_name, regex(" (?<=\\s)[N|S|E|W]$", ignore_case = TRUE), ""))
```

After applying these corrections, a re-examination of the **start_station_v2** and **end_station_v2** data confirms the successful removal of test stations and inconsistencies, providing a clean and consistent dataset for subsequent analysis.

```{r preview-start-station-2, rows.print=5}
start_station_v2 <- trip_data_v2 %>% 
  count(start_station_name, name = "count") %>% 
  arrange(start_station_name)

start_station_v2
```

```{r preview-end-station-2, include=FALSE, rows.print=5}
end_station_v2 <- trip_data_v2 %>% 
  count(end_station_name, name = "count") %>% 
  arrange(end_station_name)

end_station_v2
```

## Handle Missing Data

I use the **`colSum`** and **`is.na`** functions to meticulously identify missing values within the dataset. The analysis reveals six columns with missing data: ***start_station_name*** (833,025 rows), ***start_station_id*** (833,025 rows), ***end_station_name*** (891,896 rows), ***end_station_id*** (891,896 rows), ***end_lat*** (5,858 rows), and ***end_lng*** (5,858 rows).

```{r preview-na-columns-1, rows.print=13}
colSums(is.na(trip_data_v2))
```

To address these missing values, I prioritize ***start_station_name*** and ***end_station_name***, as their missing entries can be imputed using geographic coordinates as a reference. I refrain from imputing the remaining four columns: ***start_station_id*** and ***end_station_id*** will be removed in the subsequent step due to their irrelevance to the analysis, while ***end_lat*** and ***end_lng*** lack suitable reference data for imputation and will also be discarded.

```{r preview-na-start-station-1, rows.print=5}
start_station_location <- trip_data_v2 %>% 
  count(start_lat, start_lng , start_station_name, name = "count") %>% 
  arrange(start_lat, start_lng)

start_station_location
```

```{r preview-na-end-station-1, include=FALSE, rows.print=5}
end_station_location <- trip_data_v2 %>% 
  count(end_lat, end_lng, end_station_name, name = "count") %>% 
  arrange(end_lat, end_lng)

end_station_location
```

To effectively handle missing station names, I create four temporary columns representing increasingly granular levels of start and end coordinates (ranging from 5 digits to 1 digit). These columns facilitate accurate matching and assignment of station names to missing entries, prioritizing those with more precise location data. This iterative approach ensures that the most reliable matches are filled in first.

```{r impute-na-station-names}
digit <- 5

while (digit > 1) {
  trip_data_v2 <- trip_data_v2 %>% 
    mutate(start_lat_round = round(start_lat, digits = digit), 
           start_lng_round = round(start_lng, digits = digit), 
           end_lat_round   = round(end_lat,   digits = digit), 
           end_lng_round   = round(end_lng,   digits = digit))
  
  trip_data_v2 <- trip_data_v2 %>% 
    group_by(start_lat_round, start_lng_round) %>% 
    fill(start_station_name, .direction = "downup") %>% 
    fill(start_station_id,   .direction = "downup") %>% 
    ungroup()
  
  trip_data_v2 <- trip_data_v2 %>% 
    group_by(end_lat_round, end_lng_round) %>%  
    fill(end_station_name, .direction = "downup") %>% 
    fill(end_station_id,   .direction = "downup") %>% 
    ungroup()
  
  digit <- digit - 1
}

trip_data_v2 <- trip_data_v2 %>% 
  select(!c(start_lat_round, start_lng_round, end_lat_round, end_lng_round))
```

The imputation process yields a significant reduction in missing values: ***start_station_name*** dropped by 11,821 records each, while ***end_station_name*** saw a decrease of 41,386 records each. 

```{r preview-na-columns-2, rows.print=13}
colSums(is.na(trip_data_v2))
```

```{r preview-na-start-station-2, rows.print=5}
start_station_location_v2 <- trip_data_v2 %>% 
  count(start_lat, start_lng, start_station_name, name = "count") %>% 
  arrange(start_lat, start_lng)

start_station_location_v2
```

```{r preview-na-end-station-2, include=FALSE, rows.print=5}
end_station_location_v2 <- trip_data_v2 %>% 
  count(end_lat, end_lng, end_station_name, name = "count") %>% 
  arrange(end_lat, end_lng)

end_station_location_v2
```

To ensure a focus on complete trips, I subsequently remove all remaining rows with missing values using the **`drop_na`** function. This results in 5,614,669 remaining trips, representing a relatively minor reduction from the original dataset.

```{r remove-na-values}
trip_data_v2 <- drop_na(trip_data_v2)
dim(trip_data_v2)
```

## Add Relevant Columns

To facilitate time-series analysis, I create three new columns: ***ride_length_min***, ***day_of_week***, and ***month***. 

```{r add-columns}
trip_data_v2$ride_length_min <- as.double(difftime(trip_data_v2$ended_at, trip_data_v2$started_at, units = "mins"))
trip_data_v2$day_of_week <- wday(trip_data_v2$started_at, label = TRUE)
trip_data_v2$month <- format(trip_data_v2$started_at, "%b")
```

However, both ***day_of_week*** and ***month*** are initially stored as unsorted text strings and not categorized in their natural order. Therefore, I utilize the **`ordered`** function to transform them based on their natural order (Monday to Sunday for days and January to December for months), ensuring their suitability for further analysis.

```{r order-columns}
trip_data_v2 <- within(trip_data_v2, {
  day_of_week <- ordered(day_of_week, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))
  month <- ordered(month, levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
})
```

## Drop Unnecessary Data

The final step involves identifying and addressing outliers in the data. This is crucial to ensure the validity of the numbers and prevent distortion of the analysis. Outliers are identified using interquartile range (IQR) method, and their relevance is assessed based on their impact on the analysis and potential for being errors. If an outlier is deemed irrelevant or a mistake, it's removed from the dataset. This process may involve dropping irrelevant columns, correcting error inputs, and filtering out potential outliers.

### Irrelevant Columns

Following the previous step's conclusion, I use the **`select`** function eliminate the ***start_station_id*** and ***end_station_id*** columns from the dataset, as they are not meaningful for this project's analysis.

```{r drop-columns}
trip_data_v2 <- trip_data_v2 %>% 
  select( !c(start_station_id, end_station_id) )
```

### Error Inputs

The output reveals inconsistencies in ride lengths, particularly with unusually short or long durations. To address these inconsistencies, I filter out invalid rides based on the following criteria:

- **Rides with zero start/end geographic coordinates** 
- **Rides with less than 60 seconds in duration (including negative times)** could indicate false starts or attempts to secure the bike, and will be excluded from further analysis. 
- **Rides exceeding 24 hours in length** are considered invalid outliers, as users are not expected to keep bikes for longer than a day.

> **Note:** See the [Divvy System Data](https://divvybikes.com/system-data) and [Divvy article](https://help.divvybikes.com/hc/en-us/articles/360033484791-What-if-I-keep-a-bike-out-too-long-) for a detailed explanation of trip durations.

```{r preview-errors-1, echo=FALSE}
summary(trip_data_v2$ride_length_min)
```

```{r remove-error}
trip_data_v2 <- trip_data_v2 %>% 
  filter( !(trip_data_v2$start_lat == 0 | trip_data_v2$start_lng == 0 | trip_data_v2$end_lat == 0 | trip_data_v2$end_lng == 0 | 
            trip_data_v2$ride_length_min < 1 | trip_data_v2$ride_length_min > 1440) )
```

A thorough re-examination of the data confirms the successful removal of all previously identified errors.

```{r preview-errors-2, echo=FALSE}
summary(trip_data_v2$ride_length_min)
```

### Potential Outliers

Before removing outliers, I examine the distribution of ***ride_length_min*** using a boxplot. The numerous outliers obscure the underlying data trend. These extreme values, while technically part of the dataset, don't reflect the typical ride duration.

```{r plot-outliers-1}
ggplot(data = trip_data_v2, aes(x = member_casual, y = ride_length_min, fill = member_casual)) + 
  geom_boxplot() +
  coord_flip() + 
  theme(legend.position = "none") + 
  labs(x = "Member type", 
       y = "Ride length (in minutes)", 
       title = "Box plot showing 'ride_length_min' before removing outliers")
```

To identify them, I use the Interquartile Range (IQR) method. The IQR measures the spread of the middle 50% of the data, excluding the lower and upper quartiles (Q1 and Q3). In this case, Q1 is 6.05 minutes and Q3 is 18.67 minutes, making the IQR 12.62 minutes.

```{r find-outliers-1}
quantiles <- as.numeric(quantile(trip_data_v2$ride_length_min, probs = c(0.25, 0.50, 0.75), na.rm = FALSE))
iqr_value <- IQR(trip_data_v2$ride_length_min)
```

```{r find-outliers-2, echo=FALSE}
paste0("Q1: ",  format(quantiles[1], nsmall = 2), " minutes")
paste0("Q3: ",  format(quantiles[3], nsmall = 2), " minutes")
paste0("IQR: ", format(iqr_value,    nsmall = 2), " minutes")
```

After this, I need to set up a fence outside of Q1 and Q3. Outliers lie outside a "fence" built around this central portion of the data. This fence is constructed by adding and subtracting 1.5 times the IQR to Q1 and Q3, respectively. From the output below, the lower fence is -12.88 minutes and the upper fence is 37.59 minutes.

```{r find-outliers-3}
lower_fence <- quantiles[1] - ( 1.5 * iqr_value )
upper_fence <- quantiles[3] + ( 1.5 * iqr_value )
```

```{r find-outliers-4, echo=FALSE}
paste0("Lower Fence: ", format(lower_fence, nsmall = 2), " minutes")
paste0("Upper Fence: ", format(upper_fence, nsmall = 2), " minutes")
```

So, any value less than -12.88 minutes or greater than 37.59 minutes is considered an outlier and removed from the data. 

```{r remove-outliers}
trip_data_v2 <- trip_data_v2 %>%
  filter(!(trip_data_v2$ride_length_min < lower_fence | trip_data_v2$ride_length_min > upper_fence))
```

Filtering out these outliers significantly cleans the data. The maximum ride length after removal drops to 37.58 minutes, confirming the successful elimination of extreme values. This refined dataset provides a more accurate representation of typical ride durations for further analysis.

```{r preview-outliers, echo=FALSE}
summary(trip_data_v2$ride_length_min)
```

```{r plot-outliers-2}
ggplot(data = trip_data_v2, aes(x = member_casual, y = ride_length_min, fill = member_casual)) + 
  geom_boxplot() +
  coord_flip() + 
  theme(legend.position = "none") + 
  labs(x = "Member type", 
       y = "Ride length (in minutes)", 
       title = "Box plot showing 'ride_length_min' after removing outliers")
```

## Validate the Data

After data cleaning, I re-examined the dataset, finding it now contains 5,091,142 trip records, a reduction of 576,575 (10%) compared to the original 5,667,717. There are 14 columns in the data, all of which have no missing values, ensuring a complete dataset for thorough analysis.

```{r validate-data}
skim_without_charts(trip_data_v2)
```

To verify if the data follows a normal distribution, I use the Empirical Rule, which states that 99.7% of data points in a normal distribution cluster within three standard deviations of the mean (68% within 1 SD, 95% within 2 SD). I create the **`calculate_percentage`** function to measure the data percentage within specific standard deviation ranges. Its output reveals that 72.10%, 94.08%, and 99.38% of the data fall within one, two, and three standard deviations of the mean, respectively, confirming a close resemblance to a normal distribution. This suggests the data can be reliably analyzed with various statistical methods.

```{r prove-empirical-rule-1}
summary_stats <- summarise(trip_data_v2,
                           sd = sd(ride_length_min),
                           mean = mean(ride_length_min),
                           count = n())

calculate_percentage <- function(n_sd) {
  filtered_count <- trip_data_v2 %>%
  filter(between(ride_length_min, 
                 summary_stats$mean - n_sd * summary_stats$sd, 
                 summary_stats$mean + n_sd * summary_stats$sd)) %>%
  summarise(count = n())
  round((filtered_count$count / summary_stats$count) * 100, 2)
}

percentage_sd1 <- calculate_percentage(1)
percentage_sd2 <- calculate_percentage(2)
percentage_sd3 <- calculate_percentage(3)
```

```{r prove-empirical-rule-2, echo=FALSE}
paste0("One standard deviation: ", format(percentage_sd1, nsmall = 2), "%")
paste0("Two standard deviations: ", percentage_sd2, "%")
paste0("Three standard deviations: ", percentage_sd3, "%")
```

# Analysis

This section presents an analysis of Cyclistic's historical trip data from January to December 2022, with the objective of identifying the differences in the use of Cyclistic bikes between annual members and casual riders.

```{r create-mode-function, echo=FALSE}
find_mode <- function(x) {
  u <- unique(x)
  tab <- tabulate(match(x, u))
  u[tab == max(tab)]
}
```

```{r group-variables, echo=FALSE}
# Group the data by the 'member_casual' column, calculating stats for each group
stats_user <- trip_data_v2 %>%
  group_by(member_casual) %>%
  summarize(mean_ride_length_min = mean(ride_length_min), 
            median_ride_length_min = median(ride_length_min), 
            min_ride_length_min = min(ride_length_min), 
            max_ride_length_min = max(ride_length_min), 
            sd_ride_length_min = sd(ride_length_min), 
            ride_count = n())

# Group the data by 'member_casual' and 'day_of_week' columns, calculating stats for each group
stats_user_day <- trip_data_v2 %>%
  group_by(member_casual, day_of_week) %>%
  summarize(mean_ride_length_min = mean(ride_length_min), 
            median_ride_length_min = median(ride_length_min), 
            min_ride_length_min = min(ride_length_min), 
            max_ride_length_min = max(ride_length_min), 
            sd_ride_length_min = sd(ride_length_min), 
            ride_count = n(), 
            .groups = "drop")

# Group the data by 'member_casual' and 'hour' columns, calculating stats for each group
stats_user_hour <- trip_data_v2 %>% 
  mutate(hour = as.numeric(format(started_at, "%H"))) %>% 
  group_by(member_casual, hour) %>% 
  summarize(mean_ride_length_min = mean(ride_length_min), 
            median_ride_length_min = median(ride_length_min), 
            min_ride_length_min = min(ride_length_min), 
            max_ride_length_min = max(ride_length_min), 
            sd_ride_length_min = sd(ride_length_min), 
            ride_count = n(), 
            .groups = "drop")

# Group the data by 'member_casual' and 'month' columns, calculating stats for each group
stats_user_month <- trip_data_v2 %>% 
  group_by(member_casual, month) %>% 
  summarize(mean_ride_length_min = mean(ride_length_min), 
            median_ride_length_min = median(ride_length_min), 
            min_ride_length_min = min(ride_length_min), 
            max_ride_length_min = max(ride_length_min), 
            sd_ride_length_min = sd(ride_length_min), 
            ride_count = n(), 
            .groups = "drop")
```

## Overall Summary of Cyclistic Rides in 2022

As shown in the below table, it presents a statistical summary of time duration for Cyclistic's users, including casual riders and annual members.

```{r summarize-descriptive-statistics, echo=FALSE}
summary(trip_data_v2$ride_length_min)
```

The time duration of Cyclistic's rides can also be visualized by a histogram. The histogram shows that the distribution of ride times is positively skewed, with a high cluster of lower values and a spread-out tail on the right. This means that there are a few rides that are much longer than the majority of rides.

```{r plot-distribution-duration-all, echo=FALSE}
# Create a histogram
ggplot(trip_data_v2, aes(x = ride_length_min)) +
  geom_histogram(color = "black", fill = "darkgray") +
  scale_x_continuous(limits = c(0, 40)) +
  scale_y_continuous(limits = c(0, 600000), labels = unit_format(unit = "k", scale = 1e-3), expand = c(0, 0)) +
  labs(x = "Time Duration (in Minutes)", 
       y = "Number of Rides", 
       title = "The distribution of ride lengths (in minutes) shows a slight right \nskew, indicating more rides happen for shorter durations \ncompared to longer ones.") +
  # Add a median line
  geom_vline(xintercept = median(trip_data_v2$ride_length_min), linetype = "dashed", size = 0.50, color = "red") +
  annotate("text", label = "MED", x = median(trip_data_v2$ride_length_min), y = 600000, vjust = 1.25, hjust = -0.25, size = 2.50, color = "red") +
  # Add a mean line
  geom_vline(xintercept = mean(trip_data_v2$ride_length_min), linetype = "dashed", size = 0.50, color = "blue") +
  annotate("text", label = "AVG ", x = mean(trip_data_v2$ride_length_min), y = 600000, vjust = 1.25, hjust = -0.25, size = 2.50, color = "blue") +
  # Add a mode line
  geom_vline(xintercept = find_mode(trip_data_v2$ride_length_min), linetype = "dashed", size = 0.50, color = "darkgreen") +
  annotate("text", label = "MOD", x = find_mode(trip_data_v2$ride_length_min), y = 600000, vjust = 1.25, hjust = -0.25, size = 2.50, color = "darkgreen")
```

Because of the positive skew, the mean ride time is not a very accurate measure of the typical ride time. The median ride time is a more accurate measure, as it is not as affected by the outliers. So, the median ride time will be chosen to conduct further descriptive analysis.

## Casual Riders vs Annual Members {.tabset .tabset-pills}

Casual riders appear to ride for longer periods of time than annual members. The median time duration for casual riders is 11.52 minutes, which is higher compared to the median time duration of 8.72 minutes for annual members. However, the data also shows that casual riders took a total of 1,946,822 rides, while annual members took 3,144,320 rides. From these findings, it can be concluded that annual members tend to be more regular users of the bike-sharing service. On the other hand, casual riders are more likely to take longer trips when they do use the service, even though they use it less frequently.

### Time Duration by Users

```{r plot-bar-duration-user, echo=FALSE}
ggplot(stats_user, aes(x = member_casual, y = median_ride_length_min, fill = member_casual)) +
  geom_bar(stat = "identity", color = "black") +
  scale_y_continuous(limits = c(0, 15), expand = c(0, 0)) +
  scale_x_discrete(labels = c("Casual", "Member")) +
  theme(legend.position = "none") +
  labs(x = "User Type", 
       y = "Time Duration (in Minutes)", 
       title = "Median time duration on bikes, by user type") + 
  geom_text(aes(label = round(median_ride_length_min, 2)), position = position_dodge(width = 0.90), vjust = 2)
```

### Ride Counts by Users

```{r plot-bar-count-user, echo=FALSE}
ggplot(stats_user, aes(x = member_casual, y = ride_count, fill = member_casual)) +
  geom_bar(stat = "identity", color = "black") +
  scale_y_continuous(limits = c(0, 3500000), labels = unit_format(unit = "M", scale = 1e-6), expand = c(0, 0)) +
  scale_x_discrete(labels = c("Casual", "Member")) +
  theme(legend.position = "none") +
  labs(x = "User Type", 
       y = "Number of Rides", 
       title = "Total number of rides, by user type") + 
  geom_text(aes(label = prettyNum(ride_count, big.mark = ",")), position = position_dodge(width = 0.90), vjust = 2)
```

## Day of Week {.tabset .tabset-pills}

Casual riders tend to take longer bike rides than annual members, especially on the weekend. On average, casual riders spend about 13 minutes longer on bikes than annual members on the weekend, but annual members take time cycling at a consistent level throughout the day. In contrast, annual members use Cyclistic bikes more frequently on the weekday and their usage gradually decreases as the weekend approaches. Casual riders, on the other hand, tend to take more bike rides on weekends. This suggests that the longer ride times of casual riders are highly correlated with their increased usage during the weekends.

### Time Duration by Users and Day of Week

```{r ride-length-user-day, echo=FALSE}
ggplot(stats_user_day, aes(x = day_of_week, y = median_ride_length_min, fill = member_casual)) +
  geom_bar(stat = "identity", position = "dodge", color = "black") +
  scale_y_continuous(limits = c(0, 15), expand = c(0, 0)) +
  scale_fill_discrete(labels = c("Casual", "Member")) +
  labs(x = "Day of Week", 
       y = "Time Duration (in Minutes)", 
       title = "Median time duration on bikes, by user type and day of week") + 
  geom_text(aes(label = round(median_ride_length_min, 2)), position = position_dodge(width = 0.90), hjust = -0.10, angle = 270) +
  geom_hline(yintercept = median(trip_data_v2$ride_length_min), color = "darkgray", linetype = "dashed", size = 0.50) +
  annotate("text", label = "Median", x = 7, y = median(trip_data_v2$ride_length_min), vjust = -0.50, hjust = -0.25, size = 2.50, color = "darkgray")
```

### Ride Counts by Users and Day of Week

```{r ride-count-user-day, echo=FALSE}
ggplot(stats_user_day, aes(x = day_of_week, y = ride_count, fill = member_casual)) +
  geom_bar(stat = "identity", position = "dodge", color = "black") +
  scale_y_continuous(limits = c(0, 550000), labels = unit_format(unit = "k", scale = 1e-3), expand = c(0, 0)) + 
  scale_fill_discrete(labels = c("Casual", "Member")) +
  labs(x = "Day of Week", y = "Number of Rides", title = "Total number of rides, by user type and day of week") + 
  geom_text(aes(label = prettyNum(ride_count, big.mark = ",")), position = position_dodge(width = 0.90), hjust = -0.10, angle = 270)
```

## Hour {.tabset .tabset-pills}

Casual riders are most likely to take longer bike rides after the midday hours, from 1 PM to 3 PM. After that, their usage decreases slightly. Annual members, on the other hand, take time on bikes at a consistent level throughout the day, with no significant spikes or dips in usage. By the number of rides, annual members use Cyclistic bikes at a consistent rate throughout the day, with three peaks in usage: at 8 AM, at 12 PM, and at 5 PM, whereas casual riders start using Cyclistic bikes at 5 AM and gradually increase their usage until they peak at 5 PM.

### Time Duration by Users and Hour

```{r plot-line-duration-user-hour, echo=FALSE}
ggplot(stats_user_hour, aes(x = hour, y = median_ride_length_min, group = member_casual)) +
  geom_line(aes(color = member_casual), size = 1) +
  geom_point(aes(color = member_casual), size = 1) +
  scale_x_continuous(limits = c(0, 23), breaks = seq(0, 23, 1)) +
  scale_y_continuous(limits = c(0, 15), expand = c(0, 0)) +
  scale_color_discrete(labels = c("Casual", "Member")) +
  labs(x = "Hour", 
       y = "Time Duration (in Minutes)", 
       title = "Median time duration on bikes, by user type and hour") + 
  geom_hline(yintercept = median(trip_data_v2$ride_length_min), color = "darkgray", linetype = "dashed", size = 0.50) +
  annotate("text", label = "Median", x = 23, y = median(trip_data_v2$ride_length_min), vjust = -0.75, hjust = 0.70, size = 2.50, color = "darkgray")
```

### Ride Counts by Users and Hour

```{r plot-line-count-user-hour, echo=FALSE}
ggplot(stats_user_hour, aes(x = hour, y = ride_count, group = member_casual)) +
  geom_line(aes(color = member_casual), size = 1) +
  geom_point(aes(color = member_casual), size = 1) +
  scale_x_continuous(limits = c(0, 23), breaks = seq(from = 0, to = 23, by = 1)) +
  scale_y_continuous(limits = c(0, 350000), labels = unit_format(unit = "k", scale = 1e-3), expand = c(0, 0)) +
  scale_color_discrete(labels = c("Casual", "Member")) +
  labs(x = "Hour", 
       y = "Number of Rides", 
       title = "Total number of rides, by user type and hour") + 
  annotate("rect", xmin = 8, xmax = 17, ymin = 0, ymax = 350000, alpha = 0.10, fill = "darkorange") +
  geom_vline(xintercept = c(8, 12, 17), linetype = "dashed", size = 0.50, color = "darkorange")
```

## Month {.tabset .tabset-pills}

The findings show that the median ride time for both casual and member riders is relatively consistent throughout the year, with the exception of three months: March, April and May. However, casual riders tend to spend more time on bikes than annual members. They also shows that the number of Cyclistic rides increases from month to month, with a peak in July for casual riders and August for annual members. After that, the number of rides suddenly declines. This can be concluded that 

### Time Duration by Users and Month

```{r plot-line-duration-user-month, echo=FALSE}
ggplot(stats_user_month, aes(x = month, y = median_ride_length_min, group = member_casual)) +
  geom_line(aes(color = member_casual), size = 1) +
  geom_point(aes(color = member_casual), size = 1) +
  scale_y_continuous(limits = c(0, 15), expand = c(0, 0)) +
  scale_color_discrete(labels = c("Casual", "Member")) +
  labs(x = "Month", 
       y = "Time Duration (in Minutes)", 
       title = "Median time duration on bikes, by user type and month") + 
  geom_hline(yintercept = median(trip_data_v2$ride_length_min), color = "darkgray", linetype = "dashed", size = 0.50) +
  annotate("text", label = "Median", x = 12, y = median(trip_data_v2$ride_length_min), vjust = -0.75, hjust = 0.70, size = 2.50, color = "darkgray")
```

### Ride Counts by Users and Month

```{r plot-line-count-user-month, echo=FALSE}
ggplot(stats_user_month, aes(x = month, y = ride_count, group = member_casual)) +
  geom_line(aes(color = member_casual), size = 1) +
  geom_point(aes(color = member_casual), size = 1) +
  scale_y_continuous(limits = c(0, 450000), labels = unit_format(unit = "k", scale = 1e-3), expand = c(0, 0)) +
  scale_color_discrete(labels = c("Casual", "Member")) +
  labs(x = "Month", 
       y = "Number of Rides", 
       title = "Total number of rides, by user type and month") + 
  annotate("rect", xmin = 7, xmax = 8, ymin = 0, ymax = 450000, alpha = 0.10, fill = "darkorange") +
  geom_vline(xintercept = c(7, 8), linetype = "dashed", size = 0.50, color = "darkorange")
```

# Conclusion

This section provides concluding thought and recommendations of this report.

## Key Findings

Based on the data collected, it can be concluded that 

- **Casual riders:** They are more likely to use Cyclistic bikes for leisure purpose, including enjoyment, fitness, exploring, and much more. They spend longer periods of time and have a higher number of rides on **weekends**. They take a ride throughout the day, but have higher rides at the evening.
- **Annual members:** They tend to use Cyclistic bikes for specific purpose, such as commute to work. They spend shorter periods of time and have a higher number of rides on **weekdays**. They take a ride during rush hour , especially at 8 AM, 12 PM, and 5 PM. 
- **Warmer months:** Both casual riders and annual members take more rides during Spring (in the beginning of March) and peak usage at the end of Summer (July and August).

## Recommendations

The report makes the following recommendations:

1. **Price Incentives:** Offer incentives for casual riders to become annual members, such as discounts, free ride credits, or gift cards. This would make it a more attractive option for casual riders who are not sure if they will use Cyclistic bikes enough to justify the full annual membership price.
2. **Personalized Marketing:** Personalize the marketing messages that we send to casual riders. This will help to ensure that we are reaching them with the right message at the right time. For example, we could send a message to casual riders who have taken a certain number of rides in a month, or who have used Cyclistic bikes during rush hour. 
3. **New Member Plans:** Consider to add a monthly or quarterly membership option. It would also be a good option for casual riders who only bike during certain times of the year, such as springtime and summertime.