regression.Rmd

---
title: "253Project"
author: "Jenny Li, Kristy Ma, Liz Cao"
date: "2022/3/24"
output:
  html_document:
    toc: yes
    toc_float: yes
    df_print: paged
    code_download: yes
    theme: cerulean
  pdf_document:
    toc: yes
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, error=TRUE, message=FALSE, warning=FALSE)
```

# Part a & b
## Library statements 

```{r}
library(dplyr)
library(readr)
library(broom)
library(ggplot2)
library(tidymodels) 
tidymodels_prefer()
theme_set(theme_bw())       
Sys.setlocale("LC_TIME", "English")
set.seed(74)
```
## Read in data
```{r}
breastCa<-read_csv(file = "breast-cancer.csv")
```
## Data cleaning
```{r}
breastCa_Re<-breastCa %>% 
  drop_na() %>% 
  select(radius_mean:fractal_dimension_mean) 

breastCa_Re_new<-breastCa_Re%>%
  mutate(concave_points_mean=`concave points_mean`)%>%
  select(-8)
```

## Creation of cv folds
```{r}
breastCa_Re_CV<-vfold_cv(breastCa_Re, v = 10)
```

## Model spec
```{r}
#least square
lm_spec <-
    linear_reg() %>% 
    set_engine(engine = 'lm') %>% 
    set_mode('regression')

#LASSO
lm_lasso_spec <- 
  linear_reg() %>%
  set_args(mixture = 1, penalty = tune()) %>% ## mixture = 1 indicates Lasso
  set_engine(engine = 'glmnet') %>% #note we are using a different engine
  set_mode('regression')
```

## Recipes & workflows
```{r}
#least square
least_rec <- recipe(area_mean ~ ., data = breastCa_Re) %>%
    step_corr(all_predictors()) %>% 
    step_nzv(all_predictors()) %>% # removes variables with the same value
    step_normalize(all_numeric_predictors()) %>% # important standardization step for LASSO
    step_dummy(all_nominal_predictors())

least_lm_wf <- workflow() %>%
    add_recipe(least_rec) %>%
    add_model(lm_spec)
    
#LASSO
lasso_wf<- workflow() %>% 
  add_recipe(least_rec) %>%
  add_model(lm_lasso_spec) 
```

## Fit & tune models
```{r}
#least square
least_fit <- fit(least_lm_wf, data = breastCa_Re) 

least_fit %>% tidy()

#LASSO
#tune
penalty_grid <- grid_regular(
  penalty(range = c(-3, 1)), #log10 transformed 
  levels = 30)

tune_output <- tune_grid( # new function for tuning hyperparameters
  lasso_wf, # workflow
  resamples = breastCa_Re_CV, # cv folds
  metrics = metric_set(rmse, mae),
  grid = penalty_grid # penalty grid defined above
)

#fit
best_se_penalty <- select_by_one_std_err(tune_output, metric = 'mae', desc(penalty))
final_wf_se <- finalize_workflow(lasso_wf, best_se_penalty)
lasso_fit <- fit(final_wf_se , data = breastCa_Re)
lasso_fit %>% tidy()

```

# Part c
## Calculate and collect CV metrics

```{r}
# Least Square model
least_fit_cv <- fit_resamples(least_lm_wf,
  resamples = breastCa_Re_CV, 
  metrics = metric_set(rmse, mae)
)

least_fit_cv %>% collect_metrics(summarize = TRUE)
```
```{r}
# LASSO model
tune_output %>% 
  collect_metrics() %>% 
  filter(penalty == (best_se_penalty 
                     %>% pull(penalty)))
```
# Part d
## Residual Plots 
```{r}
#least square
least_fit_output <- least_fit %>%
  predict(new_data = breastCa_Re) %>%
  bind_cols(breastCa_Re) %>%
  mutate(resid = area_mean - .pred)

ggplot(least_fit_output, aes(x = .pred, y = resid)) +
  geom_point() +
  geom_smooth() +
  geom_hline(yintercept = 0, color = "red") +
  labs(x = "Fitted values", y = "Residuals") +
  theme_classic()

# Residuals vs. predictors (x's) 
ggplot(least_fit_output, aes(x = concavity_mean, y = resid)) +
  geom_point() +
  geom_smooth() +
  labs(x = "Concavity mean", y = "Residual") +
  geom_hline(yintercept = 0, color = "red") +
  theme_classic()

# Residuals vs. predictors (x's) 
ggplot(least_fit_output, aes(x = compactness_mean, y = resid)) +
  geom_point() +
  geom_smooth() +
  labs(x = "Compactness mean", y = "Residual") +
  geom_hline(yintercept = 0, color = "red") +
  theme_classic()

# Residuals vs. predictors (x's) 
ggplot(least_fit_output, aes(x = fractal_dimension_mean, y = resid)) +
  geom_point() +
  geom_smooth() +
  labs(x = "Fractal dimension mean", y = "Residual") +
  geom_hline(yintercept = 0, color = "red") +
  theme_classic()

# Residuals vs. predictors (x's) 
ggplot(least_fit_output, aes(x = radius_mean, y = resid)) +
  geom_point() +
  geom_smooth() +
  labs(x = "Radius mean", y = "Residual") +
  geom_hline(yintercept = 0, color = "red") +
  theme_classic()
```

```{r}
#LASSO
lasso_fit_output <- lasso_fit %>%
  predict(new_data = breastCa_Re) %>%
  bind_cols(breastCa_Re) %>%
  mutate(resid = area_mean - .pred)

ggplot(lasso_fit_output, aes(x = .pred, y = resid)) +
  geom_point() +
  geom_smooth() +
  geom_hline(yintercept = 0, color = "red") +
  labs(x = "Fitted values", y = "Residuals") +
  theme_classic()

# Residuals vs. predictors (x's) 
ggplot(lasso_fit_output, aes(x = concavity_mean, y = resid)) +
  geom_point() +
  geom_smooth() +
  labs(x = "Concavity mean", y = "Residual") +
  geom_hline(yintercept = 0, color = "red") +
  theme_classic()

# Residuals vs. predictors (x's) 
ggplot(lasso_fit_output, aes(x = compactness_mean, y = resid)) +
  geom_point() +
  geom_smooth() +
  labs(x = "Compactness mean", y = "Residual") +
  geom_hline(yintercept = 0, color = "red") +
  theme_classic()

# Residuals vs. predictors (x's) 
ggplot(lasso_fit_output, aes(x = fractal_dimension_mean, y = resid)) +
  geom_point() +
  geom_smooth() +
  labs(x = "Fractal dimension mean", y = "Residual") +
  geom_hline(yintercept = 0, color = "red") +
  theme_classic()

# Residuals vs. predictors (x's) 
ggplot(lasso_fit_output, aes(x = radius_mean, y = resid)) +
  geom_point() +
  geom_smooth() +
  labs(x = "Radius mean", y = "Residual") +
  geom_hline(yintercept = 0, color = "red") +
  theme_classic()
```

# Part e
>
In the OLS, “radius_mean” is the most important predictor for our quantitative outcome since it has the lowest p-value which is also much lower than the p-values of other variables. This shows the result is statistically significant and we have stronger evidence to reject the null hypothesis and instead claim a possible association between the mean area of tumor and the mean radius of the tumor. 

>In the LASSO, “radius_mean”, “compactness_mean”, “concavity_mean”, and “fractal_dimension_mean”, are the most important predictors of our quantitative outcome.(since the coefficents for other variables are zero) The penalty shrinks the coefficients of three variables to zero due to their minimum effect on the response variable, and keeps important predictors in the model. 

>The methods we’ve applied did not reach the consensus on the most important variable. The strong association between the mean area of tumor and the mean radius of the tumor, indicated by the lowest p-value for “radius_mean” and its inclusion in the LASSO model, is expected by their definition.


## 2. Summarize investigations: Decide on an overall best model based on your investigations so far. To do this, make clear your analysis goals. Predictive accuracy? Interpretability? A combination of both?

>Answer:  Our goal is to have both accurate predictions and keep interpretability. Our proposed best model: area_mean~ radius_mean + compactness_mean + concavity_mean + fractal_dimension_mean.( based on the p_value and the coefficents from LASSO. We prefer to keep all the variables that contain cofficients because the total of 4 variabls for a model is interpretable.) However, we are using the mean of radius to predict the area which does not really make sense in implementation, but for this assignment we will just leave it there. Also, this is coherent with the residual plot of radius which is a non-linear relationship.


## 3. Societal impact: Are there any harms that may come from your analyses and/or how the data were collected? What cautions do you want to keep in mind when communicating your work?


> Answer: There are some harms that may occur, such as some patients may not be willing to provide their personal information or records for the case study.  We need to protect the information safety of the patient. Also, we should know the number of observations may not be able to be generalized. 


## Accounting for nonlinearity 

```{r}
set.seed(123)

gam_spec <- 
  gen_additive_mod() %>%
  set_engine(engine = 'mgcv') %>%
  set_mode('regression') 

gam_mod <- fit(gam_spec,
    area_mean ~ s(radius_mean)+texture_mean+s(perimeter_mean)+smoothness_mean+compactness_mean+concave_points_mean+concavity_mean+symmetry_mean+fractal_dimension_mean,
    data = breastCa_Re_new
)

par(mfrow=c(2,2))
gam_mod %>% pluck('fit') %>% mgcv::gam.check() 
gam_mod %>% pluck('fit') %>% summary()

ns_rec <- least_rec %>%
  step_ns(x, deg_free = 9)
ns9_wf <- workflow()  %>%
  add_recipe(ns_rec) %>%
  add_model(lm_spec)

hist(breastCa_Re_new$area_mean)

gam_mod %>% pluck('fit') %>% plot()

```

## Evaluation of the GAM model on Test Data 

```{r}
gam_test_output <- gam_mod %>%
  predict(new_data=breastCa_Re_new) %>%
  bind_cols(breastCa_Re_new %>% select(area_mean))

gam_test_output %>%
  rmse(truth=area_mean, estimate=.pred) 

gam_test_output %>%
  mae(truth=area_mean, estimate=.pred) 

```