clinical_prediction_model.Rmd

---
title: "clinical_prediction_model"
author: "liuc"
date: '2022-03-14'
output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## 临床预测模型的一般流程

临床预测模型和临床预后模型虽则常使用线性回归、逻辑回归，但其涉及到具体的临床需求以及发展出一套分析思路，现整理如下。在其他的几个笔记中也有涉及类似的R语言实现过程。参数模型和半参数模型的使用，一个很重要的考量在于模型的解释度。一些非参数化的机器学习模型往往不易于解释。 在临床预测模型中_数据_的采集是最为重要的部分，对于不同的目的采用不同的研究类型选择。而通过回顾性研究得到的数据也会涉及到对具体临床问题的理解，包括预测变量的筛选和潜在重要变量的筛选。数据的质量和多个批次数据用以验证是构建模型中重要的因素，在这个层面来说临床模型的构建是不容易的，需要医院机构具有良好的数据库管理。

1.  建模，同时要关注变量的筛选
2.  验证模型,包括区别度分析（discrimination）、校准曲线（calibration）、决策曲线（DCA）、NRI/IDI等
3.  列线图等

模型的区分度指标是指对于一个模型本身区分或者说预测outcome的准确度，混淆矩阵/AUC/NRI/IDI等就是常用的指标，但是校准曲线是看待结局实际发生的概率和预测的概率的一致性，其calibration refers to the agreement between the estimated and the "true" risk of an outcome关注点在于每一个预测，在回归模型中容易理解，现举一个在分类问题中的例子：比如在逻辑回归中通过0.5来判断二分类的是与否，但每一个样本其实是有对应分类发生的概率的，而这些个体实际上发生的概率（比如10个人中有几个人是患病的）这样的的一个散点图就是校准曲线，看待的是二者的一致性，自然图的解读就是越靠近对角线越好。这的重点在于某种风险发生的具体的概率，可以通过Hosmer-Lemeshow检验得到统计检验。

决策曲线关注的是净收益和阈概率（Threshold Probability）之间的关系，比如说我们在预测一个阳性患者后还需要考虑假如 我们预测的是假阳性可能对患者的损害，我们希望自己做出来的预测模型在临床使用中，在任何时候依照模型结果进行干预净受益都比默认的好（最常见的默认情况就是全干预和全不干预）。`In brief, decision curve analysis calculates a clinical “net benefit” for one or more prediction models or diagnostic tests in comparison to default strategies of treating all or no patients.`阈概率（Threshold Probability），表示的是只有病人的预测概率超过这个阈概率，干预才有受益， 才值得干预。

下面以一个示例记录详细的建模过程。

所使用数据集为Hosmer研究的对低体重新生儿具有影响的因素。变量为二分类变量，考虑采用逻辑回归。 探索影响新生儿体重的独立影响因素。

```{r, include=FALSE}
library(tidyverse)
library(easystats)
library(rms)

get_logistic_pred = function(mod, data, res = "y", 
   pos = 1, neg = 0, cut = 0.5) {
  probs = predict(mod, newdata = data, type = "response")
  ifelse(probs > cut, pos, neg)
}
```

<span style='color: red;'>变量选择<span>

特征工程在临床预测模型中有几点需要考虑：

1.  混杂因素的矫正
2.  临床中一些变量终然在单变量回归中没有达到统计p值，也应该纳入最终的多因素模型中，基于临床专业知识的变量纳入是最为基本的考量
3.  常见的先单因素再多因素的方法（在统计效能不足时）
4.  一个单变量至少有10-20个有效样本量
5.  在临床模型的构建上，变量筛选时应考虑一些统计方法同时还需要考虑具体的数据大小等。
6.  对于连续变量可以考虑各种分类变量变化；对于等级资料设定dummy变量，比上同一个level；对于无序多分类变量设置dummy
7.  dummy 变量设置后在模型解释上，可以突出其所比对的reference。除了`ifelse`之外还可以用R包`fastDummies`以及base包里的`model.matrix`函数。建立模型时什么时候需要用到dummy变量呢，在R中对于factor变量会自动dummy的。


```{r}
# 一个生成dummy变量的示例
library(recipes)

dummies <- df %>% 
  recipes::recipe(outcome ~ .) %>% 
  step_dummy(var,
             one_hot = TRUE
             ) %>% 
  prep() %>% 
  bake()
```

另一个dummy变量设置的方式
这种方式为R里默认的方式
```{r}
contrasts(df1$race.f) <- contr.treatment(3)
```


```{r}
# 读入带label的SPSS数据对于下游的一些分析会带来麻烦，此时可以用labeled包进行处理
df1 <- haven::read_sav('./datasets/Lowweight.sav')
# df1 <- foreign::read.spss('./datasets/Lowweight.sav') %>% as.data.frame()

# 对race变量设置dummy变量, dummy变量需要n-1个变量，其中有一个为reference，此处为race3不纳入模型
df1 <- df1 %>% 
  mutate(
    race1 = if_else(race == 1, 1, 0),
    race2 = if_else(race == 2, 1, 0),
    race3 = if_else(race == 3, 1, 0),
    race.f = factor(race, levels = c(1,2,3),
                    labels = c('white', 'black', 'other'))
  )

df1

```

### rms包的应用

利用**rms**包展开分析，此包提供了多种对于回归模型的包装函数。包括拟合、区分度分析、校准曲线、nomogram等。
适合临床中的种种应用。

一个值得注意的点是，在应用多种统计模型的时候，对于一些参数的选择一般都会选择默认参数，这和在机器学习建模过程中的逻辑是有所差异的，对于逻辑回归，也是有一些参数除了依据具体的问题去选择外，还需要tune 一些超参数。

```{r}
# 为防止影响下游分析将每一列进行类型转换
# 忘记了当时为何要全部转为数字了，但应该是错误的，在R中指定因子还是重要的
# df1 <- df1 %>% 
#   mutate(across(everything(), as.numeric)) %>% 
#   as.data.frame()

attach(df1)
dd <- datadist(df1)
options(datadist = 'dd')


fit1 <- rms::lrm(formula = low ~ age+ftv+ht+lwt+ptl+smoke+ui+race1+race2,
                 data = df1, x = T, y = T
                 )

fit1


```

```{r}
# 以下为rms包所提供的一些方便的功能
plot(anova(fit1), what='proportion chisq') # relative importance

plot(Predict(fit1, fun=plogis)) # predicted values

rms::validate(fit1, method="boot", B=500) # bootstrapped validation

vif(fit1) # test for multicolinearity

Predict(fit1)

# penalty <- pentrace(mod1, penalty=c(0.5,1,2,3,4,6,8,12,16,24), maxit=25)
# mod1_pen <- update(mod1, penalty=penalty$penalty)
# effective.df(mod1_pen)
# mod1_pen
```


*模型检验* 在建模时需要检验模型是否满足其对应的assumption，比如对于线性回归模型应满足LINE，对于Cox回归应该满足等风险比例，对于逻辑回归其outcome应该是分类变量、残差符合正态分布等。

`easystats`似乎对`rms`包的支持一般。

```{r}
# assumption
performance::check_model(fit1)

# 模型整体的表现
performance::performance(fit1)

```

*校准曲线*

Calibration plot and Hosmer-Lemeshow goodness-of-fit test assess calibration.Calibration is the agreement between observed outcomes and model's prediction.

对于下图的解读：三条线分别为，Ideal线为 n=189,Mean absolute error=0.037 Mean squared error=0.00177 0.9 Quantile of absolute error=0.057


```{r}
cal1 <- rms::calibrate(fit = fit1, method = 'boot',
                       B = 500
                       )

plot(cal1,xlim = c(0,1.0),ylim = c(0,1.0))

# 更改绘图的多个参数
# plot(cal1,
#      add=F,
#      conf.int=T,#95%CI（蓝色线）
#      subtitles = F,#关闭副标题
#      cex.subtitles=0.8,
#      lwd=2,
#      lty=1,
#      errbar.col="blue")
```

ROC/AUC

```{r}
# 混淆矩阵
library(ROCR)


df1$predvalue <- predict(fit1, type = 'fitted')

pred <- ROCR::prediction(predictions = df1$predvalue, 
                         labels = df1$low)

perf <- ROCR::performance(pred,
                    measure = "tpr",
                    x.measure = "fpr")
plot(perf, colorize=TRUE,
     main = 'ROC Curve'
     )
abline(0,1, col = 3, lty = 2)
plot(perf,
     lty=3,
     col="grey78",
     add=TRUE)
auc <- ROCR::performance(pred,"auc")
auc@y.values

# for C-statistic, fit1的输出中已经包含有c-index的值，此处尝试加上置信区间
Hmisc::somers2(df1$predvalue, df1$low)
```

对于ROC曲线的分析似乎`pROC`是一个不错的选择。下面以`pROC`作为示例：

```{r}
library(pROC)

roc_score<- roc(df1$low, df1$predvalue, ci = TRUE) #AUC score

plot(roc_score ,main ="ROC curve -- Logistic Regression ")
```

*nomogram*

下面展开列线图的绘制和解读

```{r}
nom <- nomogram(fit = fit1,
                fun = plogis,#
                fun.at = c(.001, .01, .05,
                           seq(.1,.9, by = .1), .95, .99, .999),
                lp = F,
                funlabel = 'LowWeight rate'
                )
# 注意有read_sav所带来的每一列的困扰,因为不识别其每一列的格式，而显示空白
plot(nom)

```

通过以上列线图的结果，可以看到ftv对模型的影响很小，可以不考虑。现重新构建模型fit2:

```{r}
df2 <- df1 %>% 
  mutate(race = ifelse(race==1, 1, 0)) %>% 
  mutate(race = as.factor(race))

dd2 <- datadist(df2)
options(datadist = 'dd2')
fit2 <- rms::lrm(formula = low ~ age+ht+lwt+ptl+smoke+race,
                 data = df2, x = T, y = T
                 )

fit2
```

*DCA*

在R中做DCA曲线的有`ggDCA`, `rmda`等R包，但似乎都差些意思。需要一个更好更强大的R包。 DCA曲线的解读如上文所述，看我们得到的模型是不是比All干预时所有的净收益都要好，如图可以看到模型整体的净收益都比all曲线的净收益要好。 评估模型的临床效用，可以的出结论此模型在 临床应用中会有较好的应用。

```{r}
library(ggDCA)

d_res <- dca(
  fit1
)

ggplot(d_res)

ggDCA::AUDC(d_res)
```

*NRI + IDI*

净重新分类指数(net reclassification improvement NRI)是指在新旧指标，或者在不同模型上的重新分类的变化。广泛应用 于比较两个预测模型的准确度。 IDI(integrated discrimination improvement)综合判别改善指数。 二者是measures of discrimination，一般用于两个模型间的对比。

以下所使用的R包似乎只支持glm对象，rms对象似乎不支持。

```{r}
library(PredictABEL) # 此包二者都可以做，但是有点麻烦
library(nricens)


logistic.model.list <- list(fit1 = glm(formula = low ~ age+ftv+ht+lwt+ptl+smoke+ui+race1+race2,
                                       data = df1, family = binomial, x = T), 
                            fit2 = glm(formula = low ~ age+ht+lwt+ptl+smoke+race, 
                                       data = df1, family = binomial, x = T))
df1 <- within(df1, {
    ## Obtain fitted values from two models
    fitted.fit1     <- fitted(logistic.model.list[["fit1"]])
    fitted.fit2     <- fitted(logistic.model.list[["fit2"]])
    ## Changes in predicted probability
    change          <- factor(sign(fitted.fit2 - fitted.fit1),
                                     levels = c(-1,0,1), labels = c("Down","Unchanged","Up"))
    ## Mark classification
    fitted.fit1.pos <- as.numeric(fitted.fit1 >= 0.5)
    fitted.fit2.pos <- as.numeric(fitted.fit2 >= 0.5)
    ## Changes in classification
    reclass         <- factor(sign(fitted.fit2.pos - fitted.fit1.pos),
                                    levels = c(-1,0,1), labels = c("Down","Unchanged","Up"))
})

# 结果即有NRI也有IDI，不过注意不支持tibble格式
PredictABEL::reclassification(data = df1, 
                              cOutcome = 2,
                              predrisk1 = fitted(logistic.model.list[["fit1"]]),
                              predrisk2 = fitted(logistic.model.list[["fit2"]]),
                              cutoff = c(0, 0.5, 1)
                              )


nricens::nribin(mdl.std = logistic.model.list[["fit1"]],
       mdl.new = logistic.model.list[["fit2"]],
       updown = 'category',
       cut = c(0.2, 0.4), # cut 为高低风险的临界值
       niter = 1000
       )
```


## use glm()函数进行罗辑回归

logistic model，又名logit model, In the logit model the log odds of the outcome is modeled as a linear combination of the predictor variables.

logistic regression是GLM的一种，GLM所有的特点其自然也是具有的。
逻辑回归需要满足几个假设条件:
- response variable  is a binomial random variable with a single trial and success probability 
- Y值的独立性
- 线性，


`stats`包`glm`的应用：

```{r}
df1 <- df1 %>% 
  labelled::remove_labels() %>%
  as_tibble() %>% 
  mutate(race = factor(race),
         across(c(ht, smoke, ui), as.factor)
         # ftv = factor(ftv)
         )
```


在R中建模，是没有必要对factor显示指定编码方法的，对于无序因子变量，R中默认采用dummy编码方法：
下面的两个模型其结果应该是一致的。

```{r}
model_glm2 <- stats::glm(low ~ age + ftv + ht + lwt + ptl + smoke + ui + race2 + race3, 
                        data = df1, family = "binomial")

model_glm <- glm(low ~ age + ftv + ht + lwt + ptl + smoke + ui + race, 
                  data = df1, family = "binomial")

# predict(model_glm)  # 默认predict为predict.glm, type为on the scale of the linear predictors
# predict(model_glm, type = 'response')

```

```{r, eval=FALSE}
summary(model_glm)
```
*interpret results：*第一部分为残差的summary结果, `summary(residuals(model_glm))`。
Coefficients的解释和线性模型是一致的。


模型检验：
和R中的大多数模型一样，base包中的generic函数都是支持的。

```{r}
# assumption
plot(model_glm)

# 奇怪以前是支持glm类的呀。。搞半天似乎是我自己模型的问题。。。在经过不断debug后,是spss数据格式的问题
# m1 <- stats::glm(vs ~ wt + disp, data = mtcars, family = 'binomial')
# m2 <- stats::glm(low ~ age + ftv, data = df1, family = "binomial")

performance::check_model(model_glm)

# 模型整体的表现
performance::performance(model_glm)


result <- binned_residuals(model_glm)
plot(result)

check_collinearity(model_glm) |> plot()

check_outliers(model_glm) %>% plot(., type = "dots")

check_normality(model_glm) %>% plot(., type = "qq")

check_posterior_predictions(model_glm)

check_collinearity(model_glm) %>% 
  print_md()

# check_normality(model_glm) %>%
#   print_md()

DHARMa::testDispersion(model_glm)
```


The default predictions are on the scale of the log odds. These predictions are also available through the type = "link" 默认, type= 'response' probability scale.

```{r}
# 计算相关参数
# 逻辑回归默认predict返回的为其的log odds值，
log_odds <- predict(model_glm)

odds <- exp(log_odds)
probabilities <- plogis(log_odds) # == predict(model_glm, type = 'response')
odds_ratio <- exp(coef(model_glm))


```


```{r}
ggeffects::ggpredict(model = model_glm)
```


在模型完成后，需要评价模型的指标，在临床上用到的几个指标在上一部分已经介绍，在预测模型中，罗辑回归或者类似的分类模型会考虑AUC、等指标。

```{r}
# 通过caret包 计算混淆矩阵
model_glm_pred <- ifelse(predict(model_glm, type = "link") > 0, "1", "0")
# 等价于 model_glm_pred <- ifelse(predict(model_glm, type = "response") > 0.5, "Yes", "No")

train_tab = table(predicted = model_glm_pred, actual = df1$low)

train_con_mat = caret::confusionMatrix(train_tab, positive = "1")

c(train_con_mat$overall["Accuracy"], 
  train_con_mat$byClass["Sensitivity"], 
  train_con_mat$byClass["Specificity"])


# 选取不同cutoff情况下的结果
aa <- get_logistic_pred(model_glm, data = df1, 
                   res = "default", pos = "Yes", neg = "No", 
                   cut = 0.5
                   )
```


```{r}
### 马修斯相关系数, matthews correlation coefficient(phi coefficient)
# 此指数用于指定模型预测的忧度
Logistic_Predictions = predict(model_glm, type = 'response')
Logistic_Predictions_binomial=ifelse(Logistic_Predictions>0.5,1,0)
con_Logistic <- caret::confusionMatrix(factor(Logistic_Predictions_binomial), factor(df1$low))


MCC_Logistic <- list(
  TP <- con_Logistic$table[[1]],
  FP <- con_Logistic$table[[2]],
  FN <- con_Logistic$table[[3]],
  TN <- con_Logistic$table[[4]]
) %>%  # # MCC <- ((TP * TN) - (FP * FN)) / sqrt((TP + FP) * (TP+FN) * (TN+FP) * (TN+FN))
  pmap_dbl(., ~ ((..1 * ..4) - (..2 * ..3))/sqrt((..1 + ..2) * (..1 + ..3) * (..4 + ..2) * (..4 + ..3)))


# use yardisk
# yardstick 有一系列的函数可以得到所需之结果，输入为混淆矩阵
yardstick::mcc(con_Logistic$table)


# PRROC::pr.curve
```

ROC曲线，
此处以df1进行计算，最好是以test数据集进行，不过对于自身模型而言，ROC值也可以作为模型自身的一个判断标准。

```{r}
library(pROC)

test_prob <- predict(model_glm, newdata = df1, type = "response")

test_roc <- roc(df1$low ~ test_prob, plot = TRUE, print.auc = TRUE,
                ci = TRUE
                )
```


```{r}
performance::performance_roc(model_glm, new_data = df1) |> plot() +
  ggplot2::theme_bw()
```


对于罗辑回归模型，也从变量和预测两个角度去分析模型，模型是否具有类似线性模型

```{r}
visreg::visreg(model_glm)
```


```{r}
modelbased::estimate_means(model_glm, at = c('race'))
```


```{r}
modelbased::estimate_slopes(model_glm)
```


*multinomial logistic regression*

To perform multinomial logistic regression, we use the multinom function from the nnet package. Training using multinom() is done using similar syntax to lm() and glm(). We add the trace = FALSE argument to suppress information about updates to the optimization routine as the model is trained.

```{r}
library(nnet)

model_multi = multinom(Species ~ ., data = iris_trn, trace = FALSE)
summary(model_multi)$coefficients

head(predict(model_multi, newdata = iris_trn, type = "prob"))
```


```{r}
model_gamm <- mgcv::gamm(low ~ s(age) + ftv + ht)
```