Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Imputing mixed numeric/categorical data within train preProc? #1344

Open
jarbet opened this issue Jul 23, 2023 · 0 comments
Open

Imputing mixed numeric/categorical data within train preProc? #1344

jarbet opened this issue Jul 23, 2023 · 0 comments

Comments

@jarbet
Copy link

jarbet commented Jul 23, 2023

Is it possible to impute mixed numeric/categorical data within train's preProc argument? I want to impute within train's cross validation, thereby accounting for how uncertainty in imputations affects estimation of generalization error.

The ?preProcess help page suggests it is not possible to impute categorical variables:

x : a matrix or data frame. Non-numeric predictors are allowed but will be ignored.

However, the bagImpute method can handle mixed data, in theory. The following code runs, but I am not sure if it is actually imputing the missing factor or simply removing patients with missing factor values:

library(caret);
#> Loading required package: ggplot2
#> Loading required package: lattice
data(iris);
    
nrow(iris);
#> [1] 150

iris.miss <- iris;
iris.miss[1,'Species'] <- NA;
iris.miss[2,'Petal.Length'] <- NA;
set.seed(1);
fit <- train(
    Sepal.Length ~ .,
    data = iris.miss,
    method = 'lm',
    preProc = 'bagImpute',
    na.action = na.pass
    );
fit
#> Linear Regression 
#> 
#> 150 samples
#>   4 predictor
#> 
#> Pre-processing: bagged tree imputation (5) 
#> Resampling: Bootstrapped (25 reps) 
#> Summary of sample sizes: 150, 150, 150, 150, 150, 150, ... 
#> Resampling results:
#> 
#>   RMSE       Rsquared   MAE      
#>   0.3176759  0.8587222  0.2604171
#> 
#> Tuning parameter 'intercept' was held constant at a value of TRUE

Notice the printed fit says that all 150 patients were included, thus suggesting the missing factor was imputed, although I suspect that patient is simply being removed from the model and not imputed?

Created on 2023-07-23 by the reprex package (v2.0.1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant