this is an accompanying simulation for another repo
Proteomics dataset is generated with Data-Independent-Acquisition (DIA)
three imputation methods were compared
- imputation with random forest
- imputation with KNN
- left censored imputation (similar as in Perseus dafault and DEP R package)
Additionally, all three imputation were applied row-wise (proteins are rows) and column-wise (proteins are columns)
This script is NOT intended for general comparison of imputation methods (for this see e.g. [1-3], also R package for proteomics data imputation), but to estimate which imputation will work better in a specific proteomics dataset. For this, missing values were introduced randomly in a matrix (without missingness), finally the Normalized Root Mean Square Error (NRMSE) was calculated to compare imputed value with a real value. NRMSE close to 0 indicates better model.
- Jin, L., Y. Bi, C. Hu, J. Qu, S. Shen, X. Wang, and Y. Tian, A comparative study of evaluating missing value imputation methods in label-free proteomics. Sci Rep, 2021. 11(1): p. 1760.
- Liao, S.G., Y. Lin, D.D. Kang, D. Chandra, J. Bon, N. Kaminski, F.C. Sciurba, and G.C. Tseng, Missing value imputation in high-dimensional phenomic data: imputable or not, and how? BMC Bioinformatics, 2014. 15: p. 346.
- Stekhoven, D.J. and P. Bühlmann, MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 2011. 28(1): p. 112-118.