The goal of this document is to highlight the functionality
implemented in the package missForestPredict
and to provide
guidance for the usage of this package.
The package missForestPredict
implements the missing
data imputation algorithm used in the R package missForest
(Stekhoven and Bühlmann 2012) with
adaptations for prediction settings. The function
missForest
is used to impute a (training) dataset with
missing values and to learn imputations models that can be later used
for imputing new observations. The function
missForestPredict
is used to impute one or multiple new
observations (test set) using the models learned on the training data.
The word “Predict” in the function name should not misguide the user.
The function does not perform prediction of an outcome and is agnostic
on whether the outcome variable for a prediction model is part of the
training data or not; it will treat all columns of the provided data as
variables to be imputed.
Fast implementation
The imputation algorithm is based on random forests (Breiman 2001) as implemented in the
ranger
R package (Wright and Ziegler
2017). Ranger provides a fast implementation of random forests
suitable for large datasets as well as high dimensional data.
Saved models and initialization
The missing data in each column is initialized the mean/mode (or median/mode) of that variable derived on complete cases or a custom imputation scheme. Each variable is then imputed using the iterative algorithm of missForest (Stekhoven and Bühlmann 2012) until a stopping criterion is met. The algorithm supports all variable types (continuous and categorical with two or more levels) and uses a common stopping criterion for all variables. The initialization used for the training data and the random forest models for each iteration are saved and can be later used to impute new observations. Imputation initialization and models are by default “learned” also for variables with no missing values in the original (training) data. This allows for unfortunate situations in which new observations have different missing patterns than the one encountered in the training data (for example, because of accidental registration errors or because of unfortunate train