Using the missForestPredict package

Elena Albu

2023-12-11

Introduction

What is this document?

The goal of this document is to highlight the functionality implemented in the package missForestPredict and to provide guidance for the usage of this package.

Package information

The package missForestPredict implements the missing data imputation algorithm used in the R package missForest (Stekhoven and Bühlmann 2012) with adaptations for prediction settings. The function missForest is used to impute a (training) dataset with missing values and to learn imputations models that can be later used for imputing new observations. The function missForestPredict is used to impute one or multiple new observations (test set) using the models learned on the training data. The word “Predict” in the function name should not misguide the user. The function does not perform prediction of an outcome and is agnostic on whether the outcome variable for a prediction model is part of the training data or not; it will treat all columns of the provided data as variables to be imputed.

Package functionality

Fast implementation

The imputation algorithm is based on random forests (Breiman 2001) as implemented in the ranger R package (Wright and Ziegler 2017). Ranger provides a fast implementation of random forests suitable for large datasets as well as high dimensional data.

Saved models and initialization

The missing data in each column is initialized the mean/mode (or median/mode) of that variable derived on complete cases or a custom imputation scheme. Each variable is then imputed using the iterative algorithm of missForest (Stekhoven and Bühlmann 2012) until a stopping criterion is met. The algorithm supports all variable types (continuous and categorical with two or more levels) and uses a common stopping criterion for all variables. The initialization used for the training data and the random forest models for each iteration are saved and can be later used to impute new observations. Imputation initialization and models are by default “learned” also for variables with no missing values in the original (training) data. This allows for unfortunate situations in which new observations have different missing patterns than the one encountered in the training data (for example, because of accidental registration errors or because of unfortunate train