--- title: "Routine structural missing data diagnostics" author: "Janick Weberpals" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Routine structural missing data diagnostics} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: references.bib link-citations: true --- ```{=html} ``` ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", dpi = 150, fig.width = 6, fig.height = 4.5 ) ``` ```{r setup, message=FALSE} library(smdi) library(gt) library(dplyr) library(here) library(knitr) ``` # `smdi` main functionalities The `smdi` flagship function is `smdi_diagnose()` which calls multiple sub-functions which are also accessible separately. This article aims to give an introduction to missing covariate data diagnostics using the individual `smdi` diagnostics function that all funnel into the main function `smdi_diagnose()`. ::: {.alert .alert-info} What is smdi about? Large-scale simulations revealed characteristic patterns of diagnostic parameters matched to common missing data structures based on three group diagnostics: ::: ```{r, fig-group_diagnostics, fig.cap = "Overview three group diagnostics", echo = FALSE} include_graphics(here("vignettes", "smdi_diagnose_table.png")) ``` ::: {.alert .alert-info} How can this be applied to inform a real-world database study? The observed diagnostic patterns of a specific study will give insights into the likelihood of underlying missingness structures. This package enables researchers to create these three group diagnostics for their own healthcare database analytics with little effort. This is how an **example** could look like in a real-world database study: ::: ```{r, fig-guidance, fig.cap = "Example of how `smdi` diagnostics can be applied to give insights into the likelihood of underlying missingness structures in a real-world database study.", echo = FALSE} include_graphics(here("vignettes", "smdi_examples.png")) ``` ## Illustrative dataset To illustrate the usage of the `smdi` package main functions, we use the `smdi_data` dataset which is an example dataset that comes bundled with the package and includes some ready to use simulated partially observed covariates. If you prefer to simulate missingness yourself, you can do so using the `smdi_data_complete` dataset. In brief, the `smdi_data` dataset consists of a simulated lung cancer cohort with a fictional comparison of two antineoplastic systemic therapy regimens and a time-to-event outcome. More information on the underlying dataset is given in the previous [Data generation](https://janickweberpals.gitlab-pages.partners.org/smdi/articles/a_data_generation.html) article. ```{r} smdi_data %>% glimpse() ``` The dataset consists of `r formatC(nrow(smdi_data), format = "fg", big.mark = ",")` patients and `r ncol(smdi_data)` variables with `exposure` representing the two treatment regimens under comparison and `status` and `eventtime` the vital status and censoring time, respectively. For more information, please checkout: ```{r, eval=FALSE} # dataset with simulated missingness ?smdi::smdi_data() # complete dataset ?smdi::smdi_data_complete() ``` ## Descriptives ### Missingness proportions As with basically any first step into exploring (new) datasets, it's a good idea to get an overview of partially observed covariates and the magnitude of missingness. For this, `smdi` comes with two convenient functions to screen the data for missingness. This can be either as a table ... ```{r} smdi_data %>% smdi_summarize() ``` ... or visually ```{r} covars_missing <- smdi_summarize(data = smdi_data) %>% pull(covariate) smdi_data %>% smdi_vis(covar = covars_missing) ``` The plot also provides flexibility to stratify, e.g. by `exposure`. ```{r} smdi_data %>% smdi_vis(covar = covars_missing, strata = "exposure") ``` ### Missingness patterns {.tabset} Besides describing the proportion missingness by covariate (and potentially stratified by another variable), it is also of great importance to check the missingness patterns. This can give important hints if the missingness in two or more partially observed covariates may follow a monotone or non-monotone missingness pattern. If former is the case, one should be careful and potentially run the below-described `smdi` diagnostics variable-by-variable instead of jointly. > We recommend checking both missingness proportions and patterns as a first step. In case of monotonocity, the `smdi` package may likely culminate in misleading results. Please check the [article on multivariate missingness and monotonicity](https://janickweberpals.gitlab-pages.partners.org/smdi/articles/c_multivariate_missingness.html). To check missingness patterns, the `smdi` comes with re-exports of the `naniar::gg_miss_upset` and `mice::md.pattern` functions. #### Upset plot ```{r} smdi::gg_miss_upset(data = smdi_data) ``` The upset plot and the pattern matrix show that only a small fraction of observations (N = 97) is missing in all the partially observed covariates at the same. A non-monotone missingness pattern is consequently more likely than a monotone one. #### Pattern matrix ```{r, fig.width = 8, fig.height=6, fig.width = 10} smdi::md.pattern(smdi_data[, c(covars_missing)], plot = FALSE) ``` ## Before we are getting started: data format Generally, the dataframe(s) used for `smdi` (defined as the `data` parameter in all functions) should consist of the partially observed covariates, other relevant observed covariates as well as the the exposure and outcome variables. Avoid having meta-variables such as e.g. ID variables, dates and zip codes in your dataframe as all present columns in your dataframe will be used to make inferences about potential missing data mechanisms. ## Group 1 diagnostics: differences in covariate distributions ### Median/average absolute standardized mean differences As discussed in the documentation of the `smdi_asmd` and `smdi_hotelling`, the median/average standardized mean difference (asmd) may be one indicator as to how much patient characteristics differ between patients with and without an observed value for a partially observed covariate. If the median/average asmd is above a certain threshold this may indicate imbalance in patient covariate distributions which may be indicative of the partially observed covariate following a missing at random (MAR) mechanims, i.e., the missingness is explainable by other observed covariates. Similarly, no imbalance between observed covariates may be indicative that missingness cannot be explained with observed covariates and the underlying missingness mechanism may be completely at random (MCAR) or not at random (e.g., missingness is only associated with unobserved factors or through the partially observed covariate itself). To get an idea about the asmd for our partially observed covariates we can run the `smdi_asmd` function. We are ok with the default parameters (i.e., we let the function investigate all covariates with at least one NA [covar = NULL], compute the median asmd [median = TRUE] and don't explicitly model the missingness of other partially observed covariates [includeNA = FALSE]). ```{r} asmd <- smdi_asmd(data = smdi_data) ``` The output returns an *asmd* object that contains a lot of information in the following structure that can be accessed using the "\$" operator: - asmd - covar - covariate modeled - complete table 1 - asmd plot - aggregate median/average asmd Here is an example of `egfr_cat`: ```{r} asmd$egfr_cat$asmd_table1 ``` ```{r} asmd$egfr_cat$asmd_plot ``` ```{r} asmd$egfr_cat$asmd_aggregate ``` To limit the output, we can use the generic `print` or `summary` output of the object which returns a summary table of the aggregate median/average asmd per covariate. ```{r} summary(asmd) ``` ### Hotelling's and Little's hypothesis tests #### Hotteling The `smdi_hotelling` function follows the same logic, but is a formal hypothesis test for the difference in covariate distributions based on a multivariate student t-test.[@hotelling1992] The output (a hotteling object) follows the same structure as above. It's important to remember that the power of statistical hypothesis tests can be influenced by sample size, so the combined investigation along with `smdi_asmd()` is highly recommended. - hotelling - covar - Hotelling test statistics ```{r} h0 <- smdi_hotelling(data = smdi_data) h0 ``` More details can be accessed for each covariate. ```{r} h0$ecog_cat ``` #### Little Little proposes a single global test statistic for MCAR that uses all of the available data.[@little1988a] Hence, the `smdi_little` does not return one test statistic per partially observed covariate but one globally for the entire dataset. ```{r} h0_global <- smdi_little(data = smdi_data) h0_global ``` > CAVE: Hotelling's and Little's show high susceptibility with large sample sizes and it is recommended to always interpret the results along with the other diagnostics. ## Group 2 diagnostics: ability to predict missingness Since MAR mechanisms are defined as the missingness being a function of observed covariates, MAR may be predictable and evaluable by fitting a classification model, e.g., a random forest model, which would yield moderate to high area under the curve (AUC) values in the case of MAR. The `smdi_rf` function trains and fits a random forest model to assess the ability to predict missingness for the specified covariate(s). If the missing indicator for this covariate can be predicted as a function of observed covariates, a MAR missingness mechanism may be likely. CAVE: Depending on the amount of data (sample size x covariates), the computation of the function can take some minutes. The structure of the output again follows the general schema. - rf - covar - rf_table (AUC) - rf_plot (variable importance) ```{r} auc <- smdi_rf(data = smdi_data) auc$ecog_cat$rf_table auc$ecog_cat$rf_plot ``` > CAVE: If the missingness indicator variables of other partially observed covariates (indicated by suffix \_NA) have an extremely high variable importance (combined with an unusually high AUC), this might be an indicator of a monotone missing data pattern. That is, the missingness in one covariate is highly predictive of the missingness of another partially observed covariate. In this case it is advisable to exclude other partially observed covariates and run missingness diagnostics separately. This can be checked, e.g. with the `mice::md.pattern()` function (`mice` package). ## Group 3 diagnostics: association between missingness and outcome The group 3 diagnostic focuses on assessing the association between the missing indicator of the partially observed `covar` and the outcome under study. This may reveal important covariate relationships with the outcome and could give additional pieces of information. Currently, all main types of outcome regressions are supported, namely *glm*, linear (*lm*) and Cox proportional hazards (*survival*) regression models are supported and need to be specified using the *model* and *form_lhs* parameters. If *glm* is specified, any *glm_family* regression type can be used which includes binomial (default), gaussian, Gamma, inverse.gaussian, poisson, quasi, quasibinomial and quasipoisson (for more information see `?stats::family`). Further details on the `smdi_outcome` function can be accessed via ```{r, eval=FALSE} ?smdi::smdi_outcome() ``` The output of `smdi_outcome` returns a table of the univariate and adjusted beta coefficients and 95% confidence intervals for all `covar`. ```{r} smdi_outcome( data = smdi_data, model = "cox", form_lhs = "Surv(eventtime, status)" ) ``` # `smdi_diagnose()` - one function to rule them all Finally, all of the functions above funnel into `smdi_diagnose()` which outputs an smdi object including a summary table of all three smdi group diagnostics and little's global p-value statistic. If details of the individual functions above are needed, this method may not be preferable but is a convenient way to implement routine structural missing covariate diagnostics by calling a single function. Note that all parameters of the individual functions that make up `smdi_diagnose()` can be specified and will be passed on, but only the required parameter must be specified. A most minimal example could look like this. ```{r} diagnostics <- smdi_diagnose( data = smdi_data, covar = NULL, # NULL includes all covariates with at least one NA model = "cox", form_lhs = "Surv(eventtime, status)" ) ``` The output returns two parts: the `smdi_tbl` which can be called using the `$` operator and looks like this ```{r} diagnostics$smdi_tbl ``` ... and the p-value of the global Little's test statistic: ```{r} diagnostics$p_little ``` ## Publication-ready `gt`-style table For a nicely formatted and publication-ready output we can subsequently use the `smdi_diagnose` output and feed it into the `smdi_gt_style()` function: ```{r} library(gt) smdi_style_gt(diagnostics) ``` ### smdi table export To make this even more convenient, `gt` offers a functionality to export the table in different formats, e.g. `.docx`, `.png`, `.pdf` or `.rtf`: ```{r, eval = FALSE} gtsave( data = smdi_style_gt(diagnostics), filename = "smdi_table.docx", # name of the final .docx file path = "." # path where the file should be stored ) ``` # References