---
title: "REDCapDM"
output:
rmarkdown::html_vignette:
toc: true
toc_depth: 5
number_sections: true
package: "`REDCapDM`"
vignette: >
%\VignetteIndexEntry{REDCapDM}
%\VignetteEncoding{UTF-8}
%\VignetteEngine{knitr::rmarkdown}
editor_options:
chunk_output_type: inline
---
```{r message=FALSE, warning=FALSE, include=FALSE}
rm(list = ls())
library(REDCapDM)
library(kableExtra)
library(knitr)
library(dplyr)
library(purrr)
```
# **Introduction**
The REDCapDM package provides a comprehensive toolkit for managing data exported from REDCap. It supports both importing REDCap data (from files or via API), transforming and cleaning the data according to the project’s dictionary and logic, and generating query reports for data validation. In particular, REDCapDM can identify missing or out-of-range values and track changes in identified queries across data versions.
# **Functions**
All main functions are listed below (and described in detail in the examples):
## Import data
- `redcap_data()`: Read REDCap data into R.
## Process data
- `rd_dates()`: Standardize date and datetime fields.
- `rd_delete_vars()`: Remove specified variables (by name or pattern).
- `rd_recalculate()`: Recompute calculated fields and compare with REDCap values.
- `rd_factor()`: Replace numeric multiple-choice columns with their factor version.
- `rd_checkbox()`: Expand checkbox responses with custom labels and rename `var___1` columns (REDCap style) to `var_option`.
- `rd_split()`: Split dataset by form or event.
- `rd_insert_na()`: Manually set specified variables to missing based on a logical filter.
- `rd_rlogic()`: Translate REDCap branching or calculation logic into R syntax.
- `rd_dictionary()`: Update dictionary (translation of REDCap logic into R syntax) to reflect transformed data and logic.
Or we can use all these functions at once:
- `rd_transform()`: One-step pipeline to clean and preprocess the raw REDCap data.
## Queries
- `rd_query()`: Apply expressions to identify data queries/issues.
- `rd_event()`: Report missing/incomplete events per record (longitudinal).
- `check_queries()`: Compare two query reports to track changes made.
- `rd_export`: Export query/report tables to an Excel (.xlsx) file.
# **Installation**
The release version can be installed from CRAN:
```{r eval=FALSE}
install.packages("REDCapDM")
```
The development version can be installed from GitHub:
```{r eval=FALSE}
install.packages("remotes") # Run this line if the 'remotes' package isn't installed already.
remotes::install_github("bruigtp/REDCapDM")
```
# **Built-in dataset**
For the following examples, we will use a random sample of the COVICAN study which is included in the package. COVICAN is an international, multicentre cohort study of cancer patients with COVID-19 to describe the epidemiology, risk factors, and clinical outcomes of co-infections and superinfections in onco-hematological patients with COVID-19.
We can load the built-in dataset by typing:
```{r message=FALSE, warning=FALSE, comment=NA}
library(REDCapDM)
data(covican)
```
The structure of this dataset is:
```{r echo=FALSE, message=FALSE, warning=FALSE, comment=NA}
str(covican, max.level = 1)
```
The first element in the list is a data frame containing all the data. The second element is a data frame with the information in the dictionary of the project about each field. The third and final element is a data frame containing the correspondence of each event with each form.
Some of the variables in the dataset are:
```{r echo=FALSE, message=FALSE, warning=FALSE, comment=NA}
descr <- c("Identifier of each record", "",
"Auto-generated name of the events", "",
"Auto-generated name of each center", "",
"Patients older than 18 years", "No ; Yes",
"Cancer patients", "No ; Yes",
"Diagnosed of COVID-19", "No ; Yes",
"Solid tumour remission >1 year", "No ; Yes",
"Indicator of non-compliance with inclusion and exclusion criteria", "Compliance ; Non-compliance",
"Date of birth (y-m-d)", "",
"Date of first visit (y-m-d)", "",
"Age", "",
"Indicator of diabetes", "No ; Yes",
"Type of diabetes", "No complications ; End-organ diabetes-related disease",
"Indicator of chronic pulmonary disease", "No ; Yes",
"Fraction of inspired oxygen (%)", "",
"Indicator of blood test available", "No ; Yes",
"Potassium (mmol/L)", "",
"Respiratory rate (bpm)", "",
"Indicator of leukemia or lymphoma", "No ; Yes",
"Indicator of acute leukemia", "No ; Yes")
vars <- data.frame("Name" = names(covican$data |> dplyr::select(record_id:acute_leuk)),
"Description" = descr[seq(1, length(descr), 2)],
"Categories" = descr[seq(2, length(descr), 2)])
kable(vars) |>
kableExtra::row_spec(0, bold=TRUE) |>
kableExtra::kable_styling(full_width = F)
```
# **Usage**
The package structure can be divided into three main components: reading raw data, processing data and identifying queries. Typically, after collecting data in REDCap, we will have to follow these three components in order to have a final validated dataset for analysis. We will provide a complete basic user guide on how to perform each one of these steps using the package's functions. For the processing of the data and query identification, we will use the `covican` built-in dataset as an example.
## **Read data**
### **redcap_data**
The `redcap_data()` function allows users to easily import data from a REDCap project into R.
In order to read exported data from REDCap, we first need to download the data and dictionary from the REDCap project in R format. We can then use the arguments `data_path` and `dic_path` to designate the local path where we have stored the R file and the dictionary from the REDCap project:
```{r message=FALSE, warning=FALSE, comment=NA, eval=FALSE}
dataset <- redcap_data(data_path = "C:/Users/username/example.r",
dic_path = "C:/Users/username/example_dictionary.csv")
```
> Note: The R and data CSV file exported from REDCap must be located in the same directory.
If the REDCap project is longitudinal (contains more than one event) then a third element should be specified with the correspondence of each event with each form of the project. This csv file can be downloaded in the REDCap of the project following these steps: _Project Setup_ < _Designate Instruments for My Events_ < _Download instrument-event mappings (CSV)_. Then, it has to be specified using the argument `event_path`:
```{r message=FALSE, warning=FALSE, comment=NA, eval=FALSE}
dataset <- redcap_data(data_path = "C:/Users/username/example.r",
dic_path = "C:/Users/username/example_dictionary.csv",
event_path = "C:/Users/username/events.csv")
```
> Note: if the project is longitudinal and the event-form file is not provided using the `event_path` argument, some steps of the processment can not be performed.
Another way to read data exported from a REDCap project is using an API connection. To do this, we can use the arguments `uri` and `token` which respectively refer to the uniform resource identifier of the REDCap project and the user-specific string that serves as the password:
```{r eval=FALSE, message=FALSE, warning=FALSE, comment=NA}
dataset_api <- redcap_data(uri = "https://redcap.idibell.cat/api/",
token = "55E5C3D1E83213ADA2182A4BFDEA")
```
In this case, there is no need to specify the event-form file since the function will download it automatically using the API connection, if the project is longitudinal.
> **Warning**: Please keep in mind that the API token gives you special access to the REDCap project and that it should not be shared with other people.
The `redcap_data()` function returns a list with three elements: imported data, dictionary and event-form mapping(if included).
## **Process**
Given any data imported from REDCap with `redcap_data()`, this would be the pipeline of an entire processing workflow:
```{r message=FALSE, warning=FALSE, eval = FALSE, comment="#>", collapse = TRUE}
data |>
rd_delete_vars(delete_pattern = c("_complete", "_timestamp") |>
rd_dates() |>
rd_recalculate() |>
rd_checkbox() |>
rd_factor() |>
rd_dictionary() |>
rd_split(by = "event") # use "form" if not longitudinal
```
All functions are optional and should only be used at the user's discretion when necessary. The order of some functions can also be exchanged. For example, for `covican` there are no variables to delete and dates are already processed, so the pipeline would be simplified:
```{r message=FALSE, warning=FALSE, comment="#>", collapse = TRUE}
covican_transformed <- covican |>
rd_recalculate() |>
rd_checkbox() |>
rd_factor() |>
rd_dictionary() |>
rd_split(by = "event")
covican_transformed$results
```
All the functions that can be used in each step of a processing workflow are detailed below:
### **rd_delete_vars**
This function removes unwanted variables from both a REDCap dataset and its dictionary. This is especially useful for eliminating automatically generated fields such as form completion flags (`*_complete`) or timestamps (`*_timestamp`).
You can delete variables either by specifying their exact names or by using regular expression patterns:
```{r}
# Option A: delete by variable name
covican_deleted <- covican |>
rd_delete_vars(vars = c("potassium", "leuk_lymph"))
# Option B: delete by regex pattern
covican_deleted <- covican |>
rd_delete_vars(pattern = c("_complete$", "_timestamp$"))
```
When variables are deleted:
- They are removed from both the dataset and dictionary.
- Factor versions of deleted variables (if present) are also removed.
### **rd_dates**
This function is designed to process and standardize `date` and `datetime` fields in a REDCap dataset. In REDCap projects, date and datetime fields can sometimes be stored as character strings, which can make analyses difficult. This function detects which fields should be dates/datetimes from the REDCap dictionary and converts them to `Date` and `POSIXct`, respectively.
```{r message=FALSE, warning=FALSE, comment="#>", collapse = TRUE}
covican_dates <- covican |>
rd_dates()
```
Quick verification example:
```{r message=FALSE, warning=FALSE, comment="#>", collapse = TRUE}
# Simulate a character date since covican already has the dates in the correct format
covican_dates <- covican
covican_dates$data <- covican_dates$data |>
dplyr::mutate(d_birth = as.character(d_birth))
# Check class before conversion
class(covican_dates$data$d_birth)
# Check class after conversion
covican_dates <- covican_dates |>
rd_dates()
class(covican_dates$data$d_birth)
```
After this transformation, all `date` and `datetime` variables are standardized and ready for analysis in R.
### **rd_recalculate**
This function identifies calculated fields in a REDCap project, translates their calculation logic into R, recalculates the values, and compares the results with the values stored in REDCap.
It then produces a structured report that helps users detect discrepancies between REDCap’s stored calculations and the values recalculated in R.
```{r}
covican_recalc <- covican |>
rd_recalculate()
# Print recalculation results
covican_recalc$results
```
The `results` object includes:
- A summary report outlining the total number of calculated fields, how many were successfully transcribed into R, and how many showed differences between the REDCap values and the recalculated ones.
- A field-level report listing each calculated field, whether the logic was successfully converted to R, and whether the recalculated values match the originals.
You can also exclude specific fields from recalculation (e.g., complex multi-event calculations) to reduce computation time and avoid unnecessary warnings.
```{r}
# Exclude specific variables from recalculation
covican_recalc <- covican |>
rd_recalculate(exclude = c("screening_fail_crit", "resp_rate"))
covican_recalc$results
```
After running this function:
- A new variable with the suffix `_recalc` is added to the dataset, placed immediately after the original variable and containing the recalculated values.
- The data dictionary is updated with a corresponding entry, where the original variable label is extended with `"(Recalculated)"` to make these fields easy to identify.
### **rd_checkbox**
This function cleans and restructures REDCap checkbox fields. It converts the default `"Unchecked/Checked"` categories of checkbox responses created by REDCap into user-specified labels (default `"No"/"Yes"`) and renames the `varname___code` variables (original REDCap structure) to readable names based on the text of the checkbox options. Additionally, it updates the dictionary to match the new variable names. This includes choices, calculations, and branching logic.
```{r}
# Default transformation: "No"/"Yes" labels & renamed variables
cb <- covican |>
rd_checkbox()
str(cb$data$underlying_disease_hemato_acute_myeloid_leukemia)
```
For example, consider the checkbox field of the type of underlying disease present in the `covican` dataset. Originally, the variables were named `type_underlying_disease__0` and `type_underlying_disease__1`, while the option labels were ‘Haematological cancer’ and ‘Solid tumour’. After running the function, the variables are renamed to `type_underlying_disease_haematological_cancer` and `type_underlying_disease_solid_tumour`, reflecting the option text in a readable format..
To preserve the original REDCap-style names (e.g., `varname___1`, `varname___2`) instead of renaming variables based on option text:
```{r}
# use the argument checkbox_names to choose the final format of the variable names
cb <- covican |>
rd_checkbox(checkbox_names = FALSE)
str(cb$data$underlying_disease_hemato___1)
```
If a checkbox field has a branching logic, the function will not modify any values. However, you can use the `na_logic` argument, which accepts the following options:
- `"none"` (default): do not set `NA` based on branching logic during transform.
- `"missing"`: set `NA` only where the branching logic evaluation is `NA`.
- `"eval"`: set `NA` where the branching logic evaluates to `FALSE` (i.e., logic not satisfied or missing).
```{r}
cb <- covican |>
rd_checkbox(na_logic = "eval")
```
By default, checkbox factors are labeled `"No"` and `"Yes"`, but you can specify alternative labels:
```{r}
cb <- covican |>
rd_checkbox(checkbox_labels = c("Absent", "Present"))
str(cb$data$underlying_disease_hemato_acute_myeloid_leukemia)
```
### **rd_factor**
This function converts categorical variables in a REDCap dataset into R factors by replacing each original variable (numeric version) with its corresponding `.factor` version created by REDCap.
```{r}
factored <- covican |>
rd_factor()
# Checking class of the variable
str(factored$data$available_analytics)
```
If you need to keep certain variables in their raw form, you can list them in the `exclude` argument. This prevents those variables from being replaced (including their `.factor` version) while still allowing the rest of the dataset to be converted.
```{r}
factored <- covican |>
rd_factor(exclude = c("available_analytics", "urine_culture"))
# Checking class of both versions of the variable
str(covican$data$available_analytics)
str(covican$data$available_analytics.factor)
```
> Note: the function automatically excludes these system variables from conversion: `redcap_event_name`, `redcap_repeat_instrument`, `redcap_data_access_group`. These variables are retained as-is to avoid interfering with longitudinal event mappings or user access groups.
After conversion only the cleaned factor variables remain in the dataset, the original numeric version of those variables is dropped.
### **rd_dictionary**
When working with REDCap exports, the data dictionary contains field metadata, branching logic, and calculation rules written in REDCap logic. The `rd_dictionary()` function refreshes branching logic and calculations, translating them from REDCap logic into R logic, and ensures the dictionary remains consistent with the cleaned dataset.
```{r}
# Update dictionary after cleaning
dict_result <- covican |>
rd_factor() |>
rd_checkbox() |>
rd_dictionary()
```
When we transform the dictionary:
- Updates branching logic expressions so they match factor labels rather than numeric codes.
- Converts calculations and logic into R-friendly expressions.
- Reports any fields where branching logic or calculations could not be converted.
### **rd_split**
After preparing your dataset, you may want to work with only one form or one event at a time. The `rd_split()` function separates your dataset accordingly.
- **By form**
For non-longitudinal projects (or longitudinal projects with an `event_form` mapping), you can split the dataset into smaller datasets based on forms.
```{r}
forms_data <- covican |>
rd_split(by = "form")
forms_data$data
```
If repeated entries exist, you can reshape the data into wide format:
```{r}
forms_data <- covican |>
rd_split(by = "form", wide = TRUE)
```
> Note: For longitudinal projects, the column events shows the number of events in each form.
- **By event**
For longitudinal projects, you can also split the data by event. The function uses the `event_form` mapping to assign variables correctly to each event:
```{r}
events_data <- covican |>
rd_split(by = "event")
events_data$data
```
If you want to extract only one form or event, use the `which` argument:
```{r}
# Example by form
baseline_data <- covican |>
rd_split(by = "form", which = "demographics")
head(baseline_data$data)
```
### **rd_insert_na**
This is an auxiliar/bonus function that can be used to set some values of a variable(s) to missing if a certain logic is fulfilled. It can be used, for example, to insert missings on those checkboxes that do not have a branching logic, as mentioned earlier. For instance, we can transform the checkboxes with the `rd_checkbox()` function and then use this function to set the values of the checkbox _type_underlying_disease_haematological_cancer_ to missing when the age is less than 65 years old:
```{r message=FALSE, warning=FALSE, comment=NA}
cb <- covican |>
rd_checkbox()
#Before inserting missings
table(cb$data$type_underlying_disease_haematological_cancer)
#Run with this function
cb2 <- covican |>
rd_checkbox() |>
rd_insert_na(vars = "type_underlying_disease_haematological_cancer",
filter = "age < 65")
#After inserting missings
table(cb2$data$type_underlying_disease_haematological_cancer)
```
> Note that both the variable to be transformed (`age`) and the variable included in the filter (`type_underlying_disease_haematological_cancer`) are in the same event. If the variable to be transformed and the filter didn't have any event in common then the transformation would give an error. Furthermore, if the variable to be transformed was in more events than the filter, only the rows of the events in common would be converted.
### **rd_rlogic**
This is also an auxiliar/bonus function that transforms the REDCap logic into logic that can be evaluated in R. It returns both the transformed logic and the result of the evaluation of the logic in R. This function is used internally in multiple functions, for example, `rd_dictionary()`.
> This function only returns the transformed logic, so it has to be used outside the transform workflow.
Let's see how it transforms the logic of one of the calculated fields in the built-in dataset:
```{r message=FALSE, warning=FALSE, comment=NA}
logic_trans <- covican |>
rd_rlogic(logic = "if([exc_1]='1' or [inc_1]='0' or [inc_2]='0' or [inc_3]='0',1,0)",
var = "screening_fail_crit")
str(logic_trans)
```
### **rd_transform**
Alternatively, you can do all these steps at once using the `rd_transform()` function:
```{r message=FALSE, warning=FALSE, comment=NA}
covican_transformed <- rd_transform(covican)
#Print the results of the transformation
covican_transformed$results
```
Using the arguments of the function we can perform all the different type of transformations described until now.
## **Queries**
Queries are very important to ensure the accuracy and reliability of a REDCap dataset. The collected data may contain missing values, inconsistencies, or other potential errors that need to be identified in order to correct them later.
For all the following examples we will use the raw transformed data: `covican_transformed`.
### **rd_query**
The `rd_query()` function allows users to generate queries by using a specific expression. It can be used to identify missing values, values that fall outside the lower and upper limit of a variable and other types of inconsistencies.
#### *Output*
First, we will examine the output of this function. When the `rd_query()` function is executed, it returns a list that includes a data frame with all the queries identified and a second element with a summary of the number of generated queries in each specified variable for each expression applied:
```{r message=FALSE, warning=FALSE, include=FALSE}
example <- rd_query(covican_transformed,
variables = "copd",
expression = "is.na(x)")
```
```{r echo=FALSE, message=FALSE, warning=FALSE, comment=NA}
kable(head(example$queries)) |>
kableExtra::row_spec(0, bold = TRUE) |>
kableExtra::kable_styling()
example$results
```
The data frame is designed to aid users in locating each query in their REDCap project. It includes information such as the record identifier, the Data Access Group (DAG), the event in which each query can be found, along with the name and the description of the analyzed variable and a brief description of the query.
Let's see some examples of the usability of the function in generating different types of queries.
#### *Missings*
If we want to identify missing values in the variables _copd_ and _age_ in the raw transformed data, a list of required arguments needs to be supplied. We must use the `variables` argument to specify the variables from the database that will be examined and the `expression` argument to describe the expression that will be applied to those variables, in this case 'is.na(x)' to detect missing values where x represents the variable itself. Additionaly, we must use the `data` and `dic` arguments to indicate the R objects containing the REDCap data and dictionary, respectively. If the REDCap project presents a longitudinal design, we should also specify the event in which the described variables are present through the use of the `event` argument:
```{r echo=TRUE, message=FALSE, warning=FALSE, comment=NA}
example <- rd_query(covican_transformed,
variables = c("copd", "age"),
expression = c("is.na(x)", "is.na(x)"),
event = "baseline_visit_arm_1")
# Printing results
example$results
```
In this case, we can observe that there are 6 missing values in the _copd_ variable and 5 missing values in _age_.
#### *Missings of variables with a branching logic*
Another example is when we try to identify missing values in variables where a branching logic is employed. In this scenario, when the conditions of the branching logic are not satisfied, by definition, all of the values should be missing and thus queries for this specific missing values (conditions not met) should not be reported. To adress this, the function, when working with raw data, follows a two-step process. Firstly, it transforms the branching logic associated with the specified variable. Then, it applies this transformed logic during the query generation process. However, if the dataset has already been transformed using the `rd_transform()` function beforehand, the function will automatically apply the previously transformed branching logic.
In both scenarios, if a variable contains branching logic that cannot be converted from REDCap logic to R logic, the function will issue a warning. The warning message will advise the user to review the `results` element of the output for more information. This is to indicate that there might be potential issues or limitations with the conversion process for that specific variable's branching logic:
```{r message=FALSE, warning=FALSE, include=FALSE}
covican_transformed$dictionary$branching_logic_show_field_only_if[covican_transformed$dictionary$field_name %in% "potassium"] <- "[available_analytics][current-instance]=‘1’"
```
```{r message=FALSE, warning=TRUE, comment=NA}
example <- rd_query(covican_transformed,
variables = c("age", "copd", "potassium"),
expression = c("is.na(x)", "is.na(x)", "is.na(x)"),
event = "baseline_visit_arm_1")
# Printing results
example$results
```
Based on the information provided, in addition to the missing values of the _age_ and _copd_ variables already identified, there are 31 missing values in the _potassium_ variable. The branching logic associated with this variable, `[available_analytics][current-instance]=‘1’`, contains a smart variable `[current-instance]`, which cannot be directly transformed into R logic.
To address this issue and correctly identify missing values only when _available_analytics_ has the value _1_, the filter argument can be utilized. By specifying the condition within the branching logic, you can ensure that the filtering process fulfills this condition.
It is worth noting that during the transformation process, the value _1_ of the _available_analytics_ variable was changed to _Yes_ due to it being a factor. Therefore, when implementing the filter, you need to consider this transformed value rather than the original one. This ensures that the condition is accurately applied and missing values are appropriately identified based on the desired criteria.
```{r message=FALSE, warning=TRUE, comment=NA}
example <- rd_query(covican_transformed,
variables = c("potassium"),
expression = c("is.na(x)"),
event = "baseline_visit_arm_1",
filter = c("available_analytics=='Yes'"))
# Printing results
example$results
```
The total number of missing values changes when we use the `filter` argument, the variable _potassium_ now presents 21 missing values instead of the previous 31 cases identified. This means that we were identifying 10 missing values in which _available_analytics_ did not have the value _Yes_ and, therefore, should not be considered as missing values.
> Note: The `filter` argument is treated as a vector, which means that we can add a filter to each specified variable. Also, even if this argument is used to apply the branching logic condition, the warning about the presence of unconverted branching logic will still be displayed. In this specific case, you can safely ignore this warning.
#### *Expressions*
Up until this point, we have illustrated examples where the expression applied is used to detect missing values. But, as previously mentioned, the `rd_query()` function is also able to identify outliers or observations that fulfill a specific condition. Hence, to identify, for example, all the observations where _age_ is greater than 70, we should use the `expression` argument again specifying 'x>70':
```{r message=FALSE, warning=TRUE, comment=NA}
example <- rd_query(variables="age",
expression="x>70",
event="baseline_visit_arm_1",
dic=covican_transformed$dictionary,
data=covican_transformed$data)
# Printing results
example$results
```
We can add other variables with other specific expressions in the same function because it is designed to treat the arguments `variables` and `expression` as vectors, so that the element at position _n_ of `expression` is applied to the element at position _n_ of `variables`.
For example, if we want to identify all the observations where _age_ is greater than 70 and all the observations where _copd_ is 'Yes' we shall use:
```{r message=FALSE, warning=TRUE, comment=NA}
example <- rd_query(covican_transformed,
variables=c("age", "copd"),
expression=c("x > 70", "x == 'Yes'"),
event="baseline_visit_arm_1")
# Printing results
example$results
```
In a more complex scenario, for example, to identify all the observations where _age_ is greater than 70, less than 80, or it is a missing value we shall use the following expression:
```{r message=FALSE, warning=TRUE, comment=NA}
example <- rd_query(covican_transformed,
variables="age",
expression="(x>70 & x<80) | is.na(x)",
event="baseline_visit_arm_1")
# Printing results
example$results
```
#### *Special cases*
Same expression for all variables
In order to evaluate the same expression for all variables, the user should supply just a single element for `expression`:
```{r message=FALSE, warning=TRUE, comment=NA}
example <- rd_query(covican_transformed,
variables = c("copd","age","dm"),
expression = "is.na(x)",
event = "baseline_visit_arm_1")
# Printing results
example$results
```
The function issues a warning every time the same expression is applied to all variables to ensure that the user did not make a mistake when providing the information for each argument.
Not defining an event
Another special case is when the data analysed corresponds to a REDCap longitudinal project, but the event argument of the function is not defined.
There are two possibilities here: