Split and unite are complementary functions to manipulate dataframes in R. They work with summarised_results objects (see R package omopgenerics), but they can also support R dataframes from other classes.
First, let’s load relevant libraries and generate a mock summarised_result object to use in the following examples.
library(visOmopResults)
library(dplyr)
mock_sr <- mockSummarisedResult()
mock_sr |> glimpse()
#> Rows: 126
#> Columns: 13
#> $ result_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ cdm_name <chr> "mock", "mock", "mock", "mock", "mock", "mock", "mock…
#> $ group_name <chr> "cohort_name", "cohort_name", "cohort_name", "cohort_…
#> $ group_level <chr> "cohort1", "cohort1", "cohort1", "cohort1", "cohort1"…
#> $ strata_name <chr> "overall", "age_group &&& sex", "age_group &&& sex", …
#> $ strata_level <chr> "overall", "<40 &&& Male", ">=40 &&& Male", "<40 &&& …
#> $ variable_name <chr> "number subjects", "number subjects", "number subject…
#> $ variable_level <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ estimate_name <chr> "count", "count", "count", "count", "count", "count",…
#> $ estimate_type <chr> "integer", "integer", "integer", "integer", "integer"…
#> $ estimate_value <chr> "4919829", "9611305", "6176201", "4600876", "1033323"…
#> $ additional_name <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ additional_level <chr> "overall", "overall", "overall", "overall", "overall"…
A summarised_result contains 3 types of name-level paired columns which are targeted by the set of unite and split functions. These are the group columns which typically can contain information about cohorts, strata columns which have data on stratification for each group, and finally the additional columns which include further information not covered by group and strata.
The idea of the split functions is to pivot the “name” (e.g. group_name) column to split each value of that column into a column in the dataframe, which values are taken by the “level” (e.g. group_level) column.
For instance, the splitGroup
function will target the
group_name-group_level columns as seen below.
mock_sr |> splitGroup() |> glimpse()
#> Rows: 126
#> Columns: 12
#> $ result_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ cdm_name <chr> "mock", "mock", "mock", "mock", "mock", "mock", "mock…
#> $ cohort_name <chr> "cohort1", "cohort1", "cohort1", "cohort1", "cohort1"…
#> $ strata_name <chr> "overall", "age_group &&& sex", "age_group &&& sex", …
#> $ strata_level <chr> "overall", "<40 &&& Male", ">=40 &&& Male", "<40 &&& …
#> $ variable_name <chr> "number subjects", "number subjects", "number subject…
#> $ variable_level <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ estimate_name <chr> "count", "count", "count", "count", "count", "count",…
#> $ estimate_type <chr> "integer", "integer", "integer", "integer", "integer"…
#> $ estimate_value <chr> "4919829", "9611305", "6176201", "4600876", "1033323"…
#> $ additional_name <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ additional_level <chr> "overall", "overall", "overall", "overall", "overall"…
Similar to splitStrata
, the functions
splitGroup
will split group_name and
group_level columns, while splitAdditional
will
split the additional name-level pair. Finally, the function
splitAll
will split group, strata, and additional at once.
Note that after using splitStrata
on our summarised_result
object, we do no longer have a strata_name-strata_level pair,
instead we have two new columns corresponding to the stratifications,
age_group and sex.
mock_sr |> splitStrata() |> glimpse()
#> Rows: 126
#> Columns: 13
#> $ result_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ cdm_name <chr> "mock", "mock", "mock", "mock", "mock", "mock", "mock…
#> $ group_name <chr> "cohort_name", "cohort_name", "cohort_name", "cohort_…
#> $ group_level <chr> "cohort1", "cohort1", "cohort1", "cohort1", "cohort1"…
#> $ age_group <chr> "overall", "<40", ">=40", "<40", ">=40", "overall", "…
#> $ sex <chr> "overall", "Male", "Male", "Female", "Female", "Male"…
#> $ variable_name <chr> "number subjects", "number subjects", "number subject…
#> $ variable_level <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ estimate_name <chr> "count", "count", "count", "count", "count", "count",…
#> $ estimate_type <chr> "integer", "integer", "integer", "integer", "integer"…
#> $ estimate_value <chr> "4919829", "9611305", "6176201", "4600876", "1033323"…
#> $ additional_name <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ additional_level <chr> "overall", "overall", "overall", "overall", "overall"…
mock_sr |> splitAdditional() |> glimpse()
#> Rows: 126
#> Columns: 11
#> $ result_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ cdm_name <chr> "mock", "mock", "mock", "mock", "mock", "mock", "mock",…
#> $ group_name <chr> "cohort_name", "cohort_name", "cohort_name", "cohort_na…
#> $ group_level <chr> "cohort1", "cohort1", "cohort1", "cohort1", "cohort1", …
#> $ strata_name <chr> "overall", "age_group &&& sex", "age_group &&& sex", "a…
#> $ strata_level <chr> "overall", "<40 &&& Male", ">=40 &&& Male", "<40 &&& Fe…
#> $ variable_name <chr> "number subjects", "number subjects", "number subjects"…
#> $ variable_level <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ estimate_name <chr> "count", "count", "count", "count", "count", "count", "…
#> $ estimate_type <chr> "integer", "integer", "integer", "integer", "integer", …
#> $ estimate_value <chr> "4919829", "9611305", "6176201", "4600876", "1033323", …
mock_sr |> splitAll() |> glimpse()
#> Rows: 126
#> Columns: 10
#> $ result_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ cdm_name <chr> "mock", "mock", "mock", "mock", "mock", "mock", "mock",…
#> $ cohort_name <chr> "cohort1", "cohort1", "cohort1", "cohort1", "cohort1", …
#> $ age_group <chr> "overall", "<40", ">=40", "<40", ">=40", "overall", "ov…
#> $ sex <chr> "overall", "Male", "Male", "Female", "Female", "Male", …
#> $ variable_name <chr> "number subjects", "number subjects", "number subjects"…
#> $ variable_level <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ estimate_name <chr> "count", "count", "count", "count", "count", "count", "…
#> $ estimate_type <chr> "integer", "integer", "integer", "integer", "integer", …
#> $ estimate_value <chr> "4919829", "9611305", "6176201", "4600876", "1033323", …
Looking at the results below, observe how the splitting was not only done by values in the “name” column, but also among values containing the key word “&&&”. That is, “sex &&& age_group” was splitted into sex and age_group columns, instead of generating a column called “sex &&& age_group”.
The function splitNameLevel
provides a more tailored
splitting of the dataframe. This function can take any dataframe with no
restrictions to the naming of the name-level pair columns, since these
can be specified in the name
and level
arguments.
For instance let’s use it in the following table:
data_to_split <- tibble(
denominator = "general_population",
outcome = "stroke",
input_arguments = c("wash_out &&& previous_observation"),
input_arguments_values = c("60 &&& 180")
)
data_to_split
#> # A tibble: 1 × 4
#> denominator outcome input_arguments input_arguments_values
#> <chr> <chr> <chr> <chr>
#> 1 general_population stroke wash_out &&& previous_obser… 60 &&& 180
data_to_split |>
splitNameLevel(
name = "input_arguments",
level = "input_arguments_values"
)
#> # A tibble: 1 × 4
#> denominator outcome wash_out previous_observation
#> <chr> <chr> <chr> <chr>
#> 1 general_population stroke 60 180
The function splitNameLevel
, in additionally to the
argument overall
previously seen, has the argument
keep
to set whether we want to keep the columns before the
splitting.
The unite functions are the complementary to the split ones. These are meant to generate name-level pair columns from targeted columns within a dataframe.
To work with summarised_result objects, we have the
uniteGroup
, uniteStrata
, and
uniteAdditional
functions which will generate the group,
strata, and additional name-level columns respectively from a given set
of columns. For instance, in the following example we want to create the
group_name and group_level columns:
to_unite_group <- tibble(
denominator_cohort_name = c("general_population", "older_than_60", "younger_than_60"),
outcome_cohort_name = c("stroke", "stroke", "stroke")
)
to_unite_group |>
uniteGroup(cols = c("denominator_cohort_name", "outcome_cohort_name"))
#> # A tibble: 3 × 2
#> group_name group_level
#> <chr> <chr>
#> 1 denominator_cohort_name &&& outcome_cohort_name general_population &&& stroke
#> 2 denominator_cohort_name &&& outcome_cohort_name older_than_60 &&& stroke
#> 3 denominator_cohort_name &&& outcome_cohort_name younger_than_60 &&& stroke
A part from the columns to unite argument (cols
), there
is the argument ignore
, by default:
ignore = c(NA, "overall")
. This means that, levels within
ignore will be ignored. For example if in this case we do not ignore
them we will obtain the NA as output:
to_unite_strata <- tibble(
age = c(NA, ">40", "<=40", NA, NA, NA, NA, NA, ">40", "<=40"),
sex = c(NA, NA, NA, "F", "M", NA, NA, NA, "F", "M"),
region = c(NA, NA, NA, NA, NA, "North", "South", "Center", NA, NA)
)
to_unite_strata |>
uniteStrata(cols = c("age", "sex", "region"),
ignore = character())
#> # A tibble: 10 × 2
#> strata_name strata_level
#> <chr> <chr>
#> 1 age &&& sex &&& region NA &&& NA &&& NA
#> 2 age &&& sex &&& region >40 &&& NA &&& NA
#> 3 age &&& sex &&& region <=40 &&& NA &&& NA
#> 4 age &&& sex &&& region NA &&& F &&& NA
#> 5 age &&& sex &&& region NA &&& M &&& NA
#> 6 age &&& sex &&& region NA &&& NA &&& North
#> 7 age &&& sex &&& region NA &&& NA &&& South
#> 8 age &&& sex &&& region NA &&& NA &&& Center
#> 9 age &&& sex &&& region >40 &&& F &&& NA
#> 10 age &&& sex &&& region <=40 &&& M &&& NA
By default (ignore = c(NA, "overall")
) we obtain an
output where only names and levels of non-NA values are returned, and
from those rows where all values are NA it uses “overall”.
to_unite_strata |>
uniteStrata(cols = c("age", "sex", "region"))
#> # A tibble: 10 × 2
#> strata_name strata_level
#> <chr> <chr>
#> 1 overall overall
#> 2 age >40
#> 3 age <=40
#> 4 sex F
#> 5 sex M
#> 6 region North
#> 7 region South
#> 8 region Center
#> 9 age &&& sex >40 &&& F
#> 10 age &&& sex <=40 &&& M
Lastly, the function uniteNameLevel
, idem to
splitNameLevel
, provides more flexibility on the name-level
column naming, in addition of the keep
argument (FALSE by
default) to choose whether to keep the targeted columns. For instance,
if we repeat the previous example with keep
set to TRUE we
would obtain the following output:
to_unite_strata |>
uniteNameLevel(cols = c("age", "sex", "region"),
name = "name",
level = "level",
keep = TRUE)
#> # A tibble: 10 × 5
#> age sex region name level
#> <chr> <chr> <chr> <chr> <chr>
#> 1 <NA> <NA> <NA> overall overall
#> 2 >40 <NA> <NA> age >40
#> 3 <=40 <NA> <NA> age <=40
#> 4 <NA> F <NA> sex F
#> 5 <NA> M <NA> sex M
#> 6 <NA> <NA> North region North
#> 7 <NA> <NA> South region South
#> 8 <NA> <NA> Center region Center
#> 9 >40 F <NA> age &&& sex >40 &&& F
#> 10 <=40 M <NA> age &&& sex <=40 &&& M