1.1 Introduction

The OMOP CDM is a person-centric model. The person table contains records that uniquely identify each individual along with some of their demographic information. Below we create a mock CDM reference which, as is standard, has a person table which contains fields which indicate an individual’s date of birth, gender, race, and ethnicity. Each of these, except for date of birth, are represented by a concept ID (and as the person table contains one record per person these fields are treated as time-invariant).

library(PatientProfiles)
library(duckdb)
library(dplyr)

cdm <- mockPatientProfiles(numberIndividuals = 10000)

cdm$person %>%
  dplyr::glimpse()
## Rows: ??
## Columns: 5
## Database: DuckDB v1.0.0 [root@Darwin 23.4.0:R 4.4.1/:memory:]
## $ person_id            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
## $ gender_concept_id    <dbl> 8507, 8507, 8532, 8532, 8507, 8507, 8532, 8507, 8…
## $ year_of_birth        <int> 2003, 1998, 1918, 1946, 1920, 1964, 1909, 1938, 1…
## $ race_concept_id      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ethnicity_concept_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

As well as the person table, every CDM reference will include an observation period table. This table contains spans of times during which an individual is considered to being under observation. Individuals can have multiple observation periods, but they cannot overlap.

cdm$observation_period %>%
  dplyr::glimpse()
## Rows: ??
## Columns: 5
## Database: DuckDB v1.0.0 [root@Darwin 23.4.0:R 4.4.1/:memory:]
## $ person_id                     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1…
## $ observation_period_start_date <date> 2003-01-01, 1998-01-01, 1918-01-01, 194…
## $ observation_period_end_date   <date> 2133-06-05, 2080-07-25, 2114-02-20, 200…
## $ period_type_concept_id        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ observation_period_id         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1…

When performing analyses we will often be interested in working with the person and observation period tables to identify individuals’ characteristics on some date of interest. PatientProfiles provides a number of functions that can help us do this.

1.2 Adding characteristics to OMOP CDM tables

Let’s say we’re working with the condition occurrence table.

cdm$condition_occurrence %>%
  glimpse()
## Rows: ??
## Columns: 6
## Database: DuckDB v1.0.0 [root@Darwin 23.4.0:R 4.4.1/:memory:]
## $ person_id                 <int> 622, 2942, 5923, 8963, 5257, 711, 5682, 9304…
## $ condition_start_date      <date> 1952-12-01, 1981-05-09, 2017-11-23, 2031-12…
## $ condition_end_date        <date> 2001-09-02, 1983-09-10, 2096-03-22, 2098-09…
## $ condition_occurrence_id   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ condition_concept_id      <int> 10, 5, 1, 6, 1, 10, 1, 4, 9, 9, 5, 5, 7, 1, …
## $ condition_type_concept_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

This table contains diagnoses of individuals and we might, for example, want to identify their age on their date of diagnosis. This involves linking back to the person table which contains their date of birth (split across three different columns). PatientProfiles provides a simple function for this. addAge() will add a new column to the table containing each patient’s age relative to the specified index date.

cdm$condition_occurrence <- cdm$condition_occurrence %>%
  addAge(indexDate = "condition_start_date")

cdm$condition_occurrence %>%
  glimpse()
## Rows: ??
## Columns: 7
## Database: DuckDB v1.0.0 [root@Darwin 23.4.0:R 4.4.1/:memory:]
## $ person_id                 <int> 622, 5923, 8963, 5257, 711, 5682, 9304, 6332…
## $ condition_start_date      <date> 1952-12-01, 2017-11-23, 2031-12-15, 1984-05…
## $ condition_end_date        <date> 2001-09-02, 2096-03-22, 2098-09-15, 1994-03…
## $ condition_occurrence_id   <int> 1, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, …
## $ condition_concept_id      <int> 10, 1, 6, 1, 10, 1, 4, 9, 5, 5, 7, 1, 9, 5, …
## $ condition_type_concept_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ age                       <int> 9, 69, 47, 34, 43, 113, 60, 141, 31, 46, 119…

As well as calculating age, we can also create age groups at the same time. Here we create three age groups: those aged 0 to 17, those 18 to 65, and those 66 or older.

cdm$condition_occurrence <- cdm$condition_occurrence %>%
  addAge(
    indexDate = "condition_start_date",
    ageGroup = list(
      "0 to 17" = c(0, 17),
      "18 to 65" = c(18, 65),
      ">= 66" = c(66, Inf)
    )
  )

cdm$condition_occurrence %>%
  glimpse()
## Rows: ??
## Columns: 8
## Database: DuckDB v1.0.0 [root@Darwin 23.4.0:R 4.4.1/:memory:]
## $ person_id                 <int> 622, 5923, 8963, 5257, 711, 5682, 9304, 6332…
## $ condition_start_date      <date> 1952-12-01, 2017-11-23, 2031-12-15, 1984-05…
## $ condition_end_date        <date> 2001-09-02, 2096-03-22, 2098-09-15, 1994-03…
## $ condition_occurrence_id   <int> 1, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, …
## $ condition_concept_id      <int> 10, 1, 6, 1, 10, 1, 4, 9, 5, 5, 7, 1, 9, 5, …
## $ condition_type_concept_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ age                       <int> 9, 69, 47, 34, 43, 113, 60, 141, 31, 46, 119…
## $ age_group                 <chr> "0 to 17", ">= 66", "18 to 65", "18 to 65", …

By default, when adding age the new column will have been called “age” and will have been calculated using all available information on date of birth contained in the person. We can though also alter these defaults. Here, for example, we impose that month of birth is January and day of birth is the 1st for all individuals.

cdm$condition_occurrence <- cdm$condition_occurrence %>%
  addAge(
    indexDate = "condition_start_date",
    ageName = "age_from_year_of_birth",
    ageMissingMonth = 1,
    ageMissingDay = 1,
    ageImposeMonth = TRUE,
    ageImposeDay = TRUE
  )

cdm$condition_occurrence %>%
  glimpse()
## Rows: ??
## Columns: 9
## Database: DuckDB v1.0.0 [root@Darwin 23.4.0:R 4.4.1/:memory:]
## $ person_id                 <int> 622, 5923, 8963, 5257, 711, 5682, 9304, 6332…
## $ condition_start_date      <date> 1952-12-01, 2017-11-23, 2031-12-15, 1984-05…
## $ condition_end_date        <date> 2001-09-02, 2096-03-22, 2098-09-15, 1994-03…
## $ condition_occurrence_id   <int> 1, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, …
## $ condition_concept_id      <int> 10, 1, 6, 1, 10, 1, 4, 9, 5, 5, 7, 1, 9, 5, …
## $ condition_type_concept_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ age                       <int> 9, 69, 47, 34, 43, 113, 60, 141, 31, 46, 119…
## $ age_group                 <chr> "0 to 17", ">= 66", "18 to 65", "18 to 65", …
## $ age_from_year_of_birth    <int> 9, 69, 47, 34, 43, 113, 60, 141, 31, 46, 119…

As well as age at diagnosis, we might also want identify patients’ sex. PatientProfiles provides the addSex() function that will add this for us. Because this is treated as time-invariant, we will not have to specify any index variable.

cdm$condition_occurrence <- cdm$condition_occurrence %>%
  addSex()

cdm$condition_occurrence %>%
  glimpse()
## Rows: ??
## Columns: 10
## Database: DuckDB v1.0.0 [root@Darwin 23.4.0:R 4.4.1/:memory:]
## $ person_id                 <int> 622, 5923, 8963, 5257, 711, 5682, 9304, 6332…
## $ condition_start_date      <date> 1952-12-01, 2017-11-23, 2031-12-15, 1984-05…
## $ condition_end_date        <date> 2001-09-02, 2096-03-22, 2098-09-15, 1994-03…
## $ condition_occurrence_id   <int> 1, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, …
## $ condition_concept_id      <int> 10, 1, 6, 1, 10, 1, 4, 9, 5, 5, 7, 1, 9, 5, …
## $ condition_type_concept_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ age                       <int> 9, 69, 47, 34, 43, 113, 60, 141, 31, 46, 119…
## $ age_group                 <chr> "0 to 17", ">= 66", "18 to 65", "18 to 65", …
## $ age_from_year_of_birth    <int> 9, 69, 47, 34, 43, 113, 60, 141, 31, 46, 119…
## $ sex                       <chr> "Male", "Male", "Male", "Female", "Female", …

Similarly, we could also identify whether an individual was in observation at the time of their diagnosis (i.e. had an observation period that overlaps with their diagnosis date), as well as identifying how much prior observation time they had on this date and how much they have following it.

cdm$condition_occurrence <- cdm$condition_occurrence %>%
  addInObservation(indexDate = "condition_start_date") %>%
  addPriorObservation(indexDate = "condition_start_date") %>%
  addFutureObservation(indexDate = "condition_start_date")

cdm$condition_occurrence %>%
  glimpse()
## Rows: ??
## Columns: 13
## Database: DuckDB v1.0.0 [root@Darwin 23.4.0:R 4.4.1/:memory:]
## $ person_id                 <int> 622, 5923, 8963, 5257, 711, 5682, 9304, 6332…
## $ condition_start_date      <date> 1952-12-01, 2017-11-23, 2031-12-15, 1984-05…
## $ condition_end_date        <date> 2001-09-02, 2096-03-22, 2098-09-15, 1994-03…
## $ condition_occurrence_id   <int> 1, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, …
## $ condition_concept_id      <int> 10, 1, 6, 1, 10, 1, 4, 9, 5, 5, 7, 1, 9, 5, …
## $ condition_type_concept_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ age                       <int> 9, 69, 47, 34, 43, 113, 60, 141, 31, 46, 119…
## $ age_group                 <chr> "0 to 17", ">= 66", "18 to 65", "18 to 65", …
## $ age_from_year_of_birth    <int> 9, 69, 47, 34, 43, 113, 60, 141, 31, 46, 119…
## $ sex                       <chr> "Male", "Male", "Male", "Female", "Female", …
## $ in_observation            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ prior_observation         <int> 3622, 25529, 17515, 12562, 15776, 41529, 220…
## $ future_observation        <int> 36734, 35706, 36246, 8044, 17777, 26181, 909…

For these functions which work with information from the observation table, it is important to note that the results will be based on the observation period during which the index date falls within. Moreover, if a patient is not under observation at the specified date, addPriorObservation() and addFutureObservation() functions will return NA.

When checking whether someone is in observation the default is that we are checking whether someone was in observation on the index date. We could though expand this and consider a window of time around this date. For example here we add a variable indicating whether someone was in observation from 180 days before the index date to 30 days following it.

cdm$condition_occurrence %>%
  addInObservation(
    indexDate = "condition_start_date",
    window = c(-180, 30)
  ) %>%
  glimpse()
## Rows: ??
## Columns: 13
## Database: DuckDB v1.0.0 [root@Darwin 23.4.0:R 4.4.1/:memory:]
## $ person_id                 <int> 622, 5923, 8963, 5257, 711, 5682, 9304, 6332…
## $ condition_start_date      <date> 1952-12-01, 2017-11-23, 2031-12-15, 1984-05…
## $ condition_end_date        <date> 2001-09-02, 2096-03-22, 2098-09-15, 1994-03…
## $ condition_occurrence_id   <int> 1, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, …
## $ condition_concept_id      <int> 10, 1, 6, 1, 10, 1, 4, 9, 5, 5, 7, 1, 9, 5, …
## $ condition_type_concept_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ age                       <int> 9, 69, 47, 34, 43, 113, 60, 141, 31, 46, 119…
## $ age_group                 <chr> "0 to 17", ">= 66", "18 to 65", "18 to 65", …
## $ age_from_year_of_birth    <int> 9, 69, 47, 34, 43, 113, 60, 141, 31, 46, 119…
## $ sex                       <chr> "Male", "Male", "Male", "Female", "Female", …
## $ prior_observation         <int> 3622, 25529, 17515, 12562, 15776, 41529, 220…
## $ future_observation        <int> 36734, 35706, 36246, 8044, 17777, 26181, 909…
## $ in_observation            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…

We can also specify a window and require that an individual is present for only some days within it. Here we add a variable indicating whether the individual was in observation at least a year in the future,

cdm$condition_occurrence %>%
  addInObservation(
    indexDate = "condition_start_date",
    window = c(365, Inf),
    completeInterval = FALSE
  ) %>%
  glimpse()
## Rows: ??
## Columns: 13
## Database: DuckDB v1.0.0 [root@Darwin 23.4.0:R 4.4.1/:memory:]
## $ person_id                 <int> 622, 5923, 8963, 5257, 711, 5682, 9304, 6332…
## $ condition_start_date      <date> 1952-12-01, 2017-11-23, 2031-12-15, 1984-05…
## $ condition_end_date        <date> 2001-09-02, 2096-03-22, 2098-09-15, 1994-03…
## $ condition_occurrence_id   <int> 1, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, …
## $ condition_concept_id      <int> 10, 1, 6, 1, 10, 1, 4, 9, 5, 5, 7, 1, 9, 5, …
## $ condition_type_concept_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ age                       <int> 9, 69, 47, 34, 43, 113, 60, 141, 31, 46, 119…
## $ age_group                 <chr> "0 to 17", ">= 66", "18 to 65", "18 to 65", …
## $ age_from_year_of_birth    <int> 9, 69, 47, 34, 43, 113, 60, 141, 31, 46, 119…
## $ sex                       <chr> "Male", "Male", "Male", "Female", "Female", …
## $ prior_observation         <int> 3622, 25529, 17515, 12562, 15776, 41529, 220…
## $ future_observation        <int> 36734, 35706, 36246, 8044, 17777, 26181, 909…
## $ in_observation            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…

1.3 Adding characteristics to a cohort tables

The above functions can be used on both standard OMOP CDM tables and cohort tables. Note as the default index date in the functions is “cohort_start_date” we can now omit this.

cdm$cohort1 %>%
  glimpse()
## Rows: ??
## Columns: 4
## Database: DuckDB v1.0.0 [root@Darwin 23.4.0:R 4.4.1/:memory:]
## $ cohort_definition_id <int> 1, 3, 3, 1, 1, 2, 1, 2, 1, 2, 3, 2, 3, 3, 3, 1, 3…
## $ subject_id           <int> 2303, 5537, 794, 7289, 8393, 8211, 9251, 6521, 83…
## $ cohort_start_date    <date> 1939-10-30, 1958-10-22, 2098-08-03, 1965-02-04, …
## $ cohort_end_date      <date> 1958-09-21, 1983-10-06, 2147-06-19, 2004-01-04, …
cdm$cohort1 <- cdm$cohort1 %>%
  addAge(ageGroup = list(
    "0 to 17" = c(0, 17),
    "18 to 65" = c(18, 65),
    ">= 66" = c(66, Inf)
  )) %>%
  addSex() %>%
  addInObservation() %>%
  addPriorObservation() %>%
  addFutureObservation()

cdm$cohort1 %>%
  glimpse()
## Rows: ??
## Columns: 10
## Database: DuckDB v1.0.0 [root@Darwin 23.4.0:R 4.4.1/:memory:]
## $ cohort_definition_id <int> 1, 3, 3, 1, 1, 2, 1, 2, 1, 2, 3, 3, 3, 3, 1, 3, 1…
## $ subject_id           <int> 2303, 5537, 794, 7289, 8393, 8211, 9251, 6521, 83…
## $ cohort_start_date    <date> 1939-10-30, 1958-10-22, 2098-08-03, 1965-02-04, …
## $ cohort_end_date      <date> 1958-09-21, 1983-10-06, 2147-06-19, 2004-01-04, …
## $ age                  <int> 9, 35, 109, 61, 99, 104, 38, 44, 93, 38, 24, 70, …
## $ age_group            <chr> "0 to 17", "18 to 65", ">= 66", "18 to 65", ">= 6…
## $ sex                  <chr> "Female", "Female", "Male", "Male", "Female", "Ma…
## $ in_observation       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ prior_observation    <int> 3589, 13078, 40026, 22315, 36517, 38343, 13929, 1…
## $ future_observation   <int> 10808, 14048, 27464, 21479, 1208, 16398, 27919, 3…

1.4 Getting multiple characteristics at once

The above functions, which are chained together, each fetch the related information one by one. In the cases where we are interested in adding multiple characteristics, we can add these all at the same time using the more general addDemographics() functions. This will be more efficient that adding characteristics as it requires fewer joins between our table of interest and the person and observation period tables.

cdm$cohort2 %>%
  glimpse()
## Rows: ??
## Columns: 4
## Database: DuckDB v1.0.0 [root@Darwin 23.4.0:R 4.4.1/:memory:]
## $ cohort_definition_id <int> 1, 1, 3, 3, 1, 1, 1, 1, 2, 1, 2, 1, 2, 3, 2, 1, 3…
## $ subject_id           <int> 7049, 3031, 8588, 6759, 1160, 4328, 2902, 1912, 3…
## $ cohort_start_date    <date> 2002-04-22, 1918-12-16, 2139-07-22, 2008-05-22, …
## $ cohort_end_date      <date> 2094-08-21, 1942-03-11, 2167-02-26, 2045-01-16, …
tictoc::tic()
cdm$cohort2 %>%
  addAge(ageGroup = list(
    "0 to 17" = c(0, 17),
    "18 to 65" = c(18, 65),
    ">= 66" = c(66, Inf)
  )) %>%
  addSex() %>%
  addInObservation() %>%
  addPriorObservation() %>%
  addFutureObservation()
## # Source:   table<og_226_1722159335> [?? x 10]
## # Database: DuckDB v1.0.0 [root@Darwin 23.4.0:R 4.4.1/:memory:]
##    cohort_definition_id subject_id cohort_start_date cohort_end_date   age
##                   <int>      <int> <date>            <date>          <int>
##  1                    1       7049 2002-04-22        2094-08-21         36
##  2                    1       3031 1918-12-16        1942-03-11         16
##  3                    3       8588 2139-07-22        2167-02-26        143
##  4                    1       1160 2050-01-27        2089-10-29        104
##  5                    1       4328 2187-03-04        2188-12-20        174
##  6                    2       3975 2067-05-16        2071-03-08        115
##  7                    1       9142 1915-05-22        1929-07-14          2
##  8                    2       1324 2085-07-21        2091-01-23        140
##  9                    1       4750 1996-02-01        2011-10-12         31
## 10                    2       8629 1993-05-15        2005-05-11         88
## # ℹ more rows
## # ℹ 5 more variables: age_group <chr>, sex <chr>, in_observation <int>,
## #   prior_observation <int>, future_observation <int>
tictoc::toc()
## 0.464 sec elapsed
tictoc::tic()
cdm$cohort2 %>%
  addDemographics(
    age = TRUE,
    ageName = "age",
    ageGroup = list(
      "0 to 17" = c(0, 17),
      "18 to 65" = c(18, 65),
      ">= 66" = c(66, Inf)
    ),
    sex = TRUE,
    sexName = "sex",
    priorObservation = TRUE,
    priorObservationName = "prior_observation",
    futureObservation = FALSE,
  ) %>%
  glimpse()
## Rows: ??
## Columns: 8
## Database: DuckDB v1.0.0 [root@Darwin 23.4.0:R 4.4.1/:memory:]
## $ cohort_definition_id <int> 1, 1, 3, 1, 1, 2, 1, 2, 1, 2, 2, 1, 3, 3, 2, 2, 3…
## $ subject_id           <int> 7049, 3031, 8588, 1160, 4328, 3975, 9142, 1324, 4…
## $ cohort_start_date    <date> 2002-04-22, 1918-12-16, 2139-07-22, 2050-01-27, …
## $ cohort_end_date      <date> 2094-08-21, 1942-03-11, 2167-02-26, 2089-10-29, …
## $ age                  <int> 36, 16, 143, 104, 174, 115, 2, 140, 31, 88, 91, 1…
## $ age_group            <chr> "18 to 65", "0 to 17", ">= 66", ">= 66", ">= 66",…
## $ sex                  <chr> "Male", "Female", "Male", "Male", "Male", "Female…
## $ prior_observation    <int> 13260, 6193, 52432, 38012, 63614, 42139, 871, 513…
tictoc::toc()
## 0.184 sec elapsed

In our small mock dataset we see a small improvement in performance, but this difference will become much more noticeable when working with real data that will typically be far larger.