The joineR package implements methods for analyzing data from longitudinal studies in which the response from each subject consists of a time-sequence of repeated measurements and a possibly censored time-to-event outcome. The modelling framework for the repeated measurements is the linear model with random effects and/or correlated error structure. The model for the time-to-event outcome is a Cox proportional hazards model with log-Gaussian frailty. Stochastic dependence is captured by allowing the Gaussian random effects of the linear model to be correlated with the frailty term of the Cox proportional hazards model.

Unbalanced and balanced data formats

A typical protocol for a repeated measurement study specifies a set of information to be recorded on each subject immediately before or after randomization, and one or more outcome measures recorded at each of a set of pre-specified follow-up times. With a slight abuse of terminology, we refer to the initial information as a set of baseline covariates. These can include any subject-specific time-constant information, for example measured or categorical characteristics of a subject, or their treatment allocation in a randomised trial.

A natural way to store the information on baseline covariates is as an \(n\) by \(p + 1\) array or data-frame, in which \(n\) is the number of subjects, \(p\) the number of baseline covariates, and an additional column contains a unique subject identifier. It is then natural to store the follow-up measurements of each outcome variable as an \(n\) by \(m + 1\) data-frame where \(m\) is the number of follow-up times and the additional column contains the subject-identifier with which we can link outcomes to baseline covariates. For completeness, we note that the follow-up times will often be unequally spaced, and should be stored as a vector of length \(m\) to avoid any ambiguity. This defines a balanced data-format, the term referring to the common set of information intended to be obtained from each subject.

In observational studies, and in designed studies with non-hospitalized human subjects, it is difficult or impossible to insist on a common set of follow-up times. In such cases, the usual way to store the data is as an array or data-frame with a separate row for each follow-up measurement and associated follow-up time on each subject, together with the subject identifier and any other covariate information, repeated redundantly over multiple follow-up times on the same subject. This defines the unbalanced data-format, so-called because it allows the number and timing of follow-up measurements to vary arbitrarily between patients.

In practice, the unbalanced data-format is also widely used to store data from a balanced study-design, despite its in-built redundancy. Conversely, the balanced format can be used to store data from an unbalanced design, by defining the time-vector to be the complete set of distinct follow-up times and using NA’s to denote missing follow-up measurements. In practice, unless the data-set derives from a balanced design with no missing values, the choice between the two formats reflects a compromise between storing redundant information in the unbalanced format and generating a possibly large number of NA’s in the balanced format.

Motivating Examples

The package includes four data-sets: heart.valve, liver, mental, and epileptic. Descriptions of each follow.

The heart.valve data-set

This data-set is derived from a study of heart function after surgery to implant a new heart valve, and is loaded (after installing the package) with the command:


The data refer to 256 patients and are stored in the unbalanced format, which is convenient here because measurement times were unique to each subject. The data are stored as a single R object, heart.valve, which is a data-frame of dimension 988 by 25. The average number of repeated measurements per subject is therefore . As with any unbalanced data-set, values of time-constant variables are repeated over all rows that refer to the same subject. The dimensionality of the dataset can be confirmed by a call to the dim() function, whilst the names of the 25 variables can be listed by a call to the names() function:

## [1] 988  25
##  [1] "num"          "sex"          "age"          "time"         "fuyrs"       
##  [6] "status"       "grad"         "log.grad"     "lvmi"         "log.lvmi"    
## [11] "ef"           "bsa"          "lvh"          "prenyha"      "redo"        
## [16] "size"         "con.cabg"     "creat"        "dm"           "acei"        
## [21] "lv"           "emergenc"     "hc"           "sten.reg.mix" "hs"

To extract the subject-specific values of one or more baseline covariates from a data-set in the unbalanced format, we use the function UniqueVariables. The three arguments to this function are: the name of the data-frame; a vector of column names or numbers of the data-frame that identify the required baseline covariates; the column name or number of the data-frame that identifies the subject. For example, to select the baseline covariates emergenc, age and sex from the data-frame heart.valve and assign these to a new R object, we use the command:

heart.valve.cov <- UniqueVariables(
  c("emergenc", "age", "sex"),
  id.col = "num")

The resulting data-frame heart.valve.cov has four columns, corresponding to the subject identifier and the three extracted baseline covariates; hence for example:

heart.valve.cov[11:15, ]
##    num      age emergenc sex
## 11  11 70.56712        0   0
## 12  12 50.97534        0   0
## 13  13 69.94247        0   1
## 14  14 72.30685        0   0
## 15  15 42.40000        1   0

To extract all of the baseline covariates, it is easier to identify the required columns by number, hence:

heart.valve.cov <- UniqueVariables(
  c(2, 3, 5, 6, 12:25), 
  id.col = "num")

## [1] 256  19

Notice that the dimension of heart.valve.cov is 256 (the number of subjects) by 19 (one more than the number of covariates). An analysis of these data is reported in Lim et al. (2008).

The liver data

This data-set is taken from Andersen et al. (1993). It concerns the measurement of liver function in cirrhosis patients treated either with standard or novel therapy.

The data-set is included with the package as a single R object, which can be accessed as follows:

## [1] 2969    6
## [1] "id"          "prothrombin" "time"        "treatment"   "survival"   
## [6] "cens"

This data-set is stored in the balanced format. It contains longitudinal follow-up information on each subject, a solitary baseline covariate representing the respective therapy arm, and time-to-event information (time and censoring indicator) for each subject.

Subsets of the data can be accessed in the usual way. For example,

liver[liver$id %in% 29:30, ]
##    id prothrombin       time treatment  survival cens
## 33 29          59 0.00000000         0 0.1013699    0
## 34 30          51 0.00000000         0 0.1068493    1
## 35 30          75 0.09863014         0 0.1068493    1

shows that subject 29 provided only one measurement, at time \(t = 0\), and was lost-to-follow up at time \(t = 0.1013699\) years (cens = 0), whilst subject 30 provided two measurements, at times \(t = 0\) and \(t = 0.09863014\) years, and died at time \(t = 0.1068493\) years (cens = 1).

The mental data-set

This data-set relates to a study in which 150 patients suffering from chronic mental illness were randomised amongst three different drug treatments: placebo and two active drugs. A questionnaire instrument was used to assess each patient’s mental state at weeks 0, 1, 2, 4, 6 and 8 post-randomization. The data can be loaded with the command:


Only sixty-eight of the 150 subjects provided a complete sequence of measurements at weeks 0, 1, 2, 4, 6 and 8. The remainder left the study prematurely for a variety of reasons, some of which were thought to be related to their mental state. Hence, dropout is potentially informative. The data from the first five subjects can be accessed in the usual way:

mental[1:5, ]
##   id Y.t0 Y.t1 Y.t2 Y.t4 Y.t6 Y.t8 treat n.obs surv.time cens.ind
## 1  1   55   NA   NA   NA   NA   NA     2     1     0.704        0
## 2  2   44   NA   NA   NA   NA   NA     1     1     0.740        0
## 3  3   65   67   NA   NA   NA   NA     2     2     1.121        1
## 4  4   64   56   NA   NA   NA   NA     2     2     1.224        1
## 5  5   47   48   NA   NA   NA   NA     2     2     1.303        0

The data are stored in the balanced format, with 150 rows (one per subject) and 11 columns comprising a subject identifier, the measured values from the questionnaire at each if the six scheduled follow-up times, the treatment allocation, the number of non-missing measured values, an imputed dropout time and a censoring indicator, coded as 1 for subjects who dropped out for reasons thought to be related to their mental health state, and as 0 otherwise. Note the distinction made here between potentially informative dropout and censoring, the latter being assumed to be uninformative. Hence, the command

##  0  1 
## 21 61

shows that 21 of the 82 dropouts did so for reasons unrelated to their mental health.

The epileptic data-set

This data-set is derived from the SANAD (Standard and New Antiepileptic Drugs) study, described in Marson et al. (2007). SANAD was a randomised control trial to compare standard (CBZ) and new (LTG) anti-epileptic drugs with respect to their effects on long-term clinical outcomes. The data are stored in the unbalanced format. The data for each subject consist of repeated measurements of calibrated dose, several baseline covariates and the time to withdrawal from the drug to which they were randomised. The first three rows, all of which relate to the same subject, can be accessed as follows:

epileptic[1:3, ]
##   id dose time with.time with.status with.status2 with.status.uae
## 1  1    2   86      2400           0            0               0
## 2  1    2  119      2400           0            0               0
## 3  1    2  268      2400           0            0               0
##   with.status.isc treat   age gender learn.dis
## 1               0   CBZ 75.67      M        No
## 2               0   CBZ 75.67      M        No
## 3               0   CBZ 75.67      M        No

Converting between balanced and unbalanced data-formats

Two functions are provided to convert objects from one format to the other. The following code demonstrates the conversion of the `mental} data-set from the balanced to the unbalanced format, including a mnemonic re-naming of the column that now contains all of the repeated measurements:

mental.unbalanced <- to.unbalanced(mental, id.col = 1,
                                   times = c(0, 1, 2, 4, 6, 8),
                                   Y.col = 2:7, other.col = 8:11)
## [1] "id"        "time"      "Y.t0"      "treat"     "n.obs"     "surv.time"
## [7] "cens.ind"
names(mental.unbalanced)[3] <- "Y"

The following code converts the new object back to the balanced format:

mental.balanced <- to.balanced(mental.unbalanced, id.col = 1,
                               time.col = 2,
                               Y.col = 3, other.col = 4:7)
## [1] 150  11
##  [1] "id"        "Y.t0"      "Y.t1"      "Y.t2"      "Y.t4"      "Y.t6"     
##  [7] "Y.t8"      "treat"     "n.obs"     "surv.time" "cens.ind"

Note the automatic renaming of the repeated measures according to their measurement times, also that this function does not check whether storing the object in the balanced format is sensible. For example, conversion of the epileptic data to the balanced format leads to the creation of a large array, most of whose values are missing:

epileptic.balanced <- to.balanced(epileptic,
                                  id.col = 1, time.col = 3,
                                  Y.col = 2, other.col = 4:12)
## [1]  605 1006
## [1] 599783

Once a data-set is in the balanced format (if it is unbalanced then to.balanced can be applied) then the mean, variance and correlation matrix of the responses can be extracted using summarybal. An example is given below using the mental data-set:

summarybal(mental, Y.col = 2:7, times = c(0, 1, 2, 4, 6, 8), 
           na.rm = TRUE)
## $mean.vect
##      x        y
## Y.t0 0 55.77333
## Y.t1 1 53.03378
## Y.t2 2 50.37008
## Y.t4 4 48.62963
## Y.t6 6 46.94048
## Y.t8 8 46.02941
## $variance
##     Y.t0     Y.t1     Y.t2     Y.t4     Y.t6     Y.t8 
## 132.1228 152.7403 167.5366 168.6653 228.8759 181.4917 
## $cor.mtx
##           [,1]      [,2]      [,3]      [,4]      [,5]      [,6]
## [1,] 1.0000000 0.5939982 0.4537496 0.4186798 0.3432355 0.2997679
## [2,] 0.5939982 1.0000000 0.7793311 0.6805021 0.6631543 0.6149448
## [3,] 0.4537496 0.7793311 1.0000000 0.7992470 0.7077238 0.6202277
## [4,] 0.4186798 0.6805021 0.7992470 1.0000000 0.8108585 0.7402240
## [5,] 0.3432355 0.6631543 0.7077238 0.8108585 1.0000000 0.8681933
## [6,] 0.2997679 0.6149448 0.6202277 0.7402240 0.8681933 1.0000000

Creating a jointdata object

A jointdata object is a list consisting of up to three data-frames, which collectively contain repeated measurement data, time-to-event data and baseline covariate data. The repeated measurement data must be stored in the unbalanced format. The time-to-event and baseline covariate data must each be stored in the balanced format, i.e. with one line per subject. Each data-frame must include a column containing the subject id, and all three subject id columns must match. The UniqueVariables function provides a convenient way to extract the time-to-event and base line covariate data from an unbalanced data-frame, as in the following example.

liver.long <- liver[, 1:3]
liver.surv <- UniqueVariables(liver, var.col = c("survival", "cens"), 
                              id.col = "id")
liver.baseline <- UniqueVariables(liver, var.col = 4,
                                  id.col = "id")

liver.jd <- jointdata(longitudinal = liver.long,
                      survival = liver.surv, 
                      baseline = liver.baseline,
                      id.col = "id",
                      time.col = "time")

As a second example, we create a jointdata object from the single data-frame heart.valve as follows:

heart.surv <- UniqueVariables(heart.valve, var.col = c("fuyrs", "status"),
                              id.col = "num")
heart.long <- heart.valve[, c(1, 4, 5, 7, 8, 9, 10, 11)]
heart.jd <- jointdata(longitudinal = heart.long, 
                      survival = heart.surv,
                      id.col = "num",
                      time.col = "time")

A summary of a jointdata object can be obtained using the summary function. For example:

## $subjects
## [1] "Number of subjects: 256"
## $longitudinal
##            class
## time     numeric
## fuyrs    numeric
## grad     numeric
## log.grad numeric
## lvmi     numeric
## log.lvmi numeric
## ef       integer
## $survival
## Number of subjects that fail:  54
## Number of subjects censored:  202
## $baseline
## [1] "No baseline covariates data available"
## $times
## [1] "Unbalanced longitudinal study or more than twenty observation times"

A jointdata object can also be constructed from any specified subset of subjects. For example,

take <- heart.jd$survival$num[heart.jd$survival$status == 0]
heart.jd.cens <- subset(heart.jd, take)

selects only those subjects whose survival status is zero, i.e. their event-time is right-censored.

The sample function can also be used to select a random sample of subjects from a jointdata object, for example:

heart.jd.sample <- sample.jointdata(heart.jd, size = 10)

takes the data from a random sample of 10 out of the 150 subjects and assigns the result to a jointdata object named heart.jd.sample.

Plotting a jointdata object

The default operation of the generic plot function applied to a jointdata object is a plot of longitudinal profiles of the repeated measurements. If the object contains more than one longitudinal variable, each is presented in a separate panel. For example, the command plot(heart.jd) produces the following figure: