---
title: "Computing p-values for Fleming-Harrington weighted logrank tests and the MaxCombo test"
author: "Keaven Anderson, Yujie Zhao"
output: rmarkdown::html_vignette
bibliography: simtrial.bib
vignette: >
  %\VignetteIndexEntry{Computing p-values for Fleming-Harrington weighted logrank tests and the MaxCombo test}
  %\VignetteEngine{knitr::rmarkdown}
---

```{r, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

run <- requireNamespace("dplyr", quietly = TRUE) &&
  requireNamespace("gt", quietly = TRUE)
knitr::opts_chunk$set(eval = run)
```

## Introduction

This vignette demonstrates use of a simple routine to do simulations and testing using Fleming-Harrington weighted logrank tests and the MaxCombo test.
In addition, we demonstrate how to perform these tests with a dataset not generated by simulation routines within the package.
Note that all $p$-values computed here are one-sided with small values indicating that the experimental treatment is favored.

## Defining the test

The MaxCombo test has been posed as the maximum of multiple Fleming-Harrington weighted logrank tests (@FH1982, @FH2011).
Combination tests looking at a maximum of selected tests in this class have also been proposed; see @Lee2007, @NPHWGDesign, and @NPHWGSimulation.
The Fleming-Harrington class is indexed by the parameters $\rho \geq 0$ and $\gamma \geq 0$.
We will denote these as FH($\rho, \gamma$).
This class includes the logrank test as FH(0, 0).
Other tests of interest here include:

- FH(0, 1): a test that down-weights early events
- FH(1, 0): a test that down-weights late events
- FH(1, 1): a test that down-weights events increasingly as their quantiles differ from the median

## Executing for a single dataset

### Generating test statistics with `sim_fixed_n()`

We begin with a single trial simulation generated by the routine `sim_fixed_n()` using default arguments for that routine.
`sim_fixed_n()` produces one record per test and data cutoff method per simulation.
Here we choose 3 tests (logrank = FH(0, 0), FH(0, 1) and FH(1, 1)).
When more than one test is chosen the correlation between tests is computed as shown by @Karrison2016, in this case in the columns `V1`, `V2`, `V3`. The columns `rho`, `gamma` indicate $\rho$ and $\gamma$ used to compute the test. `z` is the FH($\rho, \gamma$) normal test statistic with variance 1 with a negative value favoring experimental treatment. The variable `cut` indicates how the data were cut for analysis, in this case at the maximum of the targeted minimum follow-up after last enrollment and the date at which the targeted event count was reached. `Sim` is a sequential index of the simulations performed.

```{r, message=FALSE, warning=FALSE}
library(simtrial)
library(knitr)
library(dplyr)
library(gt)
```

```{r}
set.seed(123)

x <- sim_fixed_n(
  n_sim = 1,
  timing_type = 5,
  rho_gamma = data.frame(rho = c(0, 0, 1), gamma = c(0, 1, 1))
)

x |>
  gt() |>
  fmt_number(columns = c("ln_hr", "z", "duration", "v1", "v2", "v3"), decimals = 2)
```

### Generating data with `sim_pw_surv()`

We begin with another simulation generated by `sim_pw_surv()`.
Again, we use defaults for that routine.

```{r, message=FALSE, warning=FALSE, cache=FALSE}
set.seed(123)

s <- sim_pw_surv(n = 100)

s |>
  head() |>
  gt() |>
  fmt_number(columns = c("enroll_time", "fail_time", "dropout_time", "cte"), decimals = 2)
```

Once generated, we need to cut the data for analysis. Here we cut after 75 events.

```{r, warning=FALSE, message=FALSE}
x <- s |> cut_data_by_event(75)

x |>
  head() |>
  gt() |>
  fmt_number(columns = "tte", decimals = 2)
```

Now we can analyze this data. We begin with `s` to show how this can be done in a single line.
In this case, we use the 4 test combination suggested in @NPHWGSimulation, @NPHWGDesign.

```{r, warning=FALSE, message=FALSE}
z <- s |>
  cut_data_by_event(75) |>
  maxcombo(rho = c(0, 0, 1, 1), gamma = c(0, 1, 0, 1))

z
```

Suppose we want the $p$-value just based on the logrank and FH(0, 1) and FH(1, 0) as suggested by @Lee2007.
We remove the rows and columns associated with FH(0, 0) and FH(1, 1) and then apply `pvalue_maxcombo()`.

```{r, warning=FALSE, message=FALSE}
z <- s |>
  cut_data_by_event(75) |>
  maxcombo(rho = c(0, 1), gamma = c(1, 0))

z
```

### Using survival data in another format

For a trial not generated by `sim_fixed_n()`, the process is slightly more involved.
We consider survival data not in the simtrial format and show the transformation needed.
In this case we use the small `aml` dataset from the survival package.

```{r, warning=FALSE, message=FALSE}
library(survival)
aml |>
  head() |>
  gt()
```

We rename variables and create a stratum variable as follows:

```{r, warning=FALSE, message=FALSE}
x <- aml |> transmute(
  tte = time,
  event = status,
  stratum = "All",
  treatment = case_when(
    x == "Maintained" ~ "experimental",
    x == "Nonmaintained" ~ "control"
  )
)

x |>
  head() |>
  gt()
```
Now we analyze the data with a MaxCombo with the logrank and FH(0, 1) and compute a $p$-value.

```{r, warning=FALSE, message=FALSE}
x |> maxcombo(rho = c(0, 0), gamma = c(0, 1))
```

## Simulation

We now consider the example simulation from the `pvalue_maxcombo()` help file to demonstrate how to simulate power for the MaxCombo test. However, we increase the number of simulations to 100 in this case; a larger number should be used (e.g., 1000) for a better estimate of design properties. Here we will test at the $\alpha=0.001$ level.

```{r, cache=FALSE, warning=FALSE, message=FALSE}
set.seed(123)

# Only use cut events + min follow-up
x <- sim_fixed_n(
  n_sim = 100,
  timing_type = 5,
  rho_gamma = data.frame(rho = c(0, 0, 1), gamma = c(0, 1, 1))
)

# MaxCombo power estimate for cutoff at max of targeted events, minimum follow-up
x |>
  group_by(sim) |>
  filter(row_number() == 1) |>
  ungroup() |>
  summarize(power = mean(p_value < .001))
```

We note the use of `group_map` in the above produces a list of $p$-values for each simulation.
It would be nice to have something that worked more like `dplyr::summarize()` to avoid `unlist()` and to allow evaluating, say, multiple data cutoff methods.
The latter can be done without having to re-run all simulations as follows, demonstrated with a smaller number of simulations.

```{r, cache=FALSE, warning=FALSE, message=FALSE}
# Only use cuts for events and events + min follow-up
set.seed(123)

x <- sim_fixed_n(
  n_sim = 100,
  timing_type = c(2, 5),
  rho_gamma = data.frame(rho = 0, gamma = c(0, 1))
)
```

Now we compute a $p$-value separately for each cut type, first for targeted event count.

```{r, warning=FALSE, message=FALSE}
# Subset to targeted events cutoff tests
# This chunk will be updated after the development of sim_gs_n and sim_fixed_n
x |>
  filter(cut == "Targeted events") |>
  group_by(sim) |>
  filter(row_number() == 1) |>
  ungroup() |>
  summarize(power = mean(p_value < .025))
```

Now we use the later of targeted events and minimum follow-up cutoffs.

```{r, warning=FALSE, message=FALSE}
# Subset to targeted events cutoff tests
x |>
  filter(cut != "Targeted events") |>
  group_by(sim) |>
  filter(row_number() == 1) |>
  ungroup() |>
  summarize(power = mean(p_value < .025))
```

## References