Introduction to spatialTIME

Background for Tissue Microarray (TMA) Data

A tissue microarray (TMA) is an array of samples which are obtained by taking a slice of a biopsied FFPE tumor. Each individual slice is referred to as a core. Each core is placed on a TMA and is then stained with multiple antibodies and fluorophores which illuminate when a laser is shined at them with varying wavelengths. The intensity is measured and then a random forest algorithm is used to classify the cells as being positive for a particular marker which allows us to phenotype cells. A schematic of this process is provided in Figure 1.

Create Multiplex ImmunoFlourescent (mif) Object

spatialTIME functions use a custom mif object which can be created using create_mif. The mif object has 6 slots storing the:

We include one example of a clinical and sample dataset which have a total of 229 patients with one core. Out of those 229 samples, only 5 are included in our package.


# Make sure the variable types are the same for deidentified_id and 
# deidentified_sample in their corresponding datasets
x <- create_mif(clinical_data = example_clinical %>% 
                  mutate(deidentified_id = as.character(deidentified_id)),
                sample_data = example_summary %>% 
                  mutate(deidentified_id = as.character(deidentified_id)),
                spatial_list = example_spatial,
                patient_id = "deidentified_id", 
                sample_id = "deidentified_sample")

x #prints a summary of how many patients, samples, and spatial files are present
#> 229 patients spanning 229 samples and 5 spatial data frames were found

Plotting Cores

An individual plot for each core (each sample) is created. Plots can be assigned to an R object, such as within the empty derived slot and printed to a PDF if a file name is provided.

When studying phenotype and individual markers, note that it is important to have the individual before the phenotype markers. This will ensure that the phenotype that are derived by multiple markers are not plotted over by the individual marker. For instance, below the the first plot appears to have no cytotoxic T cells (CD3+ and CD8+), but then the order is changed we see the cytotoxic T cells. Moral of the story: Put the marker combinations before the single markers.

mnames_bad <- c("CD3..CD8.","CD3..FOXP3.","CD3..Opal.570..Positive",
                "CD8..Opal.520..Positive","FOXP3..Opal.620..Positive", 
                "PDL1..Opal.540..Positive", "PD1..Opal.650..Positive")

# Used to make the legends in both plots below be in same order and use the 
# same coloring scheme for the purpose making a common legend

values =  RColorBrewer::brewer.pal(length(mnames_bad), "Accent")
names(values) = mnames_bad

#add an element in the `derived` object position
x<- plot_immunoflo(x, plot_title = "deidentified_sample",  mnames = mnames_bad,
                   cell_type = "Classifier.Label")
#> Scale for y is already present.
#> Adding another scale for y, which will replace the existing scale.
#> Scale for y is already present.
#> Adding another scale for y, which will replace the existing scale.
#> Scale for y is already present.
#> Adding another scale for y, which will replace the existing scale.
#> Scale for y is already present.
#> Adding another scale for y, which will replace the existing scale.
#> Scale for y is already present.
#> Adding another scale for y, which will replace the existing scale.

bad_names <- x[["derived"]][["spatial_plots"]][[4]] + 
  theme(legend.position = 'bottom') + 
  scale_color_manual(breaks = mnames_bad,
                     values = values,
                     labels = mnames_bad %>%
                       gsub("..Opal.*", "+", .) %>% 
                       gsub("\\.\\.", "+", .) %>% 
                       gsub("\\.", "+", .)) 
#> Scale for colour is already present.
#> Adding another scale for colour, which will replace the existing scale.

mnames_good <- c("CD3..Opal.570..Positive","CD8..Opal.520..Positive",
                 "FOXP3..Opal.620..Positive","PDL1..Opal.540..Positive",
                 "PD1..Opal.650..Positive","CD3..CD8.","CD3..FOXP3.")

x <- plot_immunoflo(x, plot_title = "deidentified_sample", mnames = mnames_good, 
                    cell_type = "Classifier.Label")
#> Scale for y is already present.
#> Adding another scale for y, which will replace the existing scale.
#> Scale for y is already present.
#> Adding another scale for y, which will replace the existing scale.
#> Scale for y is already present.
#> Adding another scale for y, which will replace the existing scale.
#> Scale for y is already present.
#> Adding another scale for y, which will replace the existing scale.
#> Scale for y is already present.
#> Adding another scale for y, which will replace the existing scale.

good_names <- x[["derived"]][["spatial_plots"]][[4]] + 
  theme(legend.position = 'bottom') + 
  scale_color_manual(breaks = mnames_good, 
                     values = values[match(mnames_good, names(values))],
                     labels = mnames_good %>%
                       gsub("..Opal.*", "+", .) %>% 
                       gsub("\\.\\.", "+", .) %>% 
                       gsub("\\.", "+", .))
#> Scale for colour is already present.
#> Adding another scale for colour, which will replace the existing scale.

x$sample %>% filter(deidentified_sample == 'TMA3_[9,K].tif') %>% select(c(2, 4:15)) %>%
  pivot_longer(cols = 2:13, names_to = 'Marker', values_to = 'Count')
#> # A tibble: 12 × 3
#>    deidentified_sample Marker                          Count
#>    <chr>               <chr>                           <dbl>
#>  1 TMA3_[9,K].tif      FOXP3 (Opal 620) Positive Cells    34
#>  2 TMA3_[9,K].tif      CD3 (Opal 570) Positive Cells     536
#>  3 TMA3_[9,K].tif      CD8 (Opal 520) Positive Cells      83
#>  4 TMA3_[9,K].tif      PD1 (Opal 650) Positive Cells       5
#>  5 TMA3_[9,K].tif      PDL1 (Opal 540) Positive Cells      1
#>  6 TMA3_[9,K].tif      CD3+ FOXP3+ Cells                  34
#>  7 TMA3_[9,K].tif      CD3+ CD8+ Cells                    68
#>  8 TMA3_[9,K].tif      CD3+ CD8+ FOXP3+ Cells              4
#>  9 TMA3_[9,K].tif      CD3+ PD1+ Cells                     5
#> 10 TMA3_[9,K].tif      CD3+ PD-L1+ Cells                   1
#> 11 TMA3_[9,K].tif      CD8+ PD1+ Cells                     0
#> 12 TMA3_[9,K].tif      CD3+ CD8+ PD-L1+ Cells              0
gridExtra::grid.arrange(bad_names, good_names, ncol=2)

# Estimating the degree of spatial clustering with Count Based Methods

Univariate

Count Based Methods

Ripley’s \(K\) measures the average number of neighboring cells across each cell. That is, the average (over all cells) number of cells within a specified radius of a cell. Ripley’s \(K\) is computed as follows:

\[\hat{K}(r) = \frac{1}{n(n-1)}\sum_{i=1}^{n}\sum_{j\neq i}w_{ij}{\bf 1}{(d(x_i,x_j)\le r)},\]

where \(r\) is the specified radius, \(d(x_i,x_j)\) is the distance between the \(i^{th}\) and \(j^{th}\) cell, \({\bf 1}_A\) is indicator function of event \(A\), and \(w_{ij}\) is the weights that are assigned for border corrections. The expected value of \(\hat{K}(r)\) is \(\pi r^2\), thus \(\hat{K}\) is expected to grow as a quadratic function of \(r\).

There are several edge corrections. Our studies have included a small number of cells and we recommend using the ‘isotropic’ or ‘translational’ edge correction, as opposed to the ‘border’ edge correction. The main goal of the edge correction is to account for the fact that there are unobserved points outside of the region, and the assumption is that the location of these cells has the same distribution as the study region. An excellent description of these corrections are provided here.

Distance Based Measures

The distribution of the nearest neighbor distances, \(\hat{G}(r)\), can be studied and is computed by

\[\hat{G}(r) = \frac{1}{n}\sum_{i=1}^{n}{\bf 1}(\min_{j}(\{d(x_i,x_j)\}\le r),\]

which is interpreted as the proportion of cells whose distance to its nearest neighbor is less than \(r\). Notice that there is not a weighting factor for each pair of points as we saw above. The edge correction in these methods, reduced sample (rs) and Hanisch (han), simply have different cell inclusion conditions. The reduced sample correction is similar to the border correction for count based methods, where the middle chunk of the area of interest is studied. The Hanisch border correction leaves out points whose \(k^{th}\) neighbor can not be in the area of interest. For more information about these border corrections, see the following article.

Need for Permutations

An underlying assumption used for many spatial clustering metrics is that the cells are randomly distributed across the region, no evidence of clustering or repulsion, and that the cell intensity is constant across the entire region. This assumption is the so-called complete spatial randomness (CSR). Damage can occur to tissue cores due to how they are collected. This damage can lead to rips and tears in the cores which results in regions where it appears that cells are not located and is not actually the case. Due to these violations of the CSR assumption, the theoretical estimate for CSR may not be accurate. To address this the cell positivity can be permuted across all observed locations and the permutation distribution of \(K\), \(L\), \(M\), and \(G\) is a core specific measure of CSR than theoretical. Also, with the permutations of CSR we are able to determine whether the observed clustering is significant by lying above or below a 95% of permuted CSR estimates.

Calculating Exact CSR

The product of running all possible permutations of Ripley’s K and bivariate Ripley’s K for cells on a TMA core or an ROI spot is simply the result of running Ripley’s K or bivariate Ripley’s K on all cells ignoring their marks. This allows us to determine the exact degree of clustering that are can use for associations with survival or other clinical variables.

Implementation

Univariate Count-Based Methods

The ripleys_k function reports a permuted and theoretical estimate of CSR, the observed value for \(K\) and the full permutation distribution of \(K\) using permute = TRUE, keep_permutation_distribution = TRUE.

Currently, the number of permutations is 10, but this should be increased to at least 100 for a more reliable estimate of the mean.

x <- ripleys_k(mif = x, mnames = mnames_good, 
               num_permutations = 10, method = "K",
               edge_correction = 'translation', r_range = 0:100,
               permute = TRUE,
               keep_permutation_distribution = FALSE, overwrite = TRUE, workers = 1)

# This will keeps the colors in evx$ery plot for the remainder of the vignette compatible 
values = RColorBrewer::brewer.pal(length(unique(x$derived$univariate_Count[[x$sample_id]])), "Accent")
names(values) = unique(x$derived$univariate_Count$deidentified_sample)

x$derived$univariate_Count  %>%
  filter(Marker != 'PDL1..Opal.540..Positive') %>%
  ggplot(aes(x = r, y = `Degree of Clustering Permutation`)) +
  geom_line(aes(color = get(x$sample_id)), show.legend = FALSE) +
  facet_wrap(Marker~., scales = 'free') + theme_bw() + 
  scale_color_manual(values = values)

We can also run using the exact CSR approach which is faster and produces a more accurate Degree of clustering.

x <- ripleys_k(mif = x, mnames = mnames_good, 
               num_permutations = 10, method = "K",
               edge_correction = 'translation', r_range = 0:100,
               permute = FALSE,
               keep_permutation_distribution = FALSE, overwrite = TRUE, workers = 1)
#> Joining with `by = join_by(r)`
#> Joining with `by = join_by(r)`
#> Joining with `by = join_by(r)`
#> Joining with `by = join_by(r)`
#> Joining with `by = join_by(r)`

# This will keeps the colors in evx$ery plot for the remainder of the vignette compatible 
values = RColorBrewer::brewer.pal(length(unique(x$derived$univariate_Count[[x$sample_id]])), "Accent")
names(values) = unique(x$derived$univariate_Count$deidentified_sample)

x$derived$univariate_Count  %>%
  filter(Marker != 'PDL1..Opal.540..Positive') %>%
  ggplot(aes(x = r, y = `Degree of Clustering Exact`)) +
  geom_line(aes(color = get(x$sample_id)), show.legend = FALSE) +
  facet_wrap(Marker~., scales = 'free') + theme_bw() + 
  scale_color_manual(values = values)

Positive values for degree of cluster when using method = 'K' indicates evidence of spatial clustering, while negative values correspond to spatial regularity. We can also observe that the permute = TRUE method is slightly more jagged in it’s curve smoothness than the permute = FALSE method (FOXP3 negative curves in particular).

Sensitivity of r

There is no clear way to select which value of \(r\) to use for a particular analysis. Below we illustrate how the degree of clustering can change with the value of \(r\). Notice that for values of \(r > 20\) that there is very little difference in the ordering between the samples (though the degree of spatial clustering can changes dramatically), while very small values \(r\) would have different results. It’s typically recommended to pick an \(r\) in a region that is of interest (small scale or large scale clustering) where there is higher variation in the Degree of Clustering values between samples.

x$derived$univariate_Count %>%
  filter(Marker == 'CD3..CD8.') %>%
  inner_join(x$clinical,.) %>%
  ggplot(aes(shape = status, y = `Degree of Clustering Exact`, x =r)) +
  geom_point(aes(color = get(x$sample_id))) +
  theme_bw() + scale_color_manual(values = values)
#> Joining with `by = join_by(deidentified_sample)`

Bivariate Count-Based Methods

In the univariate case, we consider each cell of a single cell type and center circles around each cell (reference cell). In the bivariate case, we are interested in how many cells of Type 1 (Counted) are clustered in proximity to Type 2 (Anchor). Here the circles are centered around cell of Type 2 and then the cells of Type 1 are counted. Similar to univariate Ripley’s K, we will run 10 permutations to get an average permuted CSR value.

x <- bi_ripleys_k(mif = x, mnames = mnames_good, 
                  num_permutations = 10, 
                  permute=TRUE,
                  edge_correction = 'translation', r_range = 0:100,
                  keep_permutation_distribution = FALSE, workers = 1)

x$derived$bivariate_Count  %>%
  filter(Anchor == 'CD3..FOXP3.',
         Counted == 'CD3..CD8.') %>%
  ggplot(aes(x = r, y = `Degree of Clustering Permutation`)) +
  geom_line(aes(color = get(x$sample_id)), show.legend = TRUE) +
  theme_bw() + scale_color_manual(values = values)

We can also do the same as with the univariate Ripley’s K and set permute = FALSE in order to get the exact CSR estimate and degree of clustering.

x <- bi_ripleys_k(mif = x, mnames = mnames_good, 
                  num_permutations = 10, 
                  permute=FALSE,
                  edge_correction = 'translation', r_range = 0:100,
                  keep_permutation_distribution = FALSE, 
                  overwrite = TRUE, workers = 1)

x$derived$bivariate_Count  %>%
  filter(Anchor == 'CD3..FOXP3.',
         Counted == 'CD3..CD8.') %>%
  ggplot(aes(x = r, y = `Degree of Clustering Exact`)) +
  geom_line(aes(color = get(x$sample_id)), show.legend = TRUE) +
  theme_bw() + scale_color_manual(values = values)

The interpretation of the degree of clustering is the same here. The line for TMA3_[8,U].tif shows evidence that Tregs tend to cluster around Cytotoxic T cells for all values of \(r\), while the line for TMA1_[3,B].tif indicates spatial repulsion of Tregs by Cytotoxic T cells for \(r\gt50\). Again, we can see that the curves for the exact degree of clustering are more smooth than those using the permutation method. The permutation method would produce more smooth curve with an increase in the number of permutations

Univariate Nearest-Neighbor Methods

The NN_G function reports a permuted and theoretical estimate of CSR, the observed value for \(G\), and the full permutation distribution of \(G\) when keep_perm_dis = TRUE. The degree of clustering is computed by taking the ratio of the observed \(G\) and either the permutation or theoretical estimate of CSR.

Currently, the number of permutations is 10, but this should be increased to at least 100 for a more reliable estimate of the mean.


x <- NN_G(mif = x, mnames = mnames_good, num_permutations = 10,
                edge_correction = 'rs', r = 0:100, workers = 1)

x$derived$univariate_NN  %>%
  filter(Marker != 'PDL1..Opal.540..Positive') %>%
  ggplot(aes(x = r, y = `Degree of Clustering Permutation`)) +
  geom_line(aes(color = get(x$sample_id))) +
  facet_wrap(Marker~., scales = 'free') + theme_bw() + 
  scale_color_manual(values = values)