Single-cell transcriptomics is a rapidly growing field which is essential to improve our understanding of complex tissues and organisms at an unprecedented resolution. The manual annotation of cell identity, which is the gold standard procedure to determine cell populations, remains a major limitation in most scRNA-seq analysis [1]; more recently, a growing number of resources are being published offering the scientific community many cell type specific gene signature as a reference. Alas, marker gene belonging to gene signatures could be unspecific if taken singularly, specially if that given gene is involved in all the differentiation steps in some cellular population of the dataset (from stem cells to a terminally differentiated population), making it a differentially expressed gene (DEG) in more than one cluster. In this context we decided to enlarge combiroc’s breath to the scRNA-seq analysis field as a signature refiner, in order to find the best few marker genes combinations from an already existing gene signature, regardless the genes’ ranking as differentially expressed. Combiroc is specialized in finding specific combinations, thus its use may allow researchers to find some cells of interest with a high degree of confidence and help in the process of manual annotation by reducing the number of marker genes to consider.
Here we show the complete workflow of combiroc-mediated signature refinement at single-cell resolution in which we used 3 PBMC public datasets showed in Seurat vignettes. All of them have been deeply characterized and scrupulously labelled, so they are accepted and used as reference by the scientific community.
The full story and more comments on this case study can be found in our “Less is more” bioRxiv preprint: Ferrari et al. Combiroc: when ‘less is more’ in bulk and single cell marker signatures. bioRxiv 2022.01.17.476603; doi: https://doi.org/10.1101/2022.01.17.476603
# main libraries:
library(combiroc)
library(dplyr)
library(tidyr)
library(ggplot2)
and also needed for this specific scRNAseq workflow (install them first, if you didn’t already):
library(devtools) # used to install devel-level packages
library(Seurat) # used to process scRNAseq data
library(SeuratData) # used to install and load scRNAseq datasets
library(SeuratDisk) # used to read h5seurat files
A note on executing this vignette
The code described in this vignette was originally run on a jupyter-based R-kernel notebook backed by a high-performance computing (HPC) server infrastructure.
Due to relevant size of some of the objects loaded and/or created, it is unlikely that the whole code can be run on a standard computing equipment (i.e. a laptop).
The computationally heavy passages are therefore marked as [[Heavy-Code]]{style=“color: orange;”} and can be skipped by the user by reading/loading pre-computed objects or intermediate results either from rds
or csv
files.
In this way the reader/reviewer will be able to check the code, to download (if needed) the pre-computed object when indicated and follow the whole workflow without having a HPC server.
The difficulty to distinguish NK (Natural Killer) cells from CD8+ T cells in many scRNAseq datasets is a good challenge to test combiroc performance in markers signature optimization.
The purpose of this analysis is to find the best combinations of known gene markers that best describe the NK-cell populations starting from a NK-cell single-cell transcriptomic signature, then using the models calculated on such combinations on other scRNAseq datasets.
After performing differential marker expression on an initial “training” scRNAseq dataset we will take into account the top 30 differentially expressed genes (DEGs) specific for NK cells cluster; then we will use combiroc to select the best gene combinations among these top 30 NK cells markers.
Regression models from the selected combinations will then be used on other independent datasets (“test” datasets) of the same kind.
As combiroc training dataset we are using the PBMC CITE-seq atlas from Hao et al. 2021 [2]. This dataset can be downloaded as an h5 Seurat data (h5s) from the FredHutch-NYCG atlas page (exact download link: here) [[Heavy-Code]]{style=“color: orange;”}
# read the downloaded h5seurat dataset file (using SeuratDisk functions)
atlas <- LoadH5Seurat("pbmc_multimodal.h5seurat")
# activate the level-1 annotations
Idents(atlas) <- atlas@meta.data$celltype.l1
# overview annotated cell clusters with UMAP
DimPlot(atlas, label = T, repel = T)
To guarantee the best possible performance of the training dataset (the one we extract the markers combinations from) is important to make sure its value ranges are similar to those we wanto to use as test/validation datasets. For this reason we advise using the raw counts matrices for all the datasets, in order to be able to consistently rescale them in the same way by using the function Seurat::ScaleData()
.
The gene markers for NK cells were obtained with a standard Seurat analysis (see here); the Seurat::FindMarkers()
function was used, then markers were ordered by fold change: [[Heavy-Code]]{style=“color: orange;”}
# Performing differential expression analysis
nk.de.markers <- FindMarkers(atlas, ident.1 = "NK", ident.2 = NULL)
nk.de.markers <- nk.de.markers[order(-nk.de.markers$avg_log2FC), ]
nk_genes <- rownames(nk.de.markers)[1:30]
By visualising the expression values of the top 4 markers of NK cells cluster it is evident that the majority of them is also higly expressed in other non-NK clusters (e.g. CD8 T), making them too unspecific to allow a well defined cluster annotation if considered individually.
FeaturePlot(atlas, nk_genes[1:4])
The user not willing (or computationally limited) to run the differential expression analysis above can read the pre-computed object containing the gene signature of the top 30 DEGs (overexpressed in NK cells vs all others):
nk.de.markers <- read.csv('inst/precomp/nk_degs')
nk_genes <- rownames(nk.de.markers)[1:30]
nk_genes
#> [1] "GNLY" "NKG7" "GZMB" "PRF1" "FGFBP2" "GZMA" "KLRD1"
#> [8] "CST7" "SPON2" "KLRF1" "CTSW" "CD247" "CCL5" "CLIC3"
#> [15] "HOPX" "GZMH" "IL2RB" "KLRB1" "TRDC" "CD7" "GZMM"
#> [22] "MYOM2" "FCGR3A" "ARL4C" "ABHD17A" "SYNE2" "CMC1" "EFHD2"
#> [29] "ADGRG1" "JAK1"
Central to the combiroc workflow applied to single cell data is the function seurat_to_combiroc()
.
It takes a Seurat object as input and extracts both the selected marker expression values from the @data slot of a given assay and, optionally, the class of any specific celltype (in our case ‘NK’ or ‘Other’); then it directly assembles a dataframe compatible with combiroc workflow either for training (finding combinations and models) or testing purposes (using previously found combinations).
Once we have a Seurat object obtained from a standard Seurat protocol, it needs to be converted in a combiroc-ready version in order to perform the training procedure, i.e. finding the best markers combinations and models, with combiroc’s functions.
Such a train
object can be obtained with the seurat_to_combiroc()
function: [[Heavy-Code]]{style=“color: orange;”}
train <- seurat_to_combiroc(SeuratObject = atlas,
gene_list = nk_genes,
assay = 'SCT',
labelled_data = T,
case_class = 'NK',
case_label = 'NK',
control_label = 'Other')
The pre-computed train
object obtained with the code above can be downloaded from here, then directly read:
train <- readRDS(file = "inst/precomp/train.rds")
head(train)
#> ID Class ABHD17A ADGRG1 ARL4C CCL5 CD247
#> 1 L1_AAACCCAAGAAACTCA Other 0.6931472 0.0000000 0.000000 0.0000000 0.0000000
#> 2 L1_AAACCCAAGACATACA Other 0.0000000 0.0000000 1.609438 0.0000000 0.6931472
#> 3 L1_AAACCCACAACTGGTT Other 0.0000000 0.0000000 0.000000 0.0000000 0.0000000
#> 4 L1_AAACCCACACGTACTA NK 1.3862944 0.6931472 1.609438 2.4849066 1.7917595
#> 5 L1_AAACCCACAGCATACT Other 0.0000000 0.0000000 1.386294 0.6931472 0.6931472
#> 6 L1_AAACCCACATCAGTCA Other 0.0000000 0.0000000 2.302585 2.9957323 1.6094379
#> CD7 CLIC3 CMC1 CST7 CTSW EFHD2 FCGR3A FGFBP2
#> 1 0.0000000 0.000000 0.000000 0.000000 0.000000 1.0986123 0.000000 0.000000
#> 2 1.6094379 0.000000 0.000000 0.000000 0.000000 0.6931472 0.000000 0.000000
#> 3 0.6931472 0.000000 0.000000 0.000000 0.000000 0.0000000 0.000000 0.000000
#> 4 1.6094379 2.197225 1.945910 2.944439 1.945910 1.3862944 1.386294 2.079442
#> 5 0.6931472 0.000000 0.000000 0.000000 1.098612 0.6931472 0.000000 0.000000
#> 6 0.0000000 0.000000 2.302585 2.079442 0.000000 0.0000000 0.000000 0.000000
#> GNLY GZMA GZMB GZMH GZMM HOPX IL2RB JAK1
#> 1 0.000000 0.000000 0.000000 0.00000 0.0000000 0.000000 0.000000 0.6931472
#> 2 0.000000 0.000000 0.000000 0.00000 0.0000000 0.000000 0.000000 1.3862944
#> 3 0.000000 0.000000 0.000000 0.00000 0.6931472 0.000000 0.000000 0.0000000
#> 4 3.044522 1.386294 2.772589 1.94591 1.3862944 1.609438 1.791759 1.9459101
#> 5 0.000000 0.000000 0.000000 0.00000 1.0986123 0.000000 0.000000 0.6931472
#> 6 0.000000 1.098612 0.000000 0.00000 1.9459101 0.000000 0.000000 1.3862944
#> KLRB1 KLRD1 KLRF1 MYOM2 NKG7 PRF1 SPON2 SYNE2
#> 1 0.000000 0.000000 0.000000 0.000000 0.0000000 0.000000 0.000000 0.0000000
#> 2 2.079442 0.000000 0.000000 0.000000 0.0000000 0.000000 0.000000 0.0000000
#> 3 0.000000 0.000000 0.000000 0.000000 0.6931472 0.000000 0.000000 0.6931472
#> 4 1.386294 1.609438 1.386294 1.791759 3.9120230 1.791759 2.397895 1.0986123
#> 5 0.000000 0.000000 0.000000 0.000000 0.0000000 0.000000 0.000000 0.0000000
#> 6 0.000000 0.000000 0.000000 0.000000 2.9957323 0.000000 0.000000 1.7917595
#> TRDC
#> 1 0.000000
#> 2 0.000000
#> 3 0.000000
#> 4 1.791759
#> 5 0.000000
#> 6 0.000000
Then, as described in the main combiroc vignette, we used combiroc_long
to make the data in long tidy format, fit for further processing:
train_long <- combiroc_long(train)
Given a list of markers, combiroc assesses the performance of all combinations of such markers.
The computational load of this process can be high for more than 10 markers.
A list of \(n\) markers generates \({2^{n} - 1 }\), thus for \(n=30\) we have more than a billion possible combinations.
Mathematically, for combinations here we mean combinations without repetition, i.e. a subset of \(k\) distinct elements of a set with \(n\) elements, that can be calculated as the binomial coefficient \(\binom{n}{k}\).
The purpose of combiroc is always finding the best optimized combination of markers, i.e. a subset of the original full signature which, despite the much smaller number of markers, retains the discriminatory power of the original signature or it’s even better. This is particularly important in the field of diagnostics where a smaller set of marker has a bigger translational potential (see our first combiroc paper Mazzara et al. 2017 for a discussion on this).
Similarly, here we will limit the search to combinations composed by no more than 5 markers: to do so we will set the max_length = 5
attribute of the combi()
function.
With this limitation the number of combinations to compute drops to 174,436, which is computationally more manageable.
$$ Combinations(upTo5Markers) = \sum_{i=1}^5 \binom{n}{i} $$
#> n_markers n_combs tot_combs
#> 1 1 30 30
#> 2 2 435 465
#> 3 3 4060 4525
#> 4 4 27405 31930
#> 5 5 142506 174436
The combi()
function works on the train
dataset computing the marker combinations and counting their corresponding positive samples in each class (once thresholds are selected).
A sample, to be considered positive for a given combination, must have a value higher than a given signal threshold (signalthr) for at least a given number of markers composing that combination (combithr).
As described in the combiroc’s vignette for the standard workflow, the argument signalthr in the combi()
function should be set according to the guidelines and characteristics of the methodology used for the analysis or by an accurate inspection of the signal intensity distribution. In the event specific guidelines are missing, one should set the value signalthr as suggested by the distr$Density_plot
feature.
Single cell RNA-seq datasets are exactly one of such cases: it is often neither possible nor easy to extrapolate the best signal threshold, due to the highly scattered distribution of the expression values: here we thus take advantage of the package-specific distr$Density_plot
feature and will let the package find this value without the user intervention.
The core of a combiroc analysis are the two marker_distribution()
and combi()
functions: changing the value of key parameters in these functions we can modulate the stringency and severity of the analysis.
To obtain a performing and stringent signature refinement we set combithr
argument to 2 in combi()
: this is the minimum number of positively expressed genes (i.e. the minimum number of genes that need to reach the signal threshold) to consider the whole combination as a valid one.
The plots below show expression values (signal) distribution obtained with the markers_distributions()
functions on the training dataset in long format (train_long
).
Differently from two well-defined classes (as seen in the standard combiroc workflow), it’s not intuitive where to put the signal threshold, but the function suggests a signal intensity value of 0.9.
Genes expression values, as expected, are distributed in different ways in the two classes:
Please note: the markers_distribution()
command naturally triggers a few warnings in which the user is reminded how to use the command arguments
distr <- markers_distribution(train_long,
signalthr_prediction = TRUE,
case_class = "NK")
distr$Density_plot