1 Preliminaries

1.1 Requirements - Hardware

A 32 or 64 bit processor (recommended: 64bit)
1GHz processor (recommended: multicore for a comprehensive analysis)
2GB MB RAM (recommended: 4GB+, depends on the size of experimental data)

1.2 Requirements - Software

The main software requirement is the installation of the R environment (version: >= 3.2), which can be downloaded from R project website and is distributed for all common operating systems. We tested the package in R environment installed on Windows 7, 10; Mac OS X 10.11 - 10.13 and Ubuntu 18.04 with no significant differences in the performance. The use of a dedicated Integrated development environment (IDE), e.g. posit is recommended.

Apart from a base installation of R, SLEMI requires the following R packages:

for installation

devtools

for estimation

e1071
Hmisc
nnet
glmnet
caret
doParallel (if parallel computation are needed)

for visualisation

ggplot2
ggthemes
gridExtra
corrplot

for data handling

reshape2
stringr
plyr

Each of the above packages can be installed by executing

install.packages("name_of_a_package")

in the R console.

Importantly, during installation availability of the above packages will be verified and missing packages will be automatically installed.

1.3 Installation

The package can be directly installed from GitHub. For installation, open RStudio (or base R) and run following commands in the R console

install.packages("devtools") # run if 'devtools' is not installed
library(devtools)
install_github("sysbiosig/SLEMI")

Are required packages not found, they will be installed automatically.

1.4 Citing and support

The package implements methods published in PLOS Computational Biology, please cite:

Jetka T, Nienałtowski K, Winarski T, Błoński S, Komorowski M (2019) Information-theoretic analysis of multivariate single-cell signaling responses. PLoS Comput Biol 15(7): e1007132. https://doi.org/10.1371/journal.pcbi.1007132

All problems, issues and bugs can be reported here:

https://github.com/sysbiosig/SLEMI/issues

or directly via e-mail: t.jetka a t gmail.com.

2 Structure of the package

The three functions listed below constitute the key wrapper (interface) functions of the package.

mi_logreg_main() enables calculation of the mutual information
capacity_logreg_main() enables calculation of the information capacity
prob_discr_pairwise() serves to calculate probabilities of correct discrimination between pairs of input values

The function capacity_logreg_main() triggers

preprocessing of the data
estimation of channel capacity
running diagnostic procedures
visualisation.

Each of the above steps is implemented within auxiliary functions as presented in the Figure 1 below.

The algorithm to compute the information capacity is implemented within the function capacity_logreg_algorithm(), which uses logistic regression from the nnet package.

Diagnostic procedures (significance and uncertainties of estimates) are provided in an internal function capacity_logreg_testing(). These are based on data bootstrapping and overfitting test.

For visualization, a set of graphs is created by an internal function capacity_output_graphs() and saved in a specified directory. In addition, capacity_logreg_main() returns a list with capacity estimates, optimal input probability distribution, diagnostic measures and other summary information about the analysis.

The function mi_logreg_main() serves to calculate the mutual information. It initiates similar steps as the function capacity_logreg_main() but without performing the optimization of the distribution of the input. Instead, it requires the input distribution to be specified by the user as a function’s argument.

Logistic regression and Monte Carlo methods, following an analogous algorithm as within the capacity_logreg_algorithm() function, are combined to estimate mutual information within a function mi_logreg_algorithm(). Visualisation and diagnostics are carried out by the same set of auxiliary functions as for channel capacity (internal functions capacity_output_graphs() and capacity_logreg_testing()).

The prob_discr_pairwise() allows to estimate probabilities of correct discrimination between two different values of the input. It implements estimation of probabilities of correct classification by logistic regression (from nnet package) for each pair of input values. The probabilities of correct discrimination are visualized with a graph composed of pie charts.

Main function to estimate channel capacity

2.1 Formulation of the problem

SLEMI package is designed to estimate information-theoretic measures between a discrete-valued input, \(X\), and multivariate, continuous output, \(Y\). In a typical experiment aimed to quantify information flow a given signaling system, input values \(x_1\leq x_2 \ldots... \leq x_m\), ranging from 0 to saturation are considered.

Then, for each input level, \(x_i\),\(n_i\) observations are collected, which are represented as vectors \[y^i_j \sim P(Y|X = x_i)\] Within information theory the degree of information transmission is measured as the mutual information \[MI(X,Y) = \sum_{i=1}^{m} P(x_i)\int_{R^k} P(y|X = x_i)log_2\frac{P(y|X = x_i)}{P(y)}dy\] where \(P(y)\) is the marginal distribution of the output. MI is expressed in bits and \(2^{MI}\) can be interpreted as the number of inputs that the system can resolve on average.

The maximization of mutual information with respect to the input distribution, \(P(X)\), defines the information capacity, \(C^*\). Formally, \[C^* = max_{P(X)} MI(X,Y)\] Information capacity is expressed in bits and \(2^{C^*}\) can be interpreted as the maximal number of inputs that the system can effectively resolve. For details regarding information theory or its application in systems biology please see Methods section and Supplementary Information of the corresponding paper (Jetka et al. 2019).

2.2 Input data

Functions mi_logreg_main(), capacity_logreg_main(), prob_discr_pairwise() require data in the form of the object data.frame with a specific structure of rows and columns. Responses \(y^i_j\) are assumed to be measured for a finite set of stimuli levels \(x_1,x_2,\ldots,x_m\). The responses \(y^i_j\) can be multidimensional. Usually, experimental dataset is represented as a table with rows and columns organized as shown in Figure 2.

Therefore, the input data frame is expected to have the form represented by the above table, which can be formally described by the following conditions

each row represent a response of a single cell
first column contains values of the input (X).
second and subsequent columns contain values of the measured output(s); these columns should be of type numeric; order and number of outputs should be the same for all cells.
the number of unique values of the input should be finite
a large number of observations, possibly >100, per input value is required.

Conceptual representation of a generic experimental dataset needed for quantifying information transmission of a channel

An example of the input data.frame, which contains the measurements of the NfkB system presented in the MP is available within the package under the variable data_nfkb. It has the following format

	signal	response_0	response_3	response_21
1	0ng	0.3840744	0.4252835	0.5643459
2	0ng	0.4709216	0.5777821	0.2962640
3	0ng	0.4274474	0.6696011	0.5696369
10001	8ng	0.3120216	0.3475484	9.7036535
10002	8ng	0.2544961	0.6611051	8.1628482
10003	8ng	0.1807391	0.4336810	5.3928484
11540	100ng	1.3534083	3.0158004	6.8983046
11541	100ng	1.7007936	2.2224497	2.8677178
11542	100ng	0.1997087	0.2886905	5.8193494

where each row represents measurements of a single-cell, the column named signal specifies the level of stimulation, while response_T is the response of the NfkB system in an individual cell at time point T. The above table can be shown in R by calling

library(SLEMI)
rbind(data_nfkb[1:3,1:4],data_nfkb[10001:10003,1:4],tail(data_nfkb[,1:4],3))

2.3 Calculation of the information capacity

Calculation of the information capacity with default settings is performed by the command

capacity_logreg_main(dataRaw, signal, response, output_path)

where the required arguments are

dataRaw - data frame with column of type factor containing values of input (X) and columns of type numeric containing values of output (Y), where each row represents a single observation
signal - a character which indicates the name of the column in dataRaw with values of input (X)
response - a character vector which indicates names of columns in dataRaw with values of output (Y)
output_path - a character with the directory, to which output should be saved

The function returns a list with the following elements

cc - a numeric scalar with channel capacity estimate (in bits)
p_opt - a numeric vector with the optimal input distribution
model - a nnet object describing fitted logistic regression model
data - a data.frame with the raw experimental data (if data_out=TRUE)
time - processing time of the algorithm
params - a vector of parameters used in the algorithm
regression - a confusion matrix of logistic regression predictions

By default, all returned elements are saved in output_path directory in a file output.rds. Along with the output data, results of the computations are visualised as the graphs listed below

MainPlot.pdf - a simple summary plot with basic distribution visualization and capacity estimate
capacity.pdf - a diagram presenting the capacity estimates
data_boxplots.pdf - boxplots of data
data_MeanViolin.pdf - violin plots of data with input-output relation curve (of means)

2.4 Calculation of the mutual information

The function mi_logreg_main() takes a similar list of arguments and generates analogous plots to the function capacity_logreg_main(). The differences are listed below.

Firstly, user must specify the distribution of input that should be used for calculation of the mutual information. It is done by passing a numeric vector via the argument pinput of mi_logreg_main() function. Secondly, the returned list stores the value of the computed mutual information (in bits) under the element mi.

2.5 Calculation of the probabilities of correct discrimination

Calculation of the probabilities of correct discrimination between pairs of input values is performed by running the following command

prob_discr_pairwise(dataRaw, signal, response, output_path)

where the required arguments are analogous to the arguments of the functions capacity_logreg_main() and mi_logreg_main(). The probabilities of correct discrimination are computed for each pair of unique input values and returned as a list with the following elements

prob_matr - a symmetric numeric matrix with a probability of discriminating between \(i\)-th and \(j\)-th input values in cell (i,j)
diagnostics - a list of summaries describing fitted logistic regression models of classification between each pair of input values.

In addition, a plot of corresponding pie charts is created in output_path in the pdf format.

3 Diagnostic procedures

In addition to the sole calculation of the information capacity, the function capacity_logreg_main() can also be used to asses accuracy of the channel capacity estimates resulting from potentially insufficient sample size and potential over-fitting of the regression model. Two test are implemented. Precisely, the function can perform

Bootstrap test - capacity is re-calculated using \(\alpha\)% of data, sampled from the original dataset without replacement. After repeating the procedure \(n\) times, standard deviation of the obtained sample can serve as an error of the capacity estimate.
Over-fitting test - the original data is divided into Training and Testing datasets. Then, logistic regression is estimated using \(\alpha\)% of data (training dataset), and integrals of channel capacity are calculated via Monte Carlo using remaining \((1-\alpha)\)% of data (testing dataset). It is repeated \(n\) times.

In order to perform diagnostic tests, that by default are turned off, user must set the value of the input argument

testing = TRUE (default=FALSE)

In addition, settings of the diagnostic test can be altered by changing the following parameters

TestingSeed (default= 1234) - the seed for the random number generator used to sample original dataset,
testing_cores (default= 4) - a number of cores to use (via doParallel package) in parallel computing,
boot_num (default= 40) - a number of repetitions of the bootstrap ,
boot_prob (default= 0.8) - a fraction of initial observations to use in the bootstrap,
traintest_num (default= 40) - a number of repetitions of the overfitting test,
partition_trainfrac (default= 0.6) - a fraction of initial observations to use as a training dataset in the overfitting test

4 Additional functionalities of the function `capacity_logreg_main()`

In addition, to the basic functionalities described above, the function capacity_logreg_main() allows to control several other parameters of the algorithm that computes the information capacity. These parameters and their effects are listed below.

model_out (default=TRUE) - logical, specify if nnet model object should be saved into output file
plot_width (default = 6) - numeric, the basic width of created plots
plot_height (default = 4) - numeric, the basic height of created plots
scale (default = TRUE) - logical, value indicating if the columns of dataRaw are to be centered and scaled, what is usually recommended for the purpose of stability of numerical computations. From a purely theoretical perspective, such transformation does not influence the value of channel capacity.
lr_maxit (default = 1000) - a maximum number of iterations of fitting step of logistic regression algorithm in nnet function. If a warning regarding lack of convergence of logistic model occurs, should be set to a larger value (possible if data is more complex or of a very high dimension).
MaxNWts (default = 5000) - a maximum number of parameters in logistic regression model. A limit is set to prevent accidental over-loading the memory. It should be set to a larger value in case of exceptionally high dimension of the output data or very high number of input values. In principle, logistic model requires fitting \((m-1)\cdot(d+1)\) parameters, where \(m\) is the number of unique input values and \(d\) is the dimension of the output.

The latter two parameters, i.e lr_maxit and MaxNWts, allow to change the parameters of the logistic regression model fitting within the dependent nnet package.

5 Examples

5.1 Minimal example

Below, we present a minimal model that may serve as a quick introduction to computations within the package. Precisely, we consider a system

with four different input values \(X\): 0, 0.1, 1 and 10
with the conditional output, \(Y|X=x\), give by a one-dimensional log-normal distribution \(\exp\{\mathcal{N}(10\cdot\frac{x}{1+x},1)\}\)
and the sample consisting of 1000 observations for each input value.

The example is analogous to the Test scenario 2 of the Supplementary Information of (Jetka et al. 2019) (Section 3.2).

Input data

Firstly, we generate a a synthetic dataset. The data corresponding to the model can be generated, and represented as the data frame tempdata with columns input and output, by running

xs=c(0,0.1,1,10) # concentration of input.
tempdata = data.frame(input = factor(c(t(replicate(1000,xs))),
                      levels=xs),
                      output =  c(matrix(rnorm(4000, mean=10*(xs/(1+xs)),sd=c(1,1,1,1)),
                                            ncol=4,byrow=TRUE) ))

The generated data.frame has the following structure

	input	output
1	0	-0.6747968
2	0	0.9103099
2001	1	5.3381979
2002	1	5.0503969
3999	10	7.0037857
4000	10	10.2305361

Calculation of the information capacity

The Information capacity can be calculated using the capacity_logreg_main() function that takes the data frame “tempdata” as dataRaw argument. Column names “input” and “output” are used as arguments signal and response, respectively. The output_path is set as “minimal_example/”. Therefore, the function is run as follows

tempoutput  <- capacity_logreg_main(dataRaw=tempdata, 
                                    signal="input", response="output", 
                                    output_path="minimal_example/")

Results of the computations are returned as a data structure described before. In addition, results are presented in the form of the following graph (by default saved as MainPlot.pdf in minimal_example/ directory). It represents the input-output data and gives the corresponding channel capacity.

Standard output graph of the minimal working example

Calculation of the mutual information

To compare mutual information of experimental data with its channel capacity, we can run (uniform distribution of input values is assumed, as default)

tempoutput_mi  <- mi_logreg_main(dataRaw=tempdata, 
                                    signal="input", response="output", 
                                    output_path="minimal_exampleMI/",
                                    pinput=rep(1/4,4))

and display results

print(paste("Mutual Information:", tempoutput_mi$mi,"; ",
            "Channel Capacity:", tempoutput$cc, sep=" "))

## [1] "Mutual Information: 1.48828226407725 ;  Channel Capacity: 1.54565042478486"

Alternatively, the distribution of the input can be defined with probabilities \((0.4,0.1,0.4,0.1)\)

tempoutput_mi  <- mi_logreg_main(dataRaw=tempdata, 
                                    signal="input", response="output", 
                                    output_path="minimal_exampleMI/",
                                    pinput=rc(0.4,0.1,0.4,0.1))

and display results

print(paste("Mutual Information:", tempoutput_mi$mi,"; ",
            "Channel Capacity:", tempoutput$cc, sep=" "))

## [1] "Mutual Information: 1.33866963524711 ;  Channel Capacity: 1.54565042478486"

Calculation of the probabilities of correct discrimination

Probabilities of correct discrimination between input values are calculated as follows

tempoutput_probs  <- prob_discr_pairwise(dataRaw=tempdata, 
                                    signal="input", response="output", 
                                    output_path="minimal_exampleProbs/")

The above command generates graph shown in Figure 4 in the output directory

Standard output graph presenting probabilities of correct discrimination between each pair of input values.

Diagnostics

The diagnostic test can be performed as follows

dir.create("example1_testing/")
outputCLR=capacity_logreg_main(dataRaw=data_example1,
                            signal="signal",response="response",
                            output_path="example1_testing/",
                            testing=TRUE, TestingSeed = 1234, testing_cores = 4, 
                            boot_num = 40, boot_prob = 0.8, 
                            traintest_num = 40,partition_trainfrac = 0.6)

It will run diagnostics with 40 re-sampling of the data, where bootstrap is calculated using 80% of the data, while the over-fitting test uses 60% of the original dataset.

Its results are provided in graph presented in Figure 5.

Standard output graph of the diagnostic procedures. P-values (PV) are based on empirical test either left- or right- sided. In the top axis, black dot represents the estimate of the channel capacity that involves the compete dataset, red dot is the mean of bootstrap procedures, while the bars are mean +/- sd. The remaining panels are histograms of all repetitions of a specific diagnostic procedure.

The top diagram shows the value of the capacity estimate (in black) obtained from the complete dataset and the mean value of bootstrap repetitions with indicated +/- standard deviation (in red). Plots that follow show histograms of calculated capacities for different diagnostic regimes. The black dot represents the estimate of the channel capacity based on the complete dataset. In addition, corresponding empirical p-values of both tests (left- and right-sided) are calculated to assess the randomness of obtained results (PV in the plots).

A reliable estimation of the information capacity should yield the following results of the bootstrap and overfitting tests.

The bootstrap test should yield distribution of the capacity estimates with small variance. In addition, the capacity estimated based on the complete dataset should not be an outlier (p-value>0.05). Otherwise, it would indicate that the sample size is too low for an accurate estimation of the channel capacity.
The over-fitting test should provide similar results. The capacity estimate obtained based on the complete dataset should lie within the distribution of capacities generated in the test. In the opposite case, it could mean that the logistic regression model does not fully grasp the essential aspects of input-output dependencies in the data.

5.2 Further step-by-step introductory examples

Two step-by-step examples that further illustrate the applicability of the SLEMI package are provided in the Section 6 of the ‘Testing procedures’ pdf file that is added to the publication (Jetka et al. 2019) and can be found here.

5.3 Examples in paper

To reproduce results of the NFkB analysis presented in the publication, see Section 7 of the ‘Testing procedures’ pdf file that is added to the publication (Jetka et al. 2019) and can be found here.

6 Session Info

sessionInfo()

## R version 4.3.1 (2023-06-16)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
## 
## Matrix products: default
## BLAS/LAPACK: /opt/conda/lib/libopenblasp-r0.3.24.so;  LAPACK version 3.11.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] SLEMI_1.0.2    reshape2_1.4.4 stringr_1.5.0  gridExtra_2.3  ggplot2_3.4.4 
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.4         xfun_0.40            bslib_0.5.1         
##  [4] recipes_1.0.8        lattice_0.21-9       vctrs_0.6.4         
##  [7] tools_4.3.1          generics_0.1.3       stats4_4.3.1        
## [10] parallel_4.3.1       proxy_0.4-27         tibble_3.2.1        
## [13] fansi_1.0.5          ModelMetrics_1.2.2.2 pkgconfig_2.0.3     
## [16] Matrix_1.6-1.1       data.table_1.14.8    lifecycle_1.0.3     
## [19] compiler_4.3.1       munsell_0.5.0        codetools_0.2-19    
## [22] htmltools_0.5.6.1    class_7.3-22         sass_0.4.7          
## [25] yaml_2.3.7           prodlim_2023.08.28   pillar_1.9.0        
## [28] jquerylib_0.1.4      MASS_7.3-60          cachem_1.0.8        
## [31] gower_1.0.1          iterators_1.0.14     rpart_4.1.21        
## [34] foreach_1.5.2        nlme_3.1-163         parallelly_1.36.0   
## [37] lava_1.7.2.1         tidyselect_1.2.0     digest_0.6.33       
## [40] stringi_1.7.12       future_1.33.0        dplyr_1.1.3         
## [43] purrr_1.0.2          listenv_0.9.0        splines_4.3.1       
## [46] fastmap_1.1.1        colorspace_2.1-0     cli_3.6.1           
## [49] magrittr_2.0.3       survival_3.5-7       utf8_1.2.3          
## [52] e1071_1.7-13         future.apply_1.11.0  withr_2.5.1         
## [55] scales_1.2.1         lubridate_1.9.3      timechange_0.2.0    
## [58] rmarkdown_2.25       globals_0.16.2       nnet_7.3-19         
## [61] timeDate_4022.108    evaluate_0.22        knitr_1.44          
## [64] hardhat_1.3.0        caret_6.0-94         rlang_1.1.1         
## [67] Rcpp_1.0.11          glue_1.6.2           pROC_1.18.4         
## [70] ipred_0.9-14         jsonlite_1.8.7       R6_2.5.1            
## [73] plyr_1.8.9

References

Jetka, Tomasz, Karol Nienałtowski, Tomasz Winarski, Sławomir Błoński, and Michał Komorowski. 2019. “Information-Theoretic Analysis of Multivariate Single-Cell Signaling Responses.” PLOS Computational Biology 15 (7): e1007132. https://doi.org/10.1371/journal.pcbi.1007132.

Statistical Learning based Estimation of the Mutual Information (SLEMI) - R package

User Manual

T. Jetka¹, K. Nienałtowski, M. Komorowski

19 November 2023

1 Preliminaries

1.1 Requirements - Hardware

1.2 Requirements - Software

1.3 Installation

1.4 Citing and support

2 Structure of the package

2.1 Formulation of the problem

2.2 Input data

2.3 Calculation of the information capacity

2.4 Calculation of the mutual information

2.5 Calculation of the probabilities of correct discrimination

3 Diagnostic procedures

4 Additional functionalities of the function `capacity_logreg_main()`

5 Examples

5.1 Minimal example

5.2 Further step-by-step introductory examples

5.3 Examples in paper

6 Session Info

References

Statistical Learning based Estimation of the Mutual Information (SLEMI) - R package

User Manual

T. Jetka1, K. Nienałtowski, M. Komorowski

19 November 2023

1 Preliminaries

1.1 Requirements - Hardware

1.2 Requirements - Software

1.3 Installation

1.4 Citing and support

2 Structure of the package

2.1 Formulation of the problem

2.2 Input data

2.3 Calculation of the information capacity

2.4 Calculation of the mutual information

2.5 Calculation of the probabilities of correct discrimination

3 Diagnostic procedures

4 Additional functionalities of the function capacity_logreg_main()

5 Examples

5.1 Minimal example

5.2 Further step-by-step introductory examples

5.3 Examples in paper

6 Session Info

References

T. Jetka¹, K. Nienałtowski, M. Komorowski

4 Additional functionalities of the function `capacity_logreg_main()`