---
title: "Multivariate tools for compositional data analysis: the ToolsForCoDA package"
author:
- Jan Graffelman - Dpt. of Statistics, Universitat Politecnica de Catalunya; Dpt. of Biostatistics, University of Washington 
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
bibliography: ToolsForCoDa.bib
vignette: >
  %\VignetteIndexEntry{Multivariate tools for compositional data analysis: the ToolsForCoDA package}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
---


```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
knitr::opts_chunk$set(fig.align = "center")
knitr::opts_chunk$set(warnings = FALSE)
knitr::opts_chunk$set(message = FALSE)
knitr::opts_chunk$set(fig.width = 6, fig.height = 6) 
rm(list=ls())
```

## Introduction

<div style="text-align: justify"> 

The **ToolsForCoDa** package was originally created in order to provide functions for a canonical correlation analysis with compositional data (@Graffelman2018), based on the centred logratio (clr) transformation of the compositions. Posteriorly, it has been extended with additional tools for the multivariate analysis of compositional data in the R environment. Currently, this package (version 1.1.0) provides functionality for

* log-ratio principal component analysis (LR-PCA). 
* log-ratio canonical correlation analysis (LR-CCO).
* log-ratio discriminant analysis (LR-LDA).

Both CCO and LDA rely on the inversion of a covariance matrix. The covariance matrix of the clr transformed compostions is structurally singular. The programs `lrcco` and `lrlda` resolve this with the use of a generalized inverse. Functionality for the analysis of compositional data in the R environment can be found in the packages **compositions** (@compositions), **robCompositions** (@robCompositions), **easyCODA** (@easyCODA) and **zCompositions** (@zCompositions). For further reading on compositional data, see the seminal text on compositional data by Aitchison (@Aitchison) and several recent statistical textbooks (@Pawlowsky, @Filzmoser, @Greenacre, @VanDenBoogaart). 

The remainder of this vignette shows an example session showing how to perform the three aforementioned types of analysis.

1. [Installation](#installation)

2. [LR-PCA](#lrpca)

3. [LR-CCO](#lrpco)

4. [LR-LDA](#lrlda)

## 1. Installation<a name="installation"></a>

```{r preinstall}
#install.packages("ToolsForCoDa")
library(calibrate)
library(Correlplot)
library(ToolsForCoDa)
```

## 2. Logratio principal component analysis (LR-PCA)<a name="lrpca"></a>

We consider the composition of 37 Pinot Noir samples, consisting of the concentrations of Cd, Mo, Mn, Ni, Cu, Al, Ba, Cr, Sr, Pb, B, Mg, Si, Na, Ca, P, K and an evaluation of the wine's aroma. (@FrankKowalski).

```{r pinotnoir}
data(PinotNoir)
head(PinotNoir)
Aroma <- PinotNoir[,c("Aroma")]
```

We apply closure to the chemical concentrations by division by their total, and
use `lrpca` to do perform LR-PCA.


```{r closure}
X  <- as.matrix(PinotNoir[,1:17])
Xc <- X/rowSums(X)
out.lrpca <- lrpca(Xc)
```

We study the decomposition of compositional variance, and the decay of the LR-PCA eigenvalues by means of a screeplot

```{r decomposition}
round(out.lrpca$decom,2)
plot(1:ncol(out.lrpca$decom),
     out.lrpca$decom[1,],type="b",main="Scree-plot",
     xlab="PC",ylab="Eigenvalue")
```

We construct a covariance biplot, using `jointlim` to establish sensible limits for the x and y axes. Column markers for the clr transformed variables are multiplied by a constant (2.5) for a better visualization, and the amount of explained variance is indicated on the coordinate axes.


```{r biplot}
lims <- jointlim(out.lrpca$Fs,2.5*out.lrpca$Gp)
bplot(out.lrpca$Fs,2.5*out.lrpca$Gp,rowlab="",collab=colnames(Xc),rowch=1,colch=NA,
      xl=lims$xlim,yl=lims$ylim,main="Covariance biplot")

pc1lab <- paste("PC1 (",toString(round(100*out.lrpca$decom[2,1],1)),"%)",sep="")
pc2lab <- paste("PC2 (",toString(round(100*out.lrpca$decom[2,2],1)),"%)",sep="")

text(1,-0.1,pc1lab,cex=0.75)
text(0.1,1.5,pc2lab,cex=0.75,srt=90)
```

This biplot reveals that the logratio $\ln{(Na/Pb)}$ has a large variance and is tightly correlated to the first principal component. 

The variable `Aroma` correlates with the first principal components

```{r aroma}
cor(Aroma,out.lrpca$Fs[,1:2])
```

and as the biplot suggests, `Aroma` correlates positively with the logratio $\ln{(Cr/Sr)}$

```{r correlation}
logScSr <- log(Xc[,c("Cr")]/Xc[,c("Sr")])
cor(Aroma,logScSr)
```

We note function `lrpca` also calculates condition indices, which may prove useful for detecting proportionality or one-dimensional relationships (@Graffelman2021).

## 3. Logratio canonical correlation analysys (LR-CCO)<a name="lrcco"></a>

Two examples of LR-CCO are given below. The first example concerns a small artificial data set, where both the X and Y set are compositional, and is described in Section 3.1 of Graffelman et al. (2018). The second example concerns major oxides compositions of bentonites, where the X set is compositional and Y set is not.

### 3.1 Artificial data

We first load two artificial 3-part compositions.

```{r artificial}
data("Artificial")
Xsim.com <- Artificial$Xsim.com
Ysim.com <- Artificial$Ysim.com
colnames(Xsim.com) <- paste("X",1:3,sep="")
colnames(Ysim.com) <- paste("Y",1:3,sep="")
```

We make the ternary diagrams of the two sets of compositions

```{r ternaries, fig.width = 6, fig.height = 3}
opar <- par(mfrow=c(1,2),mar=c(2,2,1,0)+0.5,pty="s")
par(mfg=c(1,1))
  ternaryplot(Xsim.com,pch=1)
par(mfg=c(1,2))
  ternaryplot(Ysim.com,pch=1)
par(opar)

```

We perform the compositional canonical analysis:

```{r lrcco}
out.lrcco <- lrcco(Xsim.com,Ysim.com)
```

And we reproduce the results in Table 1 of Graffelman et al. (2018). The canonical correlations are obtained as

```{r cancortab}
round(diag(out.lrcco$ccor),digits=3)
```

The canonical weights of the X set and the Y set are obtained by:

```{r canweights}
out.lrcco$A
out.lrcco$B
```

The canonical loadings of the X set and the Y set are obtained by

```{r canloadings}
out.lrcco$Rxu
out.lrcco$Ryv
```

The adequacy coefficients of the X set and the Y set:

```{r canadequacy}
out.lrcco$fitXs
out.lrcco$fitYs
```

The redundancy coefficients of the X set and the Y set

```{r canredundancey}
out.lrcco$fitXp
out.lrcco$fitYp
```

Finally, we make the full set of biplots for LR-CCO given in Figure 2 (@Graffelman2018). In each biplot, the canonical variates are multiplied by a convenient scalar to facilitate
the visualization.


```{r panelbiplots}
opar <- par(mfrow=c(2,2),mar=c(2,2,1,0)+0.5,mgp=c(2,1,0))
par(mfg=c(1,1))
#
# Figure A
#
Z <- rbind(out.lrcco$Fs,out.lrcco$Gp)
plot(Z[,1],Z[,2],type="n",xlim=c(-1,1),ylim=c(-1,1),asp=1,xlab="",ylab="",main="A")
arrows(0,0,Z[1:3,1],Z[1:3,2],col="red",angle=10,length=0.1)
arrows(0,0,Z[4:6,1],Z[4:6,2],col="blue",angle=10,length=0.1)
text(out.lrcco$Fs[,1],out.lrcco$Fs[,2],
     c(expression(X[1]),expression(X[2]),expression(X[3])))
text(out.lrcco$Gp[,1],out.lrcco$Gp[,2],
     c(expression(Y[1]),expression(Y[2]),expression(Y[3])),pos=c(4,3,1))
grid()
fa <- 0.15
points(fa*out.lrcco$U[,1],fa*out.lrcco$U[,2])

par(mfg=c(1,2))
#
# Figure B
#
Z <- rbind(out.lrcco$Fp,out.lrcco$Gs)
plot(Z[,1],Z[,2],type="n",xlim=c(-1.5,1.5),ylim=c(-1.5,1.5),asp=1,xlab="",ylab="",main="B")
arrows(0,0,Z[1:3,1],Z[1:3,2],col="red",angle=10,length=0.1)
arrows(0,0,Z[4:6,1],Z[4:6,2],col="blue",angle=10,length=0.1)
text(out.lrcco$Fp[,1],out.lrcco$Fp[,2],
     c(expression(X[1]),expression(X[2]),expression(X[3])))
text(out.lrcco$Gs[,1],out.lrcco$Gs[,2],
     c(expression(Y[1]),expression(Y[2]),expression(Y[3])),pos=c(4,3,1))
grid()
fa <- 0.25
points(fa*out.lrcco$V[,1],fa*out.lrcco$V[,2])

#
# Standardizing the clr transformed data
#
Xstan.clr <- scale(clrmat(Xsim.com))
Ystan.clr <- scale(clrmat(Ysim.com))
res.stan.cco <- canocov(Xstan.clr,Ystan.clr)
par(mfg=c(2,1))
#
# Figure C
#
Z <- rbind(res.stan.cco$Fs,res.stan.cco$Gp)
plot(Z[,1],Z[,2],type="n",xlim=c(-1,1),ylim=c(-1,1),asp=1,xlab="",ylab="",main="C")
arrows(0,0,Z[1:3,1],Z[1:3,2],col="red",angle=10,length=0.1)
arrows(0,0,Z[4:6,1],Z[4:6,2],col="blue",angle=10,length=0.1)
text(res.stan.cco$Fs[,1],res.stan.cco$Fs[,2],
     c(expression(X[1]),expression(X[2]),expression(X[3])))
text(res.stan.cco$Gp[,1],res.stan.cco$Gp[,2],
     c(expression(Y[1]),expression(Y[2]),expression(Y[3])),pos=c(4,3,1))
grid()
fa <- 0.2
points(fa*res.stan.cco$U[,1],fa*res.stan.cco$U[,2])
circle()
par(mfg=c(2,2))
#
# Figure D
#
Z <- rbind(res.stan.cco$Fp,res.stan.cco$Gs)
plot(Z[,1],Z[,2],type="n",xlim=c(-1.5,1.5),ylim=c(-1.5,1.5),asp=1,xlab="",ylab="",main="D")
arrows(0,0,Z[1:3,1],Z[1:3,2],col="red",angle=10,length=0.1)
arrows(0,0,Z[4:6,1],Z[4:6,2],col="blue",angle=10,length=0.1)
text(res.stan.cco$Fp[,1],res.stan.cco$Fp[,2],
     c(expression(X[1]),expression(X[2]),expression(X[3])))
text(res.stan.cco$Gs[,1],res.stan.cco$Gs[,2],
     c(expression(Y[1]),expression(Y[2]),expression(Y[3])),pos=c(4,3,1))
grid()
fa <- 0.25
points(fa*res.stan.cco$V[,1],fa*res.stan.cco$V[,2])
circle()
par(opar)
```

Panel A shows the logratios $\log{(x_2/x_3)}$ and $\log{(y_1/y_2)}$ to have long links that run parallel to the first canonical variate with the largest canonical correlation; these logratios are highly correlated. The canonical biplot shows the association between the two sets of compositions, which is not visible in the ternary diagrams above.


### 3.2 Canonical analysis of bentonites

In this subsection we treat the canonical analysis of bentonites. The X set concerns the concentrations of 9 major oxides, measured in 14 samples in the US (@Cadrin). A canonical analysis of this data set has been previously described (@Reyment), and is extended here with biplots. The Y set concerns two isotopes, $\delta D$ and $\delta 18O$. 

```{r bentonites}
data("bentonites")
head(bentonites)
```

We clr-transform and column-center the major oxides, after deletion of MnO which is outlying and had many zeros, which were replaced with 0.001. We standardize the isotopes.

```{r clrbentonites}
X <- bentonites[,1:9]
X <- X[,-4]
X <- X/rowSums(X)
Y <- scale(bentonites[,10:11])
Xclr <- clrmat(X)
cco <- canocov(Xclr,Y)
```

The two canonical correlations are large:

```{r twocancor}
diag(cco$ccor)
```

We construct a biplot of the data:

```{r biplotbentonites}
plot(cco$Fs[,1],cco$Fs[,2],col="red",pch=NA,xlab="First principal axis",
     ylab="Second principal axis",xlim=c(-1,1),ylim=c(-1,1),asp=1)
textxy(cco$Fs[,1],cco$Fs[,2],colnames(X),cex=0.75)
arrows(0,0,cco$Gp[,1],cco$Gp[,2],angle=10,col="blue",length=0.1)
arrows(0,0,cco$Fs[,1],cco$Fs[,2],angle=10,col="red",length=0.1)
text(cco$Gp[,1],cco$Gp[,2],colnames(Y),pos=c(3,1))
fa <- 0.45
points(fa*cco$U[,1],fa*cco$U[,2])
textxy(fa*cco$U[,1],fa*cco$U[,2],1:14)
```

We overplot the biplot with the canonical X-variates, which allows one to inspect the original samples (@Graffelman2005). For plotting, the canonical variate is scaled with a convenient scaling factor (here 0.45). This factor does not affect the interpretation of the biplot, but gives the samples a convenient spread.

The logratio $\log{(Na/Mg)}$ (among others) almost coincides with the first canonical variate, which correlates with $\delta 18O$. However, interpretation should proceed with care because of the small sample size.

```{r lognaca}
lnNaCa <- log(X[,c("Na")]/X[,c("Mg")])
cor(Y[,c("d18O")],lnNaCa)
```

## 4. Logratio discriminant analysis (LR-LDA)<a name="lrlda"></a>

We use archeological data from the UK (@Tubb) to illustrate LR-LDA. This dataset consists of measurements of nine oxides on 48 archeological samples from three regions in the UK. We first prepare the data:

```{r tubbdata}
data(Tubb)
head(Tubb)
site             <- factor(Tubb$site)
Oxides           <- as.matrix(Tubb[,2:10])
rownames(Oxides) <- Tubb$Sample
Oxides           <- Oxides/rowSums(Oxides)
```

Next, we carry out LR-LDA by passing the compositions in `Oxides` to the function `lrlda`. Internally, `lrlda` applies the clr transformation of the data.


```{r lrldatubbdata}
out.lrlda <- lrlda(Oxides,site)
```

The group sizes are obtained with:

```{r groupsizes}
out.lrlda$gsizes
```

The group mean vectors of the clr transformed compositions are given by:

```{r groupmeans}
out.lrlda$Mclr
```

The scores of the linear discriminant function are obtained by:

```{r ldscores}
head(out.lrlda$LD)
```

The confusion matrix for the training observations is:

```{r confusion}
out.lrlda$CM
```

Posterior probabilities for the classifications are obtained by

```{r posteriorprobs}
head(round(out.lrlda$prob.posterior,4))
```


We extract biplot coordinates for group centers, individual observations and variables, and construct the LDA biplot.


```{r biplotcoordinates}
Fp <- out.lrlda$Fp
Gs <- out.lrlda$Gs
LD <- out.lrlda$LD

colvec <- rep(NA,nrow(Oxides))
colvec[site=="G"]  <- "red"
colvec[site=="NF"] <- "green"
colvec[site=="W"]  <- "blue"


lims <- jointlim(LD,Fp)
opar <- par(bty="n",xaxt="n",yaxt="n")   
plot(Fp[,1],Fp[,2],asp=1,pch=17,xlab="",ylab="",col=c("red","green","blue"),
     xlim=lims$xlim,ylim=lims$ylim,cex=1.25)
points(LD[,1],LD[,2],col=colvec)
origin()
arrows(0,0,10*Gs[,1],10*Gs[,2],angle = 10, length = 0.1)
textxy(10*Gs[,1],10*Gs[,2],colnames(Oxides))

ld1lab <- paste("LD1 (",toString(round(100*out.lrlda$decom[2,1],1)),"%)",sep="")
ld2lab <- paste("LD2 (",toString(round(100*out.lrlda$decom[2,2],1)),"%)",sep="")

text(7,-0.25,ld1lab,cex=0.75)
text(0.25,7,ld2lab,cex=0.75,srt=90)

par(opar)
legend("topleft",c("G","NF","W"),col=c("red","green","blue"),pch=1,cex=0.5)
```

The LR-LDA biplot shows perfect separation of the three UK regions and suggests that a single logratio like $\log{(MgO/Al2O3)}$ (among other possibilities) is capable of discriminating the three regions. A boxplot of
this logratio confirms this.

```{r lrmgoal2o3}
lrMgOAl2O2 <- Oxides[,c("MgO")]/Oxides[,c("Al2O3")]
boxplot(lrMgOAl2O2~site,col=c("red","green","blue"))
```

</div>

## References