---
title: "Leave-one-out cross-validation and error correction"
author: "Collin Erickson"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Leave-one-out cross-validation and error correction}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, results='hide', echo=FALSE}
set.seed(0)
```



Cross-validation is often used in machine learning to judge how well a model is fit.
Instead of using the entire data set to fit the model,
it will use one part of the data set to fit a model and then test the model on the remaining data.
This gives an idea of how well the model will generalize to indpendent data.

## Leave-one-out predictions using Gaussian processes

Leave-one-out prediction uses an entire model fit to all the data except
a single point, and then makes a prediction at that point which can
be compared to the actual value.
It seems like this may be very expensive to do,
but it is actually an inexpensive computation for a Gaussian process model,
as long as the same parameters are used from the full model.
This will bias the predictions to better results than if parameters were
re-estimated.

Normally each prediction point requires solving a matrix equation.
To predict the output, $y$, at point $\mathbf{x}$, given 
input data in matrix $X_2$ and output $\mathbf{y_2}$, we use the equation
$$ \hat{y} = \hat{\mu} + R(\mathbf{x},~X_2) R(X_2)^{-1}( \mathbf{y_2} - \mu\mathbf{1_n})) $$
For leave-one-out predictions, the matrix $X_2$ will have all the design points
except for the one we are predicting at, and thus will be different
for each one.
However, we will have the correlation matrix $R$ for the full data set
from estimating the parameters, and there is a shortcut
to find the inverse of a matrix leaving out a single row and column.


There is significant speed-up by using a multiplication instead of a matrix solve.
The code chunk below shows that solving with a square matrix with 200 rows
is over 30 times slower than a matrix multiplication.

```{r}
n <- 200
m1 <- matrix(runif(n*n),ncol=n)
b1 <- runif(n)
if (requireNamespace("microbenchmark", quietly = TRUE)) {
  microbenchmark::microbenchmark(solve(m1, b1), m1 %*% b1)
}
```

### Getting the inverse of a submatrix

Suppose we have a matrix $K$ and know its inverse $K^{-1}$.
Suppose that $K$ has block structure
$$ K = \begin{bmatrix} A~B \\ C~D \end{bmatrix}$$
Now we want to find out how to find $A^{-1}$ using $K^{-1}$ instead of doing the full inverse.
We can write $K^{-1}$ in block structure
$$K^{-1} = \begin{bmatrix} E~F \\ G~H \end{bmatrix}$$

Now we use the fact that $K K^{-1}  = I$
$$ \begin{bmatrix} I~0 \\ 0~I \end{bmatrix} = \begin{bmatrix} A~B \\ C~D \end{bmatrix}\begin{bmatrix} E~F \\ G~H \end{bmatrix}  $$

This gives the equations
$$ AE + BG = I$$
$$ AF + BH = 0$$
$$ CE + DG = 0$$
$$ CF + DH = I$$


Solving the first equation gives that 
$$ A = (I - BG)E^{-1}$$
or
$$ A^{-1} = E (I - BG) ^{-1}$$

### Leave-one-out covariance matrix inverse for Gaussian processes

For Gaussian processes we can consider the block matrix for the covariance (or correlation) matrix
where a single row and its corresponding column is being removed.
Let the first $n-1$ rows and columns be the covariance of the points in design matrix $X$,
while the last row and column are the covariance for the vector $\mathbf{x}$ with $X$ and $\mathbf{x}$.
Then we can have

$$ K = \begin{bmatrix} C(X,X)~ C(X,\mathbf{x}) \\ C(\mathbf{x},X)~C(\mathbf{x},\mathbf{x}) \end{bmatrix}$$

Using the notation from the previous subsection we have $A = C(X,X)$ and $B=C(X,\mathbf{x})$, 
and $E$ and $G$ will be submatrices of the full $K^{-1}$.
$B$ is a column vector, so I'll write it as a vector $\mathbf{b}$,
and $G$ is a row vector, so I'll write it as a vector $\mathbf{g}^T$.
So we have
$$ C(X,X)^{-1} = E(I-C(X,x)G)^{-1}$$
So if we want to calculate
$$ A^{-1} = E (I - \mathbf{b}\mathbf{g}^T) ^{-1}$$
we still have to invert $I-BG$, which is a large matrix.
However this can be done efficiently since it is a rank one matrix
using the Sherman-Morrison formula.
$$ (I - \mathbf{b}\mathbf{g}^T)^{-1} = I^{-1} - \frac{I^{-1}\mathbf{b}\mathbf{g}^TI^{-1}}{1+\mathbf{g}^TI^{-1}\mathbf{b}}
= I - \frac{\mathbf{b}\mathbf{g}^T}{1+\mathbf{g}^T\mathbf{b}}
$$
Thus we have the shortcut for $A^{-1}$ that is only multiplication
$$ A^{-1} = E (I - \frac{\mathbf{b}\mathbf{g}^T}{1+\mathbf{g}^T\mathbf{b}})$$

To speed this up it should be calculated as

$$ A^{-1} = E - \frac{(E\mathbf{b})\mathbf{g}^T}{1+\mathbf{g}^T\mathbf{b}}$$
Below demonstrates that we get a speedup of almost twenty by using this shortcut.
```{r}
set.seed(0)
corr <- function(x,y) {exp(sum(-30*(x-y)^2))}
n <- 200
d <- 2
X <- matrix(runif(n*d),ncol=2)
R <- outer(1:n,1:n, Vectorize(function(i,j) {corr(X[i,], X[j,])}))
Rinv <- solve(R)
A <- R[-n,-n]
Ainv <- solve(A)
E <- Rinv[-n, -n]
b <- R[n,-n]
g <- Rinv[n,-n]
Ainv_shortcut <- E + E %*% b %*% g / (1-sum(g*b))
summary(c(Ainv - Ainv_shortcut))
if (requireNamespace("microbenchmark", quietly = TRUE)) {
  microbenchmark::microbenchmark(solve(A), E + E %*% b %*% g / (1-sum(g*b)))
}
```

In terms of the covariance matrices already calculated, this is the following,
where $M_{-i}$ is the matrix $M$ with the ith row and column removed, and 
$M_{i,-i}$ is the ith row of the matrix $M$ with the value from the ith column removed.

$$ R(X_{-i})^{-1} = R(X)_{-i} -
\frac{(R(X)_{-i}R(X)_{-i,i}) (R(X)^{-1})_{i,-i}^T }{1 + (R(X)^{-1})_{i,-i}^T R(X)_{-i,i}}$$


### Leave-one-out prediction

Recall that the predicted mean at a new point is
$$ \hat{y} = \hat{\mu} + R(\mathbf{x},~X_2) R(X_2)^{-1}( \mathbf{y_2} - \mu\mathbf{1_n})) $$

$$ \hat{y} = \hat{\mu} + R(\mathbf{x_i},~X_{-1}) R(X_{-i})^{-1}( \mathbf{y_{-i}} - \mu\mathbf{1_n})) $$