\documentclass[mathserif,12pt]{amsart}
%\VignetteIndexEntry{Bayesian Deconvolution}
%\VignetteEngine{knitr::knitr}

%% additional packages
\usepackage[latin1]{inputenc}
\usepackage{a4wide,graphicx,color,thumbpdf}
\usepackage{hyperref}
\usepackage{alltt}
\usepackage{amsmath}
\usepackage{amsthm}
\usepackage{upquote}
\usepackage{mdframed}
\usepackage{xcolor}
\usepackage{eulervm}

%% BibTeX settings
\usepackage[authoryear,round]{natbib}
\bibliographystyle{jae}
\bibpunct{(}{)}{,}{a}{,}{,}
\newcommand{\doi}[1]{\href{http://dx.doi.org/#1}{\normalfont\texttt{doi:#1}}}

%% markup commands for code/software
\let\code=\texttt
\let\pkg=\textbf
\let\proglang=\textsf
\newcommand{\file}[1]{`\code{#1}'}
%\newcommand{\email}[1]{\href{mailto:#1}{\normalfont\texttt{#1}}}

%% paragraph formatting
\renewcommand{\baselinestretch}{1}

%% \usepackage{Sweave} is essentially
\RequirePackage[T1]{fontenc}
\RequirePackage{ae,fancyvrb}
\DefineVerbatimEnvironment{Sinput}{Verbatim}{fontshape=sl}
\DefineVerbatimEnvironment{Soutput}{Verbatim}{}
\DefineVerbatimEnvironment{Scode}{Verbatim}{fontshape=sl}



\def\RR{\mathbb{R}}
\def\OO{\mathcal O}
\def\PP{\mathbb{P}}
\def\LL{\mathcal L}
\def\II{\mathcal I}
\def\EE{\mathbb E}
\def\VB{\mathbb V}
\def\FF{\mathcal F}
\def\NN{\mathcal N}
\def\XX{\mathcal X}
\def\SS{\mathcal S}
\def\GG{\mathcal G}
\def\CC{\mathcal C}
\def\AA{\mathcal A}
\def\TT{\mathcal T}
\def\HH{\mathcal H}
\def\JJ{\mathcal J}
\def\KK{\mathcal K}
\def\PP{\mathcal{P}}
\def\DD{\mathcal{D}}
\def\LL{\mathcal L}
\def\II{\mathcal I}
\def\pperp{\perp\!\!\!\perp}

\def\Polya{P\'olya }
\def\Renyi{R\'enyi }
\def\diag{\mbox{diag}}
\def\sgn{\mbox{sgn}}
\def\TV{\bigvee_\Omega }
%\newtheorem{example}{Example}[section]
\newtheorem{exmp}{Example}
%\newtheorem{remark}{Remark}
\renewcommand{\abstractname}{Abstract}
\def\argmax{\text{argmax}}
\newtheorem{lemma}{Lemma}
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}{Corollary}
\newtheorem{proposition}{Proposition}
\renewenvironment{proof}{\noindent {\bf Proof.\ }}{\hfill{\rule{2mm}{2mm}}}
%\renewenvironment{remark}{\noindent {\bf Remark.\ }} {\hfill{ \rule{2mm}{2mm}}}
%\renewenvironment{example}{\noindent {\bf Example.\ }}{\hfill{ \rule{2mm}{2mm}}}


\begin{document}

\bibliographystyle{jae}

\title{Bayesian Deconvolution:\\ An R Vinaigrette}
\author{Roger Koenker}
\thanks{Version:  \today .  A genre manifesto for R Vinaigrettes is available
at \url{http://davoidofmeaning.blogspot.com/2016/12/r-vinaigrettes.html}.
It seemed appropriate that my first venture in this new genre should be
samokritika (self-criticism) directed at the R package REBayes, \cite{REBayes}.
Code to reproduce the computational results presented is available from
\url{http://www.econ.uiuc.edu/~roger/research/ebayes/ebayes.html}, along
with the pdf version of this note.  I would like to thank Jiaying Gu for
very helpful comments on an earlier draft.
}


\begin{abstract}
    Nonparametric maximum likelihood estimation of general mixture models
    pioneered by the work of \cite{KW} has been recently reformulated as
    an exponential family regression spline problem in \cite{Bdecon}.  
    Both approaches yield a low dimensional estimate of the mixing distribution, 
    $g$-modeling in the terminology of Efron.   Some casual empiricism suggests
    that the Efron approach is preferable when the mixing distribution has a
    smooth density, while Kiefer-Wolfowitz is preferable for discrete mixing
    settings.  In the classical Gaussian deconvolution problem both maximum
    likelihood methods appear to be preferable to (Fourier) kernel  methods.
    Kernel smoothing of the Kiefer-Wolfowitz estimator appears to be competitive
    with the Efron procedure for smooth alternatives.
\end{abstract}

\maketitle

%\noindent {\bf Keywords:} 

<<preliminaries, echo=FALSE, warning = FALSE, message = FALSE, results='hide'>>=
hasMosek <- require("Rmosek", quietly = TRUE)                
if(hasMosek) 
    hasMosek <- tryCatch(example(mosek)$value$response$code == 0,
			   warning = function(c) 0,
			   error = function(c) 0)
if (!hasMosek) {                                                                     
                knitr::opts_chunk$set(eval = FALSE)                                   
                msg <- paste("This vignette requires Mosek, but this system",
                             "does not have Mosek",
                             "installed, so code examples will not be evaluated.")
                msg <- paste(strwrap(msg), collapse="\n")
                message(msg) 
            }
require("REBayes")
knitr::render_sweave()
options(prompt = "R> ", continue = "+  ", digits = 3, show.signif.stars = FALSE)
cleanup <- FALSE
@


\section{Introduction} \label{sec:intro}
	\cite{Bdecon} has recently introduced the phrase ``Bayesian deconvolution''
	to describe a maximum likelihood procedure for estimating mixture models
	of the general form,
	\[
	f(y) = \int \varphi(y| \theta) dG(\theta),
	\]
	where $\varphi$ denotes a known parametric ``base'' model and $G$ denotes
	an unknown, nonparametric mixing distribution.  Such models are fundemental
	in empirical Bayes compound decision settings where we have the (iid)
	hierarchical structure,
	\[
	Y_i \sim \varphi (y | \theta_i); \quad \theta_i \sim G.
	\]
	When $\theta$ is a location parameter, so $\varphi (y | \theta_i) =
	\varphi (y - \theta_i)$ this is a conventional deconvolution problem
	usually evoking characteristic function methods, however Efron's 
	maximum likelihood procedure recalls the  NPMLE of \cite{KW} except
	rather than producing a discrete estimate of $G$ it yields a smooth
	estimate.

	This note contrasts the foregoing methods in a very simple, special
	case and argues that maximum likelihood offers considerable advantages
	over prior (Fourier) deconvolution methods, perhaps most significantly
	by extending the domain of applications beyond the location shift model.

\section{The Kiefer-Wolfowitz NPMLE}

In \cite{KM} we have advocated the Kiefer-Wolfowitz NPMLE approach to 
estimating $G$ and constructing estimates of the $\theta_i$'s for compound
decision problems.  In sharp contrast to finite dimensional mixture problems
with highly multimodel likelihoods, discrete formulations of the general
nonparametric mixture problem are strictly convex and therefore admit unique
solutions.  Consider a grid $t_0, t_1, \cdots , t_m$ with associated masses
$\{ g \in \RR^m | g_i \geq 0, \; \sum_{i=1}^m g_i \Delta t_i = 1 \}$, we
can approximate the log likelihood by,
\[
\ell(G) = \sum_{i=1}^n \log f_i 
\]
where the $n$ vector $f = Ag$ and $A$ is the $n$ by $m$ matrix with typical
element $\varphi (y_i , t_j)$.  As is well known from \cite{Laird} or \cite{Lindsay}
the NPMLE, $\hat G$, has $p \leq n$ positive mass points, while in practice this
$p$ is usually closer to $\log n$ than $n$.  Interior point methods for solving
such problems are considerably more efficient than earlier EM approaches greatly
facilitating the study of their performance in simulation experiments.  Unfortunately,
little is known about their statistical efficiency from a theoretical perspective
beyond the basic consistency results of \cite{KW} and \cite{Pfanzagl}.

\section{Efron's NPMLE}

\cite{Bdecon} has proposed an alternative approach to estimating $G$ that expresses
its log derivative by a regression spline,
\[
g(y | \theta) = \exp \{ \sum_{j=1}^p \theta_j \psi_j (y) - \psi_0 (\theta) \},
\]
as in the pure density estimation methods of \cite{Stone} and \cite{Barron}.  We can
maintain the same discretization for the support of $G$, and set,
\[
g = (g_j) = (g(t_j | \theta)),
\]
so the log likelihood can be expressed as above, except that now we are estimating
a finite dimensional parameter $\theta$ of predetermined dimension.  Efron suggests
natural splines for the $\psi_j$ functions and the penalization,
\[
\ell_n (G_\theta) + \lambda \| \theta \|
\]
by the Euclidean norm of the vector $\theta$, thereby shrinking $\theta$ toward the
origin and $\hat G$ toward the uniform distribution. 

A striking feature of both the Efron and Kiefer-Wolfowitz proposals is that neither
depend upon the mixture model being a formal convolution.  Of course when $\theta$
is a location parameter so $\varphi (y | \theta) = \varphi (y - \theta)$ then
classical deconvolution methods are also applicable.  Efron compares  the performance
of his procedure with the kernel deconvolution method of \cite{SC}, and concludes
that the latter is ``too variable in the tails.''

<<Xsetup, include = FALSE>>=
X.cap <- "Four estimates of the mixing distributions $G$:  In the left panel the 
true mixing distribution is smooth, in the left panel it is discrete as described
in the text.  
"
@

<<X, fig.height = 6, fig.width = 9, fig.cap = X.cap, cache = TRUE, message = FALSE, warnings = FALSE, echo = FALSE>>=
KFE <- function(y, T = 300, lambda = 1/3){
    # Kernel Fourier Estimator: Stefanski and Carroll (Statistics, 1990)
    ks <- function(s,x) exp(s^2/2) * cos(s * x)
    K <- function(t, y, lambda = 1/3){
	k <- y
	for(i in 1:length(y)){
	    k[i] <- integrate(ks, 0, 1/lambda, x = (y[i] - t))$value/pi 
	}
	mean(k)
    }
    eps <- 1e-04
    if(length(T) == 1) T <- seq(min(y)-eps, max(y)+eps, length = T)
    g <- T
    for(j in 1:length(T))
	g[j] <- K(T[j], y, lambda = lambda)
    list(x = T, y = g)
}
BDE <- function(y, T = 300, df = 5, c0 = 1){
    # Bayesian Deconvolution Estimator: Efron (B'ka, 2016)
    require(splines)
    eps <- 1e-04
    if(length(T) == 1) T <- seq(min(y)-eps, max(y)+eps, length = T)
    X <- ns(T, df = df)
    a0 <- rep(0, ncol(X))
    A <- dnorm(outer(y,T,"-"))
    qmle <- function(a, X, A, c0){
	g <- exp(X %*% a)
	g <- g/sum(g)
	f <- A %*% g
	-sum(log(f)) + c0 * sum(a^2)^.5
    }
    ahat <- nlm(qmle, a0, X=X, A=A, c0 = c0)$estimate
    g <- exp(X %*% ahat)
    g <- g/integrate(approxfun(T,g),min(T),max(T))$value
    list(x = T,y = g)
}
W <- function(G, h, interp = FALSE, eps = 0.001){
    #Wasserstein distance:  ||G-H||_W
    H <- cumsum(h$y)
    H <- H/H[length(H)]
    W <- integrate(approxfun(h$x, abs(G(h$x) - H)),min(h$x),max(h$x))$value
    list(W=W, H=H)
}

biweight <- function(x0, x, bw){
    t <- (x - x0)/bw
    (1-t^2)^2*((t> -1 & t<1)-0) *15/16
}
Wasser <- function(G, h, interp = FALSE, eps = 0.001, bw = 0.7){
    #Wasserstein distance:  ||G-H||_W
    if(interp == "biweight"){
	yk = h$x
	for (j in 1:length(yk))
	    yk[j] = sum(biweight(h$x[j], h$x, bw = bw)*h$y/sum(h$y))
	H <- cumsum(yk)
	H <- H/H[length(H)]
    }
    else {
	H <- cumsum(h$y)
	H <- H/H[length(H)]
    }
    W <- integrate(approxfun(h$x, abs(G(h$x) - H)),min(h$x),max(h$x), subdivisions = 500)$value
    list(W=W, H=H)
}
set.seed(1234)
par(mfrow = c(1,2))
for(i in 1:2){
    if(i == 1){
	G0 <- function(t) punif(t,0,6)/8 + 7 * pnorm(t, 0, 0.5)/8  
	rf0 <- function(n){
	    s <- sample(0:1, n, replace = TRUE, prob = c(1,7)/8)
	    rnorm(n) + (1-s) * runif(n,0,6) + s * rnorm(n,0,0.5)
	}
    }
    else{
	G0 <- function(t) 0 + 7 * (t > 0)/8 + (t > 2)/8
	rf0 <- function(n){
	    s <- sample(0:1, n, replace = TRUE, prob = c(1,7)/8)
	    rnorm(n) + (1-s) * 2 + s * 0
	}
    }
    y <- rf0(1000)
    t1 <- system.time(g1 <- BDE(y))
    W1 <- Wasser(G0, g1)
    t2 <- system.time(g2 <- GLmix(y))
    W2 <- Wasser(G0, g2)
    t3 <- system.time(g3 <- KFE(y))
    W3 <- Wasser(G0, g3)
    W4 <- Wasser(G0, g2, interp = "biweight")
    plot(g1$x, G0(g1$x), type = "l", ylim = c(-0.1, 1.1), xlab = "x", ylab = "G(x)")
    lines(g1$x, W1$H,col = 2)
    lines(g2$x, W2$H, col = 3)
    lines(g3$x, W3$H, col = 4)
    lines(g2$x, W4$H, col = 5)
    legend(x = "bottomright", legend= c("True", "Efron","KW", "SC", "KWs"), col = 1:5, lwd = 1.5)
    title(paste("W(E) = ", round(W1$W,3), "; ", "W(KW) = ", round(W2$W,3), 
	    "; ", "W(SC) = ", round(W3$W,3), "; ", "W(KWs) = ", round(W4$W,3), sep = ""), cex.main = 0.75)
}
@


<<Y,  cache = TRUE, message = FALSE, warnings = FALSE, echo = FALSE>>=
sim1 <- function(n, R = 10){
    A <- matrix(0, 3, R)
    G0 <- function(t) punif(t,0,6)/8 + 7 * pnorm(t, 0, 0.5)/8  
    rf0 <- function(n){
	s <- sample(0:1, n, replace = TRUE, prob = c(1,7)/8)
	rnorm(n) + (1-s) * runif(n,0,6) + s * rnorm(n,0,0.5)
    }
    for(i in 1:R){
	y <- rf0(n)
	g <- BDE(y)
	Wg <- Wasser(G0, g)
	h <- GLmix(y)
	Wh <- Wasser(G0, h)
	Whs <- Wasser(G0, h, interp = "biweight")
	A[,i] <- c(Wg$W, Wh$W, Whs$W)
	}
    A
}
set.seed(12)
A <- sim1(1000)
a <- round(apply(A,1,mean),3)
@

\section{An Illustration}

To compare performance of the three estimators of $G$ described above, I have
considered a slight variant of the simulation setting of Efron.  The observed
$Y_i$ are $\NN (\theta, 1)$ with $\theta_i$'s drawn iidly either a.) from the mixture of
Gaussian and uniform distributions, 
\[
G(\theta) = (1 - \epsilon) \Phi (\theta/\sigma) + 
	\epsilon \theta I(0 \geq \theta < M)/M
\]
with $\epsilon = 1/7$, $\sigma = 1/2$ and $M = 6$,  or from b.) the discrete mixing
distribution with $\epsilon = 1/7$ and,
\[
G(\theta) = (1 - \epsilon)  I(0 \leq \theta) + \epsilon I(2 \leq \theta) 
\]
Figure 1 depicts typical
realizations with sample size $n = 1000$.  Following Efron we have set the
dimension of the natural spline model to $p = 5$, and his penalty parameter to one.
The scaling parameter for the Stefanski-Carroll procedure was $1/3$, also 
following Efron's suggetion.  For all three estimators the grid was equally
spaced on the support of the observed $Y_i$ with $m = 300$ distinct values.
Wasserstein distance, $L_1$ distance between distribution functions,  is reported
above the figure for each of three estimates.

The performance of Efron's estimator is very impressive, while the oscillation
of the kernel method in the tails confirms Efron's criticism.  The Kiefer-Wolfowitz
estimator is respectable, at least the estimated $\hat G$ stays within the [0,1]
bounds, but it is clearly inferior to the smoother Efron procedure.  When the
target $G$ is discrete with a small number of mass points, the performance
advantage, not surprisingly is reversed.  One might count the lack of tuning parameters for the
Kiefer-Wolfowitz NPMLE as an advantage over its competitors, or not, depending
on one's outlook on minimalism.  

In the spirit of competition, I couldn't resist trying to smooth the Kiefer Wolfowitz
NPMLE to see whether one might be able to approach the performance of the Efron estimator,
so the last (cyan) curve is a biweight kernel smooth with bandwidth equal 0.7.  This does almost
as well as the default Efron procedure for our test case for the smooth alternative, but spoils
the auspicious performance for the discrete case.  Encouraged by the former improvement I
decided to wade a little further out in the water by replicating the experiment.  In 1000
trials of the experiment with the smooth $G_0$ the mean Wasserstein error was  0.188
for the Efron estimator, 0.343
for the unsmoothed KW-NPMLE and 0.190 for the smoothed KW-NPMLE.  Of course, this
result proves nothing at all, except perhaps that I'm acquainted with an excellent bandwidth
oracle.




\bibliography{Bdecon}

\end{document}