\documentclass[12pt]{amsart} %\VignetteIndexEntry{REBayes: Empirical Bayes for Mixtures} %\VignetteEngine{knitr::knitr} %% additional packages \usepackage[latin1]{inputenc} \usepackage{thumbpdf} \usepackage{alltt} \usepackage{amsmath} \usepackage{amsfonts} \usepackage{upquote} \setlength{\textheight}{205mm} \setlength{\oddsidemargin}{10mm} \setlength{\evensidemargin}{10mm} \setlength{\topmargin}{0mm} \setlength{\textwidth}{6in} %% BibTeX settings \usepackage[authoryear,round]{natbib} \bibpunct{(}{)}{,}{a}{,}{,} \newcommand{\doi}[1]{\href{http://dx.doi.org/#1}{\normalfont\texttt{doi:#1}}} %% markup commands for code/software \let\code=\texttt \let\pkg=\textbf \let\proglang=\textsf \newcommand{\file}[1]{`\code{#1}'} %\newcommand{\email}[1]{\href{mailto:#1}{\normalfont\texttt{#1}}} %% paragraph formatting \renewcommand{\baselinestretch}{1} \def\RR{\mathbb{R}} \def\OO{\mathcal O} \def\PP{\mathbb{P}} \def\LL{\mathcal L} \def\II{\mathcal I} \def\EE{\mathbb E} \def\VB{\mathbb V} \def\FF{\mathcal F} \def\NN{\mathcal N} \def\XX{\mathcal X} \def\SS{\mathcal S} \def\GG{\mathcal G} \def\CC{\mathcal C} \def\AA{\mathcal A} \def\TT{\mathcal T} \def\HH{\mathcal H} \def\JJ{\mathcal J} \def\KK{\mathcal K} \def\PP{\mathbb{P}} \def\LL{\mathcal L} \def\II{\mathcal I} \def\argmax{\text{argmax}} \newtheorem{theorem}{Theorem} \def\sgn{\mbox{sgn}} \begin{document} \bibliographystyle{jae} \title{ REBayes:\\ An R Package for Empirical Bayes Mixture Methods} \author{Roger Koenker} \author{Jiaying Gu} \thanks{Version: \today . This research was partially supported by NSF grant SES-11-53548.} \begin{abstract} Models of unobserved heterogeneity, or frailty as it is commonly known in survival analysis, can often be formulated as semiparametric mixture models and estimated by maximum likelihood as proposed by \cite{Robbins50} and elaborated by \cite{KW}. Recent developments in convex optimization, as noted by \cite{KM}, have led to dramatic improvements in computational methods for such models. In this vignette we describe an implementation contained in the \proglang{R} package \pkg{REBayes} with applications to a wide variety of mixture settings: Gaussian location and scale, Poisson and binomial mixtures for discrete data, Weibull and Gompertz models for survival data, and several Gaussian models intended for longitudinal data. While the dimension of the nonparametric heterogeneity of these models is inherently limited by our present gridding strategy, we describe how additional fixed parameters can be relatively easily accommodated via profile likelihood. We also describe some nonparametric maximum likelihood methods for shape and norm constrained density estimation that employ related computational methods. \end{abstract} \maketitle %\noindent {\bf Keywords:} <<preliminaries, echo=FALSE, warning = FALSE, message = FALSE, results='hide'>>= hasMosek <- requireNamespace("Rmosek", quietly = TRUE) if(hasMosek) hasMosek <- tryCatch(example(mosek)$value$response$code == 0, warning = function(c) 0, error = function(c) 0) if (!hasMosek) { knitr::opts_chunk$set(eval = FALSE) msg <- paste("This vignette requires Mosek, but this system", "does not have Mosek", "installed, so code examples will not be evaluated.") msg <- paste(strwrap(msg), collapse="\n") message(msg) } require("REBayes") knitr::render_sweave() options(prompt = "R> ", continue = "+ ", digits = 2, show.signif.stars = FALSE) cleanup <- FALSE @ \section{Introduction} \label{sec:intro} Empirical Bayes methods as conceived by \cite{Robbins.56} are enjoying a robust revival stimulated by more bountiful data sources and new theoretical developments exemplified by \cite{efron.10}. Mixture models have played a central role in this revival, and this has sparked renewed interest in the \cite{KW} nonparametric maximum likelihood estimator (NPMLE) for mixtures. Relatively recent developments in convex optimization have dramatically improved computational methods for the Kiefer-Wolfowitz NPMLE, as described in \cite{KM}. To make these methods accessible to the research community we have developed an \proglang{R} package \pkg{REBayes} that incorporates a wide variety of nonparametric mixture models and provides Kiefer-Wolfowitz procedures for each of them. The simplest univariate mixture model takes the form, \[ g(x) = \int \varphi (x , \theta ) dF(\theta), \] where $\varphi$ is a known density, that we will refer to as the base density, and $F$ is an unknown distribution function that we would like to estimate, given an iid sample from the mixture density $g$. The most familiar example would be the Gaussian location model with $\varphi$ standard Gaussian, so, \[ g(x) = \int \varphi (x - \mu ) dF(\mu). \] This is the standard Gaussian sequence model and has been studied in many simulation experiments, including \cite{JS}, \cite{MW} and \cite{CvdV}, and employed in many -- typically genomic -- applications. The objective of such analyses is a compound decision problem: Given an exchangeable sample, $X_1, ... X_n$ estimate the corresponding $\mu_1, ... \mu_n$ subject to quadratic loss. As noted by \cite{Robbins.56} this yields the optimal Bayes rule, \begin{equation} \EE (\mu | x ) = x + g' (x) / g(x). \label{eq.tweedie} \end{equation} \cite{efron.11} calls this Tweedie's formula since Robbins attributes it to M.C.K. Tweedie, however it appears earlier in \cite{Dyson26} who credits it to the English astronomer Arthur Eddington. To turn this into a practical shrinkage formula we obviously need to choose an estimator for the mixture density $g$. Much of the earlier literature on this problem may be viewed as offering parametric empirical Bayes proposals in which $F$ is specified up to a finite dimensional vector of hyperparameters. Prominent examples of this parametric strategy would be the \pkg{EbayesThresh} package described in \cite{JS} and \cite{JS2}, and the recent work of \cite{Bdecon} and \cite{efron.10}. In \cite{Bayesball} we have made some comparisons with nonparametric Bayes procedures based on the Dirichlet process prior using the \pkg{DPpackage} of \cite{DP}. In our limited experience this leads to similar estimates of the mixing distribution as those of the NPMLE provided that the concentration parameter of the Dirichlet is small. See \cite{Liu96} for another comparison of Dirichlet and NPMLE methods. More recently interest has focused on nonparametric estimation of the mixing distribution as in \cite{efron.11}, \cite{BG} and \cite{JZ}. The latter authors proposed using the Kiefer-Wolfowitz NPMLE to estimate $F$, and thereby $g$, and then to use the Tweedie formula. The main drawback of this proposal was the painfully slow convergence of the fixed point iteration of the EM algorithm used to compute the NPMLE. \cite{KM}, observing that the discretization suggested by \cite{JZ} produced a convenient, finite dimensional convex optimization problem showed that the NPMLE could be implemented much more efficiently by standard interior point methods. In the next section we will briefly describe this implementation, and then turn to descriptions of various applications. Other recent applications of the \pkg{REBayes} package may be found in \cite{DZ} and \cite{JZ.15}. The extensive literature on estimating finite mixture models, that is models with a prespecified number of parametric components, faces a number of challenging problems: potentially unbounded and multi-modal likelihoods, lack of identifiability, as well as selection of the number of components. The Kiefer and Wolfowitz NPMLE enjoys several important advantages over these finite mixture models. Because it is formulated as a discretized convex optimization problem, it is automatically assured to produce a unique solution with both the location and associated mass of the mixture components determined by the optimization of the likelihood. Positivity of the mass associated with mixture components also ensures a strong form of parsimony as we shall see, so the mixture distribution is encoded by a relatively simple discrete mixing distribution. In the next section we will describe our implementation of the Kiefer-Wolfowitz NPMLE, further details are provided in \cite{KM}. As a practical matter the optimization requires an algorithmic approach capable of dealing with a quite general class of additively separable likelihoods optimized subject to both linear equality and inequality constraints. For this purpose we have found the Mosek environment of \cite{Mosek}, and the associated \proglang{R} interface \pkg{Rmosek} of \cite{Rmosek} to be highly efficient and reliable. \cite{Convex} provide a broader survey of convex optimization methods for the \proglang{R} environment, including a brief mention of some basic \pkg{REBayes} functionality and further details regarding the general capabilities of Mosek. Installation of \proglang{Mosek} and \pkg{Rmosek} are described in detail in the Readme file in the ``inst'' directory of the \pkg{REBayes} package. Various options controlling Mosek optimizing behavior can be passed via the \pkg{REBayes} fitting functions. Among these \code{rtol} and \code{verb} that control the convergence tolerance and the verbosity of the optimization printed output are most frequently useful. While we have endeavored to choose sensible default values for these and other parameters some experimentation may be required in unusual cases. \section{Computation of the Kiefer-Wolfowitz NPMLE} It is easy to see that the primal problem \begin{equation} \min_{F \in \FF} \{ - \sum_{i=1}^n \log g(x_i) \; | \; g(x_i) = \int \varphi(x_i , \theta) dF(\theta), \; i = 1, ... , n \}, \label{eq.KWP} \end{equation} where $\FF$ denotes the set of all mixing distributions, is a convex program. We seek to minimize a strictly convex objective function subject to linear equality constraints over the convex set, $\FF$. The dual formulation of the problem is also illuminating. \begin{theorem} (\cite{KM}) The solution, $\hat{F}$, of \eqref{eq.KWP} exists, and is an atomic probability measure, with not more than $n$ atoms. The locations, $\hat{\mu}_j$, and the masses, $\hat{f}_j$, at these locations can be found via the following dual characterization: the solution, $\hat{\nu}$, of \begin{equation} \max \{ \sum_{i=1}^n \log \nu_i \; | \; \sum_{i=1}^n \nu_i \varphi(Y_i , \mu) \leq n \text{ for all } \mu \} \label{eq.KWD} \end{equation} satisfies the extremal equations ($n$ equations in less than $n$ variables) \begin{equation} \label{exeq} \sum_j \varphi(Y_i , \hat{\mu}_j) \hat{f}_j = \frac{1}{\hat{\nu}_i}, \end{equation} and $\hat{\mu}_j$ are exactly those $\mu$ where the dual constraint is active---that is, the constraint function in \eqref{eq.KWD} is equal to $n$. \end{theorem} The dual formulation reduces the objective function to a simple finite dimensional sum, albeit now with an infinite dimensional constraint. The upper bound of $n$ on the number of atoms, established under slightly stronger conditions by \cite{Lindsay}, encourages us in the quest for a discrete formulation. We should hasten to add that we have no assurances about where these atoms occur, in particular it is clear already from an example in \cite{Laird} that they need not occur at the observed $x_i$. \cite{Laird} proposed using the EM algorithm to solve a discretization of the primal problem \eqref{eq.KWP} and subsequent authors, notably \cite{HeckmanSinger} and \cite{JZ}, have followed her lead. However, as has been frequently observed, EM can be quite lethargic in its pursuit of the optimum. \cite{KM} describe some comparisons of a fixed point EM algorithm with the interior point method implemented in Mosek. \nocite{Mosek} For a relatively small Gaussian location mixture problem with $n = 200$ and a grid of 300 points for the mixing distribution for $\mu$, the interior point method produced a very precise solution in about 1 second and 15 iterations, while after 10 minutes and 100,000 iterations the EM algorithm was still struggling to obtain the same accuracy as the interior point solution. In our discrete formulation we consider a fixed grid, $\{ u_1 , ... , u_m \}$, of potential support points for the mixing distribution, $F$. Typically, $m$ is a few hundred, and the grid is equally spaced, but this can be easily adapted to particular applications. We denote by $A$ an $n$ by $m$ matrix, with the elements $\varphi (Y_i , u_j )$ in the $i$-th row and $j$-th column. The discrete version of the primal problem is then, \[ \min_{f \in \RR^m} \{ - \sum_{i=1}^n \log (g_i) \; | \; Af = g, \; f \in \SS \}, \] where $\SS$ denotes the unit simplex in $\RR^m$, i.e., $\SS = \{s \in \RR^m | 1^\top s = 1, \; s \geq 0 \}$. So $f_j$ denotes the estimated mixing density estimate $\hat f$ evaluated at the grid point $u_j$, and $g_i$ denotes the estimated mixture density estimate, $\hat g$, evaluated at $Y_i$. In our experience it is somewhat more efficient to solve the corresponding dual problem, \[ \max_{\nu \in \RR^n} \{ \sum_{i=1}^n \log \nu_i \; | \; A^\top \nu \leq n 1_m, \; \nu \geq 0 \}, \] and subsequently recover the primal solution. In the \pkg{REBayes} package we have implemented this dual solution method for a wide variety of mixture problems that we will describe in subsequent sections. It is frequently convenient to consider weighted MLE formulations so \pkg{REBayes} fitting functions make some provision for weights. The implementation relies heavily on the Mosek optimization software of \cite{Mosek} and its \proglang{R} interface package \pkg{Rmosek}, \cite{Rmosek}. \section{Gaussian mixture models} Gaussian mixture models are a natural point of departure for application of the foregoing methods. We will begin by describing usage in the simplest Gaussian sequence models. Some connections to multiple testing are described in the following subsection. Gaussian scale mixtures are then considered, followed by some brief remarks on Gaussian longitudinal models where heterogeneity in both location and scale comes into play. The section concludes with a cautionary parable concerning Gaussian location-scale mixtures in non-longitudinal settings. \subsection{Needles in haystacks} To illustrate our methods in the simplest possible setting, consider the simulation framework of \cite{JS}: we have $X_i \sim \NN ( \mu_i , 1), \; i = 1, ... , n$, with $s$ of the $\mu_i = \mu_0 \neq 0$ and the remainder, $\mu_i = 0$. When $s$ is reasonably large relative to $n$ and $\mu_0$ is well separated from zero, then it should be easy to distinguish the two mass points of the mixture. Suppose we take $n = 1000$ and $s = 100$ with $\mu_0 = 2$ then the mixture density looks like that illustrated in in the left panel of Figure \ref{fig:N02}. In the middle panel of the figure we plot the NPMLE estimate of the mixing "density," which puts most of the mass near zero, and the remainder at a value slightly greater than two. The reader is encouraged to repeat this exercise to gauge the reliability of the NPMLE procedure with the \proglang{R} code reproduced below. Finally, in the right panel we illustrate the Bayes rule for predicting $\mu_i$ given observations at various values between -5 and +6. It may be noted that not only are observations below zero shrunken aggressively toward zero, but observations above two are also shrunken toward the estimated prior mass point near two. Observations between zero and two are, given the estimated mixing distribution, more ambiguous and the Bayes rule must account for both mass points in computing its conditional expectation. <<N02setup, include=FALSE>>= N02.cap = "Kiefer Wolfowitz Estimation of a Gaussian Location Mixture: The left panel is the (unknown) two component mixture density, the middle panel is the estimated NPMLE mixing density and the right panel is the estimated Bayes rule for predicting $\\hat \\mu = \\delta (x)$ based on seeing an observation $x$." set.seed(1729) @ <<N02, fig.height = 4, fig.width = 10, fig.cap = N02.cap >>= # A simple Gaussian mixture model par(mfrow = c(1,3)) x <- seq(-5, 6, by = 0.05) plot(x, 0.9 * dnorm(x,0) + 0.1 * dnorm(x,2), type = "l", xlab = "x", ylab = expression(g(x)), main = "") y <- rep(c(0,2), times = c(900,100)) + rnorm(1000) z <- GLmix(y) plot(z, xlab = expression(mu), ylab = expression(f(mu)), main = "") plot(x, predict(z,x), type = "l", ylab = expression(delta(x))) @ The Tweedie shrinkage strategy depicted in Figure \ref{fig:N02} is effective not only in shrinking the observations with $\mu_i = 0$ toward zero, but also in shrinking the non-null $\mu_i = 2$ observations toward two. This helps to explain the good performance of the NPMLE described in \cite{bakeoff} relative to the thresholding and parametric empirical Bayes procedures of \cite{JS}, \cite{MW} and \cite{CvdV}. These competitors are quite good at shrinking the null observations toward zero, unlike the NPMLE they {\it know} that there is mass at zero, but they tend to leave the non-null observations alone and this tends to inflate their mean squared error. This observation raises the natural question how would the NPMLE do when the non-null observations came from a more diffuse distribution? In Figure \ref{fig:N0G} we illustrate similar performance for a Gaussian location mixture in which 200 of the 1000 observations have $\mu_i$'s drawn from a $\NN (2,1)$ distribution. The true mixture density looks quite similar to the prior example, but the NPMLE now identifies three distinct mass points, one large one near zero, a smaller one near two and a very small mass point at about 4.5. The Bayes rule is still quite sure that negative $x_i$ should be pulled toward zero, and observations near two are nudged toward two. But despite its small mass the upper mass point of the estimated mixing distribution exerts a substantial effect. Only when we see extremely large observations bigger than 4.5 are they pulled back toward this largest mass point. This example is considerably more challenging than the previous one, but nevertheless the empirical Tweedie formula produced by the NPMLE provides a reasonable approach. <<N0Gsetup, include=FALSE>>= N0G.cap = "Kiefer Wolfowitz Estimation of a Gaussian Location Mixture: The left panel is the (unknown) mixture density, the middle panel is the estimated NPMLE mixing density and the right panel is the estimated Bayes rule for predicting $\\hat \\mu = \\delta (x)$ based on seeing an observation $x$." set.seed(1726) @ <<N0G, fig.height = 4, fig.width = 10, fig.cap = N0G.cap >>= # Another simple Gaussian mixture model par(mfrow = c(1,3)) x <- seq(-5, 7, by = 0.05) plot(x, 0.8 * dnorm(x,0) + 0.2 * dnorm(x,2,sqrt(2)), type = "l", xlab = "x", ylab = "g(x)", main = "") y <- c(rep(0,800), rnorm(200, 2)) + rnorm(1000) z <- GLmix(y) plot(z, xlab = expression(mu), ylab = expression(f(mu)), main = "") plot(x, predict(z,x), type = "l", ylab = expression(delta(x))) @ In the foregoing examples we have employed the posterior mean as a prediction assuming implicitly that we faced quadratic loss, however it is straightforward to adopt other loss functions and provide alternative predictions. For \code{GLmix} fitted objects \code{predict.GLmix} allows the user to specify posterior median or posterior modal prediction by setting the argument \code{Loss} equal to 1 or 0, respectively. If \code{Loss} is specified as a value $\tau$ between zero and one, predictions return the posterior $\tau$th quantile. Empirical Bayes posterior quantile prediction is considered in \cite{mbr16} although they restrict attention to linear shrinkage rules. They reference a large literature on the so-called ``newsvendor'' problem going back to \cite{E88} that motivates the quantile loss function. Analogous \code{predict} functions are also available for Poisson, i.e. \code{Pmix}, and binomial, i.e. \code{Bmix}, fitted objects. \subsection{Gaussian mixtures and multiple testing} \cite{Robbins.51} introduced compound decision making with the following (deceptively) simple problem. Suppose we observe, \begin{equation} Y_i = \theta_i + u_i, \quad i = 1, \cdots , n, \label{eq.model} \end{equation} with $\{ u_i \}$ iid standard Gaussian, and we know that the $\theta_i$ take values $\pm 1$. The objective is to estimate the $n$-vector, $\theta \in \{ -1 , 1 \}^n$ subject to $\ell_1$ loss, \[ L(\hat \theta , \theta) = n^{-1} \sum_{i=1}^n | \hat \theta_i - \theta_i |. \] He observes that when $n = 1$ the least favorable version of the problem occurs when we assume that the $\theta_i$'s are drawn as independent Bernoulli's with probability $p = 1/2$ that $\theta_i = \pm 1$, and then he proceeds to show that this remains true for the general ``compound decision'' problem with $n \geq 1$. The minimax decision rule is thus, \[ \delta_{1/2} (y) = \sgn (y) \] and yields constant risk, \[ R(\delta_{1/2} , \theta) = \EE L ( \delta_{1/2} (Y) , \theta ) = \Phi(-1) \approx 0.1586, \] irrespective of $p$. And yet, something feels wrong with this procedure. If we saw mostly positive $Y_i$'s wouldn't we begin to think that $p \neq 1/2$? Why are we so attached to the worst case scenario? Exploiting the common structure of the $n$ problems, Robbins suggests {\it estimating} $p$ by $\hat p = (\bar y + 1)/2$. Given this method of moments estimate of $p$, he suggests plugging it into the decision rule, \[ \delta_p (y) = \sgn (y - 1/2 \log ( (1-p)/p) ), \] a procedure that follows immediately from the requirement that, \[ P(\theta = 1 | x, p) = \frac{p \varphi(x-1)}{p \varphi(x-1) + (1-p) \varphi(x+1)}, \] exceeds one half, that is, that the posterior median of $\theta$ be 1. This prototype empirical Bayes procedure sacrifices a little in performance when $p$ is really near $1/2$, but achieves substantial gains in performance when $p$ differs substantially from $1/2$. Of course, when $n$ is large, $\hat p \rightarrow p$, so we have a form of asymptotic optimality. The link to the multiple testing literature for the Robbins problem is immediately clear since estimation of $\theta \in \{ -1 , 1 \}^n$ is essentially a testing problem in which we have weighed false discovery and false non-discovery equally. If we treat $\theta = -1$ as the null hypothesis and $\theta = 1$ as the alternative, a $p$-value procedure based on $T_i = 1 - \Phi(X_i +1)$ with cutoff $\Phi (-1)$ the decision rule, \[ \delta_p (T) = \sgn(\Phi(-1) - T) \] is equivalent to the minimax rule, $\delta (x) = \sgn (x)$. If, instead, we would like to fix the marginal false discovery rate (mFDR) at some level and optimize marginal false nondiscovery rate (mFNR) a modified $p$-value cutoff can be constructed, and this would be equivalent to replacing our symmetric $\ell_1$ loss for the estimation/classification problem by an asymmetric linear loss. A $p$-value testing procedure that is equivalent to the empirical Bayes rule estimator described earlier for the Robbins problem can also be constructed. Under the null that $X_i \sim \NN (-1,1)$, $T_i = 1 - \Phi(X_i + 1) \sim U[0,1]$, while if $X_i \sim \NN (1,1)$, \[ \PP (T_i < u) = \PP (X_i + 1 > \Phi^{-1} (1-u)) = 1 - \Phi (\Phi^{-1} (1-u) - 2). \] Thus, under the null, the density of $T$ is $f_0 (t) \equiv 1$, and under the alternative, \[ f_1 (t) = \varphi (\Phi^{-1} (1-t) - 2)/\varphi(\Phi^{-1}(1-t)), \] and the posterior probability of $\theta_i = 1$ given $t_i$ and assuming for the moment that the unconditional probability, $p = \PP(\theta_i = 1)$ is known, is given by, \[ \PP (\theta = 1 | t , p ) = \frac{p f_1 (t)} { p f_1 (t) + (1 - p) f_0 (t)}. \] Under symmetric loss we were led to the posterior median so $\hat \theta_i = 1$ if $\PP ( \theta_i = 1 | T_i, p ) > 1/2$, which is equivalent to the $p$-value rule, \[ T_i < 1 - \Phi(1 + 0.5 \log ( (1 - p)/p )). \] Again, we are led back to the problem of estimating $p$. In these two point mixture problems $\ell_1$ loss is equivalent to $0-1$ loss since the median and the mode are identical. In \cite{ProbRob} we explore some extensions of this simple setting to several other multiple testing problems. We first consider a grouped setting in which we have, \[ Y_{ij} = \theta_{ij} + u_{ij}, \quad i = 1, \cdots, n, \; \; j = 1, \cdots , m, \] with $\{ u_{ij} \}$ iid standard Gaussian as before, and $\theta_{ij} = 1$ with probability $p_i$ and $\theta_{ij} = -1$ with probability $1- p_i$, and independent over $j = 1, \cdots , m$. In this framework we can consider ``group specific'' $p_i$ that vary within the full sample yielding a nonparametric mixture problem. In the multiple testing context this grouped model has been considered by \cite{EfronAAS.08} and \cite{SunCai} among others. This formulation leads us back to the Kiefer and Wolfowitz NPMLE. We also consider abandoning the rather implausible assumption that we know the support points of the $\theta$'s. This allows us to consider multiple testing rules for more realistic settings with both composite null and alternatives. Comparing performance of these rules with the empirical characteristic function procedures of \cite{SM} shows very favorable performance. \subsection{Gaussian scale mixtures} Gaussian scale mixtures can be estimated in much the same way that we have described for location mixtures. Suppose we now observe an unbalanced panel, \[ y_{it} = \sqrt{\theta_i} u_{it}, \quad t = 1, \cdots , m_i, \quad i= 1, \cdots , n \] with $u_{it} \sim \NN (0,1)$. Sufficiency reduces the sample to $n$ observations on $S_i = m_i^{-1} \sum_{t=1}^{m_i} y_{it}^2$, and thus $S_i$ has a gamma distribution with shape parameter, $r_i = m_i / 2$, and scale parameter $\theta_i / r_i$, i.e., \[ \gamma (S_i | r_i, \theta_i/r_i ) = \frac{1}{\Gamma (r_i) (\theta_i/r_i)^{r_i}} S_i^{r_i-1} \exp \{ - S_i r_i /\theta_i \}, \] and the marginal density of $S_i$ when the $\theta_i$ are iid from $F$ is \[ g(S_i ) = \int \gamma (S_i | r_i, \theta/r_i ) dF(\theta). \] Estimation of $F$ proceeds as in the location mixture setting except that now the matrix $A$ has typical element $\gamma (S_i | \theta_j )$ with $\theta_j$'s constituting a fine grid covering the support of the sample $S_i$'s. This can be implemented in \pkg{REBayes} with the function \code{GVmix}, which may be seen as a general procedure for scale mixtures of $\chi^2$. A yet more general procedure for scale mixtures of gamma random variables is provided by the function \code{gammamix}. \cite{Robbins.82} contains an early discussion of parametric empirical Bayes methods for scale mixture of Gaussians, \cite{vdv1996} considers the semiparametric efficiency for the same model with an additional unknown location parameter. The scale mixture of Gaussians is also a crucial building block for the more general location-scale mixture we have considered in the longitudinal setting. An application of the Gaussian scale mixture procedure is described in \cite{Kasm} where a simple bivariate linear regression model, \[ Y_i = \beta_0 + x_i \beta_1 + U_i \] is considered. The $u_i$ are assumed to be generated iidly from a scale mixture of Gaussians, so $U_i^2$ have mixture density, \[ g(v) = \int_0^\infty \gamma(v | \theta) dF (\theta) \] where $\theta = \sigma^2$, and $\gamma$ is the $\chi^2 (1)$ density with free scale parameter $\theta$, \[ \gamma(v | \theta) = \frac{1}{\Gamma(1/2) \sqrt{2 \theta}} v^{-1/2} \exp ( -v/(2 \theta) ) \] Given a preliminary estimate of the $\beta$ parameters we can estimate the mixing distribution $F$ based on the sample of $\hat u_i^2$'s, and this in turn can be used to estimate the score function, \[ \hat \psi (u) = (- \log \hat g (u))' = \frac{ \int u \varphi(u/\sigma)/\sigma^3 d \hat{F} (\sigma)} {\int \varphi(u/\sigma)/\sigma d \hat{F} (\sigma)}, \] used to reestimate $\beta$. Iterating this procedure may be seen as our first encounter with Kiefer-Wolfowitz profile likelihood and can be shown to achieve an asymptotically fully efficient regression estimator for the class linear models with iid scale mixture of Gaussian errors. \subsection{Longitudinal Gaussian models} Longitudinal data allow us to explore heterogeneity in both location and scale for Gaussian Models. Let's begin by considering the model, \[ y_{it} = \alpha_i + \sqrt{\theta_i} u_{it}, \quad t = 1, \cdots , m_i, \quad i= 1, \cdots , n \] with $u_{it} \sim \NN (0,1)$. We will provisionally assume that $\alpha_i \sim F_\alpha$ and $\theta_i \sim F_{\theta}$ are independent. Again, we have sufficient statistics: \[ \bar y_i | \alpha_i , \theta_i \sim \NN (\alpha_i, \theta_i/m_i ) \] and \[ S_i | r_i , \theta_i \sim \gamma ( S_i | r_i , \theta_i/r_i ), \] where $r_i = ( m_i - 1) /2$, $S_i = (m_i - 1)^{-1} \sum_{t=1}^{m_i} (y_{it} - \bar y_i)^2$, and the log likelihood becomes, \[ \ell (F_\alpha , F_{\theta} | y) = K(y) + \sum_{i=1}^n \log \int \int \gamma(S_i | r_i , \theta /r_i ) \sqrt{m_i} \phi (\sqrt{m_i} (\bar y_i - \alpha_i)/\sqrt{\theta})/\sqrt{\theta} dF_\alpha (\alpha) dF_{\theta} (\theta). \] Since the scale component of the log likelihood is additively separable from the location component, we can solve for $\hat F_{\theta}$ in a preliminary step, as in the previous subsection, and then solve for the $\hat F_\alpha$ distribution. In fact, under the independent prior assumption, we can re-express the Gaussian component of the likelihood as Student-$t$ and thereby eliminate the dependence on $\theta$ in the Kiefer-Wolfowitz problem for estimating $F_\alpha$. An implementation is available in the function \code{WTLVmix} of \pkg{REBayes}. \cite{Bayesball} describe an application to predicting baseball batting averages in which following \cite{brown.08} averages are transformed to normality, and the $\theta$'s reflect either under or over dispersion relative to the standard binomial model. Again, profile likelihood is used to explore covariate effects embedded in this model of heterogeneity. In particular we estimate an age profile for batting prowess as a quadratic effect that peaks at age 27. Comparing predictive performance for this model we find that the independent prior NPMLE performs considerably better than its more naive competitors. It is also possible to relax the independence assumption on the location and scale effects. In \cite{Haydn} we use longitudinal data from the Panel Study on Income Dynamics (PSID) to study models of income dynamics with an arbitrary joint distribution of location and scale heterogeneity. In these models we estimate an AR(1) effect by profile likelihood. The implementation for these models uses the function \code{WGLVmix} and requires a bivariate gridding strategy for the mixing distribution. We find that there is a distinct {\it negative} dependence between the $\alpha$ (location) and $\theta$ scale effects indicating that low ``ability'' individuals also tend to be high income variability people. Accounting for heterogeneity in scale has an acute effect on the estimation of the AR(1) effect reducing what is often regarded as a unit root effect to a rather mild $\rho \approx 0.5$ effect. The Bayesian formulation of these models offers the significant additional advantage that it affords a convenient environment for forecasting future income trajectories. \subsection{The parable of the crabs: A cautionary tale} The first formal estimation of a mixture model in statistics seems to have been Karl Pearson's \nocite{Pearson} 1894 analysis of the ratio of "forehead breadth" to body length of 1000 crabs sampled from the Bay of Naples by the prominent biologist W.F.R. Weldon. Pearson estimated a two component normal mixture model by the method of moments, a truly heroic computational effort given the technology of the time. He allowed his two normal components to have distinct means and variances so together with the relative weight of the two components he had five parameters. Modern (EM) methods are capable of producing similar results, although they are quite sensitive to the choice of initial values. It is thus tempting to ask: Can the Kiefer-Wolfowitz NPMLE offer any further insight into such problems. The short answer, unfortunately, is no. The immediate difficulty one encounters is that in contrast to our baseball application, or the income dynamics model, there is no longitudinal dimension to the data. All we have is a single sample, a basket of crabs. If we were to assume that we had simply a location mixture, or simply a scale mixture, it would be easy to estimate the mixing distribution with the NPMLE. But if we try to emulate Pearson and estimate a nonparametric location and scale mixture we are headed for a Dirac catastrophe. For each observation, we are entitled to assign a distinct mixing value $\mu_i = x_i$, corresponding to these $\mu_i$ we are also entitled to assign a $\sigma_i = 0$, and to each of these points $( \mu_i , \sigma_i ) = (x_i, 0 ) \quad i = 1, ... , n$ we can assign mass $1/n$. The likelihood explodes and our mixing distribution has collapsed to the familiar empirical distribution. The moral of this fable is this: Sorting a basket of crabs is tougher than it might seem. Kiefer and Wolfowitz knew a thing or two about this; the final paragraph of their 1956 paper points the fundamental difficulty of the location-scale Gaussian mixture model, and earlier they had already pointed out that the empirical distribution function was, itself, an MLE, of a sort. \cite{Teicher67} provides a more formal discussion. \section{Mixture models for counts} The Kiefer-Wolfowitz NPMLE can also be useful in analyzing discrete random variables such as count data where unobserved heterogeneity also arises naturally. Many applications involve count data as an object of interest: the number of patents across firms or industries, the number of hospital visits among patients, or the number of claims in insurance applications. The typical model for analyzing such data is Poisson regression. Often, however, even after accounting for observed covariates, there remains some over or under-dispersion in the data, indicating a need to introduce additional unobserved heterogeneity into the Poisson model. When handling this unobserved heterogeneity, a parametric model is typically imposed on the heterogeneity distribution in the literature. We illustrate below how the NPMLE provides a more flexible nonparametric approach for handling unobserved heterogeneity in Poisson models based on a model for the number of claims for a group life insurance policy. We also point out some advantages of NPMLE over the linear credibility estimators that are widely used for experience rating of insurance contracts. For a detailed discussion of credibility theory in actuarial science see \cite{Cred}. Our data, first analyzed in \cite{Norberg89}, consists of a portfolio of Norwegian workmen's group life insurance policies. The original 1125 contracts are aggregated into 72 occupational categories and consists of the total number of deaths $X_i$ (number of claims) and the total number of years exposed to risk $E_i$ for $i = 1, \dots, 72$ for each occupational group. This data is available from \pkg{REBayes} as \code{data("Norberg")}. Data on the 1125 individual contracts is only partially documented in \cite{Norberg89}, so we resort to the 72 occupation group data that is documented in \cite{Haastrup} and is provided in the dataset \code{Norberg} in the \pkg{REBayes} package. Figure \ref{fig:P01} illustrates the histogram of the ratio of $X_i$ and $E_i$. Following \cite{Norberg89}, we assume a Poisson model for $X_i$, so conditional on iid $\theta_i \sim G$, \[ X_i \sim \text{Poisson}(\theta_i E_i) \] Here $E_i$ is renormalized by a factor of 344 as in \cite{Haastrup}, and can be interpreted as the {\it \`a priori} expected number of claims in the period of contract. The multiplicative unobserved (occupational) specific factor $\theta_i$ then accounts for the fact that various occupations have different risk profiles that are not observed, but can be indirectly inferred by the observed number of claims. In classical credibility theory this leads to insurance premiums tailored to individual risk profiles based on the observed claims and exposures that have occurred. Rather than assuming that the distribution $G$ belongs to a particular parametric class as in \cite{Norberg89} and \cite{Haastrup}, we adapt the Kiefer-Wolfowitz NPMLE to this task. \cite{Haastrup} also conducts a nonparametric Bayesian analysis with a Dirichlet Process prior using Gamma distribution as a base, our methods serve as a nonparametric empirical Bayes contrast to his results. <<P01setup, include=FALSE>>= P01.cap = "Histogram of Claims per Exposure for 72 occupation groups." @ <<P01, fig.height = 4, fig.width = 6, fig.cap = P01.cap >>= # Parametric Gamma vs Poisson mixture models for insurance claims data("Norberg") E <- Norberg$Exposure/344 X <- Norberg$Death hist(X/E, 90, freq = TRUE, xlab = "X/E", main = "", ylab = "Frequency") @ <<P02setup, include=FALSE>>= P02.cap = "Estimated mixing distribution $G$ for $\\theta$ for the group insurance data. The left panel depicts to the Kiefer-Wolfowitz NPMLE estimator for $G$ with 1000 grid points. The right panel depicts the cube root of the mass associated with support points around 8. The smooth curve superimposed in the left panel corresponds to the parametric maximum likelihood estimate of the mixing density assuming $G$ follows a Gamma distribution." @ <<P02, fig.height = 4, fig.width = 6, cache = TRUE, fig.cap = P02.cap >>= # Maximum likelihood estimation of the Gamma model logL<- function(par, x, e){ f <- choose(x + par[1] - 1, x) * (par[2]/(e + par[2]))^par[1] * (e/(e+par[2]))^x -sum(log(f)) } z <- optim(c(5,5), logL, x = X, e = E, hessian = TRUE) sez <- sqrt(diag(solve(z$hessian))) z <- z$par # Estimation of the Poisson mixture model f = Pmix(X, v = 1000, exposure = E, rtol = 1e-8) # Now plot the comparison par(mfrow=c(1,2)) plot(f$x,f$y/sum(f$y), type="l", xlab = expression(theta), ylab = expression(f(theta)), ylim = c(0,1)) lines(f$x, dgamma(f$x, shape = z[1], rate = z[2]), col = 2) plot(f$x,(f$y/sum(f$y))^(1/3), type="l", xlab = expression(theta), ylab = expression(f(theta)^{1/3}), ylim = c(0,1)) @ Figure \ref{fig:P02} contrasts the NPMLE estimator with the corresponding parametric empirical Bayes estimates assuming that $G$ follows a Gamma distribution. The main reason for adopting the Gamma mixing distribution is analytical convenience. With $\theta_i \sim \text{Gamma}(\alpha, \beta)$, the marginal distribution of $X_i$ follows a negative binomial distribution \begin{align*} g(X_i | E_i) &= \int \frac{(\theta E_i)^{X_i} \exp(-\theta E_i)}{X_{i}!} \frac{\beta^{\alpha}}{\Gamma(\alpha)} \theta^{\alpha-1} \exp(-\beta \theta) d\theta\\ & = \binom{X_{i}+\alpha - 1}{X_i} \left (\frac{\beta}{E_i + \beta}\right )^{\alpha} \left (\frac{E_i}{E_i + \beta}\right )^{X_i} \end{align*} The maximum likelihood estimates are $\alpha = 6.02$ and $ \beta = 5.25$. The credibility estimator of the risk per exposure leads to \[ \hat \theta_i = \delta_i X_i/E_i + (1- \delta_i) \EE(\theta) \] with $\EE(\theta) = \int \theta dG(\theta)$ and $\delta_i = \frac{\VB(\theta)}{\VB(\theta) + \EE(\theta)/E_i}$. Under the parametric assumption that $G$ is $\text{Gamma}(\alpha, \beta)$, it is easy to see that $\EE(\theta) = \alpha/\beta$ and $V(\theta) = \alpha/\beta^2$, hence $\hat \theta_i = \frac{X_i + \alpha}{E_i + \beta}$, which is nothing but $\EE(\theta | X_i, E_i)$ from the Poisson-Gamma mixture model. The Gamma assumption on $G$ leads to a convenient analytical form for the credibility estimator, but since it may produce a rather unrealistic estimator of the underlying mixing distribution the premium calculation of the parametric credibility estimator may be questionable. In Figure \ref{fig:P02} we see that although the majority of the support points seem to situated under the ``umbrella" of the Gamma density, the Gamma distribution fails to detect the two outliers (Group 13 and Group 53, with $X/E$ ratios equal to 8.89 and 7.98 respectively) that account for the remote mass point around 8. In the right panel of Figure \ref{fig:P02}, we plot the cube root of the estimated mixing distribution and magnify the very small yet important mass point around 8. One may argue that these two occupational groups could be viewed as outliers and hence should not be allowed to influence our views about the distribution of the unobserved risk factor $\theta$. However, an insurance company would ignore them at its peril. For our general mixing distribution NPMLE estimator $\hat G$, the credibility estimator then becomes, \[ \hat \mu = \EE(\theta | X_i, E_i) = \frac{\int \theta \frac{(\theta E_i)^{X_i} \exp(-\theta E_i)}{X_{i}!} d\hat G(\theta)}{\int \frac{(\theta E_i)^{X_i} \exp(-\theta E_i)}{X_{i}!} d\hat G(\theta)} \] Figure \ref{fig:P03} contrasts the $\hat \theta_i$ based on the parametric Poisson-Gamma empirical Bayes estimator and those based on the nonparametric Poisson mixture model. We can see that for most of the occupational groups, the two estimators agree closely except for the two most extreme case (Group 13 and 53), that have the largest $X/E$ ratio. The nonparametric empirical Bayes procedure, relying on the mass point associated with a much larger support point, produces substantially larger credibility estimator for these ``riskier" groups, thereby justifying a higher premium. The \code{Pmix} function produces the Bayes rule automatically with an output denoted as \code{dy}, as illustrated in the code below. <<P03setup, include=FALSE>>= P03.cap = "Comparison of the Parametric and the Nonparametric Empirical Bayes estimator of $\\theta_i$ for 72 occupation groups. As indicated by the 45 degree line there is good agreement between the parametric and nonparametric Bayes rules except for the two groups appearing in the upper right corner of the plot." @ <<P03, fig.height = 4, fig.width = 6, fig.cap = P03.cap >>= # Bayes rules for insurance application PBrule <- (X + z[1])/(E + z[2]) NPBrule <- f$dy plot(PBrule, NPBrule, cex = 0.5, xlab = "P-EBayes", ylab = "NP-EBayes") abline(c(0,1)) @ \section{Fraility models in survival analysis} The notion of frailty to describe unobserved heterogeneity of population risks has become a familiar feature of demographic analysis since its introduction in \cite{VMS}, and has gradually spread to other statistical domains. It is tempting to begin a survival analysis by specifying a simple parametric model for the survival distribution, say the Weibull, and then on further reflection decide that a more flexible approach is necessary. One way to introduce such flexibility is to consider mixtures of the original simple model, for example by letting the scale parameter of the Weibull be random. This sort of thinking leads to deeper concerns about the nature of randomness touched upon by \cite{ABG}. Do we really believe that individual subjects are assigned a scale parameter and then fated to draw a survival date from the corresponding Weibull? Or should we instead just regard the population survival distribution as adequately approximated by the scale mixture? In the absence of further information to distinguish subpopulations it is difficult to see how to untangle these two interpretations, and we will not try to pursue this. Instead, we will illustrate what can be done with our Kiefer-Wolfowitz apparatus in a reanalysis of the influential \cite{CLOV} experiments on medfly mortality. The primary objective of these experiments was to characterize the upper tail of the medfly mortality distribution, an endeavor that revealed several surprising biological features. \begin{itemize} \item Mortality rates {\it declined} at advanced ages, contrary to conventional biological wisdom that ageing was an inexorable process of physical decline, \item The survival distribution had an extremely heavy tail, contrary to the common view that each species had an explicit upper bound on survival propects, \item Gender cross-over in mortality rates gave males an advantage at early ages and females an advantage at advanced ages, reversing expectations from other species. \end{itemize} In the largest of the three experiments reported in \cite{CLOV}, 1.2 million Mediterranean fruit flies ({\it Ceratitis Capitata}) were raised in a large facility in Mexico, \begin{quote} ``...Pupae were sorted into one of five size classes using a pupal sorter. This enabled size dimorphism to be eliminated as a potential source of sex-specific mortality differences. Approximately, 7,200 medflies (both sexes) of a given size class were maintained in each of 167 mesh covered, 15 cm by 60 cm by 90 cm aluminum cages. Adults were given a diet of sugar and water, ad libitum, and each day dead flies were removed, counted and their sex determined ...'' \end{quote} Data from this experiment is available from the \pkg{REBayes} with further details documented there. All three of the principle conclusions of the study are illustrated in Figure \ref{fig:W01}. As specified in the code fragment below we compute daily death counts by age and gender, allowing us to plot raw mortality rates by gender. We then estimate the Weibull mixture model using gender specific Weibull shape parameters as described in \cite{Medflies}. As illustrated in the displayed code, given the estimated mixing distribution it is easy to compute the hazard functions of the corresponding mixture distributions. <<W01setup, include=FALSE>>= W01.cap = "Raw and estimated mortality rates for Carey medflies by gender" @ <<W01, fig.height = 5, fig.width = 8, fig.cap = W01.cap >>= data("flies") attach(flies) # Weibull hazard function hweibull <- function(s,alpha,lambda, f){ Lambda<-outer((lambda*s)^(alpha),exp(f$x)) Surv <- exp(-Lambda) %*% f$y/sum(f$y) A <- matrix(0, length(s), length(f$x)) for (i in 1:length(s)){ for (j in 1:length(f$x)) A[i,j] <- dweibull(s[i],shape=alpha, scale = lambda^(-1) * (exp(f$x[j]))^(-1/alpha)) } g <- A %*% f$y g/(sum(g)*Surv) } ahat <- c(2.793, 2.909) # Gender specific Weibull shape parameters counts <- tapply(num,list(age,female),"sum") cols <- c("black","grey") labs <- c("Male","Female") # Plot raw and estimated hazard functions by gender for(g in 1:2){ gc <- counts[!is.na(counts[,g]),g] freq <- gc/sum(gc) day <- as.numeric(names(gc)) atrisk <- rev(cumsum(rev(gc))) h <- rev(diff(rev(c(atrisk,0))))/atrisk fW <- Weibullmix(day, m = 5000, alpha = ahat[g], weight = freq) hW <- hweibull(day, alpha = ahat[g], lambda = 1, fW) if(g == 1){ plot(day[1:100],hW[1:100],type="l", xlim = c(0,110), ylim = c(0,.20), xlab = "Day", ylab = "Hazard") points(day[1:100], h[1:100], cex = 0.7) } else{ lines(day[1:120],hW[1:120],col = cols[2]) points(day[1:100], h[1:100], cex = 0.7, col = cols[2]) } legend("topleft", labs, lty = rep(1,2), lwd = 1.5, col=cols) } @ A controversial aspect of the Carey experiment was the effect of cage density. Critics claimed that flies raised in more crowded cages would be more likely to die earlier. To investigate whether differences in initial cage density had a significant impact on mortality we can consider a model in which density enters as a linear multiplicative scale shift in the Weibull model, that is the baseline Weibull scale becomes $\theta_0 \exp ( d_i \beta )$ where $d_i$ denotes initial cage density. To estimate the density effect parameter, $\beta$, we simply evaluate the profiled likelihood on a grid of values on the interval $[-1,1]$, yielding Figure \ref{fig:W02}. This exercise yields a point estimate of about $\hat \beta = -0.5$ that is quite precise, at least if we are to believe the confidence bounds implied by the classical Wilks, $2 \log \lambda \leadsto \chi_1^2$, theory. Leaving the reliability of such intervals to future investigation, we conclude simply that the negative estimated coefficient implies that higher density shifts the survival distribution to the right, thus prolonging lifetimes, and directly contradicting the conjecture of the Carey critics. This finding is confirmed by other methods, see for example \cite{KG} where similar results are reported for both the Cox model and several quantile regression models. Profile likelihood is not always so successful in models of this type, for a cautionary lesson involving estimation of the Weibull shape parameter see \cite{Medflies}. <<W02setup, include=FALSE>>= W02.cap = "Initial Cage Density Effect in the Weibull Mixture Model: Profile Log Likelihood (in 1000's) for the cage density effect with 0.95 (Wilks) confidence interval in blue." @ <<W02, fig.height = 4, fig.width = 6, cache = TRUE, warning = FALSE, fig.cap = W02.cap >>= # Profile likelihood estimation of initial cage density effect counts <- tapply(num,list(age, begin),"sum") freq <- c(counts) day <- as.numeric(dimnames(counts)[[1]]) den <- as.numeric(dimnames(counts)[[2]]) day <- rep(day, 165) den <- rep(den, each = 136) s <- !is.na(freq) day <- day[s] den <- den[s] freq <- freq[s]/sum(freq[s]) beta <- -10:10/10 logL <- beta for(i in 1:length(beta)){ f <- Weibullmix(day, m = 500, alpha = 2.95, lambda = exp(beta[i]*den), weight = freq) logL[i] <- f$logLik } plot(beta, logL/1000, cex = 0.5, xlab = expression(beta), ylab = "Profile Likelihood") lines(beta, logL/1000) # Wilks interval for cage density effect fsp <- splinefun(beta, max(logL) - logL - qchisq(.95,1)/2) blo <- uniroot(fsp,c(-1,-.5))$root bhi <- uniroot(fsp,c(-.5, 0))$root polygon(c(blo,bhi,bhi,blo), c(-40,-40,-30,-30), col = "lightblue") @ \subsection{MedLife: Fly-by-night insurance for Mediterranean fruit flies} Imagine that you have been engaged by MedLife{\texttrademark} to design life insurance contracts for medflies of various ages. To keep things relatively simple, suppose that you are not allowed to discriminate on the basis of gender or other observable characteristics, like pupal size or initial cage density. How should we compute an actuarially fair premium for a medfly of age $T$ for a policy that pays 1, if the fly dies between $T$ and $T+k$. We will resist speculating on who the beneficiaries of these policies might be or how double indemnity might be adjudicated. Instead, we will compare our nonparametric Weibull mixture approach with a more conventional parametric method that assumes gamma frailty for the Weibull model. Let's begin by comparing hazard function estimates for the parametric and nonparametric specifications. When the frailty distribution is gamma, so, \[ h(z) = \frac{\nu^\eta}{\Gamma (\eta)} z^{\eta - 1} e^{-\nu z}, \] it is convenient to restrict the mean frailty to be one, so $\nu = \eta$ and denote $\delta = 1/\eta$. Then for the Weibull base model with hazard function, $a (t) = (\alpha / \beta) (t/\beta)^{\alpha -1}$ and cumulative hazard, $A (t) = (t/\beta)^{\alpha}$, we can write the unconditional hazard and survival functions for the population as, \[ \lambda (t) = a(t)/(1 + \delta A(t)) \] and \[ S (t) = (1 + \delta A(t))^{-1/\delta}. \] This yields the loglikelihood, \[ \ell ( \alpha, \beta, \delta | t) = \sum_{i=1}^n \log a(t_i) - (1 + 1/\delta) \log(1 + \delta A(t_i)). \] <<W03setup, include=FALSE>>= counts <- tapply(num,age,"sum") freq <- counts/sum(counts) day <- as.numeric(names(counts)) counts <- c("0.5" = 0, counts) atrisk <- rev(cumsum(rev(counts))) hazard <- rev(diff(rev(c(atrisk,0))))/atrisk W03.cap = "Parametric versus Nonparametric Estimates of Medfly Mortality Rates" @ <<W03, fig.height = 4, fig.width = 6, cache = TRUE, warning = FALSE, fig.cap = W03.cap >>= # Parametric Gamma fraility vs nonparametric Weibull mixture model GammaFrailty <- function(pars, age, num, hazard = FALSE){ alpha <- pars[1] beta <- pars[2] delta <- pars[3] a <- (alpha/beta) * (age/beta)^(alpha - 1) A <- (age/beta)^alpha if(hazard) z <- a/(1 + delta * A) else z <- -sum(num * (log(a) - (1 + 1/delta)* log(1 + delta * A))) z } pars <- c(5, 20, 1) z <- optim(pars, GammaFrailty, age = age, num = num) fitG <- z$par fitW <- Weibullmix(day, m = 5000, alpha = 2.95, weights = freq) s <- 1:110 day <- day[s] hazard <- hazard[s] plot(day, hazard, cex = 0.5) lines(day, GammaFrailty(z$par,day,num, hazard = TRUE), lty = 2) hW <- hweibull(day, alpha = 2.95, lambda = 1, fitW) lines(day, hW, col = 2) legend("topright", c("NPMLE", "Gamma"), col = 2:1, lty = 1:2) @ Estimating the parametric gamma fraility model by maximum likelihood is straightforward as indicated in the code above, giving $(\hat \alpha, \hat \beta, \hat \delta) = (3.08, 21.12, 0.41)$, and yielding the hazard function shown in Figure \ref{fig:W03}. We superimpose the raw mortality rates and the hazard function based on our Kiefer-Wolfowitz NPMLE based on the full sample without distinguishing medfly gender. While the parametric gamma model is capable of capturing the declining portion of the hazard, it is not sufficiently flexible to adapt to the finer features of the observed mortality rates. Conditional on a draw of $\theta$ from the frailty distribution, the premium for a fly of age $T = t$ is, \[ p(t | \theta) = \frac{F(t+1 | \theta) - F(t | \theta)}{S(t | \theta)}, \] and integrating with respect to $\theta$ we have the unconditional premium, \[ p(t) = \int \frac{F(t+1 | \theta) - F(t | \theta)}{S(t | \theta)} h( \theta | t) d \theta, \] where $h(\theta)$ is the unconditional frailty density, and \[ h(\theta | t) \equiv h(\theta | T > t) = \frac{S (t | \theta ) h(\theta)}{S(t)} = \frac{\exp (- \theta A(t)) h(\theta)}{ \int \exp (- \theta A(t)) h(\theta) d\theta}. \] is the corresponding conditional frailty density. The need to condition the frailty distribution on $t$ may seem odd, but a moment's reflection reveals that mass associated with high frailty values that would imply that subjects would die very quickly, must surely be downweighted once subjects attain an age at which having these values is highly improbable. This is illustrated in Figure \ref{fig:W04} where we depict the estimated, conditional frailty based on our NPMLE at four different ages. To exaggerate the magnitude of the smaller mass points of the NPMLE we have plotted the cube root of the density. It is clear that the relatively small mass point at $\log(\theta) = -3.4$ at age 1.5, by age 20 is no longer visible; flies with such a large frailty would almost surely be dead by age 20. <<W04setup, include=FALSE>>= W04.cap = "Conditional Frailty at Various Ages: Note that the cube root of the frailties have been plotted to accentuate the smaller mass points" @ <<W04, fig.height = 8, fig.width = 8, fig.cap = W04.cap >>= # Conditional fraility at various ages Gfrailt <- function(age, fit){ alpha <- fit[1] beta <- fit[2] delta <- fit[3] A <- (age/beta)^alpha (1 + delta * A)^(-1/delta) } frailt <- function(v, t, alpha, fit){ fv = fit$y/sum(fit$y) g = sum(exp(-v * (t^alpha))* fv) exp(-v * (t^alpha)) * fv/g } par(mfrow = c(2,2)) v <- exp(fitW$x) for(t in c(1.5, 20, 60, 100)){ plot(log(v), frailt(v, t, alpha = 2.95, fitW)^(1/3), type="l", main = paste("age =", t), xlab = expression(log(theta)), ylab = expression(h( theta , t)^{1/3})) } @ % TODO Need to find a way to replace the comma with ! in ylab above. In Figure \ref{fig:W05} we plot the ten-day term life insurance premium for medflies at various ages for both the parametric gamma model and the nonparametric model. By varying the parameter $k$ in the \code{premium} function one can control the term of the life insurance policy. In the figure $k = 10$ and the premia profile is somewhat smoother than the instantaneous hazard depicted in Figure \ref{fig:W03}. Again we see that the gamma model captures the basic shape of the nonparametric rate structure, but misses some of the nuances. <<W05setup, include=FALSE>>= W05.cap = "Ten-Day Term Life Insurance Premia for Medflies of Various Ages" @ <<W05, fig.height = 4, fig.width = 6, fig.cap = W05.cap >>= # Life insurance premia for medflies of various ages premium <- function(v, t, k = 1, alpha, fit){ if("Weibullmix" %in% class(fit)) { R <- t for(i in 1:length(t)){ D <- exp(-v * t[i]^alpha) - exp(-v * (t[i] + k)^alpha) D <- D/exp(-v * t[i]^alpha) D[is.nan(D)] <- 1 R[i] <- sum(D * frailt(v, t[i], alpha, fit)) } } else R <- (Gfrailt(t,fit) - Gfrailt(t+k, fit))/Gfrailt(t, fit) R } v <- exp(fitW$x) R <- premium(v, day, k = 10, alpha = 2.95, fitW) plot(day, R, type = "l", col = 2, ylab = "Premium") R <- premium(v, day, k = 10, alpha = 2.95, fitG) lines(day, R, lty = 2) legend("topright", c("NPMLE", "Gamma"), col = 2:1, lty = 1:2) @ \section{Conclusion} We have described a new approach to computing the nonparametric maximum likelihood estimator of Kiefer and Wolfowitz for a general class of mixture models as implemented in the \proglang{R} package \pkg{REBayes}, and illustrated its application in a variety mixture model settings. The approach exploits recent developments in convex optimization as implemented in the Mosek environment of \cite{Mosek}. \cite{Convex} surveys a broader range of such developments. In addition to the capabilities intended for mixture models the \pkg{REBayes} package contains the function \code{medde} for norm and shape constrained density estimation. Further details on \code{medde} methods may be found in \cite{KM.10} and the \pkg{REBayes} documentation. \section{Acknowledgements} The authors wish to thank Ivan Mizera for many helpful discussions at the early stages of the development of the \pkg{REBayes} package, thanks too for very constructive comments from two referees. This research was partially supported by NSF grant SES-11-53548. \bibliography{rebayes} \end{document}