---
output:
  pdf_document:
    latex_engine: pdflatex
    citation_package: natbib
    toc: true
    toc_depth: 2
    includes:
        in_header: vignette_head.tex
    keep_tex: true
    number_sections: true
title: "Honest inference in Regression Discontinuity Designs"
author: "Michal Kolesár"
date: "`r format(Sys.time(), '%B %d, %Y')`"
geometry: margin=1in
fontfamily: mathpazo
fontsize: 11pt
bibliography: np-testing-library.bib
vignette: >
  %\VignetteIndexEntry{RDHonest}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include=FALSE, cache=FALSE}
library("knitr")
knitr::opts_knit$set(self.contained = FALSE)
knitr::opts_chunk$set(tidy = TRUE, collapse=TRUE, comment = "#>",
                      tidy.opts=list(blank=FALSE, width.cutoff=55))
```

# Introduction

The package `RDHonest` implements bias-aware inference methods in regression
discontinuity (RD) designs developed in @ArKo18optimal, @ArKo20, and
@KoRo16. In this vignette, we demonstrate the implementation of these methods
using datasets from @lee08, @oreopoulos06, and @battistin09 and @LuMi07, which
are included in the package as a data frame `lee08`, `cghs`, `rcp` and `headst`.
The dataset from @lalive08 used in @KoRo16 is also included in the package as
data frame `rebp`.

# Sharp RD

## Model

We observe units $i=1,\dotsc, n$, with the outcome $Y_i$ for the $i$th unit given
by
\begin{equation}
Y_i = f_Y(x_i) + u_{Y, i}
\end{equation}
where $f_Y(x_i)$ is the expectation of $Y_i$
conditional on the running variable $x_i$ and $u_{Y, i}$ is the regression error that
is conditionally mean zero by definition.

A unit is assigned to treatment if and only if the running variable $x_{i}$ lies
weakly above a known cutoff. We denote the assignment indicator by
$Z_{i}=I\{x_{i}\geq c_{0}\}$. In a sharp RD design, all units comply with the
assigned treatment, so that the observed treatment coincides with treatment
assignment, $D_{i}=Z_{i}$. The parameter of interest is given by the jump of $f$
at the cutoff, $$ \tau_Y=\lim_{x\downarrow c_{0}}f_Y(x)-\lim_{x\uparrow
c_{0}}f_Y(x).$$ Under mild continuity conditions, $\tau_{Y}$ can
be interpreted as the effect of the treatment for units at the threshold
(@htv01). Let $\sigma_Y^2(x_i)$ denote the conditional variance of $Y_i$.

In the @lee08 dataset `lee08`, the running variable corresponds to the margin of victory of
a Democratic candidate in a US House election, and the treatment corresponds to
winning the election. Therefore, the cutoff is zero. The outcome of interest is
the Democratic vote share in the following election.

The `cghs` dataset from @oreopoulos06 consists of a subsample of British
workers. The RD design exploits a change in minimum school-leaving age in the UK
from 14 to 15, which occurred in 1947. The running variable is the year in which
the individual turned 14, with the cutoff equal to 1947 so that the "treatment"
is being subject to a higher minimum school-leaving age. The outcome is log
earnings in 1998.

## Plots

The package provides a function `RDScatter` to plot the raw data. To remove
some noise, the function plots averages over `avg` number of observations.

```{r, fig.width=4.5, fig.height=3.5, fig.cap="Lee (2008) data"}
library("RDHonest")
## plot 50-bin averages in for observations 50 at most points
## away from the cutoff. See Figure 1.
RDScatter(voteshare~margin, data=lee08, subset=abs(lee08$margin)<=50,
          avg=50, propdotsize=FALSE, xlab="Margin of victory",
          ylab="Vote share in next election")
```

The running variable in the Oreopoulos dataset is discrete. It is therefore
natural to plot the average outcome by each value of the running variable, which
is achieved using by setting `avg=Inf`. The option `propdotsize=TRUE` makes the
size of the points proportional to the number of observations that the point
averages over.

```{r, fig.width=4.5, fig.height=3.5, fig.cap="Oreopoulos (2006) data"}
## see Figure 2
f2 <- RDScatter(log(earnings)~yearat14, data=cghs, cutoff=1947,
                avg=Inf, xlab="Year aged 14", ylab="Log earnings",
                propdotsize=TRUE)
## Adjust size of dots if they are too big
f2 + ggplot2::scale_size_area(max_size = 4)
```

## Inference based on local polynomial estimates

The function `RDHonest` constructs one- and two-sided confidence intervals (CIs)
around local linear estimators using either a user-supplied bandwidth, or
bandwidth that is optimized for a given performance criterion. The sense of
honesty is that, if the regression errors are normally distributed with known
variance, the CIs are guaranteed to achieve correct coverage _in finite
samples_, and achieve correct coverage asymptotically uniformly over the
parameter space otherwise. Furthermore, because the CIs explicitly take into
account the possible bias of the estimators, the asymptotic approximation does
not rely on the bandwidth to shrinking to zero at a particular rate---in fact,
the CIs are valid even if the bandwidth is fixed as $n\to\infty$.

We first estimate $\tau_{Y}$ using local linear regression: instead of using all
available observations, we only use observations within a narrow estimation
window around the cutoff, determined by a bandwidth $h$. Within this estimation
window, we may choose to give more weight to observations closer to the
cutoff---this weighing is determined a kernel $K$. The local linear regression
is just a weighted least squares (WLS) regression of $Y_{i}$ onto the treatment
indicator, the running variable, and their interaction, weighting each
observation using kernel weights $K(x_{i}/h)$. The local linear regression
estimator $\hat{\tau}_{Y, h}$ or $\tau_{Y}$ is given by the first element of the
vector of regression coefficients from this regression,
$$\hat{\beta}_{Y, h}=
  \left(\sum_{i=1}^{n}K(x_{i}/h)m(x_{i})m(x_{i})'\right)^{-1}
  \sum_{i=1}^{n}K(x_{i}/h)m(x_{i})Y_{i},$$
where $m(x)=(I\{x\geq 0\}, I\{x\geq 0\}x, 1, x)'$ collects all the regressors, and we normalize the cutoff to zero.
When the kernel is uniform, $K(u)=I\{\abs{u}\leq 1\}$, $\hat{\tau}_{Y, h}$ is
simply the treatment coefficient from an OLS regression of $Y_{i}$ onto
$m(x_{i})$ for observations that are within distance $h$ of the cutoff. Other
kernel choices may weight observations within the estimation window $[-h, h]$
differently, giving more weight to observations that are relatively closer to
the cutoff.

Equivalently, using the Frisch–Waugh–Lovell theorem, $\hat{\tau}_{Y, h}$ may
also be computed by first running an auxiliary WLS regression of the
treatment indicator $D_{i}$ onto the remaining regressors,
$(I\{x_{i}\geq 0\}x_{i}, 1, x_{i})$ and then running a WLS regression of the
outcome $Y_{i}$ onto the residuals $\tilde{D}_{i}$ from this auxiliary
regression,
$$\hat{\tau}_{Y, h}=\sum_{i=1}^{n}k_{i, h}Y_{i}, \quad k_{i, h}
  =\frac{K(x_{i}/h)\tilde{D}_{i}}{\sum_{i=1}^{n}K(x_{i}/h)\tilde{D}_{i}^{2}}.$$
This representation makes it clear that the estimator simply a weighted average
of the outcomes. By definition of the residual $\tilde{D}_{i}$, the weights sum
to zero, and satisfy $\sum_{i}k_{i, h}x_{i}=\sum_{i}k_{i, h}x_{i}I\{x_{i}\geq 0\}=0$:
this ensures that our estimate of the jump at $0$ is unbiased when the
regression function is piecewise linear inside the estimation window.

### Bias-aware confidence intervals

The estimator $\hat{\tau}_{Y, h}$ is a regression estimator, so it will be
asymptotically normal under mild regularity conditions. In particular, if the
residuals $u_{i}$ are well-behaved, a sufficient condition is that none of the
weights $k_{i, h}$ are too influential in the sense that the maximal leverage
goes to zero, as we discuss in the diagnostics section.

Due to the asymptotic normality, the simplest approach to inference is to use
the usual CI,
$\hat{\tau}_{Y, h}\pm z_{1-\alpha/2}\hatse(\hat{\tau}_{Y, h})$, where
$z_{\alpha}$ is the $\alpha$ quantile of the standard normal distribution.
However, this CI will typically undercover relative to its nominal
confidence level $1-\alpha$ because it's not correctly centered: unless the
regression function $f_{Y}$ is exactly linear inside the estimation window, the
estimator $\hat{\tau}_{Y, h}$ will be biased. If $f_{Y}$ is "close" to linear,
then this bias will be small, but if it is wiggly, the bias may be substantial,
leading to severe coverage distortions.

The idea behind bias-aware inference methods is to bound the potential bias of
the estimator by making an explicit assumption on the smoothness of $f_{Y}$. A
convenient way of doing this is to bound the curvature of $f_{Y}$ by
restricting its second derivative. To allow $f_{Y}$ to be discontinuous at zero,
we assume that it's twice differentiable on either side of the cutoff, with a
second derivative bounded by a known constant $M$. The choice of the exact
curvature parameter $M$ is key to implementing bias-aware methods, and we
discuss it below. Once $M$ is selected, we can work out the potential
finite-sample bias of the estimator and account for it in the CI construction.
In particular, it turns out that the absolute value of the bias of
$\hat{\tau}_{Y, h}$ is maximized at the piecewise quadratic function $x\mapsto
M x^{2}(I\{x< 0\}-I\{x\geq 0\})/2$, so that $$\abs{E[\hat{\tau}_{Y,
h}-\tau_{Y}]}\leq B_{M, h}:=-\frac{M}{2}\sum_{i=1}^{n} k_{i,
h}x_{i}^{2}\operatorname{sign}(x_{i}).$$ Hence, a simple way to ensure that we
achieve correct coverage regardless of the true shape of the regression function
$f_{Y}$ (so long as $\abs{f_{Y}''(x)}\leq M$) is to simply enlarge the usual CI
by this bias bound, leading to the CI $\hat{\tau}_{Y, h}\pm (B_{M,
h}+z_{1-\alpha/2}\hatse(\hat{\tau}_{Y, h}))$. We can actually to slightly better
than this, since the bias bound can't simultaneously be binding on both
endpoints of the CI. In particular, observe that in large samples, the
$t$-statistic $(\hat{\tau}_{Y, h}-\tau_{Y})/\hatse(\hat{\tau}_{Y, h})$ is
normally distributed with variance one, and mean that is bounded by $t=B_{M,
h}/\hatse(\hat{\tau}_{Y, h})$ (ignoring sampling variability in the standard
error, which is negligible in large samples). To ensure correct coverage, we
therefore replace the usual critical value $z_{1-\alpha/2}$, with the $1-\alpha$
quantile of folded normal distribution $\abs{\mathcal{N}(t,1)}$,
$\cv_{\alpha}(t)$ (note $\cv_{\alpha}(0)=z_{1-\alpha/2}$). This leads to the
bias-aware CI
\begin{equation}
\hat{\tau}_{Y, h} \pm
\cv_{\alpha}(B_{M, h}/\hatse(\hat{\tau}_{Y, h})) \hatse(\hat{\tau}_{Y, h})
\end{equation}
Notice the bias bound $B_{M, h}$ accounts for the exact *finite-sample bias* of the estimator.
 The only asymptotic approximation we have used in its
construction is the approximate normality of the estimator $\hat{\tau}_{Y, h}$,
which obtains without any restrictions on $f_{Y}$---we only need the maximal
leverage to be close to zero, mirroring a standard leverage condition from
parametric regression settings.

The function `CVb` gives the critical values $\cv_{\alpha}(t)$:

```{r, }
CVb(0, alpha=0.05) ## Usual critical value
CVb(1/2, alpha=0.05)

## Tabulate critical values for different bias levels
CVb(0:5, alpha=0.1)
```

The function `RDHonest` puts all these steps together. Specifying curvature
parameter $M=0.1$, bandwidth $h=8$, and a triangular kernel yields:

```{r}
r0 <- RDHonest(voteshare~margin, data=lee08, kern="triangular", M=0.1, h=8)
print(r0)
```

- The default for calculating the standard errors is to use the nearest neighbor
  method. Specifying `se.method="EHW"` changes them to the regression-based
  heteroskedasticity-robust Eicker-Huber-White standard errors. It can be shown
  that unless the regression function $f_{Y}$ is linear inside the estimation
  window, the `EHW` standard errors generally overestimate the conditional
  variance.
- The default option `sclass="H"` specifies the parameter space as second-order
  Hölder smoothness class, which formalizes our assumption above that the second
  derivative of $f_Y$ is bounded by $M$ on either side of the cutoff. The
  package also allows the user to use a Taylor smoothness class by setting
  `sclass="T"`. This changes the computation of the worst-case bias, and allows
  $f_Y$ to correspond to any function such that the approximation error from a
  second-order Taylor expansion around the cutoff is bounded by $M x^{2}/2$. For
  more discussion, see Section 2 in @ArKo18optimal (note the constant $C$ in
  that paper equals $C=M/2$ here).
- Other options for `kern` are `"uniform"` and `"epanechnikov"`, or the user can
  also supply their own kernel function.
- `RDHonest` reports two-sided as well one-sided CIs. One-sided CIs simply
  subtract off the worst-case bias bound $B_{M, h}$ in addition to subtracting
  the standard error times the $z_{1-\alpha}$ critical value from the estimate.
  It also reports the p-value for the hypothesis that $\tau_Y=0$.
- `RDHonest` also reports the fitted regression coefficients $\hat{\beta}_{Y,
  h}$, and returns the `lm` object under `r0$lm`. We see from the above that the
  fitted slopes below and above the cutoff differ by
  `r round(unname(r0$lm$coefficients[2]), 2)`, for instance.

## Automatic bandwidth choice

Instead of specifying a bandwidth, one can just specify the curvature parameter
$M$, and the bandwidth will be chosen optimally for a given optimality
criterion---minimizing the worst-case MSE of the estimator, or minimizing the
length the resulting confidence interval. Typically, this makes little difference:

```{r}
RDHonest(voteshare ~ margin, data=lee08, kern="triangular",
         M=0.1, opt.criterion="MSE")
## Choose bws optimal for length of CI
RDHonest(voteshare ~ margin, data=lee08, kern="triangular", M=0.1,
         opt.criterion="FLCI")
```

To compute the optimal bandwidth, the package assumes homoskedastic variance on
either side of the cutoff, which it estimates based on a preliminary local
linear regression using the @ImKa12 bandwidth selector. This homoskedasticity
assumption is dropped when the final standard errors are computed.

Notice that, when the variance function $\sigma^2_Y(x)$ is known, neither the
conditional variance of the estimator, $\sd(\hat{\tau}_{Y,
h})^{2}=\sum_{i=1}^{n}k_{i, h}^{2}\sigma^{2}_{Y}(x_{i})$, nor the bias bound
$B_{M, h}$ depend on the outcome data. Therefore, the MSE and the length of the
infeasible CI, $2\cv_{\alpha}(B_{M, h}/\sd(\hat{\tau}_{Y, h}))
\sd(\hat{\tau}_{Y, h})$, do not depend on the outcome data. To stress this
property, we refer to this infeasible CI (and, with some abuse of terminology,
also the feasible version in eq. (2)) as a fixed-length confidence interval
(FLCI). As a consequence of this property, optimizing the bandwidth for CI
length does not impact the coverage of the resulting CI.

## Choice of curvature parameter

The curvature parameter $M$ is the most important implementation choice. It
would be convenient if one could use data-driven methods to automate its
selection. Unfortunately, if one only assume that the second derivative of
$f_{Y}$ is bounded by some constant $M$, it is not possible to do that: one
cannot use data to select $M$ without distorting coverage
(@low97, @ArKo18optimal). This result is essentially an instance of the general
issue with using pre-testing or using model selection rules, such as using
cross-validation or information criteria like AIC or BIC to pick which controls
to include in a regression: doing so leads to distorted confidence intervals.
Here the curvature parameter $M$ indexes the size of the model: a large $M$ is
the analog of saying that all available covariates need to be included in the
model to purge omitted variables bias; a small $M$ is the analog of saying that
a small subset of them will do. Just like one needs to use institutional
knowledge of the problem at hand to decide which covariates to include in a
regression, ideally one uses problem-specific knowledge to select $M$. Analogous
to reporting results based on different subsets of controls in columns of a
table with regression results, one can vary the choice of $M$ by way of
sensitivity analysis.

Depending on the problem at hand, it may be difficult to translate
problem-specific intuition about how close we think the regression function is
to a linear function into a statement about the curvature parameter $M$. In such
cases, it is convenient to have a rule of thumb for selecting $M$ using the
data. To do this, we need to impose additional restrictions on $f_{Y}$ besides
assuming that its second derivative is bounded in order to get around the
results on impossibility of post-model selection inference discussed above. An
appealing way of doing this is to relate the assumption about the local
smoothness of $f_{Y}$ at the cutoff point, which drives the bias of the
estimator but is difficult to measure in the data, to the global smoothness of
$f_{Y}$, which is much easier to measure. In our implementation, we measure
global smoothness by fitting a global quartic polynomial $\check{f}$, separately
on either side of the cutoff, following @ArKo20. We assume the local second
derivative of $f_{Y}$, $M$, is no larger than the maximum second derivative of
the global polynomial approximation $\check{f}$. Under this assumption, we can
calibrate $M$ by setting
$$\hat{M}_{ROT}=\sup_{x\in[x_{\min}, x_{\max}]}\abs{\check{f}''(x)}.$$
There are
different ways of relating local and global smoothness, which lead to different
calibrations of $M$. For instance, @ImWa19 propose fitting a global quadratic
polynomial instead, and then multiplying the maximal curvature of the fitted
model by a constant such as $2$ or $4$. An important question for future
research is to figure out whether there is a way of relating local and global
smoothness that is empirically appealing across a wide range of scenarios. See
also @NoRo21 for a discussion how to visualize the choice of $M$ to aid with its
interpretation.

When the user doesn't supply `M`, the package uses the rule of thumb
$\hat{M}_{ROT}$, and prints a message to inform the user:
```{r}
## Data-driven choice of M
RDHonest(voteshare ~ margin, data=lee08)
```





## Inference when running variable is discrete

The confidence intervals described above can also be used when the running
variable is discrete, with $G$ support points: their construction makes no
assumptions on the nature of the running variable (see Section 5.1 in @KoRo16
for more detailed discussion).

Units that lie exactly at the cutoff are considered treated, since the
definition of treatment assignment is that the running variable lies weakly
above the cutoff, $x_i\geq c_0$.

As an example, consider the @oreopoulos06 data, in which the running variable is age in years:
```{r}
## Replicate Table 2, column (10) in Kolesar and Rothe (2018)
RDHonest(log(earnings) ~ yearat14, cutoff=1947,
         data=cghs, kern="uniform", M=0.04, opt.criterion="FLCI", sclass="H")
```

In addition, the package provides function `RDHonestBME` that calculates honest
confidence intervals under the assumption that the specification bias at zero is
no worse at the cutoff than away from the cutoff as in Section 5.2 in @KoRo16.

```{r}
## Replicate Table 2, column (6), run local linear regression (order=1)
## with a uniform kernel (other kernels are not implemented for RDHonestBME)
RDHonestBME(log(earnings) ~ yearat14, cutoff=1947,
            data=cghs, h=3, order=1)
```


Let us describe the implementation of the variance estimator $\hat{V}(W)$ used
to construct the CI following Section 5.2 in @KoRo16. Suppose the point
estimate is given by the first element of the regression of the outcome $y_i$ on
$m(x_i)$. For instance, local linear regression with uniform kernel and
bandwidth $h$ corresponds to $m(x)=I(|x|\leq h)\cdot(I(x>c_0),1,x, x\cdot
I(x>c_0))'$. Let $\theta=Q^{-1}E[m(x_i)y_i]$, where $Q=E[m(x_i)m(x_i)']$, denote
the estimand for this regression (treating the bandwidth as fixed), and let
$\delta(x)=f(x)-m(x)'\theta$ denote the specification error at $x$. The RD
estimate is given by first element of the least squares estimator
$\hat{\theta}=\hat{Q}^{-1}\sum_i m(x_i)y_i$, where $\hat{Q}=\sum_i
m(x_i)m(x_i)'$.

Let $w(x_i)$ denote a vector of indicator (dummy) variables for all support
points of $x_i$ within distance $h$ of the cutoff, so that $\mu(x_g)$, where
$x_g$ is the $g$th support point of $x_i$, is given by the $g$th element of the
regression estimand $S^{-1}E[w(x_i)y_i]$, where $S=E[w(x_i)w(x_i)']$. Let
$\hat{\mu}=\hat{S}^{-1}\sum_i w(x_i)y_i$, where $\hat{S}=\sum_i w(x_i)w(x_i)'$
denote the least squares estimator. Then an estimate of
$(\delta(x_1), \dotsc, \delta(x_G))'$ is given by $\hat{\delta}$, the vector with
elements $\hat{\mu}_g-x_g\hat{\theta}$.

By standard regression results, the asymptotic distribution of $\hat{\theta}$
and $\hat{\mu}$ is given by
\begin{equation*}
\sqrt{n}
\begin{pmatrix}
\hat{\theta}-\theta\\
\hat{\mu}-\mu
\end{pmatrix}\overset{d}{\to}
\mathcal{N}\left(
0,
\Omega
\right),
\end{equation*}
where
\begin{equation*}
\Omega=\begin{pmatrix}
Q^{-1}E[(\epsilon_i^2+\delta(x_i)^2)m(x_i)m(x_i)']Q^{-1}&
Q^{-1}E[\epsilon_i^2 m(x_i)w(x_i)']S^{-1}\\
S^{-1}E[\epsilon_i^2 w(x_i)m(x_i)']Q^{-1}&
S^{-1}E[\epsilon_i^2 w(x_i)w(x_i)']S^{-1}\\
\end{pmatrix}.
\end{equation*}

Let $\hat{u}_i$ denote the regression residual from the regression of $y_i$ on
$m(x_i)$, and let $\hat{\epsilon}_i$ denote the regression residuals from the
regression of $y_i$ on $w(x_i)$. Then a consistent estimator of the asymptotic
variance $\Omega$ is given by
\begin{equation*}
\hat{\Omega}=n\sum_i T_i T_i',
\qquad
T_i'=\begin{pmatrix}
\hat{u}_i m(x_i)'\hat{Q}^{-1}&
\hat{\epsilon}_i w(x_i)'\hat{S}^{-1}
\end{pmatrix}.
\end{equation*}

Note that the upper left block and lower right block correspond simply to the
Eicker-Huber-White estimators of the asymptotic variance of $\hat{\theta}$ and
$\hat{\mu}$. By the delta method, a consistent estimator of the asymptotic
variance of $(\hat\delta, \hat{\theta}_1)$ is given by
\begin{equation*}
    \hat{\Sigma}=
\begin{pmatrix}
-X & I\\
e_1'& 0\\
\end{pmatrix}\hat{\Omega}\begin{pmatrix}
-X & I\\
e_1'& 0\\
\end{pmatrix}',
\end{equation*}
where $X$ is a matrix with $g$th row equal to $x_g'$, and $e_1$ is the first
unit vector.

Recall that in the notation of @KoRo16, $W=(g^-, g^+, s^-, s^+)$, and $g^{+}$ and
$g^{-}$ are such that $x_{g^{-}}< c_0\leq x_{g^{+}}$, and
$s^{+}, s^{-}\in\{-1,1\}$. An upper limit for a right-sided CI for
$\theta_1+b(W)$ is then given by

\begin{equation*}
\hat{\theta}_{1}+s^{+}\hat\delta(x_{g^+})+
s^{-}\hat\delta(x_{g^-})+z_{1-\alpha}\hat{V}(W),
\end{equation*}
where $\hat{V}(W)=a(W)'\hat{\Sigma}a(W)$, and $a(W)\in\mathbb{R}^{G_{h}+1}$
denotes a vector with the $g_{-}$th element equal to $s^{-}$,
$(G_{h}^{-}+g_{+})$th element equal to $s^{+}$, the last element equal to one,
and the remaining elements equal to zero. The rest of the construction then
follows the description in Section 5.2 in @KoRo16.


# Fuzzy RD

## Model

In a fuzzy RD design, the treatment status $D_i$ of a unit does not necessarily
equate the treatment assignment $Z_i=I\{x_{i}\geq c_{0}\}$. Instead, the
treatment assignment induces a jump in the treatment probability at the cutoff.
Correspondingly, we augment the outcome model with a first stage that measures
the effect of the running variable on the treatment:
\begin{equation}
Y_i = f_Y(x_i) + u_{Y, i}, \qquad D_i = f_D(x_i) + u_{Y, i},
\end{equation}
where $f_D, f_Y$ are the conditional mean functions.

To account for imperfect compliance the fuzzy RD parameter scales the jump in the
outcome equation $\tau_Y$ by the jump in the treatment probability at the
cutoff, $\tau_D=\lim_{x\downarrow c_{0}}f_D(x)-\lim_{x\uparrow c_{0}}f_D(x)$.
This fuzzy RD parameter, $\theta=\tau_{Y}/\tau_{D}$, measures the local average
treatment effect for individuals at the threshold who comply with the
treatment assignment, provided mild continuity conditions and a monotonicity
condition hold (@htv01). Under perfect compliance, the treatment
probability jumps all the way from zero to one at the threshold so that
$\tau_{D}=1$, and the two parameters coincide.

For example, in the @battistin09 dataset, the treatment variable is an indicator
for retirement, and the running variable is the number of years since being eligible
to retire. The cutoff is $0$. Individuals exactly at the cutoff are dropped from
the dataset. If there were individuals exactly at the cutoff, they are assumed
to receive the treatment assignment (i.e. be eligible for retirement).

## Inference based on local polynomial estimates

A natural estimator for the fuzzy RD parameter $\theta$ is the sample analog
based on local linear regression,
\begin{equation*}
  \hat{\theta}_{h}=\frac{\hat{\tau}_{Y, h}}{\hat{\tau}_{D, h}}.
\end{equation*}

Unlike in the sharp case, our bias-aware CIs do rely on the consistency of the
estimator, which generally requires the bandwidth to shrink with the sample
size. Since this estimator is a ratio of regression coefficients, it follows by
the delta method that so long as $\tau_{D}\neq 0$, the estimator will be
asymptotically normal in large samples. In fact, the estimator is equivalent to
a weighted IV regression of $Y_i$ onto $D_i$, using $Z_i$ as an instrument, and
$x_i$ and its interaction with $Z_i$ as controls, so the variance formula is
analogous to the IV variance formula:
\begin{equation*}
\sd(\hat{\theta}_h)^2=\frac{\sd(\tau_{Y, h})^2+\theta^2\sd(\tau_{D, h})^2
-2\cov(\tau_{D, h}, \tau_{Y, h})\theta}{\tau_D^2},
\end{equation*}
where $\cov(\tau_{D, h}, \tau_{Y, h})=\sum_i
k_{i, h}^2\cov(Y_i, D_i\mid x_i)$ is the covariance of the
estimators.

If the second derivative of $f_Y$ is bounded by $M_Y$ and the second derivative
of $f_D$ is bounded by $M_D$, a linearization argument from Section 3.2.3 in
@ArKo20 that the bias can be bounded in large samples by $B_{M, h}$, with
$M=(M_{Y}+\abs{\theta}M_{D})/\abs{\tau_{D}}$, which now depends on $\theta$
itself. Therefore, optimal bandwidth calculations will require a preliminary
estimate of $\abs{\theta}$, which can be passed to `RDHonest` via the option
`T0`. Like in the sharp case, the optimal bandwidth calculations assume
homoskedastic covariance of $(u_{Y, i}, u_{D, i})$ on either side of the cutoff,
which are estimated based on a preliminary local linear regression for both the
outcome and first stage equation, with bandwidth given by the @ImKa12 bandwidth
selector applied to the outcome equation.

```{r}
## Initial estimate of treatment effect for optimal bandwidth calculations
r <- RDHonest(log(cn) | retired ~ elig_year, data=rcp, kern="triangular",
              M=c(0.001, 0.002), opt.criterion="MSE", sclass="H", T0=0)
## Use it to compute optimal bandwidth
RDHonest(log(cn) | retired ~ elig_year, data=rcp, kern="triangular",
         M=c(0.001, 0.002), opt.criterion="MSE", sclass="H",
         T0=r$coefficients$estimate)
```

## Choice of curvature parameters

Like in the sharp RD case, without further restrictions, the curvature
parameters $M_Y$ and $M_D$ cannot be data-driven: to maintain honesty over the
whole function class, a researcher must choose them a priori, rather than
attempting to use a data-driven method. Therefore, one should, whenever
possible, use problem-specific knowledge to decide what choices of $M_Y$ and
$M_D$ are reasonable a priori.

For cases in which this is difficult, the function `RDHonest` implements the
rule of thumb @ArKo20 described earlier, based on computing the global
smoothness of both $f_Y$ and $f_D$ using a quartic polynomial. When the user
doesn't supply the curvature bounds, the package uses the rule of thumb
$\hat{M}_{ROT}$, and prints a message to inform the user:

```{r}
## Data-driven choice of M
RDHonest(log(cn) | retired ~ elig_year, data=rcp, kern="triangular",
         opt.criterion="MSE", sclass="H", T0=r$coefficients$estimate)
```
See @ArKo20 for a discussion of the restrictions on the
parameter space under which this method yields honest inference.

# Extensions

## Covariates

RD datasets often contain information on a vector of $K$ pre-treatment
covariates $W_{i}$, such as pre-intervention outcomes, demographic, or
socioeconomic characteristics of the units. Similar to randomized controlled
trials, while the presence of covariates doesn't help to weaken the fundamental
identifying assumptions, augmenting the RD estimator with predetermined
covariates can increase its precision.

Let us first describe covariate adjustment in the sharp RD case. We implement
the covariate adjustment studied in @ccft19, namely to include $W_{i}$ as one of
the regressors in the WLS regression, regressing
$Y_{i}$ onto $m(x_{i})$ and $W_{i}$. As in the case with no covariates, we weight each observation with the
kernel weight $K(x_{i}/h)$. This leads to the estimator
\begin{equation*}
  \tilde{\tau}_{Y, h}=\tilde{\beta}_{Y, h,1}, \qquad \tilde{\beta}_{Y, h}=\left(\sum_{i=1}^{n}K(x_{i}/h)
    \tilde{m}(x_{i}, W_{i})\tilde{m}(x_{i}, W_{i})'\right)^{-1}
  \sum_{i=1}^{n}K(x_{i}/h)\tilde{m}(x_{i}, W_{i})Y_{i},
\end{equation*}
where $\tilde{m}(x_{i}, W_{i})=(m(x_{i})', W_{i}')'$. Denote the coefficient on
$W_{i}$ in this regression by $\tilde{\gamma}_{Y, h}$; this corresponds to the
last $K$ elements of $\tilde{\beta}_{Y, h}$. As in the case without covariates,
we first take the bandwidth $h$ as given, and defer bandwidth selection choice
to the end of this subsection.

To motivate the estimator under our framework, and to derive bias-aware CIs
that incorporate covariates, we need to formalize the assumption that the
covariates are predetermined (without any assumptions on the covariates, it is
optimal to ignore the covariates and use the unadjusted estimator
$\hat{\tau}_{Y, h}$). Let $f_{W}(x)=E[W_{i}\mid X_{i}=x]$ denote the regression
function from regressing the covariates on the running variable, and let
\begin{equation*}
  \Sigma_{WW}(x)=\operatorname{var}(W_{i}\mid X_{i}=x), \qquad
  \Sigma_{WY}(x)=\cov(W_{i}, Y_{i}\mid X_{i}=x).
\end{equation*}
We assume that the variance and covariance functions are continuous, except
possibly at zero. Let
$\gamma_{Y}=(\Sigma_{WW}(0_{+})+\Sigma_{WW}(0_{-}))^{-1}
(\Sigma_{WY}(0_{+})+\Sigma_{WY}(0_{-}))$ denote the coefficient on $W_{i}$ when
we regress $Y_{i}$ onto $W_{i}$ for observations at the cutoff. Let
$\tilde{Y}_{i}:=Y_{i}-W_{i}'\gamma_{Y}$ denote the covariate-adjusted outcome.
To formalize the assumption that the covariates are pre-determined, we assume
that $\tau_{W}=\lim_{x\downarrow 0}f_{W}(0)-\lim_{x\uparrow 0}f_{W}(0)=0$, which
implies that $\tau_{Y}$ can be identified as the jump in the covariate-adjusted
outcome $\tilde{Y}_{i}$ at $0$. Following Appendix B.1 in
@ArKo18optimal, we also assume that the covariate-adjusted outcome
varies smoothly with the running variable (except for a possible jump at the
cutoff), in that the second derivative of
\begin{equation*}
  \tilde{f}(x):=f_{Y}(x)-f_{W}(x)'\gamma_{Y}
\end{equation*}
is bounded by a known constant $\tilde{M}$. In addition, we assume $f_{W}$ has
bounded second derivatives.

Under these assumptions, if $\gamma_{Y}$ was known and hence $\tilde{Y}_{i}$ was
directly observable, we could estimate $\tau$ as in the case without covariates,
replacing $M$ with $\tilde{M}$ and $Y_{i}$ with
$\tilde{Y}_{i}$. Furthermore, as discussed in @ArKo18optimal, such
approach would be optimal under homoskedasticity assumptions.
Although $\gamma_{Y}$ is unknown, it turns out that the estimator
$\tilde{\tau}_{Y, h}$ has the same large sample behavior as the infeasible
estimator $\hat{\tau}_{\tilde{Y}, h}$. To show this, note that by standard
regression algebra, $\tilde{\tau}_{Y, h}$ can equivalently be written as
\begin{equation*}
  \tilde{\tau}_{Y, h}=\hat{\tau}_{Y-W'\tilde{\gamma}_{Y, h}, h}=
  \hat{\tau}_{\tilde{Y}, h}-
  \sum_{k=1}^{K}\hat{\tau}_{W_{k}, h}(\tilde{\gamma}_{Y, h, k}-\gamma_{Y, k}).
\end{equation*}
The first equality says that covariate-adjusted estimate is the same as an
unadjusted estimate that replaces the original outcome $Y_{i}$ with the
covariate-adjusted outcome $Y_{i}-W_{i}'\tilde{\gamma}_{Y, h}$. The second
equality uses the decomposition
$Y_{i}-W_{i}'\tilde{\gamma}_{Y, h}=\tilde{Y}_{i}-W_{i}'
(\tilde{\gamma}_{Y, h}-\gamma_{Y})$
to write the estimator as a sum of the infeasible estimator and a linear
combination of "placebo RD estimators" $\hat{\tau}_{W_{k}, h}$, that
replace $Y_{i}$ in the outcome equation with the $k$th element of $W_{i}$,
Since $f_{W}$ has bounded second derivatives, these placebo estimators converge
to zero, with rate that is at least as fast as the rate of convergence of the
infeasible estimator $\hat{\tau}_{\tilde{Y}, h}$:
$\hat{\tau}_{W_{k}, h}=O_{p}(B_{\tilde{M}, h}+\sd(\hat{\tau}_{\tilde{Y}, h}))$.
Furthermore, under regularity conditions, $\tilde{\gamma}_{Y, h}$ converges to
$\gamma_{Y}$, so that the second term in the previous display is asymptotically
negligible relative to the first. Consequently, we can form bias-aware CIs
based on $\tilde{\tau}_{Y, h}$ as in the case without covariates, treating
the covariate-adjusted outcome $Y_{i}-W_{i}'\tilde{\gamma}_{Y}$ as the outcome,
\begin{equation*}
  \tilde{\tau}_{Y, h}\pm
  \cv_{1-\alpha}(B_{\tilde{M}, h} / \sd(\hat{\tau}_{\tilde{Y}, h}))\sd(\hat{\tau}_{\tilde{Y}, h}), \qquad
  \sd(\hat{\tau}_{\tilde{Y}, h})^2=\sum_{i=1}^{n}k_{i, h}^{2}\sigma^{2}_{\tilde{Y}}(x_{i}),
\end{equation*}
where
$\sigma^{2}_{\tilde{Y}}(x_{i})=\sigma^{2}_{Y}(x_{i})+{\gamma}_{Y}'\Sigma_{WW}(x_{i}){\gamma}_{Y}-2{\gamma}_{Y}'\Sigma_{WY}(x_{i})$.
If the covariates are effective at explaining variation in the outcomes, then
the quantity
$\sum_{i}k_{i, h}^{2}\left({\gamma}_{Y}'\Sigma_{WW}(x_{i}){\gamma}_{Y}-2{\gamma}_{Y}'\Sigma_{WY}(x_{i})
\right)$ will be negative, and
$\sd(\hat{\tau}_{\tilde{Y}, h})\leq \sd(\hat{\tau}_{Y, h})$. If the smoothness of
the covariate-adjusted conditional mean function $f_{Y}-f_{W}'\gamma_{Y}$ is
greater than the smoothness of the unadjusted conditional mean function
$f_{Y}$, so that $\tilde{M}\leq M$, then using the covariates will help tighten
the confidence intervals.

Implementation of covariate-adjustment requires a choice of $\tilde{M}$, and
computing the optimal bandwidth requires a preliminary estimate of the variance
of the covariate-adjusted outcome. In our implementation, we first estimate the
model without covariates (using a rule of thumb to calibrate $M$, the bound on
the second derivative of $f_{Y}$), and compute the bandwidth $\check{h}$ that's
MSE optimal without covariates. Based on this bandwidth, we compute a
preliminary estimate $\tilde{\gamma}_{Y, \check{h}}$ of $\gamma_{Y}$, and use
this preliminary estimate to compute a preliminary covariate-adjusted outcome
$Y_{i}-W_{i}'\tilde{\gamma}_{Y, \check{h}}$. If $\tilde{M}$ is not supplied, we
calibrate $\tilde{M}$ using the rule of thumb, using this preliminary
covariate-adjusted outcome as the outcome. Similarly, we use this preliminary
covariate-adjusted outcome as the outcome to compute a preliminary estimator of
the conditional variance $\sigma^{2}_{\tilde{Y}}(x_{i})$, for optimal bandwidth
calculations, as in the case without covariates. With this choice of bandwidth
$h$, in the second step, we estimate $\tau_Y$ using the estimator
$\tilde{\tau}_{Y, h}$ defined above.



A demonstration using the `headst` data:
```{r}
## No covariates
rn <- RDHonest(mortHS ~ povrate, data=headst)
## Use Percent attending school aged 14-17, urban, black,
## and their interaction as covariates.
rc <- RDHonest(mortHS ~ povrate | urban*black + sch1417, data=headst)
rn
rc
```

We see that the inclusion of covariates leads to a reduction in the
rule-of-thumb curvature and also smaller standard errors (this would be true
even if the bandwidth was kept fixed). Correspondingly, the CIs are tighter by
about 9 percentage points:
```{r}
ci_len <- c(rc$coefficients$conf.high-rc$coefficients$conf.low,
            rn$coefficients$conf.high-rn$coefficients$conf.low)
100 * (1 - ci_len[1]/ci_len[2])
```

In the fuzzy RD case, we need to covariate-adjust the treatment $D_i$ as well as
the outcome. The implementation mirrors the sharp case. Define $\gamma_{D}$
analogously to $\gamma_{Y}$, and assume that the second derivative of
$f_{Y}(x)-f_{W}(x)'\gamma_{Y}$ is bounded by a known constant $\tilde{M}_{Y}$,
and that $f_{D}(x)-f_{W}(x)'\gamma_{D}$ is bounded by a known constant
$\tilde{M}_{D}$. The covariate-adjusted estimator is given by
$\tilde{\theta}_{h}=\tilde{\tau}_{Y, h}/\tilde{\tau}_{D, h}$, with variances and
worst-case bias computed as in the case without covariates, replacing the
treatment and outcome with their covariate-adjusted versions.

A demonstration using the `rcp` data, where we add education controls:
```{r}
RDHonest(log(cn) | retired ~ elig_year | education, data=rcp,
         T0=r$coefficients$estimate)
```

Relative to the previous estimate without covariates, the point estimate is now
much larger. This is in part due to slightly smaller bandwidth used, and the
regression function for the reduced form appears noisy below the cutoff,
potentially due to measurement error: see Figure 3. As a result, the estimates
are quite sensitive to the bandwidth used. The noise is also responsible for the
rather large data-driven estimates of the curvature parameters.

```{r, fig.width=4.5, fig.height=3.5, fig.cap="Battistin et al (2009) data"}
## see Figure 3
f3 <- RDScatter(log(cn)~elig_year, data=rcp, cutoff=0, avg=Inf,
                xlab="Years to eligibility",
                ylab="Log consumption of non-durables", propdotsize=TRUE,
                subset=abs(elig_year)<15)
## Adjust size of dots if they are too big
f3 + ggplot2::scale_size_area(max_size = 5)
```


## Aggregated data and weighted regression

In some cases, data is only observed as cell averages. For instance, suppose
that instead of observing the original `cghs` data, we only observe averages for
cells as follows:

```{r}
dd <- data.frame()
## Collapse data by running variable
for (j in unique(cghs$yearat14)) {
    ix <- cghs$yearat14==j
    df <- data.frame(y=mean(log(cghs$earnings[ix])), x=j,
                     weights=sum(ix),
                     sigma2=var(log(cghs$earnings[ix]))/sum(ix))
    dd <- rbind(dd, df)
}
```

The column `weights` gives the number of observations that each cell averages
over. In this case, if we weight the observations using `weights`, we can
recover the original estimates (and the same worst-case bias). If we use the
estimates of the conditional variance of the outcome, `dd$sigma2`, then we can
also replicate the standard error calculations:

```{r}
s0 <- RDHonest(log(earnings)~yearat14, cutoff=1947, data=cghs)
## keep same bandwidth
s1 <- RDHonest(y~x, cutoff=1947, data=dd, weights=weights,
               sigmaY2=sigma2, se.method="supplied.var",
               h=s0$coefficients$bandwidth)
## Results are identical:
s0
s1
```

Without supplying the variance estimates and specifying
`se.method="supplied.var"`, the variance estimates will not match, since the
collapsed data is not generally not sufficient to learn about the true
variability of the collapsed outcomes.

The same method works in fuzzy designs, but one has to also save the conditional
variance of the treatment and its covariance with the outcome:
```{r}
r0 <- RDHonest(log(cn)|retired~elig_year, data=rcp, h=7)
dd <- data.frame(x=sort(unique(rcp$elig_year)), y=NA, d=NA, weights=NA,
                 sig11=NA, sig12=NA, sig21=NA, sig22=NA)
for (j in seq_len(NROW(dd))) {
    ix <- rcp$elig_year==dd$x[j]
    Y <- cbind(log(rcp$cn[ix]), rcp$retired[ix])
    dd[j, -1] <- c(colMeans(Y), sum(ix), as.vector(var(Y))/sum(ix))
}
r1 <- RDHonest(y|d~x, data=dd, weights=weights,
               sigmaY2=sig11, T0=0, sigmaYD=sig21,
               sigmaD2=sig22, h=7,
               se.method="supplied.var")
## Outputs match up to numerical precision
max(abs(r0$coefficients[2:11]-r1$coefficients[2:11]))
```

## Clustering

In some applications, the data are collected by clustered sampling. In such
cases, the user can specify a vector `clusterid` signifying cluster membership.
In this case, preliminary bandwidth calculations assume that the regression
errors have a Moulton-type structure, with homoskedasticity on either side of
the cutoff: \begin{equation*} \cov(Y_{i}, Y_{j})=\begin{cases} \sigma^{2}_{+} &
\text{if $i=j$ and $x_{i}\geq 0$,} \\ \sigma^{2}_{-} & \text{if $i=j$ and
$x_{i}< 0$,} \\ \rho & \text{if $i\neq j$ and $g(i)=g(j)$,} \\ 0 &
\text{otherwise,} \end{cases} \end{equation*} where $g(i)\in\{1,\dotsc, G\}$
denotes cluster membership. Since it appears difficult to generalize the nearest
neighbor variance estimator to clustering, we use regression-based
cluster-robust variance formulas to compute estimator variances, so that option
`se.method="EHW"` is required.

```{r}
## make fake clusters
set.seed(42)
clusterid <- sample(1:50, NROW(lee08), replace=TRUE)
sc <- RDHonest(voteshare~margin, data=lee08, se.method="EHW",
               clusterid=clusterid, M=0.14, h=7)
## Since clusters are unrelated to outcomes, not clustering
## should yield similar standard errors
sn <- RDHonest(voteshare~margin, data=lee08, se.method="EHW",
               M=0.14, h=7)
sc
sn
```


## Specification testing

The package also implements lower-bound estimates for the smoothness constant
$M$ for the Taylor and Hölder smoothness class, as described in the supplements
to @KoRo16 and @ArKo18optimal

```{r}
r1 <- RDHonest(voteshare ~ margin, data=lee08, M=0.1, se.method="nn")
### Only use three point-average for averages of a 100 points closest to cutoff,
### and report results separately for points above and below cutoff
RDSmoothnessBound(r1, s=100, separate=TRUE, multiple=FALSE, sclass="T")

## Pool estimates based on observations below and above cutoff, and
## use three-point averages over the entire support of the running variable
RDSmoothnessBound(r1, s=100, separate=FALSE, multiple=TRUE, sclass="H")
```

## Optimal weights under Taylor smoothness class

For the second-order Taylor smoothness class, the function `RDHonest`, with
`kernel="optimal"`, computes finite-sample optimal estimators and confidence
intervals, as described in Section 2.2 in @ArKo18optimal. This typically yields
tighter CIs:

```{r}
r1 <- RDHonest(voteshare ~ margin, data=lee08, kern="optimal", M=0.1,
               opt.criterion="FLCI",
               se.method="nn")$coefficients
r2 <- RDHonest(voteshare ~ margin, data=lee08, kern="triangular", M=0.1,
               opt.criterion="FLCI", se.method="nn",
               sclass="T")$coefficients
r1$conf.high-r1$conf.low
r2$conf.high-r2$conf.low
```
# Inference at a point

The package can also perform inference at a point, and optimal bandwidth
selection for inference at a point. Suppose, for example, one was interested in
the vote share for candidates with margin of victory equal to 20 points:
```{r}
## Specify we're interested in inference at x0=20,
## and drop observations below cutoff
RDHonest(voteshare ~ margin, data=lee08, subset=margin>0,
         cutoff=20, kern="uniform",
         opt.criterion="MSE", sclass="H", point.inference=TRUE)
```

To compute the optimal bandwidth, the package assumes homoskedastic variance on
either side of the cutoff, which it estimates based on a preliminary local
linear regression using the @FaGi96 rule of thumb bandwidth selector. This
homoskedasticity assumption is dropped when the final standard errors are
computed.

# Diagnostics: leverage and effective observations

The estimators in this package are just weighted regression estimators, or
ratios of regression estimators in the fuzzy RD case. Regression estimators are
linear in outcomes, taking the form $\sum_i k_{i, h} Y_i$, where $k_{i, h}$ are
estimation weights, returned by `data$est_w` part of the `RDHonest` output (see
expression for $\hat{\tau}_{Y, h}$ above).

For the sampling distribution of the estimator to be well-approximated by a
normal distribution, it is important that these regression weights not be too
large: asymptotic normality requires $L_{\max}=\max_j k_{j, h}^2/\sum_i k_{i,
h}^2\to 0$. If uniform kernel is used, the weights $k_{i, h}$ are just the
diagonal elements of the partial projection matrix. We therefore refer to
$L_{\max}$ as maximal (partial) leverage, and it is reported in the `RDHonest`
output. The package issues a warning if the maximal leverage exceeds $0.1$---in
such cases using a bigger bandwidth is advised.

In the fuzzy RD case, by Theorem B.2 in the appendix to @ArKo20, the estimator
is asymptotically equivalent to $\sum_i k_{i, h} (Y_i-D_i\theta)/\tau_D$, where
$k_{i, h}$ are the weights for $\hat{\tau}_{Y, h}$. The maximal leverage
calculations are thus analogous to the sharp case.

With local regression methods, it is clear that observations outside the
estimation window don't contribute to estimation, reducing the effective sample
size. If the uniform kernel is used, the package therefore reports the number of
observations inside the estimation window as the "number of effective
observations".

To make this number comparable across different kernels, observe that, under
homoskedasticity, the variance of a linear estimator $\sum_{i} k_{i} Y_i$ is
$\sigma^2\sum_{i} k_i^2$. We expect this to scale in inverse proportion to the
sample size: with twice as many observations and the same bandwidth, we expect
the variance to halve. Therefore, if the variance ratio relative to a uniform
kernel estimator with weights $\sum_i k_{uniform, i}Y_i$ is $\sigma^2\sum_{i}
k_{uniform, i}^2/\sigma^2\sum_{i} k_i^2=\sum_{i} k_{uniform, i}^2/\sum_{i}
k_i^2$, the precision of this estimator is the same as if we used a uniform
kernel, but with $\sum_{i} k_{uniform, i}^2/\sum_{i} k_i^2$ as many
observations. Correspondingly, we define the number of effective observations
for other kernels as the number of observations inside the estimation window
times $\sum_{i} k_{uniform, i}^2/\sum_{i} k_i^2$. With this definition, using a
triangular kernel typically yields effective samples sizes equal to about 80% of
the number of observations inside the estimation window.

Finally, to assess which observations are important for pinning down the
estimate, it can be useful to explicitly plot the estimation weights.

# References