\documentclass{chapman} %%% copy Sweave.sty definitions %%% keeps `sweave' from adding `\usepackage{Sweave}': DO NOT REMOVE %\usepackage{Sweave} \RequirePackage[T1]{fontenc} \RequirePackage{graphicx,ae,fancyvrb} \IfFileExists{upquote.sty}{\RequirePackage{upquote}}{} \usepackage{relsize} \DefineVerbatimEnvironment{Sinput}{Verbatim}{} \DefineVerbatimEnvironment{Soutput}{Verbatim}{fontfamily=courier, fontshape=it, fontsize=\relsize{-1}} \DefineVerbatimEnvironment{Scode}{Verbatim}{} \newenvironment{Schunk}{}{} %%% environment for raw output \newcommand{\SchunkRaw}{\renewenvironment{Schunk}{}{} \DefineVerbatimEnvironment{Soutput}{Verbatim}{fontfamily=courier, fontshape=it, fontsize=\small} \rawSinput } %%% environment for labeled output \newcommand{\nextcaption}{} \newcommand{\SchunkLabel}{ \renewenvironment{Schunk}{\begin{figure}[ht] }{\caption{\nextcaption} \end{figure} } \DefineVerbatimEnvironment{Sinput}{Verbatim}{frame = topline} \DefineVerbatimEnvironment{Soutput}{Verbatim}{frame = bottomline, samepage = true, fontfamily=courier, fontshape=it, fontsize=\relsize{-1}} } %%% S code with line numbers \DefineVerbatimEnvironment{Sinput} {Verbatim} { %% numbers=left } \newcommand{\numberSinput}{ \DefineVerbatimEnvironment{Sinput}{Verbatim}{numbers=left} } \newcommand{\rawSinput}{ \DefineVerbatimEnvironment{Sinput}{Verbatim}{} } %%% R / System symbols \newcommand{\R}{\textsf{R}} \newcommand{\rR}{{R}} \renewcommand{\S}{\textsf{S}} \newcommand{\SPLUS}{\textsf{S-PLUS}} \newcommand{\rSPLUS}{{S-PLUS}} \newcommand{\SPSS}{\textsf{SPSS}} \newcommand{\EXCEL}{\textsf{Excel}} \newcommand{\ACCESS}{\textsf{Access}} \newcommand{\SQL}{\textsf{SQL}} %%\newcommand{\Rpackage}[1]{\hbox{\rm\textit{#1}}} %%\newcommand{\Robject}[1]{\hbox{\rm\texttt{#1}}} %%\newcommand{\Rclass}[1]{\hbox{\rm\textit{#1}}} %%\newcommand{\Rcmd}[1]{\hbox{\rm\texttt{#1}}} \newcommand{\Rpackage}[1]{\index{#1 package@{\fontseries{b}\selectfont #1} package} {\fontseries{b}\selectfont #1}} \newcommand{\rpackage}[1]{{\fontseries{b}\selectfont #1}} \newcommand{\Robject}[1]{\texttt{#1}} \newcommand{\Rclass}[1]{\index{#1 class@\textit{#1} class}\textit{#1}} \newcommand{\Rcmd}[1]{\index{#1 function@\texttt{#1} function}\texttt{#1}} \newcommand{\Roperator}[1]{\texttt{#1}} \newcommand{\Rarg}[1]{\texttt{#1}} \newcommand{\Rlevel}[1]{\texttt{#1}} %%% other symbols \newcommand{\file}[1]{\hbox{\rm\texttt{#1}}} %%\newcommand{\stress}[1]{\index{#1}\textit{#1}} \newcommand{\stress}[1]{\textit{#1}} \newcommand{\booktitle}[1]{\textit{#1}} %%' %%% Math symbols \usepackage{amstext} \usepackage{amsmath} \newcommand{\E}{\mathsf{E}} \newcommand{\Var}{\mathsf{Var}} \newcommand{\Cov}{\mathsf{Cov}} \newcommand{\Cor}{\mathsf{Cor}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \renewcommand{\a}{\mathbf{a}} \newcommand{\W}{\mathbf{W}} \newcommand{\C}{\mathbf{C}} \renewcommand{\H}{\mathbf{H}} \newcommand{\X}{\mathbf{X}} \newcommand{\B}{\mathbf{B}} \newcommand{\V}{\mathbf{V}} \newcommand{\I}{\mathbf{I}} \newcommand{\D}{\mathbf{D}} \newcommand{\bS}{\mathbf{S}} \newcommand{\N}{\mathcal{N}} \renewcommand{\L}{L} \renewcommand{\P}{\mathsf{P}} \newcommand{\K}{\mathbf{K}} \newcommand{\m}{\mathbf{m}} \newcommand{\argmin}{\operatorname{argmin}\displaylimits} \newcommand{\argmax}{\operatorname{argmax}\displaylimits} \newcommand{\bx}{\mathbf{x}} \newcommand{\bbeta}{\mathbf{\beta}} %%% links \usepackage{hyperref} \hypersetup{% pdftitle = {A Handbook of Statistical Analyses Using R (3rd Edition)}, pdfsubject = {Book}, pdfauthor = {Torsten Hothorn and Brian S. Everitt}, colorlinks = {black}, linkcolor = {black}, citecolor = {black}, urlcolor = {black}, hyperindex = {true}, linktocpage = {true}, } %%% captions & tables %% <FIXME>: conflics with figure definition in chapman.cls %%\usepackage[format=hang,margin=10pt,labelfont=bf]{caption} %% </FIMXE> \usepackage{longtable} \usepackage[figuresright]{rotating} %%% R symbol in chapter 1 \usepackage{wrapfig} %%% Bibliography \usepackage[round,comma]{natbib} \renewcommand{\refname}{References \addcontentsline{toc}{chapter}{References}} \citeindexfalse %%% texi2dvi complains that \newblock is undefined, hm... \def\newblock{\hskip .11em plus .33em minus .07em} %%% Example sections \newcounter{exercise}[chapter] \setcounter{exercise}{0} \newcommand{\exercise}{\stepcounter{exercise} \item{Ex.~\arabic{chapter}.\arabic{exercise} }} %% URLs \newcommand{\curl}[1]{\begin{center} \url{#1} \end{center}} %%% for manual corrections %\renewcommand{\baselinestretch}{2} %%% plot sizes \setkeys{Gin}{width=0.95\textwidth} %%% color \usepackage{color} %%% hyphenations \hyphenation{drop-out} \hyphenation{mar-gi-nal} %%% new bidirectional quotes need \usepackage[utf8]{inputenc} %\usepackage{setspace} \definecolor{sidebox_todo}{rgb}{1,1,0.2} \newcommand{\todo}[1]{ \hspace{0pt}% \marginpar{% \fcolorbox{black}{sidebox_todo}{% \parbox{\marginparwidth} { \raggedright\sffamily\footnotesize{TODO: #1}% } }% } } \begin{document} %% Title page \title{A Handbook of Statistical Analyses Using \R{} --- 3rd Edition} \author{Torsten Hothorn and Brian S. Everitt} \maketitle %%\VignetteIndexEntry{Chapter Bayesian Inference} %%\VignetteDepends{rmeta,coin} \setcounter{chapter}{17} \SweaveOpts{prefix.string=figures/HSAUR,eps=FALSE,keep.source=TRUE} <<setup, echo = FALSE, results = hide>>= rm(list = ls()) s <- search()[-1] s <- s[-match(c("package:base", "package:stats", "package:graphics", "package:grDevices", "package:utils", "package:datasets", "package:methods", "Autoloads"), s)] if (length(s) > 0) sapply(s, detach, character.only = TRUE) if (!file.exists("tables")) dir.create("tables") if (!file.exists("figures")) dir.create("figures") set.seed(290875) options(prompt = "R> ", continue = "+ ", width = 63, # digits = 4, show.signif.stars = FALSE, SweaveHooks = list(leftpar = function() par(mai = par("mai") * c(1, 1.05, 1, 1)), bigleftpar = function() par(mai = par("mai") * c(1, 1.7, 1, 1)))) HSAURpkg <- require("HSAUR3") if (!HSAURpkg) stop("cannot load package ", sQuote("HSAUR3")) rm(HSAURpkg) ### </FIXME> hm, R-2.4.0 --vanilla seems to need this a <- Sys.setlocale("LC_ALL", "C") ### </FIXME> book <- TRUE refs <- cbind(c("AItR", "DAGD", "SI", "CI", "ANOVA", "MLR", "GLM", "DE", "RP", "GAM", "SA", "ALDI", "ALDII", "SIMC", "MA", "PCA", "MDS", "CA"), 1:18) ch <- function(x) { ch <- refs[which(refs[,1] == x),] if (book) { return(paste("Chapter~\\\\ref{", ch[1], "}", sep = "")) } else { return(paste("Chapter~", ch[2], sep = "")) } } if (file.exists("deparse.R")) source("deparse.R") setHook(packageEvent("lattice", "attach"), function(...) { lattice.options(default.theme = function() standard.theme("pdf", color = FALSE)) }) @ \pagestyle{headings} <<singlebook, echo = FALSE>>= book <- FALSE @ \chapter[Bayesian Inference]{Incorporating Prior Knowledge via Bayesian Inference: Smoking and Lung Cancer \label{BI}} \section{Introduction} \index{Smoking and lung cancer|(} At the beginning of the 20th century, the death toll due to lung cancer was on the rise and the search for possible causes began. For lung cancer in pit workers, animal experiments showed that the so-called `Schneeberg lung disease' was induced by radiation. But this could not explain the increasing incidence of lung cancer in the general population. The identification of possible risk factors was a challenge for epidemiology and statistics, both disciplines being still in their infancy in the 1920s and 1930s. The first modern controlled epidemiological study on the effect of smoking on lung cancer was performed by Franz Hermann M\"uller as part of his dissertation at the University of Cologne in 1939. The results were published a year later \citep{HSAUR:Mueller1940}. M\"uller sent out questionnaires to the relatives of people who had recently died of lung cancer, asking about the smoking behavior and its intensity of the deceased relative. He also sent the questionnaire to healthy controls to obtain information about the smoking behavior in a control group, although it is not clear how this control group was defined. The number of lung cancer patients and healthy controls in five different groups (nonsmokers to extreme smokers) are given in Table~\ref{BI-Smoking_Mueller1940-tab}. <<BI-Smoking_Mueller1940-tab, echo = FALSE, results = tex>>= data("Smoking_Mueller1940", package = "HSAUR3") toLatex(HSAURtable(Smoking_Mueller1940), caption = paste("Smoking and lung cancer case-control study by M\\\"uller (1940).", "The smoking intensities were defined by the number of", "cigarettes smoked daily:", "1-15 (moderate), 16-25 (heavy), 26-35 (very heavy),", "and more than 35 (extreme)."), label = "BI-Smoking_Mueller1940-tab") @ Four years later Erich Sch\"oninger also wrote his dissertation on the association between smoking and lung cancer and, together with his supervisor Eberhard Schairer at the University of Jena, published his results on a case-control study \citep{HSAUR:SchairerSchoeninger1944} where he assessed the smoking behavior of lung cancer patients, patients diagnosed with other forms of cancer, and also a healthy control group. The data are given in Table~\ref{BI-Smoking_SchairerSchoeniger1944-tab}. <<BI-Smoking_SchairerSchoeniger1944-tab, echo = FALSE, results = tex>>= x <- as.table(Smoking_SchairerSchoeniger1944[, c("Lung cancer", "Healthy control")]) toLatex(HSAURtable(x, xname = "Smoking_SchairerSchoeniger1944"), caption = paste("Smoking and lung cancer case-control study by Schairer and Sch\\\"oniger (1944). Cancer other than lung cancer omitted.", "The smoking intensities were defined by the number of", "cigarettes smoked daily:", "1-5 (moderate), 6-10 (medium), 11-20 (heavy),", "and more than 20 (very heavy)."), label = "BI-Smoking_SchairerSchoeniger1944-tab") @ Shortly after the war, a Dutch epidemiologist reported on a case-control study performed in Amsterdam \citep{HSAUR:Wassink1945} and found similar results as the two German studies; see Table~\ref{BI-Smoking_Wassink1945-tab}. <<BI-Smoking_Wassink1945-tab, echo = FALSE, results = tex>>= data("Smoking_Wassink1945", package = "HSAUR3") toLatex(HSAURtable(Smoking_Wassink1945), caption = paste("Smoking and lung cancer case-control study by Wassink (1945).", "Smoking categories correspond to the categories used by M\\\"uller (1940)."), label = "BI-Smoking_Wassink1945-tab") @ In 1950 perhaps the most important, but not the first, case-control study showing an increasing risk of developing lung cancer with the amount of tobacco smoked, was published in Great Britain by Richard Doll and Austin Bradford Hill \citep{HSAUR:DollHill1950}. We restrict discussion here to data obtained for males and the data shown in Table~\ref{BI-Smoking_DollHill1950-tab} corresponds to the most recent amount of tobacco consumed regularly by smokers before disease onset \citep[Table~V in][]{HSAUR:DollHill1950}. <<BI-Smoking_DollHill1950-tab, echo = FALSE, results = tex>>= data("Smoking_DollHill1950", package = "HSAUR3") x <- as.table(Smoking_DollHill1950[,,"Male", drop = FALSE]) toLatex(HSAURtable(x, xname = "Smoking_DollHill1950"), caption = paste("Smoking and lung cancer case-control study (only males) by Doll and Hill (1950).", "The labels for the smoking categories give the number of cigarettes smoked every day."), label = "BI-Smoking_DollHill1950-tab") @ Although the design of the studies by \cite{HSAUR:Mueller1940} and \cite{HSAUR:SchairerSchoeninger1944}, especially the selection of their control groups, can be criticized \citep[see][for a detailed discussion]{HSAUR:Morabia2013} and the study by \cite{HSAUR:DollHill1950} was larger than the older studies and more detailed information on the smoking behavior was obtained by direct patient interviews, the information provided by the earlier studies was not taken into account by \cite{HSAUR:DollHill1950}. They cite \cite{HSAUR:Mueller1940} in their introduction, but did not compare their findings with his results. It is remarkable to see that both \cite{HSAUR:SchairerSchoeninger1944} and \cite{HSAUR:Wassink1945} extensively made use of the report by \cite{HSAUR:Mueller1940} and go as far as analyzing the merged data \citep[Grafiek I, E, and F, in][]{HSAUR:Wassink1945}. In an informal way, these authors wanted to use the already available information, in today's terms called `prior knowledge', to make a stronger case with the new data. Formal statistical methods to incorporate prior knowledge into data analysis as part of the `Bayesian' way of doing statistical analyses were developed in the second half of the last century, and we will focus on them in the present chapter. \index{Smoking and lung cancer|)} \section{Bayesian Inference} \section{Analysis Using \R{}} \subsection{One-by-one Analysis} For the analysis of the four different case-control studies on smoking and lung cancer, we will (retrospectively, of course) update our knowledge with every new study. We begin with a re-analysis of the data described by \cite{HSAUR:Mueller1940}. Using an approximate permutation test introduced in Chapter~\ref{CI} for the hypothesis of independence of the amount of tobacco smoked and group membership (lung cancer or healthy control), we get <<BI-M-it, echo = TRUE>>= library("coin") set.seed(29) independence_test(Smoking_Mueller1940, teststat = "quad", distribution = approximate(100000)) @ and there is clearly a strong association between the number of cigarettes smoked and incidence of lung cancer. Because the amount of tobacco smoked is an ordered categorical variable, it is more appropriate to take this information into account, for example by means of a linear association test (see Chapter~\ref{CI}). Nonsmokers receive a score of zero, and for the remaining groups we choose the mid-point of the intervals of daily cigarettes smoked that were used by \cite{HSAUR:Mueller1940} to define his groups: <<BI-M40-linit, echo = TRUE>>= ssc <- c(0, 1 + 14 / 2, 16 + 9 / 2, 26 + 9 / 2, 40) independence_test(Smoking_Mueller1940, teststat = "quad", scores = list(Smoking = ssc), distribution = approximate(100000)) @ The result shows that the data are in favor of an ordered alternative. The $p$-values obtained from approximate permutation tests are attractive because no distributional assumptions are required, but it is hard to derive estimates and confidence intervals for interpretable parameters from such tests. We will therefore now switch to logistic regression models as described in Chapter~\ref{GLM} to model the odds of lung cancer in the different smoking groups. Before we start, let us define a small function for computing odds (for intercept parameters) and odds ratios (for difference parameters) and corresponding confidence intervals from a logistic regression model: <<BI-expconfint, echo = TRUE>>= eci <- function(model) cbind("Odds (Ratio)" = exp(coef(model)), exp(confint(model))) @ We model the probability of developing lung cancer given the smoking behavior. Because our data was obtained from case-control studies where the groups (lung cancer patients and healthy controls) were defined first and only after that we observed data on the smoking behavior (in a so-called \stress{choice-based sampling}), this may seem the wrong model to start with. However, the marginal distribution of the two groups only changes the intercept in such a logistic model and the effects of smoking can still be interpreted in the way we require \citep[see][for example]{HSAUR:Tutz2012}. The formula for specifying a logistic regression model can be set up such that the response is a matrix with two columns for each smoking group consisting of the number of lung cancer deaths and the number of healthy controls. Although smoking is an ordered factor, we first fit the model with treatment contrasts, i.e., we can interpret the $\exp$ of the regression coefficients as odds ratios between each smoking group and nonsmokers: <<BI-M40-logreg, echo = TRUE>>= smoking <- ordered(rownames(Smoking_Mueller1940), levels = rownames(Smoking_Mueller1940)) contrasts(smoking) <- "contr.treatment" eci(glm(Smoking_Mueller1940 ~ smoking, family = binomial())) @ We see that all but one of the odds ratios increase with the amount of tobacco smoked with a maximum of almost $30$ for extreme smokers (more than $35$ cigarettes per day). The likelihood confidence intervals are rather wide due to the limited sample size, but also the lower limit increases with smoking. An alternative model formulation can help to compare each smoking group with the preceding group, the so-called split-coding \citep[for this and other codings see][]{HSAUR:Tutz2012}: <<BI-M40-logreg-split, echo = TRUE>>= K <- diag(nlevels(smoking) - 1) K[lower.tri(K)] <- 1 contrasts(smoking) <- rbind(0, K) eci(glm(Smoking_Mueller1940 ~ smoking, family = binomial())) @ The two largest differences are between moderate smokers and nonsmokers (\Robject{smoking1}) and between very heavy and heavy smokers (\Robject{smoking3}). The latter group difference seems, at least judged by the confidence interval, to be larger than expected under a model with no effect of smoking. For the analysis of the three remaining studies, we first perform permutation tests for the independence of smoking and the two groups (lung cancer and healthy controls) in males: <<BI-SS44-it, echo = TRUE>>= xSS44 <- as.table(Smoking_SchairerSchoeniger1944[, c("Lung cancer", "Healthy control")]) ap <- approximate(100000) pvalue(independence_test(xSS44, teststat = "quad", distribution = ap)) pvalue(independence_test(Smoking_Wassink1945, teststat = "quad", distribution = ap)) xDH50 <- as.table(Smoking_DollHill1950[,, "Male"]) pvalue(independence_test(xDH50, teststat = "quad", distribution = ap)) @ All $p$-values indicate that the data are not well-described by the independence model. \subsection{Joint Bayesian Analysis} For a Bayesian analysis, we first merge the data from all four studies into one data frame. In doing so, we also merge the smoking groups in a way that we only have three groups left: nonsmokers, moderate smokers, and heavy smokers. These groups are chosen in a way that the number of daily cigarettes is comparable. We first merge the heavy, very heavy, and extreme smokers from \cite{HSAUR:Mueller1940} <<BI-data-M, echo = TRUE>>= (M <- rbind(Smoking_Mueller1940[1:2,], colSums(Smoking_Mueller1940[3:5,]))) @ and proceed with the lung cancer patients and healthy controls from \cite{HSAUR:SchairerSchoeninger1944} in the same way <<BI-data-SS, echo = TRUE>>= SS <- Smoking_SchairerSchoeniger1944[, c("Lung cancer", "Healthy control")] (SS <- rbind(SS[1,], colSums(SS[2:3,]), colSums(SS[4:5,]))) @ and finally perform the same exercise for the \cite{HSAUR:Wassink1945} and \cite{HSAUR:DollHill1950} data <<BI-data-WDH, echo = TRUE>>= (W <- rbind(Smoking_Wassink1945[1:2,], colSums(Smoking_Wassink1945[3:4,]))) DH <- Smoking_DollHill1950[,, "Male"] (DH <- rbind(DH[1,], colSums(DH[2:3,]), colSums(DH[4:6,]))) @ The three new groups are now called nonsmokers, moderate smokers, and heavy smokers, and we set up a data frame that contains the number of people in each of the possible groups for all studies: <<BI-data-all, echo = TRUE>>= smk <- c("Nonsmoker", "Moderate smoker", "Heavy smoker") x <- expand.grid(Smoking = ordered(smk, levels = smk), Diagnosis = factor(c("Lung cancer", "Control")), Study = c("Mueller1940", "SchairerSchoeniger1944", "Wassink1945", "DollHill1950")) x$weights <- c(as.vector(M), as.vector(SS), as.vector(W), as.vector(DH)) @ Before we fit logistic regression models using the data organized in such a way, we define the contrasts for the smoking ordered factor and expand the data in a way that each row corresponds to one person. This is necessary because the \Rcmd{weights} argument to the \Rcmd{glm} function must not be used to define case weights: <<BI-data-contrasts, echo = TRUE>>= contrasts(x$Smoking) <- "contr.treatment" x <- x[rep(1:nrow(x), x$weights),] @ We now compute one logistic regression model for each study for later comparisons: <<BI-models, echo = TRUE>>= models <- lapply(levels(x$Study), function(s) glm(Diagnosis ~ Smoking, data = x, family = binomial(), subset = Study == s)) names(models) <- levels(x$Study) @ In 1939, M\"uller was hardly in the position to come up with a reasonable prior for the odds ratios between moderate or heavy smokers and nonsmokers. So we also use a noninformative prior and just perform the maximum likelihood analysis: <<BI-M40, echo = TRUE>>= eci(models[["Mueller1940"]]) @ Four years later, the maximum likelihood results obtained for the \cite{HSAUR:SchairerSchoeninger1944} data <<BI-SS44, echo = TRUE>>= eci(models[["SchairerSchoeniger1944"]]) @ could have been improved by using a normal prior for the difference in log odds whose distribution is the distribution of the maximum likelihood estimator obtained for M\"uller's data. At least approximately, we can compute posterior $90\%$ credibility intervals and the posterior mode from the Schairer and Sch\"oniger data by analyzing both data sets simultaneously. We should, however, keep in mind that the odds of developing lung cancer for nonsmokers is not really interesting for our analysis and that the four studies may very well differ with respect to this intercept parameter. Consequently, we don't want to specify a prior for the intercept. One way to implement such a strategy is to exclude the intercept term from the joint model while allowing a separate intercept for each of the studies: <<BI-M40-SS44, echo = TRUE>>= mM40_SS44 <- glm(Diagnosis ~ 0 + Study + Smoking, data = x, family = binomial(), subset = Study %in% c("Mueller1940", "SchairerSchoeniger1944")) eci(mM40_SS44) @ We observe two important differences between the maximum likelihood and Bayesian results for the Schairer and Sch\"oniger data: In the Bayesian analysis, the estimated odds ratio for moderate smokers is closer to the smaller value obtained from M\"uller's data and, more important, the credibility intervals are much narrower and, one has to say, more realistic now. An odds ratio as large as $40$ is hardly something one would expect to see in practice. If Wassink had been aware of Bayesian statistics, he could have used the posterior distribution of the parameters from our model \Robject{mM40\_SS44} as a prior distribution for analyzing his data. The maximum likelihood results for his data <<BI-M40-SS44-W45-ML, echo = TRUE>>= eci(models[["Wassink1945"]]) @ would have changed to <<BI-M40-SS44-W45, echo = TRUE>>= mM40_SS44_W45 <- glm(Diagnosis ~ 0 + Study + Smoking, data = x, family = binomial(), subset = Study %in% c("Mueller1940", "SchairerSchoeniger1944", "Wassink1945")) eci(mM40_SS44_W45) @ The rather small odds ratios obtained from the model fitted to the Wassink data only are now closer to the estimates obtained from the two previous studies and the variability, as given by the credibility intervals, is much smaller. Now, finally, the model for the Doll and Hill data reports rather large odds ratios with wide confidence intervals: <<BI-DH50, echo = TRUE>>= eci(models[["DollHill1950"]]) @ With a (now rather strong) prior defined by the three earlier studies, we get from the joint model for all four studies <<BI-all, echo = TRUE>>= m_all <- glm(Diagnosis ~ 0 + Study + Smoking, data = x, family = binomial()) eci(m_all) @ <<BI-all-round, echo = FALSE>>= r <- eci(m_all) xM <- round(r["SmokingModerate smoker", 2:3], 1) xH <- round(r["SmokingHeavy smoker", 2:3], 1) @ In 1950, the joint evidence based on such an analysis with an odds ratio between $\Sexpr{xM[1]}$ and $\Sexpr{xM[2]}$ for moderate smokers and between $\Sexpr{xH[1]}$ and $\Sexpr{xH[2]}$ for heavy smokers compared to nonsmokers, would have made a much stronger case than any of the single studies alone. It is interesting to see that with this strong prior for the Doll and Hill study, we also get relatively large odds ratios when comparing heavy to moderate smokers (see row labeled \Rcmd{Smoking2}): <<BI-results, echo = TRUE>>= K <- diag(nlevels(x$Smoking) - 1) K[lower.tri(K)] <- 1 contrasts(x$Smoking) <- rbind(0, K) eci(glm(Diagnosis ~ 0 + Study + Smoking, data = x, family = binomial())) @ \subsection{A Comparison with Meta Analysis} One may ask how the Bayesian approach of progressively updating the estimates considered here differs from a classical meta analysis described in Chapter~\ref{MA}. We first reshape the data into a form suitable for such an analysis <<BI-meta-data, echo = TRUE>>= y <- xtabs(~ Study + Smoking + Diagnosis, data = x) ntrtM <- margin.table(y, 1:2)[,"Moderate smoker"] nctrl <- margin.table(y, 1:2)[,"Nonsmoker"] ptrtM <- y[,"Moderate smoker","Lung cancer"] pctrl <- y[,"Nonsmoker","Lung cancer"] ntrtH <- margin.table(y, 1:2)[,"Heavy smoker"] ptrtH <- y[,"Heavy smoker","Lung cancer"] @ and then compute joint odds ratios and confidence intervals for moderate and heavy smokers compared to nonsmokers: <<BI-meta-data, echo = TRUE>>= library("rmeta") meta.MH(ntrt = ntrtM, nctrl = nctrl, ptrt = ptrtM, pctrl = pctrl) meta.MH(ntrt = ntrtH, nctrl = nctrl, ptrt = ptrtH, pctrl = pctrl) @ For moderate smokers, the effect is a little weaker compared with the results reported on earlier and for heavy smokers, the meta analysis identifies a stronger effect for heavy smokers. Nevertheless, the differences between the two rather different approaches are negligible and the conclusions would have been the same. \section{Summary of Findings} We have seen that, using a Bayesian approach to incorporate prior knowledge into a model, the odds of developing lung cancer increase with increased amounts of smoking. Of course, our analysis here is very simplistic, because we ignored that also pipe and cigar smokers were present in the data, we merged the data based on a very rough assessment of the number of cigarettes smoked per day, ignored whether or not the smokers inhaled the smoke into their lungs, or if nonsmokers were subject to passive-smoking, as we call it today. Most importantly, we must not misinterpret findings from case-control studies as casual and, in fact, none of the authors cited here did so. The debate on whether smoking, and which kind of smoking, actually \stress{causes} lung cancer was initiated by the publications cited in this chapter and many famous statisticians took part in the debate, for example, Sir Ronald Fisher \citep{HSAUR:Fisher1959}, took the view that the inference of causation was premature. In retrospect this was one issue (perhaps the only one) where Fisher was mistaken. \section{Final Comments} There remain a few hard-line opponents of Bayesian inference (just a few) who reject the method because of the use of subjective prior distributions which, these opponents feel, have no place in scientific investigations. And there are Bayesians who think that the only defense of using non-Bayesian methods is incompetence. But for an increasing number of statisticians Bayesian inference is very attractive, because we can use the posterior distribution of the parameters to draw conclusions from the data. Although this requires the specification of a prior distribution, we have seen in this chapter that, using data from previous experiments, priors can be defined in a reasonable way. It is not absolutely necessary to rely on rather complex numerical procedures to`estimate' a posterior distribution. When we are willing to cut some corners, we can implement simple Bayesian approaches using standard software. We should also keep in mind that the prior can be interpreted as a penalty on the parameters, and many penalization approaches therefore have an (often implicit) connection to the Bayesian way of doing statistics. Of course, just picking the prior that `works best' is dangerous and almost surely inappropriate. \section*{Exercises} \begin{description} \exercise Produce a forest plot as introduced in Chapter~\ref{MA} for the four smoking studies analyzed here. \exercise Produce a modified forest plot where one can see how the evidence for smoking being related to lung cancer evolved between 1940 and 1950. \exercise Use the \Rpackage{INLA} add-on package to perform a similar analysis by using the coefficients and their standard errors estimated from our initial logistic regression model \texttt{m[["Mueller1940"]]} as parameters of a normal prior for a logistic regression applied to the Schairer and Sch\"oniger data. Compare the resulting credibility intervals for the two odds-ratios with the approximate results obtained in this chapter. \end{description} %%\bibliographystyle{LaTeXBibTeX/refstyle} %%\bibliography{LaTeXBibTeX/HSAUR} \end{document}