%\VignetteIndexEntry{GMMAT}
\documentclass[a4paper,12pt]{article}
\usepackage[margin=1in]{geometry}
\usepackage{amsmath}
\usepackage{float}
\restylefloat{table}
\usepackage{graphicx}
\usepackage{epstopdf}
\usepackage{hyperref}
\usepackage[dvipsnames]{xcolor}
\usepackage{listings}
\usepackage[superscript]{cite}
\newcommand{\mb}{\mathbf}
\newcommand{\bs}{\boldsymbol}
\allowdisplaybreaks[1]
\lstset{
language=R,
basicstyle=\scriptsize\ttfamily,
commentstyle=\ttfamily\color{gray},
numbers=left,
numberstyle=\ttfamily\color{gray}\footnotesize,
stepnumber=1,
numbersep=5pt,
backgroundcolor=\color{white},
showspaces=false,
showstringspaces=false,
showtabs=false,
frame=single,
tabsize=2,
captionpos=b,
breaklines=true,
breakatwhitespace=false,
title=\lstname,
escapeinside={},
keywordstyle={},
morekeywords={}
}

\begin{document}
\title{GMMAT: Generalized linear Mixed Model Association Tests\\Version 1.4.2}
\date{November 17, 2023}
\author{Han Chen\\Human Genetics Center\\Department of Epidemiology\\School of Public Health\\The University of Texas Health Science Center at Houston\\Email: Han.Chen.2@uth.tmc.edu}

\maketitle
\newpage
\tableofcontents
\newpage
\section{Introduction}
\textit{GMMAT} is an R package for performing association tests using generalized linear mixed models (GLMMs) \cite{Breslow1993} in genome-wide association studies (GWAS) and sequencing association studies. GLMMs provide a broad range of models for correlated data analysis. In the GWAS and sequencing association study context, examples of correlated data include those from family studies, samples with cryptic relatedness and/or shared environmental effects, as well as samples generated from complex sampling designs.

\textit{GMMAT} first fits a GLMM with covariate adjustment and random effects to account for population structure and family or cryptic relatedness. For GWAS, \textit{GMMAT} performs score tests for each genetic variant. For candidate gene studies, \textit{GMMAT} can also perform Wald tests to get the effect size estimate for each genetic variant. For rare variant analysis from sequencing association studies, \textit{GMMAT} performs the burden test \cite{Morgenthaler2007, Li2008, Madsen2009, Morris2010}, the sequence kernel association test (SKAT) \cite{Wu2011}, SKAT-O \cite{Lee2012} and the efficient hybrid test of the burden test and SKAT, in the variant Set Mixed Model Association Tests (SMMAT) framework \cite{Chen2019}, based on user-defined variant sets.

\section{The model}
In the context of single variant test, \textit{GMMAT} works with the following GLMM
\begin{equation}
\eta_i=g(\mu_i)=\mb{X}_i\bs\alpha+G_i\beta+b_i.\nonumber
\end{equation}
We assume that given the random effects $\mb{b}$, the outcome $y_i$ are conditionally independent with mean $E(y_i|\mb{b})=\mu_i$ and variance $Var(y_i|\mb{b})=\phi a^{-1}_i v(\mu_i)$, where $\phi$ is the dispersion parameter (for binary and Poisson data $\phi=1$), $a_i$ are known weights, and $v(\cdot)$ is the variance function. The linear predictor $\eta_i$ is a monotonous function of the conditional mean $\mu_i$ via the link function $\eta_i=g(\mu_i)$. $\mb{X}_i$ is a $1\times p$ row vector of covariates for subject $i$, $\bs\alpha$ is a $p\times 1$ column vector of fixed covariate effects including the intercept, $G_i$ is the genotype of the genetic variant of interest for subject $i$, and $\beta$ is the fixed genotype effect. We assume that $\mb{b}\sim N(0, \sum_{k=1}^K\tau_k\mb{V}_k)$ is an $n\times 1$ column vector of random effects, $\tau_k$ are the variance component parameters, $\mb{V}_k$ are known $n\times n$ matrices. In practice, $\mb{V}_k$ can be the theoretical kinship matrix if analyzing family samples with known pedigree structure in a homogeneous population, or the empirical genetic relationship matrix (GRM) to account for population structure and cryptic relatedness, or any $n\times n$ matrices to account for shared environmental effects or complex sampling designs.

\textit{GMMAT} can be used to analyze both continuous and binary traits. For continuous traits, if a normal distribution and an identity link function are assumed, \textit{GMMAT} performs association tests based on linear mixed models (LMMs). For binary traits, however, we showed that performing association tests based on LMMs can lead to invalid \textit{P} values in the presence of moderate or strong population stratification, even after adjusting for top ancestry principal components (PCs) as fixed effects \cite{Chen2016}. In such scenarios, we would recommend assuming a Bernoulli distribution and a logit link function for binary traits, adjusting for top ancestry PCs as fixed-effect covariates. This GLMM is also known as the logistic mixed model.

In the context of variant set tests (also known as gene-based tests or aggregate variant tests), \textit{GMMAT} works with the following GLMM
\begin{equation}
\eta_i=g(\mu_i)=\mb{X}_i\bs\alpha+\mb{G}_i\bs\beta+b_i,\nonumber
\end{equation}
where $\mb{G}_i$ is now a $1\times q$ row vector of genotypes for $q$ variants in subject $i$, $\bs\beta$ is a $q\times 1$ column vector of genotype effects with mean $\beta_0$ and variance $\sigma^2$. The following 4 tests in the SMMAT framework are implemented. The burden test is for $H_0: \beta_0=0$ versus $H_1: \beta_0\ne 0$ under the constraint $\sigma^2=0$, SKAT is for $H_0: \sigma^2=0$ versus $H_1: \sigma^2>0$ under the constraint $\beta_0=0$, SKAT-O is a linear combination of burden test and SKAT statistics, and the efficient hybrid test combines the burden test with an adjusted SKAT (which is asymptotically independent with the burden test) $H_0: \sigma^2=0$ versus $H_1: \sigma^2>0$ under no constraints on $\beta_0$ using Fisher's method \cite{Chen2019}. Of note, in all these tests, the null GLMM is the same as the null model in the single variant test, which only needs to be fitted once.

For longitudinal data, two types of models can be applied: random intercept only models, and random intercept and random slope models. The random intercept only model is appropriate for analyzing repeated measures with no time trends, and observations for the same individual $i$ at different time points $j$ are assumed to be exchangeable. In the context of single variant test, \textit{GMMAT} works with the following GLMM for repeated measures
\begin{equation}
\eta_i=g(\mu_i)=\mb{X}_i\bs\alpha+G_i\beta+b_i.\nonumber
\end{equation}
The notations are similarly defined as the cross-sectional data above, except that now we have for each observation $j$ of individual $i$ the mean $E(y_{ij}|\mb{b})=\mu_i$ and variance $Var(y_{ij}|\mb{b})=\phi a^{-1}_i v(\mu_i)$. The random effects $\mb{b}$, which is a length $n$ vector of $b_i$ ($n$ is the number of individuals), is assumed to follow $\mb{b}\sim N(0, \sum_{k=1}^K\tau_k\mb{V}_k + \tau_{K+1}\mb{I}_n)$, where the first $K$ terms account for between-subject relatedness attributable to $n\times n$ matrices $\mb{V}_k$, and the last term accounts for random individual effects not attributable to relatedness matrices $\mb{V}_k$. For example, for unrelated individuals, we would not have any between-subject relatedness matrices $\mb{V}_k$, but only the last term for random individual effects. Let $\mb{Z}$ be the $N\times n$ design matrix indicating which observation is from which individual ($N$ is the total number of observations, $N > n$ but we do not need the same numbers of observations from each individual), the random effects for all observations $\mb{Zb}\sim N(0, \sum_{k=1}^K\tau_k\mb{ZV}_k\mb{Z}^T + \tau_{K+1}\mb{ZZ}^T)$ accounts for all sources of correlation, and the observed outcome $y_{ij}$ are assumed to be independent conditioning on that.

The random intercept and random slope model is appropriate for analyzing longitudinal data with individual-specific time trends (therefore, a random slope for time effect, in addition to the random intercept described above). In the context of single variant test, \textit{GMMAT} works with the following GLMM for time trends
\begin{equation}
\eta_{ij}=g(\mu_{ij})=\mb{X}_{ij}\bs\alpha+G_i\beta+b_{0i}+b_{1i}T_{ij}.\nonumber
\end{equation}
Typically, the time-dependent variable of interest $T_{ij}$ should be included in the fixed effect covariates $\mb{X}_{ij}$. The random effects of the intercept $\mb{b}_0$, which is a length $n$ vector of $b_{0i}$ ($n$ is the number of individuals), is assumed to follow $\mb{b}_0\sim N(0, \sum_{k=1}^K\tau_k\mb{V}_k + \tau_{K+1}\mb{I}_n)$, where the first $K$ terms account for random individual effects of the intercept attributable to $n\times n$ matrices $\mb{V}_k$, and the last term accounts for random individual effects of the intercept not attributable to relatedness matrices $\mb{V}_k$. In addition, we assume $Cov(\mb{b}_0, \mb{b}_1) = \sum_{k=1}^K\tau_{K+1+k}\mb{V}_k + \tau_{2K+2}\mb{I}_n)$, where $\mb{b}_1$ is a length $n$ vector of the random effects of the slope $b_{1i}$. The first $K$ terms account for the covariance of the random individual effects of the intercept and the slope attributable to $n\times n$ matrices $\mb{V}_k$, and the last term accounts for the covariance of the random individual effects of the intercept and the slope not attributable to relatedness matrices $\mb{V}_k$. Finally, we assume $\mb{b}_1\sim N(0, \sum_{k=1}^K\tau_{2K+2+k}\mb{V}_k + \tau_{3K+3}\mb{I}_n)$, where the first $K$ terms account for random individual effects of the slope attributable to $n\times n$ matrices $\mb{V}_k$, and the last term accounts for random individual effects of the slope not attributable to relatedness matrices $\mb{V}_k$. For example, with $K=1$, let $\mb{b}=(\mb{b}_0^T, \mb{b}_1^T)^T$ be the stacked random individual effects (for both the intercept and the slope), $\mb{Z}$ be the $N\times 2n$ design matrix for linking observations with their corresponding random intercept and random slope ($N$ is the total number of observations, again we do not need the same numbers of observations from each individual), the random effects for all observations $\mb{Zb}\sim N(0, \mb{Z}\{
\begin{pmatrix}
\tau_1 & \tau_3 \\
\tau_3 & \tau_5
\end{pmatrix}
\otimes\mb{V}_k + 
\begin{pmatrix}
\tau_2 & \tau_4 \\
\tau_4 & \tau_6
\end{pmatrix}
\otimes\mb{I}_n\}\mb{Z}^T)$ accounts for all sources of correlation, and the observed outcome $y_{ij}$ are assumed to be independent conditioning on that.

Variant set tests for longitudinal data in the SMMAT framework are similarly defined, for both random intercept only models (repeated, exchangeable measures), and random intercept and random slope models (time trends).

For multiple phenotype analysis, $m$ continuous traits can be jointly modeled in LMMs. Let $\mb{Y}$ be an $n\times m$ matrix of phenotypes, $\mb{X}$ be an $n\times p$ matrix of covariates, $\bs\alpha$ be a $p\times m$ matrix of fixed covariate effects including the intercept, $\mb{G}$ be an $n\times q$ matrix of genotypes (note that for single variant analysis, $q=1$), $\bs\beta$ be a $q\times m$ matrix of genotype effects, $\mb{b}$ be an $n\times m$ matrix of random effects, $\bs\varepsilon$ be an $n\times m$ matrix of random errors, we have
\begin{equation}
\mb{Y}=\mb{X}\bs\alpha+\mb{G}\bs\beta+\mb{b}+\bs\varepsilon.\nonumber
\end{equation}
In this joint model, $\bs\varepsilon$ follow a matrix normal distribution $\mathcal{MN}_{n\times m}(\mb{0}, \mb{I}_n, \bs\Psi_0)$, such that the vectorized form $vec(\bs\varepsilon)\sim \mathcal{N}_{nm}(\mb{0}, \bs\Psi_0\otimes\mb{I}_n)$, where $\bs\Psi_0$ is an $m\times m$ variance-covariance matrix for residuals of $m$ continuous traits. The vectorized form of the random effects $vec(\mb{b})\sim \mathcal{N}_{nm}(\mb{0}, \sum_{k=1}^K\bs\Psi_k\otimes\mb{V}_k)$, where $m\times m$ matrices $\bs\Psi_k$ are the variance-covariance component parameters for $m$ continous traits, corresponding to known $n\times n$ between-subject relatedness matrices $\mb{V}_k$. Once a null model without any genetic effects is fitted, the multiple phenotype single variant test $H_0: \bs\beta=\mb{0}$ versus $H_1: \bs\beta\ne \mb{0}$ can proceed using a chi-square distribution with $m$ degrees of freedom. Variant set tests for multiple phenotypes are currently defined as the burden test, SKAT, SKAT-O, and SMMAT efficient hybrid test on all $qm$ parameters in $\bs\beta$. Therefore, it is recommended that multiple phenotypes are rescaled to the same variance before conducting variant set tests.

\section{Getting started}
\subsection{Downloading \textit{GMMAT}}
\textit{GMMAT} can be downloaded at \url{https://github.com/hanchenphd/GMMAT}. It can be installed as a regular R package. It is also available on CRAN (\url{https://CRAN.R-project.org/package=GMMAT}).

\subsection{Installing \textit{GMMAT}}
\textit{GMMAT} links to R packages \href{https://cran.r-project.org/web/packages/Rcpp/index.html}{Rcpp} and \href{https://cran.r-project.org/web/packages/RcppArmadillo/index.html}{RcppArmadillo}, and also imports R packages \href{https://cran.r-project.org/web/packages/Rcpp/index.html}{Rcpp}, \href{https://cran.r-project.org/web/packages/CompQuadForm/index.html}{CompQuadForm}, \href{https://cran.r-project.org/web/packages/foreach/index.html}{foreach}, \href{https://cran.r-project.org/web/views/HighPerformanceComputing.html}{parallel}, \href{https://cran.r-project.org/web/packages/Matrix/index.html}{Matrix}, methods, and \href{https://cran.r-project.org/web/packages/data.table/index.html}{data.table}. \textit{GMMAT} requires Bioconductor packages \href{http://bioconductor.org/packages/release/bioc/html/SeqArray.html}{SeqArray} and \href{http://bioconductor.org/packages/release/bioc/html/SeqVarTools.html}{SeqVarTools} to work with genotype files in the GDS format. In addition, \textit{GMMAT} requires \href{https://cran.r-project.org/web/packages/testthat/index.html}{testthat} to run code checks during development, and \href{https://cran.r-project.org/web/packages/doMC/index.html}{doMC} to run parallel computing in \textbf{glmm.score} and \textbf{SMMAT} for genotype files in the GDS format (however, \href{https://cran.r-project.org/web/packages/doMC/index.html}{doMC} is not available on Windows and these functions will switch to a single thread). These dependencies should be installed before installing \textit{GMMAT}.

For optimal computational performance, it is recommended to use an R version configured with the Intel Math Kernel Library (or other fast BLAS/LAPACK libraries). See the instructions on building R with Intel MKL (\url{https://software.intel.com/en-us/articles/using-intel-mkl-with-r}).

Here is an example for installing \textit{GMMAT} and all its dependencies in an R session (assuming none of the R packages other than the default has been installed):

<<installation, eval=FALSE, echo=TRUE>>=
## try http:// if https:// URLs are not supported
## remove "doMC" below if you are running Windows
install.packages(c("devtools", "RcppArmadillo", "CompQuadForm", "doMC", 
        "foreach", "Matrix", "data.table", "BiocManager", "testthat"), 
	repos = "https://cran.r-project.org/")
BiocManager::install(c("SeqArray", "SeqVarTools"))
devtools::install_github("hanchenphd/GMMAT")
@

\subsection{Using \textit{GMMAT} in Analysis Commons}
The \textit{GMMAT} package (version 1.1.2) is also available in Analysis Commons \cite{Brody2017} on DNAnexus cloud computing platform (\url{https://platform.dnanexus.com/}), as the "gmmat\_v1.1.2" App in Project "Commons\_Tools".

\section{Input}
\textit{GMMAT} requires the phenotype and covariates in an R data frame, known positive semi-definite matrices $\mb{V}_k$ as an R matrix (in the case of a single matrix) or an R list (in the case of multiple matrices), and genotypes saved in a plain text file (or in a compressed plain text file .gz or .bz2), a \href{http://zzz.bwh.harvard.edu/plink/data.shtml#bed}{PLINK binary PED file}, or in a GDS format file. We describe how to prepare these data below.

\subsection{Phenotype and covariates}
Phenotype and covariates should be either saved as a data frame in R, or recorded in a text file that can be read into R as a data frame. The rows of the data frame represent different individuals, and the columns represent different variables. For example, here we show the header and first 6 rows of the example text file pheno.txt:
\begin{verbatim}
id      disease trait   age     sex
1       1       5.45    61      0
2       1       5.61    50      1
3       0       3.1     54      0
4       1       6.22    48      1
5       1       5.42    49      0
6       0       6.22    50      1
...
\end{verbatim}

In this example, there are an ID column (id), one binary phenotype (disease), one quantitative phenotype (trait) and two covariates (age and sex). There can be additional columns for unused variables, and the order of columns does not matter. To read it into R as a data frame, you can use
<<pheno>>=
pheno.file <- system.file("extdata", "pheno.txt", package = "GMMAT")
pheno <- read.table(pheno.file, header = TRUE)
@

Missing values in the data frame should be recognizable by R as NA. For example, if you use . (period) to denote missing values in the text file, you can use
<<pheno2, eval=FALSE, echo=TRUE>>=
pheno <- read.table(pheno.file, header = TRUE, na.strings = ".")
@

\subsection{Matrices of covariance structure}
\textit{GMMAT} requires at least one positive semi-definite matrix $\mb{V}_k$ to model the covariance structure of the random effects (for cross-sectional data). In the simplest case, this is usually a GRM estimated from the genotype data. Currently \textit{GMMAT} does not provide a function to calculate the GRM, but there are many software packages that can do this job. For example, \href{http://www.xzlab.org/software.html}{GEMMA} \cite{Zhou2012} can be used to estimate either the centered GRM or the standardized GRM. GRM saved in an external file must be read into R as a matrix. For example,
<<GRM>>=
GRM.file <- system.file("extdata", "GRM.txt.bz2", package = "GMMAT")
GRM <- as.matrix(read.table(GRM.file, check.names = FALSE))
@

This matrix can include more than $n$ individuals in practice, but the rownames and colnames must include all individuals' id in the phenotype and covariates data frame. Multiple matrices can be used to allow multiple components of random effects. In such cases, the matrices should be constructed as a list of matrices, and each matrix should comply with the rownames and colnames requirements described above (although they don't have to be in the same order). For example, if you have 3 R matrices Mat1, Mat2 and Mat3, you can construct the R list
<<Mats, eval=FALSE, echo=TRUE>>=
Mats <- list(Mat1, Mat2, Mat3)
@

All matrices must be positive semi-definite. Sparse matrices from the \textit{Matrix} package are also allowed.

\subsection{Genotypes}
\textit{GMMAT} can take genotype files either in plain text format (or the compressed version .gz or .bz2), \href{http://zzz.bwh.harvard.edu/plink/data.shtml#bed}{PLINK binary PED format}, in the GDS format, or in the BGEN format. Non-integer imputed genotypes (dosages) should be saved in plain text files (or the compressed version .gz or .bz2). The plain text file can be space-, tab-, comma-, or even special character-delimited, and there can be additional rows (e.g., comments) and/or columns before the genotype data matrix. Here is an example of part of a tab-delimited plain text genotype file geno.txt:
\begin{verbatim}
# This is an example genotype file for demonstrating GMMAT
# Each row represents one SNP for all individuals in the study
# First column is SNP name, second and third columns are alleles (Allele1
and Allele2): it is recommended to use Allele1 for the reference allele 
and Allele2 for the effect allele, but reversed coding is also allowed 
and does not affect association test results (users should be cautious 
with allele coding when interpreting results)
# Starting from fourth column, each column represents one individual
# In this example, there are 400 individuals and 100 SNPs
SNP1    A       T       0       0       NA      NA      ...
SNP2    A       C       1       0       1       0       ...
SNP3    A       C       0       0       0       1       ...
SNP4    A       G       1       0       1       1       ...
SNP5    A       G       1       0       2       1       ...
...
\end{verbatim}

Genotypes in Variant Call Format (VCF) and \href{http://zzz.bwh.harvard.edu/plink/data.shtml#bed}{PLINK binary PED format} can be converted to the GDS format using seqVCF2GDS and seqBED2GDS functions from the \href{http://bioconductor.org/packages/release/bioc/html/SeqArray.html}{SeqArray} package:
<<convert2GDS, eval=FALSE, echo=TRUE>>=
SeqArray::seqVCF2GDS("VCF_file_name", "GDS_file_name")
SeqArray::seqBED2GDS("BED_file_name", "FAM_file_name", "BIM_file_name", 
        "GDS_file_name")
@

\section{Running \textit{GMMAT}}
If \textit{GMMAT} has been successfully installed, you can load it in an R session using

<<loading>>=
library(GMMAT)
@

We provide 6 functions in \textit{GMMAT}: \textbf{glmmkin} for fitting the GLMM with known $\mb{V}_k$, \textbf{glmm.score} for running single variant score tests, \textbf{glmm.wald} for single variant Wald tests, \textbf{glmm.score.meta} for performing meta-analysis on score test results, \textbf{SMMAT} for running variant set tests SMMAT (also known as gene-based tests or aggregate variant tests), and \textbf{SMMAT.meta} for performing variant set tests meta-analysis using intermediate files (single variant scores and their covariance matrices in each variant set, from the \textbf{SMMAT} function). Details about how to use these functions, their arguments and returned values can be found in the R help document of \textit{GMMAT}. For example, to learn more about \textbf{glmmkin}, in an R session you can type
<<help, eval=FALSE, echo=TRUE>>=
?glmmkin
@

\subsection{Fitting GLMM}
Here we provide a simple example of fitting GLMM using \textbf{glmmkin}. We have the binary phenotype disease, the quantitative phenotype trait, and two covariates age and sex, saved in a plain text file pheno.txt. We also have computed the GRM externally and saved it in a compressed file GRM.txt.bz2. In this example we fit a GLMM assuming Bernoulli distribution of the disease and logit link function (also known as a logistic mixed model). We adjust for age and sex, and use one $n\times n$ matrix as $\mb{V}_k$ (the GRM) to model the covariance structure of the random effects.
<<GMMAT0>>=
model0 <- glmmkin(disease ~ age + sex, data = pheno, kins = GRM, 
        id = "id", family = binomial(link = "logit"))
model0$theta
model0$coefficients
model0$cov
@

Note that pheno and GRM must be read into R as a data frame and a matrix, respectively. When using the function \textbf{glmmkin}, the data frame of phenotype and covariates (in our example, pheno) should be passed to the argument "data", and the matrix or the list of matrices for random effects (in our example, the matrix GRM) should be passed to the argument "kins". The first argument of the function \textbf{glmmkin} is "fixed", which requires a formula for fixed effects. The syntax of the formula is the same as the formula used in a linear model \textbf{lm} and a generalized linear model \textbf{glm}. The example model above is equivalent to
<<GMMAT01, eval=FALSE, echo=TRUE>>=
model0 <- glmmkin(fixed = disease ~ age + sex, data = pheno, kins = GRM, 
        id = "id", family = binomial(link = "logit"))
@
The argument "id" specifies the ID column name in the data frame pheno. The argument "family" takes the same syntax as used in a generalized linear model \textbf{glm}. For example, if you would like to fit a LMM for a quantitative trait, you can use
<<GMMAT1>>=
model1 <- glmmkin(fixed = trait ~ age + sex, data = pheno, kins = GRM, 
        id = "id", family = gaussian(link = "identity"))
@
Please avoid using LMMs for binary traits.

To fit a heteroscedastic LMM for a quantitative trait (allowing heterogeneous residual variances among different groups) \cite{Conomos2016}, you can use
<<GMMAT2>>=
model2 <- glmmkin(fixed = trait ~ age + sex, data = pheno, kins = GRM, 
        id = "id", groups = "disease", 
        family = gaussian(link = "identity"))
model2$theta
@
In this example, groups are defined by disease status. Therefore, disease cases and controls are assumed to have different residual variances on the quantitative trait, after adjusting for age and sex.

Here is a list of supported \textbf{family} objects (for details and alternative link/variance functions, please see the R help document of \textbf{family}):

\begin{center}
\begin{tabular}{cccc}
\hline
Family & Link & Trait & Variance\\
\hline
binomial & logit & binary & $\mu(1-\mu)$\\
gaussian & identity & continuous & $\phi$\\
Gamma & inverse & continuous & $\phi\mu^2$\\
inverse.gaussian & 1/mu\^{}2 & continuous & $\phi\mu^3$\\
poisson & log & count & $\mu$\\
quasi & identity & continuous & $\phi$\\
quasibinomial & logit & binary & $\phi\mu(1-\mu)$\\
quasipoisson & log & count & $\phi\mu$\\
\hline
\end{tabular}
\end{center}

The function \textbf{glmmkin} returns a list. Except for the heteroscedastic LMM, the first element in the vector theta is the estimate of the dispersion parameter $\phi$ (for binary and Poisson data we have a fixed $\phi=1$), and the remaining elements are variance component estimates for each matrix modeling the covariance structure of the random effects (in the same order as in the list of matrices passed to "kins"). In our binary disease example model0 above, we have only one such matrix (the GRM, therefore $K=1$), and the results show that the estimate of the variance component parameter $\tau_1$ is 0.3377331. The vector coefficients gives the fixed effect estimates, and the order matches the order of covariates in the formula passed to "fixed". In our binary disease example above, we have an intercept 0.472081189, and an age effect estimate $-$0.006818634, a sex effect estimate $-$0.086444746. The matrix cov is the covariance matrix of the fixed effect estimates.

When a heteroscedastic LMM is fitted using the function \textbf{glmmkin}, the first elements (with length equal to the number of groups) in the vector theta are the estimated residual variances for each group (in the same order as the levels of groups appearing in the data), followed by variance component estimates for each matrix modeling the covariance structure of the random effects (in the same order as in the list of matrices passed to "kins"). In our heteroscedastic LMM model2 above, 0.9082504 is the residual variance estimate for cases, 1.0589985 is the residual variance estimate for controls (since in the data pheno, disease 1 appears before disease 0), and 1.0853533 is the estimate of the variance component parameter for the GRM.

In the following example, we have the same binary phenotype disease, covariates age and sex, and the same GRM as in the previous example. In addition to the GRM, we have another $n\times n$ matrix to model the covariance structure of the random effects. The GLMM is
\begin{equation}
\log(\frac{P(disease_i=1|age_i,sex_i,b_i)}{1-P(disease_i=1|age_i,sex_i,b_i)})=\alpha_0+\alpha_1\times age_i+\alpha_2\times sex_i+b_i,\nonumber
\end{equation}
where $\mb{b}\sim N(0, \tau_1\mb{V}_1+\tau_2\mb{V}_2)$, $\mb{V}_1$ is the GRM, $\mb{V}_2$ is a block diagonal matrix with block size 10 and all entries equal to 1 within a block. Here $\mb{V}_2$ is used to model clusters: in this example we have 40 clusters with 10 individuals in each cluster. For individual $i$ in cluster $j$ ($j=1,2,\ldots,40$), the model above is equivalent to
\begin{equation}
\log(\frac{P(disease_i=1|age_i,sex_i,b_{1i},b_{2j})}{1-P(disease_i=1|age_i,sex_i,b_{1i},b_{2j})})=\alpha_0+\alpha_1\times age_i+\alpha_2\times sex_i+b_{1i}+b_{2j},\nonumber
\end{equation}
where $\mb{b}_1\sim N(0, \tau_1\mb{V}_1)$, $b_{2j}\sim N(0, \tau_2)$. All 10 individuals in a cluster share a common $b_{2j}$, and $b_{2j}$ are independent and identically distributed across clusters.
<<GMMAT3>>=
M10 <- matrix(0, 400, 400)
for(i in 1:40) M10[(i-1)*10+(1:10), (i-1)*10+(1:10)] <- 1
rownames(M10) <- colnames(M10) <- 1:400
Mats <- list(GRM, M10)
model3 <- glmmkin(fixed = disease ~ age + sex, data = pheno, id = "id",
        kins = Mats, family = binomial(link = "logit"))
model3$theta
@
In this example, the dispersion parameter $\phi$ is fixed to 1, the variance component estimates $\hat\tau_1=0.2199087$, $\hat\tau_2=0.1219592$, corresponding to the $n\times n$ matrices $\mb{V}_1$ (the GRM) and $\mb{V}_2$ (the block diagonal matrix M10), respectively.

For longitudinal data, the following example illustrates a random intercept only model, which is appropriate for analyzing repeated measures with no time trends. We have 5 exchangeable observations of the continuous phenotype y.repeated for each individual, adjusting for sex as a fixed effects covariate. The model is
\begin{equation}
y_{ij}=\alpha_0+\alpha_1\times sex_i+b_i+\epsilon_{ij},\nonumber
\end{equation}
where the random individual effects $\mb{b}\sim N(0, \tau_1\mb{V}_1 + \tau_2\mb{I}_n)$,
$\mb{V}_1$ is the GRM, and the error $\epsilon_{ij}\sim N(0, \phi)$.
<<GMMAT4>>=
pheno2.file <- system.file("extdata", "pheno2.txt", package = "GMMAT")
pheno2 <- read.table(pheno2.file, header = TRUE)
model4 <- glmmkin(y.repeated ~ sex, data = pheno2, kins = GRM, id = "id",
        family = gaussian(link = "identity"))
model4$theta
@
In this example, the dispersion parameter estimate is $\hat\phi=0.5089983$, and the variance component estimates $\hat\tau_1=0.2410866$, $\hat\tau_2=0.2129918$, corresponding to the GRM and the identity matrix, respectively, for the random intercept of individual effects.

The last example of this section illustrates a random intercept and random slope model, which is appropriate for analyzing longitudinal data with individual-specific time trends. We have observations of the continuous phenotype y.trend at 5 time points for each individual, adjusting for sex and time as fixed effects covariates. The model is
\begin{equation}
y_{ij}=\alpha_0+\alpha_1\times sex_i+\alpha_2\times time_{ij}+b_{0i}+b_{1i}\times time_{ij}+\epsilon_{ij},\nonumber
\end{equation}
where the random effects of the intercept $\mb{b}_0\sim N(0, \tau_1\mb{V}_1 + \tau_2\mb{I}_n)$, the covariance $Cov(\mb{b}_0, \mb{b}_1) = \tau_3\mb{V}_1 + \tau_4\mb{I}_n)$, the random effects of the time slope $\mb{b}_1\sim N(0, \tau_5\mb{V}_1 + \tau_6\mb{I}_n)$, $\mb{V}_1$ is the GRM, and the error $\epsilon_{ij}\sim N(0, \phi)$.
<<GMMAT5>>=
model5 <- glmmkin(y.trend ~ sex + time, data = pheno2, kins = GRM, id = "id",
        random.slope = "time", family = gaussian(link = "identity"))
model5$theta
@
In this example, the dispersion parameter estimate is $\hat\phi=2.0139758$, the variance component estimates $\hat\tau_1=1.5056896$, $\hat\tau_2=0.5443516$, corresponding to the GRM and the identity matrix, for the random intercept of individual effects, the covariance estimates $\hat\tau_3=1.2539355$, $\hat\tau_4=-0.2469170$, corresponding to the GRM and the identity matrix, and the variance component estimates $\hat\tau_5=1.1690890$, $\hat\tau_6=0.6931902$, corresponding to the GRM and the identity matrix, for the random slope of individual time effects.

\subsection{Single variant tests}
\subsubsection{Score tests}
When performing score tests in GWAS, we need a fitted GLMM under the null hypothesis $H_0: \beta=0$ and a genotype file. We can construct score tests using the glmmkin class object returned from the function \textbf{glmmkin} for the null GLMM. Note that score tests require only vector/matrix multiplications and are much faster than Wald tests, which require fitting a new GLMM for each SNP. Score tests give the direction of effects but not effect size estimates. However, we can simply add score statistics and their variances from different studies to perform a meta-analysis. Here we provide a simple example of score tests using the plain text genotype file "geno.txt":
<<GMMATscoretxt, eval=FALSE, echo=TRUE>>=
geno.file <- system.file("extdata", "geno.txt", package = "GMMAT")
glmm.score(model0, infile = geno.file, outfile =
        "glmm.score.text.testoutfile.txt", infile.nrow.skip = 5, 
        infile.ncol.skip = 3, infile.ncol.print = 1:3,
        infile.header.print = c("SNP", "Allele1", "Allele2"))
@
The first argument in \textbf{glmm.score} is the returned glmmkin class object from the null GLMM. The argument "infile" is the name (and path if not in the current working directory) of the plain text genotype file (or compressed files .gz and .bz2), and the argument "outfile" is the name of the output file. In this example genotype file, we have 5 comment lines to skip using "infile.nrow.skip". The first 3 columns contain information on SNP name and alleles, which we skip from the analysis using "infile.ncol.skip" but subsequently keep in the output file using "infile.ncol.print" to select the 1st, 2nd and 3rd columns. Corresponding column names in the output file can be assigned using "infile.header.print".

If your genotype information is saved in a \href{http://zzz.bwh.harvard.edu/plink/data.shtml#bed}{PLINK binary PED file} "geno.bed", you can use:
<<GMMATscorebed, eval=FALSE, echo=TRUE>>=
geno.file <- strsplit(system.file("extdata", "geno.bed", 
        package = "GMMAT"), ".bed", fixed = TRUE)[[1]]
glmm.score(model0, infile = geno.file, outfile = 
        "glmm.score.bed.testoutfile.txt")
@
Here "infile" is the prefix (and path if not in the current working directory) of the \href{http://zzz.bwh.harvard.edu/plink/data.shtml#bed}{PLINK} files (.bed, .bim and .fam). SNP information in the .bim file (in our example, "geno.bim") is carried over to the output file.

If your genotype information is saved in a GDS file "geno.gds", you can use:
<<GMMATscoregds, eval=FALSE, echo=TRUE>>=
geno.file <- system.file("extdata", "geno.gds", package = "GMMAT")
glmm.score(model0, infile = geno.file, outfile = 
        "glmm.score.gds.testoutfile.txt")
@

Alternatively, if your genotype information is saved in a BGEN file "geno.bgen", with a BGEN sample file "geno.sample", you can use:
<<GMMATscorebgen, eval=FALSE, echo=TRUE>>=
geno.file <- system.file("extdata", "geno.bgen", package = "GMMAT")
samplefile <- system.file("extdata", "geno.sample", package = "GMMAT")
glmm.score(model0, infile = geno.file, BGEN.samplefile = samplefile,
	outfile = "glmm.score.bgen.testoutfile.txt")
@

The function \textbf{glmm.score} returns the actual computation time in seconds from its function call for plain text genotype files (or compressed files .gz and .bz2) and \href{http://zzz.bwh.harvard.edu/plink/data.shtml#bed}{PLINK binary PED files}. For GDS and BGEN genotype files, no value is returned.

\subsubsection{Wald tests}
When performing Wald tests for candidate SNPs to get effect size estimates, we need the phenotype (and covariates) data frame, the matrices modeling the covariance structure of the random effects, and the genotype file. To perform Wald tests, we do not need fitting the null GLMM required in score tests using \textbf{glmmkin}. In the example below, we perform Wald tests for 4 candidate SNPs of interest and get their effect estimates:
<<GMMATwaldtxt, warning=FALSE>>=
geno.file <- system.file("extdata", "geno.txt", package = "GMMAT")
snps <- c("SNP10", "SNP25", "SNP1", "SNP0")
glmm.wald(fixed = disease ~ age + sex, data = pheno, kins = GRM, id = "id",
        family = binomial(link = "logit"), infile = geno.file, snps = snps, 
	infile.nrow.skip = 5, infile.ncol.skip = 3, infile.ncol.print = 1:3, 
	infile.header.print = c("SNP", "Allele1", "Allele2"))
@
The syntax is a hybrid of \textbf{glmmkin} and \textbf{glmm.score}. Note that the argument "fixed" is a formula including the covariates but NOT the test SNPs. The argument "snps" is a character vector of the names of the test SNPs. If "infile" is a plain text genotype file (or compressed files .gz and .bz2), the function \textbf{glmm.wald} returns a data frame with first columns copied from the genotype file using "infile.ncol.print" and names specified using "infile.header.print", followed by the sample size N, the allele frequency (AF) of the effect allele (Allele2 in this example, but you can also define Allele1 as the effect allele in your coded genotype file), effect size estimate BETA of the effect allele, standard error SE, Wald test \textit{P} value PVAL, and an indicator for whether the GLMM is converged. Note that in the example above, SNP0 is not actually included in the genotype file, so all results are missing.

If your genotype information is saved in a \href{http://zzz.bwh.harvard.edu/plink/data.shtml#bed}{PLINK binary PED file} "geno.bed", you can use:
<<GMMATwaldbed, warning=FALSE>>=
geno.file <- strsplit(system.file("extdata", "geno.bed", 
        package = "GMMAT"), ".bed", fixed = TRUE)[[1]]
glmm.wald(fixed = disease ~ age + sex, data = pheno, kins = GRM, id = "id",
        family = binomial(link = "logit"), infile = geno.file, snps = snps)
@
It returns a data frame with first 6 columns copied from the .bim file (in our example, "geno.bim"), followed by the sample size N, the allele frequency (AF) of A2 allele (the effect allele, note that A1 allele in .bim is coded 0 and A2 allele is coded 1), effect size estimate BETA of A2 allele, standard error SE, Wald test \textit{P} value PVAL, and an indicator for whether the GLMM is converged.

Alternatively, if your genotype information is saved in a GDS file "geno.gds", you can use:
<<GMMATwaldgds, eval=FALSE, echo=TRUE>>=
geno.file <- system.file("extdata", "geno.gds", package = "GMMAT")
glmm.wald(fixed = disease ~ age + sex, data = pheno, kins = GRM, id = "id",
        family = binomial(link = "logit"), infile = geno.file, snps = snps)
@
<<GMMATwaldgds2, eval=TRUE, echo=FALSE, warning=FALSE>>=
if(requireNamespace("SeqArray", quietly = TRUE) && requireNamespace("SeqVarTools", quietly = TRUE)) {
geno.file <- system.file("extdata", "geno.gds", package = "GMMAT")
glmm.wald(fixed = disease ~ age + sex, data = pheno, kins = GRM, id = "id",
        family = binomial(link = "logit"), infile = geno.file, snps = snps)
}
@
It returns a data frame with first 5 columns information extracted from the GDS file, followed by the sample size N, the allele frequency (AF) of ALT allele, effect size estimate BETA of ALT allele, standard error SE, Wald test \textit{P} value PVAL, and an indicator for whether the GLMM is converged.

\subsubsection{Meta-analysis}
Score test results from multiple studies can be combined in meta-analysis. We provide the function \textbf{glmm.score.meta} to perform meta-analysis on score test results. Generally, if each study performs score tests using genotypes in \href{http://zzz.bwh.harvard.edu/plink/data.shtml#bed}{PLINK binary PED format} or GDS format, the score test output from \textbf{glmm.score} can be directly used as input files. Otherwise the meta-analysis function needs a tab or space delimited plain text file (or compressed files that can be recognized by the R function \textbf{read.table}) with at least 8 columns: SNP name, effect allele, reference allele, N, AF, SCORE, VAR and PVAL. Note that the SNP name, effect allele, reference allele can have customized column names in different input files, but the column names of N, AF, SCORE, VAR and PVAL should match exactly. Customized SNP and alleles column names can be specified using "SNP", "A1" and "A2". Note that we do not define whether "A1" or "A2" is the effect allele: it is your choice. However, your choice should be consistent across different studies: for example, if you have two studies with the same allele column names "Allele1" and "Allele2", and you want to define "A1" as the effect allele, but the effect allele is "Allele1" in the first study and "Allele2" in the second study, you need A2 = c("Allele2", "Allele1") for the reference allele column in each study, and A1 = c("Allele1", "Allele2") for the effect allele column in each study. Note that in \textbf{glmm.score} output from analyzing \href{http://zzz.bwh.harvard.edu/plink/data.shtml#bed}{PLINK binary PED format} genotypes, the effect allele has column name "A2", and in \textbf{glmm.score} output from analyzing GDS format genotypes, the effect allele has column name "ALT". Thus if you have a result file from analyzing \href{http://zzz.bwh.harvard.edu/plink/data.shtml#bed}{PLINK binary PED format} genotypes in the third study, and another result file from analyzing GDS format genotypes in the fourth study, in addition to the aforementioned two studies, and you still want to define "A1" as the effect allele, you need A2 = c("Allele2", "Allele1", "A1", "REF") for the reference allele column in each study, and A1 = c("Allele1", "Allele2", "A2", "ALT") for the effect allele column in each study.

Here is an example of meta-analyzing 3 score test result files:
<<GMMATscoremeta, eval=FALSE, echo=TRUE>>=
meta1.file <- system.file("extdata", "meta1.txt", package = "GMMAT")
meta2.file <- system.file("extdata", "meta2.txt", package = "GMMAT")
meta3.file <- system.file("extdata", "meta3.txt", package = "GMMAT")
glmm.score.meta(files = c(meta1.file, meta2.file, meta3.file),
        outfile = "glmm.score.meta.testoutfile.txt",
        SNP = rep("SNP", 3), A1 = rep("A1", 3), A2 = rep("A2", 3))
@

\subsection{Variant set tests}
\subsubsection{Pooled analysis}
Variant set tests (also known as gene-based tests or aggregate variant tests) in a single study (or a pooled analysis of multiple studies) can be performed using the function \textbf{SMMAT}. Currently only the GDS genotype format is supported. In addition to a glmmkin class object returned from the function \textbf{glmmkin} for the null GLMM and the GDS format genotype file, a group definition file with no header and 6 columns (variant set id, variant chromosome, variant position, variant reference allele, variant alternate allele, weight) is also required. For example, here we show the first 6 rows of the example group definition file "SetID.withweights.txt":
\begin{verbatim}
Set1    1       1       T       A       1
Set1    1       2       A       C       4
Set1    1       3       C       A       3
Set1    1       4       G       A       6
Set1    1       5       A       G       9
Set1    1       6       C       A       9
\end{verbatim}
Note that each variant in the group definition file is matched by chromosome, position, reference allele and alternate allele with variants from the GDS file. One genetic variant can be included in different groups with possibly different weights. If no external weights are needed in the analysis, simply replace the 6th column by all 1's.

Four variant set tests are supported in the SMMAT framework: "B" for the burden test, "S" for SKAT, "O" for SKAT-O and "E" for the efficient hybrid test of the burden test and SKAT. You can include one or more tests in a single analysis. If "O" is selected, the burden test and SKAT results will be automatically included; if "E" is selected, the burden test results will be automatically included. Therefore, the following example gives all four test results:
<<SMMAT1, eval=FALSE, echo=TRUE>>=
group.file <- system.file("extdata", "SetID.withweights.txt", 
        package = "GMMAT")
geno.file <- system.file("extdata", "geno.gds", package = "GMMAT")
SMMAT(model0, group.file = group.file, geno.file = geno.file, 
        MAF.range = c(1e-7, 0.5), miss.cutoff = 1, method = "davies", 
        tests = c("O", "E"))
@
<<SMMAT12, eval=TRUE, echo=FALSE>>=
if(requireNamespace("SeqArray", quietly = TRUE) && requireNamespace("SeqVarTools", quietly = TRUE)) {
group.file <- system.file("extdata", "SetID.withweights.txt", 
        package = "GMMAT")
geno.file <- system.file("extdata", "geno.gds", package = "GMMAT")
SMMAT(model0, group.file = group.file, geno.file = geno.file, 
        MAF.range = c(1e-7, 0.5), miss.cutoff = 1, method = "davies", 
        tests = c("O", "E"))
}
@
It returns a data frame with first 8 columns showing the group (variant set) name, number of variants in each group, minimum, mean, maximum missing rate of variants in each group, minimum, mean, maximum effect allele frequency of variants in each group, followed by variant set test results. For Burden, 3 columns will be included to show the burden test score, variance of the score, and its \textit{P} value. For SKAT, the \textit{P} value column will be included. For SKAT-O, 3 columns will be included to show SKAT-O \textit{P} value, minimum \textit{P} value in the search grid, and the value of the mixing parameter $\rho$ at which the minimum \textit{P} value is observed. For the efficient hybrid test, the \textit{P} value column will be included.

For a single study, intermediate files containing single variant scores and their covariance matrices for each variant set (based on the group definition file) can be saved for future use in re-analysis and/or meta-analysis. For example, here we perform the burden test and save intermediate files:
<<SMMAT2, eval=FALSE, echo=TRUE>>=
SMMAT(model0, group.file = group.file, geno.file = geno.file, 
        MAF.range = c(1e-7, 0.5), miss.cutoff = 1, method = "davies", 
        tests = "B", meta.file.prefix = "SMMAT.meta")
@
<<SMMAT22, eval=TRUE, echo=FALSE>>=
if(requireNamespace("SeqArray", quietly = TRUE) && requireNamespace("SeqVarTools", quietly = TRUE)) {
SMMAT(model0, group.file = group.file, geno.file = geno.file, 
        MAF.range = c(1e-7, 0.5), miss.cutoff = 1, method = "davies", 
        tests = "B", meta.file.prefix = "SMMAT.meta")
}
@
In the example above, a space-delimited file "SMMAT.meta.score.1" will be generated to save the single variant scores, and a binary file "SMMAT.meta.var.1" will be generated to save the covariance matrices for the variant sets. Note that the binary file is not human-readable, but can be used by \textbf{SMMAT.meta} in re-analysis and/or meta-analysis.

\subsubsection{Meta-analysis}
With intermediate files generated by \textbf{SMMAT}, the function \textbf{SMMAT.meta} can be used in re-analysis of single study results, and/or meta-analysis to combine multiple studies. Here we show an example of rerunning SKAT using intermediate files generated above in the burden test:
<<SMMATmeta, eval=FALSE, echo=TRUE>>=
SMMAT.meta(meta.files.prefix = "SMMAT.meta", n.files = 1,
        group.file = group.file, MAF.range = c(1e-7, 0.5), 
        miss.cutoff = 1, method = "davies", tests = "S")
@
<<SMMATmeta2, eval=TRUE, echo=FALSE>>=
if(requireNamespace("SeqArray", quietly = TRUE) && requireNamespace("SeqVarTools", quietly = TRUE)) {
SMMAT.meta(meta.files.prefix = "SMMAT.meta", n.files = 1,
        group.file = group.file, MAF.range = c(1e-7, 0.5), 
        miss.cutoff = 1, method = "davies", tests = "S")
}
@
<<SMMATmetacleanup, eval=TRUE, echo=FALSE>>=
if(requireNamespace("SeqArray", quietly = TRUE) && requireNamespace("SeqVarTools", quietly = TRUE)) {
unlink(c("SMMAT.meta.score.1", "SMMAT.meta.var.1"))
}
@
The first argument, "meta.files.prefix", is a vector of intermediate files' prefix with length equal to the number of studies, and the second argument, "n.files", is a vector of integers showing how many sets of intermediate files each study has (also with length equal to the number of studies). In our above example of re-analysis, we have one set of intermediate files ("SMMAT.meta.score.1" and "SMMAT.meta.var.1") with prefix "SMMAT.meta". The group definition file (passed to "group.file") should be the same as the one used to generate intermediate files by \textbf{SMMAT} (with possibly different weights allowed). In the above example, only SKAT is performed, but four variant set tests are supported in the SMMAT framework: "B" for the burden test, "S" for SKAT, "O" for SKAT-O and "E" for the efficient hybrid test of the burden test and SKAT. You can include one or more tests by passing a vector to the argument "tests". If "O" is selected, the burden test and SKAT results will be automatically included; if "E" is selected, the burden test results will be automatically included.

\section{Output}
The single variant score test function \textbf{glmm.score} generates a tab-delimited plain text output file. Here we show the header and the first five rows of the example output "glmm.score.text.testoutfile.txt" from using a plain text genotype file "geno.txt" in the function \textbf{glmm.score}:
\begin{verbatim}
SNP     Allele1 Allele2 N       AF        SCORE    VAR     PVAL
SNP1    A       T       393     0.0254453 1.985    4.55635 0.352406
SNP2    A       C       400     0.5       3.51032  46.3328 0.60606
SNP3    A       C       400     0.2075    -0.5334  30.6023 0.923185
SNP4    A       G       400     0.29875   -3.11494 40.5128 0.624567
SNP5    A       G       400     0.59375   -4.00135 42.2757 0.538287
...
\end{verbatim}
The first 3 columns are copied from the genotype file using "infile.ncol.print" with names specified using "infile.header.print". Results are included in 5 columns: the sample size N, the allele frequency (AF) of the effect allele (Allele2 in this example, but it is the user's choice: you can also define Allele1 as the effect allele in your coded genotype file), the score statistic SCORE of the effect allele, the variance of the score VAR, and score test \textit{P} value PVAL.

If you use a \href{http://zzz.bwh.harvard.edu/plink/data.shtml#bed}{PLINK binary PED file} "geno.bed" as the genotype file, here are the header and the first 5 rows of the example output "glmm.score.bed.testoutfile.txt" from \textbf{glmm.score}:
\begin{verbatim}
CHR     SNP     cM      POS     A1      A2      N       AF      
1       SNP1    0       1       T       A       393     0.974555        
1       SNP2    0       2       A       C       400     0.5     
1       SNP3    0       3       C       A       400     0.7925  
1       SNP4    0       4       G       A       400     0.70125 
1       SNP5    0       5       A       G       400     0.59375 
...
SCORE    VAR     PVAL
-1.985   4.55635 0.352406
3.51032  46.3328 0.60606
0.5334   30.6023 0.923185
3.11494  40.5128 0.624567
-4.00135 42.2757 0.538287
...
\end{verbatim}
The first 6 columns are copied from the .bim file (in our example, "geno.bim"): the chromosome CHR, SNP name, genetic location cM, physical position POS, and alleles A1 and A2. Results are included in 5 columns: the sample size N, the allele frequency (AF) of A2 allele, the score statistic SCORE of A2 allele, the variance of the score VAR, and score test \textit{P} value PVAL.

If you use a GDS genotype file "geno.gds", here are the header and the first 5 rows of the example output "glmm.score.gds.testoutfile.txt" from \textbf{glmm.score}:
\begin{verbatim}
SNP     CHR     POS     REF     ALT     N      MISSRATE AF      
SNP1    1       1       T       A       393    0.0175   0.974554707379135
SNP2    1       2       A       C       400    0        0.5     
SNP3    1       3       C       A       400    0        0.7925  
SNP4    1       4       G       A       400    0        0.70125 
SNP5    1       5       A       G       400    0        0.59375 
...
SCORE                   VAR                   PVAL
-1.98499773964117       4.55635419833128      0.352406198841659
3.51031642022882        46.3327704279554      0.606059807621643
0.533400376138446       30.6022846771608      0.923185374786553
3.11494101139992        40.5127610067912      0.624566559783622
-4.00135050079485       42.2757210650549      0.538287231263494
...
\end{verbatim}
The first 5 columns are extracted from the GDS file: SNP ("annotation/id"), CHR ("chromosome"), POS ("position"), reference and alternate alleles ("allele"). Results are included in 6 columns: the sample size N (with non-missing genotypes), the genotype missing rate MISSRATE, the allele frequency (AF) of ALT allele, the score statistic SCORE of ALT allele, the variance of the score VAR, and score test \textit{P} value PVAL.

If you use a BGEN genotype file "geno.bgen", here are the header and the first 5 rows of the example output "glmm.score.bgen.testoutfile.txt" from \textbf{glmm.score}:
\begin{verbatim}
SNP     RSID    CHR     POS     A1      A2      N       AF      
SNP1    SNP1    1       1       T       A       393     0.974555        
SNP2    SNP2    1       2       A       C       400     0.5     
SNP3    SNP3    1       3       C       A       400     0.7925  
SNP4    SNP4    1       4       G       A       400     0.70125 
SNP5    SNP5    1       5       A       G       400     0.59375 
...
SCORE    VAR     PVAL
-1.985   4.55635 0.352406
3.51032  46.3328 0.60606
0.5334   30.6023 0.923185
3.11494  40.5128 0.624567
-4.00135 42.2757 0.538287
...
\end{verbatim}
The first 6 columns are copied from the BGEN file: the SNP, RSID, chromosome CHR, physical position POS, and alleles A1 and A2. Results are included in 5 columns: the sample size N, the allele frequency (AF) of A2 allele, the score statistic SCORE of A2 allele, the variance of the score VAR, and score test \textit{P} value PVAL.

The meta-analysis function \textbf{glmm.score.meta} generates a tab-delimited plain text output file. Here are the header and the first 5 rows of the example output from the meta-analysis "glmm.score.meta.testoutfile.txt":
\begin{verbatim}
SNP     A1     A2     N       AF      SCORE    VAR      PVAL
L14     A      C      10000   0.65895 21.5371  445.72   0.30766609304268
L25     A      C      10000   0.78425 14.3376  387.091  0.466163476535903
L7      A      C      20000   0.50435 33.1136  1019.122 0.299608382095623
L9      A      C      30000   0.39875 -33.0641 904.842  0.271687891048334
L35     A      C      10000   0.78425 14.3376  387.091  0.466163476535903
...
\end{verbatim}
The first 3 columns are set by the function \textbf{glmm.score.meta} to denote SNP name and alleles (your choice of either A1 or A2 as the effect allele). N is the total sample size, AF is the effect allele frequency, SCORE is the summary score statistic of the effect allele, VAR is the variance of the summary score statistic, and PVAL is the meta-analysis \textit{P} value.

In variant set tests \textbf{SMMAT}, if "meta.file.prefix" is specified, space-delimited intermediate files for single variant scores and binary intermediate files for covariance matrices will be generated. Here are the header and the first 5 rows of the example intermediate file "SMMAT.meta.score.1":
\begin{verbatim}
group chr pos ref alt N missrate altfreq 
Set1 1 1 T A 393 0.0175 0.974554707379135 
Set1 1 2 A C 400 0 0.5 
Set1 1 3 C A 400 0 0.7925 
Set1 1 4 G A 400 0 0.70125 
Set1 1 5 A G 400 0 0.59375 
...
SCORE VAR PVAL
-1.98499773963038 4.55635419833203 0.352406198844316
3.51031642023436 46.3327704279556 0.606059807621076
0.533400376147224 30.6022846771614 0.923185374785294
3.11494101140768 40.5127610067916 0.62456655978276
-4.00135050078827 42.2757210650552 0.538287231264163
...
\end{verbatim}
The first 5 columns are copied from the group definition file, indicating the variant set (group) id, variant chromosome, variant position, variant reference allele, variant alternate allele, respectively. Results are included in 6 columns: the sample size N (with non-missing genotypes), the genotype missing rate missrate, the alt allele frequency altfreq, the score statistic SCORE of alt allele, the variance of the score VAR, and single variant score test \textit{P} value PVAL.

\section{Advanced options}
\subsection{Alternative model fitting algorithms}
By default we use the Average Information REML algorithm \cite{Gilmour1995, Yang2011} to fit the GLMM in \textbf{glmmkin}, which is computationally efficient and recommended in most cases. However, there are also alternative model fitting algorithms:
\begin{verbatim}
method = "REML", method.optim = "Brent"
\end{verbatim}
It maximizes the restricted likelihood using the derivative-free Brent method \cite{Brent1973}, but only works when there is one matrix for the covariance structure of the random effects.
\begin{verbatim}
method = "ML", method.optim = "Brent"
\end{verbatim}
It maximizes the likelihood using the Brent method.
\begin{verbatim}
method = "REML", method.optim = "Nelder-Mead"
\end{verbatim}
It maximizes the restricted likelihood using the Nelder-Mead method \cite{Nelder1965}, however it is usually very slow in large samples.
\begin{verbatim}
method = "ML", method.optim = "Nelder-Mead"
\end{verbatim}
It maximizes the likelihood using the Nelder-Mead method.

Note that the default algorithm is
\begin{verbatim}
method = "REML", method.optim = "AI"
\end{verbatim}
A maximum likelihood version of Average Information algorithm is not available in \textbf{glmmkin}.

\subsection{Changing model fitting parameters}
By default we set the maximum number of iteration to 500 and tolerance to declare convergence to 1e-5:
\begin{verbatim}
maxiter = 500, tol = 1e-5
\end{verbatim}
These parameters can be changed. When using the Brent method for maximizing the likelihood (or restricted likelihood), we specify the search range of the ratio of the variance component parameter $\tau_1$ over the dispersion parameter $\phi$ to be between 1e-5 and 1e5, and we divide the search region evenly into 10 regions on the log scale:
\begin{verbatim}
taumin = 1e-5, taumax = 1e5, tauregion = 10
\end{verbatim}
These parameters can also be changed, but they are only effective when using the Brent method.

\subsection{Missing genotypes}
It is recommended to perform genotype quality control prior to analysis to impute missing genotypes or filter out SNPs with high missing rates. However, \textit{GMMAT} does allow missing genotypes, and imputes to the mean value by default. Alternatively, in \textbf{glmm.score} and \textbf{glmm.wald}, missing genotypes can be omitted from the analysis using
\begin{verbatim}
missing.method = "omit"
\end{verbatim}
In variant set tests using \textbf{SMMAT}, instead of imputing missing genotypes to the mean value, you can impute missing genotypes to 0 (homozygous reference allele) using
\begin{verbatim}
missing.method = "impute2zero"
\end{verbatim}

If using a plain text (or compressed .gz and .bz2) genotype file, missing genotypes should be coded as "NA". If you have missing genotypes coded in a different way, you can specify this in the argument "infile.na".

\subsection{Reordered genotypes}
The genotype file (either a plain text file, a \href{http://zzz.bwh.harvard.edu/plink/data.shtml#bed}{PLINK binary PED file}, or a GDS file) can include more individuals than in the phenotype and covariates data frame, and they can be in different orders. \textit{GMMAT} handles this issue using an argument "select" in both \textbf{glmm.score} and \textbf{glmm.wald}. For example, if the order of individuals in your genotype file is A, B, C, D, but you only have 3 unique individuals (with order C, A, B) in the fitted "obj" (for \textbf{glmm.score}) or in the data frame "data" (for \textbf{glmm.wald}), then you can specify
\begin{verbatim}
select = c(2, 3, 1, 0)
\end{verbatim}
to reflect the order of individuals. Note that since individual D is not included, its order is assigned to 0. The length of the vector must match the number of individuals in your genotype file. Also note that if there are observations with missing phenotype/covariates in "data", "select" for \textbf{glmm.wald} should match to "data" before removing any missing values, while "select" for \textbf{glmm.score} should match to "obj" (in which missing values have been excluded).

In variant set tests, \textbf{SMMAT} will extract ID from "null.obj" using "id\_include" returned in the \textbf{glmmkin} fitted null model object. The ID will be matched to "sample.id" in the GDS genotype file.

\subsection{Parallel computing}
Parallel computing can be enabled in \textbf{glmm.score} and \textbf{SMMAT} using the argument "ncores" to specify how many cores you would like to use on a computing node. By default "ncores" is 1, meaning that these functions will run in a single thread. Currently parallel computing is only implemented for GDS format genotype files.

If you enable parallel computing and save intermediate files, you will get multiple sets of intermediate files. For example, if your "ncores" is 12 and you specified "meta.file.prefix" to "study1", then you will get 12 sets of (totaling 24) intermediate files "study1.score.1", "study1.var.1", "study1.score.2", "study1.var.2", ..., "study1.score.12", "study1.var.12". Later in the meta-analysis to combine with 2 sets of intermediate files "study2.score.1", "study2.var.1", "study2.score.2", "study2.var.2", you will need to use
\begin{verbatim}
meta.files.prefix = c("study1", "study2"), n.files = c(12, 2)
\end{verbatim}

If your R is configured with Intel MKL and you would like to enable parallel computing, it is recommended that you set the environmental variable "MKL\_NUM\_THREADS" to 1 before running R to avoid hanging. Alternatively, you can do this at the beginning of your R script by using
<<MKL, eval=FALSE, echo=TRUE>>=
Sys.setenv(MKL_NUM_THREADS = 1)
@

\subsection{Variant filters}
Variants can be filtered in \textbf{glmm.score} and \textbf{SMMAT} based on minor allele frequency (MAF) and missing rate filters. The argument "MAF.range" specifies the minimum and maximum MAFs for a variant to be included in the analysis. By default the minimum MAF is $1\times 10^{-7}$ and the maximum MAF is 0.5, meaning that only monomorphic markers in the sample will be excluded (if your sample size is no more than 5 million). The argument "miss.cutoff" specifies the maximum missing rate for a variant to be included in the analysis. By default it is set to 1, meaning that no variants will be removed due to high genotype missing rates.

\subsection{Internal minor allele frequency weights}
Internal weights are calculated based on the minor allele frequency (NOT the effect allele frequency, therefore, variants with effect allele frequencies 0.01 and 0.99 have the same weights) as a beta probability density function. Internal weights are multiplied by the external weights given in the last column of the group definition file. To turn off internal weights, use
\begin{verbatim}
MAF.weights.beta = c(1, 1)
\end{verbatim}
to assign flat weights, as a beta distribution with parameters 1 and 1 is a uniform distribution on the interval between 0 and 1.

\subsection{Allele flipping}
In variant set tests \textbf{SMMAT}, by default the alt allele is used as the coding allele and variants in each variant set are matched strictly on chromosome, position, reference and alternate alleles.

The argument "auto.flip" allows automatic allele flipping if a specified variant is not found in the genotype file, but a variant at the same chromosome and position with reference allele matching the alternate allele in the group definition file "group.file", and alternate allele matching the reference allele in the group definition file "group.file", to be included in the analysis. Please use with caution for whole genome sequence data, as both ref/alt and alt/ref variants at the same position are not uncommon, and they are likely two different variants, rather than allele flipping.

The argument "use.minor.allele" allows using the minor allele instead of the alt allele as the coding allele in variant set tests. Note that this choice does not change "S" for SKAT results, but "B" for the burden test, "O" for SKAT-O and "E" for efficient hybrid test of the burden test and SKAT results will be affected. Generally the alt allele can either be the minor or the major allele. If in a variant set, different variants with alt allele frequencies 0.001 and 0.998 are combined together in a burden test, the results would be difficult to interpret. We generally recommend turning on the "use.minor.allele" option, unless you know the ancestry alleles explicitly and the specific scientific hypothesis clearly that you would like to test. Along with the MAF filter, this option is useful for combining rare mutations, assuming rare allele effects are in the same direction.

\subsection{\textit{P} values of weighted sum of chi-squares}
In variant set tests \textbf{SMMAT}, you can use 3 methods in the "method" argument to compute \textit{P} values of weighted sum of chi-square distributions: "davies" \cite{davies1980}, "kuonen" \cite{kuonen1999} and "liu" \cite{liu2009}. By default "davies" is used, if it returns an error message in the calculation, or a \textit{P} value greater than 1, or less than $1\times 10^{-5}$, "kuonen" method will be used. If "kuonen" method fails to compute the \textit{P} value, "liu" method will be used.

\subsection{Heterogeneous genetic effects in variant set meta-analysis}
Heterogeneous genetic effects \cite{lee2013} are allowed in variant set tests meta-analysis function \textbf{SMMAT.meta}, by specifying groups using the "cohort.group.idx" argument. By default all studies are assumed to share the same genetic effects in the meta-analysis, and this can be changed by assigning different group indices to studies. For example,
\begin{verbatim}
cohort.group.idx = c("a","b","a","a","b")
\end{verbatim}
means cohorts 1, 3, 4 are assumed to have homogeneous genetic effects, and cohorts 2, 5 are in another group with homogeneous genetic effects (but possibly heterogeneous with group "a").

\subsection{Other options}
By default, genotypes are centered to the mean before the analysis in single variant tests. You can turn this feature off by specifying
\begin{verbatim}
center = FALSE
\end{verbatim}
in both \textbf{glmm.score} and \textbf{glmm.wald} functions to use raw genotypes.

If your genotype file is a plain text (or a compressed .gz and .bz2 file), and you want to read in fewer lines than all lines included in the file, you can use the "infile.nrow" argument to specify how many lines (including lines to be skipped using "infile.nrow.skip") you want to read in. By default the delimiter is assumed to be a tab, but you can change it using the "infile.sep" argument. These options are implemented in \textbf{glmm.score} and \textbf{glmm.wald}.

In the score test function \textbf{glmm.score}, by default 100 SNPs are tested in a batch. You can change it using the "nperbatch" argument, but the computational time can increase substantially if it is either too small or too large, depending on the performance of your computing system.

If you perform Wald tests \textbf{glmm.wald} and use a plain text (or a compressed .gz and .bz2) file, and your SNPs are not in your first column, you can change "snp.col" in \textbf{glmm.wald} to indicate which column is your SNP name.

In the variant set tests \textbf{SMMAT}, by default the group definition file "group.file" should be tab delimited, but you can change it using the "group.file.sep" argument. Also there is a "Garbage.Collection" argument (default FALSE), if turned on, \textbf{SMMAT} will call the function \textbf{gc} for each variant set tested. It helps save memory footprint, but the computation speed might be slower.

\section{Version}
\subsection{Version 0.6 (October 12, 2015)}
Initial public release of \textit{GMMAT}.

\subsection{Version 0.7 (January 22, 2016)}
\begin{enumerate}
\item Merged old functions \textbf{glmm.score.text} and \textbf{glmm.score.bed} to \textbf{glmm.score}.
\item Merged old functions \textbf{glmm.wald.text} and \textbf{glmm.wald.bed} to \textbf{glmm.wald}.
\item \textbf{glmm.score} now takes "obj", a glmmkin class object returned from \textbf{glmmkin} to set up the score test, instead of "res" and "P" from old functions \textbf{glmm.score.text} and \textbf{glmm.score.bed}.
\item Implemented model fitting with fixed variance components in \textbf{glmmkin}, and model refitting when variance component estimates are on the boundary of the parameter space, and default unconverged Average Information REML to derivative-free Brent method (one variance component parameter) or Nelder-Mead method (more than one variance component parameters).
\item Implemented offset in \textbf{glmmkin}.
\item Renamed alpha, eta, mu to coefficients, linear.predictors, fitted.values in \textbf{glmmkin} returned object.
\item Implemented score test meta-analysis function \textbf{glmm.score.meta}.
\item Fixed minor bugs in \textbf{glmm.wald} to handle errors in fitting each alternative GLMM, and minor bugs in fitting alternative GLMMs using derivative-free algorithms.
\end{enumerate}

\subsection{Version 0.7-1 (January 22, 2016)}
Light version of v0.7: same features as v0.7 except that this light version does not depend on the C++ library boost and cannot take compressed plain text files .gz and .bz2 as genotype files.

\subsection{Version 0.9 (March 9, 2018)}
\begin{enumerate}
\item Added "id\_include" to \textbf{glmmkin} returned object to indicate which rows in the data have nonmissing outcome/covariates and are included in the model fit, which is useful to create
<<select, eval=FALSE, echo=TRUE>>=
select <- match(geno_ID, pheno_ID[obj$id_include])
select[is.na(select)] <- 0
@
that can be used as the "select" argument in \textbf{glmm.score}.
\item Removed memory duplicates for big matrices in C++ and R code.
\item Support for GDS format genotype files implemented in functions \textbf{glmm.score} and \textbf{glmm.wald}, with optional parallel computing.
\item MAF.range and miss.cutoff implemented in \textbf{glmm.score} for minor allele frequency and missing rate filters.
\item Implemented variant set tests \textbf{glmm.rvtests}, including burden test, SKAT, SKAT-O and SMMAT, with optional parallel computing.
\item Implemented variant set re-analysis/meta-analysis function \textbf{glmm.rvtests.meta}.
\item Implemented heteroscedastic linear mixed models in \textbf{glmmkin} and \textbf{glmm.wald} by specifying "groups".
\end{enumerate}

\subsection{Version 0.9.1 (May 13, 2018)}
\begin{enumerate}
\item In variant set tests \textbf{glmm.rvtests} and \textbf{glmm.rvtests.meta}, tests "Burden", "SKAT", "SKAT-O" and "SMMAT" changed to "B", "S", "O" and "E", respectively, to denote 4 variant set tests in the SMMAT framework: the burden test, SKAT, SKAT-O and the efficient hybrid test of the burden test and SKAT.
\item Implemented known weights (e.g. binomial denominator) in \textbf{glmmkin} and \textbf{glmm.wald} by passing the argument "weights" to the \textbf{glm} object "fixed"
\item Implemented internal MAF-based weights, using the argument "MAF.weights.beta" to denote the two beta probability density function parameters. Note that the weights are calculated on the minor allele frequency (not the effect allele frequency). Internal weights are multiplied by the external weights given in the last column of the group definition file.
\end{enumerate}

\subsection{Version 0.9.2 (June 8, 2018)}
\begin{enumerate}
\item In variant set tests \textbf{glmm.rvtests} and \textbf{glmm.rvtests.meta}, test "O" (SKAT-O in the SMMAT framework) p-value switched to minimum p-value multipled by the number of points on the search grid if the integration result is larger than the latter (consistent with implementation in the SKAT package).
\end{enumerate}

\subsection{Version 0.9.3 (July 18, 2018)}
\begin{enumerate}
\item In the null model fitting function \textbf{glmmkin}, prior weights (e.g. binomial denominators for binomial distributions) coerced to a vector.
\item In the null model fitting function \textbf{glmmkin}, model matrix X for fixed effects now included in the returned object.
\end{enumerate}

\subsection{Version 1.0.0 (December 28, 2018)}
\begin{enumerate}
\item Implemented ID matching for the phenotype data frame and relatedness matrices in \textbf{glmmkin} and \textbf{glmm.wald}. The argument "id" is required for a column indicating ID in the phenotype data frame, and the relatedness matrices must have rownames and colnames, and they must at least include all samples as specified in the "id" column of the phenotype data frame "data".
\item Changed the definition of "id\_include" in \textbf{glmmkin} returned object to indicate the original "id" of observations in the data that have nonmissing outcome/covariates and are included in the model fit, which is useful to create
<<select2, eval=FALSE, echo=TRUE>>=
select <- match(geno_ID, unique(obj$id_include))
select[is.na(select)] <- 0
@
that can be used as the "select" argument in \textbf{glmm.score} to match the order of individuals in the plain text (or a compressed .gz and .bz2) genotype file "infile" (assuming "geno\_ID" is a vector of the ID's for "infile"). Note that this is not necessary if the genotype file is a \href{http://zzz.bwh.harvard.edu/plink/data.shtml#bed}{PLINK binary PED file}, or a GDS file (in these cases, "select" is NULL by default, and the genotype ID information will be extracted automatically and matched to "id\_include" in \textbf{glmmkin} returned object).
\item Implemented genotype ID matching in \textbf{glmm.wald} if the genotype file "infile" is a \href{http://zzz.bwh.harvard.edu/plink/data.shtml#bed}{PLINK binary PED file}, or a GDS file. Similarly to \textbf{glmm.score}, "select" is NULL by default and the genotype ID information will be extracted automatically and matched to unique "id" in the phenotype data frame "data" (but before removing any missing outcome/covariates). Note that it is usually necessary to create
<<select3, eval=FALSE, echo=TRUE>>=
select <- match(geno_ID, unique(data[, id]))
select[is.na(select)] <- 0
@
that can be used as the "select" argument in \textbf{glmm.wald} to match the order of individuals in the plain text (or a compressed .gz and .bz2) genotype file "infile" (assuming "geno\_ID" is a vector of the ID's for "infile"). Otherwise, it is assumed that the genotype ID is equal to unique "id" in the phenotype data frame "data".
\item Renamed variant set test functions in the SMMAT framework \textbf{glmm.rvtests} and \textbf{glmm.rvtests.meta} to \textbf{SMMAT} and \textbf{SMMAT.meta}, respectively.
\item Set the default of the argument "kins" to NULL, assuming individuals are unrelated.
\item Implemented model fitting for longitudinal data in \textbf{glmmkin} and \textbf{glmm.wald}. Duplicated "id" in the phenotype data frame "data" are allowed and assumed to be longitudinal data (exchangeable repeated measures, or observations from multiple time points with time trends).
\end{enumerate}

\subsection{Version 1.0.1 (January 1, 2019)}
\begin{enumerate}
\item Added "LinkingTo: BH" to improve portability with boost library headers.
\item Added "Suggests: testthat" to conduct code tests.
\item Made minor code changes to improve portability to the Windows operating system.
\end{enumerate}

\subsection{Version 1.0.2 (January 2, 2019)}
\begin{enumerate}
\item Made minor C++ code changes to improve portability to non-GCC compilers.
\item Added \textbf{Sys.info} checks in \textbf{glmm.score} and \textbf{SMMAT}. If "ncores" is greater than 1 on a Windows operating system, it will be switched back to 1 to use a single thread for genotype files in the GDS format.
\end{enumerate}

\subsection{Version 1.0.3 (January 14, 2019)}
\begin{enumerate}
\item Removed dependency on boost in reading .gz and .bz2 genotype files (for Windows portability).
\item Added dependency on zlib and bzip2 libraries.
\end{enumerate}

\subsection{Version 1.0.4 (March 6, 2019)}
\begin{enumerate}
\item Minor code change in the tests directory as the default method for generating from a discrete uniform distribution used in sample() has been changed in R-devel (3.6.0).
\end{enumerate}

\subsection{Version 1.1.0 (May 21, 2019)}
\begin{enumerate}
\item Implemented support for sparse matrices in the argument "kins" of \textbf{glmmkin}.
\item Fixed a bug in \textbf{glmm.wald} when the argument "missing.method" is "omit".
\item Fixed a bug in \textbf{SMMAT} and \textbf{SMMAT.meta} when reading group definition files with T as all reference or alternate alleles (force T to be character instead of logical).
\end{enumerate}

\subsection{Version 1.1.1 (August 26, 2019)}
\begin{enumerate}
\item Fixed a bug in \textbf{glmm.wald} when the argument "fixed" is a long formula.
\item Fixed a bug in \textbf{glmmkin} and \textbf{glmm.wald} when the indices are already ordered in subsetting ddiMatrix for longitudinal data analysis.
\end{enumerate}

\subsection{Version 1.1.2 (October 11, 2019)}
\begin{enumerate}
\item Fixed a bug in \textbf{glmmkin} when fitting a model with large dense matrices in "kins".
\item Fixed a bug in \textbf{glmm.score} and \textbf{glmm.wald} for analyzing plain text genotype files on Mac OS.
\end{enumerate}

\subsection{Version 1.2.0 (May 14, 2020)}
\begin{enumerate}
\item Implemented multiple phenotype analysis in \textbf{glmmkin}, \textbf{glmm.score} and \textbf{SMMAT}.
\item Added names to theta, coefficients, cov in \textbf{glmmkin} results.
\item Changed the logical expression in \textbf{glmmkin} and \textbf{glmm.wald} for testing if "kins" is a matrix (matrix objects now also inherit from class "array").
\item Fixed a bug in \textbf{glmmkin} and \textbf{glmm.wald} when passing \dots arguments.
\item Changed missing values to NA in \textbf{glmm.score} output for the GDS genotype format.
\item Supported reordered group definition files in \textbf{SMMAT.meta}, as long as chr:pos is a unique variant identifier (thanks to Arthur Gilly).
\item Supported imputed dosage GDS files (in the node annotation/format/DS/data) (thanks to Rounak Dey).
\end{enumerate}

\subsection{Version 1.3.0 (September 4, 2020)}
\begin{enumerate}
\item Implemented a temporary fix that explicitly converts data in tibble format to data.frame objects in \textbf{glmmkin} and \textbf{glmm.wald}.
\item Supported BGEN genotype files in \textbf{glmm.score} (credit: Duy T. Pham).
\end{enumerate}

\subsection{Version 1.3.1 (September 14, 2020)}
\begin{enumerate}
\item Fixed heap-buffer-overflow error in clang address sanitizer check and unknown-crash error in gcc address sanitizer check for BGEN genotype files (credit: Duy T. Pham).
\item Implemented progress bars for \textbf{glmm.score}, \textbf{glmm.wald}, \textbf{SMMAT} and \textbf{SMMAT.meta} in the verbose mode.
\item Implemented \textbf{SMMAT.prep} and \textbf{SMMAT.lowmem}, the two-step low-memory version of \textbf{SMMAT}. \textbf{SMMAT.lowmem} takes the returned R object from \textbf{SMMAT.prep} and uses less memory (if the returned R object from \textbf{SMMAT.prep} is saved to an R data file, the R session is terminated, and this R object is loaded into a new R session for running \textbf{SMMAT.lowmem}), especially when "group.file" contains only a subset of variants from "geno.file".
\end{enumerate}

\subsection{Version 1.3.2 (July 15, 2021)}
\begin{enumerate}
\item Supported SeqVarGDSClass objects in \textbf{glmm.score} and \textbf{SMMAT} (credit: Duy T. Pham).
\item Updated variant matching between "group.file" and "geno.file" in \textbf{SMMAT} for duplicates (credit: Duy T. Pham).
\item Fixed a minor bug in \textbf{.pKuonen} (credit: Duy T. Pham).
\item Fixed a minor bug in printing duplicate variant.id in \textbf{SMMAT}.
\item Used a low-memory sparse design matrix J for longitudinal data analysis in \textbf{glmm.score} and \textbf{SMMAT}.
\end{enumerate}

\subsection{Version 1.4.0 (April 12, 2023)}
\begin{enumerate}
\item Check for system copies of zstd and libdeflate libraries.
\item Implemented better error control in the internal function .Q\_pval (credit: Arthur Gilly).
\item Fixed minor bugs in \textbf{SMMAT}.
\item Fixed minor bugs for MISSRATE and AF calculation for imputed gds files in \textbf{glmm.score}.
\item Fixed a minor bug in reading the ID column for bgen sample files in \textbf{glmm.score}.
\item Replaced \textbf{read.table} by the more efficient function \textbf{data.table::fread}.
\item Replaced \textbf{class} checks using \textbf{inherits}.
\item Removed \textbf{context} in testthat tests.
\end{enumerate}

\subsection{Version 1.4.1 (October 5, 2023)}
\begin{enumerate}
\item Random slope model implemented for cross-sectional data with related individuals.
\item Better memory usage and improved computational efficiency for longitudinal data in \textbf{glmmkin} and \textbf{glmm.wald} (thanks to: Simon Wiegrebe).
\item Bioconductor packages \textbf{SeqArray} and \textbf{SeqVarTools} moved to Suggests.
\end{enumerate}

\subsection{Version 1.4.2 (November 17, 2023)}
\begin{enumerate}
\item Removed the "usernames" option for xcolor.sty.
\end{enumerate}

\section{Contact}
Please refer to the R help document of \textit{GMMAT} for specific questions about each function. For comments, suggestions, bug reports and questions, please contact Han Chen (Han.Chen.2@uth.tmc.edu). For bug reports, please include an example to reproduce the problem without having to access your confidential data.

\section{Acknowledgments}
Duy T. Pham implemented support for BGEN genotype files and SeqVarGDSClass objects. We would like to thank Dr. Chaolong Wang and Dr. Brian Cade for comments and suggestions on \textit{GMMAT} and the user manual. We would also like to thank Dr. Matthew Conomos for help with the Average Information REML algorithm, Dr. Stephanie Gogarten for help with the GDS genotype format, Dr. Ken Rice for help with sparse matrices, Jennifer Brody for help with parallel computing and App development in Analysis Commons, a cloud computing platform, Arthur Gilly for supporting reordered group definition files in \textbf{SMMAT.meta}, and Dr. Rounak Dey for supporting imputed dosage GDS files. The \textit{GMMAT} implementation is supported by NIH grant R00 HL130593, and the analysis pipeline implementation (the gmmat App) in Analysis Commons is supported by NIH grant U01 HL120393.

\begin{thebibliography}{}

\bibitem{Breslow1993} Breslow, N. E. and Clayton, D. G. Approximate inference in generalized linear mixed models. \textit{Journal of the American Statistical Association} \textbf{88,} 9--25 (1993).

\bibitem{Morgenthaler2007} Morgenthaler, S. and Thilly, W. G. A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: A cohort allelic sums test (CAST). \textit{Mutation Research} \textbf{615,} 28--56 (2007).

\bibitem{Li2008} Li, B. and Leal, S. M. Methods for detecting associations with rare variants for common diseases: Application to analysis of sequence data. \textit{The American Journal of Human Genetics} \textbf{83,} 311--321 (2008).

\bibitem{Madsen2009} Madsen, B. E. and Browning, S. R. A groupwise association test for rare mutations using a weighted sum statistic. \textit{PLOS Genetics} \textbf{5,} e1000384 (2009).

\bibitem{Morris2010} Morris, A. P. and Zeggini, E. An evaluation of statistical approaches to rare variant analysis in genetic association studies. \textit{Genetic Epidemiology} \textbf{34,} 188--193 (2010).

\bibitem{Wu2011} Wu, M. C., Lee, S., Cai, T., Li, Y., Boehnke, M. and Lin, X. Rare-variant association testing for sequencing data with the sequence kernel association test. \textit{The American Journal of Human Genetics} \textbf{89,} 82--93 (2011).

\bibitem{Lee2012} Lee, S., Wu, M. C. and Lin, X. Optimal tests for rare variant effects in sequencing association studies. \textit{Biostatistics} \textbf{13,} 762--775 (2012).

\bibitem{Chen2019} Chen, H., Huffman, J. E., Brody, J. A., Wang, C., Lee, S., Li, Z., Gogarten, S. M., Sofer, T., Bielak, L. F., Bis, J. C., \textit{et al.} Efficient variant set mixed model association tests for continuous and binary traits in large-scale whole-genome sequencing studies. \textit{The American Journal of Human Genetics} \textbf{104,} 260--274 (2019).

\bibitem{Chen2016} Chen, H., Wang, C., Conomos, M. P., Stilp, A. M., Li, Z., Sofer, T., Szpiro, A. A., Chen, W., Brehm, J. M., Celed\'{o}n, J. C., Redline, S., Papanicolaou, G. J., Thornton, T. A., Laurie, C. C., Rice, K. and Lin, X. Control for Population Structure and Relatedness for Binary Traits in Genetic Association Studies via Logistic Mixed Models. \textit{The American Journal of Human Genetics} \textbf{98,} 653--666 (2016).

\bibitem{Brody2017} Brody, J. A., Morrison, A. C., Bis, J. C., O'Connell, J. R., Brown, M. R., Huffman, J. E., Ames, D. C., Carroll, A., Conomos, M. P., Gabriel, S., Gibbs, R. A., Gogarten, S. M., Gupta, N., Jaquish, C. E., Johnson, A. D., Lewis, J. P., Liu, X., Manning, A. K., Papanicolaou, G. J., Pitsillides, A. N., Rice, K. M., Salerno, W., Sitlani, C. M., Smith, N. L., NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium, The Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium, TOPMed Hematology and Hemostasis Working Group, CHARGE Analysis and Bioinformatics Working Group, Heckbert, S. R., Laurie, C. C., Mitchell, B. D., Vasan, R. S., Rich, S. S., Rotter, J. I., Wilson, J. G., Boerwinkle, E., Psaty, B. M. and Cupples, L. A. Analysis commons, a team approach to discovery in a big-data environment for genetic epidemiology. \textit{Nature Genetics} \textbf{49,} 1560--1563 (2017).

\bibitem{Zhou2012} Zhou, X. and Stephens, M. Genome-wide efficient mixed-model analysis for association studies. \textit{Nature Genetics} \textbf{44,} 821--824 (2012).

\bibitem{Conomos2016} Conomos, M. P., Laurie, C. A., Stilp, A. M., Gogarten, S. M., McHugh, C. P., Nelson, S. C., Sofer, T., Fern\'{a}ndez-Rhodes, L., Justice, A. E., Graff, M., Young, K. L., Seyerle, A. A., Avery, C. L., Taylor, K. D., Rotter, J. I., Talavera, G. A., Daviglus, M. L., Wassertheil-Smoller, S., Schneiderman, N., Heiss, G., Kaplan, R. C., Franceschini, N., Reiner, A. P., Shaffer, J. R., Barr, R. G., Kerr, K. F., Browning, S. R., Browning, B. L., Weir, B. S., Avil\'{e}s-Santa, M. L., Papanicolaou, G. J., Lumley, T., Szpiro, A. A., North, K. E., Rice, K., Thornton, T. A. and Laurie, C. C. Genetic Diversity and Association Studies in US Hispanic/Latino Populations: Applications in the Hispanic Community Health Study/Study of Latinos. \textit{The American Journal of Human Genetics} \textbf{98,} 165--184 (2016).

\bibitem{Gilmour1995} Gilmour, A. R., Thompson, R. and Cullis, B. R. Average information REML: an efficient algorithm for variance parameter estimation in linear mixed models. \textit{Biometrics} \textbf{51,} 1440--1450 (1995).

\bibitem{Yang2011} Yang, J., Lee, S. H., Goddard, M. E. and Visscher, P. M. GCTA: a tool for genome-wide complex trait analysis. \textit{The American Journal of Human Genetics} \textbf{88,} 76--82 (2011).

\bibitem{Brent1973} Brent, R. P. Chapter 4: An Algorithm with Guaranteed Convergence for Finding a Zero of a Function, Algorithms for
Minimization without Derivatives, Englewood Cliffs, NJ: Prentice-Hall, ISBN 0-13-022335-2 (1973).

\bibitem{Nelder1965} Nelder, J. A. and Mead, R. A simplex algorithm for function minimization. \textit{Computer Journal} \textbf{7,} 308--313 (1965).

\bibitem{davies1980} Davies, R. B. Algorithm AS 155: The Distribution of a Linear Combination of $\chi^2$ Random Variables. \textit{Journal of the Royal Statistical Society. Series C (Applied Statistics)} \textbf{29,} 323--333 (1980).

\bibitem{kuonen1999} Kuonen, D. Saddlepoint Approximations for Distributions of Quadratic Forms in Normal Variables. \textit{Biometrika} \textbf{86,} 929--935 (1999).

\bibitem{liu2009} Liu, H., Tang, Y. and Zhang, H. H. A new chi-square approximation to the distribution of non-negative definite quadratic forms in non-central normal variables. \textit{Computational Statistics \& Data Analysis} \textbf{53,} 853--856 (2009).

\bibitem{lee2013} Lee, S., Teslovich, T. M., Boehnke, M. and Lin, X. General Framework for Meta-analysis of Rare Variants in Sequencing Association Studies. \textit{The American Journal of Human Genetics} \textbf{93,} 42--53 (2013).

\end{thebibliography}

\end{document}