For set-based association tests, the snpsettest package employed the statistical model described in VEGAS (versatile gene-based association study) [1], which takes as input variant-level p values and reference linkage disequilibrium (LD) data. Briefly, the test statistics is defined as the sum of squared variant-level Z-statistics. Letting a set of Z scores of individual SNPs zi for i∈1:p within a set s, the test statistic Qs is
Qs=p∑i=1z2i
Here, Z={z1,...,zp}′ is a vector of multivariate normal distribution with a mean vector μ and a covariance matrix Σ in which Σ represents LD among SNPs. To test a set-level association, we need to evaluate the distribution of Qs. VEGAS uses Monte Carlo simulations to approximate the distribution of Qs (directly simulate Z from multivariate normal distribution), and thus, compute a set-level p value. However, its use is hampered in practice when set-based p values are very small because the number of simulations required to obtain such p values is be very large. The snpsettest package utilizes a different approach to evaluate the distribution of Qs more efficiently.
Let Y=Σ−12Z (instead of Σ−12, we could use any decomposition that satisfies Σ=AA′ with a p×p non-singular matrix A such that Y=A−1Z). Then,
E(Y)=Σ−12μVar(Y)=Σ−12ΣΣ−12=IpY∼N(Σ−12μ, Ip)
Now, we posit U=Σ−12(Z−μ) so that
U∼N(0,Ip), U=Y−Σ−12μ
and express the test statistic Qs as a quadratic form:
Qs=p∑i=1z2i=Z′IpZ=Y′Σ12IpΣ12Y=(U+Σ−12μ)′Σ(U+Σ−12μ)
With the spectral theorem, Σ can be decomposed as follow:
Σ=PΛP′Λ=diag(λ1,...,λp), P′P=PP′=Ip
where P is an orthogonal matrix. If we set X=P′U, X is a vector of independent standard normal variable X∼N(0,Ip) since
E(X)=P′E(U)=0, Var(X)=P′Var(U)P=P′IpP=Ip
Qs=(U+Σ−12μ)′Σ(U+Σ−12μ)=(U+Σ−12μ)′PΛP′(U+Σ−12μ)=(X+P′Σ−12μ)′Λ(X+P′Σ−12μ)
Under the null hypothesis, μ is assumed to be 0. Hence,
Qs=X′ΛX=p∑i=1λix2i
where X={x1,...,xp}′. Thus, the null distribution of Qs is a linear combination of independent chi-square variables x2i∼χ2(1) (i.e., central quadratic form in independent normal variables). For computing a probability with a scalar q,
Pr(Qs>q)
several methods have been proposed, such as numerical inversion of the characteristic function [2]. The snpsettest package uses the algorithm of Davies [3] or saddlepoint approximation [4] to obtain set-based p values.
References
Liu JZ, Mcrae AF, Nyholt DR, Medland SE, Wray NR, Brown KM, et al. A Versatile Gene-Based Test for Genome-wide Association Studies. Am J Hum Genet. 2010 Jul 9;87(1):139–45.
Duchesne P, De Micheaux P. Computing the distribution of quadratic forms: Further comparisons between the Liu-Tang-Zhang approximation and exact methods. Comput Stat Data Anal. 2010;54:858–62.
Davies RB. Algorithm AS 155: The Distribution of a Linear Combination of Chi-square Random Variables. J R Stat Soc Ser C Appl Stat. 1980;29(3):323–33.
Kuonen D. Saddlepoint Approximations for Distributions of Quadratic Forms in Normal Variables. Biometrika. 1999;86(4):929–35.