---
title: "Example 4: Using SEAGLE for Chromosome-Wide Gene-Based Analysis"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{example4}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

This tutorial demonstrates how to use the `SEAGLE` package for chromosome-wide gene-based studies when the genotype data are from GWAS studies (.bed + .bim + .fam).  We recommend using PLINK1.9 available from [https://www.cog-genomics.org/plink/1.9/](https://www.cog-genomics.org/plink/1.9/) to generate a .raw file for each of the SNP sets.  For more information on .raw files, please refer to [https://www.cog-genomics.org/plink/2.0/formats#raw](https://www.cog-genomics.org/plink/2.0/formats#raw).  The .raw file records the allelic dosage for each SNP.

Examples files containing GWAS data (.bed, .bim, and .fam files) for all the genes in chromosome 22 and a corresponding gene list in `glist-hg19` can be downloaded via the `SEAGLE-example4-data.zip` file from [http://jocelynchi.com/SEAGLE/SEAGLE-example4-data.zip](http://jocelynchi.com/SEAGLE/SEAGLE-example4-data.zip).

To follow along with the PLINK conversion procedures, you will first need to install PLINK1.9 from [https://www.cog-genomics.org/plink/1.9/](https://www.cog-genomics.org/plink/1.9/) and unzip the example `SEAGLE-example4-data.zip` file.  Afterwards, you can type the PLINK commands below in a command prompt or terminal window in the folder where these files are located. This will produce a .raw file for each gene that you can read into R to obtain a relevant genetic marker matrix ${\bf G}$ that you can input into `SEAGLE`.

After unzipping the `SEAGLE-example4-data.zip` file, you will find two empty directories named `plink_bash` and `R_codes`.  The `plink_bash` directory is where we will write the PLINK1.9 commands that you will run to create the .raw files for each gene.  The `R_codes` directory is where we will write the R scripts for loading the resulting .raw files into R.

## Writing .raw files for each gene in chromosome 22 with PLINK1.9

We will begin by telling R the location where you unzipped the `SEAGLE-example4-data.zip` file.  Then we will read in the gene list in `glist-hg19` and subset for the genes in chromosome 22.  If you unzipped the file to your Downloads directory on a Mac, your path might look like the following.

```
dir <- "/Users/user_name/Downloads/SEAGLE-example4-data/"
glist <- read.table(paste0(dir,"glist-hg19"), header = F)
glist_chr22 <- subset(glist, glist$V1 == 22)
```

Next, we will remove any duplicated genes for chromosome 22 in our gene list.

```
dup <- glist_chr22$V4[duplicated(glist_chr22$V4)] # find duplicated gene
dup
```

Notice that in this example, there are 4 duplicated genes given by the following.

```
# GSTT2: chr22:24,322,339-24,326,106(GRCh37/hg19 by Ensembl)
# GSTT2B: chr22:24,299,601-24,303,373(GRCh37/hg19 by Ensembl)
# RIMBP3B: chr22:21,738,040-21,743,458(GRCh37/hg19 by Entrez Gene)
# RIMBP3C: chr22:21,899,646-21,905,750(GRCh37/hg19 by Ensembl)
```

We will update our gene list by manually updating the positions of the duplicated genes.

```
glist_chr22 <- glist_chr22[!duplicated(glist_chr22$V4),]
glist_chr22[which(glist_chr22$V4 == "GSTT2"),2:3] <- c(24322339, 24326106)
glist_chr22[which(glist_chr22$V4 == "GSTT2B"),2:3] <- c(24299601,24303373)
glist_chr22[which(glist_chr22$V4 == "RIMBP3B"),2:3] <- c(21738040,21743458)
glist_chr22[which(glist_chr22$V4 == "RIMBP3C"),2:3] <- c(21899646,21905750)
write.table(glist_chr22, "glist-hg19_chr22_updated", col.names = F, row.names = F, quote = F)

# replace hyphen with underscore
glist_chr22$V4 <- sub("-", "_", glist_chr22$V4)
```

Now we will use PLINK1.9 to create a .raw file for each gene in chromosome 22, which includes all SNPs for a given gene.  The following code will produce a sequence of bash files to produce .raw files for each gene in `glist_chr22`.  It will also produce a corresponding sequence of R code snippets for reaching each .raw file into R.  Each bash file will run PLINK1.9 for `num_genes=50` genes at a time.

```
# of plink commands per job
num_genes <- 50

bash_plink <- NULL # bash command
command <- c("#! /bin/bash", "#SBATCH --job-name=plink") # plink command
codeR <- NULL # R code
for (i in 1:nrow(glist_chr22)) {
  command <- c(command, paste0("plink --bfile 1kg_phase1_chr22 --make-set
  glist-hg19_chr22_updated --gene ", glist_chr22$V4[i]," --out ",
  glist_chr22$V4[i]," --export A") )

  # write R codes for reading .raw files, one R file for one gene
  codeR <- paste0(glist_chr22$V4[i], " <- read.delim(\"", glist_chr22$V4[i],".raw\", sep=\" \")")
  fileConn<-file(paste0(dir,"R_codes/", glist_chr22$V4[i],".R"))
  writeLines(codeR, fileConn)
  close(fileConn)

  if(i %% num_genes == 0 | i==nrow(glist_chr22)){
    # write plink commands in bash file
    fileConn<-file(paste0(dir,"plink_bash/plink_job",i-num_genes+1,"-",i,".sh"))
    writeLines(command, fileConn)
    close(fileConn)

    command <- c("#! /bin/bash", "#SBATCH --job-name=plink")
    bash_plink <- c(bash_plink, paste0("sbatch plink_job",i-num_genes+1,"-",i,".sh"))
  }
}

# Write bash commands to a text file
fileConn<-file(paste0("bash_plink.txt"))
writeLines(bash_plink, fileConn)
close(fileConn)
```

Note that if you want to run the .sh files in the `plink_bash` directory on your local computer, you can employ the `bash` command instead of `SBATCH` in terminal or command prompt.  Additionally, if you are using Linux, you may need to specify `plink1.9` rather than just `plink`.

Running the above code will setup the code and files to produce a single .raw file for each gene when running the commands in the `bash_plink.txt` file.  The .raw file for each gene will contain all the SNPs for that gene.

## Loading .raw files into R and extracting the genetic marker matrix ${\bf G}$

Now that we have data for each gene in the .raw format, we can load the .raw files sequentially into R  for input into `SEAGLE` as follows.  We'll first load the `SEAGLE` package.

```{r setup}
library(SEAGLE)
```

The following code shows how to test for the GxE interaction effect using `SEAGLE` on these genes.  To illustrate how to use SEAGLE, we will generate synthetic phenotype and covariate data for $n=1092$ study participants.

As an example, we've included .raw files for the following subset of genes from chromosome 22 in the `extdata` folder of this package: ACR, APOBEC3A, APOBEC3C, ARSA, ATF4, ATP5L2, BCRP2, BMS1P17, and BMS1P18.  We've also included a corresponding gene list in `glist-hg19_chr22_example`.  In practice, you will want to specify your own `dir` for the directory where you've stored your .raw files.

```{r}
# Read in gene list
dir <- "../inst/extdata/"
genelist <- read.delim(paste0(dir, "glist-hg19_chr22_example"), sep=" ", header=FALSE)

# Identify number of genes in genelist
num_genes <- dim(genelist)[1]

# Generate synthetic phenotype and covariate data
n <- 1092 # number of study participants

set.seed(1)
y <- 2 * rnorm(n)

set.seed(2)
X <- as.matrix(rnorm(n))

set.seed(3)
E <- as.matrix(rnorm(n))
```

Now that we have ${\bf y}$, ${\bf X}$, ${\bf E}$, and ${\bf G}$, we can run `SEAGLE` on each gene in `genelist`.  We will first perform data checking procedures on ${\bf G}$.  Then we will input ${\bf y}$, ${\bf X}$, ${\bf E}$, and ${\bf G}$ for each gene into the `prep.SEAGLE` function.  The `intercept = 0` parameter indicates that the first column of ${\bf X}$ is not the all ones vector for the intercept.

The preparation procedure formats the input data for the `SEAGLE` function by checking the dimensions of the input data.  It also pre-computes a QR decomposition for $\widetilde{\bf X} = \begin{pmatrix} {\bf 1}_{n} & {\bf X} & {\bf E} \end{pmatrix}$, where ${\bf 1}_{n}$ denotes the all ones vector of length $n$.

Afterwards, we will input the prepared data for each gene into the `SEAGLE` function to compute the score-like test statistic $T$ and its corresponding p-value.  The `init.tau` and `init.sigma` parameters are the initial values for estimating $\tau$ and $\sigma$ in the REML EM algorithm.

```{r}
# Initialize output containers for T and p-value for each gene
T.list <- numeric(length=num_genes)
pv.list <- numeric(length=num_genes)

# Run SEAGLE on each gene in genelist
for (i in 1:num_genes) {
  
  # Identify current gene
  gene_name <- genelist[i,4]
  
  # Obtain G
  Gtemp <- read.delim(paste0(dir, gene_name, ".raw"), sep=" ")
  G <- as.matrix(Gtemp[,-c(1:6)])
  L <- dim(G)[2]                ## number of SNPs
  
  # Make weights
  avg_newsnp <- colMeans(G)
  fA  = avg_newsnp/2            ## freq of allele "A"
  fa  = 1-fA                    ## freq of allele "a"
  maf = fA; maf[fA>0.5]= fa[fA>0.5]## maf should be b/w 0 and 0.5
  G = G[ ,maf>0]                ## only keep those SNPs with MAF>0
  maf <- maf[maf > 0]           ## Keep only MAF > 0)
  wt   = sqrt(maf^(-3/4))       ## Note we take the square root here
  if (length(wt) > 1) {
    G_final    = G %*% diag(wt) ## Input this G
  } else {
    T.list[i] <- NA
    pv.list[i] <- NA
    next
  }

  # Run SEAGLE
  objSEAGLE <- prep.SEAGLE(y=y, X=X, intercept=0, 
                           E=E, G=G_final)
  res <- SEAGLE(objSEAGLE, init.tau=0.5, init.sigma=0.5)
  
  # Save T and p-value into output lists
  T.list[i] <- res$T
  pv.list[i] <- res$pv
}
```

The score-like test statistics $T$ for the G$\times$E effect and their corresponding p-values can be found in `T.list` and `pv.list`, respectively.  We can take a look at the test statistics and p-values computed for each of the genes in `genelist`.

```{r}
resMat <- cbind(T.list, pv.list)
colnames(resMat) <- c("T", "p-value")
resMat
```

## Acknowledgments

Many thanks to Yueyang Huang for generating the example data, PLINK1.9 code and bash files, and R scripts for reading in the .raw files for this tutorial.