---
title: "Ex. 1 - Background questionnaire generation"
author: Yuan-Ling Liaw and Waldir Leoncio
header-includes:
    - \usepackage{setspace}\onehalfspacing
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Ex. 1 - Background questionnaire generation}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE, warning = FALSE}
library(knitr)
options(width = 90, tidy = TRUE, warning = FALSE, message = FALSE)
opts_chunk$set(comment = "", warning = FALSE, message = FALSE,
               echo = TRUE, tidy = TRUE)
```

```{r load}
library(lsasim)
```

```{r packageVersion}
packageVersion("lsasim")
```

---

```{r equation, eval=FALSE}
questionnaire_gen(n_obs, cat_prop = NULL, n_vars = NULL, n_X = NULL, n_W = NULL,
                  cor_matrix = NULL, cov_matrix = NULL,
                  c_mean = NULL, c_sd = NULL,
                  theta = FALSE, family = NULL,
                  full_output = FALSE, verbose = TRUE)
```

The function `questionnaire_gen` generates correlated continuous and ordinal data which resembles background questionnaire data. The required argument is `n_obs` and the optional arguments include

* `n_obs`: the number of observations (e.g., test takers).
* `cat_prop`: a list of vectors where each vector contains the cumulative proportions for each category of a given item.
* `n_vars`: the number of variables, including the continuous (`X`) and the ordinal (`W`) covariates as well as the latent trait (`theta`).
* `n_X`: the number of continuous (`X`) variables.
* `n_W`: the number of ordinal (`W`) variables.
* `cor_matrix`: a possibly heterogeneous correlation matrix, consisting of polyserial correlations between continuous and ordinal variables, and polychoric correlations between ordinal variables.
* `cov_matrix`: a covariance matrix, formatted as `cov_matrix`.

The arguments `c_mean` and `c_sd` are scaling parameters for continuous variables. If the logical argument `theta` is `TRUE` then the latent trait will be generated as the first continuous variable and labeled 'theta'. If `family` is `gaussian` then the data will be generated from a multivariate normal distribution, or the data will be generated from the polychoric correlation matrix.

If the logical argument `full_output` is `TRUE`, output will be a list containing the questionnaire data as well as several objects that might be of interest for further analysis of the data. The output of `full_output`will be addressed in future tutorials.

---

We only specify `n_obs = 100` and use a multivariate normal distribution. It turned out the generated data involves one continuous variable and four ordinal covariates, which are 2-category, 3-category, 4-category, and 5-category, respectively.

```{r ex 1a}
set.seed(4388)
bg <- questionnaire_gen(n_obs = 100, family = "gaussian")
str(bg)
```

---

In addition to `n_obs = 100`, we specify the logical argument `theta = TRUE`. An additional continuous variable is generated and labeled `theta`. The latent trait is always placed first in the generated data.

```{r ex 1b}
set.seed(4388)
bg <- questionnaire_gen(n_obs = 100, theta = TRUE, family = "gaussian")
str(bg)
```

---

We specify `n_vars = 4` regardless the item type. Four different item types are generated, one 1-category item (continuous), one 2-category item, one 4-category item, and one 5-category item.

```{r ex 2a}
set.seed(4388)
bg <- questionnaire_gen(n_obs = 100, n_vars = 4, family = "gaussian")
str(bg)
```

---

In addition to `n_vars = 4`, we specify the logical argument `theta = TRUE`. Three different item types are generated, two 1-category item (latent trait and continuous), one 2-category item, and one 5-category item. It is noted that when `theta = TRUE`, the first continuous variable generated is always labeled `theta`.

```{r ex 2b}
set.seed(4388)
bg <- questionnaire_gen(n_obs = 100, n_vars = 4, theta = TRUE, family = "gaussian")
str(bg)
```

---

We generate one latent trait and three continuous variables by specifying `theta = TRUE` and `n_X = 3`. We also add `n_W = 0`, or random number of ordinal variables will be generated.

```{r ex 3a}
set.seed(4388)
bg <- questionnaire_gen(n_obs = 100, n_X = 3, n_W = 0, theta = TRUE, family = "gaussian")
str(bg)
```

---

```{r ex 3b}
set.seed(4388)
bg <- questionnaire_gen(n_obs = 100, n_X = 3, theta = TRUE, family = "gaussian")
str(bg)
```

---

We can also specify `cat_prop = list(1, 1, 1, 1)` to generate one latent trait and three continuous covariates. The length of `cat_prop` corresponds to the number of generated variables (including latent trait and continuous variables in this case).

```{r ex 3c}
set.seed(4388)
bg <- questionnaire_gen(n_obs = 100, cat_prop = list(1, 1, 1, 1), theta = TRUE, family = "gaussian")
str(bg)
```

---

We generate two ordinal variables regardless the item type. It turned out one 2-category variable and one 5-category variable are generated, respectively.

```{r ex 4a}
set.seed(4388)
bg <- questionnaire_gen(n_obs = 100, n_X = 0, n_W = 2, family = "gaussian")
str(bg)
```

---

We generate one binary variable and 3 four-category variables.

```{r ex 4b}
set.seed(4388)
bg <- questionnaire_gen(n_obs = 100, n_X = 0, n_W = list(2, 4, 4, 4), family = "gaussian")
str(bg)
```

---

We generate five variables including one latent trait, two continuous, and two binary covariates. The latent trait is scaled on a mean set at 500, with a standard deviation of 100.

```{r ex 5a}
set.seed(4388)
bg <- questionnaire_gen(n_obs = 100, n_X = 2, n_W = list(2, 2), theta = TRUE,
                        c_mean = c(500, 0, 0), c_sd = c(100, 1, 1), family = "gaussian")
str(bg)
```

---

We generate one continuous and two ordinal covariates. We specify the covariance matrix between the numeric and ordinal variables. The continuous covariate is scaled and the average is 2 by specifying `c_mean = 2`. When `cov_matrix` is provided, `c_sd` is ignored .

```{r ex 5b}
set.seed(4388)
props <- list(1, c(.25, 1), c(.2, .8, 1))
yw_cov <- matrix(c(1, .5, .5, .5, 1, .8, .5, .8, 1), nrow = 3)
bg <- questionnaire_gen(n_obs = 100, cat_prop = props, cov_matrix = yw_cov,
                        c_mean = 2,
                        family = "gaussian")
str(bg)
```