This documents reanalysis response time data from an Experiment
performed by Freeman, Heathcote, Chalmers, and Hockley (2010) using the
mixed model functionality of **afex** implemented in
function `mixed`

followed by post-hoc tests using package
**emmeans** (Lenth, 2017). After a brief description of the
data set and research question, the code and results are presented.

The data are lexical decision and word naming latencies for 300 words
and 300 nonwords from 45 participants presented in Freeman et
al. (2010). The 300 items in each `stimulus`

condition were
selected to form a balanced \(2 \times
2\) design with factors neighborhood `density`

(low
versus high) and `frequency`

(low versus high). The
`task`

was a between subjects factor: 25 participants worked
on the lexical decision task and 20 participants on the naming task.
After excluding erroneous responses each participants responded to
between 135 and 150 words and between 124 and 150 nonwords. We analyzed
log RTs which showed an approximately normal picture.

We start with loading some packages we will need throughout this
example. For data manipulation we will be using the `dplyr`

and `tidyr`

packages from the `tidyverse`

. A thorough
introduction to these packages is beyond this example, but well worth
it, and can be found in ‘R for Data
Science’ by Wickham and Grolemund. For plotting we will be using
`ggplot2`

, also part of the `tidyverse`

.

After loading the packages, we will load the data (which comes with
`afex`

), remove the errors, and take a look at the variables
in the data.

```
library("afex") # needed for mixed() and attaches lme4 automatically.
library("emmeans") # emmeans is needed for follow-up tests
library("multcomp") # for advanced control for multiple testing/Type 1 errors.
library("dplyr") # for working with data frames
library("tidyr") # for transforming data frames from wide to long and the other way round.
library("ggplot2") # for plots
theme_set(theme_bw(base_size = 15) +
theme(legend.position="bottom",
panel.grid.major.x = element_blank()))
data("fhch2010") # load
fhch <- droplevels(fhch2010[ fhch2010$correct,]) # remove errors
str(fhch2010) # structure of the data
```

```
## 'data.frame': 13222 obs. of 10 variables:
## $ id : Factor w/ 45 levels "N1","N12","N13",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ task : Factor w/ 2 levels "naming","lexdec": 1 1 1 1 1 1 1 1 1 1 ...
## $ stimulus : Factor w/ 2 levels "word","nonword": 1 1 1 2 2 1 2 2 1 2 ...
## $ density : Factor w/ 2 levels "low","high": 2 1 1 2 1 2 1 1 1 1 ...
## $ frequency: Factor w/ 2 levels "low","high": 1 2 2 2 2 2 1 2 1 2 ...
## $ length : Factor w/ 3 levels "4","5","6": 3 3 2 2 1 1 3 2 1 3 ...
## $ item : Factor w/ 600 levels "abide","acts",..: 363 121 202 525 580 135 42 368 227 141 ...
## $ rt : num 1.091 0.876 0.71 1.21 0.843 ...
## $ log_rt : num 0.0871 -0.1324 -0.3425 0.1906 -0.1708 ...
## $ correct : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
```

To make sure our expectations about the data match the data we use
some `dplyr`

magic to confirm the number of participants per
condition and items per participant.

```
## are all participants in only one task?
fhch %>% group_by(id) %>%
summarise(task = n_distinct(task)) %>%
as.data.frame() %>%
{.$task == 1} %>%
all()
```

`## [1] TRUE`

```
## participants per condition:
fhch %>% group_by(id) %>%
summarise(task = first(task)) %>%
ungroup() %>%
group_by(task) %>%
summarise(n = n())
```

```
## # A tibble: 2 × 2
## task n
## <fct> <int>
## 1 naming 20
## 2 lexdec 25
```

```
## number of different items per participant:
fhch %>% group_by(id, stimulus) %>%
summarise(items = n_distinct(item)) %>%
ungroup() %>%
group_by(stimulus) %>%
summarise(min = min(items),
max = max(items),
mean = mean(items))
```

```
## # A tibble: 2 × 4
## stimulus min max mean
## <fct> <int> <int> <dbl>
## 1 word 135 150 145.
## 2 nonword 124 150 143.
```

Before running the analysis we should make sure that our dependent
variable looks roughly normal. To compare `rt`

with
`log_rt`

within the same figure we first need to transform
the data from the wide format (where both rt types occupy one column
each) into the long format (in which the two rt types are combined into
a single column with an additional indicator column). To do so we use
`tidyr::pivot_longer`

. Then we simply call
`ggplot`

with `geom_histogram`

and
`facet_wrap(vars(rt_type))`

on the new `tibble`

.
The plot shows that `log_rt`

looks clearly more normal than
`rt`

, although not perfectly so. An interesting exercise
could be to rerun the analysis below using a transformation that
provides an even better ‘normalization’.

The main factors in the experiment were the between-subjects factor
`task`

(`naming`

vs. `lexdec`

), and the
within-subjects factors `stimulus`

(`word`

vs. `nonword`

), `density`

(`low`

vs. `high`

), and `frequency`

(`low`

vs. `high`

). Before running an analysis it is a good idea to
visually inspect the data to gather some expectations regarding the
results. Should the statistical results dramatically disagree with the
expectations this suggests some type of error along the way (e.g., model
misspecification) or at least encourages a thorough check to make sure
everything is correct. We first begin by plotting the data aggregated
by-participant.

In each plot we plot the raw data in the background. To make the
individual data points visible we use
`ggbeeswarm::geom_quasirandom`

and `alpha = 0.5`

for semi-transparency. On top of this we add a (transparent) box plot as
well as the mean and standard error.

```
agg_p <- fhch %>%
group_by(id, task, stimulus, density, frequency) %>%
summarise(mean = mean(log_rt)) %>%
ungroup()
ggplot(agg_p, aes(x = interaction(density,frequency), y = mean)) +
ggbeeswarm::geom_quasirandom(alpha = 0.5) +
geom_boxplot(fill = "transparent") +
stat_summary(colour = "red") +
facet_grid(cols = vars(task), rows = vars(stimulus))
```

Now we plot the same data but aggregated across items:

```
agg_i <- fhch %>% group_by(item, task, stimulus, density, frequency) %>%
summarise(mean = mean(log_rt)) %>%
ungroup()
ggplot(agg_i, aes(x = interaction(density,frequency), y = mean)) +
ggbeeswarm::geom_quasirandom(alpha = 0.3) +
geom_boxplot(fill = "transparent") +
stat_summary(colour = "red") +
facet_grid(cols = vars(task), rows = vars(stimulus))
```

These two plots show a very similar pattern and suggest several things:

- Responses to
`nonwords`

appear slower than responses to`words`

, at least for the`naming`

task. `lexdec`

responses appear to be slower than`naming`

responses, particularly in the`word`

condition.- In the
`nonword`

and`naming`

condition we see a clear effect of`frequency`

with slower responses to`high`

than`low`

`frequency`

words. - In the
`word`

conditions the`frequency`

pattern appears to be in the opposite direction to the pattern described in the previous point: faster responses to`low`

`frequency`

than to`high`

`frequency`

words. `density`

appears to have no effect, perhaps with the exception of the`nonword`

`lexdec`

condition.

To set up a mixed model it is important to identify which factors
vary within which grouping factor generating random variability (i.e.,
grouping factors are sources of stochastic variability). The two
grouping factors are participants (`id`

) and items
(`item`

). The within-participant factors are
`stimulus`

, `density`

, and `frequency`

.
The within-item factor is `task`

. The ‘maximal model’ (Barr,
Levy, Scheepers, and Tily, 2013) therefore is the model with
by-participant random slopes for `stimulus`

,
`density`

, and `frequency`

and their interactions
and by-item random slopes for `task`

.

It is rather common that a maximal model with a complicated random effect structure, such as the present one, does not converge successfully. The best indicator of this is a “singular fit” warning. A model with a singular fit warning should not be reported or used. Instead, one should make sure that qualitatively the same results are also observed with a model without singular fit warnings. If the maximal model that does not converge and a reduced model without a singular fit warning (i.e., the final model) diverge in their results, results should only be interpreted cautiously.

In case of a singular fit or another indicator of a convergence problem, the usual first step is removing the correlations among the random terms. In our example, there are two sets of correlations, one for each random effect grouping variable. Consequently, we can build four model that have the maximal structure in terms of random-slopes and only differ in which correlations among random terms are calculated:

- With all correlations.
- No correlation among by-item random effects (i.e., no correlation
between random intercept and
`task`

random slope). - No correlation among by-participant random effect terms (i.e., no correlation among random slopes themselves and between the random slopes and the random intercept).
- No correlation among either random grouping factor.

The next decision to be made is which method to use for obtaining
\(p\)-values. The best control against
anti-conservative results is provided by `method = "KR"`

(=Kenward-Roger). However, `KR`

needs quite a lot of RAM,
especially with complicated random effect structures and large data
sets. In this case we have both, relatively large data (i.e., many
levels on each random effect, especially the item random effect) and a
complicated random effect structure. Consequently, it seems a reasonable
decision to choose another method. The second ‘best’ method (in terms of
controlling for Type I errors) is the ‘Satterthwaite’ approximation,
`method = 'S'`

. It provides a similar control of Type I
errors as the Kenward-Roger approximation and needs less RAM. The
Satterthwaite method is currently also the default method for
calculating \(p\)-values so does not
need to be set explicitly.

The following code fits the four models using the Satterthwaite
method. To suppress random effects we use the `||`

notation.
Note that it is necessary to set `expand_re = TRUE`

when
suppressing random effects among variables that are entered as factors
and not as numerical variables (all independent variables in the present
case are factors). Also note that `mixed`

automatically uses
appropriate contrast coding if factors are included in interactions
(`contr.sum`

) in contrast to the `R`

default
(which is `contr.treatment`

). To make sure the estimation
does not end prematurely we set the allowed number of function
evaluations to a very high value (using `lmerControl`

).

However, because fitting the models in R might take quite a while,
you should also be able to load the fitted binary files them from this
url and then load them in R with
`load("freeman_models.rda")`

.

```
m1s <- mixed(log_rt ~ task*stimulus*density*frequency +
(stimulus*density*frequency|id)+
(task|item), fhch,
control = lmerControl(optCtrl = list(maxfun = 1e6)))
```

`## boundary (singular) fit: see ?isSingular`

```
m2s <- mixed(log_rt ~ task*stimulus*density*frequency +
(stimulus*density*frequency|id)+
(task||item), fhch,
control = lmerControl(optCtrl = list(maxfun = 1e6)), expand_re = TRUE)
```

`## boundary (singular) fit: see ?isSingular`

```
m3s <- mixed(log_rt ~ task*stimulus*density*frequency +
(stimulus*density*frequency||id)+
(task|item), fhch,
control = lmerControl(optCtrl = list(maxfun = 1e6)), expand_re = TRUE)
```

`## boundary (singular) fit: see ?isSingular`

```
m4s <- mixed(log_rt ~ task*stimulus*density*frequency +
(stimulus*density*frequency||id)+
(task||item), fhch,
control = lmerControl(optCtrl = list(maxfun = 1e6)), expand_re = TRUE)
```

`## boundary (singular) fit: see ?isSingular`

We can see that fitting each of these models emits the “singular fit”
message (it is technically a `message`

and not a
`warning`

) indicating that the model is over-parameterized
and did not converge successfully. In particular, this message indicates
that the model is specified with more random effect parameters than can
be estimated given the current data.

It is instructive to see that even with a comparatively large number of observations, 12960, a set of seven random slopes for the by-participant term and one random slope for the by-item term cannot be estimated successfully. And this holds even after removing all correlations. Thus, it should not be surprising that similar sized models regularly do not converge with smaller numbers of observations. Furthermore, we are here in the fortunate situation that each factor has only two levels. A factor with more levels corresponds to more parameters of the random effect terms.

Before deciding what to do next, we take a look at the estimated
random effect estimates. We do so for the model without any
correlations. Note that for `afex`

models we usually do not
want to use the `summary`

method as it prints the results on
the level of the model coefficients and not model terms. But for the
random effects we have to do so. However, we are only interested in the
random effect terms so we only print those using
`summary(model)$varcor`

.

```
## Groups Name Std.Dev.
## item re2.task1 0.0568715
## item.1 (Intercept) 0.0537587
## id re1.stimulus1_by_density1_by_frequency1 0.0077923
## id.1 re1.density1_by_frequency1 0.0000000
## id.2 re1.stimulus1_by_frequency1 0.0000000
## id.3 re1.stimulus1_by_density1 0.0000000
## id.4 re1.frequency1 0.0179993
## id.5 re1.density1 0.0000000
## id.6 re1.stimulus1 0.0445333
## id.7 (Intercept) 0.1936292
## Residual 0.3001905
```

The output shows that the estimated SDs of the random slopes of the two-way interactions are all zero. However, because we cannot remove the random slopes for the two way interaction while retaining the three-way interaction, we start by removing the three-way interaction first.

```
m5s <- mixed(log_rt ~ task*stimulus*density*frequency +
((stimulus+density+frequency)^2||id)+
(task||item), fhch,
control = lmerControl(optCtrl = list(maxfun = 1e6)), expand_re = TRUE)
```

`## boundary (singular) fit: see ?isSingular`

```
## Groups Name Std.Dev.
## item re2.task1 5.6826e-02
## item.1 (Intercept) 5.3736e-02
## id re1.density1_by_frequency1 0.0000e+00
## id.1 re1.stimulus1_by_frequency1 0.0000e+00
## id.2 re1.stimulus1_by_density1 0.0000e+00
## id.3 re1.frequency1 1.7966e-02
## id.4 re1.density1 1.4422e-05
## id.5 re1.stimulus1 4.4539e-02
## id.6 (Intercept) 1.9374e-01
## Residual 3.0030e-01
```

Not too surprisingly, this model also produces a singular fit. Inspection of the estimates shows that the two-way interaction of the slopes are still estimated to be zero. So we remove those in the next step.

```
m6s <- mixed(log_rt ~ task*stimulus*density*frequency +
(stimulus+density+frequency||id)+
(task||item), fhch,
control = lmerControl(optCtrl = list(maxfun = 1e6)),
expand_re = TRUE)
```

`## boundary (singular) fit: see ?isSingular`

This model still shows a singular fit warning. The random effect estimates below show a potential culprit.

```
## Groups Name Std.Dev.
## item re2.task1 0.056831
## item.1 (Intercept) 0.053737
## id re1.frequency1 0.017972
## id.1 re1.density1 0.000000
## id.2 re1.stimulus1 0.044524
## id.3 (Intercept) 0.193659
## Residual 0.300297
```

As in `m4s`

above, the random effect SD for the density
term is estimated to be zero. Thus, we remove this as well in the next
step.

```
m7s <- mixed(log_rt ~ task*stimulus*density*frequency +
(stimulus+frequency||id)+
(task||item), fhch,
control = lmerControl(optCtrl = list(maxfun = 1e6)), expand_re = TRUE)
```

This model finally does not emit a singular fit warning. Is this our final model? Before deciding on this, we see whether we can add the correlation terms again without running into any problems. We begin by adding the correlation to the by-participant term.

```
m8s <- mixed(log_rt ~ task*stimulus*density*frequency +
(stimulus+frequency|id)+
(task||item), fhch,
control = lmerControl(optCtrl = list(maxfun = 1e6)), expand_re = TRUE)
```

`## Warning: Model failed to converge with max|grad| = 0.00347544 (tol = 0.002, component 1)`

This model does not show a singular fit message but emits another warning. Specifically, a warning that the absolute maximal gradient at the final solution is too high. This warning is not necessarily critical (i.e., it can be a false positive), but can also indicate serious problems. Consequently, we try adding the correlation between the by-item random terms instead:

```
m9s <- mixed(log_rt ~ task*stimulus*density*frequency +
(stimulus+frequency||id)+
(task|item), fhch,
control = lmerControl(optCtrl = list(maxfun = 1e6)), expand_re = TRUE)
```

This model also does not show any warnings. Thus, we have arrived at the end of the model selection process.

We now have the following two relevant models.

`m1s`

: The maximal random effect structure justified by the design (i.e., the maximal model)`m9s`

: The final model

Robust results are those that hold regardless across maximal and final (i.e., reduced) model. Therefore, let us compare the pattern of significant and non-significant effects.

```
## Mixed Model Anova Table (Type 3 tests, S-method)
##
## Model: log_rt ~ task * stimulus * density * frequency + (stimulus *
## Model: density * frequency | id) + (task | item)
## Data: fhch
## Effect df_full F_full p.value_full df_final F_final p.value_final
## 1 task 1, 43.58 13.71 *** <.001 1, 43.51 13.68 *** <.001
## 2 stimulus 1, 50.50 151.03 *** <.001 1, 50.57 151.38 *** <.001
## 3 density 1, 175.53 0.33 .569 1, 584.58 0.36 .547
## 4 frequency 1, 71.17 0.55 .459 1, 70.27 0.56 .456
## 5 task:stimulus 1, 51.45 70.91 *** <.001 1, 51.50 71.32 *** <.001
## 6 task:density 1, 184.47 16.22 *** <.001 1, 578.72 17.89 *** <.001
## 7 stimulus:density 1, 247.85 1.13 .289 1, 584.60 1.19 .275
## 8 task:frequency 1, 75.03 81.10 *** <.001 1, 74.09 82.77 *** <.001
## 9 stimulus:frequency 1, 165.95 55.85 *** <.001 1, 584.78 63.29 *** <.001
## 10 density:frequency 1, 215.83 0.11 .739 1, 584.62 0.11 .742
## 11 task:stimulus:density 1, 259.93 14.52 *** <.001 1, 578.74 14.87 *** <.001
## 12 task:stimulus:frequency 1, 178.33 110.95 *** <.001 1, 578.91 124.16 *** <.001
## 13 task:density:frequency 1, 228.12 5.53 * .020 1, 578.75 5.93 * .015
## 14 stimulus:density:frequency 1, 101.11 3.98 * .049 1, 584.64 4.62 * .032
## 15 task:stimulus:density:frequency 1, 108.10 10.19 ** .002 1, 578.77 11.72 *** <.001
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘+’ 0.1 ‘ ’ 1
```

What this shows is that the pattern of significant and
non-significant effect is the same for both models. The only significant
effect for which the evidence is not that strong is the 3-way
interaction between `stimulus:density:frequency`

. It is only
just below .05 for the full model and has a somewhat lower value for the
final model.

We can also see that one of the most noticeable differences between the maximal and the final model is the number of denominator degrees of freedom. This is highly influenced by the random effect structure and thus considerable larger in the final (i.e., reduced) model. The difference in the other statistics is lower.

It is instructive to compare those results with results obtained
using the comparatively ‘worst’ method for obtaining \(p\)-values implemented in
`afex::mixed`

, likelihood ratio tests. Likelihood ratio-tests
should in principle deliver reasonable results for large data sets. A
common rule of thumb is that the number of levels for each random effect
grouping factor needs to be large, say above 50. Here, we have a very
large number of items (600), but not that many participants (45). Thus,
qualitative results should be the very similar, but it still is
interesting to see exactly what happens. We therefore fit the final
model using `method='LRT'`

.

```
m9lrt <- mixed(log_rt ~ task*stimulus*density*frequency +
(stimulus+frequency||id)+
(task|item), fhch, method = "LRT",
control = lmerControl(optCtrl = list(maxfun = 1e6)),
expand_re = TRUE)
m9lrt
```

```
## Mixed Model Anova Table (Type 3 tests, LRT-method)
##
## Model: log_rt ~ task * stimulus * density * frequency + (stimulus +
## Model: frequency || id) + (task | item)
## Data: fhch
## Df full model: 23
## Effect df Chisq p.value
## 1 task 1 12.44 *** <.001
## 2 stimulus 1 70.04 *** <.001
## 3 density 1 0.37 .545
## 4 frequency 1 0.57 .449
## 5 task:stimulus 1 45.52 *** <.001
## 6 task:density 1 17.81 *** <.001
## 7 stimulus:density 1 1.21 .272
## 8 task:frequency 1 53.81 *** <.001
## 9 stimulus:frequency 1 60.64 *** <.001
## 10 density:frequency 1 0.11 .741
## 11 task:stimulus:density 1 14.85 *** <.001
## 12 task:stimulus:frequency 1 114.21 *** <.001
## 13 task:density:frequency 1 5.96 * .015
## 14 stimulus:density:frequency 1 4.65 * .031
## 15 task:stimulus:density:frequency 1 11.71 *** <.001
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘+’ 0.1 ‘ ’ 1
```

The results in this case match the results of the Satterthwaite method. With lower numbers of levels of the grouping factor (e.g., less participants) this would not necessarily be expected.

Fortunately, the results from all models converged on the same
pattern of significant and non-significant effects providing a high
degree of confidence in the results. This might not be too surprising
given the comparatively large number of total data points and the fact
that each random effect grouping factor has a considerable number of
levels (above 30 for both participants and items). In the following we
focus on the final model using the Satterthwaite method,
`m9s`

.

In terms of the significant findings, there are many that seem to be
in line with the descriptive results described above. For example, the
highly significant effect of `task:stimulus:frequency`

with
\(F(1, 578.91) = 124.16\), \(p < .001\), appears to be in line with
the observation that the frequency effect appears to change its sign
depending on the `task:stimulus`

cell (with
`nonword`

and `naming`

showing the opposite
patterns than the other three conditions). Consequently, we start by
investigating this interaction further below.

Before investigating the significant interaction in detail it is a good idea to remind oneself what a significant interaction represents on a conceptual level; that one or multiple of the variables in the interaction moderate (i.e., affect) the effect of the other variable or variables. Consequently, there are several ways to investigate a significant interaction. Each of the involved variables can be seen as the moderating variables and each of the variables can be seen as the effect of interest. Which one of those possible interpretations is of interest in a given situation highly depends on the actual data and research question and multiple views can be ‘correct’ in a given situation.

In addition to this conceptual issue, there are also multiple
technical ways to investigate a significant interaction. One approach
not followed here is to split the data into subsets according to the
moderating variables and compute the statistical model again for the
subsets with the effect variable(s) as remaining fixed effect. This
approach, also called *simple effects* analysis, is, for example,
recommended by Maxwell and Delaney (2004) as it does not assume variance
homogeneity and is faithful to the data at each level. The approach
taken here is to simply perform the test on the estimated final model.
This approach assumes variance homogeneity (i.e., that the variances in
all groups are homogeneous) and has the added benefit that it is
computationally relatively simple. In addition, it can all be achieved
using the framework provided by `emmeans`

(Lenth, 2017).

Our interest in the beginning is on the effect of
`frequency`

by `task:stimulus`

combination. So let
us first look at the estimated marginal means of this effect. In
`emmeans`

parlance these estimated means are called
‘least-square means’ because of historical reasons, but because of the
lack of least-square estimation in mixed models we prefer the term
estimated marginal means, or EMMs for short. Those can be obtained in
the following way. To prevent `emmeans`

from calculating the
*df* for the EMMs (which can be quite costly), we use asymptotic
*df*s (i.e., \(z\) values and
tests). `emmeans`

requires to first specify the variable(s)
one wants to treat as the effect variable(s) (here
`frequency`

) and then allows to specify condition
variables.

```
emm_options(lmer.df = "asymptotic") # also possible: 'satterthwaite', 'kenward-roger'
emm_i1 <- emmeans(m9s, "frequency", by = c("stimulus", "task"))
emm_i1
```

`## NOTE: Results may be misleading due to involvement in interactions`

```
## stimulus = word, task = naming:
## frequency emmean SE df asymp.LCL asymp.UCL
## low -0.32333 0.0455 Inf -0.4125 -0.2342
## high -0.38210 0.0455 Inf -0.4713 -0.2929
##
## stimulus = nonword, task = naming:
## frequency emmean SE df asymp.LCL asymp.UCL
## low -0.14294 0.0455 Inf -0.2321 -0.0538
## high 0.06405 0.0455 Inf -0.0252 0.1533
##
## stimulus = word, task = lexdec:
## frequency emmean SE df asymp.LCL asymp.UCL
## low 0.02337 0.0413 Inf -0.0576 0.1044
## high -0.04017 0.0413 Inf -0.1211 0.0408
##
## stimulus = nonword, task = lexdec:
## frequency emmean SE df asymp.LCL asymp.UCL
## low 0.10455 0.0413 Inf 0.0235 0.1856
## high -0.00632 0.0413 Inf -0.0873 0.0746
##
## Results are averaged over the levels of: density
## Degrees-of-freedom method: asymptotic
## Confidence level used: 0.95
```

The returned values are in line with our observation that the
`nonword`

and `naming`

condition diverges from the
other three. But is there actual evidence that the effect flips? We can
test this using additional `emmeans`

functionality.
Specifically, we first use the `pairs`

function which
provides us with a pairwise test of the effect of `frequency`

in each `task:stimulus`

combination. Then we need to combine
the four tests within one object to obtain a family-wise error rate
correction which we do via `update(..., by = NULL)`

(i.e., we
revert the effect of the `by`

statement from the earlier
`emmeans`

call) and finally we select the `holm`

method for controlling for family wise error rate (the Holm method is
uniformly more powerful than the Bonferroni).

`## NOTE: Results may be misleading due to involvement in interactions`

```
## contrast stimulus task estimate SE df z.ratio p.value
## low - high word naming 0.0588 0.0149 Inf 3.941 0.0002
## low - high nonword naming -0.2070 0.0150 Inf -13.823 <.0001
## low - high word lexdec 0.0635 0.0167 Inf 3.807 0.0002
## low - high nonword lexdec 0.1109 0.0167 Inf 6.619 <.0001
##
## Results are averaged over the levels of: density
## Degrees-of-freedom method: asymptotic
## P value adjustment: holm method for 4 tests
```

We could also use a slightly more powerful method than the Holm
method, method `free`

from package `multcomp`

.
This method takes the correlation of the model parameters into account.
However, results do not differ much here:

```
##
## Simultaneous Tests for General Linear Hypotheses
##
## Linear Hypotheses:
## Estimate Std. Error z value Pr(>|z|)
## low - high, word, naming == 0 0.05877 0.01491 3.941 0.000162 ***
## low - high, nonword, naming == 0 -0.20699 0.01497 -13.823 < 2e-16 ***
## low - high, word, lexdec == 0 0.06353 0.01669 3.807 0.000162 ***
## low - high, nonword, lexdec == 0 0.11086 0.01675 6.619 1.11e-10 ***
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
## (Adjusted p values reported -- free method)
```

We see that the results are exactly as expected. In the
`nonword`

and `naming`

condition we have a clear
negative effect of frequency while in the other three conditions it is
clearly positive.

We could now also use `emmeans`

and re-transform the
estimates back onto the original RT response scale. For this, we can
again `update`

the `emmeans`

object by using
`tran = "log"`

to specify the transformation and then
indicating we want the means on the response scale with
`type = "response"`

. These values might be used for
plotting.

```
## frequency stimulus task response SE df asymp.LCL asymp.UCL
## low word naming 0.724 0.0329 Inf 0.662 0.791
## high word naming 0.682 0.0310 Inf 0.624 0.746
## low nonword naming 0.867 0.0394 Inf 0.793 0.948
## high nonword naming 1.066 0.0485 Inf 0.975 1.166
## low word lexdec 1.024 0.0423 Inf 0.944 1.110
## high word lexdec 0.961 0.0397 Inf 0.886 1.042
## low nonword lexdec 1.110 0.0459 Inf 1.024 1.204
## high nonword lexdec 0.994 0.0410 Inf 0.916 1.077
##
## Results are averaged over the levels of: density
## Degrees-of-freedom method: asymptotic
## Confidence level used: 0.95
## Intervals are back-transformed from the log scale
```

A more direct approach for plotting the interaction is via
`afex_plot`

. For a plot that is not too busy it makes sense
to specify across which grouping factor the individual level data should
be aggregated. We use the participant variable `"id"`

here.
We also use `ggbeeswarm::geom_quasirandom`

as the geom for
the data in the background following the example in the `afex_plot`

vignette.

```
afex_plot(m9s, "frequency", "stimulus", "task", id = "id",
data_geom = ggbeeswarm::geom_quasirandom,
data_arg = list(
dodge.width = 0.5, ## needs to be same as dodge
cex = 0.8,
width = 0.1,
color = "darkgrey"))
```

`## NOTE: Results may be misleading due to involvement in interactions`

As the last example, let us take a look at the significant four-way
interaction of `task:stimulus:density:frequency`

, \(F(1, 578.77) = 11.72\), \(p < .001\). Here we might be interested
in a slightly more difficult question namely whether the
`density:frequency`

interaction varies across
`task:stimulus`

conditions. If we again look at the figures
above, it appears that there is a difference between
`low:low`

and `high:low`

in the
`nonword`

and `lexdec`

condition, but not in the
other conditions.

Looking at the 2-way interaction of `density:frequency`

by
the `task:stimulus`

interaction can be done using
`emmeans`

using the `joint_test`

function. We
simply need to specify the appropriate `by`

variables and get
conditional tests this way.

```
## stimulus = word, task = naming:
## model term df1 df2 F.ratio p.value
## density 1 Inf 1.292 0.2556
## frequency 1 Inf 15.530 0.0001
## density:frequency 1 Inf 0.196 0.6578
##
## stimulus = nonword, task = naming:
## model term df1 df2 F.ratio p.value
## density 1 Inf 17.926 <.0001
## frequency 1 Inf 191.069 <.0001
## density:frequency 1 Inf 3.656 0.0559
##
## stimulus = word, task = lexdec:
## model term df1 df2 F.ratio p.value
## density 1 Inf 0.359 0.5491
## frequency 1 Inf 14.494 0.0001
## density:frequency 1 Inf 1.669 0.1964
##
## stimulus = nonword, task = lexdec:
## model term df1 df2 F.ratio p.value
## density 1 Inf 15.837 0.0001
## frequency 1 Inf 43.811 <.0001
## density:frequency 1 Inf 14.806 0.0001
```

This test indeed shows that the `density:frequency`

interaction is only significant in the `nonword`

and
`lexdec`

condition. Next, let’s see if we can unpack this
interaction in a meaningful manner. For this we compare
`low:low`

and `high:low`

in each of the four
groups. And just for the sake of making the example more complex, we
also compare `low:high`

and `high:high`

.

To do so, we first need to setup a new set of EMMs. Specifically, we
get the EMMs of the two variables of interest, density and frequency,
using the same `by`

specification as the
`joint_test`

call. We can then setup custom contrasts that
tests our hypotheses.

```
## stimulus = word, task = naming:
## density frequency emmean SE df asymp.LCL asymp.UCL
## low low -0.31384 0.0448 Inf -0.4016 -0.22613
## high low -0.33268 0.0408 Inf -0.4126 -0.25276
## low high -0.37741 0.0466 Inf -0.4687 -0.28611
## high high -0.38644 0.0472 Inf -0.4789 -0.29399
##
## stimulus = nonword, task = naming:
## density frequency emmean SE df asymp.LCL asymp.UCL
## low low -0.10399 0.0499 Inf -0.2019 -0.00611
## high low -0.18230 0.0441 Inf -0.2688 -0.09580
## low high 0.07823 0.0519 Inf -0.0236 0.18004
## high high 0.04902 0.0494 Inf -0.0478 0.14588
##
## stimulus = word, task = lexdec:
## density frequency emmean SE df asymp.LCL asymp.UCL
## low low 0.03713 0.0403 Inf -0.0419 0.11617
## high low 0.00933 0.0368 Inf -0.0627 0.08138
## low high -0.04512 0.0419 Inf -0.1272 0.03696
## high high -0.03479 0.0424 Inf -0.1179 0.04828
##
## stimulus = nonword, task = lexdec:
## density frequency emmean SE df asymp.LCL asymp.UCL
## low low 0.04480 0.0449 Inf -0.0432 0.13278
## high low 0.16331 0.0398 Inf 0.0852 0.24140
## low high -0.00729 0.0466 Inf -0.0987 0.08411
## high high -0.00564 0.0444 Inf -0.0927 0.08140
##
## Degrees-of-freedom method: asymptotic
## Confidence level used: 0.95
```

These contrasts can be specified via a list of custom contrasts on
the EMMs (or reference grid in `emmeans`

parlance) which can
be passed to the `contrast`

function. The contrasts are a
`list`

where each element should sum to one (i.e., be a
proper contrast) and be of length equal to the number of EMMs (although
we have 16 EMMs in total, we only need to specify it for a length of
four due to conditioning by `c("stimulus", "task")`

).

To control for the family wise error rate across all tests, we again
use `update(..., by = NULL)`

on the result to revert the
effect of the conditioning.

```
# desired contrats:
des_c <- list(
ll_hl = c(1, -1, 0, 0),
lh_hh = c(0, 0, 1, -1)
)
update(contrast(emm_i2, des_c), by = NULL, adjust = "holm")
```

```
## contrast stimulus task estimate SE df z.ratio p.value
## ll_hl word naming 0.01884 0.0210 Inf 0.895 1.0000
## lh_hh word naming 0.00904 0.0212 Inf 0.426 1.0000
## ll_hl nonword naming 0.07831 0.0222 Inf 3.525 0.0030
## lh_hh nonword naming 0.02921 0.0210 Inf 1.391 0.9847
## ll_hl word lexdec 0.02780 0.0200 Inf 1.391 0.9847
## lh_hh word lexdec -0.01033 0.0199 Inf -0.520 1.0000
## ll_hl nonword lexdec -0.11850 0.0211 Inf -5.628 <.0001
## lh_hh nonword lexdec -0.00165 0.0197 Inf -0.084 1.0000
##
## Degrees-of-freedom method: asymptotic
## P value adjustment: holm method for 8 tests
```

In contrast to our expectation, the results show two significant
effects and not only one. In line with our expectations, in the
`nonword`

and `lexdec`

condition the EMM of
`low:low`

is smaller than the EMM for `high:low`

,
\(z = -5.63\), \(p < .0001\). However, in the
`nonword`

and `naming`

condition we found the
opposite pattern; the EMM of `low:low`

is larger than the EMM
for `high:low`

, \(z =
3.53\), \(p = .003\). For all
other effects \(|z| < 1.4\), \(p > .98\). In addition, there is no
difference between `low:high`

and `high:high`

in
any condition.

- Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013).
Random effects structure for confirmatory hypothesis testing: Keep it
maximal.
*Journal of Memory and Language*, 68(3), 255-278. https://doi.org/10.1016/j.jml.2012.11.001 - Bretz, F., Hothorn, T., & Westfall, P. H. (2011).
*Multiple comparisons using R*. Boca Raton, FL: CRC Press. https://CRAN.R-project.org/package=multcomp - Freeman, E., Heathcote, A., Chalmers, K., & Hockley, W. (2010).
Item effects in recognition memory for words.
*Journal of Memory and Language*, 62(1), 1-18. https://doi.org/10.1016/j.jml.2009.09.004 - Lenth, R. (2017).
*emmeans: Estimated Marginal Means, aka Least-Squares Means*. R package version 0.9.1. https://CRAN.R-project.org/package=emmeans - Maxwell, S. E., & Delaney, H. D. (2004).
*Designing experiments and analyzing data: a model-comparisons perspective*. Mahwah, N.J.: Lawrence Erlbaum Associates.