11  Frequentist and Bayesian Mixed Effects Models

In the recordings, James walked through the building blocks of fitting and interpreting mixed effects models, first from a frequentist perspective and then from a Bayesian perspective. The data come from Brown (2021) who developed a tutorial for mixed effects models. Part 1 here uses the lme4 package to fit frequentist mixed effects models. These take a little longer than normal, but still fit in a few seconds. Part 2 uses the brms package to fit Bayesian mixed effects models. You are welcome to fit these in your own time, but for the practical, I have provided the saved model objects as they take 15-30 minutes to fit depending on your computer.

Make sure you have the following packages installed and loaded.

# general data functions
library(tidyverse)

# frequentist 
library(lme4)
library(lmerTest)
library(sjPlot)

# Bayesian 
library(brms)
library(bayestestR)
library(see)
library(tidybayes)

Now we have our packages, we need to read the data from Brown (2021) who report a repeated measures study on speech perception. Participants complete two tasks where they listen to words being spoken while they complete a response time task. The words are presented in two conditions: 1) just being spoken in an auditory condition and 2) being spoken and seeing the speaker in an audiovisual condition. Each participant completes the task for 543 words, split between the two conditions in a different order for each participant. They expected response times to be slower in the audiovisual condition compared to the audio condition due to the interference from listening.

This means we have a repeated measures / within-subjects design. The outcome is response time in milliseconds (ms). The predictor - or fixed effect - is the modality (audio and audiovisual). We will also explore the impact of participants and the word stimuli as random effects.

# read the raw data from Brown (2021)
brown_data <- read_csv("Data/Brown_2021.csv")

# create a simplified version of the data by averaging over the word stimuli
brown_aggregate <- brown_data %>% 
  group_by(PID, modality) %>% 
  summarise(mean_RT = mean(RT)) %>% 
  ungroup()

11.1 Frequentist

In the first section, we will explore how to fit mixed effects models using frequentist statistics and the lme4 package (Bates et al., 2015).

11.1.1 Aggregate participants

For demonstration purposes, we are going to start slow with aggregate data and averaging over the words, leaving us with only a mean response time per condition.

11.1.1.1 Standard linear model, no random effects

We can try and fit a standard linear model using mean_RT as the outcome and modality as the predictor.

brown_model0 <- lm(mean_RT ~ modality,
                     data = brown_aggregate)

summary(brown_model0)

Call:
lm(formula = mean_RT ~ modality, data = brown_aggregate)

Residuals:
    Min      1Q  Median      3Q     Max 
-319.57 -124.34  -24.35  107.35  605.71 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)          1044.07      23.89  43.697   <2e-16 ***
modalityAudiovisual    83.16      33.79   2.461   0.0155 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 173.9 on 104 degrees of freedom
Multiple R-squared:  0.05503,   Adjusted R-squared:  0.04595 
F-statistic: 6.057 on 1 and 104 DF,  p-value: 0.0155

As you will see in upcoming models, the intercept and slope are “accurate” but the standard errors and degrees of freedom are all off. We have 53 participants but 104 degrees of freedom since we are not factoring in they contribute two observations.

11.1.1.2 Random intercept for participants

After the brief demonstration, we can fit our first mixed effects model. We have a new function called lmer() instead of lm(). This time, we have a similar formula to define mean_RT as the outcome and modality is still our main predictor. In a mixed effects model, we refer to this as a fixed effect. The new component is the (1 | PID) which represents our random effect. This is a random intercepts only model, where we are telling R to allow the intercept (the 1 part) vary for each participant (the PID part).

brown_model1 <- lmer(mean_RT ~ modality + (1 | PID), 
                     data = brown_aggregate)

summary(brown_model1)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: mean_RT ~ modality + (1 | PID)
   Data: brown_aggregate

REML criterion at convergence: 1305.6

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.59477 -0.46495 -0.00148  0.37616  2.64062 

Random effects:
 Groups   Name        Variance Std.Dev.
 PID      (Intercept) 26066    161.45  
 Residual              4192     64.74  
Number of obs: 106, groups:  PID, 53

Fixed effects:
                    Estimate Std. Error      df t value Pr(>|t|)    
(Intercept)          1044.07      23.89   59.70  43.697  < 2e-16 ***
modalityAudiovisual    83.16      12.58   52.00   6.612 2.05e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr)
modltyAdvsl -0.263

The output for a mixed effects model is similar, but we have some new sections. Since we loaded the lmerTest package, we get p-values via the Satterthwaite method. The two key sections are the random effects and fixed effects.

Starting with fixed effects, if you scroll up to the standard linear model, you will see the estimate the the intercept and slope are identical, but the standard error has decreased substantially now we have taken the within-subjects design into consideration.

Under random effects, we accurately code that there are 106 observations but 53 participants by defining the random effect. The standard deviation tells us that we expect the response time to vary by 161ms around the intercept of 1044ms, which relates to the response time towards audio stimuli. We only allowed the intercept to vary in this model, not the slope, so this tells us we fit a constant slope in the model but the intercept for each participant can move up and down.

You can see this clearer by accessing coefficient values which will print the random effects for each participant.

coef(brown_model1)
$PID
    (Intercept) modalityAudiovisual
301    978.4349            83.16029
302   1006.3888            83.16029
303    882.0067            83.16029
304   1167.7371            83.16029
306   1019.0060            83.16029
307   1063.8531            83.16029
308   1230.6380            83.16029
309    781.2698            83.16029
310   1176.8249            83.16029
311    990.6904            83.16029
312   1137.5894            83.16029
313   1084.9800            83.16029
314   1088.1276            83.16029
315    858.5082            83.16029
316   1336.9827            83.16029
318    940.9711            83.16029
319   1026.4853            83.16029
320    999.6693            83.16029
321   1052.4284            83.16029
324   1056.8015            83.16029
325    819.4797            83.16029
326   1028.3805            83.16029
328   1244.5444            83.16029
329   1008.8456            83.16029
331   1417.8641            83.16029
333   1589.9694            83.16029
334    955.2190            83.16029
335   1153.5106            83.16029
337    924.7565            83.16029
338    841.0854            83.16029
340   1180.4104            83.16029
341    898.8564            83.16029
342   1214.5150            83.16029
343   1049.2312            83.16029
344   1035.4302            83.16029
346    883.4886            83.16029
348    823.7689            83.16029
349    935.0100            83.16029
350   1172.5127            83.16029
351   1176.7303            83.16029
352    811.5272            83.16029
353   1132.9344            83.16029
354   1106.1102            83.16029
355   1185.4441            83.16029
356   1062.5312            83.16029
357    918.9947            83.16029
358    945.6114            83.16029
359   1028.0600            83.16029
360   1042.4988            83.16029
361   1008.7396            83.16029
362    947.0871            83.16029
363    910.8909            83.16029
364   1002.4373            83.16029

attr(,"class")
[1] "coef.mer"

The intercept value changes for each participant as some are faster and some are slower on average. But then the slope estimate is identical for each participant.

You can also see this visually using the sjPlot package and plot_model() function.

plot_model(brown_model1, 
           type = "re") # plot random effects

Zero on this plot represents the fixed effect intercept of 1044ms. Each participant then represents a deviation from this as the random effect. Participants in red are below the intercept (i.e., they respond faster), whereas those in blue are above the intercept (i.e., they respond slower).

11.1.2 Trial-level data

Now the function is a little less of a mystery, we can step things up and use mixed effects models to their full capability on trial-level data. If you explore the data, you can see the difference. In the aggregate data, participants contribute two rows:

head(brown_aggregate)
PID modality mean_RT
301 Audio-only 1027.4106
301 Audiovisual 1002.0642
302 Audio-only 1047.3455
302 Audiovisual 1042.5323
303 Audio-only 882.6735
303 Audiovisual 938.4381

Whereas in the full data, participants contribute 543 rows, one for each word:

head(brown_data)
PID RT modality stim
301 1024 Audio-only gown
301 838 Audio-only might
301 1060 Audio-only fern
301 882 Audio-only vane
301 971 Audio-only pup
301 1064 Audio-only rise

This means we can model the variability within both participants and different words, both of which we expect to sampled from wider distributions of participants and words.

11.1.2.1 Random intercepts only

As we work towards fitting the “maximal” model, a useful starting point for conducting a likelihood ratio test later is fitting the random effects with no fixed effects. We are only modelling the variability of responses across participants and words.

brown_model2 <- lmer(RT ~ (1 | PID) + (1 | stim), 
                     data = brown_data)

summary(brown_model2)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: RT ~ (1 | PID) + (1 | stim)
   Data: brown_data

REML criterion at convergence: 303405.8

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.4561 -0.7030 -0.0292  0.5919  4.7565 

Random effects:
 Groups   Name        Variance Std.Dev.
 stim     (Intercept)   330.6   18.18  
 PID      (Intercept) 28075.5  167.56  
 Residual             68943.6  262.57  
Number of obs: 21679, groups:  stim, 543; PID, 53

Fixed effects:
            Estimate Std. Error      df t value Pr(>|t|)    
(Intercept)  1086.42      23.10   52.14   47.03   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The intercept represents the average across all trials - around 1086ms - by ignoring the fixed effect of modality.

11.1.2.2 Modality and random intercepts

In the next model, we now add our fixed effect of modality.

brown_model3 <- lmer(RT ~ modality + (1 | PID) + (1 | stim), 
                     data = brown_data)

summary(brown_model3)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: RT ~ modality + (1 | PID) + (1 | stim)
   Data: brown_data

REML criterion at convergence: 302861.5

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.3572 -0.6949 -0.0205  0.5814  4.9120 

Random effects:
 Groups   Name        Variance Std.Dev.
 stim     (Intercept)   360.2   18.98  
 PID      (Intercept) 28065.7  167.53  
 Residual             67215.3  259.26  
Number of obs: 21679, groups:  stim, 543; PID, 53

Fixed effects:
                     Estimate Std. Error        df t value Pr(>|t|)    
(Intercept)          1044.449     23.164    52.781   45.09   <2e-16 ***
modalityAudiovisual    82.525      3.529 21433.526   23.38   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr)
modltyAdvsl -0.078

The estimates are pretty close to our aggregate model but now we are modelling the full study design from different sources of variables. The intercept for the audio modality is 1044ms and we expect the audiovisual modality to have 83ms slower responses on average.

If you look at random effects, we have two key standard deviations to explore. We estimate participants’ intercepts to vary by around 168ms but only around 19ms for word stimuli. This means the responses of different participants are more variable than different stimuli.

As before, we can visualise the random effects to see the broad pattern of variability. Just keep in mind, when we have a huge number of stimuli like the words, the plot can be difficult to distinguish different elements.

plot_model(brown_model3, 
           type = "re")
[[1]]


[[2]]

11.1.2.3 Modality, random intercepts, and slopes

The final model we will fit is the “maximal” model for our design, where we allow both the intercept and slope to vary. You can see the different syntax where a random intercept has the format (1 | random_effect), whereas random intercept and slopes has the format (1 + fixed_effect | random_effect).

brown_model4 <- lmer(RT ~ modality + (1 + modality | PID) + (1 + modality | stim), 
                     data = brown_data)

summary(brown_model4)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: RT ~ modality + (1 + modality | PID) + (1 + modality | stim)
   Data: brown_data

REML criterion at convergence: 302385.7

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.3646 -0.6964 -0.0140  0.5886  5.0003 

Random effects:
 Groups   Name                Variance Std.Dev. Corr 
 stim     (Intercept)           304.0   17.44        
          modalityAudiovisual   216.6   14.72   0.16 
 PID      (Intercept)         28554.1  168.98        
          modalityAudiovisual  7709.1   87.80   -0.17
 Residual                     65258.8  255.46        
Number of obs: 21679, groups:  stim, 543; PID, 53

Fixed effects:
                    Estimate Std. Error      df t value Pr(>|t|)    
(Intercept)          1044.14      23.36   52.13  44.703  < 2e-16 ***
modalityAudiovisual    83.18      12.57   52.10   6.615 2.02e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr)
modltyAdvsl -0.178

The random effects structure now has four key rows. We can look at the standard deviation around the intercept for word stimuli (17ms) and it’s slope (15ms). Likewise, we can explore the intercept for participants (169ms) and it’s slope (88ms). We are now allowing both the response time for audio stimuli to vary for each participant and stimulus (intercepts) and the difference to audiovisual stimuli (slopes).

For a completely wild graph, you can now see four panels for the random intercept and slope for each random effect. The same interpretation applies where 0 represents the fixed effect value, and values above and below are deviations from this for each level of the random effect.

plot_model(brown_model4, 
           type = "re")
[[1]]


[[2]]

11.1.2.4 Model comparison

The final step here is considering model comparison. We can ask questions such as “does modality have an effect on response times?” and “is this more complicated model worthwhile?”.

By fitting the three models to our trial-level data, we can pass them to the anova() function. This is known as a likelihood ratio test.

anova(brown_model2, 
      brown_model3, 
      brown_model4)
refitting model(s) with ML (instead of REML)
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
brown_model2 4 303421.9 303453.9 -151707.0 303413.9 NA NA NA
brown_model3 5 302884.0 302923.9 -151437.0 302874.0 539.9420 1 0
brown_model4 9 302418.7 302490.5 -151200.3 302400.7 473.3161 4 0

You can see the warning about it refitting the models using maximum likelihood instead of restricted maximum likelihood. We could do this manually by adding the argument REML = FALSE to the lmer() function.

We can see the difference between model 2 (random effects only) and model 3 (modality as a fixed effect and random intercepts) is statistically significant and we have a lower AIC value. Modality does have an effect on response times. The difference between model 3 and model 4 (random intercepts and slopes) is also significant. Even though model 4 is the most complex (9 parameters to estimate), it is a better fitting model relative to it’s complexity.

11.2 Bayesian

Now we have fitted the models using lme4 and frequentist statistics, we can apply that understanding to fitting Bayesian mixed effects models using the brms package (Bürkner, 2017).

As an initial warning, these models are going to take 15-30 minutes to run on your computer. I have provided the model objects for you to work with quickly, but they are quite computationally expensive to run yourself. This is also why I have focused on the random intercept only model, purely for pragmatic reasons. It takes up to an hour to fit a random intercept and slope model for these data. The formula structure is identical though, so are you are welcome to try and fit the “maximal” model yourself.

11.2.1 Default priors

We will start by fitting a model with default priors to explore the main difference in using brms compared to lme4. We can define our descriptive model as:

brown_intercept_model <- bf(RT ~ modality + (1 | PID) + (1 | stim)) 

We can pass that to the brm() function without specifying priors to update to our posterior probability distributions. You can download the model objects to speed things up, or it will save the model objects to your computer in a “Models” sub-folder if you do not have it saved.

brown_default_priors <- brm(formula = brown_intercept_model, 
                            family = gaussian(), 
                            data = brown_data, 
                              file = "Models/brown_default_priors")

summary(brown_default_priors)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: RT ~ modality + (1 | PID) + (1 | stim) 
   Data: brown_data (Number of observations: 21679) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Multilevel Hyperparameters:
~PID (Number of levels: 53) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)   169.56     16.59   141.56   205.40 1.00      634     1322

~stim (Number of levels: 543) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)    18.22      3.72     9.25    24.62 1.00      942      670

Regression Coefficients:
                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept            1040.69     22.97   996.04  1087.17 1.01      263      604
modalityAudiovisual    82.52      3.57    75.46    89.66 1.00    10771     2504

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma   259.30      1.25   256.86   261.75 1.00     6120     2854

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

The organisation of brms models is similar where we have model fitting information, random effects (but here called fancy multilevel hyperparameters), and fixed effects. The interpretation is identical but given we are working in a Bayesian framework, each parameter in the model is subject to it’s own posterior distribution.

Our estimates are similar to the frequentist equivalents but there are some potential problems we have not encountered in previous Bayesian models, given we are now working with more complex models. The R-hat values are all good but several effect sample size values are far too small in the low hundreds.

We will not explore everything for this default prior model, but we can explore the posteriors and trace plots. The intercept seems to be the most problematic following the low effective sample size values, with the plot here looking particularly choppy.

plot(brown_default_priors)

The final thing we will look at here is the posterior predictive check to make sure assuming normality is a reasonable choice.

pp_check(brown_default_priors, 
         ndraws = 100)

Brown (2021) notes in the tutorial that the response time data have been edited to be more suitable for teaching contexts. Presumably this is making them look a little more normal or the responses are just much slower here, as a normal distribution suits the data quite well. Normally, they have characteristic skew where you need to consider something like a gamma distribution.

11.2.2 Informed priors

The model we will explore in a little more detail is setting informed priors for the parameters we can set reasonable expectations on. Using our descriptive model and data, we can see what brms will let us set priors on.

get_prior(brown_intercept_model, 
          data = brown_data)
prior class coef group resp dpar nlpar lb ub source
b default
b modalityAudiovisual default
student_t(3, 1057, 289.1) Intercept default
student_t(3, 0, 289.1) sd 0 default
sd PID default
sd Intercept PID default
sd stim default
sd Intercept stim default
student_t(3, 0, 289.1) sigma 0 default

In a mixed effects model, we have more parameters than we have set so far. We have our standard intercept, slope, and sigma. We also have standard deviation parameters for our two random effects. It looks busier than it really is as brms will allow you to set priors on all standard deviation parameters or specify individual parameters.

For our model, we can use Brown and Strand (2019), a previous article from the tutorial author where they reported a similar study. It is very rare to see people reporting values for sigma or standard deviations, so we are just going to focus on setting priors for the intercept and slope. We do not have enough subject knowledge to set the other priors.

In Brown and Strand (2019), they report descriptive statistics for the mean response time to audio stimuli between 998ms and 1089ms, with an SD between 161 and 203. We can use this information to set a prior for the intercept. There are other predictors in the models they fit, but modality ranged between 21ms and 48ms slower for audiovisual stimuli. The larger effects were statistically significant and the smaller effects were non-significant, so we can bake in a wide range of uncertainty to expect positive values for the slope, but include some negative values for the non-signicant effects.

brown_priors <- set_prior("normal(35, 20)", class = "b") + 
  set_prior("normal(1040, 180)", class = "Intercept")

brown_priors %>% 
  parse_dist() %>% # Function from tidybayes/ggdist to turn prior into a dataframe
  mutate(prior_label = paste0(class, ": ", prior)) %>% # add the class and prior
  ggplot(aes(y = 0, dist = .dist, args = .args, fill = prior_label)) + # Fill in details from prior and add fill
  stat_slab(normalize = "panels") + # ggdist layer to visualise distributions
  scale_fill_viridis_d(option = "plasma", end = 0.9) + # Add colour scheme
  guides(fill = "none") + # Remove legend for fill
  facet_wrap(~prior_label, scales = "free") + # Split into a different panel for each prior
  labs(x = "Value", y = "Density") +
  theme_classic()

Now we have our priors, we can refit the model. Given the effect sample size values, I also increased the number of iterations. This will produce 8000 post-warmup draws instead of 4000 to try and increase the effective sample size.

brown_informed_priors <- brm(formula = brown_intercept_model, 
                             prior = brown_priors,
                            family = gaussian(), 
                            data = brown_data, 
                            iter = 4000,
                            file = "Models/brown_informed_priors")

summary(brown_informed_priors)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: RT ~ modality + (1 | PID) + (1 | stim) 
   Data: brown_data (Number of observations: 21679) 
  Draws: 4 chains, each with iter = 4000; warmup = 2000; thin = 1;
         total post-warmup draws = 8000

Multilevel Hyperparameters:
~PID (Number of levels: 53) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)   171.76     16.76   143.16   208.65 1.00     1173     2384

~stim (Number of levels: 543) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)    18.22      3.77     9.57    24.70 1.00     1713     1266

Regression Coefficients:
                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept            1043.64     23.27   998.27  1089.81 1.01      387      928
modalityAudiovisual    81.08      3.45    74.28    87.85 1.00    21349     5464

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma   259.32      1.25   256.87   261.79 1.00    16344     5110

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Setting priors and increasing the number of iterations has helped the model, but it is still quite ideal. Really, you would want to keep on tweaking the model until the effect sample size was in the high hundreds for the intercept, but it is improved for all the other parameters. Low effective sample size points to potentially unreliable parameter estimates but we will explore this one in the interests of time given the estimates are so close to the frequentist versions.

The posteriors and trace plots look similar, with the intercept just looking a little choppy.

plot(brown_informed_priors)

One helpful type of plot is superimposing the prior with the probability of direction, to look at the consistency between the choice of prior and the posterior. If there is a mismatch, it does not automatically mean there is a problem, but it at least prompts you to reflect and ask questions of your model choices.

plot(p_direction(brown_informed_priors), 
     priors = TRUE)

Here, we can see the slope estimate is firmly positive with quite a narrow posterior compared to the prior we set. We allowed a wide range given the previous study, so the data have been the dominant force here.

If we wanted to plot the model predictions and help communicate our findings, we can take the conditional effects plot and tidy things up a little.

conditional_plot <- conditional_effects(brown_informed_priors)

plot(conditional_plot, 
     plot = FALSE)[[1]] + #I don't know why you need this, but it doesn't work without
  theme_classic() + 
  scale_y_continuous(limits = c(900, 1200), 
                     breaks = seq(900, 1200, 50)) + 
  labs(x = "Modality", 
       y = "Response Time (ms)")

There is not anything different to the descriptive model, but we can run the posterior predictive check just to follow the process.

pp_check(brown_informed_priors, 
         ndraws = 100)