8 Mediation models

For this chapter we’ll go through the steps to estimate a mediation model, including calculation of direct and indirect effects (see lecture for definitions).

The basic structure of a mediation model is as shown in Figure 8.1 below.

Figure 8.1: A basic three-variable mediation structure

In this example, we are interested in the effect on productivity of having people work on-site in an office (in-person working, or IP) versus at home. Let’s say that we have surveyed 1,000 firms and calculate the proportion of employees located at the main worksite (IP) and obtained a measure of each firm’s productivity (PR). Download the (simulated; not real) data if you want to follow along.

## first load in the data
library("tidyverse")
library("lavaan")

wpdat <- read_csv("data/workplace.csv", col_types = "ddd")

wpdat

# A tibble: 1,000 × 3
      IP    OI    PR
   <dbl> <dbl> <dbl>
 1 0.202  75.8  93.7
 2 0.393  84.1  96.8
 3 0.367  80.5  64.8
 4 0.556  74.6  72.3
 5 0.538  64.0  54.0
 6 0.720  91.2  72.8
 7 0.494  79.3  71.1
 8 0.385  60.8  58.2
 9 0.334  90.6  94.1
10 0.483  67.2  82.5
# ℹ 990 more rows

To begin, let’s look at the unmediated association between in-person working and productivity. We can do this with just a very simple regression model.

mod_unmediated <- lm(PR ~ IP, data = wpdat)
summary(mod_unmediated)


Call:
lm(formula = PR ~ IP, data = wpdat)

Residuals:
    Min      1Q  Median      3Q     Max 
-41.358  -8.025   0.259   8.522  38.115 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   69.600      1.771  39.291  < 2e-16 ***
IP            16.193      3.541   4.574 5.39e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.26 on 998 degrees of freedom
Multiple R-squared:  0.02053,   Adjusted R-squared:  0.01955 
F-statistic: 20.92 on 1 and 998 DF,  p-value: 5.393e-06

From this model, we can see that the higher the proportion of in-person workers, the higher the productivity (with productivity measured on a scale 1-100). The coefficient of 16.193 means that if we increase the proportion of workers on-site by .3, productivity would go up by about 4.858 points.

Let’s assume that you have a theory about why in-person working is important for productivity—namely, that in-person interaction is critical for the development of “organizational intelligence” (OI), which might be thought of as the extent to which a typical worker understands the various roles and practices of the organization, allowing them to see how their own efforts fit into the whole and thus to work more fluidly with others as a team.

So to test this, we’d want to know whether (1) the causal path \(IP \rightarrow OI \rightarrow PR\) is statistically significant, and (2) what proportion of the total effect of \(IP\) on \(PR\) is explained by this mechanism versus other remaining (unmeasured) variables that account for the relationship. We may also want to know (3) whether the statistically significant direct effect of IP on PR ‘survives’ after controlling for the mediated effect via OI.

The mediated effect \(IP \rightarrow OI \rightarrow PR\) can be calculated as the product \(a \times b\) where \(a\) and \(b\) are the path coefficients shown in Figure 8.1. This tells you how much of a unit change in IP is transmitted to PR through OI.

The total effect is then \(ab + c'\), where \(c'\) is the effect on PR or IP *after controlling for the mediated effect of IP on PR through OI.

8.1 Fitting mediation models

Let’s take a look at how to fit the model and estimate these effects in lavaan.

We can specify the model one of two ways, either as two or three regression equations. Both models will be equivalent, but using two equations (one for each endogenous variable) is the more typical approach.

library("lavaan")

mod_med_2 <- "
# path a
OI ~ IP
# paths b and c'
PR ~ OI + IP"

mod_fit_2 <- sem(mod_med_2, data = wpdat)

We can specify all the paths individually, just to check that the model is the same.

mod_med_3 <- "
# direct effect (path c')
PR ~ IP
# path a
OI ~ IP
# path b
PR ~ OI"

mod_fit_3 <- sem(mod_med_3, data = wpdat)

parameterEstimates(mod_fit_3) |> arrange(op, lhs, rhs)

  lhs op rhs     est    se      z pvalue ci.lower ci.upper
1  OI  ~  IP  22.978 3.020  7.608  0.000   17.059   28.898
2  PR  ~  IP   0.162 2.922  0.055  0.956   -5.565    5.888
3  PR  ~  OI   0.698 0.030 23.458  0.000    0.639    0.756
4  IP ~~  IP   0.012 0.000     NA     NA    0.012    0.012
5  OI ~~  OI 109.309 4.888 22.361  0.000   99.728  118.890
6  PR ~~  PR  96.697 4.324 22.361  0.000   88.221  105.172

If we compare to the parameter estimates for the 2 equation syntax, we can see that they are identical.

parameterEstimates(mod_fit_2) |> arrange(op, lhs, rhs)

  lhs op rhs     est    se      z pvalue ci.lower ci.upper
1  OI  ~  IP  22.978 3.020  7.608  0.000   17.059   28.898
2  PR  ~  IP   0.162 2.922  0.055  0.956   -5.565    5.888
3  PR  ~  OI   0.698 0.030 23.458  0.000    0.639    0.756
4  IP ~~  IP   0.012 0.000     NA     NA    0.012    0.012
5  OI ~~  OI 109.309 4.888 22.361  0.000   99.728  118.890
6  PR ~~  PR  96.697 4.324 22.361  0.000   88.221  105.172

Let’s look at the model output, including \(R^2\) and standardized estimates.

summary(mod_fit_2, standardized=TRUE, rsquare=TRUE)

lavaan 0.6-19 ended normally after 1 iteration

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         5

  Number of observations                          1000

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  OI ~                                                                  
    IP               22.978    3.020    7.608    0.000   22.978    0.234
  PR ~                                                                  
    OI                0.698    0.030   23.458    0.000    0.698    0.606
    IP                0.162    2.922    0.055    0.956    0.162    0.001

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .OI              109.309    4.888   22.361    0.000  109.309    0.945
   .PR               96.697    4.324   22.361    0.000   96.697    0.632

R-Square:
                   Estimate
    OI                0.055
    PR                0.368

Our model is just-identified (\(df = 0\)), meaning that the model fit is perfect and we have no additional degrees of freedom for considering measures of model fit. We can also see that our model gives us an \(R^2\) of 0.368 for the endogenous variable PR, which means that it accounts for about 36.8% of the variation in productivity. We can also see that the effect of in-person work only accounts for about 5.5% of the variation in organizational knowledge, indicating that there is much more to learn about how such knowledge develops. Finally, we can see from the output that the direct effect \(c'\) (PR ~ IP; \(p = 0\\.956\)) is not significant in this model (it did not ‘survive’ controlling for the mediation). We should be careful not to accept the null hypothesis, however; we would have to perform a more stringent analysis such as an equivalence test before doing so.

8.2 Calculating indirect effects

What’s missing from the output above are calculations of the indirect effect, the total effect, and the proportion of the total effect explained by the indirect effect. We could compute these from the output using simple arithmetic, but then we would lack information about their standard errors.

To get these values we need to update the model syntax and include calculations by using the x := ... operator, which you can think of as variable x “is calculated as” (whatever is on the right hand side). Using this syntax you can do arbitrary calculations and get statistical information in the output.

However, to perform such calculations we first need to be able to give model parameters predictable names. Up to now, when we specified a regression model in lavaan we’ve followed the standard R model syntax, where the regression coefficients are omitted.

However, we can also use a lavaan-specific syntax where we give the coefficients names. So, y ~ x would be the same as y ~ b1 * x. This will tell lavaan to give the coefficient for x the name b1 rather than just choosing some default name. This allows us to refer to it again later. We’re limited in our names to using alphanumeric characters (e.g., A-Z and 0-9). So we can’t legally call something c' but we can call it cprime.

mod_med_names <- "
# path a
OI ~ a * IP
# paths b and c'
PR ~ b * OI + cprime * IP
"

mod_fit_names <- sem(mod_med_names, data = wpdat)
parameterEstimates(mod_fit_names)

  lhs op rhs  label     est    se      z pvalue ci.lower ci.upper
1  OI  ~  IP      a  22.978 3.020  7.608  0.000   17.059   28.898
2  PR  ~  OI      b   0.698 0.030 23.458  0.000    0.639    0.756
3  PR  ~  IP cprime   0.162 2.922  0.055  0.956   -5.565    5.888
4  OI ~~  OI        109.309 4.888 22.361  0.000   99.728  118.890
5  PR ~~  PR         96.697 4.324 22.361  0.000   88.221  105.172
6  IP ~~  IP          0.012 0.000     NA     NA    0.012    0.012

Now we can expand the syntax further to compute or mediation statistics.

mod_med_stats <- "
# regression equations
OI ~ a * IP
PR ~ b * OI + cprime * IP

# compute mediation stats; indir = indirect effect
indir := a * b
total := indir + cprime
prop := indir / total
"

mod_fit_stats <- sem(mod_med_stats, data = wpdat)
parameterEstimates(mod_fit_stats)

    lhs op          rhs  label     est    se      z pvalue ci.lower ci.upper
1    OI  ~           IP      a  22.978 3.020  7.608  0.000   17.059   28.898
2    PR  ~           OI      b   0.698 0.030 23.458  0.000    0.639    0.756
3    PR  ~           IP cprime   0.162 2.922  0.055  0.956   -5.565    5.888
4    OI ~~           OI        109.309 4.888 22.361  0.000   99.728  118.890
5    PR ~~           PR         96.697 4.324 22.361  0.000   88.221  105.172
6    IP ~~           IP          0.012 0.000     NA     NA    0.012    0.012
7 indir :=          a*b  indir  16.032 2.215  7.237  0.000   11.690   20.374
8 total := indir+cprime  total  16.193 3.537  4.578  0.000    9.261   23.126
9  prop :=  indir/total   prop   0.990 0.179  5.539  0.000    0.640    1.340

Now we get statistics for the computed values in the output. Note that the \(z\) statistic in the output is our test statistics. For the indirect effect, this is known as the Sobel statistic. This test assumes normality, but assumption is unlikely to hold when we are looking at the distribution of an estimator calculated as a product. So, a better method to use for getting standard errors is bootstrapping, which we can specify in the call to sem() using the argument se="boot". Let’s re-fit the model. This will take longer due to the bootstrapping.

mod_fit_boot <- sem(mod_med_stats, data = wpdat, se = "boot")

Warning: lavaan->lav_model_nvcov_bootstrap():  
   19 bootstrap runs failed or did not converge.

We got a warning about non-convergence when re-fitting the model to the bootstrapped data. This can happen sometimes. The default number of bootstrap samples is 1,000, so we don’t have to worry too much about losing data for 19 of them. But if there had been more than 19 instances, we might want to increase the number of bootstrap samples (see ?lavOptions for information).

parameterEstimates(mod_fit_boot)

    lhs op          rhs  label     est    se      z pvalue ci.lower ci.upper
1    OI  ~           IP      a  22.978 2.978  7.716  0.000   17.240   28.979
2    PR  ~           OI      b   0.698 0.031 22.374  0.000    0.634    0.760
3    PR  ~           IP cprime   0.162 2.822  0.057  0.954   -5.386    5.500
4    OI ~~           OI        109.309 4.827 22.646  0.000  100.536  118.833
5    PR ~~           PR         96.697 4.243 22.788  0.000   88.222  104.994
6    IP ~~           IP          0.012 0.000     NA     NA    0.012    0.012
7 indir :=          a*b  indir  16.032 2.153  7.445  0.000   11.881   20.296
8 total := indir+cprime  total  16.193 3.427  4.725  0.000    9.554   22.642
9  prop :=  indir/total   prop   0.990 0.217  4.571  0.000    0.736    1.499

From the output we can see that our estimated mediation effect is 16.032, which is statistically significant, \(z = 7\\.445\), \(p < \\.001\). We can also see that the mediation effect accounts for about 99.0% of the total effect.