7  Path analysis

7.1 SEM software: lavaan

For estimating structural equation models in R we will use the lavaan add-on package (Rosseel, 2012) for R. You can install it by typing install.packages("lavaan") in your R console window.

The main workhorse function for lavaan is called, simply, lavaan(). This is a powerful function that lets you specify a wide array of models. You can see all the options by looking at the help page (?lavaan::lavaan):

lavaan                 package:lavaan                  R Documentation

Fit a Latent Variable Model

Description:

     Fit a latent variable model.

Usage:

     lavaan(model = NULL, data = NULL, ordered = NULL,
            sampling.weights   = NULL,
            sample.cov = NULL, sample.mean = NULL, sample.th = NULL,
            sample.nobs = NULL,
            group = NULL, cluster = NULL, constraints = "", 
            WLS.V = NULL, NACOV = NULL, ov.order = "model",
            slotOptions = NULL, slotParTable = NULL, slotSampleStats = NULL,
            slotData = NULL, slotModel = NULL, slotCache = NULL,
            sloth1 = NULL,
            ...)

The ... in the function syntax suggests there are more options than are listed, and the help page says “See lavOptions for a complete list.” And if you type lavOptions() you will see there are 113 additional options! This can be quite overwhelming for the novice user.

Fortunately, lavaan provides two ‘convenience’ functions that hide away some of this complexity and that automatically choose sensible defaults. These functions are:

function description
sem() Fit a structural equation model
cfa() Fit a confirmatory factor analysis

In what follows, we’ll be using sem() to fit a path model. The function documentation indicates what options of lavaan are set to by default (see ?lavOptions for further information).

     The ‘sem’ function is a wrapper for the more general lavaan
     function, but setting the following default options: ‘int.ov.free
     = TRUE’, ‘int.lv.free = FALSE’, ‘auto.fix.first = TRUE’ (unless
     ‘std.lv = TRUE’), ‘auto.fix.single = TRUE’, ‘auto.var = TRUE’,
     ‘auto.cov.lv.x = TRUE’, ‘auto.efa = TRUE’, ‘auto.th = TRUE’,
     ‘auto.delta = TRUE’, and ‘auto.cov.y = TRUE’.

Many of these options are only relevant for models including latent variables, which our path models will not include. Thus, when using path analyses, these are the defaults you should keep in mind.

option setting effect
int.ov.free=TRUE intercepts of observed variables are not fixed to zero
auto.var=TRUE residual variances for observed variables are set free
auto.cov.y=TRUE covariances of endogenous variables are modeled and set free

7.2 Example: SEM vs regression

We will begin our practical introduction to path analysis with an example that represents a simple extension from ordinary regression modeling—namely, a situation where you have multiple response variables.

Let’s assume we’re interested in the social determinants of narcisissm and we have given a (subclinical) population a narcissism questionnaire (scale goes from 1-100).

To follow along, download the following files. NB: These files contain simulated (i.e., not real) data, so don’t draw any conclusions about narcissism from the analyses.

file description
narc_1.csv narcisissm data, 1 outcome variable
narc_2.csv narcisissm data, 2 outcome variables

7.2.1 single outcome data

We might have the following theory of narcisissm, where parenting style (specifically, the degree of “helicopter” parenting where parents smother their child with attention) and economic privilege jointly determine the level of adult narcissism.

Here, the SEM framework wouldn’t buy us anything because we can fit this model using basic regression.

library("tidyverse") # for read_csv
library("lavaan")    # for SEM; we'll need these functions later
This is lavaan 0.6-19
lavaan is FREE software! Please report any bugs.
narc_1 <- read_csv("data/narc_1.csv", col_types = "ddd")

mod_reg <- lm(Narc ~ HP + EP, data = narc_1)

summary(mod_reg)

Call:
lm(formula = Narc ~ HP + EP, data = narc_1)

Residuals:
     Min       1Q   Median       3Q      Max 
-18.6504  -4.5028  -0.3351   4.4134  18.0993 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 42.62670    6.48476   6.573 4.31e-10 ***
HP           0.18531    0.06704   2.764  0.00625 ** 
EP           0.33032    0.12922   2.556  0.01133 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.149 on 197 degrees of freedom
Multiple R-squared:  0.07027,   Adjusted R-squared:  0.06083 
F-statistic: 7.445 on 2 and 197 DF,  p-value: 0.0007642
sem_formula <- 'Narc ~ HP + EP'

mod_sem <- sem(sem_formula, data = narc_1)

summary(mod_sem, rsquare = TRUE)
lavaan 0.6-19 ended normally after 1 iteration

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         3

  Number of observations                           200

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)
  Narc ~                                              
    HP                0.185    0.067    2.785    0.005
    EP                0.330    0.128    2.576    0.010

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
   .Narc             50.345    5.034   10.000    0.000

R-Square:
                   Estimate
    Narc              0.070

You can see that the regression coefficients closely match the ones from the regression. We also get an estimate of the residual variance for the DV of 50.345, which is close to the “Residual standard error” in the regression model, because sqrt(50.345) = 7.095, which is close to 7.149.

The summary suggests that we have estimated 3 parameters. What are they? Let’s use lavaan’s parTable() function to see.

parTable(mod_sem)
  id  lhs op  rhs user block group free ustart exo label plabel  start    est
1  1 Narc  ~   HP    1     1     1    1     NA   0         .p1.  0.185  0.185
2  2 Narc  ~   EP    1     1     1    2     NA   0         .p2.  0.330  0.330
3  3 Narc ~~ Narc    0     1     1    3     NA   0         .p3. 50.345 50.345
4  4   HP ~~   HP    0     1     1    0     NA   1         .p4. 56.991 56.991
5  5   HP ~~   EP    0     1     1    0     NA   1         .p5.  1.423  1.423
6  6   EP ~~   EP    0     1     1    0     NA   1         .p6. 15.341 15.341
     se
1 0.067
2 0.128
3 5.034
4 0.000
5 0.000
6 0.000

The ‘free’ column counts the free parameters. You’ll note that Narc ~ HP, Narc ~ EP and Narc ~~ Narc are the (population) direct effects and variance that are being estimated. All the variances/covariances for the exogenous variables are just directly calculated from the sample, and so they have no standard errors (= 0).

We can also get confidence intervals using the parameterEstimates() function, which also puts the estimates in a table, making them easier to access for reporting.

parameterEstimates(mod_sem)
   lhs op  rhs    est    se      z pvalue ci.lower ci.upper
1 Narc  ~   HP  0.185 0.067  2.785  0.005    0.055    0.316
2 Narc  ~   EP  0.330 0.128  2.576  0.010    0.079    0.582
3 Narc ~~ Narc 50.345 5.034 10.000  0.000   40.477   60.212
4   HP ~~   HP 56.991 0.000     NA     NA   56.991   56.991
5   HP ~~   EP  1.423 0.000     NA     NA    1.423    1.423
6   EP ~~   EP 15.341 0.000     NA     NA   15.341   15.341

We can also get standardized estimates. The std.all column is the one you want to look at.

parameterEstimates(mod_sem, standardized = TRUE)
   lhs op  rhs    est    se      z pvalue ci.lower ci.upper std.lv std.all
1 Narc  ~   HP  0.185 0.067  2.785  0.005    0.055    0.316  0.185   0.190
2 Narc  ~   EP  0.330 0.128  2.576  0.010    0.079    0.582  0.330   0.176
3 Narc ~~ Narc 50.345 5.034 10.000  0.000   40.477   60.212 50.345   0.930
4   HP ~~   HP 56.991 0.000     NA     NA   56.991   56.991 56.991   1.000
5   HP ~~   EP  1.423 0.000     NA     NA    1.423    1.423  1.423   0.048
6   EP ~~   EP 15.341 0.000     NA     NA   15.341   15.341 15.341   1.000
  std.nox
1   0.025
2   0.045
3   0.930
4  56.991
5   1.423
6  15.341

7.2.2 Adding a second DV

Up to this point using SEM doesn’t really buy us anything over and above ordinary regression. But the situation changes if we have more than one DV. Let’s say that instead of a single measure we had a measure of two types of narcissism: vulnerable narcissism (VN) and grandiose narcissism (GN), which, for the sake of example, let’s say are not mutually exclusive (i.e., a person can have a little bit of each).

We could perform two separate regressions, but this has the problem that each regression is performed independently of the other, which only makes sense if we are sure that the two DVs are totally uncorrelated. If they are correlated, then the estimates from this approach will be biased. Here, a SEM approach is useful because it can account for this covariance.

narc_2 <- read_csv("data/narc_2.csv",
                   col_types = "dddd")

sem_formula2 <- '
VN ~ HP + EP
GN ~ HP + EP'

mod_sem2 <- sem(sem_formula2, data = narc_2)

summary(mod_sem2, rsquare = TRUE)
lavaan 0.6-19 ended normally after 9 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         7

  Number of observations                           200

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)
  VN ~                                                
    HP                0.131    0.030    4.351    0.000
    EP               -0.193    0.058   -3.315    0.001
  GN ~                                                
    HP                0.145    0.039    3.754    0.000
    EP               -0.014    0.074   -0.190    0.849

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)
 .VN ~~                                               
   .GN                0.994    0.941    1.056    0.291

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
   .VN               10.375    1.037   10.000    0.000
   .GN               16.975    1.698   10.000    0.000

R-Square:
                   Estimate
    VN                0.125
    GN                0.066

Let’s have a look at the model estimates, including standardized coefficients, and add the (raw) coefficients to our path diagram.

parameterEstimates(mod_sem2, standardized = TRUE)
   lhs op rhs    est    se      z pvalue ci.lower ci.upper std.lv std.all
1   VN  ~  HP  0.131 0.030  4.351  0.000    0.072    0.191  0.131   0.288
2   VN  ~  EP -0.193 0.058 -3.315  0.001   -0.307   -0.079 -0.193  -0.219
3   GN  ~  HP  0.145 0.039  3.754  0.000    0.069    0.221  0.145   0.257
4   GN  ~  EP -0.014 0.074 -0.190  0.849   -0.160    0.132 -0.014  -0.013
5   VN ~~  VN 10.375 1.037 10.000  0.000    8.341   12.408 10.375   0.875
6   GN ~~  GN 16.975 1.698 10.000  0.000   13.648   20.302 16.975   0.934
7   VN ~~  GN  0.994 0.941  1.056  0.291   -0.850    2.838  0.994   0.075
8   HP ~~  HP 56.991 0.000     NA     NA   56.991   56.991 56.991   1.000
9   HP ~~  EP  1.423 0.000     NA     NA    1.423    1.423  1.423   0.048
10  EP ~~  EP 15.341 0.000     NA     NA   15.341   15.341 15.341   1.000
   std.nox
1    0.038
2   -0.056
3    0.034
4   -0.003
5    0.875
6    0.934
7    0.075
8   56.991
9    1.423
10  15.341