Introduction
In SEM, one of the first steps is to evaluate the model’s global fit. After global fit, we need to evaluate the local fit of a model, meaning how the model reproduces specific correlations between observed variables.
There are a couple of common methods for this, (a) testing for high residual correlations, or (b) modification indices. This tutorial focuses on the second. Modification indices test the likely change in the model fit if a single parameter is added to the model that was not originally included. This test can be carried out for every possible parameter that was not included (Bentler 1990).
Modification Indices
Modification indices present different indices to quantify the effect of each parameter, and we will focus on two here. These are (a) the modification index (MI) or Lagrange multiplier, which estimates the extent to which the model’s chi-square () test statistic would decrease if a parameter were added to the model and freely estimated, and (b) standardized expected parameter change (SEPC), which is the approximated standardized value of the parameter if it were to be estimated in the model (Whittaker 2012).
MI presents the possible effect on the overall model, and SEPC presents the effect size for the missed parameter.
We will show an example with the Holzinger and Swineford (1939) model. You first estimate your SEM/CFA model as usual
HS.model <- ' visual =~ x1 + x2 + x3
textual =~ x4 + x5 + x6
speed =~ x7 + x8 + x9 '
fit <- bcfa(HS.model, data=HolzingerSwineford1939, std.lv=TRUE)
Then we would need to write a discrepancy function to collect the modification indices. The list below contains two functions that estimate and save the MI and SEPC.
discFUN <- list(
mod.ind_mi = function(object){
temp <- modificationindices(object, free.remove = F)
mods <- temp$mi
names(mods) <- paste0(temp$lhs, temp$op, temp$rhs)
return(mods)
},
mod.ind_sepc.all = function(object){
temp <- modificationindices(object, free.remove = F)
sepc.all <- temp$sepc.all
names(sepc.all) <- paste0(temp$lhs, temp$op, temp$rhs)
return(sepc.all)
}
)
Then we will pass this function to the ppmc()
function
of blavaan. With this function, the MI and SEPC are computed
for each posterior sample, leading to posterior distributions for each
of them.
out <- ppmc(fit, discFUN = discFUN)
Then we view the top 5 parameters arrange by the posterior mean (EAP) MI, which in this case shows that the parameter having the highest impact in overall model fit (according to EAP) is visual=~x9, the cross-loading from the Visual factor to item x9.
summary(out, prob=.9, discFUN = "mod.ind_mi", sort.by="EAP", decreasing=T)[1:5,]
##
## Posterior summary statistics and highest posterior density (HPD) 90% credible intervals
## for the posterior distribution of realized discrepancy-function values based on observed data,
## along with posterior predictive p values to test hypotheses in either direction:
##
##
## EAP Median MAP SD lower upper PPP_sim_GreaterThan_obs
## visual=~x9 35.164 35.119 35.213 11.223 16.637 52.245 0.014
## x7~~x8 32.981 35.507 39.796 14.528 8.480 56.405 0.076
## x8~~x9 26.437 11.969 2.849 43.335 0.000 65.993 0.327
## x4~~x6 19.620 6.387 2.223 35.491 0.000 52.799 0.457
## visual=~x7 18.284 16.753 13.244 9.849 3.525 31.902 0.015
## PPP_sim_LessThan_obs
## visual=~x9 0.986
## x7~~x8 0.924
## x8~~x9 0.673
## x4~~x6 0.543
## visual=~x7 0.985
But according to the posterior median, the parameter that would have the highest impact would be the residual correlation between indicators x7 and x8
summary(out, prob=.9, discFUN = "mod.ind_mi", sort.by="Median", decreasing=T)[1:5,]
##
## Posterior summary statistics and highest posterior density (HPD) 90% credible intervals
## for the posterior distribution of realized discrepancy-function values based on observed data,
## along with posterior predictive p values to test hypotheses in either direction:
##
##
## EAP Median MAP SD lower upper PPP_sim_GreaterThan_obs
## x7~~x8 32.981 35.507 39.796 14.528 8.480 56.405 0.076
## visual=~x9 35.164 35.119 35.213 11.223 16.637 52.245 0.014
## visual=~x7 18.284 16.753 13.244 9.849 3.525 31.902 0.015
## x8~~x9 26.437 11.969 2.849 43.335 0.000 65.993 0.327
## textual=~x1 11.052 9.906 2.014 8.385 0.000 22.615 0.227
## PPP_sim_LessThan_obs
## x7~~x8 0.924
## visual=~x9 0.986
## visual=~x7 0.985
## x8~~x9 0.673
## textual=~x1 0.773
The MI is still recommended as the best metric to indicate which parameter is best to include next, and we can use the SEPC to evaluate the likely effect size for the respective parameters.
summary(out, prob=.9, discFUN = "mod.ind_sepc.all", sort.by="EAP", decreasing=T)[1:5,]
##
## Posterior summary statistics and highest posterior density (HPD) 90% credible intervals
## for the posterior distribution of realized discrepancy-function values based on observed data,
## along with posterior predictive p values to test hypotheses in either direction:
##
##
## EAP Median MAP SD lower upper PPP_sim_GreaterThan_obs
## x7~~x8 0.794 0.786 0.755 0.410 0.502 1.323 0.043
## visual=~x9 0.516 0.496 0.465 0.134 0.341 0.704 0.010
## textual=~x1 0.268 0.298 0.312 0.188 0.024 0.531 0.137
## x1~~x9 0.249 0.247 0.246 0.043 0.200 0.304 0.016
## x2~~x3 0.223 0.223 0.223 0.037 0.167 0.274 0.021
## PPP_sim_LessThan_obs
## x7~~x8 0.957
## visual=~x9 0.990
## textual=~x1 0.863
## x1~~x9 0.984
## x2~~x3 0.979
Here we see that for the 2 highest parameters, the likely SEPC is x7~~x8 = 0.793876155456826 and visual=~x9 = 0.516400919262519. With this information we can decide to include one of these new parameters in the model (one at the time). For this example, because factor loadings have a larger impact on the model-implied covariance matrix, I would choose visual=~x9
HS.model <- ' visual =~ x1 + x2 + x3 + x9
textual =~ x4 + x5 + x6
speed =~ x7 + x8 + x9 '
fit2 <- bcfa(HS.model, data=HolzingerSwineford1939, std.lv=TRUE)
And you can check if the added parameter has the expected impact on
overall fit with the blavFitIndices()
and the
summary()
functions.
It is important to consider also the theoretical relevance of the suggested parameters, and to ensure that they make sense, instead of just adding parameters until having good fit.
Summary
In this tutorial we show how to calculate the MI and SEPC across posterior distributions, and evaluate which parameters can be added.
With the ppmc()
function we are able to calculate
relevant information after model estimation, and build posterior
distributions of them.
The general recommendations are to use MI to identify the most likely parameter to add, and SEPC as the effect size of the new parameter.