put these lines in your “setup” code chunk:
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_knit$set(root.dir = rprojroot::find_rstudio_root_file())
library(tidyverse)
run the setup code chunk
in the console, install the “interactions” package by running
install.packages("interactions")
- we will use functions in
this package for simple slopes analysis and plots that will help us
interpret interaction effects.
in the console, install the “car” package by running
install.packages("car")
- we will use a function in this
package to get some extra regression output.
data description: lumos_subset1000pluseye.csv is a subset of a
public dataset of lumosity (a cognitive training website) user
performance data. You can find the publication associated with the full
dataset here:
Guerra-Carrillo,
B., Katovich, K., & Bunge, S. A. (2017). Does higher education hone
cognitive functioning and learning efficacy? Findings from a large and
diverse sample. PloS one, 12(8), e0182276.
https://doi.org/10.1371/journal.pone.0182276
What to do first: Make a new code chunk and use
readr::read_csv() to read in the data. Make sure that NA values are
handled the way you want (click on the tibble in the Environment window
pane to take a quick look).
What to do next: make sure the columns that contain nominal vals are treated as nominal, using forcats::as_factor() you can copy your code chunk from last week, or just look at the solution below
#first import the data
lumos_tib <- readr::read_csv("data/lumos_subset1000pluseye.csv", na = "NA")
# now make sure the columns we want as factors are treated that way, using forcats::as_factor()
lumos_tib <- lumos_tib %>% dplyr::mutate(
test_type = forcats::as_factor(test_type),
assessment = forcats::as_factor(assessment),
gender = forcats::as_factor(gender),
edu_cat = forcats::as_factor(edu_cat),
english_nativelang = forcats::as_factor(english_nativelang),
ethnicity = forcats::as_factor(ethnicity)
)
Above: the conceptual model of moderation
Above: the statistical model of moderation - notice that X1 and X2 are interchangeable in the statistical model (i.e., the statistical model does not distinguish between the predictor and the moderator)
Previously, when we looked just at age
and
raw_score
, we saw a small association such that older
participants scored lower. But what if that association depended on
another variable, such as the size of their screen? That is, maybe older
individuals do worse than younger individuals only if they are working
on a small screen? Such a relationship would be an example of an
interactive effect, or moderation (i.e., screen size
moderates the effect of age on raw score, or it could be
restated as age moderates the effect of screen size on raw score).
So far, the data we have worked with are real values from Lumosity, but
for this step I have created an imaginary variable called
imaginary_screensize
- just for educational purposes.
Use a new regression model (call it
score_ageXscreen_lm
) to examine the interaction of
age
and imaginary_screensize
on
raw_score
. We include the interaction of two variables in a
regression model by multiplying the variables and including that as a
predictor. You can specify an interaction in a formula (which you
include in a call to lm()
) like this:
formula = raw_score ~ age + imaginary_screensize + age:imaginary_screensize
(you may also see interactions specified with a *
, which is
shorthand for including an interaction along with main effects, and you
should always include the main effects along with an interaction).
Try it now (you’ll see soon why we should alter this
formula). Use summary()
to get the model summary.
score_ageXscreen_lm <- lumos_tib %>% drop_na(age,raw_score,imaginary_screensize) %>%
lm(formula = raw_score ~ age + imaginary_screensize + age:imaginary_screensize)
summary(score_ageXscreen_lm)
# you might also be interested in multicollinearity info (uncomment the line below)
# car::vif(score_ageXscreen_lm)
##
## Call:
## lm(formula = raw_score ~ age + imaginary_screensize + age:imaginary_screensize,
## data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.6210 -3.0272 -0.0353 2.7568 12.9636
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.863e+02 4.770e+01 8.099 1.61e-15 ***
## age -2.925e+00 1.088e+00 -2.689 0.00729 **
## imaginary_screensize -2.810e-01 3.627e-02 -7.747 2.31e-14 ***
## age:imaginary_screensize 2.213e-03 8.264e-04 2.677 0.00754 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.337 on 996 degrees of freedom
## Multiple R-squared: 0.2578, Adjusted R-squared: 0.2555
## F-statistic: 115.3 on 3 and 996 DF, p-value: < 2.2e-16
Notice that there are coefficient estimates for age
,
imaginary_screensize
, and their interaction
(age:imaginary_screensize
).
The positive coefficient (with low p-value) for the interaction term suggests that at larger screensize values, the relation of age to performance is more positive/less negative than at smaller screensize values. Equally, we could restate it: at older age values, the relation of screensize to performance is more positive/less negative. The coefficient for the interaction term has the same meaning as the other coefficients: that an increase in one unit of the predictor predicts an increase in .002 units of the outcome – of course the predictor in this case is the product of two variables, so interpretation takes a little more work, which we will do in step 1.4.
But we now have issues with the interpretation of the
coefficients for the main effects of age
and
imaginary_screensize
. The coefficient for a single
variable in a model represents the effect of that variable when other
terms are zero. So the coefficient for
imaginary_screensize
(the main effect of screensize) now
represents the effect of screensize at age=0. Likewise, the coefficient
for age
represents the effect of age when screensize=0.
Neither effect is interpretable, because of the presence of the
interaction.
If we subtract the mean age from each value of age
and store it in a new variable called age_cent
(and do the
same for imaginary_screensize
), then we can enter these new
variables into the regression instead. Then we will be able to better
understand the resulting main effect coefficients (as, e.g., the effect
of age at the mean value of screen size).
you can create these new “centered” variables using
mutate()
to add new columns the centered variables
(age_cent = age - mean(age)
and
imaginary_screensize_cent = imaginary_screensize - mean(imaginary_screensize)
,
but first select only the cases that have valid values for all the
columns we are using (age
,
imaginary_screensize
, raw_score
) - and store
it in a new tibble (call it lumos_validagerawscreen_tib
),
so we can keep track of the fact that the new centered variables are
computed based on the reduced dataset (this is not really necessary
because all cases have valid values but it is good practice to always be
mindful of how many complete cases you have for a given set of
variables, because it will affect calculations such as mean-centering
(and you should always be watching out for and thinking about missing
values).
Then re-estimate a new model (store it in a variable called
score_ageXscreen_cent_lm
) that is the same as the last one
except you use the centered variables (and interaction of the centered
variables) as the predictors. Use summary()
to get the
model summary.
lumos_validagerawscreen_tib <- lumos_tib %>%
drop_na(age,raw_score,imaginary_screensize) %>%
mutate(
age_cent = age - mean(age),
imaginary_screensize_cent = imaginary_screensize - mean(imaginary_screensize)
)
score_ageXscreen_cent_lm <- lumos_validagerawscreen_tib %>%
lm(formula = raw_score ~ age_cent + imaginary_screensize_cent +
age_cent:imaginary_screensize_cent)
summary(score_ageXscreen_cent_lm)
# you might also be interested in multicollinearity info (uncomment line below)
# car::vif(score_ageXscreen_cent_lm)
##
## Call:
## lm(formula = raw_score ~ age_cent + imaginary_screensize_cent +
## age_cent:imaginary_screensize_cent, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.6210 -3.0272 -0.0353 2.7568 12.9636
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.9905184 0.1384815 115.470 < 2e-16 ***
## age_cent -0.0127844 0.0112311 -1.138 0.25526
## imaginary_screensize_cent -0.1846676 0.0105013 -17.585 < 2e-16 ***
## age_cent:imaginary_screensize_cent 0.0022127 0.0008264 2.677 0.00754 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.337 on 996 degrees of freedom
## Multiple R-squared: 0.2578, Adjusted R-squared: 0.2555
## F-statistic: 115.3 on 3 and 996 DF, p-value: < 2.2e-16
the model F-statistic didn’t change, it is still significant (low p-value indicates the data and observed F-stat are unlikely under the null hypothesis [that the true value of all coefficients is zero] )
the coefficient for the interaction term didn’t change (it is
positive and the t-stat for the coefficient is the same), but the
coefficients (and t-stats) for age_cent
and
imaginary_screensize_cent
differ from the uncentered
version. When an interaction effect is included in a model, centering
the variables allows us to interpret the main effects in these two ways
(from Field textbook p.792): “(1) they are the effect of that predictor
at the mean value of the sample; and (2) they are the average effect of
the predictor across the range of scores for the other predictors” -
Importantly, age
is not a significant predictor in this
model, meaning that if we hold screen size constant at its mean value
then age is not significantly related to performance.
When age was the only predictor in the regression, we could
easily visualize the linear relationship by plotting the regression line
(e.g. over a scatter plot), but the (significant) interaction is telling
us that the effect of age is different for different levels of screen
size - we can’t depict that with a single line. Instead, we will plot
the model predictions for all values of age
, but we’ll use
three lines:
age_cent
when imaginary_screensize_cent
is at
it’s mean value (zero, since we mean-centered it)age_cent
when imaginary_screensize_cent
is at
a high value (+1 standard deviation)age_cent
when imaginary_screensize_cent
is at
a low value (-1 standard deviation)to generate this plot use the
interactions::interact_plot()
function. You will need to
pass it an model = ...
argument specifying the variable
that stores your model (model = score_ageXscreen_cent_lm
if
you are following the provided solution code), as well as
pred = ...
and modx = ...
arguments to specify
the variables to plot as the predictor and the moderator. Add
interval = TRUE
as an argument to get 95% confidence
intervals around each line.
in the same code chunk, you can also run a “simple slopes”
analysis, which is simply a computation of the effect of one variable
(age_cent
) at different levels of another variable
(imaginary_screensize_cent
). We will calculate the effect
of age_cent
at mean and +/- 1SD of
imaginary_screensize_cent
, just like in the interaction
plot, so we will actually be calculating the slopes of the 3 lines we
plotted.
interactions::sim_slopes()
- just like
interact_plot()
, you will pass it
model = score_ageXscreen_cent_lm
, followed by
pred = ...
and modx = ...
arguments (another
useful argument is jnplot = TRUE
- we won’t cover that here
but explore it on your own if you wish).# plot age vs raw_score at different levels of screensize
p1 <- interactions::interact_plot(model = score_ageXscreen_cent_lm, pred = age_cent, modx = imaginary_screensize_cent, interval = TRUE)
p1 + labs (title = "Age * Screensize Simple Slopes")
#simple slopes of age at 3 levels of screensize
interactions::sim_slopes(model = score_ageXscreen_cent_lm, pred = age_cent, modx = imaginary_screensize_cent)
## JOHNSON-NEYMAN INTERVAL
##
## When imaginary_screensize_cent is OUTSIDE the interval [-4.81, 29.54], the
## slope of age_cent is p < .05.
##
## Note: The range of observed values of imaginary_screensize_cent is [-48.97,
## 32.03]
##
## SIMPLE SLOPES ANALYSIS
##
## Slope of age_cent when imaginary_screensize_cent = -1.329232e+01 (- 1 SD):
##
## Est. S.E. t val. p
## ------- ------ -------- ------
## -0.04 0.02 -2.68 0.01
##
## Slope of age_cent when imaginary_screensize_cent = 2.000888e-14 (Mean):
##
## Est. S.E. t val. p
## ------- ------ -------- ------
## -0.01 0.01 -1.14 0.26
##
## Slope of age_cent when imaginary_screensize_cent = 1.329232e+01 (+ 1 SD):
##
## Est. S.E. t val. p
## ------ ------ -------- ------
## 0.02 0.02 1.06 0.29
Answer the following questions about the model in your notes:
1) What is the relation between age and performance
(raw_score
) when screen size is held at a low value?
2) What is the relation between age and performance when screen size is
held at its average (mean) value?
3) What is the relation between age and performance when screen size is
held at a large value?
Finally, take a look at the Johnson-Neyman interval info (above
the “Simple Slopes Analysis” output of sim_slopes()
). This
interval tells us that, below imaginary_screensize_cent
values of -4.81, we would consider age to have a significantly
negative (p<.05) relation to performance (according to our
model), and above imaginary_screensize_cent
values of
29.54, age has a significantly positive relation to
performance. Does this make sense to you? Can you translate those “zones
of significance” cutoffs into the original (not mean-centered)
imaginary_screensize
units?
Note on effect size - The R2 for our full
model (raw_score
= b0 +
b1age
+
b2imaginary_screensize
+
b3age*imaginary_screensize
) is .2578,
meaning that the model (age, screensize, and their interaction) accounts
for 25.78% of the variance in performance. We can use model comparison
to compare a model with versus without the interaction term, which would
give an
R22</sup
change is .2578-.2524 = .0054, or about .54% of variance explained by
the interaction of age and screen size). Extra note on the simulated
data: the imaginary_screensize
variable explains an
unrealistically large proportion of the variance because it was
specially simulated to give us a significant interaction
effect.
Cautionary note on variance partitioning with
anova()
: If you’re curious about how R’s
anova()
function works, you would notice that running
anova(score_ageXscreen_cent_lm)
gives us different
statistics for individual terms than are given by
summary(score_ageXscreen_cent_lm)
. This is because the R
anova()
function uses type I (also called sequential) sums
of squares, so variance is first assigned to the first predictor (age)
then the next predictor (screen size), and lastly the interaction, so
statistics for individual terms depend on the order they are specified
in the model. Our interpretation requires type III (also called partial)
sums of squares, where each main effect is assigned variance not
accounted for by the other main effect and the interaction (so the order
of the terms does not matter). We’ll revisit this issue again when we
cover interactions between categorical variables.
Above: the simple three variable model of mediation - notice that X and M are distinguished in the model (unlike X1 and X2 in the statistical moderation model)
Mediation refers to a situation when the
relationship between a predictor variable (X in the chart above) and an
outcome variable (Y in the chart above) can be explained by their
relationship to a third variable (the mediator, M).
Forget about the screen size variable for a moment and consider possible
explanations for the negative relation between age and performance that
we saw when age
was our only predictor for
raw_score
.
Maybe part of the relation could be explained by something like eyesight
that deteriorates with age. The data file you imported for this activity
has a new (simulated/imaginary) variable called eyesight_z
which is an eyesight “score” where higher values indicate better
eyesight (it has been scaled such that the sample mean is 0 and the s.d.
is 1).
We will test a mediation model where eyesight_z
explains
the relation between age
and raw_score
.
A note about causality: In this model we have good reasons for
thinking the direction of the relationships is as specified in the
mediation model (i.e., it would not be possible for a change in eyesight
to cause a change in age, or for a change in test performance to cause a
change in eyesight) but there may be many unmeasured variables that
could be involved. The test of our mediation model will tell us whether
the eyesight_z
measure accounts for a “significant” part of
the relationship between age
and
raw_score
.
There are the four conditions for our mediation model test, which are
tested with three regression models. First we will list the
regression models (coefficients of each model are different so we’ll
refer to them each with unique subscripts b1 through
b4):
Y
= intercept + b1X
M
= intercept + b2X
Y
= intercept + b3X
+
b4M
We use these three models to check four conditions of mediation
(section 11.4.2 of the Field textbook):
1. the predictor variable must significantly predict the outcome
variable in model 1 (c is significantly different from 0)
2. the predictor variable must significantly predict the mediator in
model 2 (a is significantly different from 0)
3. the mediator must significantly predict the outcome variable in model
3 (b is significantly different from 0)
4. the predictor variable must predict the outcome variable less
strongly in model 3 than in model 1 (c’ is closer to 0 than c, in other
words, the direct effect is smaller than the total effect)
how much smaller should the direct effect be? A perfect mediation
would reduce the direct effect to zero, but that rarely actually
happens.
to assess whether there is a significant reduction (from c to c’)
we will instead test whether a*b (called the “indirect effect”,
literally b2 times b4 from our
series of regression models) is significantly different from zero.
Significance of a*b can be assessed with a Sobel test (textbook
section 11.4.2) or by using a bootstrap method to estimate a confidence
interval around a*b.
The process()
function that we will use gives us a
bootstrapped 95% confidence interval around a*b, so if the interval does
not contain zero, then we can say there is a significant mediation (or,
more precisely, we reject the null hypothesis that the indirect effect
is zero). Bootstrapping involves taking thousands of repeated random
samples of cases from our data (with replacement, so the same case can
be included more than once in one repetition) and building a sampling
distribution of our parameter of interest from those random samples
(they are samples of a sample, hence the term “bootstrapping”). From
that sampling distribution we can estimate our parameter of interest
(a*b) and a confidence interval around it.
You may read some arguments that not all of those 4 conditions
are necessary to have evidence of mediation (e.g., c may not be
different from 0) but these 4 conditions are a straightforward method to
assess a mediation model, so let’s get on with it and assess our
mediation model (that eyesight_z
explains the relation
between age
and raw_score
)
Note that we are using a series of linear regression models to
test the mediation model, so the assumptions that we need to check are
the same ones we discussed in the multiple regression lab activity (but
also notice that we will be using bootstrapping to estimate the mediated
effect, which eases concern over significance tests on the parameter
estimate when assumptions are violated). We won’t check assumptions here
(to save a little time) but it is a good exercise check the plots of
residuals from each model if you have extra time (use the
plot(model)
function).
process.R
process()
function in an R script
developed by Andrew Hayes to test our mediation model. Unlike all the
functions we have used before, this function is not contained in an R
package, so we can’t use install.packages()
. Instead, we
need to download an R-script that has the code for the
process()
function, then source the script
(source just means run the code within your current
environment) to make the function available within your project
environment. Here is how to do that:
process()
function will be
available in your environment (you will see it appear in the
“Environment” pane under “Functions”). But if you want to be able to
“knit” your R markdown doc, you will need to copy that command from the
console pane
(source("fullpath_and_foldername/process.R", local = knitr::knit_global())
)
into a chunk in your markdown doc.X
corresponds to age
, Y
to
raw_score
, and M
to eyesight_z
,
so the three models we use to test the conditions are:
1. raw_score
= intercept +
b1age
b1 = path c
2. eyesight_z
= intercept +
b2age
b2 = path a
3. raw_score
= intercept +
b3age
+
b4eyesight_z
b3 = path c’ and
b4 = path b
process()
to estimate the three
modelslumos_tib
to the function process()
with
arguments that specify x = ...
, y = ...
,
m = ...
, model = 4
, total = TRUE
,
effsize = TRUE
model = 4
specifies the form of the mediation model
(the creator of PROCESS, Andrew Hayes uses different numbers to refer to
different types of models, a simple mediation model was assigned as #4 -
you can look up the different model numbers in the documentation posted
on the class Canvas site)total
and effsize
arguments specify
that we want an estimate of the total effect and we want a standardized
estimate for the indirect effectlumos_tib %>% drop_na(age,raw_score,eyesight_z) %>%
process(y = "raw_score", x = "age", m = "eyesight_z",
model = 4, total = TRUE, effsize = TRUE, progress = FALSE)
##
## ********************* PROCESS for R Version 4.1.1 *********************
##
## Written by Andrew F. Hayes, Ph.D. www.afhayes.com
## Documentation available in Hayes (2022). www.guilford.com/p/hayes3
##
## ***********************************************************************
##
## Model : 4
## Y : raw_score
## X : age
## M : eyesight_z
##
## Sample size: 1000
##
## Random seed: 754320
##
##
## ***********************************************************************
## Outcome Variable: eyesight_z
##
## Model Summary:
## R R-sq MSE F df1 df2 p
## 0.0811 0.0066 0.9944 6.6046 1.0000 998.0000 0.0103
##
## Model:
## coeff se t p LLCI ULCI
## constant 0.2860 0.1157 2.4726 0.0136 0.0590 0.5130
## age -0.0066 0.0026 -2.5699 0.0103 -0.0116 -0.0016
##
## ***********************************************************************
## Outcome Variable: raw_score
##
## Model Summary:
## R R-sq MSE F df1 df2 p
## 0.8379 0.7022 7.5398 1175.1727 2.0000 997.0000 0.0000
##
## Model:
## coeff se t p LLCI ULCI
## constant 16.6401 0.3195 52.0880 0.0000 16.0132 17.2670
## age -0.0137 0.0071 -1.9455 0.0520 -0.0276 0.0001
## eyesight_z 4.1946 0.0872 48.1241 0.0000 4.0236 4.3656
##
## ************************ TOTAL EFFECT MODEL ***************************
## Outcome Variable: raw_score
##
## Model Summary:
## R R-sq MSE F df1 df2 p
## 0.1014 0.0103 25.0288 10.3689 1.0000 998.0000 0.0013
##
## Model:
## coeff se t p LLCI ULCI
## constant 17.8397 0.5803 30.7436 0.0000 16.7010 18.9784
## age -0.0413 0.0128 -3.2201 0.0013 -0.0665 -0.0161
##
## ***********************************************************************
## Bootstrapping in progress. Please wait.
##
## ************ TOTAL, DIRECT, AND INDIRECT EFFECTS OF X ON Y ************
##
## Total effect of X on Y:
## effect se t p LLCI ULCI c_cs
## -0.0413 0.0128 -3.2201 0.0013 -0.0665 -0.0161 -0.1014
##
## Direct effect of X on Y:
## effect se t p LLCI ULCI c'_cs
## -0.0137 0.0071 -1.9455 0.0520 -0.0276 0.0001 -0.0337
##
## Indirect effect(s) of X on Y:
## Effect BootSE BootLLCI BootULCI
## eyesight_z -0.0276 0.0109 -0.0492 -0.0066
##
## Completely standardized indirect effect(s) of X on Y:
## Effect BootSE BootLLCI BootULCI
## eyesight_z -0.0677 0.0262 -0.1198 -0.0164
##
## ******************** ANALYSIS NOTES AND ERRORS ************************
##
## Level of confidence for all confidence intervals in output: 95
##
## Number of bootstraps for percentile bootstrap confidence intervals: 5000
process()
Let’s look closely at the output from top to bottom:
the section starting with
Outcome Variable: eyesight_z
corresponds to model
#2 of the three models we use to test the mediation.
coeff
value for the age
row gives us
b2 (path a)LLCI
and ULCI
columns give us the
lower (LLCI) and upper (ULCI) limits of the 95% confidence interval
around each parameterthe next section starting with
Outcome Variable: raw_score
corresponds to model
#3 of the three models we use to test the mediation.
coeff
for age
gives us
b3 a.k.a path c’ a.k.a. the direct effectcoeff
for eyesight_z
gives us
b4 a.k.a. path bthe section starting with TOTAL EFFECT MODEL
corresponds to model #1 of the three models we use to test the
mediation
coeff
for age
gives us
b1 a.k.a path c a.k.a. the total effect (it’s the
same as the coefficient we got last week when we entered
age
as the only predictor for raw_score
)at this point, take a moment to recognize that the first three of our four conditions are satisfied (a, b, and c are significantly different from 0)
now look at the section starting “TOTAL, DIRECT, AND INDIRECT EFFECTS OF X ON Y”
age
predicts a -.0276 change in
raw_score
, considering the indirect path onlyage
predicts that
much decrease in raw_score
through the indirect path
onlyimaginary_screensize
in a mediation model?The “Completely standardized indirect effect(s) of X on Y” section
produced by process()
is essentially a standardized
regression coefficient, and as such it can be compared across studies
(and is useful for meta-analyses). We could try to compute something
similar to R2, but all of these measures cause difficulties
with how we interpret them, so we may be better off sticking with the
standardized indirect effect measure (see the Field textbook section
11.4.3 for full discussion).