Learn how to use linear regression with one continuous outcome and one or more predictors
use scatterplots to check for non-linear associations
understand model R2, F-statistic, beta coefficients (standardized, unstandardized)
check model residuals for potential sources of bias
F-statistic for model comparisons
Examine curvilinear/quadratic associations (this wasn’t in the SPSS activity)
Dichotomous outcome: logistic regression
put these lines in your “setup” code chunk:
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_knit$set(root.dir = rprojroot::find_rstudio_root_file())
library(tidyverse)
run the setup code chunk (the necessary library()
statements are in there)
install.packages("GGally")
install.packages("parameters")
data description: lumos_subset1000plusimaginary.csv is the same
file we worked with last week. This is subset of a public dataset of
lumosity (a cognitive training website) user performance data. You can
find the publication associated with the full dataset here:
Guerra-Carrillo,
B., Katovich, K., & Bunge, S. A. (2017). Does higher education hone
cognitive functioning and learning efficacy? Findings from a large and
diverse sample. PloS one, 12(8), e0182276.
https://doi.org/10.1371/journal.pone.0182276
raw_score
)pretest_score
(test at start of the training program)
has been transformed to have a mean of 100 and standard deviation of 15
(this is a typical transformation for IQ scores)What to do first: Make a new code chunk and use
readr::read_csv() to read in the data. Make sure that NA values are
handled the way you want (click on the tibble in the Environment window
pane to take a quick look).
What to do next: make sure the columns that contain nominal vals are treated as nominal, using forcats::as_factor() you can copy your code chunk from last week, or just look at the solution below
#first import the data
lumos_tib <- readr::read_csv("data/lumos_subset1000plusimaginary.csv", na = "NA")
# now make sure the columns we want as factors are treated that way, using forcats::as_factor()
lumos_tib <- lumos_tib %>% dplyr::mutate(
test_type = forcats::as_factor(test_type),
assessment = forcats::as_factor(assessment),
gender = forcats::as_factor(gender),
edu_cat = forcats::as_factor(edu_cat),
english_nativelang = forcats::as_factor(english_nativelang),
ethnicity = forcats::as_factor(ethnicity)
)
A note about terminology: (same as the note in the SPSS activity) In this lab activity we will use the terms “predictor”, “independent variable (IV)”, and “explanatory variable” interchangeably to refer to variables that are entered as explanatory terms in the model. We will use the terms “dependent variable” and “outcome variable” interchangeably to refer to the variable that is being explained. You should be mindful of the implications of these words (e.g., about causality, which cannot be inferred simply by assigning one variable as a predictor and another as an outcome) but we won’t focus on language in this lab. Instead, you should get used to the different terminology that is used.
GGally::ggscatmat()
like in the code below.p1 <- lumos_tib %>%
GGally::ggscatmat(columns = c("age","pretest_score","raw_score"))
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
p1
- Notice that the age
variable is not normally
distributed, but normality of the variables, especially with a large
data set is not a concern. The scatters show that a linear relation
between variables is a reasonable assumption
- So let’s dive straight in and use linear regression with
age
as an explanatory variable for
raw_score
- We will use the lm()
function, piping in the data and
specifying age
as a predictor for raw_score
with the argument formula = raw_score ~ age
(an intercept
term is automatically included). Save the model in a variable called
score_lm
- use summary(score_lm)
to show the model summary
score_lm <- lumos_tib %>% drop_na(age,raw_score) %>%
lm(formula = raw_score ~ age)
summary(score_lm)
##
## Call:
## lm(formula = raw_score ~ age, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.8899 -3.4962 -0.0183 3.1838 21.1927
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.83974 0.58027 30.74 < 2e-16 ***
## age -0.04130 0.01282 -3.22 0.00132 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.003 on 998 degrees of freedom
## Multiple R-squared: 0.01028, Adjusted R-squared: 0.009291
## F-statistic: 10.37 on 1 and 998 DF, p-value: 0.001323
age
explains
about 1.03% of the variance in raw_score
(not much by most
standards).anova(score_lm)
to view
Mean Sum of Squares for the model and residuals), and the p-value tells
you the probability of an F-statistic at least that large under the null
hypothesis.parameters::model_parameters(score_lm, standardize="refit")
(this function also gives you confidence intervals for parameters)- the
standardized coefficient for age
indicates that a change of
+1 standard deviation in age predicts a change of -.10 standard
deviations in raw_score
it is the same as if we first
standardized (z-scored) each variable and then ran the regression
model.age
so
that the zero value was meaningful (e.g., re-center to mean=0, then the
intercept would correspond to the predicted score for someone with the
mean age)age
(under “Pr(>|t|)”)
is the same as the overall p-value - this is because age is the only
predictor in the modelage
are the same
as the t-stat and p-value the we got last week for the correlation
between age
and raw_score
- this is because
the correlation is also a test of a linear relation (and
age
is the only predictor)Now let’s add pretest_score
to our model, so that we are
predicting raw_score
as a function of age
and
pretest_score
. See if you can write the code yourself (copy
and edit the code from the previous step) before looking at the solution
below. Store the model in a variable named
score_2pred_lm
.
score_2pred_lm <- lumos_tib %>% drop_na(age,raw_score,pretest_score) %>%
lm(formula = raw_score ~ age + pretest_score)
summary(score_2pred_lm)
##
## Call:
## lm(formula = raw_score ~ age + pretest_score, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.7387 -2.0523 0.0005 2.1198 9.3891
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.904994 0.846546 -14.063 <2e-16 ***
## age -0.005794 0.008131 -0.713 0.476
## pretest_score 0.274721 0.007052 38.956 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.152 on 997 degrees of freedom
## Multiple R-squared: 0.6076, Adjusted R-squared: 0.6068
## F-statistic: 771.9 on 2 and 997 DF, p-value: < 2.2e-16
raw_score
is explained by
the model?pretest_score
tell
you?age
variable is not significant in this model? How does it compare to the
partial correlation test that we ran last week with the same
variables?lm()
function produces a predicted
outcome value for every set of predictor variables (e.g., if one
participant has an age of 15 and a pretest_score of 99.3 then the
predicted outcome value will be 17.84 + (-.006*15) + (.275*99.3), based
on the model we just estimated.raw_score
),
thus there are as many model residual values as there are cases. We’ll
look at these residuals in this stepNow, look back at the decision process chart. It says we should look at “zpred vs zresid” to check for linearity, heteroscedasticity, and independence, then look at a histogram of residuals to check for normality.
Fortunately, there is one easy function that will show everything we
need: plot(modelname)
- Use
plot(score_2pred_lm)
to show the diagnostic plots for the 2
predictor model from the last step.
plot(score_2pred_lm)
We can compare two models (w/ the same outcome data) by computing an
F-statistic based on the ratio of error (RSS=residual sum of squares) of
one model to the other (for example, a full model with all predictors
compared to less full model with a subset of predictors). If one model
is better at explaining the outcome than the other, than the RSS of the
worse model will be greater than the RSS of the better model, and the
F-statistic quantifies that for us (and gives a p-value under the null
hypothesis). You can also rephrase this F-statistic as a ratio based on
R2 for each model (see Chapter 9 of the Field Textbook,
equation 9.18)
- Let’s compare the last model (age
and
pretest_score
) to the first model (age
only)
of raw_score
. Because we are comparing the residual
variance between models, we can use the anova()
function in
R. The arguments we pass are the names of the two models (put the fuller
model second). Try it now.
anova(score_lm, score_2pred_lm)
## Analysis of Variance Table
##
## Model 1: raw_score ~ age
## Model 2: raw_score ~ age + pretest_score
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 998 24978.7
## 2 997 9903.8 1 15075 1517.6 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
age
+pretest_score
vs age
alone is
unlikely if the models equally explain variance in
raw_score
(so the fuller model is better.broom::glance(modelname)
in R - lower AIC and BIC values
indicate better model fit.When we looked just at age
and raw_score
,
we saw a small association such that older participants scored lower.
But is it possible that the association is really inverted U-shaped such
that older is better at young ages, but worse at old ages? We can
examine this possibility by including a quadratic (age2) term
in the model. It’s as simple as centering the age variable (by
subtracting the mean) and squaring it, then including that term in the
model. It’s always best to include lower order terms when you add higher
order terms, so our new formula will be
raw_score ~ age_ctr + age_squared
(first we have to compute
age_ctr
and age_squared
). Try it now: first
compute age_ctr
and age_squared
, then estimate
the model with the new formula (store the model in a variable named
score_agesq_lm
).
lumos_tib <- lumos_tib %>%
mutate(
age_ctr = age-mean(age),
age_squared = age_ctr^2
)
score_agesq_lm <- lumos_tib %>% lm(formula = raw_score ~ age_ctr + age_squared)
summary(score_agesq_lm)
##
## Call:
## lm(formula = raw_score ~ age_ctr + age_squared, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.5455 -3.4196 -0.0787 3.2473 21.3652
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.448005 0.243438 67.566 < 2e-16 ***
## age_ctr -0.059527 0.015266 -3.899 0.000103 ***
## age_squared -0.002668 0.001218 -2.191 0.028659 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.993 on 997 degrees of freedom
## Multiple R-squared: 0.01503, Adjusted R-squared: 0.01305
## F-statistic: 7.605 on 2 and 997 DF, p-value: 0.0005273
age_ctr
, and age_squared
.age_squared
tells us that it is unlikely (p=.029) to find a
t-stat this large under the null hypothesis (that the age_squared
association is zero).car::vif(score_agesq_lm)
– looks okay, right? But what if
we hadn’t centered the age variable before squaring it?Logistic regression is used when the outcome variable has just two
possible values (e.g., true/false, on/off, yes/no, or greater/less than
some critical threshold). For the sake of learning, let’s imagine that a
raw_score
value of 16 or greater wins a $100 prize, so we
want to see if we can predict who wins the prize based only on their
years of education (years_edu
).
Why can’t we run a regular linear regression? - Because
regular linear regression will give you predicted values that fall
outside the possible outcome values and are not interpretable.
Logistic regression will yield predicted values between 0 and 1 that can
be interpreted as the probability of the outcome (e.g., prize received)
occurring. These predicted values follow a sigmoid-shaped logit
function.
prize
that is 1 if
raw_score
is >= 16. Peek at the solution code to see
how.glm()
function (glm=generalized linear
model) to estimate the logistic model. Just like with lm()
you can specify the formula as prize ~ years_edu
, but now
add the argument family=binomial
to specify that you are
estimating a logistic model.prize_binlm
, and
use summary(prize_binlm)
to view the model summary, and
anova(prize_binlm, test="LRT")
for the likelihood ratio
test (labeled “Chi-square” in SPSS)predicted_probs <- fitted(prize_binlm)
), and then use
them to make a classification table like the one you saw in SPSS
(table(...)
)lumos_tib <- lumos_tib %>%
mutate(prize = ifelse(raw_score>=16,1,0)) #if raw_score>=16 set prize to 1 (else 0)
prize_binlm <- lumos_tib %>% drop_na(prize,years_edu) %>%
glm(formula = prize ~ years_edu, family = binomial)
# for the classification table:
cat("Classification Table:\n")
predicted_probs <- fitted(prize_binlm)
table((lumos_tib %>% drop_na(prize, years_edu))$prize, predicted_probs>.5,
dnn = c("observed","predicted"))
#for the coefficients table:
summary(prize_binlm)
#for the overall chi-square test
anova(prize_binlm, test = "LRT")
## Classification Table:
## predicted
## observed FALSE TRUE
## 0 165 207
## 1 110 313
##
## Call:
## glm(formula = prize ~ years_edu, family = binomial, data = .)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.22288 0.44192 -5.030 4.91e-07 ***
## years_edu 0.15099 0.02802 5.388 7.14e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1098.8 on 794 degrees of freedom
## Residual deviance: 1068.4 on 793 degrees of freedom
## AIC: 1072.4
##
## Number of Fisher Scoring iterations: 4
##
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: prize
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 794 1098.8
## years_edu 1 30.431 793 1068.4 3.459e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-First look at the Classification Table - The table shows how many
cases the model (including years_edu
) predicted to have a
prize
value of 1 (meaning the predicted probability from
the logit function was > .5), and this is split up according to
whether the actual/observed value was 1. So the overall accuracy is
(165+313)/795 = 60.13%
Now, notice there is no R2 value. This is because there is no universally agreed method for computing it, but you can compute something similar to what SPSS provides if you want (e.g., see documentation for the DescTools Package)
The “Analysis of Deviance Table” includes the Chi-square test of model fit (same as the test in the “Model Summary” in SPSS. The chi-square stat is in the “Deviance” column. This test is based on the log-likelhood of the model. From the Field textbook, “log-likelihood is based on summing the probabilities associated with the predicted, P(Yi), and actual, Yi, outcomes. It is analogous to the residual sum of squares in the sense that it is an indicator of how much unexplained information there is after the model has been fitted.” (p. 1378, section 20.3.2). The log-likelihood is used to compute a deviance statistic that has a Chi-square distribution, and we use the Chi-square distribution to compute the probability of obtaining a statistic that large under the null hypothesis to get our overall model significance (p-value).
Let’s focus on the coefficient estimate for
years_edu
(in the table called “Coefficients” in the model
summary. In logistic regression, the coefficients are in units of
log(odds) (log = natural logarithm). This means that if we increase a
value of years_edu
by one unit the model would predict an
increase of .151 in the log(odds) of receiving a prize. We can try to
make that more understandable by converting the coefficient to an odds
ratio by exponentiating it (raising e to the power of
the coefficient - see section 20.3.5 of the Field textbook). If you
exponentiate our coefficient of .151 (in your R code you could add
exp(prize_binlm$coefficients)
) you would get
e0.151 = 1.16, meaning that a 1 year increase in
years_edu
would predict a ~16% increase in odds of
receiving the prize (according to the model). A number below 1 would
mean that higher years_edu
was related to lower odds of
receiving the prize.
anova(prize_binlm, test = "LRT")
) to compare a model with
versus without each predictor. For a model with a single predictor you
can just use the overall Chi-square test. For models with multiple
predictors you would run one model with out your predictor of interest,
one model with the predictor, and use the anova()
function
to compare the models.We can plot the model predictions with a few lines of code using
ggplot
(the y-axis is the predicted probability of
prize
) as in this code (the stat_smooth()
function computes the logistic curve):
p8a <- lumos_tib %>% drop_na(prize,years_edu) %>%
ggplot(aes(x=years_edu, y=prize)) +
geom_point(alpha=.15, position = position_jitter(w = 0.2, h = .05)) +
stat_smooth(method="glm", se=FALSE, method.args = list(family=binomial),
fullrange = TRUE) +
labs(title = "predicted probability: prize ~ years_edu",
y= "prize (predicted probability)") +
theme_classic()
# or extend the x-scale to see the full shape of the logistic curve (according to the model, you would need ~30 years of education to give yourself a 90% shot at the prize!! but the max years_edu value is 20 in this data)
p8b <- lumos_tib %>% drop_na(prize,years_edu) %>%
ggplot(aes(x=years_edu, y=prize)) +
geom_point(alpha=.15, position = position_jitter(w = 0.2, h = .05)) +
scale_x_continuous(limits = c(0,30)) +
stat_smooth(method="glm", se=FALSE, method.args = list(family=binomial),
fullrange = TRUE) +
labs(title = "predicted probability: prize ~ years_edu (x axis extended)",
y= "prize (predicted probability)") +
theme_classic()
p8a; p8b
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
- We didn’t check the same sources of bias (non-linearity, heteroscedasticity, lack of independence) that we did for the linear regression example. Rather than linearity bewteen predictor and outcome, logistic regression assumes that the relation of predictors to log(odds) of the outcome is linear. Homoscedasticity/Homogeneity of variance is not an assumption of logistic regression. We do need to assume independence, because non-independent measures (e.g., unmodeled groups or repeated measures) cause a problem called overdispersion. Refer to the Field textbook (Chapter 20) for a complete overview of logistic regression and how to check assumptions. This resource at STHDA is also useful