Use the decision process depicted in the Field textbook to choose a measure of association between two variables:
Measuring associations with more than two variables:
Extra Time: same correlation coefficient, different associations
make a folder for today’s activity, with a new Rproj file
make a “data” folder (inside the project folder)
make an “r_docs” folder (inside the project folder)
data description:
This is subset of a public dataset of lumosity (a cognitive training
website) user performance data. You can find the publication associated
with this data here:
Guerra-Carrillo,
B., Katovich, K., & Bunge, S. A. (2017). Does higher education hone
cognitive functioning and learning efficacy? Findings from a large and
diverse sample. PloS one, 12(8), e0182276.
https://doi.org/10.1371/journal.pone.0182276
raw_score
)pretest_score
(test at start of the training program)
has been transformed to have a mean of 100 and standard deviation of 15
(this is a typical transformation for IQ scores)First install 3 packages: in the console
(not in your markdown doc) type each of these commands
to install the needed packages:
- install.packages("gmodels")
- depending on your settings, installing “gmodels” will also install
“DescTools”, but if not then run
install.packages("DescTools")
- install.packages("ppcor")
What to do next:
1. Edit your “setup” code chunk to add these lines:
knitr::opts_knit$set(root.dir = rprojroot::find_rstudio_root_file())
library("tidyverse")
2. Make a new code chunk in your R markdown doc and use
readr::read_csv() to read in the data. Make sure that NA values are
handled the way you want (click on the tibble in the Environment window
pane to take a quick look).
3. Set data types for columns as needed (gender, edu_cat as factors, …)
– take a look at the code to see how
#first import the data
lumos_tib <- readr::read_csv("data/lumos_subset.csv", na = "NA")
# now make sure the columns we want as factors are treated that way, using forcats::as_factor()
lumos_tib <- lumos_tib %>% dplyr::mutate(
test_type = forcats::as_factor(test_type),
assessment = forcats::as_factor(assessment),
gender = forcats::as_factor(gender),
edu_cat = forcats::as_factor(edu_cat),
english_nativelang = forcats::as_factor(english_nativelang),
ethnicity = forcats::as_factor(ethnicity)
)
pretest_score
) and post-test performance
(raw_score
) – they should be related, right?raw_score
and pretest_score
,
then we’ll make a scatter of them togetherpretest_score
on the x-axis
and raw_score
on the y-axis# histogram for raw_score
p1 <- lumos_tib %>% drop_na(raw_score,pretest_score) %>%
ggplot( aes(x=raw_score)) + geom_histogram(binwidth = 1) + theme_classic() +
labs (title = "Raw score distribution")
# Q-Q for raw_score
p2 <- lumos_tib %>% drop_na(raw_score,pretest_score) %>%
ggplot( aes(sample=raw_score)) + geom_qq() +
geom_qq_line() + theme_classic() +
labs (title = "Raw score Q-Q")
# histogram for pretest_score
p3 <- lumos_tib %>% drop_na(raw_score,pretest_score) %>%
ggplot( aes(x=pretest_score)) + geom_histogram(binwidth = 3) +
theme_classic() +
labs (title = "pretest distribution")
# Q-Q for pretest_score
p4 <- lumos_tib %>% drop_na(raw_score,pretest_score) %>%
ggplot( aes(sample=pretest_score)) + geom_qq() +
geom_qq_line() + theme_classic() +
labs (title = "pretest Q-Q")
# scatter plot
p5 <- lumos_tib %>% drop_na(raw_score,pretest_score) %>%
ggplot(aes(x=pretest_score, y=raw_score)) +
geom_point() + theme_classic() +
labs(title="raw_score vs pretest_score scatter")
# extra: scatter plot with a slight random jitter so you can see overlapping points
# shape = 1 specifies empty circles for points
p6 <- lumos_tib %>% drop_na(raw_score,pretest_score) %>%
ggplot(aes(x=pretest_score, y=raw_score)) +
geom_jitter(width = .4, height = .2, size = 1, shape = 1, alpha = .25) + theme_classic() +
labs(title="raw_score vs pretest_score scatter with jitter")
p1; p2; p3; p4; p5; p6
correlation()
function from the
correlation
library. Specify the options
method="pearson"
and alternative="two.sided"
(for a two-sided significance test) like in the code below. You can use
dplyr::select()
to select the variables that you pipe to
the correlation::correlation()
function.#get the correlation
lumos_tib %>% drop_na(raw_score,pretest_score) %>%
dplyr::select(pretest_score,raw_score) %>%
correlation::correlation(method = "pearson",
alternative = "two.sided",
digits = 3, ci_digits = 3)
#extra example: html formatted table using sjPlot package
lumos_tib %>% drop_na(raw_score,pretest_score,age) %>%
dplyr::select(pretest_score,raw_score,age) %>%
sjPlot::tab_corr(corr.method = "pearson",
title = "Table 1 - correlations between performance and age",
p.numeric = FALSE,digits = 3,
triangle = "lower",
use.viewer = TRUE)
## # Correlation Matrix (pearson-method)
##
## Parameter1 | Parameter2 | r | 95% CI | t(998) | p
## ------------------------------------------------------------------------
## pretest_score | raw_score | 0.779 | [0.754, 0.803] | 39.293 | < .001***
##
## p-value adjustment method: Holm (1979)
## Observations: 1000
pretest_score | raw_score | age | |
---|---|---|---|
pretest_score | |||
raw_score | 0.779*** | ||
age | -0.112*** | -0.101** | |
Computed correlation used pearson-method with listwise-deletion. |
After you have the scatter plot and pearson correlation,
confidence interval, and null hypothesis test statistic, answer this
question for yourself:
Based on the p-value you got (“p<.001”), which statement below is
true? (assume that this dataset is a random sample of Lumosity
users)
1. there is a greater than 99.9% probability that the true population
correlation between raw_score and pretest_score is non-zero
2. there is less than .1% probability that raw_score and pretest_score
are uncorrelated 3. there is less than .1% probability of finding a
correlation at least this extreme (in a sample this size) if the true
population correlation is zero
NOTE: With large samples, correlation p-values are
not often useful, because even trivially small correlations are
significant. The effect size (the pearson correlation coefficient, r, in
this case) is generally what you would care about.
NOTE #2: The scatter plot doesn’t seem to show 1,000
points, right? That is because many data points are right on top of each
other (there are only 38 unique values of raw_score) – in cases where
you want to see more of the individual points, use can use a “jitter” to
add/subtract small random amounts from each value before plotting them.
A different approach could be to make the points partially transparent
with the alpha
option, so the points appear darker when
there are many on top of one another.
pretest_score
and
gametime_prior
?pretest_score
distribution,
so now check the gametime_prior
distribution and make a
scatter plot of pretest_score
vs
gametime_prior
– what do you notice?# gametime_prior histogram
p1 <- lumos_tib %>% drop_na(pretest_score,gametime_prior) %>%
ggplot( aes(x=gametime_prior)) +
geom_histogram(bins = 40) + theme_classic() +
labs (title = "Prior game time distribution")
# gametime_prior Q-Q
p2 <- lumos_tib %>% drop_na(pretest_score,gametime_prior) %>%
ggplot( aes(sample=gametime_prior)) +
geom_qq() + geom_qq_line() + theme_classic() +
labs (title = "Prior game time Q-Q")
# scatter plot: pretest_score vs gametime_prior
p3 <- lumos_tib %>% drop_na(pretest_score,gametime_prior) %>%
ggplot(aes(x=gametime_prior, y=pretest_score)) +
geom_point() + theme_classic() +
labs(title="Pretest score vs prior game time scatter")
# EXTRA - is this a situation where you could log-transform the gametime_prior variable?
p4 <- lumos_tib %>% drop_na(pretest_score,gametime_prior) %>%
ggplot( aes(x=log(gametime_prior))) +
geom_histogram(bins = 40) + theme_classic() +
labs (title = "LOG Prior game time distribution")
# gametime_prior Q-Q
p5 <- lumos_tib %>% drop_na(pretest_score,gametime_prior) %>%
ggplot( aes(sample=log(gametime_prior))) +
geom_qq() + geom_qq_line() + theme_classic() +
labs (title = "LOG Prior game time Q-Q")
# scatter plot: pretest_score vs gametime_prior
p6 <- lumos_tib %>% drop_na(pretest_score,gametime_prior) %>%
ggplot(aes(x=log(gametime_prior), y=pretest_score)) +
geom_jitter(width = .1, height = 1, size = 1, shape = 1, alpha = .25) +
theme_classic() +
labs(title="Pretest score vs LOG prior game time scatter")
p1; p2; p3; p4; p5; p6
correlation()
function from the
correlation
library but change the method
option to method = "spearman"
#get the correlation
lumos_tib %>% drop_na(pretest_score,gametime_prior) %>%
dplyr::select(gametime_prior,pretest_score) %>%
correlation::correlation(method = "spearman",
alternative = "two.sided",
digits = 3, ci_digits = 3)
lumos_tib %>% drop_na(pretest_score,gametime_prior) %>%
mutate(loggametime_prior=log(gametime_prior)) %>%
dplyr::select(gametime_prior,pretest_score) %>%
correlation::correlation(method = "pearson",
alternative = "two.sided",
digits = 3, ci_digits = 3)
#extra - notice that the spearman correlation coefficient is not affected by log transformation - why? because log tranformation doesn't affect ranks
lumos_tib %>% drop_na(pretest_score,gametime_prior) %>%
mutate(loggametime_prior=log(gametime_prior)) %>%
dplyr::select(loggametime_prior,pretest_score) %>%
correlation::correlation(method = "spearman",
alternative = "two.sided",
digits = 3, ci_digits = 3)
## # Correlation Matrix (spearman-method)
##
## Parameter1 | Parameter2 | rho | 95% CI | S | p
## ----------------------------------------------------------------------------
## gametime_prior | pretest_score | 0.061 | [-0.003, 0.124] | 1.565e+08 | 0.054
##
## p-value adjustment method: Holm (1979)
## Observations: 1000
## # Correlation Matrix (pearson-method)
##
## Parameter1 | Parameter2 | r | 95% CI | t(998) | p
## ----------------------------------------------------------------------------
## gametime_prior | pretest_score | 0.126 | [0.065, 0.187] | 4.018 | < .001***
##
## p-value adjustment method: Holm (1979)
## Observations: 1000
## # Correlation Matrix (spearman-method)
##
## Parameter1 | Parameter2 | rho | 95% CI | S | p
## -------------------------------------------------------------------------------
## loggametime_prior | pretest_score | 0.061 | [-0.003, 0.124] | 1.565e+08 | 0.054
##
## p-value adjustment method: Holm (1979)
## Observations: 1000
method = kendall
) on pretest_score
and
gametime_prior
– compare the correlation coefficient values
(spearman, pearson, kendall) and p-values – what do you notice?#get the correlation
lumos_tib %>% drop_na(raw_score,gametime_prior) %>%
dplyr::select(gametime_prior,pretest_score) %>%
correlation::correlation(method = "kendall",
alternative = "two.sided",
digits = 3, ci_digits = 3)
## # Correlation Matrix (kendall-method)
##
## Parameter1 | Parameter2 | tau | 95% CI | z | p
## -----------------------------------------------------------------------
## gametime_prior | pretest_score | 0.042 | [0.001, 0.083] | 1.919 | 0.055
##
## p-value adjustment method: Holm (1979)
## Observations: 1000
Sometimes we want to look at associations between nominal variables, but we can’t use the above methods because one or more variables is not ordinal. In this lumosity data set let’s say we want to know whether the non-native English speakers that use the website tend to have a different level of education (than native English speakers).
edu_cat
and the english_nativelang
variablesCrossTable()
function from the “gmodels” package as in the
code below (the curly brace syntax is used because gmodels functions do
not use the tidyverse syntax).dplyr::filter()
function to
ignore cases where edu_cat
or
english_nativelang
is “unspecified”.# contingency table
conting_table <- lumos_tib %>%
filter(edu_cat != "unspecified" & english_nativelang != "unspecified") %>%
{gmodels::CrossTable(.$edu_cat, .$english_nativelang, expected=TRUE,
prop.t=FALSE, prop.c=FALSE, prop.r = FALSE)}
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Expected N |
## | Chi-square contribution |
## |-------------------------|
##
##
## Total Observations in Table: 738
##
##
## | .$english_nativelang
## .$edu_cat | yes | no | Row Total |
## -------------|-----------|-----------|-----------|
## hs | 82 | 5 | 87 |
## | 78.159 | 8.841 | |
## | 0.189 | 1.669 | |
## -------------|-----------|-----------|-----------|
## college | 410 | 37 | 447 |
## | 401.573 | 45.427 | |
## | 0.177 | 1.563 | |
## -------------|-----------|-----------|-----------|
## postcollege | 171 | 33 | 204 |
## | 183.268 | 20.732 | |
## | 0.821 | 7.260 | |
## -------------|-----------|-----------|-----------|
## Column Total | 663 | 75 | 738 |
## -------------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 11.6791 d.f. = 2 p = 0.002910156
##
##
##
Now let’s walk through the output. First, notice the number of total observations - it should tell you that 738 out of the original 1000 cases had values that we can examine in this analysis (meaning values were not “unspecified”)
The legend at the top tells you what is in each cell:
edu_cat
and
english_nativelang
are independent, we should expect 20.732
users in that postcollege/non-native-english cell.You can look over every cell, but a general question we can address is whether there is statistical evidence that education level and english as a native language are NOT independent. The chi-squared independence test at the bottom of the output provides the null hypothesis test - showing that the observed deviations from expected frequencies are unlikely (χ2(2, N = 738) = 11.68, p=.0029) if the null hypothesis is true (that the variables are independent) - degrees of fredom for this test is (rows - 1)*(columns - 1). The chi-squared contributions for each cell suggest that the association is largely due to more non-native speakers with post-college education than would be expected if the variables were independent.
DescTools::ContCoef()
function to compute the
contingency coefficient. You will need to pass in the frequency table
that you made in the last step, like this
DescTools::ContCoef(conting_table$t)
(assuming you stored
the table in variable called conting_table
in the previous
step)DescTools::CramerV()
function to compute
Cramer’s V. Pass in the same frequency table from before.print()
function to make the output
more readable (as in the code below)# contingency coefficient
print("The contingency coefficient for edu_cat and english_native_lang is:")
print(DescTools::ContCoef(conting_table$t), digits=4)
print("Cramer's V for edu_cat and english_native_lang is:")
print(DescTools::CramerV(conting_table$t), digits=4)
## [1] "The contingency coefficient for edu_cat and english_native_lang is:"
## [1] 0.1248
## [1] "Cramer's V for edu_cat and english_native_lang is:"
## [1] 0.1258
- the low (~.125) coefficient/Cramer’s V tells us that, although
the chi-squared test is significant, the association is rather small
(but that’s not to say it couldn’t be meaningful)
- Note about effect size: Cramer’s V is a commonly used effect size
measure for associations between nominal variables
Suppose you want to know about the association between age and
performance in the training program (raw_score
), but you
want to adjust for their performance level before the training program
(pretest_score
). One way to adjust is with semi-partial and
partial correlations. We’ll do both here.
age
, raw_score
, and
pretest_score
, then compute pearson correlation
coefficients between each pair of variables (pipe all three variables to
correlation::correlation()
as in the code below.# raw_score X age scatter
p1 <- lumos_tib %>% drop_na(age,raw_score,pretest_score) %>%
ggplot(aes(x=age, y=raw_score)) +
geom_point() + theme_classic() +
labs(title="Raw score vs Age scatter")
p2 <- lumos_tib %>% drop_na(age,raw_score,pretest_score) %>%
ggplot(aes(x=age, y=pretest_score)) +
geom_point() + theme_classic() +
labs(title="Pretest score vs Age scatter")
p3 <- lumos_tib %>% drop_na(age,raw_score,pretest_score)%>%
ggplot(aes(x=pretest_score, y=raw_score)) +
geom_point() + theme_classic() +
labs(title="Raw score vs Pretest score scatter")
corrtable <- lumos_tib %>% drop_na(age,raw_score,pretest_score) %>%
dplyr::select(age,raw_score,pretest_score) %>%
correlation::correlation(method = "pearson",
alternative = "two.sided",
digits = 3, ci_digits = 3)
#if you want r-squared in the corr table then use the line below
corrtable <- corrtable %>% mutate(rsq=r^2)
p1; p2; p3; corrtable
## Parameter1 | Parameter2 | r | 95% CI | t(998) | p | rsq
## -------------------------------------------------------------------------------
## age | raw_score | -0.10 | [-0.16, -0.04] | -3.22 | 0.001** | 0.01
## age | pretest_score | -0.11 | [-0.17, -0.05] | -3.56 | < .001*** | 0.01
## raw_score | pretest_score | 0.78 | [ 0.75, 0.80] | 39.29 | < .001*** | 0.61
##
## Observations: 1000
age
and
raw_score
, adjusting (partialling out) for
pretest_score
. Notice that the correlations and scatter
plots show us that pretest_score
shares variance with both
(small negative correlation with age
, and large positive
correlation with raw_score
).age
and raw_score
, accounting for
(i.e., removing shared variance) the association between
age
and pretest_score
(but not the association
between raw_score
and pretest_score
).age
and raw_score
, accounting for (i.e.,
removing shared variance) the associations between age
and
pretest_score
and the association between
raw_score
and pretest_score
. This can be
restated as the unique relationship between age
and
raw_score
as a proportion of the variance in
raw_score
that is left over when pretest_score
has been considered (that’s a mouthful so we’ll try to sum it all up at
the end).ppcpr::spcor.test()
and ppcpr::pcor.test()
function calls as in the code below. We will specify x=age
,
y=raw_score
, and z=pretest_score
in the
function arguments.# semi-partial
semipart <- lumos_tib %>% drop_na(age,raw_score,pretest_score) %>%
{ppcor::spcor.test(method = "pearson",
x=.$age, y=.$raw_score, z=.$pretest_score)}
print("semi-partial correlation: x=age, y=raw_score, z=pretest_score")
print(semipart)
# partial
partcor <- lumos_tib %>% drop_na(age,raw_score,pretest_score) %>%
{ppcor::pcor.test(method = "pearson",
x=.$age, y=.$raw_score, z=.$pretest_score)}
print("partial correlation: x=age, y=raw_score, z=pretest_score")
print(partcor)
## [1] "semi-partial correlation: x=age, y=raw_score, z=pretest_score"
## estimate p.value statistic n gp Method
## 1 -0.02242273 0.4789975 -0.7081825 1000 1 pearson
## [1] "partial correlation: x=age, y=raw_score, z=pretest_score"
## estimate p.value statistic n gp Method
## 1 -0.02256492 0.4762132 -0.7126757 1000 1 pearson
What do you notice?
- We originally asked whether age was related to performance at the end
of the training program. The (small) zero-order
correlation between age
and raw_score
suggested that older individuals scored somewhat lower.
- But we also saw that there was a (similarly small) correlation between
age
and pretest_score
, so we decided we should
adjust for pretest_score
. This way we could ask how age
uniquely related to post-training performance - that is, when we account
for the fact that older adults had lower pretest scores, did older
adults perform worse after training?
- Both the semi-partial and partial correlations suggested that the
association between age and post-training performance was explained by
pretest scores.
- See chapter 8 for more on semi-partial and partial correlations.
ans_tib <- as_tibble(anscombe)
“anscombe” is a small
dataset contained in the R base package# import data
ans_tib <- as_tibble(anscombe)
#x1 and y1
p1 <- ans_tib %>%
ggplot(aes(x=x1, y=y1)) + geom_point() +
geom_smooth(method = "lm", formula = "y ~ x", se = FALSE, fullrange = TRUE ) +
scale_x_continuous(limits = c(4,20)) +
scale_y_continuous(limits = c(2,14)) +
coord_cartesian(xlim = c(4,20), ylim = c(2,14)) + labs(title="x1 and y1") + theme_classic() + ggpubr::stat_cor(method="pearson")
#x2 and y2
p2 <- ans_tib %>%
ggplot(aes(x=x2, y=y2)) + geom_point() +
geom_smooth(method = "lm", formula = "y ~ x", se = FALSE, fullrange = TRUE ) +
scale_x_continuous(limits = c(4,20)) +
scale_y_continuous(limits = c(2,14)) +
coord_cartesian(xlim = c(4,20), ylim = c(2,14)) + labs(title="x2 and y2") + theme_classic() + ggpubr::stat_cor(method="pearson")
#x3 and y3
p3 <- ans_tib %>%
ggplot(aes(x=x3, y=y3)) + geom_point() +
geom_smooth(method = "lm", formula = "y ~ x", se = FALSE, fullrange = TRUE ) +
scale_x_continuous(limits = c(4,20)) +
scale_y_continuous(limits = c(2,14)) +
coord_cartesian(xlim = c(4,20), ylim = c(2,14)) + labs(title="x3 and y3") + theme_classic() + ggpubr::stat_cor(method="pearson")
#x4 and y4
p4 <- ans_tib %>%
ggplot(aes(x=x4, y=y4)) + geom_point() +
geom_smooth(method = "lm", formula = "y ~ x", se = FALSE, fullrange = TRUE ) +
scale_x_continuous(limits = c(4,20)) +
scale_y_continuous(limits = c(2,14)) +
coord_cartesian(xlim = c(4,20), ylim = c(2,14)) + labs(title="x4 and y4") + theme_classic() + ggpubr::stat_cor(method="pearson")
gridExtra::grid.arrange(p1, p2, p3, p4, nrow = 2, ncol = 2)
Textbook Chapter 8 - Field, A. (2018). Discovering statistics using IBM SPSS statistics. 5th Edition. SAGE Publications.
Dataset from Guerra-Carrillo,
Katovich, & Bunge (2017). “Does higher education hone cognitive
functioning and learning efficacy? Findings from a large and diverse
sample.” PLoS one, 12(8), e0182276.. Licensed under CC-By Attribution 4.0 International by
Belen Guerra-Carrillo and Bunge Lab.
discovr_07 (Andy Field tutorial)
More on contingency tables and coefficients at R Coder