Learning Objectives


Step 1 - Get organized


Step 2 - Import data and install packages

data description:
This is subset of a public dataset of lumosity (a cognitive training website) user performance data. You can find the publication associated with this data here:
Guerra-Carrillo, B., Katovich, K., & Bunge, S. A. (2017). Does higher education hone cognitive functioning and learning efficacy? Findings from a large and diverse sample. PloS one, 12(8), e0182276. https://doi.org/10.1371/journal.pone.0182276

First install 3 packages: in the console (not in your markdown doc) type each of these commands to install the needed packages:
- install.packages("gmodels")
- depending on your settings, installing “gmodels” will also install “DescTools”, but if not then run install.packages("DescTools")
- install.packages("ppcor")

What to do next:
1. Edit your “setup” code chunk to add these lines:
knitr::opts_knit$set(root.dir = rprojroot::find_rstudio_root_file())
library("tidyverse")
2. Make a new code chunk in your R markdown doc and use readr::read_csv() to read in the data. Make sure that NA values are handled the way you want (click on the tibble in the Environment window pane to take a quick look).
3. Set data types for columns as needed (gender, edu_cat as factors, …) – take a look at the code to see how

#first import the data
lumos_tib <- readr::read_csv("data/lumos_subset.csv", na = "NA")

# now make sure the columns we want as factors are treated that way, using forcats::as_factor()
lumos_tib <- lumos_tib %>% dplyr::mutate(
  test_type = forcats::as_factor(test_type),
  assessment = forcats::as_factor(assessment),
  gender = forcats::as_factor(gender),
  edu_cat = forcats::as_factor(edu_cat),
  english_nativelang = forcats::as_factor(english_nativelang),
  ethnicity = forcats::as_factor(ethnicity)
)

Step 3 - Pearson correlation

Correlation Decision chart
Correlation Decision chart
Step 3.1 - First take a look at the decision chart above (Fig 8.6 from the Field textbook). We will start by taking a quick look at the distributions for raw_score and pretest_score, then we’ll make a scatter of them together
  1. Make a histogram and Q-Q plot for each variable
    - drop any datapoint that does not have a valid value for both raw_score and pretest_score
    - are the measures distributed (approximately) normally?
  2. Now make a scatterplot with pretest_score on the x-axis and raw_score on the y-axis
    - does the association appear linear?
    - pay attention to the number of points you see in the scatterplot - is it what you expected?
# histogram for raw_score
p1 <- lumos_tib %>% drop_na(raw_score,pretest_score) %>%
    ggplot( aes(x=raw_score)) + geom_histogram(binwidth = 1) + theme_classic() +
        labs (title = "Raw score distribution")
# Q-Q for raw_score
p2 <- lumos_tib %>% drop_na(raw_score,pretest_score) %>%
    ggplot( aes(sample=raw_score)) + geom_qq() + 
        geom_qq_line() + theme_classic() +
        labs (title = "Raw score Q-Q")
# histogram for pretest_score
p3 <- lumos_tib %>% drop_na(raw_score,pretest_score) %>%
    ggplot( aes(x=pretest_score)) + geom_histogram(binwidth = 3) +
      theme_classic() +
      labs (title = "pretest distribution")
# Q-Q for pretest_score
p4 <- lumos_tib %>% drop_na(raw_score,pretest_score) %>%
    ggplot( aes(sample=pretest_score)) + geom_qq() + 
        geom_qq_line() + theme_classic() +
        labs (title = "pretest Q-Q")
# scatter plot
p5 <- lumos_tib %>% drop_na(raw_score,pretest_score) %>%
  ggplot(aes(x=pretest_score, y=raw_score)) + 
    geom_point() + theme_classic() + 
    labs(title="raw_score vs pretest_score scatter")
# extra: scatter plot with a slight random jitter so you can see overlapping points
#   shape = 1 specifies empty circles for points
p6 <- lumos_tib %>% drop_na(raw_score,pretest_score) %>%
  ggplot(aes(x=pretest_score, y=raw_score)) + 
    geom_jitter(width = .4, height = .2, size = 1, shape = 1, alpha = .25) + theme_classic() + 
    labs(title="raw_score vs pretest_score scatter with jitter")

p1; p2; p3; p4; p5; p6

 

  • In the plots for both variables can see that they are approximately normally distributed, anyways we have a large sample (1000 observations) so we are not too concerned. We should, however, pay attention to extreme values that may have an overly strong influence on the correlation (these are often called high leverage outliers) - the scatter plot shows that the outliers don’t really change the association (i.e., they fit the overall pattern).
Step 3.2 - Now that we are satisfied with the assumptions, compute a Pearson correlation coefficient, confidence interval, and null hypothesis test p-value (use a two-sided test).
  • Use the correlation() function from the correlation library. Specify the options method="pearson" and alternative="two.sided" (for a two-sided significance test) like in the code below. You can use dplyr::select() to select the variables that you pipe to the correlation::correlation() function.
#get the correlation
lumos_tib %>% drop_na(raw_score,pretest_score) %>% 
  dplyr::select(pretest_score,raw_score) %>% 
  correlation::correlation(method = "pearson", 
                           alternative = "two.sided", 
                           digits = 3, ci_digits = 3)

#extra example: html formatted table using sjPlot package
lumos_tib %>% drop_na(raw_score,pretest_score,age) %>% 
  dplyr::select(pretest_score,raw_score,age) %>% 
  sjPlot::tab_corr(corr.method = "pearson",
                   title = "Table 1 - correlations between performance and age",
                   p.numeric = FALSE,digits = 3,
                   triangle = "lower",
                   use.viewer = TRUE)
## # Correlation Matrix (pearson-method)
## 
## Parameter1    | Parameter2 |     r |         95% CI | t(998) |         p
## ------------------------------------------------------------------------
## pretest_score |  raw_score | 0.779 | [0.754, 0.803] | 39.293 | < .001***
## 
## p-value adjustment method: Holm (1979)
## Observations: 1000
Table 1 - correlations between performance and age
  pretest_score raw_score age
pretest_score      
raw_score 0.779***    
age -0.112*** -0.101**  
Computed correlation used pearson-method with listwise-deletion.

 

After you have the scatter plot and pearson correlation, confidence interval, and null hypothesis test statistic, answer this question for yourself:
Based on the p-value you got (“p<.001”), which statement below is true? (assume that this dataset is a random sample of Lumosity users)
1. there is a greater than 99.9% probability that the true population correlation between raw_score and pretest_score is non-zero
2. there is less than .1% probability that raw_score and pretest_score are uncorrelated 3. there is less than .1% probability of finding a correlation at least this extreme (in a sample this size) if the true population correlation is zero

NOTE: With large samples, correlation p-values are not often useful, because even trivially small correlations are significant. The effect size (the pearson correlation coefficient, r, in this case) is generally what you would care about.
NOTE #2: The scatter plot doesn’t seem to show 1,000 points, right? That is because many data points are right on top of each other (there are only 38 unique values of raw_score) – in cases where you want to see more of the individual points, use can use a “jitter” to add/subtract small random amounts from each value before plotting them. A different approach could be to make the points partially transparent with the alpha option, so the points appear darker when there are many on top of one another.


Step 4 - Non-parametric correlation coefficients: Spearman’s rho & Kendall’s tau

# gametime_prior histogram
p1 <- lumos_tib %>% drop_na(pretest_score,gametime_prior) %>%
    ggplot( aes(x=gametime_prior)) + 
        geom_histogram(bins = 40) + theme_classic() +
        labs (title = "Prior game time distribution")
# gametime_prior Q-Q
p2 <- lumos_tib %>% drop_na(pretest_score,gametime_prior) %>%
    ggplot( aes(sample=gametime_prior)) + 
        geom_qq() + geom_qq_line() + theme_classic() +
        labs (title = "Prior game time Q-Q")
# scatter plot: pretest_score vs gametime_prior
p3 <- lumos_tib %>% drop_na(pretest_score,gametime_prior) %>%
    ggplot(aes(x=gametime_prior, y=pretest_score)) + 
        geom_point() + theme_classic() + 
        labs(title="Pretest score vs prior game time scatter")
# EXTRA - is this a situation where you could log-transform the gametime_prior variable?  
p4 <- lumos_tib %>% drop_na(pretest_score,gametime_prior) %>%
    ggplot( aes(x=log(gametime_prior))) + 
        geom_histogram(bins = 40) + theme_classic() +
        labs (title = "LOG Prior game time distribution")
# gametime_prior Q-Q
p5 <- lumos_tib %>% drop_na(pretest_score,gametime_prior) %>%
    ggplot( aes(sample=log(gametime_prior))) + 
        geom_qq() + geom_qq_line() + theme_classic() +
        labs (title = "LOG Prior game time Q-Q")

# scatter plot: pretest_score vs gametime_prior
p6 <- lumos_tib %>% drop_na(pretest_score,gametime_prior) %>%
    ggplot(aes(x=log(gametime_prior), y=pretest_score)) + 
        geom_jitter(width = .1, height = 1, size = 1, shape = 1, alpha = .25) +
        theme_classic() + 
        labs(title="Pretest score vs LOG prior game time scatter")
p1; p2; p3; p4; p5; p6

 

Step 4.1 - Compute the Spearman rank correlation coefficient, rho, with confidence interval, and null hypothesis test p-value (use a two-sided test).
  • Use the same correlation() function from the correlation library but change the method option to method = "spearman"
  • for comparison, also compute the Pearson correlation coefficient
#get the correlation
lumos_tib %>% drop_na(pretest_score,gametime_prior) %>% 
  dplyr::select(gametime_prior,pretest_score) %>% 
  correlation::correlation(method = "spearman", 
                           alternative = "two.sided", 
                           digits = 3, ci_digits = 3)

lumos_tib %>% drop_na(pretest_score,gametime_prior) %>% 
  mutate(loggametime_prior=log(gametime_prior)) %>% 
  dplyr::select(gametime_prior,pretest_score) %>% 
  correlation::correlation(method = "pearson", 
                           alternative = "two.sided", 
                           digits = 3, ci_digits = 3)

#extra - notice that the spearman correlation coefficient is not affected by log transformation - why? because log tranformation doesn't affect ranks
lumos_tib %>% drop_na(pretest_score,gametime_prior) %>% 
  mutate(loggametime_prior=log(gametime_prior)) %>% 
  dplyr::select(loggametime_prior,pretest_score) %>% 
  correlation::correlation(method = "spearman", 
                           alternative = "two.sided", 
                           digits = 3, ci_digits = 3)
## # Correlation Matrix (spearman-method)
## 
## Parameter1     |    Parameter2 |   rho |          95% CI |         S |     p
## ----------------------------------------------------------------------------
## gametime_prior | pretest_score | 0.061 | [-0.003, 0.124] | 1.565e+08 | 0.054
## 
## p-value adjustment method: Holm (1979)
## Observations: 1000
## # Correlation Matrix (pearson-method)
## 
## Parameter1     |    Parameter2 |     r |         95% CI | t(998) |         p
## ----------------------------------------------------------------------------
## gametime_prior | pretest_score | 0.126 | [0.065, 0.187] |  4.018 | < .001***
## 
## p-value adjustment method: Holm (1979)
## Observations: 1000
## # Correlation Matrix (spearman-method)
## 
## Parameter1        |    Parameter2 |   rho |          95% CI |         S |     p
## -------------------------------------------------------------------------------
## loggametime_prior | pretest_score | 0.061 | [-0.003, 0.124] | 1.565e+08 | 0.054
## 
## p-value adjustment method: Holm (1979)
## Observations: 1000

 

  • Now let’s take a moment and think about the Spearman Rank correlation coefficient. It is computed by ranking the cases.
  • This means that some cases will have ranks that are tied, and if there are a lot of ties then the rank correlation needs a correction - the Kendall correlation coefficient is a variant of the Spearman that corrects for ties
Step 4.2 - So next, compute the Kendall correlation coefficient
  • Use the same correlation() function (with method = kendall) on pretest_score and gametime_prior – compare the correlation coefficient values (spearman, pearson, kendall) and p-values – what do you notice?
  • Note about effect size: the pearson r, spearman rho, and kendall tau are all measures of effect size
#get the correlation
lumos_tib %>% drop_na(raw_score,gametime_prior) %>% 
  dplyr::select(gametime_prior,pretest_score) %>% 
  correlation::correlation(method = "kendall", 
                           alternative = "two.sided", 
                           digits = 3, ci_digits = 3)
## # Correlation Matrix (kendall-method)
## 
## Parameter1     |    Parameter2 |   tau |         95% CI |     z |     p
## -----------------------------------------------------------------------
## gametime_prior | pretest_score | 0.042 | [0.001, 0.083] | 1.919 | 0.055
## 
## p-value adjustment method: Holm (1979)
## Observations: 1000

 

Step 5 - Categorical (nominal) variables: contingency coefficients

Sometimes we want to look at associations between nominal variables, but we can’t use the above methods because one or more variables is not ordinal. In this lumosity data set let’s say we want to know whether the non-native English speakers that use the website tend to have a different level of education (than native English speakers).

Step 5.1/5.2 - So let’s look at the association between the edu_cat and the english_nativelang variables
  • a scatter plot is not too useful for categorical variables, so lets make a contingency table to examine the data - use the CrossTable() function from the “gmodels” package as in the code below (the curly brace syntax is used because gmodels functions do not use the tidyverse syntax).
  • for this exercise use the dplyr::filter() function to ignore cases where edu_cat or english_nativelang is “unspecified”.
# contingency table
conting_table <- lumos_tib %>% 
  filter(edu_cat != "unspecified" & english_nativelang != "unspecified") %>%
  {gmodels::CrossTable(.$edu_cat, .$english_nativelang, expected=TRUE,
                       prop.t=FALSE, prop.c=FALSE, prop.r = FALSE)}
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |              Expected N |
## | Chi-square contribution |
## |-------------------------|
## 
##  
## Total Observations in Table:  738 
## 
##  
##              | .$english_nativelang 
##    .$edu_cat |       yes |        no | Row Total | 
## -------------|-----------|-----------|-----------|
##           hs |        82 |         5 |        87 | 
##              |    78.159 |     8.841 |           | 
##              |     0.189 |     1.669 |           | 
## -------------|-----------|-----------|-----------|
##      college |       410 |        37 |       447 | 
##              |   401.573 |    45.427 |           | 
##              |     0.177 |     1.563 |           | 
## -------------|-----------|-----------|-----------|
##  postcollege |       171 |        33 |       204 | 
##              |   183.268 |    20.732 |           | 
##              |     0.821 |     7.260 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       663 |        75 |       738 | 
## -------------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  11.6791     d.f. =  2     p =  0.002910156 
## 
## 
## 

 

  • Now let’s walk through the output. First, notice the number of total observations - it should tell you that 738 out of the original 1000 cases had values that we can examine in this analysis (meaning values were not “unspecified”)

  • The legend at the top tells you what is in each cell:

    • N is the observed joint frequency. For example, out of 75 users that stated english was not their native language, 33 of them are in the “postcollege” education category.
    • Expected N is the expected joint frequency given the assumption of independence. If edu_cat and english_nativelang are independent, we should expect 20.732 users in that postcollege/non-native-english cell.
    • Chi-square contribution measures the amount that a cell contributes to the overall chi-square statistic for the table (the sum of all contributions equals the overall chi-squared value below the table).

You can look over every cell, but a general question we can address is whether there is statistical evidence that education level and english as a native language are NOT independent. The chi-squared independence test at the bottom of the output provides the null hypothesis test - showing that the observed deviations from expected frequencies are unlikely (χ2(2, N = 738) = 11.68, p=.0029) if the null hypothesis is true (that the variables are independent) - degrees of fredom for this test is (rows - 1)*(columns - 1). The chi-squared contributions for each cell suggest that the association is largely due to more non-native speakers with post-college education than would be expected if the variables were independent.

Step 5.3 - Measures of association between nominal variables
  • The chi-squared test statistic indicates the association between these two categorical variables, but the scale is hard to interpret. Let’s try a couple measures of association, the contingency coefficient and an alternative called Cramer’s V. These measures are each on a 0 to 1 scale, but Cramer’s V is generally preferred (contingency coefficients cannot reach the max value of 1 in many cases)
  • use the DescTools::ContCoef() function to compute the contingency coefficient. You will need to pass in the frequency table that you made in the last step, like this DescTools::ContCoef(conting_table$t) (assuming you stored the table in variable called conting_table in the previous step)
  • use the DescTools::CramerV() function to compute Cramer’s V. Pass in the same frequency table from before.
  • you can use the print() function to make the output more readable (as in the code below)
# contingency coefficient
print("The contingency coefficient for edu_cat and english_native_lang is:")
print(DescTools::ContCoef(conting_table$t), digits=4)
print("Cramer's V for edu_cat and english_native_lang is:")
print(DescTools::CramerV(conting_table$t), digits=4)
## [1] "The contingency coefficient for edu_cat and english_native_lang is:"
## [1] 0.1248
## [1] "Cramer's V for edu_cat and english_native_lang is:"
## [1] 0.1258

  - the low (~.125) coefficient/Cramer’s V tells us that, although the chi-squared test is significant, the association is rather small (but that’s not to say it couldn’t be meaningful)
- Note about effect size: Cramer’s V is a commonly used effect size measure for associations between nominal variables


Step 6 - Accounting for a third variable: Partial and semi-partial correlation

Suppose you want to know about the association between age and performance in the training program (raw_score), but you want to adjust for their performance level before the training program (pretest_score). One way to adjust is with semi-partial and partial correlations. We’ll do both here.

# raw_score X age scatter 
p1 <- lumos_tib %>% drop_na(age,raw_score,pretest_score) %>%
    ggplot(aes(x=age, y=raw_score)) + 
        geom_point() + theme_classic() + 
        labs(title="Raw score vs Age scatter")
p2 <- lumos_tib %>% drop_na(age,raw_score,pretest_score) %>%
    ggplot(aes(x=age, y=pretest_score)) + 
        geom_point() + theme_classic() + 
        labs(title="Pretest score vs Age scatter")
p3 <- lumos_tib %>% drop_na(age,raw_score,pretest_score)%>%
    ggplot(aes(x=pretest_score, y=raw_score)) + 
        geom_point() + theme_classic() + 
        labs(title="Raw score vs Pretest score scatter")
corrtable <- lumos_tib %>% drop_na(age,raw_score,pretest_score) %>% 
  dplyr::select(age,raw_score,pretest_score) %>% 
  correlation::correlation(method = "pearson", 
                           alternative = "two.sided", 
                           digits = 3, ci_digits = 3)
#if you want r-squared in the corr table then use the line below
corrtable <- corrtable %>% mutate(rsq=r^2)

p1; p2; p3; corrtable
## Parameter1 |    Parameter2 |     r |         95% CI | t(998) |         p |  rsq
## -------------------------------------------------------------------------------
## age        |     raw_score | -0.10 | [-0.16, -0.04] |  -3.22 | 0.001**   | 0.01
## age        | pretest_score | -0.11 | [-0.17, -0.05] |  -3.56 | < .001*** | 0.01
## raw_score  | pretest_score |  0.78 | [ 0.75,  0.80] |  39.29 | < .001*** | 0.61
## 
## Observations: 1000

 

# semi-partial
semipart <- lumos_tib %>% drop_na(age,raw_score,pretest_score) %>% 
  {ppcor::spcor.test(method = "pearson", 
                    x=.$age, y=.$raw_score, z=.$pretest_score)}
print("semi-partial correlation: x=age, y=raw_score, z=pretest_score")
print(semipart)
# partial
partcor <- lumos_tib %>% drop_na(age,raw_score,pretest_score) %>% 
  {ppcor::pcor.test(method = "pearson", 
                    x=.$age, y=.$raw_score, z=.$pretest_score)}
print("partial correlation: x=age, y=raw_score, z=pretest_score")
print(partcor)
## [1] "semi-partial correlation: x=age, y=raw_score, z=pretest_score"
##      estimate   p.value  statistic    n gp  Method
## 1 -0.02242273 0.4789975 -0.7081825 1000  1 pearson
## [1] "partial correlation: x=age, y=raw_score, z=pretest_score"
##      estimate   p.value  statistic    n gp  Method
## 1 -0.02256492 0.4762132 -0.7126757 1000  1 pearson

 

What do you notice?
- We originally asked whether age was related to performance at the end of the training program. The (small) zero-order correlation between age and raw_score suggested that older individuals scored somewhat lower.
- But we also saw that there was a (similarly small) correlation between age and pretest_score, so we decided we should adjust for pretest_score. This way we could ask how age uniquely related to post-training performance - that is, when we account for the fact that older adults had lower pretest scores, did older adults perform worse after training?
- Both the semi-partial and partial correlations suggested that the association between age and post-training performance was explained by pretest scores.
- See chapter 8 for more on semi-partial and partial correlations.

That’s all - if you have more time help your classmates or check out the extra time exercise below


Extra time: Same correlation coefficient, different associations

# import data 
ans_tib <- as_tibble(anscombe)
#x1 and y1
p1 <- ans_tib %>% 
  ggplot(aes(x=x1, y=y1)) + geom_point() + 
  geom_smooth(method = "lm", formula = "y ~ x", se = FALSE, fullrange = TRUE ) +
  scale_x_continuous(limits = c(4,20)) +
  scale_y_continuous(limits = c(2,14)) +
  coord_cartesian(xlim = c(4,20), ylim = c(2,14)) + labs(title="x1 and y1") + theme_classic() + ggpubr::stat_cor(method="pearson")  
#x2 and y2
p2 <- ans_tib %>% 
  ggplot(aes(x=x2, y=y2)) + geom_point() + 
  geom_smooth(method = "lm", formula = "y ~ x", se = FALSE, fullrange = TRUE ) +
  scale_x_continuous(limits = c(4,20)) +
  scale_y_continuous(limits = c(2,14)) +
  coord_cartesian(xlim = c(4,20), ylim = c(2,14)) + labs(title="x2 and y2") + theme_classic() + ggpubr::stat_cor(method="pearson")
#x3 and y3
p3 <- ans_tib %>% 
  ggplot(aes(x=x3, y=y3)) + geom_point() + 
  geom_smooth(method = "lm", formula = "y ~ x", se = FALSE, fullrange = TRUE ) +
  scale_x_continuous(limits = c(4,20)) +
  scale_y_continuous(limits = c(2,14)) +
  coord_cartesian(xlim = c(4,20), ylim = c(2,14)) + labs(title="x3 and y3") + theme_classic() + ggpubr::stat_cor(method="pearson")
#x4 and y4
p4 <- ans_tib %>% 
  ggplot(aes(x=x4, y=y4)) + geom_point() +
  geom_smooth(method = "lm", formula = "y ~ x", se = FALSE, fullrange = TRUE ) +
  scale_x_continuous(limits = c(4,20)) +
  scale_y_continuous(limits = c(2,14)) +
  coord_cartesian(xlim = c(4,20), ylim = c(2,14)) + labs(title="x4 and y4") + theme_classic() + ggpubr::stat_cor(method="pearson")

gridExtra::grid.arrange(p1, p2, p3, p4, nrow = 2, ncol = 2)

 

That’s all - if you want more examples check out the discovr_07 tutorial


References:

Textbook Chapter 8 - Field, A. (2018). Discovering statistics using IBM SPSS statistics. 5th Edition. SAGE Publications.

Dataset from Guerra-Carrillo, Katovich, & Bunge (2017). “Does higher education hone cognitive functioning and learning efficacy? Findings from a large and diverse sample.” PLoS one, 12(8), e0182276.. Licensed under CC-By Attribution 4.0 International by Belen Guerra-Carrillo and Bunge Lab.
discovr_07 (Andy Field tutorial)

Anscombe’s quartet - Anscombe, F. J. (1973). “Graphs in Statistical Analysis”. American Statistician. 27 (1): 17–21.

More on contingency tables and coefficients at R Coder