Associations between variables

Learning Objectives

Use the decision process depicted in the Field textbook to choose a measure of association between two variables:
- continuous variables, linearity and normality assumed, no outlier issues: Pearson’s r
- ordinal or continuous variables, assumption violations and outliers okay:
  - if ties in ranking are not a concern: Spearman’s rho
  - if ties in ranking are a concern: Kendall’s tau
- nominal (categorical) variables: Contingency coefficients or Cramer’s V
Measuring associations with more than two variables:
- Partial, Semi-partial correlation, “third variable” issues
- The rest will come in the class meeting on multiple regression
Extra Time: same correlation coefficient, different associations

Step 1 - Get organized

make a folder for today’s activity, with a new Rproj file
make a “data” folder (inside the project folder)
- Download the data file lumos_subset.csv and place it in your “data” folder
make an “r_docs” folder (inside the project folder)
- start a new R Markdown doc and save it in your r_docs folder

Step 2 - Import data and install packages

data description:
This is subset of a public dataset of lumosity (a cognitive training website) user performance data. You can find the publication associated with this data here:
Guerra-Carrillo, B., Katovich, K., & Bunge, S. A. (2017). Does higher education hone cognitive functioning and learning efficacy? Findings from a large and diverse sample. PloS one, 12(8), e0182276. https://doi.org/10.1371/journal.pone.0182276

this data subset includes only the arithmetic reasoning test (AR) score from a post-test at the end of a 100 day training program (raw_score)
pretest_score (test at start of the training program) has been transformed to have a mean of 100 and standard deviation of 15 (this is a typical transformation for IQ scores)

First install 3 packages: in the console (not in your markdown doc) type each of these commands to install the needed packages:
- install.packages("gmodels")
- depending on your settings, installing “gmodels” will also install “DescTools”, but if not then run install.packages("DescTools")
- install.packages("ppcor")

What to do next:
1. Edit your “setup” code chunk to add these lines:
knitr::opts_knit$set(root.dir = rprojroot::find_rstudio_root_file())
library("tidyverse")
2. Make a new code chunk in your R markdown doc and use readr::read_csv() to read in the data. Make sure that NA values are handled the way you want (click on the tibble in the Environment window pane to take a quick look).
3. Set data types for columns as needed (gender, edu_cat as factors, …) – take a look at the code to see how

#first import the data
lumos_tib <- readr::read_csv("data/lumos_subset.csv", na = "NA")

# now make sure the columns we want as factors are treated that way, using forcats::as_factor()
lumos_tib <- lumos_tib %>% dplyr::mutate(
  test_type = forcats::as_factor(test_type),
  assessment = forcats::as_factor(assessment),
  gender = forcats::as_factor(gender),
  edu_cat = forcats::as_factor(edu_cat),
  english_nativelang = forcats::as_factor(english_nativelang),
  ethnicity = forcats::as_factor(ethnicity)
)

Step 3 - Pearson correlation

Let’s examine the association between pre-test performance (pretest_score) and post-test performance (raw_score) – they should be related, right?

Correlation Decision chart

Step 3.1 - First take a look at the decision chart above (Fig 8.6 from the Field textbook). We will start by taking a quick look at the distributions for `raw_score` and `pretest_score`, then we’ll make a scatter of them together

Make a histogram and Q-Q plot for each variable
- drop any datapoint that does not have a valid value for both raw_score and pretest_score
- are the measures distributed (approximately) normally?
Now make a scatterplot with pretest_score on the x-axis and raw_score on the y-axis
- does the association appear linear?
- pay attention to the number of points you see in the scatterplot - is it what you expected?

# histogram for raw_score
p1 <- lumos_tib %>% drop_na(raw_score,pretest_score) %>%
    ggplot( aes(x=raw_score)) + geom_histogram(binwidth = 1) + theme_classic() +
        labs (title = "Raw score distribution")
# Q-Q for raw_score
p2 <- lumos_tib %>% drop_na(raw_score,pretest_score) %>%
    ggplot( aes(sample=raw_score)) + geom_qq() + 
        geom_qq_line() + theme_classic() +
        labs (title = "Raw score Q-Q")
# histogram for pretest_score
p3 <- lumos_tib %>% drop_na(raw_score,pretest_score) %>%
    ggplot( aes(x=pretest_score)) + geom_histogram(binwidth = 3) +
      theme_classic() +
      labs (title = "pretest distribution")
# Q-Q for pretest_score
p4 <- lumos_tib %>% drop_na(raw_score,pretest_score) %>%
    ggplot( aes(sample=pretest_score)) + geom_qq() + 
        geom_qq_line() + theme_classic() +
        labs (title = "pretest Q-Q")
# scatter plot
p5 <- lumos_tib %>% drop_na(raw_score,pretest_score) %>%
  ggplot(aes(x=pretest_score, y=raw_score)) + 
    geom_point() + theme_classic() + 
    labs(title="raw_score vs pretest_score scatter")
# extra: scatter plot with a slight random jitter so you can see overlapping points
#   shape = 1 specifies empty circles for points
p6 <- lumos_tib %>% drop_na(raw_score,pretest_score) %>%
  ggplot(aes(x=pretest_score, y=raw_score)) + 
    geom_jitter(width = .4, height = .2, size = 1, shape = 1, alpha = .25) + theme_classic() + 
    labs(title="raw_score vs pretest_score scatter with jitter")

p1; p2; p3; p4; p5; p6

In the plots for both variables can see that they are approximately normally distributed, anyways we have a large sample (1000 observations) so we are not too concerned. We should, however, pay attention to extreme values that may have an overly strong influence on the correlation (these are often called high leverage outliers) - the scatter plot shows that the outliers don’t really change the association (i.e., they fit the overall pattern).

Step 3.2 - Now that we are satisfied with the assumptions, compute a Pearson correlation coefficient, confidence interval, and null hypothesis test p-value (use a two-sided test).

Use the correlation() function from the correlation library. Specify the options method="pearson" and alternative="two.sided" (for a two-sided significance test) like in the code below. You can use dplyr::select() to select the variables that you pipe to the correlation::correlation() function.

#get the correlation
lumos_tib %>% drop_na(raw_score,pretest_score) %>% 
  dplyr::select(pretest_score,raw_score) %>% 
  correlation::correlation(method = "pearson", 
                           alternative = "two.sided", 
                           digits = 3, ci_digits = 3)

#extra example: html formatted table using sjPlot package
lumos_tib %>% drop_na(raw_score,pretest_score,age) %>% 
  dplyr::select(pretest_score,raw_score,age) %>% 
  sjPlot::tab_corr(corr.method = "pearson",
                   title = "Table 1 - correlations between performance and age",
                   p.numeric = FALSE,digits = 3,
                   triangle = "lower",
                   use.viewer = TRUE)

## # Correlation Matrix (pearson-method)
## 
## Parameter1    | Parameter2 |     r |         95% CI | t(998) |         p
## ------------------------------------------------------------------------
## pretest_score |  raw_score | 0.779 | [0.754, 0.803] | 39.293 | < .001***
## 
## p-value adjustment method: Holm (1979)
## Observations: 1000

Table 1 - correlations between performance and age
	pretest_score	raw_score
pretest_score
raw_score	0.779***
age	-0.112***	-0.101**
Computed correlation used pearson-method with listwise-deletion.

After you have the scatter plot and pearson correlation, confidence interval, and null hypothesis test statistic, answer this question for yourself:
Based on the p-value you got (“p<.001”), which statement below is true? (assume that this dataset is a random sample of Lumosity users)
1. there is a greater than 99.9% probability that the true population correlation between raw_score and pretest_score is non-zero
2. there is less than .1% probability that raw_score and pretest_score are uncorrelated 3. there is less than .1% probability of finding a correlation at least this extreme (in a sample this size) if the true population correlation is zero

NOTE: With large samples, correlation p-values are not often useful, because even trivially small correlations are significant. The effect size (the pearson correlation coefficient, r, in this case) is generally what you would care about.
NOTE #2: The scatter plot doesn’t seem to show 1,000 points, right? That is because many data points are right on top of each other (there are only 38 unique values of raw_score) – in cases where you want to see more of the individual points, use can use a “jitter” to add/subtract small random amounts from each value before plotting them. A different approach could be to make the points partially transparent with the alpha option, so the points appear darker when there are many on top of one another.

Step 4 - Non-parametric correlation coefficients: Spearman’s rho & Kendall’s tau

Let’s ask a new question about the data: what is the correlation between pretest_score and gametime_prior?
You already looked at the pretest_score distribution, so now check the gametime_prior distribution and make a scatter plot of pretest_score vs gametime_prior– what do you notice?

# gametime_prior histogram
p1 <- lumos_tib %>% drop_na(pretest_score,gametime_prior) %>%
    ggplot( aes(x=gametime_prior)) + 
        geom_histogram(bins = 40) + theme_classic() +
        labs (title = "Prior game time distribution")
# gametime_prior Q-Q
p2 <- lumos_tib %>% drop_na(pretest_score,gametime_prior) %>%
    ggplot( aes(sample=gametime_prior)) + 
        geom_qq() + geom_qq_line() + theme_classic() +
        labs (title = "Prior game time Q-Q")
# scatter plot: pretest_score vs gametime_prior
p3 <- lumos_tib %>% drop_na(pretest_score,gametime_prior) %>%
    ggplot(aes(x=gametime_prior, y=pretest_score)) + 
        geom_point() + theme_classic() + 
        labs(title="Pretest score vs prior game time scatter")
# EXTRA - is this a situation where you could log-transform the gametime_prior variable?  
p4 <- lumos_tib %>% drop_na(pretest_score,gametime_prior) %>%
    ggplot( aes(x=log(gametime_prior))) + 
        geom_histogram(bins = 40) + theme_classic() +
        labs (title = "LOG Prior game time distribution")
# gametime_prior Q-Q
p5 <- lumos_tib %>% drop_na(pretest_score,gametime_prior) %>%
    ggplot( aes(sample=log(gametime_prior))) + 
        geom_qq() + geom_qq_line() + theme_classic() +
        labs (title = "LOG Prior game time Q-Q")

# scatter plot: pretest_score vs gametime_prior
p6 <- lumos_tib %>% drop_na(pretest_score,gametime_prior) %>%
    ggplot(aes(x=log(gametime_prior), y=pretest_score)) + 
        geom_jitter(width = .1, height = 1, size = 1, shape = 1, alpha = .25) +
        theme_classic() + 
        labs(title="Pretest score vs LOG prior game time scatter")
p1; p2; p3; p4; p5; p6

Game time is not normally distributed, and there are some extreme values that may have a strong influence on the association, so the linearity and normality assumptions are not met. We can use a non-parametric measure of association, the Spearman rank correlation coefficient, to measure the association between these variables (alternatively, you can consider transforming the gametime_prior variable):

Step 4.1 - Compute the Spearman rank correlation coefficient, rho, with confidence interval, and null hypothesis test p-value (use a two-sided test).

Use the same correlation() function from the correlation library but change the method option to method = "spearman"
for comparison, also compute the Pearson correlation coefficient

#get the correlation
lumos_tib %>% drop_na(pretest_score,gametime_prior) %>% 
  dplyr::select(gametime_prior,pretest_score) %>% 
  correlation::correlation(method = "spearman", 
                           alternative = "two.sided", 
                           digits = 3, ci_digits = 3)

lumos_tib %>% drop_na(pretest_score,gametime_prior) %>% 
  mutate(loggametime_prior=log(gametime_prior)) %>% 
  dplyr::select(gametime_prior,pretest_score) %>% 
  correlation::correlation(method = "pearson", 
                           alternative = "two.sided", 
                           digits = 3, ci_digits = 3)

#extra - notice that the spearman correlation coefficient is not affected by log transformation - why? because log tranformation doesn't affect ranks
lumos_tib %>% drop_na(pretest_score,gametime_prior) %>% 
  mutate(loggametime_prior=log(gametime_prior)) %>% 
  dplyr::select(loggametime_prior,pretest_score) %>% 
  correlation::correlation(method = "spearman", 
                           alternative = "two.sided", 
                           digits = 3, ci_digits = 3)

## # Correlation Matrix (spearman-method)
## 
## Parameter1     |    Parameter2 |   rho |          95% CI |         S |     p
## ----------------------------------------------------------------------------
## gametime_prior | pretest_score | 0.061 | [-0.003, 0.124] | 1.565e+08 | 0.054
## 
## p-value adjustment method: Holm (1979)
## Observations: 1000
## # Correlation Matrix (pearson-method)
## 
## Parameter1     |    Parameter2 |     r |         95% CI | t(998) |         p
## ----------------------------------------------------------------------------
## gametime_prior | pretest_score | 0.126 | [0.065, 0.187] |  4.018 | < .001***
## 
## p-value adjustment method: Holm (1979)
## Observations: 1000
## # Correlation Matrix (spearman-method)
## 
## Parameter1        |    Parameter2 |   rho |          95% CI |         S |     p
## -------------------------------------------------------------------------------
## loggametime_prior | pretest_score | 0.061 | [-0.003, 0.124] | 1.565e+08 | 0.054
## 
## p-value adjustment method: Holm (1979)
## Observations: 1000

Now let’s take a moment and think about the Spearman Rank correlation coefficient. It is computed by ranking the cases.
This means that some cases will have ranks that are tied, and if there are a lot of ties then the rank correlation needs a correction - the Kendall correlation coefficient is a variant of the Spearman that corrects for ties

Step 4.2 - So next, compute the Kendall correlation coefficient

Use the same correlation() function (with method = kendall) on pretest_score and gametime_prior – compare the correlation coefficient values (spearman, pearson, kendall) and p-values – what do you notice?
Note about effect size: the pearson r, spearman rho, and kendall tau are all measures of effect size

#get the correlation
lumos_tib %>% drop_na(raw_score,gametime_prior) %>% 
  dplyr::select(gametime_prior,pretest_score) %>% 
  correlation::correlation(method = "kendall", 
                           alternative = "two.sided", 
                           digits = 3, ci_digits = 3)

## # Correlation Matrix (kendall-method)
## 
## Parameter1     |    Parameter2 |   tau |         95% CI |     z |     p
## -----------------------------------------------------------------------
## gametime_prior | pretest_score | 0.042 | [0.001, 0.083] | 1.919 | 0.055
## 
## p-value adjustment method: Holm (1979)
## Observations: 1000

Step 5 - Categorical (nominal) variables: contingency coefficients

Sometimes we want to look at associations between nominal variables, but we can’t use the above methods because one or more variables is not ordinal. In this lumosity data set let’s say we want to know whether the non-native English speakers that use the website tend to have a different level of education (than native English speakers).

Step 5.1/5.2 - So let’s look at the association between the `edu_cat` and the `english_nativelang` variables

a scatter plot is not too useful for categorical variables, so lets make a contingency table to examine the data - use the CrossTable() function from the “gmodels” package as in the code below (the curly brace syntax is used because gmodels functions do not use the tidyverse syntax).
for this exercise use the dplyr::filter() function to ignore cases where edu_cat or english_nativelang is “unspecified”.

# contingency table
conting_table <- lumos_tib %>% 
  filter(edu_cat != "unspecified" & english_nativelang != "unspecified") %>%
  {gmodels::CrossTable(.$edu_cat, .$english_nativelang, expected=TRUE,
                       prop.t=FALSE, prop.c=FALSE, prop.r = FALSE)}

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |              Expected N |
## | Chi-square contribution |
## |-------------------------|
## 
##  
## Total Observations in Table:  738 
## 
##  
##              | .$english_nativelang 
##    .$edu_cat |       yes |        no | Row Total | 
## -------------|-----------|-----------|-----------|
##           hs |        82 |         5 |        87 | 
##              |    78.159 |     8.841 |           | 
##              |     0.189 |     1.669 |           | 
## -------------|-----------|-----------|-----------|
##      college |       410 |        37 |       447 | 
##              |   401.573 |    45.427 |           | 
##              |     0.177 |     1.563 |           | 
## -------------|-----------|-----------|-----------|
##  postcollege |       171 |        33 |       204 | 
##              |   183.268 |    20.732 |           | 
##              |     0.821 |     7.260 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       663 |        75 |       738 | 
## -------------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  11.6791     d.f. =  2     p =  0.002910156 
## 
## 
##

Now let’s walk through the output. First, notice the number of total observations - it should tell you that 738 out of the original 1000 cases had values that we can examine in this analysis (meaning values were not “unspecified”)
The legend at the top tells you what is in each cell:
- N is the observed joint frequency. For example, out of 75 users that stated english was not their native language, 33 of them are in the “postcollege” education category.
- Expected N is the expected joint frequency given the assumption of independence. If edu_cat and english_nativelang are independent, we should expect 20.732 users in that postcollege/non-native-english cell.
- Chi-square contribution measures the amount that a cell contributes to the overall chi-square statistic for the table (the sum of all contributions equals the overall chi-squared value below the table).

You can look over every cell, but a general question we can address is whether there is statistical evidence that education level and english as a native language are NOT independent. The chi-squared independence test at the bottom of the output provides the null hypothesis test - showing that the observed deviations from expected frequencies are unlikely (χ²(2, N = 738) = 11.68, p=.0029) if the null hypothesis is true (that the variables are independent) - degrees of fredom for this test is (rows - 1)*(columns - 1). The chi-squared contributions for each cell suggest that the association is largely due to more non-native speakers with post-college education than would be expected if the variables were independent.

Step 5.3 - Measures of association between nominal variables

The chi-squared test statistic indicates the association between these two categorical variables, but the scale is hard to interpret. Let’s try a couple measures of association, the contingency coefficient and an alternative called Cramer’s V. These measures are each on a 0 to 1 scale, but Cramer’s V is generally preferred (contingency coefficients cannot reach the max value of 1 in many cases)
use the DescTools::ContCoef() function to compute the contingency coefficient. You will need to pass in the frequency table that you made in the last step, like this DescTools::ContCoef(conting_table$t) (assuming you stored the table in variable called conting_table in the previous step)
use the DescTools::CramerV() function to compute Cramer’s V. Pass in the same frequency table from before.
you can use the print() function to make the output more readable (as in the code below)

# contingency coefficient
print("The contingency coefficient for edu_cat and english_native_lang is:")
print(DescTools::ContCoef(conting_table$t), digits=4)
print("Cramer's V for edu_cat and english_native_lang is:")
print(DescTools::CramerV(conting_table$t), digits=4)

## [1] "The contingency coefficient for edu_cat and english_native_lang is:"
## [1] 0.1248
## [1] "Cramer's V for edu_cat and english_native_lang is:"
## [1] 0.1258

- the low (~.125) coefficient/Cramer’s V tells us that, although the chi-squared test is significant, the association is rather small (but that’s not to say it couldn’t be meaningful)
- Note about effect size: Cramer’s V is a commonly used effect size measure for associations between nominal variables

Step 6 - Accounting for a third variable: Partial and semi-partial correlation

Suppose you want to know about the association between age and performance in the training program (raw_score), but you want to adjust for their performance level before the training program (pretest_score). One way to adjust is with semi-partial and partial correlations. We’ll do both here.

Let’s start by understanding the zero-order correlations (correlations between two variables at a time). Make a scatter plot for each pair of age, raw_score, and pretest_score, then compute pearson correlation coefficients between each pair of variables (pipe all three variables to correlation::correlation() as in the code below.

# raw_score X age scatter 
p1 <- lumos_tib %>% drop_na(age,raw_score,pretest_score) %>%
    ggplot(aes(x=age, y=raw_score)) + 
        geom_point() + theme_classic() + 
        labs(title="Raw score vs Age scatter")
p2 <- lumos_tib %>% drop_na(age,raw_score,pretest_score) %>%
    ggplot(aes(x=age, y=pretest_score)) + 
        geom_point() + theme_classic() + 
        labs(title="Pretest score vs Age scatter")
p3 <- lumos_tib %>% drop_na(age,raw_score,pretest_score)%>%
    ggplot(aes(x=pretest_score, y=raw_score)) + 
        geom_point() + theme_classic() + 
        labs(title="Raw score vs Pretest score scatter")
corrtable <- lumos_tib %>% drop_na(age,raw_score,pretest_score) %>% 
  dplyr::select(age,raw_score,pretest_score) %>% 
  correlation::correlation(method = "pearson", 
                           alternative = "two.sided", 
                           digits = 3, ci_digits = 3)
#if you want r-squared in the corr table then use the line below
corrtable <- corrtable %>% mutate(rsq=r^2)

p1; p2; p3; corrtable

## Parameter1 |    Parameter2 |     r |         95% CI | t(998) |         p |  rsq
## -------------------------------------------------------------------------------
## age        |     raw_score | -0.10 | [-0.16, -0.04] |  -3.22 | 0.001**   | 0.01
## age        | pretest_score | -0.11 | [-0.17, -0.05] |  -3.56 | < .001*** | 0.01
## raw_score  | pretest_score |  0.78 | [ 0.75,  0.80] |  39.29 | < .001*** | 0.61
## 
## Observations: 1000

We want to examine the relation between age and raw_score, adjusting (partialling out) for pretest_score. Notice that the correlations and scatter plots show us that pretest_score shares variance with both (small negative correlation with age, and large positive correlation with raw_score).
The semi-partial correlation is the association between age and raw_score, accounting for (i.e., removing shared variance) the association between age and pretest_score (but not the association between raw_score and pretest_score).
The partial correlation is the association between age and raw_score, accounting for (i.e., removing shared variance) the associations between age and pretest_score and the association between raw_score and pretest_score. This can be restated as the unique relationship between age and raw_score as a proportion of the variance in raw_score that is left over when pretest_score has been considered (that’s a mouthful so we’ll try to sum it all up at the end).
Now compute the semi-partial and partial correlations with the ppcpr::spcor.test() and ppcpr::pcor.test() function calls as in the code below. We will specify x=age, y=raw_score, and z=pretest_score in the function arguments.

# semi-partial
semipart <- lumos_tib %>% drop_na(age,raw_score,pretest_score) %>% 
  {ppcor::spcor.test(method = "pearson", 
                    x=.$age, y=.$raw_score, z=.$pretest_score)}
print("semi-partial correlation: x=age, y=raw_score, z=pretest_score")
print(semipart)
# partial
partcor <- lumos_tib %>% drop_na(age,raw_score,pretest_score) %>% 
  {ppcor::pcor.test(method = "pearson", 
                    x=.$age, y=.$raw_score, z=.$pretest_score)}
print("partial correlation: x=age, y=raw_score, z=pretest_score")
print(partcor)

## [1] "semi-partial correlation: x=age, y=raw_score, z=pretest_score"
##      estimate   p.value  statistic    n gp  Method
## 1 -0.02242273 0.4789975 -0.7081825 1000  1 pearson
## [1] "partial correlation: x=age, y=raw_score, z=pretest_score"
##      estimate   p.value  statistic    n gp  Method
## 1 -0.02256492 0.4762132 -0.7126757 1000  1 pearson

What do you notice?
- We originally asked whether age was related to performance at the end of the training program. The (small) zero-order correlation between age and raw_score suggested that older individuals scored somewhat lower.
- But we also saw that there was a (similarly small) correlation between age and pretest_score, so we decided we should adjust for pretest_score. This way we could ask how age uniquely related to post-training performance - that is, when we account for the fact that older adults had lower pretest scores, did older adults perform worse after training?
- Both the semi-partial and partial correlations suggested that the association between age and post-training performance was explained by pretest scores.
- See chapter 8 for more on semi-partial and partial correlations.

That’s all - if you have more time help your classmates or check out the extra time exercise below

Extra time: Same correlation coefficient, different associations

Run the code below to see how different associations can yield identical correlation coefficients (that’s why it’s important to examine assumptions!)
ans_tib <- as_tibble(anscombe) “anscombe” is a small dataset contained in the R base package
the variables are x1, y1, x2, y2, x3, y3, x4, y4
The code below will compute pearson’s r and scatter plots for x1&y1, x2&y2, etc.
look at the correlation coefficients and the scatter plots, what do you notice?
copy the code into your own R markdown doc for practice if you like - note that it requires the package “ggpubr”.

# import data 
ans_tib <- as_tibble(anscombe)
#x1 and y1
p1 <- ans_tib %>% 
  ggplot(aes(x=x1, y=y1)) + geom_point() + 
  geom_smooth(method = "lm", formula = "y ~ x", se = FALSE, fullrange = TRUE ) +
  scale_x_continuous(limits = c(4,20)) +
  scale_y_continuous(limits = c(2,14)) +
  coord_cartesian(xlim = c(4,20), ylim = c(2,14)) + labs(title="x1 and y1") + theme_classic() + ggpubr::stat_cor(method="pearson")  
#x2 and y2
p2 <- ans_tib %>% 
  ggplot(aes(x=x2, y=y2)) + geom_point() + 
  geom_smooth(method = "lm", formula = "y ~ x", se = FALSE, fullrange = TRUE ) +
  scale_x_continuous(limits = c(4,20)) +
  scale_y_continuous(limits = c(2,14)) +
  coord_cartesian(xlim = c(4,20), ylim = c(2,14)) + labs(title="x2 and y2") + theme_classic() + ggpubr::stat_cor(method="pearson")
#x3 and y3
p3 <- ans_tib %>% 
  ggplot(aes(x=x3, y=y3)) + geom_point() + 
  geom_smooth(method = "lm", formula = "y ~ x", se = FALSE, fullrange = TRUE ) +
  scale_x_continuous(limits = c(4,20)) +
  scale_y_continuous(limits = c(2,14)) +
  coord_cartesian(xlim = c(4,20), ylim = c(2,14)) + labs(title="x3 and y3") + theme_classic() + ggpubr::stat_cor(method="pearson")
#x4 and y4
p4 <- ans_tib %>% 
  ggplot(aes(x=x4, y=y4)) + geom_point() +
  geom_smooth(method = "lm", formula = "y ~ x", se = FALSE, fullrange = TRUE ) +
  scale_x_continuous(limits = c(4,20)) +
  scale_y_continuous(limits = c(2,14)) +
  coord_cartesian(xlim = c(4,20), ylim = c(2,14)) + labs(title="x4 and y4") + theme_classic() + ggpubr::stat_cor(method="pearson")

gridExtra::grid.arrange(p1, p2, p3, p4, nrow = 2, ncol = 2)

That’s all - if you want more examples check out the discovr_07 tutorial

References:

Textbook Chapter 8 - Field, A. (2018). Discovering statistics using IBM SPSS statistics. 5th Edition. SAGE Publications.

Dataset from Guerra-Carrillo, Katovich, & Bunge (2017). “Does higher education hone cognitive functioning and learning efficacy? Findings from a large and diverse sample.” PLoS one, 12(8), e0182276.. Licensed under CC-By Attribution 4.0 International by Belen Guerra-Carrillo and Bunge Lab.
discovr_07 (Andy Field tutorial)

Anscombe’s quartet - Anscombe, F. J. (1973). “Graphs in Statistical Analysis”. American Statistician. 27 (1): 17–21.

More on contingency tables and coefficients at R Coder

Associations between variables

grad-stats

2022-02-14

Learning Objectives

Step 1 - Get organized

Step 2 - Import data and install packages

Step 3 - Pearson correlation

Step 3.1 - First take a look at the decision chart above (Fig 8.6 from the Field textbook). We will start by taking a quick look at the distributions for `raw_score` and `pretest_score`, then we’ll make a scatter of them together

Step 3.2 - Now that we are satisfied with the assumptions, compute a Pearson correlation coefficient, confidence interval, and null hypothesis test p-value (use a two-sided test).

Step 4 - Non-parametric correlation coefficients: Spearman’s rho & Kendall’s tau

Step 4.1 - Compute the Spearman rank correlation coefficient, rho, with confidence interval, and null hypothesis test p-value (use a two-sided test).

Step 4.2 - So next, compute the Kendall correlation coefficient

Step 5 - Categorical (nominal) variables: contingency coefficients

Step 5.1/5.2 - So let’s look at the association between the `edu_cat` and the `english_nativelang` variables

Step 5.3 - Measures of association between nominal variables

Step 6 - Accounting for a third variable: Partial and semi-partial correlation

That’s all - if you have more time help your classmates or check out the extra time exercise below

Extra time: Same correlation coefficient, different associations

That’s all - if you want more examples check out the discovr_07 tutorial

References:

Associations between variables

grad-stats

2022-02-14

Learning Objectives

Step 1 - Get organized

Step 2 - Import data and install packages

Step 3 - Pearson correlation

Step 3.1 - First take a look at the decision chart above (Fig 8.6 from the Field textbook). We will start by taking a quick look at the distributions for raw_score and pretest_score, then we’ll make a scatter of them together

Step 3.2 - Now that we are satisfied with the assumptions, compute a Pearson correlation coefficient, confidence interval, and null hypothesis test p-value (use a two-sided test).

Step 4 - Non-parametric correlation coefficients: Spearman’s rho & Kendall’s tau

Step 4.1 - Compute the Spearman rank correlation coefficient, rho, with confidence interval, and null hypothesis test p-value (use a two-sided test).

Step 4.2 - So next, compute the Kendall correlation coefficient

Step 5 - Categorical (nominal) variables: contingency coefficients

Step 5.1/5.2 - So let’s look at the association between the edu_cat and the english_nativelang variables

Step 5.3 - Measures of association between nominal variables

Step 6 - Accounting for a third variable: Partial and semi-partial correlation

That’s all - if you have more time help your classmates or check out the extra time exercise below

Extra time: Same correlation coefficient, different associations

That’s all - if you want more examples check out the discovr_07 tutorial

References:

Step 3.1 - First take a look at the decision chart above (Fig 8.6 from the Field textbook). We will start by taking a quick look at the distributions for `raw_score` and `pretest_score`, then we’ll make a scatter of them together

Step 5.1/5.2 - So let’s look at the association between the `edu_cat` and the `english_nativelang` variables