View on GitHub

psych596

Graduate Statistical Methods for Psychology

Learning Objectives - Associations between two variables


Step 1 - Get organized


Step 2 - Import data

data description:
This is subset of a public dataset of Lumosity (a cognitive training website) user performance data. You can find the publication associated with this data here:
Guerra-Carrillo, B., Katovich, K., & Bunge, S. A. (2017). Does higher education hone cognitive functioning and learning efficacy? Findings from a large and diverse sample. PloS one, 12(8), e0182276. https://doi.org/10.1371/journal.pone.0182276

Import the data: Open SPSS and use File -> Import Data-> CSV or Text Data - now check the variable types and add labels if you wish. Careful! If you use "import text data" make sure you set the delimiter as "comma" (SPSS may automatically also treat "space" as a delimiter)

What to do next:

Step 3 - Pearson correlation

Let's examine the association between pre-test performance (pretest_score) and post-test performance (raw_score) -- they should be related, right?

Correlation Decision chart

Step 3.1 - First take a look at the decision chart above (Fig 8.6 from the Field textbook). We will start by taking a quick look at the distributions for raw_score and pretest_score, then we'll make a scatter of them together
  1. Select only cases that have valid scores (greater than or equal to zero) for for both raw_score and pretest_score (use Select Cases - you should find that all rows have valid scores)
  2. Make a histogram, bar plot, and Q-Q plot for each variable (like we did in week #2 - use Analyze->Descriptives->Explore (select "Normality Plots" and "histogram")
    - are the measures distributed (approximately) normally? Consider the shape of the histogram, the number and location of outliers, and the distance of points on the Q-Q Plot from the diagonal.
    - you may notice that the Statistical "Tests of Normality" give significant statistics, suggesting that the distributions significantly deviate from a normal distribution. But as noted in the Field textbook, any large sample (like this) is likely to give you a significant statistic even when the deviation from normality is minor. It is more important to look at the plots to assess the distribution.
  3. Now make a scatterplot with pretest_score on the x-axis and raw_score on the y-axis
    - use the Chart Builder (or any method you prefer)
    - does the association appear linear? (we'll talk about this idea in discussion)

In the plots for both variables can see that they are approximately normally distributed, anyways we have a large sample (1000 observations) so, as stated in the decision chart above, we are not concerned. We should, however, pay attention to extreme values (outliers) that may have an overly strong influence on the correlation (these are often called high leverage outliers) and violate our linearity assumption - the scatter plot shows that the outliers fit the overall linear pattern.

Step 3.2 - Now that we are satisfied with the assumptions, compute a Pearson correlation coefficient, confidence interval, and null hypothesis test p-value (use a two-sided test).

What is the Pearson Correlation coefficient between pretest_score and raw_score? This is one measure of effect size (referred to as the correlation coefficient "r"). It can range between -1 to 1. The positive value indicates that individuals with a relatively high pretest_score tend to also have a relatively high raw_score.

Based on the p-value you got ("p<.001") for the null hypothesis test, which statement below is true? (assume that this dataset is a random sample of the population of Lumosity users)
a. there is a greater than 99.9% probability that the true population correlation between raw_score and pretest_score is non-zero
b. there is less than .1% probability that raw_score and pretest_score are uncorrelated
c. there is less than .1% probability of finding a correlation at least this extreme (in a sample this size) if the true population correlation is zero

NOTE: With large samples, correlation p-values are not often useful, because even trivially small correlations are significant. The effect size (the pearson correlation coefficient, r, in this case) is generally what you would care about.
NOTE #2: The scatter plot doesn't seem to show 1,000 points, right? That is because many data points are right on top of each other (there are only 38 unique values of raw_score).


Step 4 - Non-parametric correlation coefficients: Spearman's rho & Kendall's tau

Step 4.1 - Compute the Spearman rank correlation coefficient, rho or ρ, with confidence interval, and null hypothesis test p-value (use a two-sided test).
Step 4.2 - So next, compute the Kendall correlation coefficient.

Step 5 - Categorical (nominal) variables: contingency coefficients

Sometimes we want to look at associations between nominal variables, but we can't use the above methods because one or more variables is not ordinal. In this lumosity data set let's say we want to know whether the non-native English speakers that use the website tend to have a different level of education (than native English speakers).

Step 5.1 - So let's look at the association between the edu_cat and the english_nativelang variables
Step 5.2 - Cross-tabulation frequency table
Step 5.3 - Measures of association between nominal variables

The chi-squared test statistic indicates the association between these two categorical variables, but the scale is hard to interpret. Let's try a couple measures of association, the contingency coefficient and an alternative called Cramer's V. These measures are each on a 0 to 1 scale, but Cramer's V is generally preferred (contingency coefficients cannot reach the max value of 1 in many cases, which makes them hard to compare)


Step 6 - Accounting for a third variable: Partial and semi-partial correlation

Suppose you want to know about the association between age and performance in the training program (raw_score), but you want to adjust for their performance level before the training program (pretest_score). One way to adjust is with semi-partial and partial correlations. We'll do both here.

What do you notice?

That's all - move on to the RStudio activity when you are ready!


References:

Textbook Chapter 8 - Field, A. (2018). Discovering statistics using IBM SPSS statistics. 5th Edition. SAGE Publications.

Dataset from Guerra-Carrillo, Katovich, & Bunge (2017). "Does higher education hone cognitive functioning and learning efficacy? Findings from a large and diverse sample." PLoS one, 12(8), e0182276.. Licensed under CC-By Attribution 4.0 International by Belen Guerra-Carrillo and Bunge Lab.