Associations between variables: Multiple Regression in SPSS

edited Feb 21, 2024
Jamil Palacios Bhanji

Goals for today


Step 1 - Get organized


Step 2 - Import data and check it out

Data description: lumos_subset1000plusimaginary.csv is the same file we worked with last week, but with one extra variable added ("imaginary_screensize" - this is a fabricated variable that is not part of the real dataset).

This is subset of a public dataset of Lumosity (a cognitive training website) user performance data. You can find the publication associated with the full dataset here:
Guerra-Carrillo, B., Katovich, K., & Bunge, S. A. (2017). Does higher education hone cognitive functioning and learning efficacy? Findings from a large and diverse sample. PloS one, 12(8), e0182276. https://doi.org/10.1371/journal.pone.0182276

Import the data: Open SPSS and use File -> Import Data-> CSV or Text Data - now check the variable types and add labels if you wish. Careful! If you use "import text data" make sure you set the only delimiter as "comma" (SPSS may automatically also treat "space" as a delimiter, so uncheck that option)

What to do next:


Step 3 - The General Linear Model (GLM) with one predictor

A note about terminology: In this lab activity we will use the terms "predictor", "independent variable (IV)", and "explanatory variable" interchangeably to refer to variables that are entered as explanatory terms in the model. We will use the terms "dependent variable" and "outcome variable" interchangeably to refer to the variable that is being explained. You should be mindful of the implications of these words (e.g., about causality, which cannot be inferred simply by assigning one variable as a predictor and another as an outcome) but we won't focus on language in this lab. Instead, you should get used to the different terminology that is used.

GLM Decision chart{width=50%}

Above is the decision process chart from the book.

Looking at the output:

  1. In the Model Summary table, the R-square value tells us that the model (raw_score = b0 + b1age) explains about 1% of the variance in raw_score (a small effect by most standards, but could be important in some contexts).
  2. In the ANOVA table, the F-statistic is the ratio of variance explained by the model to the error within the model (Mean Square for the model divided by Mean Square of the Residual). The "Sig." value (aka p-value) tells you the probability of an F-statistic at least that large under the null hypothesis.
  3. In the Coefficients table


Step 4 - The General Linear Model (GLM) with multiple predictors, model comparison, model diagnostics (steps 4, 5, and 6 in the RStudio activity)

Now let's add pretest_score to our model, so that we are predicting raw_score as a function of age and pretest_score.

Model Diagnostics

Now that we have run the regression model, we have to check on the assumptions that we made. Let's look back at our decision chart - it says we can use a graph of "zpred vs zresid" to check for linearity, heteroscedasticity, and independence, then look at a histogram of residuals to check for normality. These are all plots created from the Model 2 residuals (difference between the raw_score value predicted by the model, and the actual raw_score value for each case in the dataset). These charts should appear in the output you generated from the last regression:

  1. "Scatterplot- Dependent Variable: raw_score" - this is the"zpred vs zresid" and is useful for checking linearity, heteroscedasticity, and independence. We are basically checking to make sure there are no clear patterns in the residuals.

  2. "Histogram- Dependent Variable: raw_score" - this is a histogram of the residuals. You expect to see a normal curve shape. Look at this plot along with the "Normal P-P" plot to assess the normality assumption.

  3. "Normal P-P" - this is a P-P plot of the residuals (like a Q-Q plot but of cumulative probability rather than quantiles), and helps us check for non-normally distributed residuals. We expect points to fall close to the diagonal line if the residuals are normal (they do here).

Now that you have reviewed the output, answer the following questions for yourself about the full model (raw_score = b0 + b1age + b2pretest_score)

  1. What does the Model R-square tell you for this model? That is, what percent of the variance in raw_score is explained by the model with age and pretest score as predictors?
  2. What does the overall F-statistic and p-value tell you?
  3. What does the beta coefficient for pretest_score tell you?
  4. What does it mean that the t-statistic for the age variable is not significant in this model? How does it compare to the partial correlation test that we ran last week with the same variables?

Step 5 - Dichotomous outcome: logistic regression (Step 8 in the RStudio activity)

Logistic regression is used when the outcome variable has just two possible values (e.g., true/false, on/off, yes/no, or greater/less than some critical threshold). For the sake of learning, let's imagine that a raw_score value of 16 or greater wins a $100 prize, so we want to see if we can explain who wins the prize based only on the users' years of education (years_edu).
Why can't we run a regular linear regression? - Because regular linear regression will give you predicted values that fall outside the possible outcome values (0 and 1) and are not interpretable.
On the other hand, logistic regression will yield predicted values between 0 and 1 that can be interpreted as the probability of the outcome (e.g., prize received) occurring. These predicted values follow a sigmoid-shaped logit function.

A sigmoid-shaped logistic function looks something like this: When applied to a logistic regression, the x-axis is a hypothetical predictor and the y-axis is the probability of the outcome occuring.
image below from wikipedia: logistic function

Let's estimate a logistic model now

Look at the output:

That's all for this part, have some fun in RStudio now!


References