last edited 2024-02-09


Goals for today


Step 1 - Get organized


Step 2 - Import data and check it out

#first import the data
ratings_tib <- readr::read_csv("data/Schroeder_Epley_2015_Study_4_abbrev.csv", 
                               na = "NA")
# now make sure the columns we want as factors are treated that way, using forcats::as_factor()
ratings_tib <- ratings_tib |> dplyr::mutate(
  condition_factor = forcats::as_factor(CONDITION)
) |> dplyr::mutate( #let's give names to the levels by using the fct_recode() function
  condition_factor = forcats::fct_recode(condition_factor, text = "0", audio = "1")
) #after this you should see "text" and "audio" in the condition_factor column of ratings_tib
# use code below to re-order the levels of condition
# ratings_tib <- ratings_tib |> mutate(
#   condition_factor = forcats::fct_relevel(condition_factor,"audio","text")
#   )

 


Step 3 - Examine group means and distributions

Two group decision chart
Two group decision chart

Above is the decision process chart from the book.

  • Following the chart, we should start with box plots and histograms to check for unusual cases, non-normality, and possible differences in variance (violation of homogeneity of variance) between groups. Note that we used the term “homoscedasticity” when talking about regression with continuous explanatory variables - it’s the same idea as homogeneity of variance. Let’s also make a table of mean, sd, and #cases by group. See if you can use what you learned in previous activities to make
    1. box plot (group on x-axis)
    2. histogram (one for each group, or color-code groups on a single histogram)
    3. Q-Q plot (one plot for each group)
    4. a table of mean, sd, and #cases by group

To jog your memory, here is a link to the “import and examine” activity where we made descriptive tables and distribution plots. In that earlier activity you can go to Step 3.1 for tables, Step 3.2 for plots, and the code in the last “mini-challenge” section has an example of how to add a grouping variable to tables and plots.

#box plot  
p1 <- ratings_tib |> 
  ggplot(aes(x = condition_factor, y = Hire_Rating)) + 
    geom_boxplot() +
    theme_classic() + labs(title="Hire Rating box plot by group", y = "Hire Rating", x = "Group")  
#histogram
p2 <- ratings_tib |> 
  ggplot( aes(x = Hire_Rating, fill = condition_factor)) + 
    geom_histogram(position = "identity", alpha = .5, binwidth = 1) + 
    theme_classic() + labs (title = "Hire Rating Distribution by Group")

#Extra: q-q plots for each group (using facets - see https://ggplot2-book.org/facet.html)
p3 <- ratings_tib |> group_by(condition_factor) |>
  ggplot( aes(sample=Hire_Rating)) + geom_qq() + geom_qq_line() + theme_classic() + 
    facet_wrap(~ condition_factor) + 
    labs (title = "Q-Q of Hire Rating")

#table
ratings_tib |> group_by(condition_factor) |>  
  dplyr::summarise(
    median =  median(Hire_Rating),
    mean =  mean(Hire_Rating),
    sd = sd(Hire_Rating),
    cases = n() - sum(is.na(Hire_Rating))
  ) |> 
    knitr::kable(caption = "Hire Rating Descriptives by Group", digits = 3) |> 
    kableExtra::kable_styling(full_width = FALSE)

p1; p2; p3
Hire Rating Descriptives by Group
condition_factor median mean sd cases
text 2 2.889 2.055 18
audio 5 4.714 2.261 21

 


Step 4 - Compare means with an independent samples t-test

# independent samples t-test (equal variance not assumed)
t.test(formula = Hire_Rating ~ condition_factor, data = ratings_tib, 
                 alternative = "two.sided", var.equal = FALSE, paired = FALSE)
# or equivalently, using x and y vectors instead of specifying with a formula:
# t.test(x = subset(ratings_tib$Hire_Rating,ratings_tib$condition_factor=="text"), 
#        y = subset(ratings_tib$Hire_Rating,ratings_tib$condition_factor=="audio"),
#        alternative = "two.sided",
#        var.equal = FALSE, paired = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  Hire_Rating by condition_factor
## t = -2.6399, df = 36.856, p-value = 0.01208
## alternative hypothesis: true difference in means between group text and group audio is not equal to 0
## 95 percent confidence interval:
##  -3.2265958 -0.4241979
## sample estimates:
##  mean in group text mean in group audio 
##            2.888889            4.714286

 

Looking at the t-test summary:

  1. Welch Two Sample t-test is the name for an independent samples t-test that allows for unequal variances in each group (equivalent to the SPSS independent samples t-test “Equal variance not assumed” row)

  2. The t-statistic is ratio of the mean difference divided by estimated standard error (essentially the sum of standard error of each mean, though the formula and degrees of freedom are modified to account for unequal variances - see Chapter 10 of the Field textbook), and the p-value tells you the probability of a t-statistic this far from zero under the null hypothesis (notice the degrees of freedom is not an integer - this is due to the modification allowing for unequal variance)

  3. The sign of the t-statistic (positive or negative) is determined by the order of levels of the IV: first level (“text”) minus second level (“audio”). If you wanted to reverse that, you could re-order the levels like this: ratings_tib <- ratings_tib |> dplyr::mutate(conditions_factor = forcats::fct_relevel(conditions_factor,"audio","text")) - go ahead and do that now (it makes most sense to do it in the chunk where you import the data, then re-run all the chunks)

  4. The 95% confidence interval gives an interval around the estimated difference between means: we expect 95% of intervals constructed this way to contain the true difference in population means

  5. The sample means should match what you saw in your table above.


Step 5 - Effect size - independent samples

In general, effect size estimates for two group comparisons are simply the difference between means expressed in standardized units. This effect size measure is referred to as Cohen’s d. As you read in the Lakens article, there are different methods to calculate d. The effectsize package can give us these measures - here we will used the pooled variance version of d (ds in the Lakens article).

try it now in your Rmd doc

ratings_tib <- ratings_tib  |> mutate(
  condition_factor = forcats::fct_relevel(condition_factor,"audio","text")
  )
# pooled variance cohen's d
effectsize::cohens_d(Hire_Rating ~ condition_factor, data = ratings_tib, pooled_sd = TRUE, 
                     paired = FALSE, alternative = "two.sided") |>     
  knitr::kable(caption = "Effect size Cohen's d (pooled)", digits = 3) |> 
  kableExtra::kable_styling(full_width = FALSE)

# pooled variance hedge's g
# effectsize::hedges_g(Hire_Rating ~ condition_factor, data = ratings_tib, pooled_sd = TRUE, 
#                      paired = FALSE, alternative = "two.sided") |>     
#   knitr::kable(caption = "Effect size Hedge's g", digits = 3) |> 
#   kableExtra::kable_styling(full_width = FALSE)
Effect size Cohen’s d (pooled)
Cohens_d CI CI_low CI_high
0.842 0.95 0.178 1.494

 

Now, answer the following questions for yourself based on what you’ve done so far

  1. What is the difference between the mean Hire Rating from the group of recruiters that heard audio pitches compared to the mean level for recruiters that read text pitches (in terms of raw units of the Hire_rating scale)?
  2. What is the estimated effect size (Cohen’s d), and confidence interval for the effect size? Does the confidence interval include zero?
  3. What do you conclude about the effect of audio versus text pitches on hiring ratings based on this sample of data you have examined?

How to report the result

A report of the result of a comparison like this should include:
1. Type of comparison (this can be in the Methods section. e.g. “Means were compared by an independent samples t-test (equal variances not assumed).”)
2. Means and SDs for each group (you might also report a confidence interval for each mean and/or for the difference between means)
3. Test statistic with degrees of freedom and p value
4. Effect size

Following the way the result was reported in the original publication (Schroeder & Epley, 2015) we could report the finding like this:
“Recruiters reported being more likely to hire the candidates when they listened to pitches (M = 4.71, SD = 2.26) than when they read the same pitches (M = 2.89, SD = 2.06), t(36.87) = 2.64, p = .01, 95% CI of the difference = [0.42, 3.23], d = 0.84.”

Extra: Bayes Factor reporting

You may, in some papers, see a “Bayes Factor” reported along with the typical t-test information. In this two group mean comparison case, a Bayes Factor tells us how much more likely the alternative hypothesis is (compared to the null), given the observed difference in means in the sample. You can think of it as odds in favor of the alternative hypothesis. To calculate a Bayes factor for a t-test, you can use the ttestBF function in the BayesFactor package, syntax is like the regular t.test function. Here is example code if you want to try it: BayesFactor::ttestBF( formula = Hire_Rating ~ condition_factor, paired = FALSE, data = ratings_tib ) - the Bayes Factor is displayed after the r (this is an r scale value, not a correlation coefficient, it controls the width of the alterative distribution and .707 is a standard value).


Step 6 - Non-parametric test for independent samples (based on sum of ranks)

Try it now in your R markdown doc.

# independent samples Wilcoxon rank sum test
wilcox.test(formula = Hire_Rating ~ condition_factor, data = ratings_tib,
            paired = FALSE, alternative = "two.sided")
## Warning in wilcox.test.default(x = DATA[[1L]], y = DATA[[2L]], ...): cannot
## compute exact p-value with ties
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Hire_Rating by condition_factor
## W = 275.5, p-value = 0.01441
## alternative hypothesis: true location shift is not equal to 0
Examine the output
  • You will get a message saying “cannot compute exact p-value with ties” because there are ties when some of the scores are ranked, which means the p-value given is an approximation (and imprecise when N<50). For this reason you might choose instead to use a robust test such as the Yuen (1974) test of trimmed means (see discovR tutorial 09 in the Andy Field tutorial we installed during Week 1, and the WRS2::yuen() function).
  • The output will give you a W statistic and a p-value, stating the probability of the observed “location shift” or greater if the null hypothesis is true. This W is equivalent to the Mann-Whitney U that you see in SPSS, but it differs from the SPSS W statistic (Arrgh!) because there are multiple definitions of the W statistic. When you report this test it is not a bad idea to specify in the Methods what software you used.
  • What about an effect size measure for this kind of comparison? The Field textbook (section 7.4.5) recommends calculating an r value with the formula (r = z/sqrt(N)).
  • The wilcox.test() internally computed a z-statistic but didn’t include it in the output. The simplest way to get the z-stat is to work backwards from the p-value using the qnorm() function to get the corresponding z-stat from the normal distribution. We can access the exact (not rounded) p-value by first storing the result of wilcox.test() in a variable called Wtest, and then selecting the p-value like this : Wtest$p.value.
  • Since our p-value is for a two sided test we will use WtestZ <- abs(qnorm(Wtest.pvalue/2)) and WtestR <- WtestZ/sqrt(N) (note that there is no pos/neg sign on the effect size computed this way).
  • Check out the code below and test it in your own markdown if you like.
  • You may notice a tiny difference in the z and r values compared to what you got in SPSS, due to a correction that the wilcox.test() function applies (you can change this by specifying correct = FALSE when you call the function)
# independent samples Wilcoxon rank sum test
Wtest <- wilcox.test(formula = Hire_Rating ~ condition_factor, data = ratings_tib,
                     paired = FALSE, distribution = "exact", 
                     alternative = "two.sided")
## Warning in wilcox.test.default(x = DATA[[1L]], y = DATA[[2L]], ...): cannot
## compute exact p-value with ties
WtestZ <- abs(qnorm(Wtest$p.value/2))
WtestN <- nrow(ratings_tib |> drop_na(Hire_Rating,condition_factor))
WtestR <- WtestZ/sqrt(WtestN)
## the line below puts the Z and R values into a formatted output
cat(sprintf("\nEffect size for rank sum comparison:\nZ = %.3f, r = %.3f", WtestZ, WtestR))
## 
## Effect size for rank sum comparison:
## Z = 2.447, r = 0.392

 

Reporting

  • When reporting a result of a Wilcoxon Rank Sum test, you should report the medians, test statistic, and p-value, with an explanation in the Methods of what test you used. For example,
    Methods: “The distributions of the dependent variable in each group was non-normal, therefore groups were compared by Wilcoxon rank sum test (Wilcoxon, 1945) implemented by the wilcoxon.test function in R, and effect size is reported as r computed from the test z-score (Rosenthal, 1991).” Results: “Recruiters rated candidates more highly when they listened to pitches (Mdn = 5) than when they read the same pitches (Mdn = 2), Wilcoxon rank sum W = 275.5, p = .014, z = 2.447, r = .392.”

Step 7 - Dependent samples (also called paired samples, within-subjects comparison, repeated measures)

Now, let’s imagine a different sample of data, where each recruiter was exposed to both conditions (audio and text) and made the same ratings, so from each recruiter there are two Hire_Rating measures: one in the audio and one in the text condition. This is a within-subjects or repeated measures design. When we analyze this sample we need to account for the fact that measures in each condition are dependent, meaning that the “audio” rating from a recruiter may be related to the “text” rating from the same recruiter.

Re-interpret (reimagine) the data using pnum_rm as the recruiter/participant ID

Hypothetically, imagine that the dataset we have been working with was a within-subjects design, and our participant identifier is stored in the pnum_rm column instead of pnum. There were 3 more recruiters in the “audio” condition than the “text” condition so we have to drop 3 points (these are the rows with NA values for pnum_rm) so that we have 18 pairs of data points. Below, the plot on the left is the same data we’ve been working with (with the 3 points dropped), and the plot on the right has lines connecting the pairs of points.

Now let’s use this paired data and conduct a paired samples t-test to compare hire ratings in the audio condition to the text condition.

  • first create a new tibble where the 3 data points with no pnum_rm value are dropped:
    ratings_rm_tib <- ratings_tib |> drop_na(pnum_rm) - you should have 36 cases in your new tibble.
  • use the same t-test() function that you used before, but on the new sample of data, and specify paired = TRUE (the var.equal is irrelevant for paired samples)
  • the way t.test() matches pairs of values is by the order they appear in the data (the first “audio” case gets matched with the first “text” case, and so on – if there are not an equal number of each group you will get an error).
  • thus, your data must be sorted so that matched pairs are identified by their order. You can use a function to sort the data by pnum_rm first, to put the cases are in order (use dplyr::arrange(pnum_rm))
  • the solution below uses two piping steps to drop na values and sort by pnum_rm with dplyr::arrange() - see the code for how to do this.
ratings_rm_tib <- ratings_tib |> drop_na(pnum_rm) |> dplyr::arrange(pnum_rm, condition_factor) 
#paired samples t-test
t.test(formula = Hire_Rating ~ condition_factor, data = ratings_rm_tib, 
         paired = TRUE, alternative = "two.sided")
## 
##  Paired t-test
## 
## data:  Hire_Rating by condition_factor
## t = 15, df = 17, p-value = 3.096e-11
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  2.148364 2.851636
## sample estimates:
## mean difference 
##             2.5

 

Did you get an error message saying you can’t use “paired = TRUE” with the formula method?

This is caused by a change between version 0.8.5 and 0.8.6 of the “effectsize” package. The fix, which will come in 0.8.7 (not available on CRAN when this was written), will be to use a new function called repeated_measures_d(). For now, the solution is to downgrade your “effectsize” package version to 0.8.5. Use this command in the console (install the remotes package if you don’t already have it):
install_version("effectsize", version = "0.8.5", repos = "http://cran.us.r-project.org")

Why is the paired comparison so different if most of the points are the same?

  • With this design, observations are not independent. Each observation is related to another observation from the same recruiter.
  • So our model (of the difference between conditions) actually has only 18 independent observations because we are using the difference between audio and text hiring ratings for each recruiter (17 degrees of freedom because we estimate one parameter, the mean difference between conditions).
  • A paired samples t-test is exactly the same as conducting a 1 sample t-test on the difference between conditions compared to zero (notice that the output says “mean of the differences”).
  • The t-statistic is still the ratio of the mean difference between conditions divided by estimated standard error, but our estimated standard error is now based on the difference between pairs of values rather than the standard errors for each group

Step 8 - Effect size for dependent samples

try it now in your Rmd doc

#1
effectsize::cohens_d(Hire_Rating ~ condition_factor, data = ratings_rm_tib, 
                     pooled_sd = TRUE, paired = FALSE, alternative = "two.sided") |>     
  knitr::kable(caption = "Effect size Cohen's d (pooled)", digits = 3) |> 
  kableExtra::kable_styling(full_width = FALSE)
#2 - the data must be sorted by pnum_rm, so it doesn't hurt to repeat the arrange() function here
ratings_rm_tib <- ratings_rm_tib |> dplyr::arrange(pnum_rm) 
effectsize::cohens_d(Hire_Rating ~ condition_factor, data = ratings_rm_tib, 
                     pooled_sd = FALSE, paired = TRUE, alternative = "two.sided") |>     
  knitr::kable(caption = "Effect size Cohen's dz (dependent)", digits = 3) |> 
  kableExtra::kable_styling(full_width = FALSE)
Effect size Cohen’s d (pooled)
Cohens_d CI CI_low CI_high
1.353 0.95 0.617 2.073
Effect size Cohen’s dz (dependent)
Cohens_d CI CI_low CI_high
3.536 0.95 2.263 4.793

 

Extra: Bayes Factor reporting

  • You can use the same BayesFactor::ttestBF() function to calculate a Bayes Factor, this time with paired = TRUE - the formula = syntax does not work for a paired test, so you will need to use the x = and y = options instead of formula = and data =.

Step 9 - Non-parametric test for paired samples - when assumptions are violated

Try it now in your R markdown doc.
Don’t forget to specify ratings_rm_tib as the data, and sort cases using dplyr::arrange() (especially if you are copy-pasting code from Step 6).

# paired samples Wilcoxon signed rank test
ratings_rm_tib |> 
  dplyr::arrange(pnum_rm) |> 
  wilcox.test(formula = Hire_Rating ~ condition_factor, data = _, 
              paired = TRUE, alternative = "two.sided")
## Warning in wilcox.test.default(x = DATA[[1L]], y = DATA[[2L]], ...): cannot
## compute exact p-value with ties
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  Hire_Rating by condition_factor
## V = 171, p-value = 0.0001377
## alternative hypothesis: true location shift is not equal to 0

 

Examine the output
  • You will get a message saying “cannot compute exact p-value with ties” because there are ties when some of the scores are ranked, which means the p-value given is an approximation (and imprecise when N<50).
  • For this reason you might choose instead to use a robust test such as the Yuen (1974) test for paired samples (see discovR tutorial 09 in the Andy Field tutorial we installed during Week 1, and the WRS2::yuend() function).
  • The output will give you a V statistic and a p-value, stating the probability of the observed “location shift” or greater if the null hypothesis is true.
  • What about an effect size measure for this kind of comparison? The Field textbook (section 7.5.5) makes the same recommendation as for the independent samples comparison: (r = z/sqrt(N)). Just like in Step 6, we have to compute this with a few lines of code (see solution). You should find that it is very close (though not exactly the same) to what you got in the SPSS activity, due a correction that the wilcox.test() function applies (you can change this by specifying correct = FALSE when you call the function).
# paired samples Wilcoxon signed rank test
Wtest <- ratings_rm_tib |> 
  dplyr::arrange(pnum_rm) |>
  wilcox.test(formula = Hire_Rating ~ condition_factor, data = _, 
              paired = TRUE, alternative = "two.sided")
## Warning in wilcox.test.default(x = DATA[[1L]], y = DATA[[2L]], ...): cannot
## compute exact p-value with ties
WtestZ <- abs(qnorm(Wtest$p.value/2))
WtestN <- nrow(ratings_rm_tib |> drop_na(Hire_Rating, condition_factor, pnum_rm))
WtestR <- WtestZ/sqrt(WtestN)
## the line below puts the Z and R values into a formatted output
cat(sprintf("\nEffect size for rank sum comparison:\nZ = %.3f, r = %.3f", WtestZ, WtestR))
## 
## Effect size for rank sum comparison:
## Z = 3.812, r = 0.635

 

Wrapping up

Some things to think about and explore:
- How would you depict the findings in the between-subjects design example? In the within-subjects design example?
- check out Figure 7 to see how Schroeder & Epley (2015) depicted the between-subjects data in their paper.
- for an example of plotting data from a within-subjects design, check out Figure 2a in Mehr, Song, & Spelke (2016)

That’s all for this activity!


References