Comparing two groups in R

last edited 2024-02-09

Goals for today

Compare two means from independent groups (continuous outcome variable)
use boxplot, histograms to check assumptions
assumptions okay: Independent samples t-test, equal variance assumed or not
- Effect size Cohen d (pooled variance)
assumptions not okay: nonparametric Wilcoxon rank sum (Mann-Whitney U)
- Effect size r (use z-to-r formula)
Compare two means from dependent samples (e.g., paired samples)
use boxplot, histograms to check assumptions
assumptions okay: dependent (paired) samples t-test
- Effect size Cohen d (use difference between pairs)
assumptions not okay: non-parametric Wilcoxon signed rank test
- Effect size r (use z-to-r formula)

Step 1 - Get organized

In RStudio, set up a new project in the folder you created for this activity. If you haven’t already, download this data file into a “data” folder within your project folder:
- Schroeder_Epley_2015_Study_4_abbrev.csv
Start a new R markdown and do your work in there, save the file in a subfolder called r_docs
- put these lines in your “setup” code chunk:
  knitr::opts_chunk$set(echo = TRUE)
  knitr::opts_knit$set(root.dir = rprojroot::find_rstudio_root_file())
  library(tidyverse)
run the setup code chunk
In the RStudio console, install the packages you’ll need today with the install.packages() command:
- install.packages("effectsize")

Step 2 - Import data and check it out

data description: . This dataset is from Study 4 of Schroeder, J., & Epley, N. (2015). The sound of intellect: Speech reveals a thoughtful mind, increasing a job candidate’s appeal. Psychological Science, 26, 877-891. and is used in the Open Stats Lab t-test activity. In this study, professional recruiters from Fortune 500 companies evaluated job candidates based on a brief “elevator pitch”. The recruiters were randomly assigned to listen to an audio recording of the pitch (“audio” condition) or read a transcript of the pitch (“text” condition), then the recruiters rated how likely they would be to hire the candidate.
- there are three variables in the file you downloaded: pnum, CONDITION, and Hire_Rating (there’s also pnum_rm but ignore that for now)
  - pnum stores participant ids (professional recruiters sampled from a conference meeting)
  - CONDITION stores the “audio”/“text” condition indicating that recruiters heard the pitch (“audio”=1) or read the pitch as text (“text”=0) - this is our grouping variable (IV))
  - Hire_Rating stores the recruiters’ rating of how likely they would be to hire the candidate (0 = not at all likely, 10 = extremely likely) (this is our dependent variable (DV))
What to do first: Make a new code chunk and use readr::read_csv() to read in the data file Schroeder_Epley_2015_Study_4_abbrev.csv into a variable called ratings_tib. Now click on the tibble in the Environment window pane to take a quick look (Are your variables the right types? Are there missing values?).
What to do next: make sure the columns that contain categorical vals are treated correctly. Here, we use the function forcats::as_factor() to set CONDITION as a “factor”. A factor is a variable type that can be used to represent categories (each level of the category is assigned a number, starting at 1, and the category names are stored as “levels”). Using the “factor” type is one of many approaches to represent categorical data in R. look at the solution below
For future reference, here is a helpful cheat sheet on factors

#first import the data
ratings_tib <- readr::read_csv("data/Schroeder_Epley_2015_Study_4_abbrev.csv", 
                               na = "NA")
# now make sure the columns we want as factors are treated that way, using forcats::as_factor()
ratings_tib <- ratings_tib |> dplyr::mutate(
  condition_factor = forcats::as_factor(CONDITION)
) |> dplyr::mutate( #let's give names to the levels by using the fct_recode() function
  condition_factor = forcats::fct_recode(condition_factor, text = "0", audio = "1")
) #after this you should see "text" and "audio" in the condition_factor column of ratings_tib
# use code below to re-order the levels of condition
# ratings_tib <- ratings_tib |> mutate(
#   condition_factor = forcats::fct_relevel(condition_factor,"audio","text")
#   )

Step 3 - Examine group means and distributions

Two group decision chart

Above is the decision process chart from the book.

Following the chart, we should start with box plots and histograms to check for unusual cases, non-normality, and possible differences in variance (violation of homogeneity of variance) between groups. Note that we used the term “homoscedasticity” when talking about regression with continuous explanatory variables - it’s the same idea as homogeneity of variance. Let’s also make a table of mean, sd, and #cases by group. See if you can use what you learned in previous activities to make
1. box plot (group on x-axis)
2. histogram (one for each group, or color-code groups on a single histogram)
3. Q-Q plot (one plot for each group)
4. a table of mean, sd, and #cases by group

To jog your memory, here is a link to the “import and examine” activity where we made descriptive tables and distribution plots. In that earlier activity you can go to Step 3.1 for tables, Step 3.2 for plots, and the code in the last “mini-challenge” section has an example of how to add a grouping variable to tables and plots.

#box plot  
p1 <- ratings_tib |> 
  ggplot(aes(x = condition_factor, y = Hire_Rating)) + 
    geom_boxplot() +
    theme_classic() + labs(title="Hire Rating box plot by group", y = "Hire Rating", x = "Group")  
#histogram
p2 <- ratings_tib |> 
  ggplot( aes(x = Hire_Rating, fill = condition_factor)) + 
    geom_histogram(position = "identity", alpha = .5, binwidth = 1) + 
    theme_classic() + labs (title = "Hire Rating Distribution by Group")

#Extra: q-q plots for each group (using facets - see https://ggplot2-book.org/facet.html)
p3 <- ratings_tib |> group_by(condition_factor) |>
  ggplot( aes(sample=Hire_Rating)) + geom_qq() + geom_qq_line() + theme_classic() + 
    facet_wrap(~ condition_factor) + 
    labs (title = "Q-Q of Hire Rating")

#table
ratings_tib |> group_by(condition_factor) |>  
  dplyr::summarise(
    median =  median(Hire_Rating),
    mean =  mean(Hire_Rating),
    sd = sd(Hire_Rating),
    cases = n() - sum(is.na(Hire_Rating))
  ) |> 
    knitr::kable(caption = "Hire Rating Descriptives by Group", digits = 3) |> 
    kableExtra::kable_styling(full_width = FALSE)

p1; p2; p3

Hire Rating Descriptives by Group
condition_factor	median	mean	sd	cases
text	2	2.889	2.055	18
audio	5	4.714	2.261	21

Step 4 - Compare means with an independent samples t-test

With such a small dataset it is hard to make a definite judgment of whether the data are normally distributed, but there are no extreme points and the distribution is not far from normal (Q-Q plot points are near the diagonal). The variance of each group is similar, but as we discussed earlier we will not assume equal variances. There are three more cases in the audio compared to the text condition, but that is not a problem in itself.
So let’s compare the means of the two groups with an independent samples t-test
- use the base R function t.test() - you can find the documentation for the function by searching for it in the “Help” tab of your bottom right window pane
- when you use the function, specify:
  1. a formula to specify the DV ~ IV relation (Hire_Rating ~ condition_factor) – alternatively (but much harder to read) you could instead specify two separate vectors x = filter(ratings_tib,condition_factor=="audio")$Hire_Rating, y = filter(ratings_tib,condition_factor=="text")$Hire_Rating
  2. the data (if using |> or %>% to pipe a tibble into the function then use data = _ for |> or data = . for %>%, otherwise use data = ratings_tib) - omit this if you use the x =, y = form.
  3. whether you want a two-sided or one-sided test (alternative = "two.sided")
  4. whether you are assuming equal variance in the groups (var.equal = FALSE)
  5. whether the values are independent or paired (paired = FALSE)
Now conduct a two-sided test, not assuming equal variance (equal variance not assumed is the default for this function and is recommended because it is more general with little effect on statistical power)
- look at the solution below and then see if you can make it work in your Rmd document

# independent samples t-test (equal variance not assumed)
t.test(formula = Hire_Rating ~ condition_factor, data = ratings_tib, 
                 alternative = "two.sided", var.equal = FALSE, paired = FALSE)
# or equivalently, using x and y vectors instead of specifying with a formula:
# t.test(x = subset(ratings_tib$Hire_Rating,ratings_tib$condition_factor=="text"), 
#        y = subset(ratings_tib$Hire_Rating,ratings_tib$condition_factor=="audio"),
#        alternative = "two.sided",
#        var.equal = FALSE, paired = FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  Hire_Rating by condition_factor
## t = -2.6399, df = 36.856, p-value = 0.01208
## alternative hypothesis: true difference in means between group text and group audio is not equal to 0
## 95 percent confidence interval:
##  -3.2265958 -0.4241979
## sample estimates:
##  mean in group text mean in group audio 
##            2.888889            4.714286

Looking at the t-test summary:

Welch Two Sample t-test is the name for an independent samples t-test that allows for unequal variances in each group (equivalent to the SPSS independent samples t-test “Equal variance not assumed” row)
The t-statistic is ratio of the mean difference divided by estimated standard error (essentially the sum of standard error of each mean, though the formula and degrees of freedom are modified to account for unequal variances - see Chapter 10 of the Field textbook), and the p-value tells you the probability of a t-statistic this far from zero under the null hypothesis (notice the degrees of freedom is not an integer - this is due to the modification allowing for unequal variance)
The sign of the t-statistic (positive or negative) is determined by the order of levels of the IV: first level (“text”) minus second level (“audio”). If you wanted to reverse that, you could re-order the levels like this: ratings_tib <- ratings_tib |> dplyr::mutate(conditions_factor = forcats::fct_relevel(conditions_factor,"audio","text")) - go ahead and do that now (it makes most sense to do it in the chunk where you import the data, then re-run all the chunks)
The 95% confidence interval gives an interval around the estimated difference between means: we expect 95% of intervals constructed this way to contain the true difference in population means
The sample means should match what you saw in your table above.

Step 5 - Effect size - independent samples

In general, effect size estimates for two group comparisons are simply the difference between means expressed in standardized units. This effect size measure is referred to as Cohen’s d. As you read in the Lakens article, there are different methods to calculate d. The effectsize package can give us these measures - here we will used the pooled variance version of d (d_s in the Lakens article).

use effectsize::cohens_d() to compute the effect size estimate. you’ll need to specify:
1. a formula, matching your t-test (Hire_Rating ~ condition_factor) - this function does not accept formula = ..., instead you must instead specify the formula as the first argument (see the solution code if this doesn’t make sense). How could you know that? - the info is in the “help” documentation for the function (look under the “Help” tab in RStudio)
2. the data (data = ratings_tib, or if piping the data: data = _ for |> or data = . for %>% )
3. whether you want pooled variance across the groups (pooled_sd = TRUE)
4. whether the values are independent or paired (paired = FALSE)
5. alternative = "two.sided" to request a two sided confidence interval around the estimate
  6.. (optional) to format the output nicely (and control decimal precision), pipe the output to knitr::kable() (like in the solution example)
if you are curious about Hedge’s g (which should be less biased for small samples like this one) use effectsize::hedges_g() with exactly the same arguments that you supplied for cohen’s d
the sign of the effect size is also determined by the order of levels in the IV (as mentioned above for the t-test)

try it now in your Rmd doc

ratings_tib <- ratings_tib  |> mutate(
  condition_factor = forcats::fct_relevel(condition_factor,"audio","text")
  )
# pooled variance cohen's d
effectsize::cohens_d(Hire_Rating ~ condition_factor, data = ratings_tib, pooled_sd = TRUE, 
                     paired = FALSE, alternative = "two.sided") |>     
  knitr::kable(caption = "Effect size Cohen's d (pooled)", digits = 3) |> 
  kableExtra::kable_styling(full_width = FALSE)

# pooled variance hedge's g
# effectsize::hedges_g(Hire_Rating ~ condition_factor, data = ratings_tib, pooled_sd = TRUE, 
#                      paired = FALSE, alternative = "two.sided") |>     
#   knitr::kable(caption = "Effect size Hedge's g", digits = 3) |> 
#   kableExtra::kable_styling(full_width = FALSE)

Effect size Cohen’s d (pooled)
Cohens_d	CI	CI_low	CI_high
0.842	0.95	0.178	1.494

Now, answer the following questions for yourself based on what you’ve done so far

What is the difference between the mean Hire Rating from the group of recruiters that heard audio pitches compared to the mean level for recruiters that read text pitches (in terms of raw units of the Hire_rating scale)?
What is the estimated effect size (Cohen’s d), and confidence interval for the effect size? Does the confidence interval include zero?
What do you conclude about the effect of audio versus text pitches on hiring ratings based on this sample of data you have examined?

How to report the result

A report of the result of a comparison like this should include:
1. Type of comparison (this can be in the Methods section. e.g. “Means were compared by an independent samples t-test (equal variances not assumed).”)
2. Means and SDs for each group (you might also report a confidence interval for each mean and/or for the difference between means)
3. Test statistic with degrees of freedom and p value
4. Effect size

Following the way the result was reported in the original publication (Schroeder & Epley, 2015) we could report the finding like this:
“Recruiters reported being more likely to hire the candidates when they listened to pitches (M = 4.71, SD = 2.26) than when they read the same pitches (M = 2.89, SD = 2.06), t(36.87) = 2.64, p = .01, 95% CI of the difference = [0.42, 3.23], d = 0.84.”

Extra: Bayes Factor reporting

You may, in some papers, see a “Bayes Factor” reported along with the typical t-test information. In this two group mean comparison case, a Bayes Factor tells us how much more likely the alternative hypothesis is (compared to the null), given the observed difference in means in the sample. You can think of it as odds in favor of the alternative hypothesis. To calculate a Bayes factor for a t-test, you can use the ttestBF function in the BayesFactor package, syntax is like the regular t.test function. Here is example code if you want to try it: BayesFactor::ttestBF( formula = Hire_Rating ~ condition_factor, paired = FALSE, data = ratings_tib ) - the Bayes Factor is displayed after the r (this is an r scale value, not a correlation coefficient, it controls the width of the alterative distribution and .707 is a standard value).

Step 6 - Non-parametric test for independent samples (based on sum of ranks)

What if we were concerned that the assumptions (normality) were violated? In such a case we can use a non-parametric test for comparing groups. To do this we will compute a test statistic, called the Wilcoxon rank sum (equivalent to the Mann Whitney test), where there are no assumptions related to normality or equal variance.
This statistic is based on first ranking the scores and then summing the total of the ranks for each group (you may remember that the non-parametric Spearman correlation is also based on ranked scores).
Because this statistic is based on the ranks, the null hypothesis is properly stated as “the probability that a randomly drawn case from one group is larger than a randomly drawn case from the other is equal to 0.5”. The procedure is described fully in section 7.4 of the Field Textbook.
To run the Wilcoxon rank sum test in R, you can use the wilcox.test() base R function. This function requires similar arguments compared to the t.test() function:
1. a formula to specify the DV ~ IV relation (formula = Hire_Rating ~ condition_factor)
2. the data (if using |> or %>% to pipe a tibble into the function then use data = _ for |> or data = . for %>%, otherwise use data = ratings_tib)
3. whether you want a two-sided or one-sided test (alternative = "two.sided")
4. whether the values are independent or paired (paired = FALSE) - the paired Wilcoxon is called the “signed rank” test and is not equivalent to the Mann-Whitney test.

Try it now in your R markdown doc.

# independent samples Wilcoxon rank sum test
wilcox.test(formula = Hire_Rating ~ condition_factor, data = ratings_tib,
            paired = FALSE, alternative = "two.sided")

## Warning in wilcox.test.default(x = DATA[[1L]], y = DATA[[2L]], ...): cannot
## compute exact p-value with ties

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Hire_Rating by condition_factor
## W = 275.5, p-value = 0.01441
## alternative hypothesis: true location shift is not equal to 0

Examine the output

You will get a message saying “cannot compute exact p-value with ties” because there are ties when some of the scores are ranked, which means the p-value given is an approximation (and imprecise when N<50). For this reason you might choose instead to use a robust test such as the Yuen (1974) test of trimmed means (see discovR tutorial 09 in the Andy Field tutorial we installed during Week 1, and the WRS2::yuen() function).
The output will give you a W statistic and a p-value, stating the probability of the observed “location shift” or greater if the null hypothesis is true. This W is equivalent to the Mann-Whitney U that you see in SPSS, but it differs from the SPSS W statistic (Arrgh!) because there are multiple definitions of the W statistic. When you report this test it is not a bad idea to specify in the Methods what software you used.
What about an effect size measure for this kind of comparison? The Field textbook (section 7.4.5) recommends calculating an r value with the formula (r = z/sqrt(N)).
The wilcox.test() internally computed a z-statistic but didn’t include it in the output. The simplest way to get the z-stat is to work backwards from the p-value using the qnorm() function to get the corresponding z-stat from the normal distribution. We can access the exact (not rounded) p-value by first storing the result of wilcox.test() in a variable called Wtest, and then selecting the p-value like this : Wtest$p.value.
Since our p-value is for a two sided test we will use WtestZ <- abs(qnorm(Wtest.pvalue/2)) and WtestR <- WtestZ/sqrt(N) (note that there is no pos/neg sign on the effect size computed this way).
Check out the code below and test it in your own markdown if you like.
You may notice a tiny difference in the z and r values compared to what you got in SPSS, due to a correction that the wilcox.test() function applies (you can change this by specifying correct = FALSE when you call the function)

# independent samples Wilcoxon rank sum test
Wtest <- wilcox.test(formula = Hire_Rating ~ condition_factor, data = ratings_tib,
                     paired = FALSE, distribution = "exact", 
                     alternative = "two.sided")

## Warning in wilcox.test.default(x = DATA[[1L]], y = DATA[[2L]], ...): cannot
## compute exact p-value with ties

WtestZ <- abs(qnorm(Wtest$p.value/2))
WtestN <- nrow(ratings_tib |> drop_na(Hire_Rating,condition_factor))
WtestR <- WtestZ/sqrt(WtestN)
## the line below puts the Z and R values into a formatted output
cat(sprintf("\nEffect size for rank sum comparison:\nZ = %.3f, r = %.3f", WtestZ, WtestR))

## 
## Effect size for rank sum comparison:
## Z = 2.447, r = 0.392

Reporting

When reporting a result of a Wilcoxon Rank Sum test, you should report the medians, test statistic, and p-value, with an explanation in the Methods of what test you used. For example,
Methods: “The distributions of the dependent variable in each group was non-normal, therefore groups were compared by Wilcoxon rank sum test (Wilcoxon, 1945) implemented by the wilcoxon.test function in R, and effect size is reported as r computed from the test z-score (Rosenthal, 1991).” Results: “Recruiters rated candidates more highly when they listened to pitches (Mdn = 5) than when they read the same pitches (Mdn = 2), Wilcoxon rank sum W = 275.5, p = .014, z = 2.447, r = .392.”

Step 7 - Dependent samples (also called paired samples, within-subjects comparison, repeated measures)

Now, let’s imagine a different sample of data, where each recruiter was exposed to both conditions (audio and text) and made the same ratings, so from each recruiter there are two Hire_Rating measures: one in the audio and one in the text condition. This is a within-subjects or repeated measures design. When we analyze this sample we need to account for the fact that measures in each condition are dependent, meaning that the “audio” rating from a recruiter may be related to the “text” rating from the same recruiter.

Re-interpret (reimagine) the data using pnum_rm as the recruiter/participant ID

Hypothetically, imagine that the dataset we have been working with was a within-subjects design, and our participant identifier is stored in the pnum_rm column instead of pnum. There were 3 more recruiters in the “audio” condition than the “text” condition so we have to drop 3 points (these are the rows with NA values for pnum_rm) so that we have 18 pairs of data points. Below, the plot on the left is the same data we’ve been working with (with the 3 points dropped), and the plot on the right has lines connecting the pairs of points.

Now let’s use this paired data and conduct a paired samples t-test to compare hire ratings in the audio condition to the text condition.

first create a new tibble where the 3 data points with no pnum_rm value are dropped:
ratings_rm_tib <- ratings_tib |> drop_na(pnum_rm) - you should have 36 cases in your new tibble.
use the same t-test() function that you used before, but on the new sample of data, and specify paired = TRUE (the var.equal is irrelevant for paired samples)
the way t.test() matches pairs of values is by the order they appear in the data (the first “audio” case gets matched with the first “text” case, and so on – if there are not an equal number of each group you will get an error).
thus, your data must be sorted so that matched pairs are identified by their order. You can use a function to sort the data by pnum_rm first, to put the cases are in order (use dplyr::arrange(pnum_rm))
the solution below uses two piping steps to drop na values and sort by pnum_rm with dplyr::arrange() - see the code for how to do this.

ratings_rm_tib <- ratings_tib |> drop_na(pnum_rm) |> dplyr::arrange(pnum_rm, condition_factor) 
#paired samples t-test
t.test(formula = Hire_Rating ~ condition_factor, data = ratings_rm_tib, 
         paired = TRUE, alternative = "two.sided")

## 
##  Paired t-test
## 
## data:  Hire_Rating by condition_factor
## t = 15, df = 17, p-value = 3.096e-11
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  2.148364 2.851636
## sample estimates:
## mean difference 
##             2.5

Did you get an error message saying you can’t use “paired = TRUE” with the formula method?

This is caused by a change between version 0.8.5 and 0.8.6 of the “effectsize” package. The fix, which will come in 0.8.7 (not available on CRAN when this was written), will be to use a new function called repeated_measures_d(). For now, the solution is to downgrade your “effectsize” package version to 0.8.5. Use this command in the console (install the remotes package if you don’t already have it):
install_version("effectsize", version = "0.8.5", repos = "http://cran.us.r-project.org")

Why is the paired comparison so different if most of the points are the same?

With this design, observations are not independent. Each observation is related to another observation from the same recruiter.
So our model (of the difference between conditions) actually has only 18 independent observations because we are using the difference between audio and text hiring ratings for each recruiter (17 degrees of freedom because we estimate one parameter, the mean difference between conditions).
A paired samples t-test is exactly the same as conducting a 1 sample t-test on the difference between conditions compared to zero (notice that the output says “mean of the differences”).
The t-statistic is still the ratio of the mean difference between conditions divided by estimated standard error, but our estimated standard error is now based on the difference between pairs of values rather than the standard errors for each group

Step 8 - Effect size for dependent samples

There are two distinct ways to think about effect size for this within-subjects sample:
1. effect size should be comparable across studies, thus should not be dependent on the study design (within-subjects in this study).
  - in this case we should ignore the dependency between cases and compute a pooled variance Cohen’s d as we did with the first sample (and would get the same value if not for the 3 dropped data points)
2. the effect size should factor in the dependency between scores in different conditions
  - in this case we should factor in the correlation between the scores using paired = TRUE in our function call, and we would get a different (larger) effect size than we got for the first (between subjects) study - referred to as d_z in the Lakens (2013) article.
Which effect size measure should you use?
- It depends on your research question and, to some extent, expectations in your area of study.
- Some research questions might be specific to repeated measures designs, and in that case #2 makes sense.
- But #1 makes sense when you intend the effect size to be comparable across different study designs.
- The important thing is to be clear about how effect size is calculated when you report it. E.g. Methods: “Effect size for paired comparisons is reported as d = (mean difference)/(standard deviation of the difference)(Lakens, 2013)”
Now use effectsize::cohens_d() to compute the effect size estimate - try it both ways. you’ll need to specify:
1. a formula, matching your t-test (Hire_Rating ~ condition_factor)
2. the data (data = ratings_rm_tib)
3. whether the values are independent or paired (paired = FALSE for #1 or TRUE for #2)
4. whether you want pooled variance across the groups (pooled_sd = TRUE for #1, it is irrelevant for #2)
5. alternative = "two.sided" to get a two sided confidence interval

try it now in your Rmd doc

#1
effectsize::cohens_d(Hire_Rating ~ condition_factor, data = ratings_rm_tib, 
                     pooled_sd = TRUE, paired = FALSE, alternative = "two.sided") |>     
  knitr::kable(caption = "Effect size Cohen's d (pooled)", digits = 3) |> 
  kableExtra::kable_styling(full_width = FALSE)
#2 - the data must be sorted by pnum_rm, so it doesn't hurt to repeat the arrange() function here
ratings_rm_tib <- ratings_rm_tib |> dplyr::arrange(pnum_rm) 
effectsize::cohens_d(Hire_Rating ~ condition_factor, data = ratings_rm_tib, 
                     pooled_sd = FALSE, paired = TRUE, alternative = "two.sided") |>     
  knitr::kable(caption = "Effect size Cohen's dz (dependent)", digits = 3) |> 
  kableExtra::kable_styling(full_width = FALSE)

Effect size Cohen’s d (pooled)
Cohens_d	CI	CI_low	CI_high
1.353	0.95	0.617	2.073

Effect size Cohen’s dz (dependent)
Cohens_d	CI	CI_low	CI_high
3.536	0.95	2.263	4.793

Extra: Bayes Factor reporting

You can use the same BayesFactor::ttestBF() function to calculate a Bayes Factor, this time with paired = TRUE - the formula = syntax does not work for a paired test, so you will need to use the x = and y = options instead of formula = and data =.

Step 9 - Non-parametric test for paired samples - when assumptions are violated

One non-parametric alternative to the paired samples t-test (useful when assumptions are violated) is the Wilcoxon signed rank test.
This test is based on ranking the differences between scores in the two conditions you’re comparing. The sign of the differences is attached to the ranked scores, hence the name of the test (see Field textbook section 7.5 for details).
Because this statistic is based on the ranks, the null hypothesis is that the median of the differences between conditions is zero.
We can use the same wilcox.test() function, the only change is that we set the “paired” argument to TRUE (`paired = TRUE).

Try it now in your R markdown doc.
Don’t forget to specify ratings_rm_tib as the data, and sort cases using dplyr::arrange() (especially if you are copy-pasting code from Step 6).

# paired samples Wilcoxon signed rank test
ratings_rm_tib |> 
  dplyr::arrange(pnum_rm) |> 
  wilcox.test(formula = Hire_Rating ~ condition_factor, data = _, 
              paired = TRUE, alternative = "two.sided")

## Warning in wilcox.test.default(x = DATA[[1L]], y = DATA[[2L]], ...): cannot
## compute exact p-value with ties

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  Hire_Rating by condition_factor
## V = 171, p-value = 0.0001377
## alternative hypothesis: true location shift is not equal to 0

Examine the output

You will get a message saying “cannot compute exact p-value with ties” because there are ties when some of the scores are ranked, which means the p-value given is an approximation (and imprecise when N<50).
For this reason you might choose instead to use a robust test such as the Yuen (1974) test for paired samples (see discovR tutorial 09 in the Andy Field tutorial we installed during Week 1, and the WRS2::yuend() function).
The output will give you a V statistic and a p-value, stating the probability of the observed “location shift” or greater if the null hypothesis is true.
What about an effect size measure for this kind of comparison? The Field textbook (section 7.5.5) makes the same recommendation as for the independent samples comparison: (r = z/sqrt(N)). Just like in Step 6, we have to compute this with a few lines of code (see solution). You should find that it is very close (though not exactly the same) to what you got in the SPSS activity, due a correction that the wilcox.test() function applies (you can change this by specifying correct = FALSE when you call the function).

# paired samples Wilcoxon signed rank test
Wtest <- ratings_rm_tib |> 
  dplyr::arrange(pnum_rm) |>
  wilcox.test(formula = Hire_Rating ~ condition_factor, data = _, 
              paired = TRUE, alternative = "two.sided")

## Warning in wilcox.test.default(x = DATA[[1L]], y = DATA[[2L]], ...): cannot
## compute exact p-value with ties

WtestZ <- abs(qnorm(Wtest$p.value/2))
WtestN <- nrow(ratings_rm_tib |> drop_na(Hire_Rating, condition_factor, pnum_rm))
WtestR <- WtestZ/sqrt(WtestN)
## the line below puts the Z and R values into a formatted output
cat(sprintf("\nEffect size for rank sum comparison:\nZ = %.3f, r = %.3f", WtestZ, WtestR))

## 
## Effect size for rank sum comparison:
## Z = 3.812, r = 0.635

Wrapping up

Some things to think about and explore:
- How would you depict the findings in the between-subjects design example? In the within-subjects design example?
- check out Figure 7 to see how Schroeder & Epley (2015) depicted the between-subjects data in their paper.
- for an example of plotting data from a within-subjects design, check out Figure 2a in Mehr, Song, & Spelke (2016)

That’s all for this activity!

References

Chapter 7 of Field textbook: Field, A.P. (2018). Discovering Statistics Using IBM SPSS Statistics. 5th Edition. London: Sage.
Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and ANOVAs. Frontiers in psychology, 4, 863.
Yuen, K.K. (1974). The two-sample trimmed t for unequal population variances. Biometrika, 61, 165–170.
Rosenthal, R. (1991). Meta-analytic procedures for social research (2nd ed.). Newbury Park, CA: Sage.
Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics, 1, 80–83.
|>

Comparing two groups in R

Bhanji - Statistical Methods

Goals for today

Step 1 - Get organized

Step 2 - Import data and check it out

Step 3 - Examine group means and distributions

Above is the decision process chart from the book.

Step 4 - Compare means with an independent samples t-test

Looking at the t-test summary:

Step 5 - Effect size - independent samples

Now, answer the following questions for yourself based on what you’ve done so far

How to report the result

Extra: Bayes Factor reporting

Step 6 - Non-parametric test for independent samples (based on sum of ranks)

Examine the output

Reporting

Step 7 - Dependent samples (also called paired samples, within-subjects comparison, repeated measures)

Re-interpret (reimagine) the data using pnum_rm as the recruiter/participant ID

Now let’s use this paired data and conduct a paired samples t-test to compare hire ratings in the audio condition to the text condition.

Did you get an error message saying you can’t use “paired = TRUE” with the formula method?

Why is the paired comparison so different if most of the points are the same?

Step 8 - Effect size for dependent samples

Extra: Bayes Factor reporting

Step 9 - Non-parametric test for paired samples - when assumptions are violated

Examine the output

Wrapping up

That’s all for this activity!

References