updated Sept 2022


Goals for today


Starting off notes

Today we are going to jump straight into plotting with the “ggplot2” package. The ggplot2 package will most likely not seem intuitive at first, but doing examples is a good way to get the hang of it. Andy Field’s “discovr” tutorial #3 is highly recommended for getting more comfortable with plotting in R.
For learning more about ggplot2 you can go through the official documentation (the cheat sheet is a good resource) and examples in the R-Graph-Gallery site. Other good resources are the data visualization chapter in R for Data Science, and Nordmann et al (preprint). “Data visualisation using R, for researchers who don’t use R” https://doi.org/10.31234/osf.io/4huvw

This activity borrows the from the PsyTeachR unit on data visualization developed at the University of Glasgow School of Psychology and Institute of Neuroscience and Psychology.


Step 1 - Get organized

If you haven’t already done so (i.e., you already made the folders for the SPSS activity):
1. make a folder for today’s activity- “data-visual”
- in this “data-visual” folder, create a new Rproj file (File->New Project->Existing)
2. in the “data-visual” folder, make a “data” folder
- Download the data file mentalrotation.csv (right-click, Save As) and place it in the new “data” folder
3. in this “data-visual” folder, make a “r_docs” folder
- start a new R Markdown doc and save it in your r_docs folder
- delete all the example text below the “setup” code chunk (that RStudio automatically creates when you make a new RMarkdown doc)
4. in this “data-visual” folder, make an “images” folder (leave it empty for now)
5. In the R markdown doc you started, edit the “setup” code chunk to include these lines:
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_knit$set(root.dir = rprojroot::find_rstudio_root_file())
library("tidyverse")


Step 2 - Import data and check it out

#first import the data
mrot_tib <- readr::read_csv("data/mentalrotation.csv", na = c("NA","[N/A]"))

# now make sure the columns we want as factors are treated that way, using forcats::as_factor() - we could let "Angle" be a Ratio variable 
mrot_tib <- mrot_tib %>% dplyr::mutate(
  DesiredResponse = forcats::as_factor(DesiredResponse),
  ActualResponse = forcats::as_factor(ActualResponse),
  CorrectResponse = forcats::as_factor(CorrectResponse),
  Sex = forcats::as_factor(Sex),
)

 

#first import the data
mrot_tib <- readr::read_csv("data/mentalrotation.csv", na = c("NA","[N/A]"))

# now make sure the columns we want as nominal vals are treated as nominal, using forcats::as_factor()
mrot_tib <- mrot_tib %>% dplyr::mutate(
  DesiredResponse = forcats::as_factor(DesiredResponse),
  ActualResponse = forcats::as_factor(ActualResponse),
  CorrectResponse = forcats::as_factor(CorrectResponse),
  Sex = forcats::as_factor(Sex),
)

# first group by Participant and store in a new tibble
mrot_bysub <- mrot_tib %>% drop_na(Time) %>% 
  group_by(Participant) %>% 
  dplyr::summarise(
    sub_meanRT = mean(Time)
  ) %>% ungroup()
# then average across Participants and show table
mrot_bysub %>% dplyr::summarise(
  meanRT = mean(sub_meanRT),
  ci.low = ggplot2::mean_cl_normal(sub_meanRT)$ymin,
  ci.upp = ggplot2::mean_cl_normal(sub_meanRT)$ymax,
  median =  median(sub_meanRT),
  sd = sd(sub_meanRT),
  cases = n() - sum(is.na(sub_meanRT))
) %>% ungroup() %>% 
    knitr::kable(caption = "Response Time Descriptives", 
                 digits = 3) %>% 
    kableExtra::kable_styling(full_width = FALSE)

# now visualize the distribution
p2a <- mrot_bysub %>% drop_na(sub_meanRT) %>%
    ggplot( aes(x=sub_meanRT)) + geom_histogram(binwidth=200) + theme_classic() +
        labs (title = "Response Time distribution")
p2b <- mrot_bysub %>% drop_na(sub_meanRT) %>%
    ggplot( aes(y=sub_meanRT)) + geom_boxplot() + theme_classic() + 
        labs (title = "Response Time box plot")
p2c <- mrot_bysub %>% drop_na(sub_meanRT) %>%
    ggplot( aes(sample=sub_meanRT)) + geom_qq() + geom_qq_line() + theme_classic() +
        labs (title = "Response Time Q-Q")
p2a; p2b; p2c
mrot_bysub %>% {shapiro.test(.$sub_meanRT)}
Response Time Descriptives
meanRT ci.low ci.upp median sd cases
2734.102 2544.947 2923.256 2618.95 693.007 54
## 
##  Shapiro-Wilk normality test
## 
## data:  .$sub_meanRT
## W = 0.97293, p-value = 0.2588


Step 3 - make a bar plot of response time means by Angle condition

# A. first group data by Participant and Angle and store in a new tibble
mrot_bysub <- mrot_tib %>% filter(CorrectResponse=="Correct") %>% 
  group_by(Participant, Angle) %>% 
  dplyr::summarise(
    sub_meanRT = mean(Time)
  ) %>% ungroup()
# # ----uncomment this block to inspect Participant-level means
# mrot_bysub %>% head(8) %>% 
#   knitr::kable(caption = "Participant-level Mean RT (correct trials)", digits = 3) %>% kableExtra::kable_styling(full_width = FALSE)
# # -----------

# B. then group that data by Angle and average across Participants
mrot_summary <- mrot_bysub %>% 
  group_by(Angle) %>% 
  dplyr::summarise(
    meanRT = mean(sub_meanRT),
    ci.low = ggplot2::mean_cl_normal(sub_meanRT)$ymin,
    ci.upp = ggplot2::mean_cl_normal(sub_meanRT)$ymax,
  ) %>% ungroup()

# and show the group-level data in a table
mrot_summary %>% 
  knitr::kable(caption = "Group-level Mean RT", digits = 3) %>% 
  kableExtra::kable_styling(full_width = FALSE)

# C. now we can make the plot
p3a <- mrot_summary %>% 
  ggplot(aes(x = Angle, y = meanRT)) + 
    geom_bar(stat = "identity") +
    coord_cartesian(ylim = c(0, 4500)) + 
    theme_classic() + labs(title="Correct trial response time means", y = "mean response time (ms)", x = "angle of disparity (degrees)")
p3a
Group-level Mean RT
Angle meanRT ci.low ci.upp
0 1931.013 1777.462 2084.565
50 2518.084 2312.799 2723.369
100 3105.220 2877.041 3333.400
150 3293.210 3075.000 3511.421

 

A few key elements in the code:
  1. First we compute participant means for each Angle condition (part A). The filter(CorrectResponse=="Correct") makes sure we only use trials where the response was correct. Then we take the mean across subjects to get group level means for each Angle condition (part B). In general, it is a good idea to use %>% ungroup() to remove hidden grouping information that R stores when you use group_by() and summarise(), or you may get unexpected results if you use the data object in other functions.

  2. the plotting command is all in the block of code (part C) containing the ggplot() function - mrot_summary %>% sends the summary stats to the ggplot() functions.

  3. ggplot(aes(x = Angle, y = meanRT)) sets the “Angle” column of the data to the x-axis and the “meanRT” column to the y-axis. These are called “aesthetics” (hence the aes() function) in ggplot terminology.

  4. After the main ggplot() function we add “layers”. The main layer is the bars that are created by geom_bar(). We specify stat = "identity" because the statistic we want to plot is already calculated in the mrot_summary tibble the we sent as input to ggplot(). If instead we sent a tibble of participant-wise means, we could specify stat = "summary", fun = "mean" to have ggplot compute the group means.

  5. coord_cartesian(ylim = c(0, 4500)) controls the range of the y-axis (we could add xlim = if we wanted to). labs() controls the labels. theme_classic is one of many options for themes that specify color, font, positioning, etc.


Error bars

  • Okay, so we can see the means, but there’s no information at all about variability, so let’s at least put error bars on the bar plot.
  • Let’s set them to equal to the 95% confidence interval around the mean. The solution above already calculated ci.low and ci.upp, so just add a layer to the ggplot() code including this geom_errorbar(aes(ymin=ci.low, ymax=ci.upp), width=10) - the ymin/ymax are values for the bars, and the width is in units of the x-axis (which is 50 units between levels because “Angle” is treated as an interval measure rather than as a factor).
# add ymin and ymax to the aesthetics list, which are needed for geom_errorbar
p3b <- mrot_summary %>% 
  ggplot(aes(x = Angle, y = meanRT, ymin=ci.low, ymax=ci.upp)) + 
    geom_bar(stat = "identity") +
    geom_errorbar(width=10) +
    coord_cartesian(ylim = c(0, 4500)) + 
    theme_classic() + labs(title="Correct trial response time means w/ 95% CI", y = "mean response time (ms)", x = "angle of disparity (degrees)")
p3b


Step 3.1 - Line Plot

  • In some cases you might prefer a point and line plot instead of bars, though it provides the same amount of information. Let’s see how this data looks as a line plot instead.
  • Instead of geom_bar use geom_point to get a point at each mean.
  • Now add another layer with geom_line(aes(group=1),stat = "identity") to draw lines connecting the points (the group=1 part specifies there is only one group of lines).
p3c <- mrot_summary %>% 
  ggplot(aes(x = Angle, y = meanRT, ymin=ci.low, ymax=ci.upp)) + 
    geom_point(stat = "identity") +
    geom_line(aes(group=1),stat = "identity") +
    geom_errorbar(width=10) +
    coord_cartesian(ylim = c(0, 4500)) + 
    theme_classic() + labs(title="Correct trial response time means w/ 95% CI", y = "mean response time (ms)", x = "angle of disparity (degrees)")
p3c


Step 4 - group the bar plot by “DesiredResponse”

# regroup data to include additional factor
mrot_bysub <- mrot_tib %>% filter(CorrectResponse=="Correct") %>%
  group_by(Participant, Angle, DesiredResponse) %>% 
  dplyr::summarise(
    sub_meanRT = mean(Time)
  ) %>% ungroup()
mrot_summary <- mrot_bysub %>% 
  group_by(Angle,DesiredResponse) %>% 
  dplyr::summarise(
    meanRT = mean(sub_meanRT),
    ci.low = ggplot2::mean_cl_normal(sub_meanRT)$ymin,
    ci.upp = ggplot2::mean_cl_normal(sub_meanRT)$ymax,
  ) %>% ungroup()
# add fill aesthetic as DesiredResponse
# use position_dodge to move the bars and errorbars horizontally so 
# they aren't on top of each other. A dodge value of .9*(distance between levels) is generally a good dodge value. Since "Angle" is stored as a numeric type variable, the distance between levels is 50, so we will use a dodge value of 45. But (as you will see later) if we had stored "Angle" as a factor type variable, then the distance between levels would be 1, and we would use a dodge value of .9
p4a <- mrot_summary %>% 
  ggplot(aes(x = Angle, y = meanRT, ymin=ci.low, ymax=ci.upp, fill=DesiredResponse)) + 
    geom_bar(stat = "identity", position=position_dodge(45)) +
    geom_errorbar(width=10, position=position_dodge(45)) +
    coord_cartesian(ylim = c(0, 4500)) + 
    theme_classic() + labs(title="Correct Trial response time means grouped by DesiredResponse", y = "mean response time (ms)", x = "angle of disparity (degrees)")
p4a

 

Step 4.1 Line Plot, grouped by “DesiredResponse”

  • Cool. Now that you’ve figured that out, try the same thing but as a line plot.

  • Instead of fill, use the color aesthetic.

  • Use a smaller dodge value (or no dodge value at all) now that you don’t have thick bars. If you specify a dodge value, you’ll need to do so for each of the three geom elements (geom_point, geom_line, and geom_errorbar)

mrot_bysub <- mrot_tib %>% drop_na(Time) %>%
  group_by(Participant, Angle, DesiredResponse) %>% 
  dplyr::summarise(
    sub_meanRT = mean(Time)
  ) %>% ungroup()
mrot_summary <- mrot_bysub %>% 
  group_by(Angle,DesiredResponse) %>% 
  dplyr::summarise(
    meanRT = mean(sub_meanRT),
    ci.low = ggplot2::mean_cl_normal(sub_meanRT)$ymin,
    ci.upp = ggplot2::mean_cl_normal(sub_meanRT)$ymax,
  ) %>% ungroup()
p4b <- mrot_summary %>% 
  ggplot(aes(x = Angle, y = meanRT, ymin=ci.low, ymax=ci.upp, color=DesiredResponse)) + 
    geom_point(stat = "identity", position=position_dodge(2)) +
    geom_line(aes(group=DesiredResponse), position=position_dodge(2)) +
    geom_errorbar(width=10, position=position_dodge(2)) +
    coord_cartesian(ylim = c(0, 4500)) + 
    theme_classic() + labs(title="Correct trial response time grouped by DesiredResponse", y = "mean response time (ms)", x = "angle of disparity (degrees)")
p4b


Step 5 - Visualize a relation between two variables

mrot_bysub <- mrot_tib %>% drop_na() %>% 
  group_by(Participant) %>% 
  dplyr::summarise(
    sub_meanRT = mean(Time),
    sub_accuracy = sum(CorrectResponse=="Correct")/n()
  ) %>% ungroup()

 

Scatter plot of response time by accuracy

  • now we can specify the x and y axes, and use geom_point() to plot a point for each subject with accuracy on the x-axis and response time on the y-axis
  • we can add a regression line by adding a layer with geom_smooth(method = "lm", formula = "y ~ x") (the intercept is automatically included). If you want to fit a quadratic line use formula = "y ~ x + I(x^2)" or formula = "y ~ poly(x,2)" instead.
  • the shaded area shows the 95% confidence band around the regression line, you can turn it off with se = FALSE
p5a <- mrot_bysub %>% 
  ggplot(aes(x = sub_accuracy, y = sub_meanRT)) + 
    xlim(.6,1) + # this controls the range of data values for the geom_smooth line
    geom_point(stat="identity") +
    geom_smooth(method = "lm", formula = "y ~ x", se = TRUE, fullrange=TRUE) + #intercept automatically included
    coord_cartesian(ylim = c(0, 6000)) + 
    theme_classic() + labs(title="scatter", y = "mean response time (ms)", x = "accuracy")
p5a


Scatter plot of response time by accuracy, grouped by DesiredResponse

  • What if we wanted to split up subject RT and accuracy by DesiredResponse (whether the shapes were same or different)?
  • just (a) recompute the subject-level means, grouping by DesiredResponse, and (b) add a color mapping aesthetic color=DesiredResponse in the aes() part of the ggplot code
mrot_bysub <- mrot_tib %>% drop_na() %>% 
  group_by(Participant, DesiredResponse) %>% 
  dplyr::summarise(
    sub_meanRT = mean(Time),
    sub_accuracy = sum(CorrectResponse=="Correct")/n()
  ) %>% ungroup()
## `summarise()` has grouped output by 'Participant'. You can override using the
## `.groups` argument.
p5b <- mrot_bysub %>% 
  ggplot(aes(x = sub_accuracy, y = sub_meanRT, color = DesiredResponse)) + 
    geom_point(stat="identity") +
    geom_smooth(method = "lm", formula = "y ~ x", se = FALSE, fullrange=TRUE) + #intercept automatically included
    coord_cartesian(ylim = c(0, 6000), xlim = c(.6,1)) + 
    xlim(.6,1) + # this controls the range of data values for the geom_smooth line
    theme_classic() + labs(title="scatter grouped by DesiredResponse", y = "mean response time (ms)", x = "accuracy")
p5b


Step 6. Alternatives to visualise variability

Here are some alternative approaches to visualizing variability. Take a look at each and think about possible uses. Try making the code work in your markdown doc if you are interested.

1. Use a box plot instead of plotting means:

  • a box plot shows:
    • the median (line)
    • a box spanning the interquartile range (IQR=25th to 75th percentile)
    • “whiskers” extending 1.5*(IQR) past the box edges (specifically, to the most extreme data points that fall within 1.5*IQR)
    • points for individual extreme values beyond the whiskers
  • notice that we have to re-specify the x-axis (Angle) variable as a factor to get geom_boxplot to work split the data by Angle mutate(Angle=forcats::as_factor(Angle)) – this changes the units of the x-axis (previously 50 units between points, now 1 unit between points), which affects values such as the position_dodge value (was 45 in the bar chart example, now it is .9)
  • notice that we pass in subject level means, instead of group means
  • also notice that we have to expand the ylim range to see all outliers
# regroup data 
mrot_bysub <- mrot_tib %>% mutate(Angle=forcats::as_factor(Angle)) %>%
  drop_na(Time) %>% 
  group_by(Participant, Angle, DesiredResponse) %>% 
  dplyr::summarise(
    sub_meanRT = mean(Time)
  ) %>% ungroup()

p6a <- mrot_bysub %>% 
  ggplot(aes(x = Angle, y = sub_meanRT, fill=DesiredResponse)) + 
    geom_boxplot(position=position_dodge(.9)) +
    coord_cartesian(ylim = c(0, 6000)) + 
    theme_classic() + labs(title="Response time grouped box plot", y = "mean response time (ms)", x = "angle of disparity (degrees)")
p6a


2. Use a violin plot layer under plotted means

  • a violin plot shows:
    • a (smoothed) probability density across the full range (width=probability of a given value)
    • you can add anything you like on top, here we add means and CI
  • again, we pass in subject level means for geom_violin
  • again, we use an expanded ylim range
  • for geom_point() and geom_errorbar this time we specify the group summarized means and CI
# group data 
mrot_bysub <- mrot_tib %>% mutate(Angle=forcats::as_factor(Angle)) %>%
  drop_na(Time) %>% 
  group_by(Participant, Angle, DesiredResponse) %>% 
  dplyr::summarise(
    sub_meanRT = mean(Time)
  ) %>% ungroup()

mrot_summary <- mrot_bysub %>% 
  group_by(Angle,DesiredResponse) %>% 
  dplyr::summarise(
    meanRT = mean(sub_meanRT),
    ci.low = ggplot2::mean_cl_normal(sub_meanRT)$ymin,
    ci.upp = ggplot2::mean_cl_normal(sub_meanRT)$ymax,
  ) %>% ungroup()
p6b <- mrot_bysub %>% 
  ggplot(aes(x = Angle, y = sub_meanRT, fill=DesiredResponse)) + 
    geom_violin(position = position_dodge(.9)) +
    geom_point(data=mrot_summary, aes(x = Angle, y = meanRT), 
               stat="identity", position = position_dodge(.9)) +
    geom_errorbar(data=mrot_summary, 
                  aes(x = Angle, y = meanRT, ymin=ci.low, ymax=ci.upp),
                  stat="identity", width=.2, position_dodge(.9)) +
    coord_cartesian(ylim = c(0, 6000)) + 
    theme_classic() + labs(title="Response time grouped violin plot w/ mean and CI", y = "mean response time (ms)", x = "angle of disparity (degrees)")
p6b


3. Show all the points: Use a 1D scatter with mean and error bars on top

  • sometimes it makes sense to show all individual points- here we use a 1D scatter (often called a stripplot), randomly jittered to make the points more visible
  • again, we pass in subject level means
  • again, we use an expanded ylim range
  • again, we specify the x-axis (Angle) variable as a factor
  • use alpha = (with a value from 0 to 1) to control transparency of any element (used here to make the points less prominent- notice that it changes the points in the legend too, there’s a fix for that in the next example).
  • we do the mean and error bars just like in the previous example using group-summarized data
# regroup data 
mrot_bysub <- mrot_tib %>% mutate(Angle=forcats::as_factor(Angle)) %>%
  drop_na(Time) %>% 
  group_by(Participant, Angle, DesiredResponse) %>% 
  dplyr::summarise(
    sub_meanRT = mean(Time)
  ) %>% ungroup()

p6c <- mrot_bysub %>% 
  ggplot(aes(x = Angle, y = sub_meanRT, color=DesiredResponse,
             shape=DesiredResponse)) + 
    geom_jitter(position=position_jitterdodge(.75), alpha = .5) +
    geom_point(data=mrot_summary, aes(x = Angle, y = meanRT),color="black",
               stat="identity", position = position_dodge(.75), 
               show.legend = FALSE) +
    geom_errorbar(data=mrot_summary, 
                  aes(x = Angle, y = meanRT, ymin=ci.low, ymax=ci.upp),
                  color="black", stat="identity", width=.2, 
                  position_dodge(.75), show.legend = FALSE) +
    coord_cartesian(ylim = c(0, 6000)) + 
    theme_classic() + 
    labs(title="grouped 1D scatter plot of Response time", 
         y = "mean response time (ms)", 
         x = "angle of disparity (degrees)")
p6c

 

3b. Too cluttered? Try these variations

Sina Plot
  • requires the “ggforce” library
  • geom_sina groups points to approximate the distribution, so the individual points make the shape of a violin plot
  • again, we specify the x-axis (Angle) variable as a factor
    To make it easier to look at, adjust any of the following:
  • the “alpha” property controls transparency
  • “size” controls point size
  • the guides() layer controls the legend
  • the position_dodge() value controls spacing of the groups
  • Color:
    • scale_color_manual()(for points and lines)
    • scale_fill_manual() (for inside of shapes)
    • adjust font sizes for all elements easily by setting base_size in the theme layer (default is 11 point)
    • this example uses a color palette from the “wesanderson” package (requires that you install.packages("wesanderson"))
library(ggforce)
# regroup data 
mrot_bysub <- mrot_tib %>% mutate(Angle=forcats::as_factor(Angle)) %>%
  drop_na(Time) %>% 
  group_by(Participant, Angle, DesiredResponse) %>% 
  dplyr::summarise(
    sub_meanRT = mean(Time)
  ) %>% ungroup()

p6d <- mrot_bysub %>% 
  ggplot(aes(x = Angle, y = sub_meanRT, color=DesiredResponse,
             shape=DesiredResponse)) + 
    geom_sina(position=position_dodge(.6),alpha=.25, size = 1) +
    geom_point(data=mrot_summary, aes(x = Angle, y = meanRT,
                                     color=DesiredResponse),
              position=position_dodge(.6), show.legend = FALSE) +
    geom_errorbar(data=mrot_summary, aes(x = Angle, y = meanRT,
                                         ymin=ci.low, ymax=ci.upp,
                                         color=DesiredResponse),
                  stat="identity", width=.4, alpha = 1,
                  position_dodge(.6), show.legend = FALSE) +
    coord_cartesian(ylim = c(0, 6000)) + 
    theme_classic(base_size = 14) + 
    labs(title="grouped sina plot w/ means and CIs", 
         y = "mean response time (ms)", 
         x = "angle of disparity (degrees)") +
    guides(color = guide_legend(override.aes = list(alpha=1, size =2))) +
    scale_color_manual(values = wesanderson::wes_palette("Darjeeling1", n = 2))

p6d

 


4. Still want more? Check out pirateplot() in the “yarrr” package:

  • bar shows the mean
  • box around the bar shows conf interval
  • violin shows probability density
  • error bars show IQR
  • points show raw data
# regroup data 
mrot_bysub <- mrot_tib %>% mutate(Angle=forcats::as_factor(Angle)) %>% drop_na(Time) %>% 
  group_by(Participant, Angle, DesiredResponse) %>% 
  dplyr::summarise(
    sub_meanRT = mean(Time)
  ) %>% ungroup()
p6d <- mrot_bysub %>% 
  yarrr::pirateplot(data = ., sub_meanRT ~ Angle + DesiredResponse,
                    ylab = "Response Time(ms)", 
                    quant = c(.25, .75),
                    inf.method = 'ci')
p6d
## NULL

 

 


Last Step - Saving your plots as image files

# notice that the width/height parameters will change the size of the image but the size of components in the chart (like point, text, etc.) will stay the same
# you will get an error with this code if the folder "images" does not exist in your project folder
ggsave(filename = "images/scatter1D.png", plot = p6c, device = "png", width = 6, height = 4, units = "in", dpi = 300)

References