updated Sept 2022
Learn basic chart types using ggplot2 that are appropriate for common purposes
Learn strategies for visualizing variability
By going through examples, develop an beginner’s understanding of the “grammar of graphics” (Wickham, 2007) used by ggplot2 (geom, coordinate system, aesthetic component mapping, statistical function, layers, …)
Today we are going to jump straight into plotting with the “ggplot2”
package. The ggplot2 package will most likely not seem intuitive at
first, but doing examples is a good way to get the hang of it. Andy
Field’s “discovr” tutorial #3 is highly recommended for getting more
comfortable with plotting in R.
For learning more about ggplot2 you can go through the official documentation (the cheat
sheet is a good resource) and examples in
the R-Graph-Gallery site. Other good resources are the data
visualization chapter in R for Data
Science, and Nordmann et al (preprint).
“Data visualisation using R, for researchers who don’t use R”
https://doi.org/10.31234/osf.io/4huvw
This activity borrows the from the PsyTeachR unit on data visualization developed at the University of Glasgow School of Psychology and Institute of Neuroscience and Psychology.
If you haven’t already done so (i.e., you already made the folders
for the SPSS activity):
1. make a folder for today’s activity- “data-visual”
- in this “data-visual” folder, create a new Rproj file
(File->New Project->Existing)
2. in the “data-visual” folder, make a “data” folder
- Download the data file mentalrotation.csv
(right-click, Save As) and place it in the new “data” folder
3. in this “data-visual” folder, make a “r_docs” folder
- start a new R Markdown doc and save it in your r_docs folder
- delete all the example text below the “setup” code chunk (that RStudio
automatically creates when you make a new RMarkdown doc)
4. in this “data-visual” folder, make an “images” folder (leave it empty
for now)
5. In the R markdown doc you started, edit the “setup” code chunk to
include these lines:
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_knit$set(root.dir = rprojroot::find_rstudio_root_file())
library("tidyverse")
DesiredResponse
, ActualResponse
,
CorrectResponse
, and Sex
as factors, using
forcats::as_factor()
(by default they are set to type
character
. Although all our code today will still work the
same with these variables as character
, using the
factor
datatype is an easy way to check the levels for each
of your variables, and can help you avoid some unwanted errors (e.g.,
it’s easy to see some hard to find errors, like the values “Correct” and
“correct” getting treated as different levels - you would be able to see
this easily for any factor
column by looking in the
“Environment” tab in the top right pane). take a look at the
solution to see how to use forcats::as_factor()
#first import the data
mrot_tib <- readr::read_csv("data/mentalrotation.csv", na = c("NA","[N/A]"))
# now make sure the columns we want as factors are treated that way, using forcats::as_factor() - we could let "Angle" be a Ratio variable
mrot_tib <- mrot_tib %>% dplyr::mutate(
DesiredResponse = forcats::as_factor(DesiredResponse),
ActualResponse = forcats::as_factor(ActualResponse),
CorrectResponse = forcats::as_factor(CorrectResponse),
Sex = forcats::as_factor(Sex),
)
#first import the data
mrot_tib <- readr::read_csv("data/mentalrotation.csv", na = c("NA","[N/A]"))
# now make sure the columns we want as nominal vals are treated as nominal, using forcats::as_factor()
mrot_tib <- mrot_tib %>% dplyr::mutate(
DesiredResponse = forcats::as_factor(DesiredResponse),
ActualResponse = forcats::as_factor(ActualResponse),
CorrectResponse = forcats::as_factor(CorrectResponse),
Sex = forcats::as_factor(Sex),
)
# first group by Participant and store in a new tibble
mrot_bysub <- mrot_tib %>% drop_na(Time) %>%
group_by(Participant) %>%
dplyr::summarise(
sub_meanRT = mean(Time)
) %>% ungroup()
# then average across Participants and show table
mrot_bysub %>% dplyr::summarise(
meanRT = mean(sub_meanRT),
ci.low = ggplot2::mean_cl_normal(sub_meanRT)$ymin,
ci.upp = ggplot2::mean_cl_normal(sub_meanRT)$ymax,
median = median(sub_meanRT),
sd = sd(sub_meanRT),
cases = n() - sum(is.na(sub_meanRT))
) %>% ungroup() %>%
knitr::kable(caption = "Response Time Descriptives",
digits = 3) %>%
kableExtra::kable_styling(full_width = FALSE)
# now visualize the distribution
p2a <- mrot_bysub %>% drop_na(sub_meanRT) %>%
ggplot( aes(x=sub_meanRT)) + geom_histogram(binwidth=200) + theme_classic() +
labs (title = "Response Time distribution")
p2b <- mrot_bysub %>% drop_na(sub_meanRT) %>%
ggplot( aes(y=sub_meanRT)) + geom_boxplot() + theme_classic() +
labs (title = "Response Time box plot")
p2c <- mrot_bysub %>% drop_na(sub_meanRT) %>%
ggplot( aes(sample=sub_meanRT)) + geom_qq() + geom_qq_line() + theme_classic() +
labs (title = "Response Time Q-Q")
p2a; p2b; p2c
mrot_bysub %>% {shapiro.test(.$sub_meanRT)}
meanRT | ci.low | ci.upp | median | sd | cases |
---|---|---|---|---|---|
2734.102 | 2544.947 | 2923.256 | 2618.95 | 693.007 | 54 |
##
## Shapiro-Wilk normality test
##
## data: .$sub_meanRT
## W = 0.97293, p-value = 0.2588
# A. first group data by Participant and Angle and store in a new tibble
mrot_bysub <- mrot_tib %>% filter(CorrectResponse=="Correct") %>%
group_by(Participant, Angle) %>%
dplyr::summarise(
sub_meanRT = mean(Time)
) %>% ungroup()
# # ----uncomment this block to inspect Participant-level means
# mrot_bysub %>% head(8) %>%
# knitr::kable(caption = "Participant-level Mean RT (correct trials)", digits = 3) %>% kableExtra::kable_styling(full_width = FALSE)
# # -----------
# B. then group that data by Angle and average across Participants
mrot_summary <- mrot_bysub %>%
group_by(Angle) %>%
dplyr::summarise(
meanRT = mean(sub_meanRT),
ci.low = ggplot2::mean_cl_normal(sub_meanRT)$ymin,
ci.upp = ggplot2::mean_cl_normal(sub_meanRT)$ymax,
) %>% ungroup()
# and show the group-level data in a table
mrot_summary %>%
knitr::kable(caption = "Group-level Mean RT", digits = 3) %>%
kableExtra::kable_styling(full_width = FALSE)
# C. now we can make the plot
p3a <- mrot_summary %>%
ggplot(aes(x = Angle, y = meanRT)) +
geom_bar(stat = "identity") +
coord_cartesian(ylim = c(0, 4500)) +
theme_classic() + labs(title="Correct trial response time means", y = "mean response time (ms)", x = "angle of disparity (degrees)")
p3a
Angle | meanRT | ci.low | ci.upp |
---|---|---|---|
0 | 1931.013 | 1777.462 | 2084.565 |
50 | 2518.084 | 2312.799 | 2723.369 |
100 | 3105.220 | 2877.041 | 3333.400 |
150 | 3293.210 | 3075.000 | 3511.421 |
First we compute participant means for each Angle condition (part
A). The filter(CorrectResponse=="Correct")
makes sure we
only use trials where the response was correct. Then we take the mean
across subjects to get group level means for each Angle condition (part
B). In general, it is a good idea to use %>% ungroup()
to remove hidden grouping information that R stores when you use
group_by()
and summarise()
, or you may get
unexpected results if you use the data object in other
functions.
the plotting command is all in the block of code (part C)
containing the ggplot()
function -
mrot_summary %>%
sends the summary stats to the ggplot()
functions.
ggplot(aes(x = Angle, y = meanRT))
sets the “Angle”
column of the data to the x-axis and the “meanRT” column to the y-axis.
These are called “aesthetics” (hence the aes()
function) in
ggplot terminology.
After the main ggplot()
function we add “layers”.
The main layer is the bars that are created by geom_bar()
.
We specify stat = "identity"
because the statistic we want
to plot is already calculated in the mrot_summary
tibble
the we sent as input to ggplot(). If instead we sent a tibble of
participant-wise means, we could specify
stat = "summary", fun = "mean"
to have ggplot compute the
group means.
coord_cartesian(ylim = c(0, 4500))
controls the
range of the y-axis (we could add xlim =
if we wanted to).
labs()
controls the labels. theme_classic
is
one of many options for themes that specify color, font, positioning,
etc.
ci.low
and
ci.upp
, so just add a layer to the ggplot() code including
this geom_errorbar(aes(ymin=ci.low, ymax=ci.upp), width=10)
- the ymin/ymax are values for the bars, and the width is in units of
the x-axis (which is 50 units between levels because “Angle” is treated
as an interval measure rather than as a factor).# add ymin and ymax to the aesthetics list, which are needed for geom_errorbar
p3b <- mrot_summary %>%
ggplot(aes(x = Angle, y = meanRT, ymin=ci.low, ymax=ci.upp)) +
geom_bar(stat = "identity") +
geom_errorbar(width=10) +
coord_cartesian(ylim = c(0, 4500)) +
theme_classic() + labs(title="Correct trial response time means w/ 95% CI", y = "mean response time (ms)", x = "angle of disparity (degrees)")
p3b
geom_bar
use geom_point
to get
a point at each mean.geom_line(aes(group=1),stat = "identity")
to draw lines
connecting the points (the group=1 part specifies there is only one
group of lines).p3c <- mrot_summary %>%
ggplot(aes(x = Angle, y = meanRT, ymin=ci.low, ymax=ci.upp)) +
geom_point(stat = "identity") +
geom_line(aes(group=1),stat = "identity") +
geom_errorbar(width=10) +
coord_cartesian(ylim = c(0, 4500)) +
theme_classic() + labs(title="Correct trial response time means w/ 95% CI", y = "mean response time (ms)", x = "angle of disparity (degrees)")
p3c
# regroup data to include additional factor
mrot_bysub <- mrot_tib %>% filter(CorrectResponse=="Correct") %>%
group_by(Participant, Angle, DesiredResponse) %>%
dplyr::summarise(
sub_meanRT = mean(Time)
) %>% ungroup()
mrot_summary <- mrot_bysub %>%
group_by(Angle,DesiredResponse) %>%
dplyr::summarise(
meanRT = mean(sub_meanRT),
ci.low = ggplot2::mean_cl_normal(sub_meanRT)$ymin,
ci.upp = ggplot2::mean_cl_normal(sub_meanRT)$ymax,
) %>% ungroup()
# add fill aesthetic as DesiredResponse
# use position_dodge to move the bars and errorbars horizontally so
# they aren't on top of each other. A dodge value of .9*(distance between levels) is generally a good dodge value. Since "Angle" is stored as a numeric type variable, the distance between levels is 50, so we will use a dodge value of 45. But (as you will see later) if we had stored "Angle" as a factor type variable, then the distance between levels would be 1, and we would use a dodge value of .9
p4a <- mrot_summary %>%
ggplot(aes(x = Angle, y = meanRT, ymin=ci.low, ymax=ci.upp, fill=DesiredResponse)) +
geom_bar(stat = "identity", position=position_dodge(45)) +
geom_errorbar(width=10, position=position_dodge(45)) +
coord_cartesian(ylim = c(0, 4500)) +
theme_classic() + labs(title="Correct Trial response time means grouped by DesiredResponse", y = "mean response time (ms)", x = "angle of disparity (degrees)")
p4a
Cool. Now that you’ve figured that out, try the same thing but as
a line plot.
Instead of fill
, use the color
aesthetic.
Use a smaller dodge value (or no dodge value at all) now that you
don’t have thick bars. If you specify a dodge value, you’ll need to do
so for each of the three geom elements (geom_point
,
geom_line
, and geom_errorbar
)
mrot_bysub <- mrot_tib %>% drop_na(Time) %>%
group_by(Participant, Angle, DesiredResponse) %>%
dplyr::summarise(
sub_meanRT = mean(Time)
) %>% ungroup()
mrot_summary <- mrot_bysub %>%
group_by(Angle,DesiredResponse) %>%
dplyr::summarise(
meanRT = mean(sub_meanRT),
ci.low = ggplot2::mean_cl_normal(sub_meanRT)$ymin,
ci.upp = ggplot2::mean_cl_normal(sub_meanRT)$ymax,
) %>% ungroup()
p4b <- mrot_summary %>%
ggplot(aes(x = Angle, y = meanRT, ymin=ci.low, ymax=ci.upp, color=DesiredResponse)) +
geom_point(stat = "identity", position=position_dodge(2)) +
geom_line(aes(group=DesiredResponse), position=position_dodge(2)) +
geom_errorbar(width=10, position=position_dodge(2)) +
coord_cartesian(ylim = c(0, 4500)) +
theme_classic() + labs(title="Correct trial response time grouped by DesiredResponse", y = "mean response time (ms)", x = "angle of disparity (degrees)")
p4b
mrot_bysub <- mrot_tib %>% drop_na() %>%
group_by(Participant) %>%
dplyr::summarise(
sub_meanRT = mean(Time),
sub_accuracy = sum(CorrectResponse=="Correct")/n()
) %>% ungroup()
geom_smooth(method = "lm", formula = "y ~ x")
(the
intercept is automatically included). If you want to fit a quadratic
line use formula = "y ~ x + I(x^2)"
or
formula = "y ~ poly(x,2)"
instead.se = FALSE
p5a <- mrot_bysub %>%
ggplot(aes(x = sub_accuracy, y = sub_meanRT)) +
xlim(.6,1) + # this controls the range of data values for the geom_smooth line
geom_point(stat="identity") +
geom_smooth(method = "lm", formula = "y ~ x", se = TRUE, fullrange=TRUE) + #intercept automatically included
coord_cartesian(ylim = c(0, 6000)) +
theme_classic() + labs(title="scatter", y = "mean response time (ms)", x = "accuracy")
p5a
color=DesiredResponse
in the aes() part of the ggplot
codemrot_bysub <- mrot_tib %>% drop_na() %>%
group_by(Participant, DesiredResponse) %>%
dplyr::summarise(
sub_meanRT = mean(Time),
sub_accuracy = sum(CorrectResponse=="Correct")/n()
) %>% ungroup()
## `summarise()` has grouped output by 'Participant'. You can override using the
## `.groups` argument.
p5b <- mrot_bysub %>%
ggplot(aes(x = sub_accuracy, y = sub_meanRT, color = DesiredResponse)) +
geom_point(stat="identity") +
geom_smooth(method = "lm", formula = "y ~ x", se = FALSE, fullrange=TRUE) + #intercept automatically included
coord_cartesian(ylim = c(0, 6000), xlim = c(.6,1)) +
xlim(.6,1) + # this controls the range of data values for the geom_smooth line
theme_classic() + labs(title="scatter grouped by DesiredResponse", y = "mean response time (ms)", x = "accuracy")
p5b
Here are some alternative approaches to visualizing variability. Take a look at each and think about possible uses. Try making the code work in your markdown doc if you are interested.
mutate(Angle=forcats::as_factor(Angle))
– this
changes the units of the x-axis (previously 50 units between points, now
1 unit between points), which affects values such as the position_dodge
value (was 45 in the bar chart example, now it is .9)# regroup data
mrot_bysub <- mrot_tib %>% mutate(Angle=forcats::as_factor(Angle)) %>%
drop_na(Time) %>%
group_by(Participant, Angle, DesiredResponse) %>%
dplyr::summarise(
sub_meanRT = mean(Time)
) %>% ungroup()
p6a <- mrot_bysub %>%
ggplot(aes(x = Angle, y = sub_meanRT, fill=DesiredResponse)) +
geom_boxplot(position=position_dodge(.9)) +
coord_cartesian(ylim = c(0, 6000)) +
theme_classic() + labs(title="Response time grouped box plot", y = "mean response time (ms)", x = "angle of disparity (degrees)")
p6a
geom_point()
and geom_errorbar
this
time we specify the group summarized means and CI# group data
mrot_bysub <- mrot_tib %>% mutate(Angle=forcats::as_factor(Angle)) %>%
drop_na(Time) %>%
group_by(Participant, Angle, DesiredResponse) %>%
dplyr::summarise(
sub_meanRT = mean(Time)
) %>% ungroup()
mrot_summary <- mrot_bysub %>%
group_by(Angle,DesiredResponse) %>%
dplyr::summarise(
meanRT = mean(sub_meanRT),
ci.low = ggplot2::mean_cl_normal(sub_meanRT)$ymin,
ci.upp = ggplot2::mean_cl_normal(sub_meanRT)$ymax,
) %>% ungroup()
p6b <- mrot_bysub %>%
ggplot(aes(x = Angle, y = sub_meanRT, fill=DesiredResponse)) +
geom_violin(position = position_dodge(.9)) +
geom_point(data=mrot_summary, aes(x = Angle, y = meanRT),
stat="identity", position = position_dodge(.9)) +
geom_errorbar(data=mrot_summary,
aes(x = Angle, y = meanRT, ymin=ci.low, ymax=ci.upp),
stat="identity", width=.2, position_dodge(.9)) +
coord_cartesian(ylim = c(0, 6000)) +
theme_classic() + labs(title="Response time grouped violin plot w/ mean and CI", y = "mean response time (ms)", x = "angle of disparity (degrees)")
p6b
alpha =
(with a value from 0 to 1) to control
transparency of any element (used here to make the points less
prominent- notice that it changes the points in the legend too, there’s
a fix for that in the next example).# regroup data
mrot_bysub <- mrot_tib %>% mutate(Angle=forcats::as_factor(Angle)) %>%
drop_na(Time) %>%
group_by(Participant, Angle, DesiredResponse) %>%
dplyr::summarise(
sub_meanRT = mean(Time)
) %>% ungroup()
p6c <- mrot_bysub %>%
ggplot(aes(x = Angle, y = sub_meanRT, color=DesiredResponse,
shape=DesiredResponse)) +
geom_jitter(position=position_jitterdodge(.75), alpha = .5) +
geom_point(data=mrot_summary, aes(x = Angle, y = meanRT),color="black",
stat="identity", position = position_dodge(.75),
show.legend = FALSE) +
geom_errorbar(data=mrot_summary,
aes(x = Angle, y = meanRT, ymin=ci.low, ymax=ci.upp),
color="black", stat="identity", width=.2,
position_dodge(.75), show.legend = FALSE) +
coord_cartesian(ylim = c(0, 6000)) +
theme_classic() +
labs(title="grouped 1D scatter plot of Response time",
y = "mean response time (ms)",
x = "angle of disparity (degrees)")
p6c
geom_sina
groups points to approximate the
distribution, so the individual points make the shape of a violin
plotguides()
layer controls the legendposition_dodge()
value controls spacing of the
groupsscale_color_manual()
(for points and lines)scale_fill_manual()
(for inside of shapes)base_size
in the theme layer (default is 11 point)install.packages("wesanderson")
)
library(ggforce)
# regroup data
mrot_bysub <- mrot_tib %>% mutate(Angle=forcats::as_factor(Angle)) %>%
drop_na(Time) %>%
group_by(Participant, Angle, DesiredResponse) %>%
dplyr::summarise(
sub_meanRT = mean(Time)
) %>% ungroup()
p6d <- mrot_bysub %>%
ggplot(aes(x = Angle, y = sub_meanRT, color=DesiredResponse,
shape=DesiredResponse)) +
geom_sina(position=position_dodge(.6),alpha=.25, size = 1) +
geom_point(data=mrot_summary, aes(x = Angle, y = meanRT,
color=DesiredResponse),
position=position_dodge(.6), show.legend = FALSE) +
geom_errorbar(data=mrot_summary, aes(x = Angle, y = meanRT,
ymin=ci.low, ymax=ci.upp,
color=DesiredResponse),
stat="identity", width=.4, alpha = 1,
position_dodge(.6), show.legend = FALSE) +
coord_cartesian(ylim = c(0, 6000)) +
theme_classic(base_size = 14) +
labs(title="grouped sina plot w/ means and CIs",
y = "mean response time (ms)",
x = "angle of disparity (degrees)") +
guides(color = guide_legend(override.aes = list(alpha=1, size =2))) +
scale_color_manual(values = wesanderson::wes_palette("Darjeeling1", n = 2))
p6d
pirateplot()
in the
“yarrr” package:# regroup data
mrot_bysub <- mrot_tib %>% mutate(Angle=forcats::as_factor(Angle)) %>% drop_na(Time) %>%
group_by(Participant, Angle, DesiredResponse) %>%
dplyr::summarise(
sub_meanRT = mean(Time)
) %>% ungroup()
p6d <- mrot_bysub %>%
yarrr::pirateplot(data = ., sub_meanRT ~ Angle + DesiredResponse,
ylab = "Response Time(ms)",
quant = c(.25, .75),
inf.method = 'ci')
p6d
## NULL
if you want to save your plots, and control the file type, size,
and resolution (e.g., to submit with a publication, or paste into a word
doc) you can use the ggsave()
function (for some file types
you may need to install additional packages)
below is an example of how to save a plot we created above as a *.png file with 300 dpi resolution, 6in X 4in
if you intend to put multiple plots together to make a figure (e.g., in Adobe Illustrator, Inkscape, Powerpoint, …) it is recommended to use the *.svg file type (device = “svg”) to preserve the resolution of the chart components.
# notice that the width/height parameters will change the size of the image but the size of components in the chart (like point, text, etc.) will stay the same
# you will get an error with this code if the folder "images" does not exist in your project folder
ggsave(filename = "images/scatter1D.png", plot = p6c, device = "png", width = 6, height = 4, units = "in", dpi = 300)