Activity-Import and examine data in R

Download this file as an R markdown doc

Learning Objectives

Set up a project directory and create an R markdown file (*.Rmd) to document your work
Import data from text (csv or tab-delimited)
Get descriptives and characterize distributions using simple visuals:
- histogram
- quantile-quantile plot
- box plot
Run a normality test (shapiro-wilk)
Transform a variable
Reshape a tibble (wide to long and long to wide)

Step 1 - Get organized.

Video resource (Prof Andy Field) - Working in RStudio - refer to the video if you want a review

1.1 You already did this for the SPSS activity: Set up a typical project work flow consisting of a project folder containing

*.Rproj file
“r_docs” folder to store your lab notes in R markdown
“data” folder to store data files for this activity
{“images” folder to store plots} - not needed for most activities

1.2 Start a new R markdown document to save your work on this activity

File->New File->R Markdown
- name it something sensible
- default output html (or something else if you prefer but html doesn’t require any extra package downloads)
- now save it in the “r_docs” folder you made for this activity
in the “setup” code chunk load these packages with the library() function:
- tidyverse, readr, ggplot2
also in the “setup” chunk, paste in the line below (sets the base folder for code run from your markdown file):
- knitr::opts_knit$set(root.dir = rprojroot::find_rstudio_root_file())
run the setup chunk (click the play button in the top right of the chunk) - this action will load the libraries that will be used below
- if you get an error that “there is no package called blahblahblah” then you either have a typo (remember R is case-sensitive), or you need to install the missing package (in the console type, e.g., install.packages("tidyverse") )

setup code chunk image

Step 2 - Download and import data

2.1 download these files:

“nhanes_selectvars_n500.csv”
“cort-hypothetical.txt”
(right-click, save as) and move each to the “data” folder in your project

NHANES is a large public health dataset - you will work with a small subset of cases and variables.
The data in “cort-hypothetical.txt” are hypothetical salivary cortisol values (nmol/L) generated for this activity. It is in a tab-delimited format with each line containing ID and cortisol value.

2.2 Import the data into R

put the code to read the files in a new code chunk (name it “Step2-load-datasets” within your markdown file
Use readr::read_csv() to import “data/nhanes_selectvars_n500.csv”, store it in a tibble named nhanes_tib
Workflow : Insert a “code chunk” into your R markdown doc and put the code there, then use the “play” button to run chunks of code as you go along. Don’t “knit” the file until you want to see all your work in one html/pdf doc.

Important!!!!

Don’t use the View() function in any of your markdown code chunks- it will make “knitr” stall when you try to knit your final doc. Instead, just click on the variable in your “Environment” tab when you want to look at the data.

nhanes_tib <- readr::read_csv("data/nhanes_selectvars_n500.csv")

2.3 Check your column data types

the first thing you should do after importing data is make sure each column is the data type that you want
you can check data types by clicking on the arrow next to nhanes_tib in the Environment tab (top right pane) - or run str(nhanes_tib) from the console
numeric variables should be “num”, and text/string/nominal variables should be “chr” or “Factor” (depending on how you intend to use them) - in this example everything should be fine but in future labs we’ll see examples of how to convert types

Step 3 - Get descriptives and view the distribution

3.1 Get median, mean, and sd for the variable “Height”

use mean(), median(), and sd() functions and select the “Height” column. The basic syntax for selecting a column is to use “$” like this mean(nhanes_tib$Height, na.rm = True) (notice what happens if you leave out “na.rm = True”). To put all the descriptives in a table we will use pipe syntax like you see in the solution below. It’s also a good idea to also count cases - use n() and sum(is.na()) – EDIT 9/8/2021: the n() function only works within the dplyr::summarise() function (see the solution code).
Put the code in a code chunk called “Step3.1-nhanes-height-all”.
To organize the stats in a neater way call the same functions from within the dplyr::summarise() function (see the solution code below)

nhanes_tib %>% dplyr::summarise(
  median =  median(Height,na.rm = TRUE),
  mean =  mean(Height,na.rm = TRUE),
  sd = sd(Height,na.rm=TRUE),
  cases = n() - sum(is.na(Height))
  ) %>% 
  knitr::kable(caption = "Height Descriptives- all individuals", digits = 3) %>% 
  kableExtra::kable_styling(full_width = FALSE)

Height Descriptives- all individuals
median	mean	sd	cases
165.4	159.859	21.849	477

3.1b Now add the confidence interval around the mean to the table, using the function `ggplot2::mean_cl_normal()` (see the solution for how to use it)

nhanes_tib %>% dplyr::summarise(
  median =  median(Height,na.rm = TRUE),
  mean =  mean(Height,na.rm = TRUE),
  norm.ci.low = ggplot2::mean_cl_normal(Height)$ymin,
  norm.ci.upp = ggplot2::mean_cl_normal(Height)$ymax,
  sd = sd(Height,na.rm = TRUE),
  cases = n() - sum(is.na(Height))
  ) %>% 
  knitr::kable(caption = "Height Descriptives- all individuals", digits = 3) %>% 
  kableExtra::kable_styling(full_width = FALSE)

Height Descriptives- all individuals
median	mean	norm.ci.low	norm.ci.upp	sd	cases
165.4	159.859	157.894	161.825	21.849	477

3.2 Look at the distribution of Height

Put your code in the same code chunk that you used for descriptive stats.

Plot a histogram and a boxplot of “Height” using nhanes_tib %>% drop_na(Height) %>% ggplot( aes(x=Height)) + geom_histogram(binwidth=2). Try different values for the binwidth argument to see how it changes the plot. See the solution for extras you can add, like vertical lines for quantiles and mean.
Make a boxplot of “Height” using nhanes_tib %>% drop_na(Height) %>% ggplot( aes(y=Height)) + geom_boxplot()
Make a quantile-quantile plot of Height. Use nhanes_tib %>% drop_na(Height) %>% ggplot( aes(sample=Height)) + geom_qq() + geom_qq_line() Is Height normally distributed in this sample? Does this Q-Q plot look different from the SPSS one (notice the flipped axis labels)?
Describe the distribution shape in your own words. Why does the boxplot show so many outliers at low values? What do you notice about the median and the mean?
Calculate the Shapiro-Wilk statistic for deviation from a normal distribution (use the function shapiro.test() on the Height data). shapiro.test() is an function that doesn’t accept a “data” argument, so to use the usual piping syntax, the code is nhanes_tib %>% {shapiro.test(.$Height)} - or alternatively you can use shapiro.test(nhanes_tib$Height). Although the first example looks unnecessarily complicated, the syntax can be very useful when there are additional piping steps before calling a function.
What does the statistic tell you? (and did you already know that from the plots?)

height_quants <- quantile(nhanes_tib$Height, c(.25,.5,.75), na.rm = TRUE)
height_mean <- mean(nhanes_tib$Height, na.rm = TRUE)
p1 <- nhanes_tib %>% drop_na(Height) %>%
  ggplot( aes(x=Height)) + geom_histogram(binwidth=2) + 
    geom_vline(xintercept = height_quants, colour="red", linetype = "dashed") +
    geom_vline(xintercept = height_mean, colour="black") +
    theme_classic() + labs (title = "Height distribution- all individuals")
p2 <- nhanes_tib %>% drop_na(Height) %>%
    ggplot( aes(y=Height)) + geom_boxplot() + theme_classic() + 
      labs (title = "Height box plot- all individuals")
p3 <- nhanes_tib %>% drop_na(Height) %>%
    ggplot( aes(sample=Height)) + geom_qq() + geom_qq_line() + theme_classic() +
      labs (title = "Height Q-Q- all individuals")
p1; p2; p3
nhanes_tib %>% drop_na(Height) %>% {shapiro.test(.$Height)}

## 
##  Shapiro-Wilk normality test
## 
## data:  .$Height
## W = 0.84446, p-value < 2.2e-16

3.3 Filter by Age

Insert a new code chunk (call it “Step3.3-nhanes-height-adults”)
Now let’s restrict the plots to individuals age 18 or older, make a new descriptives table, histogram, box plot, and q-q plot of height using just those individuals (use the filter() function to select cases, like this: nhanes_tib %>% filter(Age>=18)). Then re-calculate the Shapiro-Wilk statistic.
What do the plots tell you? What do you notice about the mean and median?

# code for table
nhanes_tib %>% filter(Age>=18) %>% dplyr::summarise(
  median =  median(Height,na.rm = TRUE),
  mean =  mean(Height,na.rm = TRUE),
  norm.ci.low = ggplot2::mean_cl_normal(Height)$ymin,
  norm.ci.upp = ggplot2::mean_cl_normal(Height)$ymax,
  sd = sd(Height,na.rm=TRUE),
  cases = n() - sum(is.na(Height))
  ) %>% 
  knitr::kable(caption = "Height Descriptives (Age 18 or older)", digits = 3) %>%
  kableExtra::kable_styling(full_width = FALSE)

#code for histogram
height_quants <- nhanes_tib %>% drop_na(Height) %>% filter(Age>=18) %>%
  {quantile(.$Height, c(.25,.5,.75), na.rm = TRUE)}
height_mean <- nhanes_tib %>% drop_na(Height) %>% filter(Age>=18) %>%
  {mean(.$Height, na.rm = TRUE)}
p1 <- nhanes_tib %>% drop_na(Height) %>% filter(Age>=18) %>%
  ggplot( aes(x=Height)) + geom_histogram(binwidth=2) + 
    geom_vline(xintercept = height_quants, colour="red", linetype = "dashed") +
    geom_vline(xintercept = height_mean, colour="black") +
    theme_classic() + labs (title = "Height distribution- (Age 18 or older)")

#code for boxplot
p2 <- nhanes_tib %>% drop_na(Height) %>% filter(Age>=18) %>%
  ggplot( aes(y=Height)) + geom_boxplot() + theme_classic() + 
    labs (title = "Height box plot- (Age 18 or older)")

#code for Q-Q plot
p3 <- nhanes_tib %>% drop_na(Height) %>% filter(Age>=18) %>%
  ggplot( aes(sample=Height)) + geom_qq() + geom_qq_line() + 
    theme_classic() + labs (title = "Height Q-Q- (Age 18 or older)")
p1; p2; p3
nhanes_tib %>% drop_na(Height) %>% filter(Age>=18) %>% {shapiro.test(.$Height)}

Height Descriptives (Age 18 or older)
median	mean	norm.ci.low	norm.ci.upp	sd	cases
168.75	168.629	167.533	169.726	10.219	336

## 
##  Shapiro-Wilk normality test
## 
## data:  .$Height
## W = 0.99544, p-value = 0.4318

Step 4 - Applying a mathematical transformation

Imagine a subset (N=400) of the NHANES participants gave saliva samples. The lab sent you a file “cort-hypothetical.txt” containing an ID and cortisol measurements on each line, separated by tabs (“\t”).

4.1 Import the text file into a new dataframe/tibble (insert a new code chunk called “Step4-cort” ) - note that some values are “qns” which stands for “quantity not sufficient” and should be treated as missing values. Use readr::read_delim() and set the argument “delim” to “ (for tab-delimited) and”na” to “qns”.

cort_tib <- readr::read_delim("data/cort-hypothetical.txt", delim = "\t",na = "qns")

4.2 Now check out the distribution of “cortisol_t1” using what you’ve learned so far.

Describe the distribution of cortisol_t1 in your notes.

cort_tib <- readr::read_delim("data/cort-hypothetical.txt", delim = "\t",na = "qns")

## Rows: 400 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## dbl (3): ID, cortisol_t1, cortisol_t2
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

cort_tib %>%  dplyr::summarise(
  median =  median(cortisol_t1,na.rm = TRUE),
  mean =  mean(cortisol_t1,na.rm = TRUE),
  norm.ci.low = ggplot2::mean_cl_normal(cortisol_t1)$ymin,
  norm.ci.upp = ggplot2::mean_cl_normal(cortisol_t1)$ymax,
  sd = sd(cortisol_t1,na.rm=TRUE),
  cases = n() - sum(is.na(cortisol_t1))
  ) %>% 
    knitr::kable(caption = "Cort Descriptives", digits = 3) %>% 
    kableExtra::kable_styling(full_width = FALSE)
p1 <- cort_tib %>% drop_na(cortisol_t1) %>%
  ggplot( aes(x=cortisol_t1)) +
    geom_histogram(bins=30) + theme_classic() +
    labs (title = "Cort distribution")
p2 <- cort_tib %>% drop_na(cortisol_t1) %>%
  ggplot( aes(y=cortisol_t1)) + geom_boxplot() + 
    theme_classic() + labs(title = "Cort Box Plot")
p3 <- cort_tib %>% drop_na(cortisol_t1) %>%
  ggplot( aes(sample=cortisol_t1)) + geom_qq() +
    geom_qq_line() + theme_classic() +
    labs(title = "Cort Q-Q Plot")
p1; p2; p3
cort_tib %>% drop_na(cortisol_t1) %>% {shapiro.test(.$cortisol_t1)}

Cort Descriptives
median	mean	norm.ci.low	norm.ci.upp	sd	cases
3.931	4.796	4.461	5.131	3.364	390

## 
##  Shapiro-Wilk normality test
## 
## data:  .$cortisol_t1
## W = 0.80483, p-value < 2.2e-16

4.3 Log transformation of a variable can be useful for some measurements with positively skewed distributions, and it is a common practice with salivary cortisol.

Use the function “log” to create a new column in your cortisol dataframe, containing the natural logarithm of each cortisol_t1 measurement - name the new column logcortisol_1. Then, describe the distribution of the new logcortisol_t1 variable.

cort_tib <- cort_tib %>% mutate(logcortisol_t1=log(cortisol_t1))
cort_tib %>%  dplyr::summarise(
  median =  median(logcortisol_t1,na.rm = TRUE),
  mean =  mean(logcortisol_t1,na.rm = TRUE),
  norm.ci.low = ggplot2::mean_cl_normal(cortisol_t1)$ymin,
  norm.ci.upp = ggplot2::mean_cl_normal(cortisol_t1)$ymax,
  sd = sd(logcortisol_t1,na.rm=TRUE),
  cases = n() - sum(is.na(logcortisol_t1))
  ) %>% 
    knitr::kable(caption = "Log Cort Descriptives", digits = 3) %>% 
    kableExtra::kable_styling(full_width = FALSE)
p1 <- cort_tib %>% drop_na(logcortisol_t1) %>%
  ggplot( aes(x=logcortisol_t1)) +
    geom_histogram(bins=30) + theme_classic() +
    labs (title = "Log Cort distribution")
p2 <- cort_tib %>% drop_na(logcortisol_t1) %>%
  ggplot( aes(y=logcortisol_t1)) + 
    geom_boxplot() + theme_classic() + 
    labs (title = "Log Cort box plot")
p3 <- cort_tib %>% drop_na(logcortisol_t1) %>%
  ggplot( aes(sample=logcortisol_t1)) + 
    geom_qq() + geom_qq_line() + theme_classic()
p1; p2; p3
cort_tib %>% drop_na(logcortisol_t1) %>% 
  {shapiro.test(.$logcortisol_t1)}

Log Cort Descriptives
median	mean	norm.ci.low	norm.ci.upp	sd	cases
1.369	1.369	4.461	5.131	0.637	390

## 
##  Shapiro-Wilk normality test
## 
## data:  .$logcortisol_t1
## W = 0.99414, p-value = 0.1394

Step 5 - Re-structuring data (wide to long and back again)

Did you watch the video about “tidy” data? Would you say that the cortisol data tibble is “tidy” right now (take a look at the tibble by clicking on it in the Environment tab of RStudio)? Assume that cortisol_t1 and cortisol_t2 are the same measure (salivary cortisol) taken at two timepoints.
We will discuss what “tidy” data means when we re-group, but for now let’s practice restructuring the data using pivot_longer() and pivot_wider() (these functions are updated replacements for the gather() and spread() functions you saw used in the tidy data video)

Step 5.1 - Use `pivot_longer()` to restructure the data to long/tidy format

you should end up with 4 columns: ID, time, cortisol, logcortisol, and 2 rows for each ID (for t1 and t2)
name the restructured tibble cort_long_tib
this very tricky so go ahead and peek at the solution
does the new tibble look the way you expected?
what if you want to save the data in this format? Use write_delim() or write_csv and output a text document

# wide to long/tidy
cort_long_tib <- cort_tib %>% 
  pivot_longer(!ID, names_to = c(".value","time"),
               names_sep = "_", values_drop_na = FALSE) 
#.value indicates that component of the name defines the name of the column containing the cell values
head(cort_long_tib)

## # A tibble: 6 × 4
##      ID time  cortisol logcortisol
##   <dbl> <chr>    <dbl>       <dbl>
## 1 61744 t1        2.07     0.726  
## 2 61744 t2        3.09    NA      
## 3 61940 t1        1.23     0.209  
## 4 61940 t2        2.92    NA      
## 5 56097 t1        1.00     0.00499
## 6 56097 t2        1.12    NA

Step 5.2 - Use `pivot_wider()` to restructure the data back to wide format

use cort_long_tib as the input, and reshape it to a wide format
name the restructured tibble cort_wide_tib
you should end up with 5 columns: ID, cortisol_t1, cortisol_t2, logcortisol_t1, and logcortisol_t2, 1 row for each ID (all of the logcortisol_t2 values should be missing)
why would you want to change the shape of a dataset like this? we’ll talk about some cases in discussion
check out this tidyr page for lots of restructuring examples

cort_wide_tib <- cort_long_tib %>% 
  pivot_wider(id_cols = ID, names_from = "time", 
              values_from = c("cortisol","logcortisol"),
              names_sep = "_", values_fill = NULL)

That’s all for today - help a classmate or try the mini-challenge below if you have more time!

Re-do descriptives and a histogram and boxplot of Height from the NHANES data set, but group by Gender, so that you have 1 histogram plot (where different Genders get color-coded) and 1 boxplot (with Gender on the x-axis). Keep it restricted to Age 18+ only. Do you see any outliers (according to the boxplot threshold), when the data are split by Gender?
Hints:
- you can use the “fill” argument in aes()to map a variable onto the fill color chart component (called an “aesthetic” in ggplot language) - use position=“identity” and alpha=.5 so that the histogram bars for different Gender are overlayed (rather than stacked)
- you can use the “x” axis argument in aes() to map a variable onto the x-axis (for the boxplot)
Things to pay attention to: did you see a warning like “Removed {X} rows containing non-finite values” above the plots? What does that mean? How can you adjust your code so that you don’t get the warning?
Experiment with different settings (position, alpha, theme) if you finish early

nhanes_tib %>% drop_na(Height,Gender) %>% filter(Age>=18) %>% 
    group_by(Gender) %>%  
    dplyr::summarise(
        median =  median(Height,na.rm = TRUE),
        mean =  mean(Height,na.rm = TRUE),
        sd = sd(Height,na.rm=TRUE),
        cases = n()
    ) %>% 
    knitr::kable(caption = "Height Descriptives by Gender (Age 18+)", digits = 3) %>% 
    kableExtra::kable_styling(full_width = FALSE)
p1 <- nhanes_tib  %>% drop_na(Height,Gender) %>% filter(Age>=18) %>%
    ggplot( aes(x=Height, fill=Gender)) + 
        geom_histogram(position="identity",alpha=.5,binwidth=2) + 
        theme_classic() + labs (title = "Height Distribution by Gender (Age 18+")
p2 <- nhanes_tib %>% drop_na(Height,Gender) %>% filter(Age>=18) %>%
    ggplot( aes(y=Height, x=Gender)) + geom_boxplot() + theme_classic() + 
        labs (title = "Height Distribution by Gender")
p1
p2

Height Descriptives by Gender (Age 18+)
Gender	median	mean	sd	cases
female	162.3	161.745	7.649	167
male	175.4	175.433	7.533	169

Activity-Import and examine data in R

JB

updated 01/24/2024

Learning Objectives

Step 1 - Get organized.

1.1 You already did this for the SPSS activity: Set up a typical project work flow consisting of a project folder containing

1.2 Start a new R markdown document to save your work on this activity

Step 2 - Download and import data

2.1 download these files:

2.2 Import the data into R

Important!!!!

2.3 Check your column data types

Step 3 - Get descriptives and view the distribution

3.1 Get median, mean, and sd for the variable “Height”

3.1b Now add the confidence interval around the mean to the table, using the function `ggplot2::mean_cl_normal()` (see the solution for how to use it)

3.2 Look at the distribution of Height

3.3 Filter by Age

Step 4 - Applying a mathematical transformation

Imagine a subset (N=400) of the NHANES participants gave saliva samples. The lab sent you a file “cort-hypothetical.txt” containing an ID and cortisol measurements on each line, separated by tabs (“\t”).

Step 5 - Re-structuring data (wide to long and back again)

Step 5.1 - Use `pivot_longer()` to restructure the data to long/tidy format

Step 5.2 - Use `pivot_wider()` to restructure the data back to wide format

That’s all for today - help a classmate or try the mini-challenge below if you have more time!

Activity-Import and examine data in R

JB

updated 01/24/2024

Learning Objectives

Step 1 - Get organized.

1.1 You already did this for the SPSS activity: Set up a typical project work flow consisting of a project folder containing

1.2 Start a new R markdown document to save your work on this activity

Step 2 - Download and import data

2.1 download these files:

2.2 Import the data into R

Important!!!!

2.3 Check your column data types

Step 3 - Get descriptives and view the distribution

3.1 Get median, mean, and sd for the variable “Height”

3.1b Now add the confidence interval around the mean to the table, using the function ggplot2::mean_cl_normal() (see the solution for how to use it)

3.2 Look at the distribution of Height

3.3 Filter by Age

Step 4 - Applying a mathematical transformation

Imagine a subset (N=400) of the NHANES participants gave saliva samples. The lab sent you a file “cort-hypothetical.txt” containing an ID and cortisol measurements on each line, separated by tabs (“\t”).

Step 5 - Re-structuring data (wide to long and back again)

Step 5.1 - Use pivot_longer() to restructure the data to long/tidy format

Step 5.2 - Use pivot_wider() to restructure the data back to wide format

That’s all for today - help a classmate or try the mini-challenge below if you have more time!

3.1b Now add the confidence interval around the mean to the table, using the function `ggplot2::mean_cl_normal()` (see the solution for how to use it)

Step 5.1 - Use `pivot_longer()` to restructure the data to long/tidy format

Step 5.2 - Use `pivot_wider()` to restructure the data back to wide format