Collaborative Data Activity - Categorical Outcomes

General process to analyze categorical outcomes with categorical predictors

Categorical outcome decision process

Step 0 - Install a package

This activity makes use of two new packages: “janitor” (for data cleaning), and “vcd” (for odds ratio calculation)

Step 1 - Import the data

we make use of the “janitor” package in this doc
allow “na”, “n/a”, etc. as missing values
create unique id for reach row
here’s a link to the tsv file if you want to check it out on your own

pub_tib <- readr::read_delim("data/Dataset-PublicationStatistics-2024.tsv",
                             na=c("n/a","na","N/A","NA","Na","N/a",""),
                             delim = "\t", show_col_types = FALSE) |>
  janitor::clean_names() |>  #fixes the column names
  select(-student_name,-citation,-doi) |>  #drop unused columns 
  janitor::remove_empty(which = "rows") |>  # drop empty rows
  mutate(id = row_number(), .before = 1) |>   #add a column with id for each row
  mutate(across(where(is.character), str_trim)) |> #remove trailing/leading whitespace 
  mutate(across(where(is.character), tolower)) #make all text columns lowercase

Step 2 - clean the data and set variable types

Check values of our variables:
1. binary variables (reported yes/no): store as factor and check levels
2. numerical (e.g., participants): convert “no” to NA and store as numerical
3. field: since one entry can belong to multiple fields, we create a series of dummy coded variables soc,cog,dev, etc where a row gets a value of 1 if the given field appears in the field column. there is some variance in how fields are entered (“social”/“soc”, “cognitive”/“cog) so we’ll just check for a key part of text for each.
4. sample_size_justification: store as factor and check levels

#fix typos where "no" was entered as "on"
pub_tib <- pub_tib |> mutate(
  income_or_ses_reported = str_replace(income_or_ses_reported, pattern = "on", replacement = "no")
)

#1. binary variables - using "factor" instead of "as_factor" because it will put
#   the levels in alpha order - NA values are recoded as "no"
pub_tib <- pub_tib |> mutate(
  race_ethn_reported = factor(replace_na(race_ethn_reported,"no")),
  income_or_ses_reported = factor(replace_na(income_or_ses_reported,"no")),
  location_reported = factor(replace_na(location_reported,"no")),
  general_sample = factor(replace_na(general_sample,"no")),
  sample_size_justification = factor(sample_size_justification)
)
pub_tib |> select(race_ethn_reported:general_sample,sample_size_justification) |> purrr::map(levels)

## $race_ethn_reported
## [1] "no"  "yes"
## 
## $income_or_ses_reported
## [1] "no"  "yes"
## 
## $location_reported
## [1] "no"  "yes"
## 
## $general_sample
## [1] "no"  "yes"
## 
## $sample_size_justification
## [1] "a-priori"         "accuracy"         "constraints"      "heuristics"      
## [5] "no-justification" "no-statement"     "population"

pub_tib |> count(race_ethn_reported)

## # A tibble: 2 × 2
##   race_ethn_reported     n
##   <fct>              <int>
## 1 no                   197
## 2 yes                  109

pub_tib |> count(income_or_ses_reported)

## # A tibble: 2 × 2
##   income_or_ses_reported     n
##   <fct>                  <int>
## 1 no                       233
## 2 yes                       73

pub_tib |> count(location_reported)

## # A tibble: 2 × 2
##   location_reported     n
##   <fct>             <int>
## 1 no                   97
## 2 yes                 209

pub_tib |> count(general_sample)

## # A tibble: 2 × 2
##   general_sample     n
##   <fct>          <int>
## 1 no                93
## 2 yes              213

pub_tib |> count(sample_size_justification)

## # A tibble: 8 × 2
##   sample_size_justification     n
##   <fct>                     <int>
## 1 a-priori                     26
## 2 accuracy                      7
## 3 constraints                  13
## 4 heuristics                   11
## 5 no-justification             57
## 6 no-statement                 81
## 7 population                    5
## 8 <NA>                        106

#2. Numerical variables: participants_male and participants_female are stored as chr
#   use parse_number() and any non-numeric values will get NA 
#   but there's an inconsistency in reporting NA/no versus 0 - we would need to 
#   resolve the inconsistency to make use of that data
pub_tib <- pub_tib |> mutate(
  participants_n = parse_number(stringr::word(participants_n), na = "no"),
  participants_male = parse_number(stringr::word(participants_male), na = "no"),
  participants_female = parse_number(stringr::word(participants_female), na = "no"),
  participants_nonbin = parse_number(stringr::word(participants_nonbin), na = "no")
) 
pub_tib |> select(participants_n:participants_nonbin) |> psych::describe()

##                     vars   n    mean      sd median trimmed    mad min    max
## participants_n         1 306 1143.83 8356.92    112  200.83 127.50   7 138625
## participants_male      2 264  506.53 3847.89     45   93.28  53.37   0  61226
## participants_female    3 268  698.06 5137.24     62  108.82  72.65   0  77399
## participants_nonbin    4  48    3.56   11.76      0    0.88   0.00   0     75
##                      range  skew kurtosis     se
## participants_n      138618 14.85   237.47 477.73
## participants_male    61226 14.94   231.95 236.82
## participants_female  77399 13.08   185.72 313.81
## participants_nonbin     75  4.90    25.82   1.70

#3. check out "field"
pub_tib |> janitor::tabyl(field) #or use count(field)

##                       field  n     percent
##          clinical, cog, dev  1 0.003267974
##                         cog 36 0.117647059
##                 cog , other  1 0.003267974
##                    cog, dev 14 0.045751634
##               cog, dev, soc  1 0.003267974
##                  cog, neuro 19 0.062091503
##             cog, neuro, dev  7 0.022875817
##             cog, neuro, soc  1 0.003267974
##                  cog, other  2 0.006535948
##                    cog, soc  6 0.019607843
##                     cog,dev  5 0.016339869
##                   cog,neuro 45 0.147058824
##               cog,neuro,dev  4 0.013071895
##              cog,neuro,pers  1 0.003267974
##                   cog,other  1 0.003267974
##                     cog,soc  9 0.029411765
##                 cog,soc,dev  1 0.003267974
##                      conbeh 14 0.045751634
##                         dev  7 0.022875817
##                    dev, cog  5 0.016339869
##                  dev, other  1 0.003267974
##                    dev, soc  1 0.003267974
##                     dev,cog  2 0.006535948
##               dev,cog,neuro  2 0.006535948
##                   dev,neuro  1 0.003267974
##               dev,neuro,cog  1 0.003267974
##            dev,neuro,social  1 0.003267974
##                     dev,soc  1 0.003267974
##                  dev,social  2 0.006535948
##                       neuro 20 0.065359477
##            neuro, cognitive  2 0.006535948
##                   neuro,cog  2 0.006535948
##                   neuro,dev  1 0.003267974
##                   neuro,soc  1 0.003267974
##                       other 14 0.045751634
##                  other, dev  2 0.006535948
##                other, neuro  1 0.003267974
##                        pers  7 0.022875817
##                   pers, soc  6 0.019607843
##              pers, soc, cog  1 0.003267974
##              pers,quant,cog  2 0.006535948
##          psych, bio, social  1 0.003267974
##            psych, cognitive  2 0.006535948
##           psych, soc, neuro  1 0.003267974
##               psych, social  3 0.009803922
##  psych, social, personality  2 0.006535948
##                         soc 31 0.101307190
##                  soc & pers  1 0.003267974
##                    soc, cog  3 0.009803922
##                   soc, pers  1 0.003267974
##               soc,cog,neuro  2 0.006535948
##                     soc,dev  1 0.003267974
##                   soc,neuro  1 0.003267974
##                      social  5 0.016339869
##            social,cog,neuro  1 0.003267974

# now dummy code the "field" variable, allowing for multiple fields per entry
pub_tib <- pub_tib |> 
  mutate(
    soc = if_else(str_detect(field,"soc"), 1, 0),
    cog = if_else(str_detect(field,"cog"), 1, 0),
    dev = if_else(str_detect(field,"dev"), 1, 0),
    pers = if_else(str_detect(field,"pers"), 1, 0),
    conbeh = if_else(str_detect(field,"con"), 1, 0),
    neuro = if_else(str_detect(field,"neuro"), 1, 0),
    quant = if_else(str_detect(field,"quant"), 1, 0),
    other = if_else(str_detect(field,"other"), 1, 0),
    #field_combo = ""
  )
# print counts of each field
pub_tib |> select(soc:other) |> colSums(na.rm = TRUE)

##    soc    cog    dev   pers conbeh  neuro  quant  other 
##     84    179     61     21     14    114      2     22

# # only 2 quant cases so drop that column
# pub_tib <- pub_tib |> select(-quant)

Step 3 - Chi-square test of independence and loglinear analysis

We can discuss what questions to ask with the data and we can explore as much as we have time for. But let’s start with an example that makes use of a contingency table and the chi square test of independence:

Question 1 (chi square test): If a study uses a sample that is meant to represent the general population, is race/ethnicity more likely to be reported?

we can test whether general_sample and race_ethn_reported are related
- H₀: general_sample and race_ethn_reported are independent

Generate contingency table
Examine observed frequencies compared to expected frequencies
Chi-squared test of independence
if expected frequency for a cell is 5 or less then use Fisher’s exact test

# 1. Contingency Table
q1_xtab <- pub_tib |> 
  with(gmodels::CrossTable(general_sample, race_ethn_reported, expected = TRUE,
                       prop.chisq = TRUE)) #use fisher=TRUE if expected counts <5
# 2. Cramers V (round to 3 decimals)
cat("Cramer's V: ", round(DescTools::CramerV(q1_xtab$t),3), "\n")

## Registered S3 method overwritten by 'DescTools':
##   method         from 
##   reorder.factor gdata

# 3. Extra: odds ratio
q1_odds <- vcd::oddsratio(q1_xtab$t, log=FALSE)
cat("odds ratio general_sample and race_ethn_reported")
q1_odds  #interpretation: a general sample publication is YY as likely to report race/ethn compared to a non-general sample publication
confint(q1_odds) #confidence interval

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |              Expected N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  306 
## 
##  
##                | race_ethn_reported 
## general_sample |        no |       yes | Row Total | 
## ---------------|-----------|-----------|-----------|
##             no |        45 |        48 |        93 | 
##                |    59.873 |    33.127 |           | 
##                |     3.694 |     6.677 |           | 
##                |     0.484 |     0.516 |     0.304 | 
##                |     0.228 |     0.440 |           | 
##                |     0.147 |     0.157 |           | 
## ---------------|-----------|-----------|-----------|
##            yes |       152 |        61 |       213 | 
##                |   137.127 |    75.873 |           | 
##                |     1.613 |     2.915 |           | 
##                |     0.714 |     0.286 |     0.696 | 
##                |     0.772 |     0.560 |           | 
##                |     0.497 |     0.199 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |       197 |       109 |       306 | 
##                |     0.644 |     0.356 |           | 
## ---------------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  14.89978     d.f. =  1     p =  0.0001133763 
## 
## Pearson's Chi-squared test with Yates' continuity correction 
## ------------------------------------------------------------
## Chi^2 =  13.91479     d.f. =  1     p =  0.0001912875 
## 
##  
## Cramer's V:  0.221 
## odds ratio general_sample and race_ethn_reported odds ratios for x and y 
## 
## [1] 0.3762336
##                   2.5 %    97.5 %
## no:yes/no:yes 0.2273702 0.6225603

Understand the output

total observations - 306 means all cases were included
N is the observed joint frequency. For example, out of [how many?] samples that represent the general population, [how many?] of those papers reported race/ethnicity.
Expected N is the expected joint frequency.
Chi-square contribution measures the amount that a cell contributes to the overall chi-square statistic for the table (the sum of all contributions equals the overall chi-squared value below the table).

Answer the following

Are any of the expected counts less than or equal to 5?
Are the observed deviations from expected frequencies likely under the null hypothesis? (χ²(1, N = 306) = —–, p<.0001).
Examine the observed and expected frequencies. What direction is the association between the categories?
What is the effect size? (Cramers V, odds ratio)

Using the last example as a template, let’s ask whether research field and race/ethnicity reporting are related.

each publication can only be classified as one field, so we will limit our cases to those that are classified as a single field
after limiting cases, there are six categories of research field (and “other”)

# 0. filter cases to keep only single field pubs
pub_tib <- pub_tib |> rowwise() |> 
  mutate(
    issinglefield = if_else(sum(c_across(soc:other))==1,1,0),
  ) |> ungroup()
pub_singlefield_tib <- pub_tib |> filter(issinglefield==1) |> 
  mutate(
    singlefield = case_when(
      soc == 1 ~ "soc",
      cog == 1 ~ "cog",
      dev == 1 ~ "dev",
      neuro == 1 ~ "neuro",
      pers == 1 ~ "pers",
      conbeh == 1 ~ "conbeh",
      other == 1 ~ "other"
    )
  )
pub_singlefield_tib |> count(singlefield)

# 1. Contingency Table
fieldxrace_xtab <- pub_singlefield_tib |> 
  with(gmodels::CrossTable(singlefield, race_ethn_reported, expected = TRUE,
                       prop.chisq = TRUE, simulate.p.value=TRUE)) #if expected counts <5
# 2. Cramers V (round to 3 decimals)

# 3. Interpretation

## # A tibble: 7 × 2
##   singlefield     n
##   <chr>       <int>
## 1 cog            38
## 2 conbeh         14
## 3 dev             7
## 4 neuro          20
## 5 other          14
## 6 pers            7
## 7 soc            40
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |              Expected N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  140 
## 
##  
##              | race_ethn_reported 
##  singlefield |        no |       yes | Row Total | 
## -------------|-----------|-----------|-----------|
##          cog |        30 |         8 |        38 | 
##              |    21.714 |    16.286 |           | 
##              |     3.162 |     4.216 |           | 
##              |     0.789 |     0.211 |     0.271 | 
##              |     0.375 |     0.133 |           | 
##              |     0.214 |     0.057 |           | 
## -------------|-----------|-----------|-----------|
##       conbeh |        13 |         1 |        14 | 
##              |     8.000 |     6.000 |           | 
##              |     3.125 |     4.167 |           | 
##              |     0.929 |     0.071 |     0.100 | 
##              |     0.163 |     0.017 |           | 
##              |     0.093 |     0.007 |           | 
## -------------|-----------|-----------|-----------|
##          dev |         1 |         6 |         7 | 
##              |     4.000 |     3.000 |           | 
##              |     2.250 |     3.000 |           | 
##              |     0.143 |     0.857 |     0.050 | 
##              |     0.013 |     0.100 |           | 
##              |     0.007 |     0.043 |           | 
## -------------|-----------|-----------|-----------|
##        neuro |        15 |         5 |        20 | 
##              |    11.429 |     8.571 |           | 
##              |     1.116 |     1.488 |           | 
##              |     0.750 |     0.250 |     0.143 | 
##              |     0.188 |     0.083 |           | 
##              |     0.107 |     0.036 |           | 
## -------------|-----------|-----------|-----------|
##        other |         5 |         9 |        14 | 
##              |     8.000 |     6.000 |           | 
##              |     1.125 |     1.500 |           | 
##              |     0.357 |     0.643 |     0.100 | 
##              |     0.062 |     0.150 |           | 
##              |     0.036 |     0.064 |           | 
## -------------|-----------|-----------|-----------|
##         pers |         3 |         4 |         7 | 
##              |     4.000 |     3.000 |           | 
##              |     0.250 |     0.333 |           | 
##              |     0.429 |     0.571 |     0.050 | 
##              |     0.037 |     0.067 |           | 
##              |     0.021 |     0.029 |           | 
## -------------|-----------|-----------|-----------|
##          soc |        13 |        27 |        40 | 
##              |    22.857 |    17.143 |           | 
##              |     4.251 |     5.668 |           | 
##              |     0.325 |     0.675 |     0.286 | 
##              |     0.163 |     0.450 |           | 
##              |     0.093 |     0.193 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |        80 |        60 |       140 | 
##              |     0.571 |     0.429 |           | 
## -------------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test with simulated p-value
##   (based on 2000 replicates) 
## ------------------------------------------------------------
## Chi^2 =  35.65011     d.f. =  NA     p =  0.0004997501 
## 
## 
##

Question 2 (loglinear model): Are reporting of race/ethn, income, and location related?

we can test whether race_ethn_reported, income_or_ses_reported, and location_reported are related
- H₀: the variables are independent

Generate contingency table
Convert to dataframe of frequencies
Fit loglin model (using glm(family=poisson))
Compare observed to fitted (predicted) frequencies
Backward elimination (reduce the model)
- hypothesis test for reduced model compared to previous model: significant (p<.05) chi-square test indicates that dropping the term in the reduced model significantly worsens the fit of the reduced model (meaning the term should be kept in the model)
- hypothesis test for
Visualize

cat("1. Contingency table (2x2x2)")

## 1. Contingency table (2x2x2)

q2_xtab <- pub_tib |> 
  xtabs(formula = ~income_or_ses_reported + location_reported + race_ethn_reported)
q2_xtab

## , , race_ethn_reported = no
## 
##                       location_reported
## income_or_ses_reported  no yes
##                    no   65 102
##                    yes   6  24
## 
## , , race_ethn_reported = yes
## 
##                       location_reported
## income_or_ses_reported  no yes
##                    no   19  47
##                    yes   7  36

cat("Flatten table for display")

## Flatten table for display

ftable(q2_xtab, row.vars = c("race_ethn_reported","income_or_ses_reported"))

##                                           location_reported  no yes
## race_ethn_reported income_or_ses_reported                          
## no                 no                                        65 102
##                    yes                                        6  24
## yes                no                                        19  47
##                    yes                                        7  36

cat("2. convert to dataframe of frequencies")

## 2. convert to dataframe of frequencies

q2_xtab.df <- as.data.frame(as.table(q2_xtab))
cat("set reference level to no (-4 means leave out the 4th columnn for this computation)")

## set reference level to no (-4 means leave out the 4th columnn for this computation)

q2_xtab.df[,-4] <- lapply(q2_xtab.df[,-4], relevel, ref = "no")
cat("3. Fit a loglinear model, using glm(family=poisson), start with full model")

## 3. Fit a loglinear model, using glm(family=poisson), start with full model

q2_llmodfull <- glm(
  Freq ~ income_or_ses_reported * location_reported * race_ethn_reported,
  data = q2_xtab.df, family = poisson)
summary(q2_llmodfull)

## 
## Call:
## glm(formula = Freq ~ income_or_ses_reported * location_reported * 
##     race_ethn_reported, family = poisson, data = q2_xtab.df)
## 
## Coefficients:
##                                                                      Estimate
## (Intercept)                                                            4.1744
## income_or_ses_reportedyes                                             -2.3826
## location_reportedyes                                                   0.4506
## race_ethn_reportedyes                                                 -1.2299
## income_or_ses_reportedyes:location_reportedyes                         0.9357
## income_or_ses_reportedyes:race_ethn_reportedyes                        1.3841
## location_reportedyes:race_ethn_reportedyes                             0.4551
## income_or_ses_reportedyes:location_reportedyes:race_ethn_reportedyes  -0.2038
##                                                                      Std. Error
## (Intercept)                                                              0.1240
## income_or_ses_reportedyes                                                0.4267
## location_reportedyes                                                     0.1587
## race_ethn_reportedyes                                                    0.2608
## income_or_ses_reportedyes:location_reportedyes                           0.4832
## income_or_ses_reportedyes:race_ethn_reportedyes                          0.6144
## location_reportedyes:race_ethn_reportedyes                               0.3148
## income_or_ses_reportedyes:location_reportedyes:race_ethn_reportedyes     0.6914
##                                                                      z value
## (Intercept)                                                           33.655
## income_or_ses_reportedyes                                             -5.584
## location_reportedyes                                                   2.839
## race_ethn_reportedyes                                                 -4.716
## income_or_ses_reportedyes:location_reportedyes                         1.936
## income_or_ses_reportedyes:race_ethn_reportedyes                        2.253
## location_reportedyes:race_ethn_reportedyes                             1.446
## income_or_ses_reportedyes:location_reportedyes:race_ethn_reportedyes  -0.295
##                                                                      Pr(>|z|)
## (Intercept)                                                           < 2e-16
## income_or_ses_reportedyes                                            2.35e-08
## location_reportedyes                                                  0.00452
## race_ethn_reportedyes                                                2.40e-06
## income_or_ses_reportedyes:location_reportedyes                        0.05283
## income_or_ses_reportedyes:race_ethn_reportedyes                       0.02428
## location_reportedyes:race_ethn_reportedyes                            0.14824
## income_or_ses_reportedyes:location_reportedyes:race_ethn_reportedyes  0.76817
##                                                                         
## (Intercept)                                                          ***
## income_or_ses_reportedyes                                            ***
## location_reportedyes                                                 ** 
## race_ethn_reportedyes                                                ***
## income_or_ses_reportedyes:location_reportedyes                       .  
## income_or_ses_reportedyes:race_ethn_reportedyes                      *  
## location_reportedyes:race_ethn_reportedyes                              
## income_or_ses_reportedyes:location_reportedyes:race_ethn_reportedyes    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1.8906e+02  on 7  degrees of freedom
## Residual deviance: 6.4393e-15  on 0  degrees of freedom
## AIC: 56.876
## 
## Number of Fisher Scoring iterations: 3

cat("we can check the goodness of fit, but for the saturated model it is always 1\n")

## we can check the goodness of fit, but for the saturated model it is always 1

pchisq(deviance(q2_llmodfull), df = df.residual(q2_llmodfull), lower.tail = F)

## [1] 0

cat("1st step in backward elimination: drop the highest order term (3-way 
interaction term) and compare to the previous step (full model).
^2 is shorthand for all 2nd order interactions")

## 1st step in backward elimination: drop the highest order term (3-way 
## interaction term) and compare to the previous step (full model).
## ^2 is shorthand for all 2nd order interactions

q2_llmod2 <- glm(
  Freq ~ (income_or_ses_reported + location_reported + race_ethn_reported)^2,
  data = q2_xtab.df, family = poisson)
summary(q2_llmod2)

## 
## Call:
## glm(formula = Freq ~ (income_or_ses_reported + location_reported + 
##     race_ethn_reported)^2, family = poisson, data = q2_xtab.df)
## 
## Coefficients:
##                                                 Estimate Std. Error z value
## (Intercept)                                       4.1678     0.1224  34.050
## income_or_ses_reportedyes                        -2.3072     0.3310  -6.970
## location_reportedyes                              0.4614     0.1546   2.984
## race_ethn_reportedyes                            -1.2011     0.2399  -5.007
## income_or_ses_reportedyes:location_reportedyes    0.8381     0.3442   2.435
## income_or_ses_reportedyes:race_ethn_reportedyes   1.2234     0.2822   4.336
## location_reportedyes:race_ethn_reportedyes        0.4130     0.2792   1.479
##                                                 Pr(>|z|)    
## (Intercept)                                      < 2e-16 ***
## income_or_ses_reportedyes                       3.16e-12 ***
## location_reportedyes                             0.00285 ** 
## race_ethn_reportedyes                           5.52e-07 ***
## income_or_ses_reportedyes:location_reportedyes   0.01490 *  
## income_or_ses_reportedyes:race_ethn_reportedyes 1.45e-05 ***
## location_reportedyes:race_ethn_reportedyes       0.13907    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 189.055299  on 7  degrees of freedom
## Residual deviance:   0.087068  on 1  degrees of freedom
## AIC: 54.963
## 
## Number of Fisher Scoring iterations: 3

cat("goodness of fit test - nonsignificant value indicates that the model-predicted
   frequencies do not significantly differ from the observed frequencies\n")

## goodness of fit test - nonsignificant value indicates that the model-predicted
##    frequencies do not significantly differ from the observed frequencies

pchisq(deviance(q2_llmod2), df = df.residual(q2_llmod2), lower.tail = F)

## [1] 0.7679388

cat("compare reduced model to full model\n")

## compare reduced model to full model

anova(q2_llmod2,q2_llmodfull)

## Analysis of Deviance Table
## 
## Model 1: Freq ~ (income_or_ses_reported + location_reported + race_ethn_reported)^2
## Model 2: Freq ~ income_or_ses_reported * location_reported * race_ethn_reported
##   Resid. Df Resid. Dev Df Deviance
## 1         1   0.087068            
## 2         0   0.000000  1 0.087068

cat("take the chisq stat and lookup the p value for the model comparison
  the nonsignificant value indicates that dropping the 3-way interaction does not
  significantly affect model fit\n")

## take the chisq stat and lookup the p value for the model comparison
##   the nonsignificant value indicates that dropping the 3-way interaction does not
##   significantly affect model fit

pchisq(anova(q2_llmod2,q2_llmodfull)$Deviance[2], df = 1, lower.tail = F)

## [1] 0.7679388

cat("next step in backward elimination: drop another term- let's drop race:income
   then compare to the model from the previous step\n")

## next step in backward elimination: drop another term- let's drop race:income
##    then compare to the model from the previous step

q2_llmod3 <- glm(
  Freq ~ income_or_ses_reported + location_reported + race_ethn_reported + 
    income_or_ses_reported:location_reported + location_reported:race_ethn_reported,
  data = q2_xtab.df, family = poisson)
summary(q2_llmod3)

## 
## Call:
## glm(formula = Freq ~ income_or_ses_reported + location_reported + 
##     race_ethn_reported + income_or_ses_reported:location_reported + 
##     location_reported:race_ethn_reported, family = poisson, data = q2_xtab.df)
## 
## Coefficients:
##                                                Estimate Std. Error z value
## (Intercept)                                      4.1188     0.1252  32.892
## income_or_ses_reportedyes                       -1.8659     0.2980  -6.261
## location_reportedyes                             0.3791     0.1598   2.372
## race_ethn_reportedyes                           -1.0046     0.2292  -4.382
## income_or_ses_reportedyes:location_reportedyes   0.9563     0.3350   2.855
## location_reportedyes:race_ethn_reportedyes       0.5871     0.2693   2.180
##                                                Pr(>|z|)    
## (Intercept)                                     < 2e-16 ***
## income_or_ses_reportedyes                      3.83e-10 ***
## location_reportedyes                            0.01769 *  
## race_ethn_reportedyes                          1.17e-05 ***
## income_or_ses_reportedyes:location_reportedyes  0.00431 ** 
## location_reportedyes:race_ethn_reportedyes      0.02925 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 189.055  on 7  degrees of freedom
## Residual deviance:  19.305  on 2  degrees of freedom
## AIC: 72.181
## 
## Number of Fisher Scoring iterations: 4

cat("compare to higher order model (p-value is printed from pchisq function)\n")

## compare to higher order model (p-value is printed from pchisq function)

pchisq(anova(q2_llmod3,q2_llmod2)$Deviance[2], df = 1, lower.tail = F)

## [1] 1.165878e-05

cat("# there's a sig difference when we dropped race:income, so we keep it in the 
  final model, and continue dropping each interaction term one by one and comparing
  to the model with all 2-way interaction terms (q2_model2) - we drop location:race next\n")

## # there's a sig difference when we dropped race:income, so we keep it in the 
##   final model, and continue dropping each interaction term one by one and comparing
##   to the model with all 2-way interaction terms (q2_model2) - we drop location:race next

q2_llmod4 <- glm(
  Freq ~ income_or_ses_reported + location_reported + race_ethn_reported + 
    income_or_ses_reported:location_reported + race_ethn_reported:income_or_ses_reported,
  data = q2_xtab.df, family = poisson)
summary(q2_llmod4)

## 
## Call:
## glm(formula = Freq ~ income_or_ses_reported + location_reported + 
##     race_ethn_reported + income_or_ses_reported:location_reported + 
##     race_ethn_reported:income_or_ses_reported, family = poisson, 
##     data = q2_xtab.df)
## 
## Coefficients:
##                                                 Estimate Std. Error z value
## (Intercept)                                       4.0978     0.1166  35.137
## income_or_ses_reportedyes                        -2.4221     0.3319  -7.298
## location_reportedyes                              0.5731     0.1364   4.201
## race_ethn_reportedyes                            -0.9283     0.1454  -6.385
## income_or_ses_reportedyes:location_reportedyes    0.9563     0.3350   2.855
## income_or_ses_reportedyes:race_ethn_reportedyes   1.2883     0.2788   4.621
##                                                 Pr(>|z|)    
## (Intercept)                                      < 2e-16 ***
## income_or_ses_reportedyes                       2.93e-13 ***
## location_reportedyes                            2.66e-05 ***
## race_ethn_reportedyes                           1.71e-10 ***
## income_or_ses_reportedyes:location_reportedyes   0.00431 ** 
## income_or_ses_reportedyes:race_ethn_reportedyes 3.82e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 189.0553  on 7  degrees of freedom
## Residual deviance:   2.3195  on 2  degrees of freedom
## AIC: 55.195
## 
## Number of Fisher Scoring iterations: 4

cat("compare to higher order model (p-value is printed from pchisq function)\n")

## compare to higher order model (p-value is printed from pchisq function)

pchisq(anova(q2_llmod4,q2_llmod2)$Deviance[2], df = 1, lower.tail = F)

## [1] 0.1351417

cat("# no significant difference when we dropped location:race, so we leave it out of the 
  final model, and continue dropping each interaction term one by one and comparing
  to the model with all 2-way interaction terms (q2_model2) - we drop income:location
  next (it's the last two-way interaction to check\n")

## # no significant difference when we dropped location:race, so we leave it out of the 
##   final model, and continue dropping each interaction term one by one and comparing
##   to the model with all 2-way interaction terms (q2_model2) - we drop income:location
##   next (it's the last two-way interaction to check

q2_llmod5 <- glm(
  Freq ~ income_or_ses_reported + location_reported + race_ethn_reported + 
    location_reported:race_ethn_reported + race_ethn_reported:income_or_ses_reported,
  data = q2_xtab.df, family = poisson)
summary(q2_llmod5)

## 
## Call:
## glm(formula = Freq ~ income_or_ses_reported + location_reported + 
##     race_ethn_reported + location_reported:race_ethn_reported + 
##     race_ethn_reported:income_or_ses_reported, family = poisson, 
##     data = q2_xtab.df)
## 
## Coefficients:
##                                                 Estimate Std. Error z value
## (Intercept)                                       4.0975     0.1225  33.460
## income_or_ses_reportedyes                        -1.7168     0.1983  -8.658
## location_reportedyes                              0.5736     0.1484   3.865
## race_ethn_reportedyes                            -1.3411     0.2438  -5.501
## location_reportedyes:race_ethn_reportedyes        0.5871     0.2693   2.180
## income_or_ses_reportedyes:race_ethn_reportedyes   1.2883     0.2788   4.621
##                                                 Pr(>|z|)    
## (Intercept)                                      < 2e-16 ***
## income_or_ses_reportedyes                        < 2e-16 ***
## location_reportedyes                            0.000111 ***
## race_ethn_reportedyes                           3.78e-08 ***
## location_reportedyes:race_ethn_reportedyes      0.029248 *  
## income_or_ses_reportedyes:race_ethn_reportedyes 3.82e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 189.0553  on 7  degrees of freedom
## Residual deviance:   6.5957  on 2  degrees of freedom
## AIC: 59.472
## 
## Number of Fisher Scoring iterations: 4

cat("compare this to higher order model (p-value is printed from pchisq function)\n")

## compare this to higher order model (p-value is printed from pchisq function)

pchisq(anova(q2_llmod5,q2_llmod2)$Deviance[2], df = 1, lower.tail = F)

## [1] 0.01073515

cat("we see that dropping the income:location term has a significant effect, so we keep it in and our final model is q2_llmod4, which included the single terms + location:race + location:income but dropped location:race\n")

## we see that dropping the income:location term has a significant effect, so we keep it in and our final model is q2_llmod4, which included the single terms + location:race + location:income but dropped location:race

cat("goodness of fit test - nonsignificant value indicates that the model-predicted
   frequencies do not significantly differ from the observed frequencies\n")

## goodness of fit test - nonsignificant value indicates that the model-predicted
##    frequencies do not significantly differ from the observed frequencies

cat("chi square stat for the final model:")

## chi square stat for the final model:

deviance(q2_llmod4)

## [1] 2.319491

cat("p value for the final model:")

## p value for the final model:

pchisq(deviance(q2_llmod4), df = df.residual(q2_llmod4), lower.tail = F)

## [1] 0.3135659

cat("4. Now look at the fitted values compared to the observed values for this final model (q2_llmod4). We need to combine the original data with the fitted values to do so. We can see that there is a fairly close fit between fitted and observed\n")

## 4. Now look at the fitted values compared to the observed values for this final model (q2_llmod4). We need to combine the original data with the fitted values to do so. We can see that there is a fairly close fit between fitted and observed

cbind(q2_llmod4$data, fitted(q2_llmod4)) |> 
  kableExtra::kbl(caption = "final model fitted and observed Freq values") |> kableExtra::kable_classic(lightable_options = "hover")

final model fitted and observed Freq values
income_or_ses_reported	location_reported	race_ethn_reported	Freq	fitted(q2_llmod4)
no	no	no	65	60.206009
yes	no	no	6	5.342466
no	yes	no	102	106.793991
yes	yes	no	24	24.657534
no	no	yes	19	23.793991
yes	no	yes	7	7.657534
no	yes	yes	47	42.206009
yes	yes	yes	36	35.342466

cat("exponentiate coefficients to get odds (compared to reference value of no)\n")

## exponentiate coefficients to get odds (compared to reference value of no)

exp(coef(q2_llmod4))

##                                     (Intercept) 
##                                     60.20600858 
##                       income_or_ses_reportedyes 
##                                      0.08873642 
##                            location_reportedyes 
##                                      1.77380952 
##                           race_ethn_reportedyes 
##                                      0.39520958 
##  income_or_ses_reportedyes:location_reportedyes 
##                                      2.60196180 
## income_or_ses_reportedyes:race_ethn_reportedyes 
##                                      3.62676768

cat("5. Now we can Visualize the frequencies\n")

## 5. Now we can Visualize the frequencies

q2_propxtab <- prop.table(q2_xtab,1)
q2_propxtab.df <- as.data.frame(q2_propxtab)
q2_propxtab.df |> ggplot(aes(x=race_ethn_reported, 
                              y=Freq, fill=income_or_ses_reported)) +
  geom_col(position = "dodge") + 
  facet_wrap(~location_reported, labeller = "label_both")

cat("5.1 let's also break that up into the two 2-way interactions, first let's do income:location - we see that location reporting is increased among publications that report income/ses\n")

## 5.1 let's also break that up into the two 2-way interactions, first let's do income:location - we see that location reporting is increased among publications that report income/ses

q2_incXloc_tab <- pub_tib |> 
  xtabs(formula = ~income_or_ses_reported + location_reported)
q2_propincXloc_df <- as.data.frame(prop.table(q2_incXloc_tab,1))
q2_propincXloc_df |> 
  ggplot(aes(x=income_or_ses_reported, y=Freq, fill=location_reported)) +
  geom_col(position = "dodge")

cat("5.2 Now let's do race:income - we see that publications that did not report race also tended to not report income\n")

## 5.2 Now let's do race:income - we see that publications that did not report race also tended to not report income

q2_propraceXinc_tab <- pub_tib |> 
  xtabs(formula = ~race_ethn_reported + income_or_ses_reported)
q2_propraceXinc_df <- as.data.frame(prop.table(q2_propraceXinc_tab,1))
q2_propraceXinc_df |> 
  ggplot(aes(x=race_ethn_reported, y=Freq, fill=income_or_ses_reported)) +
  geom_col(position = "dodge")

cat("6. follow up with 2x2 chi-square tests of (a) race:income and (b) location:income, using the chi square test of independence and odds ratio calculation like we did in the first example:\n")

## 6. follow up with 2x2 chi-square tests of (a) race:income and (b) location:income, using the chi square test of independence and odds ratio calculation like we did in the first example:

raceXinc_xtab <- pub_tib |> 
  with(gmodels::CrossTable(race_ethn_reported, income_or_ses_reported, expected = TRUE,
                       prop.chisq = TRUE)) #use fisher=TRUE if expected counts <5

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |              Expected N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  306 
## 
##  
##                    | income_or_ses_reported 
## race_ethn_reported |        no |       yes | Row Total | 
## -------------------|-----------|-----------|-----------|
##                 no |       167 |        30 |       197 | 
##                    |   150.003 |    46.997 |           | 
##                    |     1.926 |     6.147 |           | 
##                    |     0.848 |     0.152 |     0.644 | 
##                    |     0.717 |     0.411 |           | 
##                    |     0.546 |     0.098 |           | 
## -------------------|-----------|-----------|-----------|
##                yes |        66 |        43 |       109 | 
##                    |    82.997 |    26.003 |           | 
##                    |     3.481 |    11.110 |           | 
##                    |     0.606 |     0.394 |     0.356 | 
##                    |     0.283 |     0.589 |           | 
##                    |     0.216 |     0.141 |           | 
## -------------------|-----------|-----------|-----------|
##       Column Total |       233 |        73 |       306 | 
##                    |     0.761 |     0.239 |           | 
## -------------------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  22.66333     d.f. =  1     p =  1.93017e-06 
## 
## Pearson's Chi-squared test with Yates' continuity correction 
## ------------------------------------------------------------
## Chi^2 =  21.34954     d.f. =  1     p =  3.827117e-06 
## 
##

cat("odds ratio and conf interval:\n")

## odds ratio and conf interval:

vcd::oddsratio(raceXinc_xtab$t, log=FALSE)

##  odds ratios for x and y 
## 
## [1] 3.626768

confint(vcd::oddsratio(raceXinc_xtab$t, log=FALSE)) #confidence interval

##                  2.5 %   97.5 %
## no:yes/no:yes 2.099935 6.263738

cat("interpretation: a publication that reports race is 5.32 times more likely to report income, compared to a publication that does not report race\n")

## interpretation: a publication that reports race is 5.32 times more likely to report income, compared to a publication that does not report race

locXinc_xtab <- pub_tib |> 
  with(gmodels::CrossTable(location_reported, income_or_ses_reported, expected = TRUE,
                       prop.chisq = TRUE)) #use fisher=TRUE if expected counts <5

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |              Expected N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  306 
## 
##  
##                   | income_or_ses_reported 
## location_reported |        no |       yes | Row Total | 
## ------------------|-----------|-----------|-----------|
##                no |        84 |        13 |        97 | 
##                   |    73.859 |    23.141 |           | 
##                   |     1.392 |     4.444 |           | 
##                   |     0.866 |     0.134 |     0.317 | 
##                   |     0.361 |     0.178 |           | 
##                   |     0.275 |     0.042 |           | 
## ------------------|-----------|-----------|-----------|
##               yes |       149 |        60 |       209 | 
##                   |   159.141 |    49.859 |           | 
##                   |     0.646 |     2.062 |           | 
##                   |     0.713 |     0.287 |     0.683 | 
##                   |     0.639 |     0.822 |           | 
##                   |     0.487 |     0.196 |           | 
## ------------------|-----------|-----------|-----------|
##      Column Total |       233 |        73 |       306 | 
##                   |     0.761 |     0.239 |           | 
## ------------------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  8.54453     d.f. =  1     p =  0.003465621 
## 
## Pearson's Chi-squared test with Yates' continuity correction 
## ------------------------------------------------------------
## Chi^2 =  7.722691     d.f. =  1     p =  0.005453106 
## 
##

cat("odds ratio and conf interval:")

## odds ratio and conf interval:

vcd::oddsratio(locXinc_xtab$t, log=FALSE)

##  odds ratios for x and y 
## 
## [1] 2.601962

confint(vcd::oddsratio(locXinc_xtab$t, log=FALSE)) #confidence interval

##                  2.5 %   97.5 %
## no:yes/no:yes 1.349499 5.016828

cat("interpretation: a publication that reports location is 2.6 times more likely to report income, compared to a publication that does not report location")

## interpretation: a publication that reports location is 2.6 times more likely to report income, compared to a publication that does not report location

Based on our log lin modeling, let’s answer the following

What is the simplest model that does not differ significantly from the full model?
- Freq ~ income_or_ses_reported + location_reported + race_ethn_reported +
- income_or_ses_reported:location_reported + race_ethn_reported:income_or_ses_reported
- this is the model that includes all two-way interactions except race:location
How can we interpret this result?
- we interpret the highest order terms in the model and see that:
  1. location reporting is increased among publications that report income/ses
  2. publications that report race also tended to report income

Reporting

report the likelihood ratio statistic for the final model.
For any terms that are significant you should report the chi-square change.
If you break down any higher-order interactions in subsequent analyses then you need to report the relevant chi-square statistics (and odds ratios).

For this example we could report:
The three-way loglinear analysis produced a final model that retained interactions of (a) race reporting by income/ses reporting and (b) location reporting by income/ses reporting. The likelihood ratio of this final model was χ2(2) = 3.222, p = .200, indicating the model did not significantly differ from the observed frequencies. The highest-order interaction (race reporting by income/ses reporting by location reporting) was not significant, (comparison of model without the three-way interaction to the full model: χ2(1) = 0.986, p = .320), and the race/ethnicity-reported by location-reported interaction was not significant (model without this term compared to model with all 2nd-order interactions: χ2(1) = 2.236, p = .135. To break down the associations, separate chi-square tests were performed to examine (a) race/ethnicity reporting by income/ses reporting (combining location reporting categories) and (b) location reporting by income/ses reporting (combining race/ethnicity reporting categories). There was a significant association between race/ethnicity-reported and whether location was reported, χ2(1) = 24.961, p < .0001, odds ratio = 5.325. Furthermore, there was a significant association between income/ses-reported and whether location was reported, χ2(1) = 11.447, p = .001, odds ratio = 4.055. [plots/contingency tables can be used to characterize the two-way associations completely]

Finishing up- export the cleaned data files and re-run analyses in SPSS

- see [notes on the SPSS analysis](../spss/loglin-inclass2022-spss.html) linked on Canvas, it may be helpful to match up the output from SPSS to our output from the same analysis in R above

pub_tib |> 
  mutate(
    race_ethn_reported = as.numeric(race_ethn_reported)-1, #factor levels are 1=no, 2=yes
    income_or_ses_reported = as.numeric(income_or_ses_reported)-1, #so we subtract 1
    location_reported = as.numeric(location_reported)-1    #to end up with 0, 1 values
  ) |> 
  readr::write_csv("data/collab_data_cleaned.csv")

References

Chapter 19 of Field textbook: Field, A.P. (2018). Discovering Statistics Using IBM SPSS Statistics. 5th Edition. London: Sage.