General process to analyze categorical outcomes with categorical predictors

Categorical outcome decision process
Categorical outcome decision process

Step 0 - Install a package

Step 1 - Import the data

pub_tib <- readr::read_delim("data/Dataset-PublicationStatistics-2024.tsv",
                             na=c("n/a","na","N/A","NA","Na","N/a",""),
                             delim = "\t", show_col_types = FALSE) |>
  janitor::clean_names() |>  #fixes the column names
  select(-student_name,-citation,-doi) |>  #drop unused columns 
  janitor::remove_empty(which = "rows") |>  # drop empty rows
  mutate(id = row_number(), .before = 1) |>   #add a column with id for each row
  mutate(across(where(is.character), str_trim)) |> #remove trailing/leading whitespace 
  mutate(across(where(is.character), tolower)) #make all text columns lowercase 

Step 2 - clean the data and set variable types

Check values of our variables:
1. binary variables (reported yes/no): store as factor and check levels
2. numerical (e.g., participants): convert “no” to NA and store as numerical
3. field: since one entry can belong to multiple fields, we create a series of dummy coded variables soc,cog,dev, etc where a row gets a value of 1 if the given field appears in the field column. there is some variance in how fields are entered (“social”/“soc”, “cognitive”/“cog) so we’ll just check for a key part of text for each.
4. sample_size_justification: store as factor and check levels

#fix typos where "no" was entered as "on"
pub_tib <- pub_tib |> mutate(
  income_or_ses_reported = str_replace(income_or_ses_reported, pattern = "on", replacement = "no")
)

#1. binary variables - using "factor" instead of "as_factor" because it will put
#   the levels in alpha order - NA values are recoded as "no"
pub_tib <- pub_tib |> mutate(
  race_ethn_reported = factor(replace_na(race_ethn_reported,"no")),
  income_or_ses_reported = factor(replace_na(income_or_ses_reported,"no")),
  location_reported = factor(replace_na(location_reported,"no")),
  general_sample = factor(replace_na(general_sample,"no")),
  sample_size_justification = factor(sample_size_justification)
)
pub_tib |> select(race_ethn_reported:general_sample,sample_size_justification) |> purrr::map(levels)
## $race_ethn_reported
## [1] "no"  "yes"
## 
## $income_or_ses_reported
## [1] "no"  "yes"
## 
## $location_reported
## [1] "no"  "yes"
## 
## $general_sample
## [1] "no"  "yes"
## 
## $sample_size_justification
## [1] "a-priori"         "accuracy"         "constraints"      "heuristics"      
## [5] "no-justification" "no-statement"     "population"
pub_tib |> count(race_ethn_reported)
## # A tibble: 2 × 2
##   race_ethn_reported     n
##   <fct>              <int>
## 1 no                   197
## 2 yes                  109
pub_tib |> count(income_or_ses_reported)
## # A tibble: 2 × 2
##   income_or_ses_reported     n
##   <fct>                  <int>
## 1 no                       233
## 2 yes                       73
pub_tib |> count(location_reported)
## # A tibble: 2 × 2
##   location_reported     n
##   <fct>             <int>
## 1 no                   97
## 2 yes                 209
pub_tib |> count(general_sample)
## # A tibble: 2 × 2
##   general_sample     n
##   <fct>          <int>
## 1 no                93
## 2 yes              213
pub_tib |> count(sample_size_justification)
## # A tibble: 8 × 2
##   sample_size_justification     n
##   <fct>                     <int>
## 1 a-priori                     26
## 2 accuracy                      7
## 3 constraints                  13
## 4 heuristics                   11
## 5 no-justification             57
## 6 no-statement                 81
## 7 population                    5
## 8 <NA>                        106
#2. Numerical variables: participants_male and participants_female are stored as chr
#   use parse_number() and any non-numeric values will get NA 
#   but there's an inconsistency in reporting NA/no versus 0 - we would need to 
#   resolve the inconsistency to make use of that data
pub_tib <- pub_tib |> mutate(
  participants_n = parse_number(stringr::word(participants_n), na = "no"),
  participants_male = parse_number(stringr::word(participants_male), na = "no"),
  participants_female = parse_number(stringr::word(participants_female), na = "no"),
  participants_nonbin = parse_number(stringr::word(participants_nonbin), na = "no")
) 
pub_tib |> select(participants_n:participants_nonbin) |> psych::describe()
##                     vars   n    mean      sd median trimmed    mad min    max
## participants_n         1 306 1143.83 8356.92    112  200.83 127.50   7 138625
## participants_male      2 264  506.53 3847.89     45   93.28  53.37   0  61226
## participants_female    3 268  698.06 5137.24     62  108.82  72.65   0  77399
## participants_nonbin    4  48    3.56   11.76      0    0.88   0.00   0     75
##                      range  skew kurtosis     se
## participants_n      138618 14.85   237.47 477.73
## participants_male    61226 14.94   231.95 236.82
## participants_female  77399 13.08   185.72 313.81
## participants_nonbin     75  4.90    25.82   1.70
#3. check out "field"
pub_tib |> janitor::tabyl(field) #or use count(field)  
##                       field  n     percent
##          clinical, cog, dev  1 0.003267974
##                         cog 36 0.117647059
##                 cog , other  1 0.003267974
##                    cog, dev 14 0.045751634
##               cog, dev, soc  1 0.003267974
##                  cog, neuro 19 0.062091503
##             cog, neuro, dev  7 0.022875817
##             cog, neuro, soc  1 0.003267974
##                  cog, other  2 0.006535948
##                    cog, soc  6 0.019607843
##                     cog,dev  5 0.016339869
##                   cog,neuro 45 0.147058824
##               cog,neuro,dev  4 0.013071895
##              cog,neuro,pers  1 0.003267974
##                   cog,other  1 0.003267974
##                     cog,soc  9 0.029411765
##                 cog,soc,dev  1 0.003267974
##                      conbeh 14 0.045751634
##                         dev  7 0.022875817
##                    dev, cog  5 0.016339869
##                  dev, other  1 0.003267974
##                    dev, soc  1 0.003267974
##                     dev,cog  2 0.006535948
##               dev,cog,neuro  2 0.006535948
##                   dev,neuro  1 0.003267974
##               dev,neuro,cog  1 0.003267974
##            dev,neuro,social  1 0.003267974
##                     dev,soc  1 0.003267974
##                  dev,social  2 0.006535948
##                       neuro 20 0.065359477
##            neuro, cognitive  2 0.006535948
##                   neuro,cog  2 0.006535948
##                   neuro,dev  1 0.003267974
##                   neuro,soc  1 0.003267974
##                       other 14 0.045751634
##                  other, dev  2 0.006535948
##                other, neuro  1 0.003267974
##                        pers  7 0.022875817
##                   pers, soc  6 0.019607843
##              pers, soc, cog  1 0.003267974
##              pers,quant,cog  2 0.006535948
##          psych, bio, social  1 0.003267974
##            psych, cognitive  2 0.006535948
##           psych, soc, neuro  1 0.003267974
##               psych, social  3 0.009803922
##  psych, social, personality  2 0.006535948
##                         soc 31 0.101307190
##                  soc & pers  1 0.003267974
##                    soc, cog  3 0.009803922
##                   soc, pers  1 0.003267974
##               soc,cog,neuro  2 0.006535948
##                     soc,dev  1 0.003267974
##                   soc,neuro  1 0.003267974
##                      social  5 0.016339869
##            social,cog,neuro  1 0.003267974
# now dummy code the "field" variable, allowing for multiple fields per entry
pub_tib <- pub_tib |> 
  mutate(
    soc = if_else(str_detect(field,"soc"), 1, 0),
    cog = if_else(str_detect(field,"cog"), 1, 0),
    dev = if_else(str_detect(field,"dev"), 1, 0),
    pers = if_else(str_detect(field,"pers"), 1, 0),
    conbeh = if_else(str_detect(field,"con"), 1, 0),
    neuro = if_else(str_detect(field,"neuro"), 1, 0),
    quant = if_else(str_detect(field,"quant"), 1, 0),
    other = if_else(str_detect(field,"other"), 1, 0),
    #field_combo = ""
  )
# print counts of each field
pub_tib |> select(soc:other) |> colSums(na.rm = TRUE)
##    soc    cog    dev   pers conbeh  neuro  quant  other 
##     84    179     61     21     14    114      2     22
# # only 2 quant cases so drop that column
# pub_tib <- pub_tib |> select(-quant)

Step 3 - Chi-square test of independence and loglinear analysis

We can discuss what questions to ask with the data and we can explore as much as we have time for. But let’s start with an example that makes use of a contingency table and the chi square test of independence:

Question 1 (chi square test): If a study uses a sample that is meant to represent the general population, is race/ethnicity more likely to be reported?

  • we can test whether general_sample and race_ethn_reported are related
    • H0: general_sample and race_ethn_reported are independent
  1. Generate contingency table
  2. Examine observed frequencies compared to expected frequencies
  3. Chi-squared test of independence
  4. if expected frequency for a cell is 5 or less then use Fisher’s exact test
# 1. Contingency Table
q1_xtab <- pub_tib |> 
  with(gmodels::CrossTable(general_sample, race_ethn_reported, expected = TRUE,
                       prop.chisq = TRUE)) #use fisher=TRUE if expected counts <5
# 2. Cramers V (round to 3 decimals)
cat("Cramer's V: ", round(DescTools::CramerV(q1_xtab$t),3), "\n")
## Registered S3 method overwritten by 'DescTools':
##   method         from 
##   reorder.factor gdata
# 3. Extra: odds ratio
q1_odds <- vcd::oddsratio(q1_xtab$t, log=FALSE)
cat("odds ratio general_sample and race_ethn_reported")
q1_odds  #interpretation: a general sample publication is YY as likely to report race/ethn compared to a non-general sample publication
confint(q1_odds) #confidence interval
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |              Expected N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  306 
## 
##  
##                | race_ethn_reported 
## general_sample |        no |       yes | Row Total | 
## ---------------|-----------|-----------|-----------|
##             no |        45 |        48 |        93 | 
##                |    59.873 |    33.127 |           | 
##                |     3.694 |     6.677 |           | 
##                |     0.484 |     0.516 |     0.304 | 
##                |     0.228 |     0.440 |           | 
##                |     0.147 |     0.157 |           | 
## ---------------|-----------|-----------|-----------|
##            yes |       152 |        61 |       213 | 
##                |   137.127 |    75.873 |           | 
##                |     1.613 |     2.915 |           | 
##                |     0.714 |     0.286 |     0.696 | 
##                |     0.772 |     0.560 |           | 
##                |     0.497 |     0.199 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |       197 |       109 |       306 | 
##                |     0.644 |     0.356 |           | 
## ---------------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  14.89978     d.f. =  1     p =  0.0001133763 
## 
## Pearson's Chi-squared test with Yates' continuity correction 
## ------------------------------------------------------------
## Chi^2 =  13.91479     d.f. =  1     p =  0.0001912875 
## 
##  
## Cramer's V:  0.221 
## odds ratio general_sample and race_ethn_reported odds ratios for x and y 
## 
## [1] 0.3762336
##                   2.5 %    97.5 %
## no:yes/no:yes 0.2273702 0.6225603
Understand the output
  1. total observations - 306 means all cases were included

  2. N is the observed joint frequency. For example, out of [how many?] samples that represent the general population, [how many?] of those papers reported race/ethnicity.

  3. Expected N is the expected joint frequency.

  4. Chi-square contribution measures the amount that a cell contributes to the overall chi-square statistic for the table (the sum of all contributions equals the overall chi-squared value below the table).

Answer the following
  1. Are any of the expected counts less than or equal to 5?
  2. Are the observed deviations from expected frequencies likely under the null hypothesis? (χ2(1, N = 306) = —–, p<.0001).
  3. Examine the observed and expected frequencies. What direction is the association between the categories?
  4. What is the effect size? (Cramers V, odds ratio)

Using the last example as a template, let’s ask whether research field and race/ethnicity reporting are related.

  • each publication can only be classified as one field, so we will limit our cases to those that are classified as a single field
  • after limiting cases, there are six categories of research field (and “other”)
# 0. filter cases to keep only single field pubs
pub_tib <- pub_tib |> rowwise() |> 
  mutate(
    issinglefield = if_else(sum(c_across(soc:other))==1,1,0),
  ) |> ungroup()
pub_singlefield_tib <- pub_tib |> filter(issinglefield==1) |> 
  mutate(
    singlefield = case_when(
      soc == 1 ~ "soc",
      cog == 1 ~ "cog",
      dev == 1 ~ "dev",
      neuro == 1 ~ "neuro",
      pers == 1 ~ "pers",
      conbeh == 1 ~ "conbeh",
      other == 1 ~ "other"
    )
  )
pub_singlefield_tib |> count(singlefield)

# 1. Contingency Table
fieldxrace_xtab <- pub_singlefield_tib |> 
  with(gmodels::CrossTable(singlefield, race_ethn_reported, expected = TRUE,
                       prop.chisq = TRUE, simulate.p.value=TRUE)) #if expected counts <5
# 2. Cramers V (round to 3 decimals)

# 3. Interpretation
## # A tibble: 7 × 2
##   singlefield     n
##   <chr>       <int>
## 1 cog            38
## 2 conbeh         14
## 3 dev             7
## 4 neuro          20
## 5 other          14
## 6 pers            7
## 7 soc            40
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |              Expected N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  140 
## 
##  
##              | race_ethn_reported 
##  singlefield |        no |       yes | Row Total | 
## -------------|-----------|-----------|-----------|
##          cog |        30 |         8 |        38 | 
##              |    21.714 |    16.286 |           | 
##              |     3.162 |     4.216 |           | 
##              |     0.789 |     0.211 |     0.271 | 
##              |     0.375 |     0.133 |           | 
##              |     0.214 |     0.057 |           | 
## -------------|-----------|-----------|-----------|
##       conbeh |        13 |         1 |        14 | 
##              |     8.000 |     6.000 |           | 
##              |     3.125 |     4.167 |           | 
##              |     0.929 |     0.071 |     0.100 | 
##              |     0.163 |     0.017 |           | 
##              |     0.093 |     0.007 |           | 
## -------------|-----------|-----------|-----------|
##          dev |         1 |         6 |         7 | 
##              |     4.000 |     3.000 |           | 
##              |     2.250 |     3.000 |           | 
##              |     0.143 |     0.857 |     0.050 | 
##              |     0.013 |     0.100 |           | 
##              |     0.007 |     0.043 |           | 
## -------------|-----------|-----------|-----------|
##        neuro |        15 |         5 |        20 | 
##              |    11.429 |     8.571 |           | 
##              |     1.116 |     1.488 |           | 
##              |     0.750 |     0.250 |     0.143 | 
##              |     0.188 |     0.083 |           | 
##              |     0.107 |     0.036 |           | 
## -------------|-----------|-----------|-----------|
##        other |         5 |         9 |        14 | 
##              |     8.000 |     6.000 |           | 
##              |     1.125 |     1.500 |           | 
##              |     0.357 |     0.643 |     0.100 | 
##              |     0.062 |     0.150 |           | 
##              |     0.036 |     0.064 |           | 
## -------------|-----------|-----------|-----------|
##         pers |         3 |         4 |         7 | 
##              |     4.000 |     3.000 |           | 
##              |     0.250 |     0.333 |           | 
##              |     0.429 |     0.571 |     0.050 | 
##              |     0.037 |     0.067 |           | 
##              |     0.021 |     0.029 |           | 
## -------------|-----------|-----------|-----------|
##          soc |        13 |        27 |        40 | 
##              |    22.857 |    17.143 |           | 
##              |     4.251 |     5.668 |           | 
##              |     0.325 |     0.675 |     0.286 | 
##              |     0.163 |     0.450 |           | 
##              |     0.093 |     0.193 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |        80 |        60 |       140 | 
##              |     0.571 |     0.429 |           | 
## -------------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test with simulated p-value
##   (based on 2000 replicates) 
## ------------------------------------------------------------
## Chi^2 =  35.65011     d.f. =  NA     p =  0.0004997501 
## 
## 
## 

Reporting

  • report the likelihood ratio statistic for the final model.
  • For any terms that are significant you should report the chi-square change.
  • If you break down any higher-order interactions in subsequent analyses then you need to report the relevant chi-square statistics (and odds ratios).

For this example we could report:
The three-way loglinear analysis produced a final model that retained interactions of (a) race reporting by income/ses reporting and (b) location reporting by income/ses reporting. The likelihood ratio of this final model was χ2(2) = 3.222, p = .200, indicating the model did not significantly differ from the observed frequencies. The highest-order interaction (race reporting by income/ses reporting by location reporting) was not significant, (comparison of model without the three-way interaction to the full model: χ2(1) = 0.986, p = .320), and the race/ethnicity-reported by location-reported interaction was not significant (model without this term compared to model with all 2nd-order interactions: χ2(1) = 2.236, p = .135. To break down the associations, separate chi-square tests were performed to examine (a) race/ethnicity reporting by income/ses reporting (combining location reporting categories) and (b) location reporting by income/ses reporting (combining race/ethnicity reporting categories). There was a significant association between race/ethnicity-reported and whether location was reported, χ2(1) = 24.961, p < .0001, odds ratio = 5.325. Furthermore, there was a significant association between income/ses-reported and whether location was reported, χ2(1) = 11.447, p = .001, odds ratio = 4.055. [plots/contingency tables can be used to characterize the two-way associations completely]

Finishing up- export the cleaned data files and re-run analyses in SPSS

- see [notes on the SPSS analysis](../spss/loglin-inclass2022-spss.html) linked on Canvas, it may be helpful to match up the output from SPSS to our output from the same analysis in R above     
pub_tib |> 
  mutate(
    race_ethn_reported = as.numeric(race_ethn_reported)-1, #factor levels are 1=no, 2=yes
    income_or_ses_reported = as.numeric(income_or_ses_reported)-1, #so we subtract 1
    location_reported = as.numeric(location_reported)-1    #to end up with 0, 1 values
  ) |> 
  readr::write_csv("data/collab_data_cleaned.csv")

References

  • Chapter 19 of Field textbook: Field, A.P. (2018). Discovering Statistics Using IBM SPSS Statistics. 5th Edition. London: Sage.