pub_tib <- readr::read_delim("data/Dataset-PublicationStatistics-2024.tsv",
na=c("n/a","na","N/A","NA","Na","N/a",""),
delim = "\t", show_col_types = FALSE) |>
janitor::clean_names() |> #fixes the column names
select(-student_name,-citation,-doi) |> #drop unused columns
janitor::remove_empty(which = "rows") |> # drop empty rows
mutate(id = row_number(), .before = 1) |> #add a column with id for each row
mutate(across(where(is.character), str_trim)) |> #remove trailing/leading whitespace
mutate(across(where(is.character), tolower)) #make all text columns lowercase
Check values of our variables:
1. binary variables (reported yes/no): store as factor and check
levels
2. numerical (e.g., participants): convert “no” to NA and store as
numerical
3. field: since one entry can belong to multiple fields, we create a
series of dummy coded variables
soc
,cog
,dev
, etc where a row gets
a value of 1 if the given field appears in the field
column. there is some variance in how fields are entered
(“social”/“soc”, “cognitive”/“cog) so we’ll just check for a key part of
text for each.
4. sample_size_justification: store as factor and check levels
#fix typos where "no" was entered as "on"
pub_tib <- pub_tib |> mutate(
income_or_ses_reported = str_replace(income_or_ses_reported, pattern = "on", replacement = "no")
)
#1. binary variables - using "factor" instead of "as_factor" because it will put
# the levels in alpha order - NA values are recoded as "no"
pub_tib <- pub_tib |> mutate(
race_ethn_reported = factor(replace_na(race_ethn_reported,"no")),
income_or_ses_reported = factor(replace_na(income_or_ses_reported,"no")),
location_reported = factor(replace_na(location_reported,"no")),
general_sample = factor(replace_na(general_sample,"no")),
sample_size_justification = factor(sample_size_justification)
)
pub_tib |> select(race_ethn_reported:general_sample,sample_size_justification) |> purrr::map(levels)
## $race_ethn_reported
## [1] "no" "yes"
##
## $income_or_ses_reported
## [1] "no" "yes"
##
## $location_reported
## [1] "no" "yes"
##
## $general_sample
## [1] "no" "yes"
##
## $sample_size_justification
## [1] "a-priori" "accuracy" "constraints" "heuristics"
## [5] "no-justification" "no-statement" "population"
pub_tib |> count(race_ethn_reported)
## # A tibble: 2 × 2
## race_ethn_reported n
## <fct> <int>
## 1 no 197
## 2 yes 109
pub_tib |> count(income_or_ses_reported)
## # A tibble: 2 × 2
## income_or_ses_reported n
## <fct> <int>
## 1 no 233
## 2 yes 73
pub_tib |> count(location_reported)
## # A tibble: 2 × 2
## location_reported n
## <fct> <int>
## 1 no 97
## 2 yes 209
pub_tib |> count(general_sample)
## # A tibble: 2 × 2
## general_sample n
## <fct> <int>
## 1 no 93
## 2 yes 213
pub_tib |> count(sample_size_justification)
## # A tibble: 8 × 2
## sample_size_justification n
## <fct> <int>
## 1 a-priori 26
## 2 accuracy 7
## 3 constraints 13
## 4 heuristics 11
## 5 no-justification 57
## 6 no-statement 81
## 7 population 5
## 8 <NA> 106
#2. Numerical variables: participants_male and participants_female are stored as chr
# use parse_number() and any non-numeric values will get NA
# but there's an inconsistency in reporting NA/no versus 0 - we would need to
# resolve the inconsistency to make use of that data
pub_tib <- pub_tib |> mutate(
participants_n = parse_number(stringr::word(participants_n), na = "no"),
participants_male = parse_number(stringr::word(participants_male), na = "no"),
participants_female = parse_number(stringr::word(participants_female), na = "no"),
participants_nonbin = parse_number(stringr::word(participants_nonbin), na = "no")
)
pub_tib |> select(participants_n:participants_nonbin) |> psych::describe()
## vars n mean sd median trimmed mad min max
## participants_n 1 306 1143.83 8356.92 112 200.83 127.50 7 138625
## participants_male 2 264 506.53 3847.89 45 93.28 53.37 0 61226
## participants_female 3 268 698.06 5137.24 62 108.82 72.65 0 77399
## participants_nonbin 4 48 3.56 11.76 0 0.88 0.00 0 75
## range skew kurtosis se
## participants_n 138618 14.85 237.47 477.73
## participants_male 61226 14.94 231.95 236.82
## participants_female 77399 13.08 185.72 313.81
## participants_nonbin 75 4.90 25.82 1.70
#3. check out "field"
pub_tib |> janitor::tabyl(field) #or use count(field)
## field n percent
## clinical, cog, dev 1 0.003267974
## cog 36 0.117647059
## cog , other 1 0.003267974
## cog, dev 14 0.045751634
## cog, dev, soc 1 0.003267974
## cog, neuro 19 0.062091503
## cog, neuro, dev 7 0.022875817
## cog, neuro, soc 1 0.003267974
## cog, other 2 0.006535948
## cog, soc 6 0.019607843
## cog,dev 5 0.016339869
## cog,neuro 45 0.147058824
## cog,neuro,dev 4 0.013071895
## cog,neuro,pers 1 0.003267974
## cog,other 1 0.003267974
## cog,soc 9 0.029411765
## cog,soc,dev 1 0.003267974
## conbeh 14 0.045751634
## dev 7 0.022875817
## dev, cog 5 0.016339869
## dev, other 1 0.003267974
## dev, soc 1 0.003267974
## dev,cog 2 0.006535948
## dev,cog,neuro 2 0.006535948
## dev,neuro 1 0.003267974
## dev,neuro,cog 1 0.003267974
## dev,neuro,social 1 0.003267974
## dev,soc 1 0.003267974
## dev,social 2 0.006535948
## neuro 20 0.065359477
## neuro, cognitive 2 0.006535948
## neuro,cog 2 0.006535948
## neuro,dev 1 0.003267974
## neuro,soc 1 0.003267974
## other 14 0.045751634
## other, dev 2 0.006535948
## other, neuro 1 0.003267974
## pers 7 0.022875817
## pers, soc 6 0.019607843
## pers, soc, cog 1 0.003267974
## pers,quant,cog 2 0.006535948
## psych, bio, social 1 0.003267974
## psych, cognitive 2 0.006535948
## psych, soc, neuro 1 0.003267974
## psych, social 3 0.009803922
## psych, social, personality 2 0.006535948
## soc 31 0.101307190
## soc & pers 1 0.003267974
## soc, cog 3 0.009803922
## soc, pers 1 0.003267974
## soc,cog,neuro 2 0.006535948
## soc,dev 1 0.003267974
## soc,neuro 1 0.003267974
## social 5 0.016339869
## social,cog,neuro 1 0.003267974
# now dummy code the "field" variable, allowing for multiple fields per entry
pub_tib <- pub_tib |>
mutate(
soc = if_else(str_detect(field,"soc"), 1, 0),
cog = if_else(str_detect(field,"cog"), 1, 0),
dev = if_else(str_detect(field,"dev"), 1, 0),
pers = if_else(str_detect(field,"pers"), 1, 0),
conbeh = if_else(str_detect(field,"con"), 1, 0),
neuro = if_else(str_detect(field,"neuro"), 1, 0),
quant = if_else(str_detect(field,"quant"), 1, 0),
other = if_else(str_detect(field,"other"), 1, 0),
#field_combo = ""
)
# print counts of each field
pub_tib |> select(soc:other) |> colSums(na.rm = TRUE)
## soc cog dev pers conbeh neuro quant other
## 84 179 61 21 14 114 2 22
# # only 2 quant cases so drop that column
# pub_tib <- pub_tib |> select(-quant)
We can discuss what questions to ask with the data and we can explore as much as we have time for. But let’s start with an example that makes use of a contingency table and the chi square test of independence:
general_sample
and
race_ethn_reported
are related
general_sample
and
race_ethn_reported
are independent# 1. Contingency Table
q1_xtab <- pub_tib |>
with(gmodels::CrossTable(general_sample, race_ethn_reported, expected = TRUE,
prop.chisq = TRUE)) #use fisher=TRUE if expected counts <5
# 2. Cramers V (round to 3 decimals)
cat("Cramer's V: ", round(DescTools::CramerV(q1_xtab$t),3), "\n")
## Registered S3 method overwritten by 'DescTools':
## method from
## reorder.factor gdata
# 3. Extra: odds ratio
q1_odds <- vcd::oddsratio(q1_xtab$t, log=FALSE)
cat("odds ratio general_sample and race_ethn_reported")
q1_odds #interpretation: a general sample publication is YY as likely to report race/ethn compared to a non-general sample publication
confint(q1_odds) #confidence interval
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Expected N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 306
##
##
## | race_ethn_reported
## general_sample | no | yes | Row Total |
## ---------------|-----------|-----------|-----------|
## no | 45 | 48 | 93 |
## | 59.873 | 33.127 | |
## | 3.694 | 6.677 | |
## | 0.484 | 0.516 | 0.304 |
## | 0.228 | 0.440 | |
## | 0.147 | 0.157 | |
## ---------------|-----------|-----------|-----------|
## yes | 152 | 61 | 213 |
## | 137.127 | 75.873 | |
## | 1.613 | 2.915 | |
## | 0.714 | 0.286 | 0.696 |
## | 0.772 | 0.560 | |
## | 0.497 | 0.199 | |
## ---------------|-----------|-----------|-----------|
## Column Total | 197 | 109 | 306 |
## | 0.644 | 0.356 | |
## ---------------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 14.89978 d.f. = 1 p = 0.0001133763
##
## Pearson's Chi-squared test with Yates' continuity correction
## ------------------------------------------------------------
## Chi^2 = 13.91479 d.f. = 1 p = 0.0001912875
##
##
## Cramer's V: 0.221
## odds ratio general_sample and race_ethn_reported odds ratios for x and y
##
## [1] 0.3762336
## 2.5 % 97.5 %
## no:yes/no:yes 0.2273702 0.6225603
total observations - 306 means all cases were included
N is the observed joint frequency. For example, out of
[how many?] samples that represent the general population, [how many?]
of those papers reported race/ethnicity.
Expected N is the expected joint frequency.
Chi-square contribution measures the amount that a cell contributes to the overall chi-square statistic for the table (the sum of all contributions equals the overall chi-squared value below the table).
For this example we could report:
The three-way loglinear analysis produced a final model that retained
interactions of (a) race reporting by income/ses reporting and (b)
location reporting by income/ses reporting. The likelihood ratio of this
final model was χ2(2) = 3.222, p = .200, indicating the model did not
significantly differ from the observed frequencies. The highest-order
interaction (race reporting by income/ses reporting by location
reporting) was not significant, (comparison of model without the
three-way interaction to the full model: χ2(1) = 0.986, p = .320), and
the race/ethnicity-reported by location-reported interaction was not
significant (model without this term compared to model with all
2nd-order interactions: χ2(1) = 2.236, p = .135. To break down the
associations, separate chi-square tests were performed to examine (a)
race/ethnicity reporting by income/ses reporting (combining location
reporting categories) and (b) location reporting by income/ses reporting
(combining race/ethnicity reporting categories). There was a significant
association between race/ethnicity-reported and whether location was
reported, χ2(1) = 24.961, p < .0001, odds ratio = 5.325. Furthermore,
there was a significant association between income/ses-reported and
whether location was reported, χ2(1) = 11.447, p = .001, odds ratio =
4.055. [plots/contingency tables can be used to characterize the two-way
associations completely]
- see [notes on the SPSS analysis](../spss/loglin-inclass2022-spss.html) linked on Canvas, it may be helpful to match up the output from SPSS to our output from the same analysis in R above
pub_tib |>
mutate(
race_ethn_reported = as.numeric(race_ethn_reported)-1, #factor levels are 1=no, 2=yes
income_or_ses_reported = as.numeric(income_or_ses_reported)-1, #so we subtract 1
location_reported = as.numeric(location_reported)-1 #to end up with 0, 1 values
) |>
readr::write_csv("data/collab_data_cleaned.csv")