8.11 Pearson’s chi-squared and Fisher’s exact tests

Pearson’s chi-squared (\(\chi^2\)) test of independence is used to determine whether two categorical variables are independent in a given population. Independence here means that the relative frequencies of one variable are the same over all levels of another variable.

A common setting for this is the classic 2x2 table. This refers to two categorical variables with exactly two levels each, such as is show in Table 8.1 above. The null hypothesis of independence for this particular question is no difference in the proportion of patients with ulcerated tumours who die (45.6%) compared with non-ulcerated tumours (13.9%). From the raw frequencies, there seems to be a large difference, as we noted in the plot we made above.

8.11.1 Base R

Base R has reliable functions for all common statistical tests, but they are sometimes a little inconvenient to extract results from.

A table of counts can be constructed, either using the $ to identify columns, or using the with() function.

table(meldata$ulcer.factor, meldata$status_dss) # both give same result
with(meldata, table(ulcer.factor, status_dss))
##          
##           Alive Died melanoma
##   Absent     99            16
##   Present    49            41

When working with older R functions, a useful shortcut is the exposition pipe-operator (%$%) from the magrittr package, home of the standard forward pipe-operator (%>%). The exposition pipe-operator exposes data frame/tibble columns on the left to the function which follows on the right. It’s easier to see in action by making a table of counts.

library(magrittr)
meldata %$%        # note $ sign here
  table(ulcer.factor, status_dss)
##             status_dss
## ulcer.factor Alive Died melanoma
##      Absent     99            16
##      Present    49            41

The counts table can be passed to prop.table() for proportions.

meldata %$%
  table(ulcer.factor, status_dss) %>% 
  prop.table(margin = 1)     # 1: row, 2: column etc.
##             status_dss
## ulcer.factor     Alive Died melanoma
##      Absent  0.8608696     0.1391304
##      Present 0.5444444     0.4555556

Similarly, the counts table can be passed to chisq.test() to perform the chi-squared test.

meldata %$%        # note $ sign here
  table(ulcer.factor, status_dss) %>% 
  chisq.test()
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  .
## X-squared = 23.631, df = 1, p-value = 1.167e-06

The result can be extracted into a tibble using the tidy() function from the broom package.

library(broom)
meldata %$%        # note $ sign here
  table(ulcer.factor, status_dss) %>% 
  chisq.test() %>% 
  tidy()
## # A tibble: 1 x 4
##   statistic    p.value parameter method                                         
##       <dbl>      <dbl>     <int> <chr>                                          
## 1      23.6 0.00000117         1 Pearson's Chi-squared test with Yates' continu…

The chisq.test() function applies the Yates’ continuity correction by default. The standard interpretation assumes that the discrete probability of observed counts in the table can be approximated by the continuous chi-squared distribution. This introduces some error. The correction involves subtracting 0.5 from the absolute difference between each observed and expected value. This is particularly helpful when counts are low, but can be removed if desired by chisq.test(..., correct = FALSE).