14.6 4. Check for associations between missing and observed data: missing_pairs | missing_compare

In deciding whether data is MCAR or MAR, one approach is to explore patterns of missingness between levels of included variables. This is particularly important (we would say absolutely required) for a primary outcome measure / dependent variable.

Take for example “death”. When that outcome is missing it is often for a particular reason. For example, perhaps patients undergoing emergency surgery were less likely to have complete records compared with those undergoing planned surgery. And of course, death is more likely after emergency surgery.

missing_pairs() uses functions from the excellent GGally package. It produces pairs plots to show relationships between missing values and observed values in all variables.

explanatory = c("age", "sex.factor",
"nodes", "obstruct.factor",
"smoking_mcar", "smoking_mar")
dependent = "mort_5yr"
colon_s %>%
missing_pairs(dependent, explanatory)

For continuous variables (age and nodes), the distributions of observed and missing data can be visually compared. Is there a difference between age and mortality above?

For discrete, data, counts are presented by default. It is often easier to compare proportions:

colon_s %>%
missing_pairs(dependent, explanatory, position = "fill")

It should be obvious that missingness in Smoking (MCAR) does not relate to sex (row 6, column 3). But missingness in Smoking (MAR) does differ by sex (last row, column 3) as was designed above when the missing data were created.

We can confirm this using missing_compare().

explanatory = c("age", "sex.factor",
"nodes", "obstruct.factor")
dependent = "smoking_mcar"

missing_mcar = colon_s %>%
missing_compare(dependent, explanatory)
## Warning: Factor obstruct.factor contains implicit NA, consider using
## forcats::fct_explicit_na
TABLE 10.1: Missing data comparison: Smoking (MCAR).
Missing data analysis: Smoking (MCAR) Not missing Missing p
Age (years) Mean (SD) 59.7 (11.9) 59.9 (12.6) 0.882
Sex Female 399 (48.2) 46 (45.5) 0.692
Male 429 (51.8) 55 (54.5)
nodes Mean (SD) 3.6 (3.4) 4.0 (4.5) 0.302
Obstruction No 654 (80.7) 78 (79.6) 0.891
Yes 156 (19.3) 20 (20.4)
dependent = "smoking_mar"

missing_mar = colon_s %>%
missing_compare(dependent, explanatory)
## Warning: Factor obstruct.factor contains implicit NA, consider using
## forcats::fct_explicit_na
TABLE 10.2: Missing data comparison: Smoking (MAR).
Missing data analysis: Smoking (MAR) Not missing Missing p
Age (years) Mean (SD) 59.9 (11.8) 59.4 (12.6) 0.632
Sex Female 288 (39.7) 157 (77.3) <0.001
Male 438 (60.3) 46 (22.7)
nodes Mean (SD) 3.6 (3.5) 3.9 (3.9) 0.321
Obstruction No 568 (80.1) 164 (82.4) 0.533
Yes 141 (19.9) 35 (17.6)

It takes dependent and explanatory variables, but in this context dependent just refers to the variable being tested for missingness against the explanatory variables.

Comparisons for continuous data use a Kruskal Wallis and for discrete data a chi-squared test.

As expected, a relationship is seen between Sex and Smoking (MAR) but not Smoking (MCAR).

14.6.1 For those who like an omnibus test

If you are work predominately with continous rather than categorical data, you may find these tests from the MissMech package useful. The package and output is well documented, and provides two tests which can be used to determine whether data are MCAR.

library(MissMech)
explanatory = c("age", "nodes")
dependent = "mort_5yr"

colon_s %>%
select(explanatory) %>%
MissMech::TestMCARNormality()
## Note: Using an external vector in selections is ambiguous.
## ℹ Use all_of(explanatory) instead of explanatory to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## Call:
## MissMech::TestMCARNormality(data = .)
##
## Number of Patterns:  2
##
## Total number of cases used in the analysis:  929
##
##  Pattern(s) used:
##           age   nodes   Number of cases
## group.1     1       1               911
## group.2     1      NA                18
##
##
##     Test of normality and Homoscedasticity:
##   -------------------------------------------
##
## Hawkins Test:
##
##     P-value for the Hawkins test of normality and homoscedasticity:  7.607252e-14
##
##     Either the test of multivariate normality or homoscedasticity (or both) is rejected.
##     Provided that normality can be assumed, the hypothesis of MCAR is
##     rejected at 0.05 significance level.
##
## Non-Parametric Test:
##
##     P-value for the non-parametric test of homoscedasticity:  0.6171955
##
##     Reject Normality at 0.05 significance level.
##     There is not sufficient evidence to reject MCAR at 0.05 significance level.