## 14.5 3. Look for patterns of missingness: missing_pattern

Using finalfit, missing_pattern() wraps a function from the mice package, md.pattern(). This produces a table and a plot showing the pattern of missingness between variables.

explanatory = c("age", "sex.factor",
"obstruct.factor",
"smoking_mcar", "smoking_mar")
dependent = "mort_5yr"

colon_s %>%
missing_pattern(dependent, explanatory)

##     age sex.factor mort_5yr obstruct.factor smoking_mcar smoking_mar
## 631   1          1        1               1            1           1   0
## 167   1          1        1               1            1           0   1
## 69    1          1        1               1            0           1   1
## 27    1          1        1               1            0           0   2
## 14    1          1        1               0            1           1   1
## 4     1          1        1               0            1           0   2
## 3     1          1        1               0            0           1   2
## 8     1          1        0               1            1           1   1
## 4     1          1        0               1            1           0   2
## 1     1          1        0               1            0           1   2
## 1     1          1        0               1            0           0   3
##       0          0       14              21          101         203 339

This allows us to look for patterns of missingness between variables. There are 11 patterns in these data. The number and pattern of missingness help us to determine the likelihood of it being random rather than systematic.

### 14.5.1 Make sure you include missing data in demographics tables

Table 1 in a healthcare study is often a demographics table of an “explanatory variable of interest” against other explanatory variables/confounders. Do not silently drop missing values in this table. It is easy to do this correctly with summary_factorlist(). This function provides a useful summary of a dependent variable against explanatory variables. Despite its name, continuous variables are handled nicely.

na_include=TRUE ensures missing data from the explanatory variables (but not dependent) are included. Note that any p-values are generated across missing groups as well, so run a second time with na_include=FALSE if you wish a hypothesis test only over observed data.

# Explanatory or confounding variables
explanatory = c("age", "sex.factor",
"nodes",
"smoking_mcar", "smoking_mar")

# Explanatory variable of interest
dependent = "obstruct.factor" # Bowel obstruction

table1 = colon_s %>%
summary_factorlist(dependent, explanatory,
na_include=TRUE, p=TRUE)
## Note: dependent includes missing data. These are dropped.
TABLE 12.3: Simulated missing completely at random (MCAR) and missing at random (MAR) dataset.
label levels No Yes p
Age (years) Mean (SD) 60.2 (11.5) 57.3 (13.3) 0.004
Sex Female 346 (47.3) 91 (51.7) 0.330
Male 386 (52.7) 85 (48.3)
nodes Mean (SD) 3.7 (3.7) 3.5 (3.2) 0.435
Smoking (MCAR) Non-smoker 500 (68.3) 130 (73.9) 0.080
Smoker 154 (21.0) 26 (14.8)
(Missing) 78 (10.7) 20 (11.4)
Smoking (MAR) Non-smoker 456 (62.3) 115 (65.3) 0.822
Smoker 112 (15.3) 26 (14.8)
(Missing) 164 (22.4) 35 (19.9)