11.7 Handling missing data: MCAR

Prior to a standard regression analysis, we can either:

  • Delete the variable with the missing data
  • Delete the cases with the missing data
  • Impute (fill in) the missing data
  • Model the missing data

Using the examples, we identify that smoking (MCAR) is missing completely at random.

We know nothing about the missing values themselves, but we know of no plausible reason that the values of the missing data, for say, people who died should be different to the values of the missing data for those who survived. The pattern of missingness is therefore not felt to be MNAR.

11.7.1 Common solution: row-wise deletion

Depending on the number of data points that are missing, we may have sufficient power with complete cases to examine the relationships of interest.

We therefore elect to omit the patients in whom smoking is missing. This is known as list-wise deletion and will be performed by default and usually silently by any standard regression function.

explanatory <- c("age", "sex.factor", 
                 "nodes", "obstruct.factor",  
                 "smoking_mcar")
dependent <- "mort_5yr"
fit = colon_s %>% 
  finalfit(dependent, explanatory)
TABLE 8.2: Regression analysis with missing data: List-wise deletion.
Dependent: Mortality 5 year Alive Died OR (univariable) OR (multivariable)
Age (years) Mean (SD) 59.8 (11.4) 59.9 (12.5) 1.00 (0.99-1.01, p=0.986) 1.01 (1.00-1.02, p=0.200)
Sex Female 243 (55.6) 194 (44.4)
Male 268 (56.1) 210 (43.9) 0.98 (0.76-1.27, p=0.889) 1.02 (0.76-1.38, p=0.872)
nodes Mean (SD) 2.7 (2.4) 4.9 (4.4) 1.24 (1.18-1.30, p<0.001) 1.25 (1.18-1.33, p<0.001)
Obstruction No 408 (56.7) 312 (43.3)
Yes 89 (51.1) 85 (48.9) 1.25 (0.90-1.74, p=0.189) 1.53 (1.05-2.22, p=0.027)
Smoking (MCAR) Non-smoker 358 (56.4) 277 (43.6)
Smoker 90 (49.7) 91 (50.3) 1.31 (0.94-1.82, p=0.113) 1.37 (0.96-1.96, p=0.083)

11.7.2 Other considerations

  • Sensitivity analysis
  • Omit the variable
  • Imputation
  • Model the missing data

If the variable in question is thought to be particularly important, you may wish to perform a sensitivity analysis. A sensitivity analysis in this context aims to capture the effect of uncertainty on the conclusions drawn from the model. Thus, you may choose to re-label all missing smoking values as “smoker”, and see if that changes the conclusions of your analysis. The same procedure can be performed labelling with “non-smoker”.

If smoking is not associated with the explanatory variable of interest or the outcome, it may be considered not to be a confounder and so could be omitted. That deals with the missing data issue, but of course may not always be appropriate.

Imputation and modelling are considered below.