11.5 Decide how to handle missing data

Prior to a standard regression analysis, we can either:

  • Delete the variable with the missing data
  • Delete the cases with the missing data
  • Impute (fill in) the missing data
  • Model the missing data

11.5.1 MCAR vs MAR

Using the examples, we identify that Smoking (MCAR) is missing completely at random.

We know nothing about the missing values themselves, but we know of no plausible reason that the values of the missing data, for say, people who died should be different to the values of the missing data for those who survived. The pattern of missingness is therefore not felt to be MNAR.

Common solution Depending on the number of data points that are missing, we may have sufficient power with complete cases to examine the relationships of interest.

We therefore elect to omit the patients in whom smoking is missing. This is known as list-wise deletion and will be performed by default and usually silently by any standard regression function.

## Warning: Factor `obstruct.factor` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## Warning: Factor `smoking_mcar` contains implicit NA, consider using
## `forcats::fct_explicit_na`
TABLE 8.2: Regression analysis with missing data: list-wise deletion.
Dependent: Mortality 5 year Alive Died OR (univariable) OR (multivariable)
Age (years) Mean (SD) 59.8 (11.4) 59.9 (12.5) 1.00 (0.99-1.01, p=0.986) 1.01 (1.00-1.02, p=0.200)
Sex Female 243 (55.6) 194 (44.4)
Male 268 (56.1) 210 (43.9) 0.98 (0.76-1.27, p=0.889) 1.02 (0.76-1.38, p=0.872)
nodes Mean (SD) 2.7 (2.4) 4.9 (4.4) 1.24 (1.18-1.30, p<0.001) 1.25 (1.18-1.33, p<0.001)
Obstruction No 408 (56.7) 312 (43.3)
Yes 89 (51.1) 85 (48.9) 1.25 (0.90-1.74, p=0.189) 1.53 (1.05-2.22, p=0.027)
Smoking (MCAR) Non-smoker 358 (56.4) 277 (43.6)
Smoker 90 (49.7) 91 (50.3) 1.31 (0.94-1.82, p=0.113) 1.37 (0.96-1.96, p=0.083)

Other considerations

  • Sensitivity analysis
  • Omit the variable
  • Imputation
  • Model the missing data

If the variable in question is thought to be particularly important, you may wish to perform a sensitivity analysis. A sensitivity analysis in this context aims to capture the effect of uncertainty on the conclusions drawn from the model. Thus, you may choose to re-label all missing smoking values as “smoker”, and see if that changes the conclusions of your analysis. The same procedure can be performed labeling with “non-smoker”.

If smoking is not associated with the explanatory variable of interest or the outcome, it may be considered not to be a confounder and so could be omitted. That deals with the missing data issue, but of course may not always be appropriate.

Imputation and modelling are considered below.

11.5.2 MCAR vs MAR

But life is rarely that simple.

Considering that the smoking variable is more likely to be missing if the patient is female (missing_compare shows a relationship). But, say, that the missing values are not different from the observed values. Missingness is then MAR.

If we simply drop all the patients for whom smoking is missing (list-wise deletion), then we drop relatively more females than men. This may have consequences for our conclusions if sex is associated with our explanatory variable of interest or outcome.

Common solution

mice is our go to package for multiple imputation. That’s the process of filling in missing data using a best-estimate from all the other data that exists. When first encountered, this may not sound like a good idea.

However, taking our simple example, if missingness in smoking is predicted strongly by sex (and other observed variables), and the values of the missing data are random, then we can impute (best-guess) the missing smoking values using sex and other variables in the dataset.

Imputation is not usually appropriate for the explanatory variable of interest or the outcome variable. In both cases, the hypothesis is that there is an meaningful association with other variables in the dataset, therefore it doesn’t make sense to use these variables to impute them.

Here is some code to run mice. The package is well documented, and there are a number of checks and considerations that should be made to inform the imputation process. Read the documentation carefully prior to doing this yourself.

Note also missing_predictorMatrix() from finalfit. This provides a straightforward way to include or exclude variables to be imputed or to be used for imputation.

## 
##  iter imp variable
##   1   1  mort_5yr  nodes  obstruct.factor  smoking_mar
##   1   2  mort_5yr  nodes  obstruct.factor  smoking_mar
##   1   3  mort_5yr  nodes  obstruct.factor  smoking_mar
##   1   4  mort_5yr  nodes  obstruct.factor  smoking_mar
##   2   1  mort_5yr  nodes  obstruct.factor  smoking_mar
##   2   2  mort_5yr  nodes  obstruct.factor  smoking_mar
##   2   3  mort_5yr  nodes  obstruct.factor  smoking_mar
##   2   4  mort_5yr  nodes  obstruct.factor  smoking_mar
##   3   1  mort_5yr  nodes  obstruct.factor  smoking_mar
##   3   2  mort_5yr  nodes  obstruct.factor  smoking_mar
##   3   3  mort_5yr  nodes  obstruct.factor  smoking_mar
##   3   4  mort_5yr  nodes  obstruct.factor  smoking_mar
##   4   1  mort_5yr  nodes  obstruct.factor  smoking_mar
##   4   2  mort_5yr  nodes  obstruct.factor  smoking_mar
##   4   3  mort_5yr  nodes  obstruct.factor  smoking_mar
##   4   4  mort_5yr  nodes  obstruct.factor  smoking_mar
##   5   1  mort_5yr  nodes  obstruct.factor  smoking_mar
##   5   2  mort_5yr  nodes  obstruct.factor  smoking_mar
##   5   3  mort_5yr  nodes  obstruct.factor  smoking_mar
##   5   4  mort_5yr  nodes  obstruct.factor  smoking_mar
## Warning: Factor `obstruct.factor` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## Warning: Factor `smoking_mar` contains implicit NA, consider using
## `forcats::fct_explicit_na`
TABLE 11.3: Regression analysis with missing data: multiple imputation using mice().
label levels Alive Died OR (univariable) OR (multivariable) OR (multivariable inc. smoking) OR (multiple imputation)
Age (years) Mean (SD) 59.8 (11.4) 59.9 (12.5) 1.00 (0.99-1.01, p=0.986) 1.01 (1.00-1.02, p=0.122) 1.02 (1.01-1.04, p=0.004) 1.01 (1.00-1.02, p=0.213)
Sex Female 243 (47.6) 194 (48.0)
Male 268 (52.4) 210 (52.0) 0.98 (0.76-1.27, p=0.889) 0.98 (0.74-1.30, p=0.890) 0.97 (0.69-1.34, p=0.836) 1.01 (0.77-1.34, p=0.924)
nodes Mean (SD) 2.7 (2.4) 4.9 (4.4) 1.24 (1.18-1.30, p<0.001) 1.25 (1.19-1.32, p<0.001) 1.28 (1.21-1.37, p<0.001) 1.23 (1.17-1.29, p<0.001)
Obstruction No 408 (82.1) 312 (78.6)
Yes 89 (17.9) 85 (21.4) 1.25 (0.90-1.74, p=0.189) 1.36 (0.95-1.93, p=0.089) 1.49 (1.00-2.22, p=0.052) 1.34 (0.95-1.90, p=0.098)
Smoking (MAR) Non-smoker 312 (78.2) 266 (83.6)
Smoker 87 (21.8) 52 (16.4) 0.70 (0.48-1.02, p=0.067)
0.77 (0.51-1.16, p=0.221) 0.75 (0.50-1.14, p=0.178)

By examining the coefficients, the effect of the imputation compared with the complete case analysis can be clearly seen.

Other considerations

  • Omit the variable
  • Imputing factors with new level for missing data
  • Model the missing data

As above, if the variable does not appear to be important, it may be omitted from the analysis. A sensitivity analysis in this context is another form of imputation. But rather than using all other available information to best-guess the missing data, we simply assign the value as above. Imputation is therefore likely to be more appropriate.

There is an alternative method to model the missing data for the categorical in this setting – just consider the missing data as a factor level. This has the advantage of simplicity, with the disadvantage of increasing the number of terms in the model.

## Warning: Factor `obstruct.factor` contains implicit NA, consider using
## `forcats::fct_explicit_na`
TABLE 11.4: Regression analysis with missing data: explicitly modelling missing data.
Dependent: Mortality 5 year Alive Died OR (univariable) OR (multivariable)
Age (years) Mean (SD) 59.8 (11.4) 59.9 (12.5) 1.00 (0.99-1.01, p=0.986) 1.01 (1.00-1.02, p=0.114)
Sex Female 243 (55.6) 194 (44.4)
Male 268 (56.1) 210 (43.9) 0.98 (0.76-1.27, p=0.889) 0.95 (0.71-1.28, p=0.743)
nodes Mean (SD) 2.7 (2.4) 4.9 (4.4) 1.24 (1.18-1.30, p<0.001) 1.25 (1.19-1.32, p<0.001)
Obstruction No 408 (56.7) 312 (43.3)
Yes 89 (51.1) 85 (48.9) 1.25 (0.90-1.74, p=0.189) 1.35 (0.95-1.92, p=0.099)
Smoking (MAR) Non-smoker 312 (54.0) 266 (46.0)
Smoker 87 (62.6) 52 (37.4) 0.70 (0.48-1.02, p=0.067) 0.78 (0.52-1.17, p=0.233)
(Missing) 112 (56.6) 86 (43.4) 0.90 (0.65-1.25, p=0.528) 0.85 (0.59-1.23, p=0.390)

11.5.3 MNAR vs MAR

Missing not at random data is tough in healthcare. To determine if data are MNAR for definite, we need to know their value in a subset of observations (patients).

Imagine that smoking status is poorly recorded in patients admitted to hospital as an emergency with an obstructing cancer. Obstructing bowel cancers may be larger or their position may make the prognosis worse. Smoking may relate to the aggressiveness of the cancer and may be an independent predictor of prognosis. The missing values for smoking may therefore not be random. Smoking may be more common in the emergency patients and may be more common in those that die.

There is no easy way to handle this. If at all possible, try to get the missing data. Otherwise, be careful when drawing conclusions from analyses where data are thought to be missing not at random.