14.7 5. Decide how to handle missing data
Prior to a standard regression analysis, we can either:
 Delete the variable with the missing data
 Delete the cases with the missing data
 Impute (fill in) the missing data
 Model the missing data
14.7.1 MCAR vs MAR
Using the examples, we identify that Smoking (MCAR) is missing completely at random.
We know nothing about the missing values themselves, but we know of no plausible reason that the values of the missing data, for say, people who died should be different to the values of the missing data for those who survived. The pattern of missingness is therefore not felt to be MNAR.
Common solution Depending on the number of data points that are missing, we may have sufficient power with complete cases to examine the relationships of interest.
We therefore elect to simply omit the patients in whom smoking is missing. This is known as listwise deletion and will be performed by default in standard regression analyses in R.
explanatory = c("age", "sex.factor",
"nodes", "obstruct.factor",
"smoking_mcar")
dependent = "mort_5yr"
fit = colon_s %>%
finalfit(dependent, explanatory)
## Warning: Factor `obstruct.factor` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## Warning: Factor `smoking_mcar` contains implicit NA, consider using
## `forcats::fct_explicit_na`
Dependent: Mortality 5 year  Alive  Died  OR (univariable)  OR (multivariable)  

Age (years)  Mean (SD)  59.8 (11.4)  59.9 (12.5)  1.00 (0.991.01, p=0.986)  1.01 (1.001.02, p=0.200) 
Sex  Female  243 (55.6)  194 (44.4) 


Male  268 (56.1)  210 (43.9)  0.98 (0.761.27, p=0.889)  1.02 (0.761.38, p=0.872)  
nodes  Mean (SD)  2.7 (2.4)  4.9 (4.4)  1.24 (1.181.30, p<0.001)  1.25 (1.181.33, p<0.001) 
Obstruction  No  408 (56.7)  312 (43.3) 


Yes  89 (51.1)  85 (48.9)  1.25 (0.901.74, p=0.189)  1.53 (1.052.22, p=0.027)  
Smoking (MCAR)  Nonsmoker  358 (56.4)  277 (43.6) 


Smoker  90 (49.7)  91 (50.3)  1.31 (0.941.82, p=0.113)  1.37 (0.961.96, p=0.083) 
Other considerations
 Sensitivity analysis
 Omit the variable
 Imputation
 Model the missing data
If the variable in question is thought to be particularly important, you may wish to perform a sensitivity analysis. A sensitivity analysis in this context aims to capture the effect of uncertainty on the conclusions drawn from the model. Thus, you may choose to relabel all missing smoking values as “smoker”, and see if that changes the conclusions of your analysis. The same procedure can be performed labeling with “nonsmoker”.
If smoking is not associated with the explanatory variable of interest (bowel obstruction) or the outcome, it may be considered not to be a confounder and so could be omitted. That neatly deals with the missing data issue, but of course may not be appropriate.
Imputation and modelling are considered below.
14.7.2 MCAR vs MAR
But life is rarely that simple.
Consider that the smoking variable is more likely to be missing if the patient is female (missing_compareshows a relationship). But, say, that the missing values are not different from the observed values. Missingness is then MAR.
If we simply drop all the cases (patients) in which smoking is missing (listwise deletion), then proportionally we drop more females than men. This may have consequences for our conclusions if sex is associated with our explanatory variable of interest or outcome.
Common solution
mice
is our go to package for multiple imputation. That’s the process of filling in missing data using a bestestimate from all the other data that exists. When first encountered, this may not sound like a good idea.
However, taking our simple example, if missingness in smoking is predicted strongly by sex (and other observed variables), and the values of the missing data are random, then we can impute (bestguess) the missing smoking values using sex and other variables in the dataset.
Imputation is not usually appropriate for the explanatory variable of interest or the outcome variable. In both case, the hypothesis is that there is an meaningful association with other variables in the dataset, therefore it doesn’t make sense to use these variables to impute them.
Here is some code to run mice. The package is well documented, and there are a number of checks and considerations that should be made to inform the imputation process. Read the documentation carefully prior to doing this yourself.
Note also finalfit::missing_predictorMatrix(). This provides an easy way to include or exclude variables to be imputed or to be used for imputation.
# Multivariate Imputation by Chained Equations (mice)
library(finalfit)
library(dplyr)
library(mice)
explanatory = c("age", "sex.factor",
"nodes", "obstruct.factor", "smoking_mar")
dependent = "mort_5yr"
# Choose not to impute missing values
# for explanatory variable of interest and
# outcome variable.
# But include in algorithm for imputation.
colon_s %>%
select(dependent, explanatory) %>%
missing_predictorMatrix(
drop_from_imputed = c("obstruct.factor", "mort_5yr")
) > predM
colon_s %>%
select(dependent, explanatory) %>%
# Usually run imputation with 10 imputed sets, 4 here for demonstration
mice(m = 4, predictorMatrix = predM) %>%
# Run logistic regression on each imputed set
with(glm(formula(ff_formula(dependent, explanatory)),
family="binomial")) %>%
# Pool and summarise results
pool() %>%
summary(conf.int = TRUE, exponentiate = TRUE) %>%
# Jiggle into finalfit format
mutate(explanatory_name = rownames(.)) %>%
select(explanatory_name, estimate, `2.5 %`, `97.5 %`, p.value) %>%
condense_fit(estimate_name = "OR (multiple imputation)") %>%
remove_intercept() > fit_imputed
##
## iter imp variable
## 1 1 mort_5yr nodes obstruct.factor smoking_mar
## 1 2 mort_5yr nodes obstruct.factor smoking_mar
## 1 3 mort_5yr nodes obstruct.factor smoking_mar
## 1 4 mort_5yr nodes obstruct.factor smoking_mar
## 2 1 mort_5yr nodes obstruct.factor smoking_mar
## 2 2 mort_5yr nodes obstruct.factor smoking_mar
## 2 3 mort_5yr nodes obstruct.factor smoking_mar
## 2 4 mort_5yr nodes obstruct.factor smoking_mar
## 3 1 mort_5yr nodes obstruct.factor smoking_mar
## 3 2 mort_5yr nodes obstruct.factor smoking_mar
## 3 3 mort_5yr nodes obstruct.factor smoking_mar
## 3 4 mort_5yr nodes obstruct.factor smoking_mar
## 4 1 mort_5yr nodes obstruct.factor smoking_mar
## 4 2 mort_5yr nodes obstruct.factor smoking_mar
## 4 3 mort_5yr nodes obstruct.factor smoking_mar
## 4 4 mort_5yr nodes obstruct.factor smoking_mar
## 5 1 mort_5yr nodes obstruct.factor smoking_mar
## 5 2 mort_5yr nodes obstruct.factor smoking_mar
## 5 3 mort_5yr nodes obstruct.factor smoking_mar
## 5 4 mort_5yr nodes obstruct.factor smoking_mar
# Use finalfit merge methods to create and compare results
colon_s %>%
summary_factorlist(dependent, explanatory, fit_id = TRUE) > summary1
## Warning: Factor `obstruct.factor` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## Warning: Factor `smoking_mar` contains implicit NA, consider using
## `forcats::fct_explicit_na`
colon_s %>%
glmuni(dependent, explanatory) %>%
fit2df(estimate_suffix = " (univariable)") > fit_uni
colon_s %>%
glmmulti(dependent, explanatory) %>%
fit2df(estimate_suffix = " (multivariable inc. smoking)") > fit_multi
explanatory = c("age", "sex.factor",
"nodes", "obstruct.factor")
colon_s %>%
glmmulti(dependent, explanatory) %>%
fit2df(estimate_suffix = " (multivariable)") > fit_multi_r
# Combine to final table
fit_impute = summary1 %>%
ff_merge(fit_uni) %>%
ff_merge(fit_multi_r) %>%
ff_merge(fit_multi) %>%
ff_merge(fit_imputed) %>%
select(fit_id, index)
label  levels  Alive  Died  OR (univariable)  OR (multivariable)  OR (multivariable inc. smoking)  OR (multiple imputation) 

Age (years)  Mean (SD)  59.8 (11.4)  59.9 (12.5)  1.00 (0.991.01, p=0.986)  1.01 (1.001.02, p=0.122)  1.02 (1.011.04, p=0.004)  1.01 (1.001.02, p=0.213) 
Sex  Female  243 (47.6)  194 (48.0) 




Male  268 (52.4)  210 (52.0)  0.98 (0.761.27, p=0.889)  0.98 (0.741.30, p=0.890)  0.97 (0.691.34, p=0.836)  1.01 (0.771.34, p=0.924)  
nodes  Mean (SD)  2.7 (2.4)  4.9 (4.4)  1.24 (1.181.30, p<0.001)  1.25 (1.191.32, p<0.001)  1.28 (1.211.37, p<0.001)  1.23 (1.171.29, p<0.001) 
Obstruction  No  408 (82.1)  312 (78.6) 




Yes  89 (17.9)  85 (21.4)  1.25 (0.901.74, p=0.189)  1.36 (0.951.93, p=0.089)  1.49 (1.002.22, p=0.052)  1.34 (0.951.90, p=0.098)  
Smoking (MAR)  Nonsmoker  312 (78.2)  266 (83.6) 




Smoker  87 (21.8)  52 (16.4)  0.70 (0.481.02, p=0.067) 

0.77 (0.511.16, p=0.221)  0.75 (0.501.14, p=0.178) 
By examining the coefficients, the effect of the imputation compared with the complete case analysis can be clearly seen.
Other considerations
 Omit the variable
 Imputing factors with new level for missing data
 Model the missing data
As above, if the variable does not appear to be important, it may be omitted from the analysis. A sensitivity analysis in this context is another form of imputation. But rather than using all other available information to bestguess the missing data, we simply assign the value as above. Imputation is therefore likely to be more appropriate.
There is an alternative method to model the missing data for the categorical in this setting – just consider the missing data as a factor level. This has the advantage of simplicity, with the disadvantage of increasing the number of terms in the model. Multiple imputation is generally preferred.
library(dplyr)
explanatory = c("age", "sex.factor",
"nodes", "obstruct.factor", "smoking_mar")
fit_explicit_na = colon_s %>%
mutate(
smoking_mar = forcats::fct_explicit_na(smoking_mar)
) %>%
finalfit(dependent, explanatory)
## Warning: Factor `obstruct.factor` contains implicit NA, consider using
## `forcats::fct_explicit_na`
Dependent: Mortality 5 year  Alive  Died  OR (univariable)  OR (multivariable)  

Age (years)  Mean (SD)  59.8 (11.4)  59.9 (12.5)  1.00 (0.991.01, p=0.986)  1.01 (1.001.02, p=0.114) 
Sex  Female  243 (55.6)  194 (44.4) 


Male  268 (56.1)  210 (43.9)  0.98 (0.761.27, p=0.889)  0.95 (0.711.28, p=0.743)  
nodes  Mean (SD)  2.7 (2.4)  4.9 (4.4)  1.24 (1.181.30, p<0.001)  1.25 (1.191.32, p<0.001) 
Obstruction  No  408 (56.7)  312 (43.3) 


Yes  89 (51.1)  85 (48.9)  1.25 (0.901.74, p=0.189)  1.35 (0.951.92, p=0.099)  
Smoking (MAR)  Nonsmoker  312 (54.0)  266 (46.0) 


Smoker  87 (62.6)  52 (37.4)  0.70 (0.481.02, p=0.067)  0.78 (0.521.17, p=0.233)  
(Missing)  112 (56.6)  86 (43.4)  0.90 (0.651.25, p=0.528)  0.85 (0.591.23, p=0.390) 
14.7.3 MNAR vs MAR
Missing not at random data is tough in healthcare. To determine if data are MNAR for definite, we need to know their value in a subset of observations (patients).
Using our example above. Say smoking status is poorly recorded in patients admitted to hospital as an emergency with an obstructing cancer. Obstructing bowel cancers may be larger or their position may make the prognosis worse. Smoking may relate to the aggressiveness of the cancer and may be an independent predictor of prognosis. The missing values for smoking may therefore not be random. Smoking may be more common in the emergency patients and may be more common in those that die.
There is no easy way to handle this. If at all possible, try to get the missing data. Otherwise, take care when drawing conclusions from analyses where data are thought to be missing not at random.