11.1 Identification of missing data

As journal editors, we often receive studies in which the investigators fail to describe, analyse, or even acknowledge missing data. This is frustrating, as it is often of the utmost importance. Conclusions may (and do) change when missing data are accounted for. Some folk seem to not even appreciate that in a conventional regression, only rows with complete data are included. By reading this, you will not be one of them!

These are the five steps to ensuring missing data are correctly identified and appropriately dealt with:

  1. Ensure your data are coded correctly.
  2. Identify missing values within each variable.
  3. Look for patterns of missingness.
  4. Check for associations between missing and observed data.
  5. Decide how to handle missing data.

We will work through a number of functions that will help with each of these. But first, here are some terms that are easy to mix up. These are important as they describe the mechanism of missingness and this determines how you can handle the missing data.

For each of the following examples we will imagine that we are collecting data on the relationship between gender, smoking and the outcome of cancer treatment. The ground truth in this imagined scenario is that both gender and smoking influence the outcome from cancer treatment.

11.1.1 Missing completely at random (MCAR)

As it says, values are randomly missing from your dataset. Missing data values do not relate to any other data in the dataset and there is no pattern to the actual values of the missing data themselves.

In our example, smoking status is missing from a random subset of male and female patients.

This may have the effect of making our population smaller, but the complete case population has the same characteristics as the missing data population. This is easy to handle, but unfortunately, data are almost never missing completely at random.

11.1.2 Missing at random (MAR)

This is confusing and would be better named missing conditionally at random. Here, missingness in a particular variable has an association with one or more other variables in the dataset. However, the actual values of the missing data are random.

In our example, smoking status is missing for some female patients but not for male patients.

But data is missing from the same number of female smokers as female non-smokers. So the complete case female patients has the same characteristics as the missing data female patients.

11.1.3 Missing not at random (MNAR)

The pattern of missingness is related to other variables in the dataset, but in addition, the actual values of the missing data are not random.

In our example, smoking status is missing in female patients who are more likely to smoke, but not for male patients.

Thus, the complete case female patients have different characteristics to the missing data female patients. For instance, the missing data female patients may be more likely to die after cancer treatment. Looking at our available population, we therefore under estimate the likelihood of a female dying from cancer treatment.

Missing not at random data are important, can alter your conclusions, and are the most difficult to diagnose and handle. They can only be detected by collecting and examining some of the missing data. This is often difficult or impossible to do.

How you deal with missing data is dependent on the type of missingness. Once you know the type, you can start addressing it. More on this below.