11.2 Ensure your data are coded correctly: ff_glimpse()

While it sounds obvious, this step is often ignored in the rush to get results. The first step in any analysis is robust data cleaning and coding. Lots of packages have a glimpse-type function and our own finalfit is no different. This function has three specific goals:

  1. Ensure all variables are of the type you expect them to be. That is the commonest reason to get an error with a finalfit function. Numbers should be numeric, categorical variables should be characters or factors, and dates should be dates (for a reminder on these, see Section 2.2.
  2. Ensure you know which variables have missing data. This presumes missing values are correctly assigned NA.
  3. Ensure factor levels and variable labels are assigned correctly.

11.2.1 The Question

Using the colon_s colon cancer dataset, we are interested in exploring the association between a cancer obstructing the bowel and 5-year survival, accounting for other patient and disease characteristics.

For demonstration purposes, we will make up MCAR and MAR smoking variables (smoking_mcar and smoking_mar). Do not worry about understanding the long cascading mutate and sample() functions below, this is merely for creating the example variables. You would not be ‘creating’ your data, we hope.

# Create some extra missing data
library(finalfit)
library(dplyr)
set.seed(1)
colon_s <- colon_s %>% 
  mutate(
    ## Smoking missing completely at random
    smoking_mcar = sample(c("Smoker", "Non-smoker", NA), 
                          n(), replace=TRUE, 
                          prob = c(0.2, 0.7, 0.1)) %>% 
      factor() %>% 
      ff_label("Smoking (MCAR)"),
    
    ## Smoking missing conditional on patient sex
    smoking_mar = ifelse(sex.factor == "Female",
                         sample(c("Smoker", "Non-smoker", NA), 
                                sum(sex.factor == "Female"), 
                                replace = TRUE,
                                prob = c(0.1, 0.5, 0.4)),
                         
                         sample(c("Smoker", "Non-smoker", NA), 
                                sum(sex.factor == "Male"), 
                                replace=TRUE, prob = c(0.15, 0.75, 0.1))
    ) %>% 
      factor() %>% 
      ff_label("Smoking (MAR)")
  )

We will then examine our variables of interest using ff_glimpse():

explanatory <- c("age", "sex.factor", 
                 "nodes", "obstruct.factor",  
                 "smoking_mcar", "smoking_mar")
dependent <- "mort_5yr"

colon_s %>% 
  ff_glimpse(dependent, explanatory)
## $Continuous
##             label var_type   n missing_n missing_percent mean   sd  min
## age   Age (years)    <dbl> 929         0             0.0 59.8 11.9 18.0
## nodes       nodes    <dbl> 911        18             1.9  3.7  3.6  0.0
##       quartile_25 median quartile_75  max
## age          53.0   61.0        69.0 85.0
## nodes         1.0    2.0         5.0 33.0
## 
## $Categorical
##                            label var_type   n missing_n missing_percent
## mort_5yr        Mortality 5 year    <fct> 915        14             1.5
## sex.factor                   Sex    <fct> 929         0             0.0
## obstruct.factor      Obstruction    <fct> 908        21             2.3
## smoking_mcar      Smoking (MCAR)    <fct> 828       101            10.9
## smoking_mar        Smoking (MAR)    <fct> 726       203            21.9
##                 levels_n                              levels  levels_count
## mort_5yr               2        "Alive", "Died", "(Missing)"  511, 404, 14
## sex.factor             2                    "Female", "Male"      445, 484
## obstruct.factor        2            "No", "Yes", "(Missing)"  732, 176, 21
## smoking_mcar           2 "Non-smoker", "Smoker", "(Missing)" 645, 183, 101
## smoking_mar            2 "Non-smoker", "Smoker", "(Missing)" 585, 141, 203
##                   levels_percent
## mort_5yr        55.0, 43.5,  1.5
## sex.factor                48, 52
## obstruct.factor 78.8, 18.9,  2.3
## smoking_mcar          69, 20, 11
## smoking_mar           63, 15, 22

You don’t need to specify the variables, and if you don’t, ff_glimpse() will summarise all variables:

colon_s %>%
  ff_glimpse()

Use this to check that the variables are all assigned and behaving as expected. The proportion of missing data can be seen, e.g., smoking_mar has 22% missing data.