14.3 1. Ensure your data are coded correctly: ff_glimpse()

While clearly obvious, this step is often ignored in the rush to get results. The first step in any analysis is robust data cleaning and coding. Lots of packages have a glimpse-type function and our own Finalfit is no different. This function has three specific goals:

  1. Ensure all factors and numerics are correctly assigned. That is the commonest reason to get an error with a Finalfit function. You think you’re using a factor variable, but in fact it is incorrectly coded as a continuous numeric.
  2. Ensure you know which variables have missing data. This presumes missing values are correctly assigned NA.
  3. Ensure factor levels and variable labels are assigned correctly.

14.3.1 The Question

Using the colon_s colon cancer dataset, we are interested in exploring the association between a cancer obstructing the bowel and 5-year survival, accounting for other patient and disease characteristics.

For demonstration purposes, we will create random MCAR and MAR smoking variables to the dataset.

## Continuous
##             label var_type   n missing_n missing_percent mean   sd  min
## age   Age (years)    <dbl> 929         0             0.0 59.8 11.9 18.0
## nodes       nodes    <dbl> 911        18             1.9  3.7  3.6  0.0
##       quartile_25 median quartile_75  max
## age          53.0   61.0        69.0 85.0
## nodes         1.0    2.0         5.0 33.0
## 
## Categorical
##                            label var_type   n missing_n missing_percent
## mort_5yr        Mortality 5 year    <fct> 915        14             1.5
## sex.factor                   Sex    <fct> 929         0             0.0
## obstruct.factor      Obstruction    <fct> 908        21             2.3
## smoking_mcar      Smoking (MCAR)    <fct> 828       101            10.9
## smoking_mar        Smoking (MAR)    <fct> 726       203            21.9
##                 levels_n                              levels  levels_count
## mort_5yr               2        "Alive", "Died", "(Missing)"  511, 404, 14
## sex.factor             2                    "Female", "Male"      445, 484
## obstruct.factor        2            "No", "Yes", "(Missing)"  732, 176, 21
## smoking_mcar           2 "Non-smoker", "Smoker", "(Missing)" 645, 183, 101
## smoking_mar            2 "Non-smoker", "Smoker", "(Missing)" 585, 141, 203
##                   levels_percent
## mort_5yr        55.0, 43.5,  1.5
## sex.factor                48, 52
## obstruct.factor 78.8, 18.9,  2.3
## smoking_mcar          69, 20, 11
## smoking_mar           63, 15, 22

The function summarises a data frame or tibble by numeric (continuous) variables and factor (discrete) variables. The dependent and explanatory are for convenience. Pass either or neither e.g. to summarise data frame or tibble:

Use this to check that the variables are all assigned and behaving as expected. The proportion of missing data can be seen, e.g. smoking_mar has 22% missing data.