11.2 Ensure your data are coded correctly: ff_glimpse()
While it sounds obvious, this step is often ignored in the rush to get results. The first step in any analysis is robust data cleaning and coding. Lots of packages have a glimpse-type function and our own finalfit is no different. This function has three specific goals:
- Ensure all variables are of the type you expect them to be. That is the commonest reason to get an error with a finalfit function. Numbers should be numeric, categorical variables should be characters or factors, and dates should be dates (for a reminder on these, see Section 2.2.
- Ensure you know which variables have missing data. This presumes missing values are correctly assigned
NA
. - Ensure factor levels and variable labels are assigned correctly.
11.2.1 The Question
Using the colon_s
colon cancer dataset, we are interested in exploring the association between a cancer obstructing the bowel and 5-year survival, accounting for other patient and disease characteristics.
For demonstration purposes, we will make up MCAR and MAR smoking variables (smoking_mcar
and smoking_mar
).
Do not worry about understanding the long cascading mutate and sample()
functions below, this is merely for creating the example variables.
You would not be ‘creating’ your data, we hope.
# Create some extra missing data
library(finalfit)
library(dplyr)
set.seed(1)
colon_s <- colon_s %>%
mutate(
## Smoking missing completely at random
smoking_mcar = sample(c("Smoker", "Non-smoker", NA),
n(), replace=TRUE,
prob = c(0.2, 0.7, 0.1)) %>%
factor() %>%
ff_label("Smoking (MCAR)"),
## Smoking missing conditional on patient sex
smoking_mar = ifelse(sex.factor == "Female",
sample(c("Smoker", "Non-smoker", NA),
sum(sex.factor == "Female"),
replace = TRUE,
prob = c(0.1, 0.5, 0.4)),
sample(c("Smoker", "Non-smoker", NA),
sum(sex.factor == "Male"),
replace=TRUE, prob = c(0.15, 0.75, 0.1))
) %>%
factor() %>%
ff_label("Smoking (MAR)")
)
We will then examine our variables of interest using ff_glimpse()
:
explanatory <- c("age", "sex.factor",
"nodes", "obstruct.factor",
"smoking_mcar", "smoking_mar")
dependent <- "mort_5yr"
colon_s %>%
ff_glimpse(dependent, explanatory)
## $Continuous
## label var_type n missing_n missing_percent mean sd min
## age Age (years) <dbl> 929 0 0.0 59.8 11.9 18.0
## nodes nodes <dbl> 911 18 1.9 3.7 3.6 0.0
## quartile_25 median quartile_75 max
## age 53.0 61.0 69.0 85.0
## nodes 1.0 2.0 5.0 33.0
##
## $Categorical
## label var_type n missing_n missing_percent
## mort_5yr Mortality 5 year <fct> 915 14 1.5
## sex.factor Sex <fct> 929 0 0.0
## obstruct.factor Obstruction <fct> 908 21 2.3
## smoking_mcar Smoking (MCAR) <fct> 828 101 10.9
## smoking_mar Smoking (MAR) <fct> 726 203 21.9
## levels_n levels levels_count
## mort_5yr 2 "Alive", "Died", "(Missing)" 511, 404, 14
## sex.factor 2 "Female", "Male" 445, 484
## obstruct.factor 2 "No", "Yes", "(Missing)" 732, 176, 21
## smoking_mcar 2 "Non-smoker", "Smoker", "(Missing)" 645, 183, 101
## smoking_mar 2 "Non-smoker", "Smoker", "(Missing)" 585, 141, 203
## levels_percent
## mort_5yr 55.0, 43.5, 1.5
## sex.factor 48, 52
## obstruct.factor 78.8, 18.9, 2.3
## smoking_mcar 69, 20, 11
## smoking_mar 63, 15, 22
You don’t need to specify the variables, and if you don’t, ff_glimpse()
will summarise all variables:
Use this to check that the variables are all assigned and behaving as expected.
The proportion of missing data can be seen, e.g., smoking_mar
has 22% missing data.