2.2 Variable types and why we care

There are three broad types of data:

  • continuous (numbers), in R: numeric, double, or integer;
  • categorical, in R: character, factor, or logical (TRUE/FALSE);
  • date/time, in R: POSIXct date-time4.

Values within a column all have to be the same type, but a tibble can of course hold columns of different types. Generally, R is good at figuring out what type of data you have (in programming, this ‘figuring out’ is called ‘parsing’).

For example, when reading in data, it will tell you what was assumed for each column:

typesdata <- read_csv("data/typesdata.csv")
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   id = col_character(),
##   group = col_character(),
##   measurement = col_double(),
##   date = col_datetime(format = "")
## )
## # A tibble: 3 x 4
##   id    group     measurement date               
##   <chr> <chr>           <dbl> <dttm>             
## 1 ID1   Control           1.8 2017-01-02 12:00:00
## 2 ID2   Treatment         4.5 2018-02-03 13:00:00
## 3 ID3   Treatment         3.7 2019-03-04 14:00:00

This means that a lot of the time you do not have to worry about those little <chr> vs <dbl> vs <S3: POSIXct> labels. But in cases of irregular or faulty input data, or when doing a lot of calculations and modifications to your data, we need to be aware of these different types to be able to find and fix mistakes.

For example, consider a similar file as above but with some data entry issues introduced:

typesdata_faulty <- read_csv("data/typesdata_faulty.csv")
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   id = col_character(),
##   group = col_character(),
##   measurement = col_character(),
##   date = col_character()
## )
## # A tibble: 3 x 4
##   id    group     measurement date           
##   <chr> <chr>     <chr>       <chr>          
## 1 ID1   Control   1.8         02-Jan-17 12:00
## 2 ID2   Treatment 4.5         03-Feb-18 13:00
## 3 ID3   Treatment 3.7 or 3.8  04-Mar-19 14:00

Notice that R parsed both the measurement and date variables as characters. Measurement has been parsed as a character because of a data entry issue: the person taking the measurement couldn’t decide which value to note down (maybe the scale was shifting between the two values) so they included both values and text “or” in the cell.

A numeric variable will also get parsed as a categorical variable if it contains certain typos, e.g., if entered as “3..7” instead of “3.7”.

The reason R didn’t automatically make sense of the date column is that it couldn’t tell which is the date and which is the year: 02-Jan-17 could stand for 02-Jan-2017 as well as 2002-Jan-17.

Therefore, while a lot of the time you do not have to worry about variable types and can just get on with your analysis, it is important to understand what the different types are to be ready to deal with them when issues arise.

Since health datasets are generally full of categorical data, it is crucial to understand the difference between characters and factors (both are types of categorical variables in R with pros and cons).

So here we go.

2.2.1 Numeric variables (continuous)

Numbers are straightforward to handle and don’t usually cause trouble. R usually refers to numbers as numeric (or num), but sometimes it really gets its nerd on and calls numbers integer or double. Integers are numbers without decimal places (e.g., 1, 2, 3), whereas double stands for “Double-precision floating-point” format (e.g., 1.234, 5.67890).

It doesn’t usually matter whether R is classifying your continuous data numeric/num/double/int, but it is good to be aware of these different terms as you will see them in R messages.

Something to note about numbers is that R doesn’t usually print more than 6 decimal places, but that doesn’t mean they don’t exist. For example, from the typedata tibble, we’re taking the measurement column and sending it to the mean() function. R then calculates the mean and tells us what it is with 6 decimal places:

typesdata$measurement %>% mean()
## [1] 3.333333

Let’s save that in a new object:

measurement_mean <- typesdata$measurement %>% mean()

But when using the double equals operator to check if this is equivalent to a fixed value (you might do this when comparing to a threshold, or even another mean value), R returns FALSE:

measurement_mean == 3.333333
## [1] FALSE

Now this doesn’t seem right, does it - R clearly told us just above that the mean of this variable is 3.333333 (reminder: the actual values in the measurement column are 1.8, 4.5, 3.7). The reason the above statement is FALSE is because measurement_mean is quietly holding more than 6 decimal places.

And it gets worse. In this example, you may recognise that repeating decimals (0.333333…) usually mean there’s more of them somewhere. And you may think that rounding them down with the round() function would make your == behave as expected. Except, it’s not about rounding, it’s about how computers store numbers with decimals. Computers have issues with decimal numbers, and this simple example illustrates one:

(0.10 + 0.05) == 0.15
## [1] FALSE

This returns FALSE, meaning R does not seem to think that 0.10 + 0.05 is equal to 0.15. This issue isn’t specific to R, but to programming languages in general. For example, python also thinks that the sum of 0.10 and 0.05 does not equal 0.15.

This is where the near() function comes in handy:

near(0.10+0.05, 0.15)
## [1] TRUE
near(measurement_mean, 3.333333, 0.000001)
## [1] TRUE

The first two arguments for near() are the numbers you are comparing; the third argument is the precision you are interested in. So if the numbers are equal within that precision, it returns TRUE. You can omit the third argument - the precision (in this case also known as the tolerance). If you do, near() will use a reasonable default tolerance value.

2.2.2 Character variables

Characters (sometimes referred to as strings or character strings) in R are letters, words, or even whole sentences (an example of this may be free text comments). Characters are displayed in-between "" (or '').

A useful function for quickly investigating categorical variables is the count() function:

typesdata %>%
## # A tibble: 2 x 2
##   group         n
##   <chr>     <int>
## 1 Control       1
## 2 Treatment     2

count() can accept multiple variables and will count up the number of observations in each subgroup, e.g., mydata %>% count(var1, var2).

Another helpful option to count is sort = TRUE, which will order the result putting the highest count (n) to the top.

typesdata %>%
  count(group, sort = TRUE)
## # A tibble: 2 x 2
##   group         n
##   <chr>     <int>
## 1 Treatment     2
## 2 Control       1

count()with the sort = TRUE option is also useful for identifying duplicate IDs or misspellings in your data. With this example tibble (typesdata) that only has three rows, it is easy to see that the id column is a unique identifier whereas the group column is a categorical variable.

You can check everything by just eyeballing the tibble using the built in Viewer tab (click on the dataset in the Environment tab).

But for larger datasets, you need to know how to check and then clean data programmatically - you can’t go through thousands of values checking they are all as intended without unexpected duplicates or typos.

For most variables (categorical or numeric), we recommend always plotting your data before starting analysis. But to check for duplicates in a unique identifier, use count() with sort = TRUE:

# all ids are unique:
typesdata %>% 
  count(id, sort = TRUE)
## # A tibble: 3 x 2
##   id        n
##   <chr> <int>
## 1 ID1       1
## 2 ID2       1
## 3 ID3       1
# we add in a duplicate row where id = ID3,
# then count again:
typesdata %>% 
  add_row(id = "ID3") %>% 
  count(id, sort = TRUE)
## # A tibble: 3 x 2
##   id        n
##   <chr> <int>
## 1 ID3       2
## 2 ID1       1
## 3 ID2       1

2.2.3 Factor variables (categorical)

Factors are fussy characters. Factors are fussy because they include something called levels. Levels are all the unique values a factor variable could take, e.g., like when we looked at typesdata %>% count(group). Using factors rather than just characters can be useful because:

  • The values factor levels can take are fixed. For example, once you tell R that typesdata$group is a factor with two levels: Control and Treatment, combining it with other datasets with different spellings or abbreviations for the same variable will generate a warning. This can be helpful but can also be a nuisance when you really do want to add in another level to a factor variable.
  • Levels have an order. When running statistical tests on grouped data (e.g., Control vs Treatment, Adult vs Child) and the variable is just a character, not a factor, R will use the alphabetically first as the reference (comparison) level. Converting a character column into a factor column enables us to define and change the order of its levels. Level order affects many things including regression results and plots: by default, categorical variables are ordered alphabetically. If we want a different order in say a bar plot, we need to convert to a factor and reorder before we plot it. The plot will then order the groups correctly.

So overall, since health data is often categorical and has a reference (comparison) level, then factors are an essential way to work with these data in R. Nevertheless, the fussiness of factors can sometimes be unhelpful or even frustrating. A lot more about factor handling will be covered later (8).

2.2.4 Date/time variables

R is good for working with dates. For example, it can calculate the number of days/weeks/months between two dates, or it can be used to find what future date is (e.g., “what’s the date exactly 60 days from now?”). It also knows about time zones and is happy to parse dates in pretty much any format - as long as you tell R how your date is formatted (e.g., day before month, month name abbreviated, year in 2 or 4 digits, etc.). Since R displays dates and times between quotes (`` ’’), they look similar to characters. However, it is important to know whether R has understood which of your columns contain date/time information, and which are just normal characters.

library(lubridate) # lubridate makes working with dates easier
current_datetime <- Sys.time()
## [1] "2021-01-15 11:56:12 GMT"
my_datetime <- "2020-12-01 12:00"
## [1] "2020-12-01 12:00"

When printed, the two objects - current_datetime and my_datetime seem to have a similar format. But if we try to calculate the difference between these two dates, we get an error:

my_datetime - current_datetime
## [1] "Error in `-.POSIXt`(my_datetime, current_datetime)"

That’s because when we assigned a value to my_datetime, R assumed the simpler type for it - so a character. We can check what the type of an object or variable is using the class() function:

current_datetime %>% class()
## [1] "POSIXct" "POSIXt"
my_datetime %>% class()
## [1] "character"

So we need to tell R that my_datetime does indeed include date/time information so we can then use it in calculations:

my_datetime_converted <- ymd_hm(my_datetime)
## [1] "2020-12-01 12:00:00 UTC"

Calculating the difference will now work:

my_datetime_converted - current_datetime
## Time difference of -44.99737 days

Since R knows this is a difference between two date/time objects, it prints them in a nicely readable way. Furthermore, the result has its own type; it is a “difftime”.

my_datesdiff <- my_datetime_converted - current_datetime
my_datesdiff %>% class()
## [1] "difftime"

This is useful if we want to apply this time difference on another date, e.g.:

ymd_hm("2021-01-02 12:00") + my_datesdiff
## [1] "2020-11-18 12:03:47 UTC"

But if we want to use the number of days in a normal calculation, e.g., what if a measurement increased by 560 arbitrary units during this time period. We might want to calculate the increase per day like this:

## [1] "Error in `/.difftime`(560, my_datesdiff)"

Doesn’t work, does it. We need to convert my_datesdiff (which is a difftime value) into a numeric value by using the as.numeric() function:

## [1] -12.44517

The lubridate package comes with several convenient functions for parsing dates, e.g., ymd(), mdy(), ymd_hm(), etc. - for a full list see lubridate.tidyverse.org.

However, if your date/time variable comes in an extra special format, then use the parse_date_time() function where the second argument specifies the format using the specifiers given in Table 2.2.

TABLE 2.2: Date/time format specifiers.
Notation Meaning Example
%d day as number 01-31
%m month as number 01-12
%B month name January-December
%b abbreviated month Jan-Dec
%Y 4-digit year 2019
%y 2-digit year 19
%H hours 12
%M minutes 01
%S seconds 59
%A weekday Monday-Sunday
%a abbreviated weekday Mon-Sun

For example:

parse_date_time("12:34 07/Jan'20", "%H:%M %d/%b'%y")
## [1] "2020-01-07 12:34:00 UTC"

Furthermore, the same date/time specifiers can be used to rearrange your date and time for printing:

## [1] "2021-01-15 11:56:12 GMT"
Sys.time() %>% format("%H:%M on %B-%d (%Y)")
## [1] "11:56 on January-15 (2021)"

You can even add plain text into the format() function, R will know to put the right date/time values where the % are:

Sys.time() %>% format("Happy days, the current time is %H:%M %B-%d (%Y)!")
## [1] "Happy days, the current time is 11:56 January-15 (2021)!"

  1. Portable Operating System Interface (POSIX) is a set of computing standards. There’s nothing more to understand about this other than when R starts shouting “POSIXct this or POSIXlt that” at you, check your date and time variables↩︎