2.2 Variable types and why we care

There are three broad types of data:

  • continuous (numbers), in R: numeric, double, or integer;
  • categorical, in R: character, factor, or logical (TRUE/FALSE);
  • date/time, in R: POSIXct date-time4.

Values within a column all have to be the same type, but a tibble can of course hold columns of different types. Generally, R is good at figuring out what type of data you have (in programming, this ‘figuring out’ is called ‘parsing’).

For example, when reading in data, it will tell you what was assumed for each column:

## Parsed with column specification:
## cols(
##   id = col_character(),
##   group = col_character(),
##   measurement = col_double(),
##   date = col_datetime(format = "")
## )
## # A tibble: 3 x 4
##   id    group     measurement date               
##   <chr> <chr>           <dbl> <dttm>             
## 1 ID1   Control           1.8 2017-01-02 12:00:00
## 2 ID2   Treatment         4.5 2018-02-03 13:00:00
## 3 ID3   Treatment         3.7 2019-03-04 14:00:00

This means that a lot of the time you do not have to worry about those little <chr> vs <dbl> vs <S3: POSIXct> labels. But in cases of irregular or faulty input data, or when doing a lot of calculations and modifications to your data, we need to be aware of these different types to be able to find and fix mistakes.

For example, consider a similar file as above but with some data entry issues introduced:

## Parsed with column specification:
## cols(
##   id = col_character(),
##   group = col_character(),
##   measurement = col_character(),
##   date = col_character()
## )
## # A tibble: 3 x 4
##   id    group     measurement date           
##   <chr> <chr>     <chr>       <chr>          
## 1 ID1   Control   1.8         02-Jan-17 12:00
## 2 ID2   Treatment 4.5         03-Feb-18 13:00
## 3 ID3   Treatment 3.7 or 3.8  04-Mar-19 14:00

Notice that R parsed both the measurement and date variables as characters. Measurement has been parsed as a character because of a data entry issue: the person taking the measurement couldn’t decide which value to note down (maybe the scale was shifting between the two values) so they included both values and text “or” in the cell.

A numeric variable will also get parsed as a categorical variable if it contains certain typos, e.g., if entered as “3..7” instead of “3.7”.

The reason R didn’t automatically make sense of the date column is that it couldn’t tell which is the date and which is the year: 02-Jan-17 could stand for 02-Jan-2017 as well as 2002-Jan-17.

Therefore, while a lot of the time you do not have to worry about variable types and can just get on with your analysis, it is important to understand what the different types are to be ready to deal with them when issues arise.

Since health datasets are generally full of categorical data, it is crucial to understand the difference between characters and factors (both are types of categorical variables in R with pros and cons).

So here we go.

2.2.1 Numeric variables (continuous)

Number are straightforward to handle and don’t usually cause trouble. R usually refers to numbers as numeric (or num), but sometimes it really gets its nerd on and calls numbers integer or double. Integers are numbers without decimal places (e.g., 1, 2, 3), whereas double stands for “Double-precision floating-point” format (e.g., 1.234, 5.67890).

It doesn’t usually matter whether R is classifying your continuous data numeric/num/double/int, but it is good to be aware of these different terms as you will see them in R messages.

FRIENDLY WARNING: What’s about to follow is a bit dry. Furthermore, it is not essential for complete beginners - you might want to continue reading from Character variables. Before you leave, take a mental note that sometimes numbers in R have more decimal places than it seems, and that can cause funny behaviour when using the double equals operator (==).

Something to note about numbers is that R doesn’t usually print more than 6 decimal places, but that doesn’t mean they don’t exist. For example, from the typedata tibble, we’re taking the measurement column and sending it to the mean() function. R then calculates the mean and tells us what it is with 6 decimal places:

## [1] 3.333333

Let’ save that in a new object:

But when using the double equals operator to check if this is equivalent to a fixed value (you might do this when comparing to a threshold, or even another mean value), R returns FALSE:

## [1] FALSE

Now this doesn’t seem right, does it - R clearly told us just above that the mean of this variable is 3.333333 (reminder: the actual values in the measurement column are 1.8, 4.5, 3.7). The reason the above statement is FALSE is because measurement_mean is quietly holding more than 6 decimal places.

One way to go about this is to round the mean to a reasonable number of decimal places:

## [1] 3.333

The second argument of round() specifies the number of decimal places you want your number(s) rounded to. So when using round() in the equality statement like this, we get the expected TRUE:

## [1] TRUE

Which is usually fine, especially if you’ve finished applying calculations on that number. But when you indent to use it if further calculations, then rounding should be left to the very end - to minimise rounding errors. This is where the near() function comes in handy:

## [1] TRUE

The first two arguments for near() are the numbers you are comparing, the third argument is the precision you are interested in. So if the numbers are equal within that precision, it returns TRUE. This means you get the expected result without having to round the numbers off.

2.2.2 Character variables

Characters (sometimes referred to as strings or character strings) in R are letters, words, or even whole sentences (an example of this may be free text comments). Characters are displayed in-between "" (or '').

A useful function for quickly investigating categorical variables is the count() function:

## # A tibble: 2 x 2
##   group         n
##   <chr>     <int>
## 1 Control       1
## 2 Treatment     2

count() can accept multiple variables and will count up the number of observations in each subgroup, e.g., mydata %>% count(var1, var2).

Another helpful option to count is sort = TRUE, which will order the result putting the highest count (n) to the top.

## # A tibble: 2 x 2
##   group         n
##   <chr>     <int>
## 1 Treatment     2
## 2 Control       1

count()with the sort = TRUE option is also useful for identifying duplicate IDs or misspellings in your data. With this example tibble (typesdata) that only has three rows, it is easy to see that the id column is a unique identifier whereas the group column is a categorical variable.

You can check everything by just eyeballing the tibble using the built in Viewer tab (click on the dataset in the Environment tab).

But for larger datasets, you need to know how to check and then clean data programmatically - you can’t go through thousands of values checking they are all as intended without unexpected duplicates or typos.

For most variables (categorical or numeric), we recommend always plotting your data before starting analysis. But to check for duplicates in a unique identifier, use count() with sort = TRUE:

## # A tibble: 3 x 2
##   id        n
##   <chr> <int>
## 1 ID1       1
## 2 ID2       1
## 3 ID3       1
## # A tibble: 3 x 2
##   id        n
##   <chr> <int>
## 1 ID3       2
## 2 ID1       1
## 3 ID2       1

2.2.3 Factor variables (categorical)

Factors are fussy characters. Factors are fussy because they include something called levels. Levels are all the unique values a factor variable could take, e.g. like when we looked at typesdata %>% count(group). Using factors rather than just characters can be useful because:

  • The values factor levels can take is fixed. For example, once you tell R that typesdata$group is a factor with two levels: Control and Treatment, combining it with other datasets with different spellings or abbreviations for the same variable will generate a warning. This can be helpful but can also be a nuisance when you really do want to add in another level to a factor variable.
  • Levels have an order. When running statistical tests on grouped data (e.g., Control vs Treatment, Adult vs Child) and the variable is just a character, not a factor, R will use the alphabetically first as the reference (comparison) level. Converting a character column into a factor column enables us to define and change the order of its levels. Level order affects many things including regression results and plots: by default, categorical variables are ordered alphabetically. If we want a different order in say a bar plot, we need to convert to a factor and reorder before we plot it. The plot will then order the groups correctly.

So overall, since health data is often categorical and has a reference (comparison) level, then factors are an essential way to work with these data in R. Nevertheless, the fussiness of factors can sometimes be unhelpful or even frustrating. A lot more about factor handling will be covered later (8).

2.2.4 Date/time variables

R is good for working with dates. For example, it can calculate the number of days/weeks/months between two dates, or it can be used to find a future date is (e.g., “what’s the date exactly 60 days from now?”). It also knows about time zones and is happy to parse dates in pretty much any format - as long as you tell R how your date is formatted (e.g., day before month, month name abbreviated, year in 2 or 4 digits, etc.). Since R displays dates and times between quotes (""), they look similar to characters. However, it is important to know whether R has understood which of your columns contain date/time information, as which are just normal characters.

## [1] "2020-04-15 12:43:05 BST"
## [1] "2020-12-01 12:00"

When printed, the two objects - current_datetime and my_datetime seem to have the a similar format. But if we try to calculate the difference between these two dates, we get an error:

## Error in `-.POSIXt`(my_datetime, current_datetime): can only subtract from "POSIXt" objects

That’s because when we assigned a value to my_datetime, R assumed the simpler type for it - so a character. We can check what the type of an object or variable is using the class() function:

## [1] "POSIXct" "POSIXt"
## [1] "character"

So we need to tell R that my_datetime does indeed include date/time information so we can then use it in calculations:

## [1] "2020-12-01 12:00:00 UTC"

Calculating the difference will now work:

## Time difference of 230.0117 days

Since R knows this is a difference between two date/time objects, it prints the in a nicely readable way. Furthermore, the result has its own type, it is a “difftime”.

## [1] "difftime"

This is useful if we want to apply this time difference on another date, e.g.:

## [1] "2021-08-20 12:16:54 UTC"

But if we want to use the number of days in a normal calculation, e.g., what if a measurement increased by 560 arbitrary units during this time period. We might want to calculate the increase per day like this:

## Error in `/.difftime`(560, my_datesdiff): second argument of / cannot be a "difftime" object

Doesn’t work, does it. We need to convert my_datesdiff (which is a difftime value) into a numeric value by using the as.numeric() function:

## [1] 2.434658

The lubridate package comes with several convenient functions for parsing dates, e.g., ymd(), mdy(), ymd_hm(), etc. - for a full list see lubridate.tidyverse.org.

However, if your date/time variable comes in an extra special format, then use the parse_date_time() function where the second argument specifies the format using the specifiers given in Table 2.2.

TABLE 2.2: Date/time format specifiers.
Notation Meaning Example
%d day as number 01-31
%m month as number 01-12
%B month name January-December
%b abbreviated month Jan-Dec
%Y 4-digit year 2019
%y 2-digit year 19
%H hours 12
%M minutes 01
%S seconds 59
%A weekday Monday-Sunday
%a abbreviated weekday Mon-Sun

For example:

## [1] "2020-01-07 12:34:00 UTC"

Furthermore, the same date/time specifiers can be used to rearrange your date and time for printing:

## [1] "2020-04-15 12:43:05 BST"
## [1] "12:43 on April-15 (2020)"

You can even add plain text into the format() function, R will know to put the right date/time values where the % are:

## [1] "Happy days, the current time is 12:43 April-15 (2020)!"
## [1] "Happy days, the current time is 12:43 April-15 (2020)!"

  1. Portable Operating System Interface (POSIX) is a set of computing standards. There’s nothing more to understand about this other than when R starts shouting “POSIXct this or POSIXlt that” at you, check your date and time variables