2.2 Variable types and why we care
There are three broad types of data:
- continuous (numbers), in R: numeric, double, or integer;
- categorical, in R: character, factor, or logical (TRUE/FALSE);
- date/time, in R: POSIXct date-time4.
Values within a column all have to be the same type, but a tibble can of course hold columns of different types. Generally, R is good at figuring out what type of data you have (in programming, this ‘figuring out’ is called ‘parsing’).
For example, when reading in data, it will tell you what was assumed for each column:
## ## ── Column specification ──────────────────────────────────────────────────────── ## cols( ## id = col_character(), ## group = col_character(), ## measurement = col_double(), ## date = col_datetime(format = "") ## )
## # A tibble: 3 x 4 ## id group measurement date ## <chr> <chr> <dbl> <dttm> ## 1 ID1 Control 1.8 2017-01-02 12:00:00 ## 2 ID2 Treatment 4.5 2018-02-03 13:00:00 ## 3 ID3 Treatment 3.7 2019-03-04 14:00:00
This means that a lot of the time you do not have to worry about those little
<S3: POSIXct> labels.
But in cases of irregular or faulty input data, or when doing a lot of calculations and modifications to your data, we need to be aware of these different types to be able to find and fix mistakes.
For example, consider a similar file as above but with some data entry issues introduced:
## ## ── Column specification ──────────────────────────────────────────────────────── ## cols( ## id = col_character(), ## group = col_character(), ## measurement = col_character(), ## date = col_character() ## )
## # A tibble: 3 x 4 ## id group measurement date ## <chr> <chr> <chr> <chr> ## 1 ID1 Control 1.8 02-Jan-17 12:00 ## 2 ID2 Treatment 4.5 03-Feb-18 13:00 ## 3 ID3 Treatment 3.7 or 3.8 04-Mar-19 14:00
Notice that R parsed both the measurement and date variables as characters. Measurement has been parsed as a character because of a data entry issue: the person taking the measurement couldn’t decide which value to note down (maybe the scale was shifting between the two values) so they included both values and text “or” in the cell.
A numeric variable will also get parsed as a categorical variable if it contains certain typos, e.g., if entered as “3..7” instead of “3.7”.
The reason R didn’t automatically make sense of the date column is that it couldn’t tell which is the date and which is the year:
02-Jan-17 could stand for
02-Jan-2017 as well as
Therefore, while a lot of the time you do not have to worry about variable types and can just get on with your analysis, it is important to understand what the different types are to be ready to deal with them when issues arise.
Since health datasets are generally full of categorical data, it is crucial to understand the difference between characters and factors (both are types of categorical variables in R with pros and cons).
So here we go.
2.2.1 Numeric variables (continuous)
Numbers are straightforward to handle and don’t usually cause trouble.
R usually refers to numbers as
num), but sometimes it really gets its nerd on and calls numbers
Integers are numbers without decimal places (e.g.,
1, 2, 3), whereas
double stands for “Double-precision floating-point” format (e.g.,
It doesn’t usually matter whether R is classifying your continuous data
numeric/num/double/int, but it is good to be aware of these different terms as you will see them in R messages.
Something to note about numbers is that R doesn’t usually print more than 6 decimal places, but that doesn’t mean they don’t exist.
For example, from the
typedata tibble, we’re taking the
measurement column and sending it to the
R then calculates the mean and tells us what it is with 6 decimal places:
##  3.333333
Let’s save that in a new object:
But when using the double equals operator to check if this is equivalent to a fixed value (you might do this when comparing to a threshold, or even another mean value), R returns
##  FALSE
Now this doesn’t seem right, does it - R clearly told us just above that the mean of this variable is 3.333333 (reminder: the actual values in the measurement column are 1.8, 4.5, 3.7).
The reason the above statement is
FALSE is because
measurement_mean is quietly holding more than 6 decimal places.
And it gets worse. In this example, you may recognise that repeating decimals (0.333333…) usually mean there’s more of them somewhere. And you may think that rounding them down with the
round() function would make your
== behave as expected. Except, it’s not about rounding, it’s about how computers store numbers with decimals. Computers have issues with decimal numbers, and this simple example illustrates one:
##  FALSE
This returns FALSE, meaning R does not seem to think that
0.10 + 0.05 is equal to
0.15. This issue isn’t specific to R, but to programming languages in general. For example, python also thinks that the sum of
0.05 does not equal
This is where the
near() function comes in handy:
##  TRUE
##  TRUE
The first two arguments for
near() are the numbers you are comparing; the third argument is the precision you are interested in. So if the numbers are equal within that precision, it returns
TRUE. You can omit the third argument - the precision (in this case also known as the tolerance). If you do,
near() will use a reasonable default tolerance value.
2.2.2 Character variables
Characters (sometimes referred to as strings or character strings) in R are letters, words, or even whole sentences (an example of this may be free text comments).
Characters are displayed in-between
A useful function for quickly investigating categorical variables is the
## # A tibble: 2 x 2 ## group n ## <chr> <int> ## 1 Control 1 ## 2 Treatment 2
count() can accept multiple variables and will count up the number of observations in each subgroup, e.g.,
mydata %>% count(var1, var2).
Another helpful option to count is
sort = TRUE, which will order the result putting the highest count (
n) to the top.
## # A tibble: 2 x 2 ## group n ## <chr> <int> ## 1 Treatment 2 ## 2 Control 1
sort = TRUE option is also useful for identifying duplicate IDs or misspellings in your data.
With this example
typesdata) that only has three rows, it is easy to see that the
id column is a unique identifier whereas the
group column is a categorical variable.
You can check everything by just eyeballing the
tibble using the built in Viewer tab (click on the dataset in the Environment tab).
But for larger datasets, you need to know how to check and then clean data programmatically - you can’t go through thousands of values checking they are all as intended without unexpected duplicates or typos.
For most variables (categorical or numeric), we recommend always plotting your data before starting analysis.
But to check for duplicates in a unique identifier, use
sort = TRUE:
## # A tibble: 3 x 2 ## id n ## <chr> <int> ## 1 ID1 1 ## 2 ID2 1 ## 3 ID3 1
## # A tibble: 3 x 2 ## id n ## <chr> <int> ## 1 ID3 2 ## 2 ID1 1 ## 3 ID2 1
2.2.3 Factor variables (categorical)
Factors are fussy characters.
Factors are fussy because they include something called levels.
Levels are all the unique values a factor variable could take, e.g., like when we looked at
typesdata %>% count(group).
Using factors rather than just characters can be useful because:
- The values factor levels can take are fixed.
For example, once you tell R that
typesdata$groupis a factor with two levels: Control and Treatment, combining it with other datasets with different spellings or abbreviations for the same variable will generate a warning. This can be helpful but can also be a nuisance when you really do want to add in another level to a
- Levels have an order. When running statistical tests on grouped data (e.g., Control vs Treatment, Adult vs Child) and the variable is just a character, not a factor, R will use the alphabetically first as the reference (comparison) level. Converting a character column into a factor column enables us to define and change the order of its levels. Level order affects many things including regression results and plots: by default, categorical variables are ordered alphabetically. If we want a different order in say a bar plot, we need to convert to a factor and reorder before we plot it. The plot will then order the groups correctly.
So overall, since health data is often categorical and has a reference (comparison) level, then factors are an essential way to work with these data in R. Nevertheless, the fussiness of factors can sometimes be unhelpful or even frustrating. A lot more about factor handling will be covered later (8).
2.2.4 Date/time variables
R is good for working with dates. For example, it can calculate the number of days/weeks/months between two dates, or it can be used to find what future date is (e.g., “what’s the date exactly 60 days from now?”). It also knows about time zones and is happy to parse dates in pretty much any format - as long as you tell R how your date is formatted (e.g., day before month, month name abbreviated, year in 2 or 4 digits, etc.). Since R displays dates and times between quotes (`` ’’), they look similar to characters. However, it is important to know whether R has understood which of your columns contain date/time information, and which are just normal characters.
##  "2021-01-15 11:56:12 GMT"
##  "2020-12-01 12:00"
When printed, the two objects -
my_datetime seem to have a similar format.
But if we try to calculate the difference between these two dates, we get an error:
##  "Error in `-.POSIXt`(my_datetime, current_datetime)"
That’s because when we assigned a value to
my_datetime, R assumed the simpler type for it - so a character.
We can check what the type of an object or variable is using the
##  "POSIXct" "POSIXt"
##  "character"
So we need to tell R that
my_datetime does indeed include date/time information so we can then use it in calculations:
##  "2020-12-01 12:00:00 UTC"
Calculating the difference will now work:
## Time difference of -44.99737 days
Since R knows this is a difference between two date/time objects, it prints them in a nicely readable way. Furthermore, the result has its own type; it is a “difftime”.
##  "difftime"
This is useful if we want to apply this time difference on another date, e.g.:
##  "2020-11-18 12:03:47 UTC"
But if we want to use the number of days in a normal calculation, e.g., what if a measurement increased by 560 arbitrary units during this time period. We might want to calculate the increase per day like this:
##  "Error in `/.difftime`(560, my_datesdiff)"
Doesn’t work, does it.
We need to convert
my_datesdiff (which is a difftime value) into a numeric value by using the
##  -12.44517
The lubridate package comes with several convenient functions for parsing dates, e.g.,
ymd_hm(), etc. - for a full list see lubridate.tidyverse.org.
However, if your date/time variable comes in an extra special format, then use the
parse_date_time() function where the second argument specifies the format using the specifiers given in Table 2.2.
|%d||day as number||01-31|
|%m||month as number||01-12|
##  "2020-01-07 12:34:00 UTC"
Furthermore, the same date/time specifiers can be used to rearrange your date and time for printing:
##  "2021-01-15 11:56:12 GMT"
##  "11:56 on January-15 (2021)"
You can even add plain text into the
format() function, R will know to put the right date/time values where the
##  "Happy days, the current time is 11:56 January-15 (2021)!"
Portable Operating System Interface (POSIX) is a set of computing standards. There’s nothing more to understand about this other than when R starts shouting “POSIXct this or POSIXlt that” at you, check your date and time variables↩︎