## 2.2 Variable types and why we care

There are three broad types of data:

- continuous (numbers), in R: numeric, double, or integer;
- categorical, in R: character, factor, or logical (TRUE/FALSE);
- date/time, in R: POSIXct date-time
^{4}.

Values within a column all have to be the same type, but a tibble can of course hold columns of different types. Generally, R is good at figuring out what type of data you have (in programming, this ‘figuring out’ is called ‘parsing’).

For example, when reading in data, it will tell you what was assumed for each column:

```
## Parsed with column specification:
## cols(
## id = col_character(),
## group = col_character(),
## measurement = col_double(),
## date = col_datetime(format = "")
## )
```

```
## # A tibble: 3 x 4
## id group measurement date
## <chr> <chr> <dbl> <dttm>
## 1 ID1 Control 1.8 2017-01-02 12:00:00
## 2 ID2 Treatment 4.5 2018-02-03 13:00:00
## 3 ID3 Treatment 3.7 2019-03-04 14:00:00
```

This means that a lot of the time you do not have to worry about those little `<chr>`

vs `<dbl>`

vs `<S3: POSIXct>`

labels.
But in cases of irregular or faulty input data, or when doing a lot of calculations and modifications to your data, we need to be aware of these different types to be able to find and fix mistakes.

For example, consider a similar file as above but with some data entry issues introduced:

```
## Parsed with column specification:
## cols(
## id = col_character(),
## group = col_character(),
## measurement = col_character(),
## date = col_character()
## )
```

```
## # A tibble: 3 x 4
## id group measurement date
## <chr> <chr> <chr> <chr>
## 1 ID1 Control 1.8 02-Jan-17 12:00
## 2 ID2 Treatment 4.5 03-Feb-18 13:00
## 3 ID3 Treatment 3.7 or 3.8 04-Mar-19 14:00
```

Notice that R parsed both the measurement and date variables as characters. Measurement has been parsed as a character because of a data entry issue: the person taking the measurement couldn’t decide which value to note down (maybe the scale was shifting between the two values) so they included both values and text “or” in the cell.

A numeric variable will also get parsed as a categorical variable if it contains certain typos, e.g., if entered as “3..7” instead of “3.7”.

The reason R didn’t automatically make sense of the date column is that it couldn’t tell which is the date and which is the year: `02-Jan-17`

could stand for `02-Jan-2017`

as well as `2002-Jan-17`

.

Therefore, while a lot of the time you do not have to worry about variable types and can just get on with your analysis, it is important to understand what the different types are to be ready to deal with them when issues arise.

Since health datasets are generally full of categorical data, it is crucial to understand the difference between characters and factors (both are types of categorical variables in R with pros and cons).

So here we go.

### 2.2.1 Numeric variables (continuous)

Number are straightforward to handle and don’t usually cause trouble.
R usually refers to numbers as `numeric`

(or `num`

), but sometimes it really gets its nerd on and calls numbers `integer`

or `double`

.
Integers are numbers without decimal places (e.g., `1, 2, 3`

), whereas `double`

stands for “Double-precision floating-point” format (e.g., `1.234, 5.67890`

).

It doesn’t usually matter whether R is classifying your continuous data `numeric/num/double/int`

, but it is good to be aware of these different terms as you will see them in R messages.

FRIENDLY WARNING: What’s about to follow is a bit dry. Furthermore, it is not essential for complete beginners - you might want to continue reading from

Character variables. Before you leave, take a mental note that sometimes numbers in R have more decimal places than it seems, and that can cause funny behaviour when using the double equals operator (`==`

).

Something to note about numbers is that R doesn’t usually print more than 6 decimal places, but that doesn’t mean they don’t exist.
For example, from the `typedata`

tibble, we’re taking the `measurement`

column and sending it to the `mean()`

function.
R then calculates the mean and tells us what it is with 6 decimal places:

`## [1] 3.333333`

Let’ save that in a new object:

But when using the double equals operator to check if this is equivalent to a fixed value (you might do this when comparing to a threshold, or even another mean value), R returns `FALSE`

:

`## [1] FALSE`

Now this doesn’t seem right, does it - R clearly told us just above that the mean of this variable is 3.333333 (reminder: the actual values in the measurement column are 1.8, 4.5, 3.7).
The reason the above statement is `FALSE`

is because `measurement_mean`

is quietly holding more than 6 decimal places.

One way to go about this is to round the mean to a reasonable number of decimal places:

`## [1] 3.333`

The second argument of `round()`

specifies the number of decimal places you want your number(s) rounded to.
So when using `round()`

in the equality statement like this, we get the expected `TRUE`

:

`## [1] TRUE`

Which is usually fine, especially if you’ve finished applying calculations on that number.
But when you indent to use it if further calculations, then rounding should be left to the very end - to minimise rounding errors.
This is where the `near()`

function comes in handy:

`## [1] TRUE`

The first two arguments for `near()`

are the numbers you are comparing, the third argument is the precision you are interested in. So if the numbers are equal within that precision, it returns `TRUE`

.
This means you get the expected result without having to round the numbers off.

### 2.2.2 Character variables

*Characters* (sometimes referred to as *strings* or *character strings*) in R are letters, words, or even whole sentences (an example of this may be free text comments).
Characters are displayed in-between `""`

(or `''`

).

A useful function for quickly investigating categorical variables is the `count()`

function:

```
## # A tibble: 2 x 2
## group n
## <chr> <int>
## 1 Control 1
## 2 Treatment 2
```

`count()`

can accept multiple variables and will count up the number of observations in each subgroup, e.g., `mydata %>% count(var1, var2)`

.

Another helpful option to count is `sort = TRUE`

, which will order the result putting the highest count (`n`

) to the top.

```
## # A tibble: 2 x 2
## group n
## <chr> <int>
## 1 Treatment 2
## 2 Control 1
```

`count()`

with the `sort = TRUE`

option is also useful for identifying duplicate IDs or misspellings in your data.
With this example `tibble`

(`typesdata`

) that only has three rows, it is easy to see that the `id`

column is a unique identifier whereas the `group`

column is a categorical variable.

You can check everything by just eyeballing the `tibble`

using the built in Viewer tab (click on the dataset in the Environment tab).

But for larger datasets, you need to know how to check and then clean data programmatically - you can’t go through thousands of values checking they are all as intended without unexpected duplicates or typos.

For most variables (categorical or numeric), we recommend always plotting your data before starting analysis.
But to check for duplicates in a unique identifier, use `count()`

with `sort = TRUE`

:

```
## # A tibble: 3 x 2
## id n
## <chr> <int>
## 1 ID1 1
## 2 ID2 1
## 3 ID3 1
```

```
# we add in a duplicate row where id = ID3,
# then count again:
typesdata %>%
add_row(id = "ID3") %>%
count(id, sort = TRUE)
```

```
## # A tibble: 3 x 2
## id n
## <chr> <int>
## 1 ID3 2
## 2 ID1 1
## 3 ID2 1
```

### 2.2.3 Factor variables (categorical)

*Factors* are fussy characters.
Factors are fussy because they include something called *levels*.
Levels are all the unique values a factor variable could take, e.g. like when we looked at `typesdata %>% count(group)`

.
Using factors rather than just characters can be useful because:

- The values factor levels can take is fixed.
For example, once you tell R that
`typesdata$group`

is a factor with two levels: Control and Treatment, combining it with other datasets with different spellings or abbreviations for the same variable will generate a warning. This can be helpful but can also be a nuisance when you really do want to add in another level to a`factor`

variable. - Levels have an order. When running statistical tests on grouped data (e.g., Control vs Treatment, Adult vs Child) and the variable is just a character, not a factor, R will use the alphabetically first as the reference (comparison) level. Converting a character column into a factor column enables us to define and change the order of its levels. Level order affects many things including regression results and plots: by default, categorical variables are ordered alphabetically. If we want a different order in say a bar plot, we need to convert to a factor and reorder before we plot it. The plot will then order the groups correctly.

So overall, since health data is often categorical and has a reference (comparison) level, then factors are an essential way to work with these data in R. Nevertheless, the fussiness of factors can sometimes be unhelpful or even frustrating. A lot more about factor handling will be covered later (8).

### 2.2.4 Date/time variables

R is good for working with dates. For example, it can calculate the number of days/weeks/months between two dates, or it can be used to find a future date is (e.g., “what’s the date exactly 60 days from now?”). It also knows about time zones and is happy to parse dates in pretty much any format - as long as you tell R how your date is formatted (e.g., day before month, month name abbreviated, year in 2 or 4 digits, etc.). Since R displays dates and times between quotes (""), they look similar to characters. However, it is important to know whether R has understood which of your columns contain date/time information, as which are just normal characters.

```
library(lubridate) # lubridate makes working with dates easier
current_datetime <- Sys.time()
current_datetime
```

`## [1] "2020-04-15 12:43:05 BST"`

`## [1] "2020-12-01 12:00"`

When printed, the two objects - `current_datetime`

and `my_datetime`

seem to have the a similar format.
But if we try to calculate the difference between these two dates, we get an error:

`## Error in `-.POSIXt`(my_datetime, current_datetime): can only subtract from "POSIXt" objects`

That’s because when we assigned a value to `my_datetime`

, R assumed the simpler type for it - so a character.
We can check what the type of an object or variable is using the `class()`

function:

`## [1] "POSIXct" "POSIXt"`

`## [1] "character"`

So we need to tell R that `my_datetime`

does indeed include date/time information so we can then use it in calculations:

`## [1] "2020-12-01 12:00:00 UTC"`

Calculating the difference will now work:

`## Time difference of 230.0117 days`

Since R knows this is a difference between two date/time objects, it prints the in a nicely readable way. Furthermore, the result has its own type, it is a “difftime”.

`## [1] "difftime"`

This is useful if we want to apply this time difference on another date, e.g.:

`## [1] "2021-08-20 12:16:54 UTC"`

But if we want to use the number of days in a normal calculation, e.g., what if a measurement increased by 560 arbitrary units during this time period. We might want to calculate the increase per day like this:

`## Error in `/.difftime`(560, my_datesdiff): second argument of / cannot be a "difftime" object`

Doesn’t work, does it.
We need to convert `my_datesdiff`

(which is a difftime value) into a numeric value by using the `as.numeric()`

function:

`## [1] 2.434658`

The **lubridate** package comes with several convenient functions for parsing dates, e.g., `ymd()`

, `mdy()`

, `ymd_hm()`

, etc. - for a full list see lubridate.tidyverse.org.

However, if your date/time variable comes in an extra special format, then use the `parse_date_time()`

function where the second argument specifies the format using the specifiers given in Table 2.2.

Notation | Meaning | Example |
---|---|---|

%d | day as number | 01-31 |

%m | month as number | 01-12 |

%B | month name | January-December |

%b | abbreviated month | Jan-Dec |

%Y | 4-digit year | 2019 |

%y | 2-digit year | 19 |

%H | hours | 12 |

%M | minutes | 01 |

%S | seconds | 59 |

%A | weekday | Monday-Sunday |

%a | abbreviated weekday | Mon-Sun |

For example:

`## [1] "2020-01-07 12:34:00 UTC"`

Furthermore, the same date/time specifiers can be used to rearrange your date and time for printing:

`## [1] "2020-04-15 12:43:05 BST"`

`## [1] "12:43 on April-15 (2020)"`

You can even add plain text into the `format()`

function, R will know to put the right date/time values where the `%`

are:

`## [1] "Happy days, the current time is 12:43 April-15 (2020)!"`

`## [1] "Happy days, the current time is 12:43 April-15 (2020)!"`

Portable Operating System Interface (POSIX) is a set of computing standards. There’s nothing more to understand about this other than when R starts shouting “POSIXct this or POSIXlt that” at you, check your date and time variables↩