2.8 Creating new columns - mutate() | R for Health Data Science

2.8 Creating new columns - `mutate()`

The function for adding new columns (or making changes to existing ones) to a tibble is called mutate(). As a reminder, this is what typesdata looked like:

typesdata

## # A tibble: 3 x 4
##   id    group     measurement date               
##   <chr> <chr>           <dbl> <dttm>             
## 1 ID1   Control           1.8 2017-01-02 12:00:00
## 2 ID2   Treatment         4.5 2018-02-03 13:00:00
## 3 ID3   Treatment         3.7 2019-03-04 14:00:00

Let’s say we decide to divide the column measurement by 2. A quick way to see these values would be to pull them out using the $ operator and then divide by 2:

typesdata$measurement

## [1] 1.8 4.5 3.7

typesdata$measurement/2

## [1] 0.90 2.25 1.85

But this becomes cumbersome once we want to combine multiple variables from the same tibble in a calculation. So the mutate() is the way to go here:

typesdata %>% 
  mutate(measurement/2)

## # A tibble: 3 x 5
##   id    group     measurement date                `measurement/2`
##   <chr> <chr>           <dbl> <dttm>                        <dbl>
## 1 ID1   Control           1.8 2017-01-02 12:00:00            0.9 
## 2 ID2   Treatment         4.5 2018-02-03 13:00:00            2.25
## 3 ID3   Treatment         3.7 2019-03-04 14:00:00            1.85

Notice how the mutate() above returns the whole tibble with a new column called measurement/2. This is quite nice of mutate(), but it would be best to give columns names that don’t include characters other than letters, numbers, underscores (_) or dots (.). So let’s assign a more standard name for this new column:

typesdata %>% 
  mutate(measurement_half = measurement/2)

## # A tibble: 3 x 5
##   id    group     measurement date                measurement_half
##   <chr> <chr>           <dbl> <dttm>                         <dbl>
## 1 ID1   Control           1.8 2017-01-02 12:00:00             0.9 
## 2 ID2   Treatment         4.5 2018-02-03 13:00:00             2.25
## 3 ID3   Treatment         3.7 2019-03-04 14:00:00             1.85

Better. You can see that R likes the name we gave it a bit better as it’s now removed the back-ticks from around it. Overall, back-ticks can be used to call out non-standard column names, so if you are forced to read in data with, e.g., spaces in column names, then the back-ticks enable calling column names that would otherwise error¹⁰:

mydata$`Nasty column name`

# or

mydata %>% 
  select(`Nasty column name`)

But as usual, if it gets printed, it doesn’t get saved. We have two options - we can either overwrite the typesdata tibble (by changing the first line to typesdata = typesdata %>%), or we can create a new one (that appears in your Environment):

typesdata_modified <- typesdata %>% 
  mutate(measurement_half = measurement/2)

typesdata_modified

## # A tibble: 3 x 5
##   id    group     measurement date                measurement_half
##   <chr> <chr>           <dbl> <dttm>                         <dbl>
## 1 ID1   Control           1.8 2017-01-02 12:00:00             0.9 
## 2 ID2   Treatment         4.5 2018-02-03 13:00:00             2.25
## 3 ID3   Treatment         3.7 2019-03-04 14:00:00             1.85

The mutate() function can also be used to create a new column with a single constant value; which in return can be used to calculate a difference for each of the existing dates:

library(lubridate)
typesdata %>% 
  mutate(reference_date   = ymd_hm("2020-01-01 12:00"),
         dates_difference = reference_date - date) %>% 
  select(date, reference_date, dates_difference)

## # A tibble: 3 x 3
##   date                reference_date      dates_difference
##   <dttm>              <dttm>              <drtn>          
## 1 2017-01-02 12:00:00 2020-01-01 12:00:00 1094.0000 days  
## 2 2018-02-03 13:00:00 2020-01-01 12:00:00  696.9583 days  
## 3 2019-03-04 14:00:00 2020-01-01 12:00:00  302.9167 days

(We are then using the select() function to only choose the three relevant columns.)

Finally, the mutate function can be used to create a new column with a summarised value in it, e.g., the mean of another column:

typesdata %>% 
  mutate(mean_measurement = mean(measurement))

## # A tibble: 3 x 5
##   id    group     measurement date                mean_measurement
##   <chr> <chr>           <dbl> <dttm>                         <dbl>
## 1 ID1   Control           1.8 2017-01-02 12:00:00             3.33
## 2 ID2   Treatment         4.5 2018-02-03 13:00:00             3.33
## 3 ID3   Treatment         3.7 2019-03-04 14:00:00             3.33

Which in return can be useful for calculating a standardized measurement (i.e., relative to the mean):

typesdata %>% 
  mutate(mean_measurement     = mean(measurement)) %>% 
  mutate(measurement_relative = measurement/mean_measurement) %>% 
  select(matches("measurement"))

## # A tibble: 3 x 3
##   measurement mean_measurement measurement_relative
##         <dbl>            <dbl>                <dbl>
## 1         1.8             3.33                 0.54
## 2         4.5             3.33                 1.35
## 3         3.7             3.33                 1.11

2.8.1 Worked example/exercise

Round the difference to 0 decimal places using the round() function inside a mutate(). Then add a clever matches("date") inside the select() function to choose all matching columns.

Solution:

typesdata %>% 
  mutate(reference_date   = ymd_hm("2020-01-01 12:00"),
         dates_difference = reference_date - date) %>% 
  mutate(dates_difference = round(dates_difference)) %>% 
  select(matches("date"))

## # A tibble: 3 x 3
##   date                reference_date      dates_difference
##   <dttm>              <dttm>              <drtn>          
## 1 2017-01-02 12:00:00 2020-01-01 12:00:00 1094 days       
## 2 2018-02-03 13:00:00 2020-01-01 12:00:00  697 days       
## 3 2019-03-04 14:00:00 2020-01-01 12:00:00  303 days

You can shorten this by adding the round() function directly around the subtraction, so the third line becomes dates_difference = round(reference_date - date)) %>%. But sometimes writing calculations out longer than the absolute minimum can make them easier to understand when you return to an old script months later.

Furthermore, we didn’t have to save the reference_date as a new column, the calculation could have used the value directly: mutate(dates_difference = ymd_hm("2020-01-01 12:00") - date) %>%. But again, defining it makes it clearer for your future self to see what was done. And it makes reference_date available for reuse in more complicated calculations within the tibble.

If this happens to you a lot, then check out library(janitor) and its function clean_names() for automatically tidying non-standard column names.↩︎