2.8 Creating new columns - mutate()
The function for adding new columns (or making changes to existing ones) to a tibble is called mutate()
.
As a reminder, this is what typesdata
looked like:
## # A tibble: 3 x 4
## id group measurement date
## <chr> <chr> <dbl> <dttm>
## 1 ID1 Control 1.8 2017-01-02 12:00:00
## 2 ID2 Treatment 4.5 2018-02-03 13:00:00
## 3 ID3 Treatment 3.7 2019-03-04 14:00:00
Let’s say we decide to divide the column measurement
by 2.
A quick way to see these values would be to pull them out using the $
operator and then divide by 2:
## [1] 1.8 4.5 3.7
## [1] 0.90 2.25 1.85
But this becomes cumbersome once we want to combine multiple variables from the same tibble in a calculation. So the mutate()
is the way to go here:
## # A tibble: 3 x 5
## id group measurement date `measurement/2`
## <chr> <chr> <dbl> <dttm> <dbl>
## 1 ID1 Control 1.8 2017-01-02 12:00:00 0.9
## 2 ID2 Treatment 4.5 2018-02-03 13:00:00 2.25
## 3 ID3 Treatment 3.7 2019-03-04 14:00:00 1.85
Notice how the mutate()
above returns the whole tibble with a new column called measurement/2
.
This is quite nice of mutate()
, but it would be best to give columns names that don’t include characters other than letters, numbers, underscores (_
) or dots (.
).
So let’s assign a more standard name for this new column:
## # A tibble: 3 x 5
## id group measurement date measurement_half
## <chr> <chr> <dbl> <dttm> <dbl>
## 1 ID1 Control 1.8 2017-01-02 12:00:00 0.9
## 2 ID2 Treatment 4.5 2018-02-03 13:00:00 2.25
## 3 ID3 Treatment 3.7 2019-03-04 14:00:00 1.85
Better. You can see that R likes the name we gave it a bit better as it’s now removed the back-ticks from around it. Overall, back-ticks can be used to call out non-standard column names, so if you are forced to read in data with, e.g., spaces in column names, then the back-ticks enable calling column names that would otherwise error10:
But as usual, if it gets printed, it doesn’t get saved.
We have two options - we can either overwrite the typesdata
tibble (by changing the first line to typesdata = typesdata %>%
), or we can create a new one (that appears in your Environment):
## # A tibble: 3 x 5
## id group measurement date measurement_half
## <chr> <chr> <dbl> <dttm> <dbl>
## 1 ID1 Control 1.8 2017-01-02 12:00:00 0.9
## 2 ID2 Treatment 4.5 2018-02-03 13:00:00 2.25
## 3 ID3 Treatment 3.7 2019-03-04 14:00:00 1.85
The mutate()
function can also be used to create a new column with a single constant value; which in return can be used to calculate a difference for each of the existing dates:
library(lubridate)
typesdata %>%
mutate(reference_date = ymd_hm("2020-01-01 12:00"),
dates_difference = reference_date - date) %>%
select(date, reference_date, dates_difference)
## # A tibble: 3 x 3
## date reference_date dates_difference
## <dttm> <dttm> <drtn>
## 1 2017-01-02 12:00:00 2020-01-01 12:00:00 1094.0000 days
## 2 2018-02-03 13:00:00 2020-01-01 12:00:00 696.9583 days
## 3 2019-03-04 14:00:00 2020-01-01 12:00:00 302.9167 days
(We are then using the select()
function to only choose the three relevant columns.)
Finally, the mutate function can be used to create a new column with a summarised value in it, e.g., the mean of another column:
## # A tibble: 3 x 5
## id group measurement date mean_measurement
## <chr> <chr> <dbl> <dttm> <dbl>
## 1 ID1 Control 1.8 2017-01-02 12:00:00 3.33
## 2 ID2 Treatment 4.5 2018-02-03 13:00:00 3.33
## 3 ID3 Treatment 3.7 2019-03-04 14:00:00 3.33
Which in return can be useful for calculating a standardized measurement (i.e., relative to the mean):
typesdata %>%
mutate(mean_measurement = mean(measurement)) %>%
mutate(measurement_relative = measurement/mean_measurement) %>%
select(matches("measurement"))
## # A tibble: 3 x 3
## measurement mean_measurement measurement_relative
## <dbl> <dbl> <dbl>
## 1 1.8 3.33 0.54
## 2 4.5 3.33 1.35
## 3 3.7 3.33 1.11
2.8.1 Worked example/exercise
Round the difference to 0 decimal places using the round()
function inside a mutate()
.
Then add a clever matches("date")
inside the select()
function to choose all matching columns.
Solution:
typesdata %>%
mutate(reference_date = ymd_hm("2020-01-01 12:00"),
dates_difference = reference_date - date) %>%
mutate(dates_difference = round(dates_difference)) %>%
select(matches("date"))
## # A tibble: 3 x 3
## date reference_date dates_difference
## <dttm> <dttm> <drtn>
## 1 2017-01-02 12:00:00 2020-01-01 12:00:00 1094 days
## 2 2018-02-03 13:00:00 2020-01-01 12:00:00 697 days
## 3 2019-03-04 14:00:00 2020-01-01 12:00:00 303 days
You can shorten this by adding the round()
function directly around the subtraction, so the third line becomes dates_difference = round(reference_date - date)) %>%
.
But sometimes writing calculations out longer than the absolute minimum can make them easier to understand when you return to an old script months later.
Furthermore, we didn’t have to save the reference_date
as a new column, the calculation could have used the value directly: mutate(dates_difference = ymd_hm("2020-01-01 12:00") - date) %>%
.
But again, defining it makes it clearer for your future self to see what was done. And it makes reference_date
available for reuse in more complicated calculations within the tibble.
If this happens to you a lot, then check out
library(janitor)
and its functionclean_names()
for automatically tidying non-standard column names.↩︎