3.4 Add new columns: mutate()

We met mutate() in the last chapter. Let’s first give the summarised column a better name, e.g., deaths_per_group. We can remove groupings by using ungroup(). This is important to remember if you want to manipulate the dataset in its original format. We can combine ungroup() with mutate() to add a total deaths column, which will be used below to calculate a percentage:

gbd2017 %>% 
  group_by(cause, sex) %>% 
  summarise(deaths_per_group = sum(deaths_millions)) %>% 
  ungroup() %>% 
  mutate(deaths_total = sum(deaths_per_group))
## `summarise()` regrouping output by 'cause' (override with `.groups` argument)
## # A tibble: 6 x 4
##   cause                     sex    deaths_per_group deaths_total
##   <chr>                     <chr>             <dbl>        <dbl>
## 1 Communicable diseases     Female             4.91        55.74
## 2 Communicable diseases     Male               5.47        55.74
## 3 Injuries                  Female             1.42        55.74
## 4 Injuries                  Male               3.05        55.74
## 5 Non-communicable diseases Female            19.15        55.74
## 6 Non-communicable diseases Male              21.74        55.74

3.4.1 Percentages formatting: percent()

So summarise() condenses a tibble, whereas mutate() retains its current size and adds columns. We can also further lines to mutate() to calculate the percentage of each group:

# percent() function for formatting percentages come from library(scales)
library(scales)
gbd2017_summarised <- gbd2017 %>% 
  group_by(cause, sex) %>% 
  summarise(deaths_per_group = sum(deaths_millions)) %>% 
  ungroup() %>% 
  mutate(deaths_total    = sum(deaths_per_group),
         deaths_relative = percent(deaths_per_group/deaths_total))
gbd2017_summarised
## # A tibble: 6 x 5
##   cause                     sex    deaths_per_group deaths_total deaths_relative
##   <chr>                     <chr>             <dbl>        <dbl> <chr>          
## 1 Communicable diseases     Female             4.91        55.74 8.8%           
## 2 Communicable diseases     Male               5.47        55.74 9.8%           
## 3 Injuries                  Female             1.42        55.74 2.5%           
## 4 Injuries                  Male               3.05        55.74 5.5%           
## 5 Non-communicable diseases Female            19.15        55.74 34.4%          
## 6 Non-communicable diseases Male              21.74        55.74 39.0%

The percent() function comes from library(scales) and is a handy way of formatting percentages You must keep in mind that it changes the column from a number (denoted <dbl>) to a character (<chr>). The percent() function is equivalent to:

# using values from the first row as an example:
round(100*4.91/55.74, 1) %>% paste0("%")
## [1] "8.8%"

This is convenient for final presentation of number, but if you intend to do further calculations/plot/sort the percentages just calculate them as fractions with:

gbd2017_summarised %>% 
  mutate(deaths_relative = deaths_per_group/deaths_total)
## # A tibble: 6 x 5
##   cause                     sex    deaths_per_group deaths_total deaths_relative
##   <chr>                     <chr>             <dbl>        <dbl>           <dbl>
## 1 Communicable diseases     Female             4.91        55.74         0.08809
## 2 Communicable diseases     Male               5.47        55.74         0.09813
## 3 Injuries                  Female             1.42        55.74         0.02548
## 4 Injuries                  Male               3.05        55.74         0.05472
## 5 Non-communicable diseases Female            19.15        55.74         0.3436 
## 6 Non-communicable diseases Male              21.74        55.74         0.3900

and convert to nicely formatted percentages later with mutate(deaths_percentage = percent(deaths_relative)).