3.5 summarise() vs mutate()

So far we’ve shown you examples of using summarise() on grouped data (so following group_by()) and mutate() on the whole dataset (either without using group_by() at all, or resetting the grouping information with ungroup()).

But here’s the thing: mutate() is also happy to work on grouped data.

Let’s save the aggregated example from above in a new tibble. We will then sort the rows using arrange() based on sex, just for easier viewing (it was previously sorted by cause).

The arrange() function sorts the rows within a tibble:

## # A tibble: 6 x 3
## # Groups:   cause [3]
##   cause                     sex    deaths_pergroups
##   <chr>                     <chr>             <dbl>
## 1 Communicable diseases     Female             4.91
## 2 Injuries                  Female             1.42
## 3 Non-communicable diseases Female            19.15
## 4 Communicable diseases     Male               5.47
## 5 Injuries                  Male               3.05
## 6 Non-communicable diseases Male              21.74

You should also notice that summarise() drops all variables that are not listed in group_by() or created inside it. So year, income, and deaths_millions exist in gbd2017, but they do not exist in gbd_summarised.

We now want to calculate the percentage of deaths from each cause for each gender. We could use summarise() to calculate the totals:

## # A tibble: 2 x 2
##   sex    deaths_persex
##   <chr>          <dbl>
## 1 Female         25.48
## 2 Male           30.26

But that drops the cause and deaths_pergroups columns. One way would be to now use a join on gbd_summarised and gbd_summarised_sex:

## Joining, by = "sex"
## # A tibble: 6 x 4
## # Groups:   cause [3]
##   cause                     sex    deaths_pergroups deaths_persex
##   <chr>                     <chr>             <dbl>         <dbl>
## 1 Communicable diseases     Female             4.91         25.48
## 2 Injuries                  Female             1.42         25.48
## 3 Non-communicable diseases Female            19.15         25.48
## 4 Communicable diseases     Male               5.47         30.26
## 5 Injuries                  Male               3.05         30.26
## 6 Non-communicable diseases Male              21.74         30.26

Joining different summaries together can be useful, especially if the individual pipelines are quite long (e.g., over 5 lines of %>%). However, it does increase the chance of mistakes creeping in and is best avoided if possible.

An alternative is to use mutate() with group_by() to achieve the same result as the full_join() above:

## # A tibble: 6 x 4
## # Groups:   sex [2]
##   cause                     sex    deaths_pergroups deaths_persex
##   <chr>                     <chr>             <dbl>         <dbl>
## 1 Communicable diseases     Female             4.91         25.48
## 2 Injuries                  Female             1.42         25.48
## 3 Non-communicable diseases Female            19.15         25.48
## 4 Communicable diseases     Male               5.47         30.26
## 5 Injuries                  Male               3.05         30.26
## 6 Non-communicable diseases Male              21.74         30.26

So mutate() calculates the sums within each grouping variable (in this example just group_by(sex)) and puts the results in a new column without condensing the tibble down or removing any of the existing columns.

Let’s combine all of this together into a single pipeline and calculate the percentages per cause for each gender:

## # A tibble: 6 x 5
## # Groups:   sex [2]
##   cause                     sex    deaths_pergroups deaths_persex sex_cause_perc
##   <chr>                     <chr>             <dbl>         <dbl> <chr>         
## 1 Injuries                  Female             1.42         25.48 6%            
## 2 Communicable diseases     Female             4.91         25.48 19%           
## 3 Non-communicable diseases Female            19.15         25.48 75%           
## 4 Injuries                  Male               3.05         30.26 10.1%         
## 5 Communicable diseases     Male               5.47         30.26 18.1%         
## 6 Non-communicable diseases Male              21.74         30.26 71.8%