3.5 summarise() vs mutate() | R for Health Data Science

3.5 `summarise()` vs `mutate()`

So far we’ve shown you examples of using summarise() on grouped data (following group_by()) and mutate() on the whole dataset (without using group_by()).

But here’s the thing: mutate() is also happy to work on grouped data.

Let’s save the aggregated example from above in a new tibble. We will then sort the rows using arrange() based on sex, just for easier viewing (it was previously sorted by cause).

The arrange() function sorts the rows within a tibble:

gbd_summarised <- gbd2017 %>% 
  group_by(cause, sex) %>% 
  summarise(deaths_per_group = sum(deaths_millions)) %>% 
  arrange(sex)

## `summarise()` regrouping output by 'cause' (override with `.groups` argument)

gbd_summarised

## # A tibble: 6 x 3
## # Groups:   cause [3]
##   cause                     sex    deaths_per_group
##   <chr>                     <chr>             <dbl>
## 1 Communicable diseases     Female             4.91
## 2 Injuries                  Female             1.42
## 3 Non-communicable diseases Female            19.15
## 4 Communicable diseases     Male               5.47
## 5 Injuries                  Male               3.05
## 6 Non-communicable diseases Male              21.74

You should also notice that summarise() drops all variables that are not listed in group_by() or created inside it. So year, income, and deaths_millions exist in gbd2017, but they do not exist in gbd_summarised.

We now want to calculate the percentage of deaths from each cause for each gender. We could use summarise() to calculate the totals:

gbd_summarised_sex <- gbd_summarised %>% 
  group_by(sex) %>% 
  summarise(deaths_per_sex = sum(deaths_per_group))

## `summarise()` ungrouping output (override with `.groups` argument)

gbd_summarised_sex

## # A tibble: 2 x 2
##   sex    deaths_per_sex
##   <chr>           <dbl>
## 1 Female          25.48
## 2 Male            30.26

But that drops the cause and deaths_per_group columns. One way would be to now use a join on gbd_summarised and gbd_summarised_sex:

full_join(gbd_summarised, gbd_summarised_sex)

## Joining, by = "sex"

## # A tibble: 6 x 4
## # Groups:   cause [3]
##   cause                     sex    deaths_per_group deaths_per_sex
##   <chr>                     <chr>             <dbl>          <dbl>
## 1 Communicable diseases     Female             4.91          25.48
## 2 Injuries                  Female             1.42          25.48
## 3 Non-communicable diseases Female            19.15          25.48
## 4 Communicable diseases     Male               5.47          30.26
## 5 Injuries                  Male               3.05          30.26
## 6 Non-communicable diseases Male              21.74          30.26

Joining different summaries together can be useful, especially if the individual pipelines are quite long (e.g., over 5 lines of %>%). However, it does increase the chance of mistakes creeping in and is best avoided if possible.

An alternative is to use mutate() with group_by() to achieve the same result as the full_join() above:

gbd_summarised %>% 
  group_by(sex) %>% 
  mutate(deaths_per_sex = sum(deaths_per_group))

## # A tibble: 6 x 4
## # Groups:   sex [2]
##   cause                     sex    deaths_per_group deaths_per_sex
##   <chr>                     <chr>             <dbl>          <dbl>
## 1 Communicable diseases     Female             4.91          25.48
## 2 Injuries                  Female             1.42          25.48
## 3 Non-communicable diseases Female            19.15          25.48
## 4 Communicable diseases     Male               5.47          30.26
## 5 Injuries                  Male               3.05          30.26
## 6 Non-communicable diseases Male              21.74          30.26

So mutate() calculates the sums within each grouping variable (in this example just group_by(sex)) and puts the results in a new column without condensing the tibble down or removing any of the existing columns.

Let’s combine all of this together into a single pipeline and calculate the percentages per cause for each gender:

gbd2017 %>% 
  group_by(cause, sex) %>% 
  summarise(deaths_per_group = sum(deaths_millions)) %>% 
  group_by(sex) %>% 
  mutate(deaths_per_sex  = sum(deaths_per_group),
         sex_cause_perc = percent(deaths_per_group/deaths_per_sex)) %>% 
  arrange(sex, deaths_per_group)

## `summarise()` regrouping output by 'cause' (override with `.groups` argument)

## # A tibble: 6 x 5
## # Groups:   sex [2]
##   cause                     sex   deaths_per_group deaths_per_sex sex_cause_perc
##   <chr>                     <chr>            <dbl>          <dbl> <chr>         
## 1 Injuries                  Fema…             1.42          25.48 6%            
## 2 Communicable diseases     Fema…             4.91          25.48 19%           
## 3 Non-communicable diseases Fema…            19.15          25.48 75%           
## 4 Injuries                  Male              3.05          30.26 10.1%         
## 5 Communicable diseases     Male              5.47          30.26 18.1%         
## 6 Non-communicable diseases Male             21.74          30.26 71.8%