3.5 summarise()
vs mutate()
So far we’ve shown you examples of using summarise()
on grouped data (following group_by()
) and mutate()
on the whole dataset (without using group_by()
).
But here’s the thing: mutate()
is also happy to work on grouped data.
Let’s save the aggregated example from above in a new tibble.
We will then sort the rows using arrange()
based on sex
, just for easier viewing (it was previously sorted by cause
).
The arrange()
function sorts the rows within a tibble:
gbd_summarised <- gbd2017 %>%
group_by(cause, sex) %>%
summarise(deaths_per_group = sum(deaths_millions)) %>%
arrange(sex)
## `summarise()` regrouping output by 'cause' (override with `.groups` argument)
## # A tibble: 6 x 3
## # Groups: cause [3]
## cause sex deaths_per_group
## <chr> <chr> <dbl>
## 1 Communicable diseases Female 4.91
## 2 Injuries Female 1.42
## 3 Non-communicable diseases Female 19.15
## 4 Communicable diseases Male 5.47
## 5 Injuries Male 3.05
## 6 Non-communicable diseases Male 21.74
You should also notice that summarise()
drops all variables that are not listed in group_by()
or created inside it.
So year
, income
, and deaths_millions
exist in gbd2017
, but they do not exist in gbd_summarised
.
We now want to calculate the percentage of deaths from each cause for each gender.
We could use summarise()
to calculate the totals:
gbd_summarised_sex <- gbd_summarised %>%
group_by(sex) %>%
summarise(deaths_per_sex = sum(deaths_per_group))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## sex deaths_per_sex
## <chr> <dbl>
## 1 Female 25.48
## 2 Male 30.26
But that drops the cause
and deaths_per_group
columns.
One way would be to now use a join on gbd_summarised
and gbd_summarised_sex
:
## Joining, by = "sex"
## # A tibble: 6 x 4
## # Groups: cause [3]
## cause sex deaths_per_group deaths_per_sex
## <chr> <chr> <dbl> <dbl>
## 1 Communicable diseases Female 4.91 25.48
## 2 Injuries Female 1.42 25.48
## 3 Non-communicable diseases Female 19.15 25.48
## 4 Communicable diseases Male 5.47 30.26
## 5 Injuries Male 3.05 30.26
## 6 Non-communicable diseases Male 21.74 30.26
Joining different summaries together can be useful, especially if the individual pipelines are quite long (e.g., over 5 lines of %>%
).
However, it does increase the chance of mistakes creeping in and is best avoided if possible.
An alternative is to use mutate()
with group_by()
to achieve the same result as the full_join()
above:
## # A tibble: 6 x 4
## # Groups: sex [2]
## cause sex deaths_per_group deaths_per_sex
## <chr> <chr> <dbl> <dbl>
## 1 Communicable diseases Female 4.91 25.48
## 2 Injuries Female 1.42 25.48
## 3 Non-communicable diseases Female 19.15 25.48
## 4 Communicable diseases Male 5.47 30.26
## 5 Injuries Male 3.05 30.26
## 6 Non-communicable diseases Male 21.74 30.26
So mutate()
calculates the sums within each grouping variable (in this example just group_by(sex)
) and puts the results in a new column without condensing the tibble down or removing any of the existing columns.
Let’s combine all of this together into a single pipeline and calculate the percentages per cause for each gender:
gbd2017 %>%
group_by(cause, sex) %>%
summarise(deaths_per_group = sum(deaths_millions)) %>%
group_by(sex) %>%
mutate(deaths_per_sex = sum(deaths_per_group),
sex_cause_perc = percent(deaths_per_group/deaths_per_sex)) %>%
arrange(sex, deaths_per_group)
## `summarise()` regrouping output by 'cause' (override with `.groups` argument)
## # A tibble: 6 x 5
## # Groups: sex [2]
## cause sex deaths_per_group deaths_per_sex sex_cause_perc
## <chr> <chr> <dbl> <dbl> <chr>
## 1 Injuries Fema… 1.42 25.48 6%
## 2 Communicable diseases Fema… 4.91 25.48 19%
## 3 Non-communicable diseases Fema… 19.15 25.48 75%
## 4 Injuries Male 3.05 30.26 10.1%
## 5 Communicable diseases Male 5.47 30.26 18.1%
## 6 Non-communicable diseases Male 21.74 30.26 71.8%