3.3 Aggregating: group_by()
, summarise()
Health data analysis is frequently concerned with making comparisons between groups. Groups of genes, or diseases, or patients, or populations, etc. An easy approach to the comparison of data by a categorical grouping is therefore essential.
We will introduce flexible functions from tidyverse that you can apply in any setting. The examples intentionally get quite involved to demonstrate the different approaches that can be used.
To quickly calculate the total number of deaths in 2017, we can select the column and send it into the sum()
function:
## [1] 55.74
But a much cleverer way of summarising data is using the summarise()
function:
## # A tibble: 1 x 1
## `sum(deaths_millions)`
## <dbl>
## 1 55.74
This is indeed equal to the number of deaths per year we saw in the previous chapter using the shorter version of this data (deaths from the three causes were 10.38, 4.47, 40.89 which adds to 55.74).
sum()
is a function that adds numbers together, whereas summarise()
is an efficient way of creating summarised tibbles.
The main strength of summarise()
is how it works with the group_by()
function.
group_by()
and summarise()
are like cheese and wine, a perfect complement for each other, seldom seen apart.
We use group_by()
to tell summarise()
which subgroups to apply the calculations on.
In the above example, without group_by()
, summarise just works on the whole dataset, yielding the same result as just sending a single column into the sum()
function.
We can subset on the cause variable using group_by()
:
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
## cause `sum(deaths_millions)`
## <chr> <dbl>
## 1 Communicable diseases 10.38
## 2 Injuries 4.47
## 3 Non-communicable diseases 40.89
Furthermore, group_by()
is happy to accept multiple grouping variables.
So by just copying and editing the above code, we can quickly get summarised totals across multiple grouping variables (by just adding sex
inside the group_by()
after cause
):
## `summarise()` regrouping output by 'cause' (override with `.groups` argument)
## # A tibble: 6 x 3
## # Groups: cause [3]
## cause sex `sum(deaths_millions)`
## <chr> <chr> <dbl>
## 1 Communicable diseases Female 4.91
## 2 Communicable diseases Male 5.47
## 3 Injuries Female 1.42
## 4 Injuries Male 3.05
## 5 Non-communicable diseases Female 19.15
## 6 Non-communicable diseases Male 21.74