3.3 Aggregating: group_by(), summarise()

Health data analysis is frequently concerned with making comparisons between groups. Groups of genes, or diseases, or patients, or populations, etc. An easy approach to the comparison of data by a categorical grouping is therefore essential.

We will introduce flexible functions from tidyverse that you can apply in any setting. The examples intentionally get quite involved to demonstrate the different approaches that can be used.

To quickly calculate the total number of deaths in 2017, we can select the column and send it into the sum() function:

gbd2017$deaths_millions %>% sum()
## [1] 55.74

But a much cleverer way of summarising data is using the summarise() function:

gbd2017 %>% 
  summarise(sum(deaths_millions))
## # A tibble: 1 x 1
##   `sum(deaths_millions)`
##                    <dbl>
## 1                  55.74

This is indeed equal to the number of deaths per year we saw in the previous chapter using the shorter version of this data (deaths from the three causes were 10.38, 4.47, 40.89 which adds to 55.74).

sum() is a function that adds numbers together, whereas summarise() is an efficient way of creating summarised tibbles. The main strength of summarise() is how it works with the group_by() function. group_by() and summarise() are like cheese and wine, a perfect complement for each other, seldom seen apart.

We use group_by() to tell summarise() which subgroups to apply the calculations on. In the above example, without group_by(), summarise just works on the whole dataset, yielding the same result as just sending a single column into the sum() function.

We can subset on the cause variable using group_by():

gbd2017 %>% 
  group_by(cause) %>% 
  summarise(sum(deaths_millions))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
##   cause                     `sum(deaths_millions)`
##   <chr>                                      <dbl>
## 1 Communicable diseases                      10.38
## 2 Injuries                                    4.47
## 3 Non-communicable diseases                  40.89

Furthermore, group_by() is happy to accept multiple grouping variables. So by just copying and editing the above code, we can quickly get summarised totals across multiple grouping variables (by just adding sex inside the group_by() after cause):

gbd2017 %>% 
  group_by(cause, sex) %>% 
  summarise(sum(deaths_millions))
## `summarise()` regrouping output by 'cause' (override with `.groups` argument)
## # A tibble: 6 x 3
## # Groups:   cause [3]
##   cause                     sex    `sum(deaths_millions)`
##   <chr>                     <chr>                   <dbl>
## 1 Communicable diseases     Female                   4.91
## 2 Communicable diseases     Male                     5.47
## 3 Injuries                  Female                   1.42
## 4 Injuries                  Male                     3.05
## 5 Non-communicable diseases Female                  19.15
## 6 Non-communicable diseases Male                    21.74