3.11 Exercise - group_by(), summarise()

Read in the full GBD dataset with variables cause, year, sex, income, deaths_millions.

## Observations: 168
## Variables: 5
## $ cause           <chr> "Communicable diseases", "Communicable diseases", "Co…
## $ year            <dbl> 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990,…
## $ sex             <chr> "Female", "Female", "Female", "Female", "Male", "Male…
## $ income          <chr> "High", "Upper-Middle", "Lower-Middle", "Low", "High"…
## $ deaths_millions <dbl> 0.21, 1.15, 4.43, 1.51, 0.26, 1.35, 4.73, 1.72, 0.20,…

Year 2017 of this dataset was shown in Table 3.1, the full dataset has seven times as many observations as Table 3.1 since it includes information about multiple years: 1990, 1995, 2000, 2005, 2010, 2015, 2017.

Investigate these code examples:

## # A tibble: 7 x 2
##    year total_per_year
##   <dbl>          <dbl>
## 1  1990          46.32
## 2  1995          48.91
## 3  2000          50.38
## 4  2005          51.25
## 5  2010          52.63
## 6  2015          54.62
## 7  2017          55.74
## # A tibble: 21 x 3
## # Groups:   year [7]
##     year cause                     total_per_cause
##    <dbl> <chr>                               <dbl>
##  1  1990 Communicable diseases               15.36
##  2  1990 Injuries                             4.25
##  3  1990 Non-communicable diseases           26.71
##  4  1995 Communicable diseases               15.11
##  5  1995 Injuries                             4.53
##  6  1995 Non-communicable diseases           29.27
##  7  2000 Communicable diseases               14.81
##  8  2000 Injuries                             4.56
##  9  2000 Non-communicable diseases           31.01
## 10  2005 Communicable diseases               13.89
## # … with 11 more rows

You should recognise that:

  • summary_data1 includes the total number of deaths per year.
  • summary_data2 includes the number of deaths per cause per year.
  • summary_data1 = means we are creating a new tibble called summary_data1 and saving (=) results into it. If summary_data1 was a tibble that already existed, it would get overwritten.
  • gbd_full is the data being sent to the group_by() and then summarise() functions.
  • group_by() tells summarise() that we want aggregated results for each year.
  • summarise() then creates a new variable called total_per_year that sums the deaths from each different observation (subcategory) together.
  • Calling summary_data1 on a separate line gets it printed.
  • We then do something very similar in summary_data2.

Compare the number of rows (observations) and number of columns (variables) of gbd_full, summary_data1, and summary_data2.

You should notice that: * summary_data2 has exactly 3 times as many rows (observations) as summary_data1. Why? * gbd_full has 5 variables, whereas the summarised tibbles have 2 and 3. Which variables got dropped? How?

Answers

  • gbd_full has 168 observations (rows),
  • summary_data1 has 7,
  • summary_data2 has 21.

summary_data1 was grouped by year, therefore it includes a (summarised) value for each year in the original dataset. summary_data2 was grouped by year and cause (Communicable diseases, Injuries, Non-communicable diseases), so it has 3 values for each year.

The columns a summarise() function returns are: variables listed in group_by() + variables created inside summarise() (e.g., in this case deaths_peryear). All others get aggregated.