3.10 Exercises

3.10.1 Exercise - pivot_wider()

Using the GBD dataset with variables cause, year (1990 and 2017 only), sex (as shown in Table 3.3):

Use pivot_wider() to put the cause variable into columns using the deaths_millions as values:

TABLE 3.4: Exercise: putting the cause variable into the wide format.
year sex Communicable diseases Injuries Non-communicable diseases
1990 Female 7.30 1.41 12.80
2017 Female 4.91 1.42 19.15
1990 Male 8.06 2.84 13.91
2017 Male 5.47 3.05 21.74

Solution

3.10.2 Exercise - group_by(), summarise()

Read in the full GBD dataset with variables cause, year, sex, income, deaths_millions.

## Observations: 168
## Variables: 5
## $ cause           <chr> "Communicable diseases", "Communicable diseases", "Co…
## $ year            <dbl> 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990,…
## $ sex             <chr> "Female", "Female", "Female", "Female", "Male", "Male…
## $ income          <chr> "High", "Upper-Middle", "Lower-Middle", "Low", "High"…
## $ deaths_millions <dbl> 0.21, 1.15, 4.43, 1.51, 0.26, 1.35, 4.73, 1.72, 0.20,…

Year 2017 of this dataset was shown in Table 3.1, the full dataset has seven times as many observations as Table 3.1 since it includes information about multiple years: 1990, 1995, 2000, 2005, 2010, 2015, 2017.

Investigate these code examples:

## # A tibble: 7 x 2
##    year total_per_year
##   <dbl>          <dbl>
## 1  1990          46.32
## 2  1995          48.91
## 3  2000          50.38
## 4  2005          51.25
## 5  2010          52.63
## 6  2015          54.62
## 7  2017          55.74
## # A tibble: 21 x 3
## # Groups:   year [7]
##     year cause                     total_per_cause
##    <dbl> <chr>                               <dbl>
##  1  1990 Communicable diseases               15.36
##  2  1990 Injuries                             4.25
##  3  1990 Non-communicable diseases           26.71
##  4  1995 Communicable diseases               15.11
##  5  1995 Injuries                             4.53
##  6  1995 Non-communicable diseases           29.27
##  7  2000 Communicable diseases               14.81
##  8  2000 Injuries                             4.56
##  9  2000 Non-communicable diseases           31.01
## 10  2005 Communicable diseases               13.89
## # … with 11 more rows

You should recognise that:

  • summary_data1 includes the total number of deaths per year.
  • summary_data2 includes the number of deaths per cause per year.
  • summary_data1 = means we are creating a new tibble called summary_data1 and saving (=) results into it. If summary_data1 was a tibble that already existed, it would get overwritten.
  • gbd_full is the data being sent to the group_by() and then summarise() functions.
  • group_by() tells summarise() that we want aggregated results for each year.
  • summarise() then creates a new variable called total_per_year that sums the deaths from each different observation (subcategory) together.
  • Calling summary_data1 on a separate line gets it printed.
  • We then do something similar in summary_data2.

Compare the number of rows (observations) and number of columns (variables) of gbd_full, summary_data1, and summary_data2.

You should notice that: * summary_data2 has exactly 3 times as many rows (observations) as summary_data1. Why? * gbd_full has 5 variables, whereas the summarised tibbles have 2 and 3. Which variables got dropped? How?

Answers

  • gbd_full has 168 observations (rows),
  • summary_data1 has 7,
  • summary_data2 has 21.

summary_data1 was grouped by year, therefore it includes a (summarised) value for each year in the original dataset. summary_data2 was grouped by year and cause (Communicable diseases, Injuries, Non-communicable diseases), so it has 3 values for each year.

The columns a summarise() function returns are: variables listed in group_by() + variables created inside summarise() (e.g., in this case deaths_peryear). All others get aggregated.

3.10.3 Exercise - full_join(), percent()

For each cause, calculate its percentage to total deaths in each year.

Hint: Use full_join() on summary_data1 and summary_data2, and then use mutate() to add a new column called percentage.

Example result for a single year:

## Joining, by = "year"
## # A tibble: 3 x 5
##    year total_per_year cause                     total_per_cause percentage
##   <dbl>          <dbl> <chr>                               <dbl> <chr>     
## 1  1990          46.32 Communicable diseases               15.36 33.161%   
## 2  1990          46.32 Injuries                             4.25 9.175%    
## 3  1990          46.32 Non-communicable diseases           26.71 57.664%

Solution

3.10.4 Exercise - mutate(), summarise()

Instead of creating the two summarised tibbles and using a full_join(), achieve the same result as in the previous Exercise by with a single pipeline using summarise() and then mutate().

Hint: you have to do it the either way round, so group_by(year, cause) %>% summarise(...) first, then group_by(year) %>% mutate().

Bonus: select() columns year, cause, percentage, then pivot_wider() the cause variable using percentage as values.

Solution

## # A tibble: 7 x 4
## # Groups:   year [7]
##    year `Communicable diseases` Injuries `Non-communicable diseases`
##   <dbl> <chr>                   <chr>    <chr>                      
## 1  1990 33%                     9%       58%                        
## 2  1995 31%                     9%       60%                        
## 3  2000 29%                     9%       62%                        
## 4  2005 27%                     9%       64%                        
## 5  2010 24%                     9%       67%                        
## 6  2015 20%                     8%       72%                        
## 7  2017 19%                     8%       73%

Note that your pipelines shouldn’t be much longer than this, and we often save interim results into separate tibbles for checking (like we did with summary_data1 and summary_data2, making sure the number of rows are what we expect and spot checking that the calculation worked as expected).

R doesn’t do what you want it to do, it does what you ask it to do. Testing and spot checking is essential as you will make mistakes. We sure do.

Do not feel like you should be able to just bash out these clever pipelines without a lot of trial and error first.

3.10.5 Exercise - filter(), summarise(), pivot_wider()

Still working with gbd_full:

  • Filter for 1990.
  • Calculate the total number of deaths in the different income groups (High, Upper-Middle, Lower-Middle, Low). Hint: use group_by(income) and summarise(new_column_name = sum(variable)).
  • Calculate the total number of deaths within each income group for males and females. Hint: this is as easy as adding , sex to group_by(income).

  • pivot_wider() the income column.

Solution

## # A tibble: 2 x 5
##   sex     High   Low `Lower-Middle` `Upper-Middle`
##   <chr>  <dbl> <dbl>          <dbl>          <dbl>
## 1 Female 4.140  2.22           8.47          6.68 
## 2 Male   4.46   2.57           9.83          7.950