## 3.13 Exercise - mutate(), summarise()

Instead of creating the two summarised tibbles and using a full_join(), achieve the same result as in the previous Exercise by with a single pipeline using summarise() and then mutate().

Hint: you have to do it the either way round, so group_by(year, cause) %>% summarise(...) first, then group_by(year) %>% mutate().

Bonus: select() columns year, cause, percentage, then spread() the cause variable using percentage as values.

Solution

gbd_full %>%
# aggregate to deaths per cause per year using summarise()
group_by(year, cause) %>%
summarise(total_per_cause = sum(deaths_millions)) %>%
# then add a column of yearly totals using mutate()
group_by(year) %>%
mutate(total_per_year = sum(total_per_cause)) %>%
mutate(percentage = percent(total_per_cause/total_per_year)) %>%
# select the final variables and spread for better vieweing
select(year, cause, percentage) %>%
spread(cause, percentage)
## # A tibble: 7 x 4
## # Groups:   year [7]
##    year Communicable diseases Injuries Non-communicable diseases
##   <dbl> <chr>                   <chr>    <chr>
## 1  1990 33%                     9%       58%
## 2  1995 31%                     9%       60%
## 3  2000 29%                     9%       62%
## 4  2005 27%                     9%       64%
## 5  2010 24%                     9%       67%
## 6  2015 20%                     8%       72%
## 7  2017 19%                     8%       73%

Note that your pipelines shouldn’t be much longer than this, and we often save interim results into separate tibbles for checking (like we did with summary_data1 and summary_data2, making sure the number of rows are what we expect and spot checking that the calculation worked as expected).

R doesn’t do what you want it to do, it does what you ask it to do. Testing and spot checking is essential as you will make mistakes. We sure do.

Do not feel like you should be able to just bash out these clever pipelines without a lot of trial and error first.