3.10 Exercises
3.10.1 Exercise - pivot_wider()
Using the GBD dataset with variables cause
, year
(1990 and 2017 only), sex
(as shown in Table 3.3):
Use pivot_wider()
to put the cause
variable into columns using the deaths_millions
as values:
year | sex | Communicable diseases | Injuries | Non-communicable diseases |
---|---|---|---|---|
1990 | Female | 7.30 | 1.41 | 12.80 |
2017 | Female | 4.91 | 1.42 | 19.15 |
1990 | Male | 8.06 | 2.84 | 13.91 |
2017 | Male | 5.47 | 3.05 | 21.74 |
Solution
3.10.2 Exercise - group_by()
, summarise()
Read in the full GBD dataset with variables cause
, year
, sex
, income
, deaths_millions
.
## Rows: 168
## Columns: 5
## $ cause <chr> "Communicable diseases", "Communicable diseases", "Co…
## $ year <dbl> 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990,…
## $ sex <chr> "Female", "Female", "Female", "Female", "Male", "Male…
## $ income <chr> "High", "Upper-Middle", "Lower-Middle", "Low", "High"…
## $ deaths_millions <dbl> 0.21, 1.15, 4.43, 1.51, 0.26, 1.35, 4.73, 1.72, 0.20,…
Year 2017 of this dataset was shown in Table 3.1, the full dataset has seven times as many observations as Table 3.1 since it includes information about multiple years: 1990, 1995, 2000, 2005, 2010, 2015, 2017.
Investigate these code examples:
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 7 x 2
## year total_per_year
## <dbl> <dbl>
## 1 1990 46.32
## 2 1995 48.91
## 3 2000 50.38
## 4 2005 51.25
## 5 2010 52.63
## 6 2015 54.62
## 7 2017 55.74
summary_data2 <-
gbd_full %>%
group_by(year, cause) %>%
summarise(total_per_cause = sum(deaths_millions))
## `summarise()` regrouping output by 'year' (override with `.groups` argument)
## # A tibble: 21 x 3
## # Groups: year [7]
## year cause total_per_cause
## <dbl> <chr> <dbl>
## 1 1990 Communicable diseases 15.36
## 2 1990 Injuries 4.25
## 3 1990 Non-communicable diseases 26.71
## 4 1995 Communicable diseases 15.11
## 5 1995 Injuries 4.53
## 6 1995 Non-communicable diseases 29.27
## 7 2000 Communicable diseases 14.81
## 8 2000 Injuries 4.56
## 9 2000 Non-communicable diseases 31.01
## 10 2005 Communicable diseases 13.89
## # … with 11 more rows
You should recognise that:
summary_data1
includes the total number of deaths per year.summary_data2
includes the number of deaths per cause per year.summary_data1 <-
means we are creating a new tibble calledsummary_data1
and saving (<-
) results into it. Ifsummary_data1
was a tibble that already existed, it would get overwritten.gbd_full
is the data being sent to thegroup_by()
and thensummarise()
functions.group_by()
tellssummarise()
that we want aggregated results for each year.summarise()
then creates a new variable calledtotal_per_year
that sums the deaths from each different observation (subcategory) together.- Calling
summary_data1
on a separate line gets it printed. - We then do something similar in
summary_data2
.
Compare the number of rows (observations) and number of columns (variables) of gbd_full
, summary_data1
, and summary_data2
.
You should notice that:
summary_data2
has exactly 3 times as many rows (observations) assummary_data1
. Why?gbd_full
has 5 variables, whereas the summarised tibbles have 2 and 3. Which variables got dropped? How?
Answers
gbd_full
has 168 observations (rows),summary_data1
has 7,summary_data2
has 21.
summary_data1
was grouped by year, therefore it includes a (summarised) value for each year in the original dataset.
summary_data2
was grouped by year and cause (Communicable diseases, Injuries, Non-communicable diseases), so it has 3 values for each year.
The columns a summarise()
function returns are: variables listed in group_by()
+ variables created inside summarise()
(e.g., in this case deaths_peryear
). All others get aggregated.
3.10.3 Exercise - full_join()
, percent()
For each cause, calculate its percentage to total deaths in each year.
Hint: Use full_join()
on summary_data1
and summary_data2
, and then use mutate()
to add a new column called percentage
.
Example result for a single year:
## Joining, by = "year"
## # A tibble: 3 x 5
## year total_per_year cause total_per_cause percentage
## <dbl> <dbl> <chr> <dbl> <chr>
## 1 1990 46.32 Communicable diseases 15.36 33.161%
## 2 1990 46.32 Injuries 4.25 9.175%
## 3 1990 46.32 Non-communicable diseases 26.71 57.664%
Solution
3.10.4 Exercise - mutate()
, summarise()
Instead of creating the two summarised tibbles and using a full_join()
, achieve the same result as in the previous exercise with a single pipeline using summarise()
and then mutate()
.
Hint: you have to do it the other way around, so group_by(year, cause) %>% summarise(...)
first, then group_by(year) %>% mutate()
.
Bonus: select()
columns year
, cause
, percentage
, then pivot_wider()
the cause
variable using percentage
as values.
Solution
gbd_full %>%
# aggregate to deaths per cause per year using summarise()
group_by(year, cause) %>%
summarise(total_per_cause = sum(deaths_millions)) %>%
# then add a column of yearly totals using mutate()
group_by(year) %>%
mutate(total_per_year = sum(total_per_cause)) %>%
# add the percentage column
mutate(percentage = percent(total_per_cause/total_per_year)) %>%
# select the final variables for better vieweing
select(year, cause, percentage) %>%
pivot_wider(names_from = cause, values_from = percentage)
## `summarise()` regrouping output by 'year' (override with `.groups` argument)
## # A tibble: 7 x 4
## # Groups: year [7]
## year `Communicable diseases` Injuries `Non-communicable diseases`
## <dbl> <chr> <chr> <chr>
## 1 1990 33% 9% 58%
## 2 1995 31% 9% 60%
## 3 2000 29% 9% 62%
## 4 2005 27% 9% 64%
## 5 2010 24% 9% 67%
## 6 2015 20% 8% 72%
## 7 2017 19% 8% 73%
Note that your pipelines shouldn’t be much longer than this, and we often save interim results into separate tibbles for checking (like we did with summary_data1
and summary_data2
, making sure the number of rows are what we expect and spot checking that the calculation worked as expected).
R doesn’t do what you want it to do, it does what you ask it to do. Testing and spot checking is essential as you will make mistakes. We sure do.
Do not feel like you should be able to just bash out these clever pipelines without a lot of trial and error first.
3.10.5 Exercise - filter()
, summarise()
, pivot_wider()
Still working with gbd_full
:
Filter for 1990.
Calculate the total number of deaths in the different income groups (High, Upper-Middle, Lower-Middle, Low). Hint: use
group_by(income)
andsummarise(new_column_name = sum(variable))
.Calculate the total number of deaths within each income group for males and females. Hint: this is as easy as adding
, sex
togroup_by(income)
.pivot_wider()
theincome
column.
Solution
gbd_full %>%
filter(year == 1990) %>%
group_by(income, sex) %>%
summarise(total_deaths = sum(deaths_millions)) %>%
pivot_wider(names_from = income, values_from = total_deaths)
## `summarise()` regrouping output by 'income' (override with `.groups` argument)
## # A tibble: 2 x 5
## sex High Low `Lower-Middle` `Upper-Middle`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Female 4.140 2.22 8.47 6.68
## 2 Male 4.46 2.57 9.83 7.950