3.8 Reshaping data - long vs wide format
So far, all of the examples we’ve shown you have been using ‘tidy’ data. Data is ‘tidy’ when it follows a couple of rules: each variable is in its own column, and each observation is in its own row. Making data ‘tidy’ often means transforming the table from a “wide” format into a “long” format. Long format is efficient to use in data analysis and visualisation and can also be considered “computer readable”.
But sometimes when presenting data in tables for humans to read, or when collecting data directly into a spreadsheet, it can be convenient to have data in a wide format. Data is ‘wide’ when some or all of the columns are levels of a factor. An example makes this easier to see.
gbd_wide <- read_csv("data/global_burden_disease_wide-format.csv")
gbd_long <- read_csv("data/global_burden_disease_cause-year-sex.csv")
cause | Female_1990 | Female_2017 | Male_1990 | Male_2017 |
---|---|---|---|---|
Communicable diseases | 7.30 | 4.91 | 8.06 | 5.47 |
Injuries | 1.41 | 1.42 | 2.84 | 3.05 |
Non-communicable diseases | 12.80 | 19.15 | 13.91 | 21.74 |
cause | year | sex | deaths_millions |
---|---|---|---|
Communicable diseases | 1990 | Female | 7.30 |
Communicable diseases | 2017 | Female | 4.91 |
Communicable diseases | 1990 | Male | 8.06 |
Communicable diseases | 2017 | Male | 5.47 |
Injuries | 1990 | Female | 1.41 |
Injuries | 2017 | Female | 1.42 |
Injuries | 1990 | Male | 2.84 |
Injuries | 2017 | Male | 3.05 |
Non-communicable diseases | 1990 | Female | 12.80 |
Non-communicable diseases | 2017 | Female | 19.15 |
Non-communicable diseases | 1990 | Male | 13.91 |
Non-communicable diseases | 2017 | Male | 21.74 |
Tables 3.3 and 3.2 contain the exact same information, but in long (tidy) and wide formats, respectively.
3.8.1 Pivot values from rows into columns (wider)
If we want to take the long data from 3.3 and put some of the numbers next to each other for easier visualisation, then pivot_wider()
from the tidyr package is the function to do it.
It means we want to send a variable into columns, and it needs just two arguments: the variable we want to become the new columns, and the variable where the values currently are.
## # A tibble: 6 x 4
## cause sex `1990` `2017`
## <chr> <chr> <dbl> <dbl>
## 1 Communicable diseases Female 7.3 4.91
## 2 Communicable diseases Male 8.06 5.47
## 3 Injuries Female 1.41 1.42
## 4 Injuries Male 2.84 3.05
## 5 Non-communicable diseases Female 12.8 19.15
## 6 Non-communicable diseases Male 13.91 21.74
This means we can quickly eyeball how the number of deaths has changed from 1990 to 2017 for each cause category and sex.
Whereas if we wanted to quickly look at the difference in the number of deaths for females and males, we can change the names_from =
argument from = years
to = sex
.
Furthermore, we can also add a mutate()
to calculate the difference:
## # A tibble: 6 x 5
## cause year Female Male `Male - Female`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Communicable diseases 1990 7.3 8.06 0.76
## 2 Communicable diseases 2017 4.91 5.47 0.5600
## 3 Injuries 1990 1.41 2.84 1.430
## 4 Injuries 2017 1.42 3.05 1.63
## 5 Non-communicable diseases 1990 12.8 13.91 1.110
## 6 Non-communicable diseases 2017 19.15 21.74 2.59
All of these differences are positive which means every year, more men die than women. Which make sense, as more boys are born than girls.
And what if we want to look at both year
and sex
at the same time, so to create Table 3.2 from Table 3.3?
No problem, pivot_wider()
can deal with multiple variables at the same time, names_from = c(sex, year)
:
## # A tibble: 3 x 5
## cause Female_1990 Female_2017 Male_1990 Male_2017
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Communicable diseases 7.3 4.91 8.06 5.47
## 2 Injuries 1.41 1.42 2.84 3.05
## 3 Non-communicable diseases 12.8 19.15 13.91 21.74
pivot_wider()
has a few optional arguments that may be useful for you.
For example, pivot_wider(..., values_fill = 0)
can be used to fill empty cases (if you have any) with a value you specified.
Or pivot_wider(..., names_sep = ": ")
can be used to change the separator that gets put between the values (e.g., you may want “Female: 1990” instead of the default “Female_1990”).
Remember that pressing F1 when your cursor is on a function opens it up in the Help tab where these extra options are listed.
3.8.2 Pivot values from columns to rows (longer)
The inverse of pivot_wider()
is pivot_longer()
.
If you’re lucky enough, your data comes from a proper database and is already in the long and tidy format.
But if you do get landed with something that looks like Table 3.2, you’ll need to know how to wrangle the variables currently spread across different columns into the tidy format (where each column is a variable, each row is an observation).
pivot_longer()
can be a little bit more difficult to use as you need to describe all the columns to be collected using a select_helper
. Run `?select_helpers and click on the first result in the Help tab for a reminder.
For example, here we want to collect all the columns that include the words Female or Male, the select helper for it is matches("Female|Male")
:
gbd_wide %>%
pivot_longer(matches("Female|Male"),
names_to = "sex_year",
values_to = "deaths_millions") %>%
slice(1:6)
## # A tibble: 6 x 3
## cause sex_year deaths_millions
## <chr> <chr> <dbl>
## 1 Communicable diseases Female_1990 7.3
## 2 Communicable diseases Female_2017 4.91
## 3 Communicable diseases Male_1990 8.06
## 4 Communicable diseases Male_2017 5.47
## 5 Injuries Female_1990 1.41
## 6 Injuries Female_2017 1.42
You’re probably looking at the example above and thinking that’s all nice and simple on this miniature example dataset, but how on earth will I figure this out on a real-world example.
And you’re right, we won’t deny that pivot_longer()
is one of the most technically complicated functions in this book, and it can take a lot of trial and error to get it to work.
How to get started with your own pivot_longer()
transformation is to first play with the select()
function to make sure you are telling R exactly which columns to pivot into the longer format.
For example, before working out the pivot_longer()
code for the above example, we would figure this out first:
## # A tibble: 3 x 4
## Female_1990 Female_2017 Male_1990 Male_2017
## <dbl> <dbl> <dbl> <dbl>
## 1 7.3 4.91 8.06 5.47
## 2 1.41 1.42 2.84 3.05
## 3 12.8 19.15 13.91 21.74
Then, knowing that matches("Female|Male")
works as expected inside our little select()
test, we can copy-paste it into pivot_longer()
and add the names_to
and values_to
arguments. Both of these arguments are new column names that you can make up (in the above example, we are using “sex_year” and “deaths_millions”).
3.8.3 separate()
a column into multiple columns
While pivot_longer()
did a great job fetching the different observations that were spread across multiple columns into a single one, it’s still a combination of two variables - sex and year.
We can use the separate()
function to deal with that.
gbd_wide %>%
# same pivot_longer as before
pivot_longer(matches("Female|Male"),
names_to = "sex_year",
values_to = "deaths_millions") %>%
separate(sex_year, into = c("sex", "year"), sep = "_", convert = TRUE)
## # A tibble: 12 x 4
## cause sex year deaths_millions
## <chr> <chr> <int> <dbl>
## 1 Communicable diseases Female 1990 7.3
## 2 Communicable diseases Female 2017 4.91
## 3 Communicable diseases Male 1990 8.06
## 4 Communicable diseases Male 2017 5.47
## 5 Injuries Female 1990 1.41
## 6 Injuries Female 2017 1.42
## 7 Injuries Male 1990 2.84
## 8 Injuries Male 2017 3.05
## 9 Non-communicable diseases Female 1990 12.8
## 10 Non-communicable diseases Female 2017 19.15
## 11 Non-communicable diseases Male 1990 13.91
## 12 Non-communicable diseases Male 2017 21.74
We’ve also added convert = TRUE
to separate()
so year
would get converted into a numeric variable.
The combination of, e.g., “Female-1990” is a character variable, so after separating them both sex
and year
would still be classified as characters.
But the convert = TRUE
recognises that year
is a number and will appropriately convert it into an integer.