3.8 Reshaping data - long vs wide format
So far, all of the examples we’ve shown you have been using ‘tidy’ data. Data is ‘tidy’ when it is in long format: each variable is in its own column, and each observation is in its own row. This long format is efficient to use in data analysis and visualisation and can also be considered “computer readable”.
But sometimes when presenting data in tables for humans to read, or when collecting data directly into a spreadsheet, it can be convenient to have data in a wide format. Data is ‘wide’ when some or all of the columns are levels of a factor. An example makes this easier to see.
3.8.1 Pivot values from rows into columns (wider)
If we want to take the long data from 3.3 and put some of the numbers next to each other for easier visualisation, then
pivot_wider() from the tidyr package is the function to do it.
It means we want to send a variable into columns, and it needs just two arguments: the variable we want to become the new columns, and the variable where the values currently are.
## # A tibble: 6 x 4 ## cause sex `1990` `2017` ## <chr> <chr> <dbl> <dbl> ## 1 Communicable diseases Female 7.3 4.91 ## 2 Communicable diseases Male 8.06 5.47 ## 3 Injuries Female 1.41 1.42 ## 4 Injuries Male 2.84 3.05 ## 5 Non-communicable diseases Female 12.8 19.15 ## 6 Non-communicable diseases Male 13.91 21.74
This means we can quickly eyeball how the number of deaths has changed from 1990 to 2017 for each cause category and sex.
Whereas if we wanted to quickly look at the difference in the number of deaths for females and males, we can change the
names_from = argument from
= years to
Furthermore, we can also add a
mutate() to calculate the difference:
## # A tibble: 6 x 5 ## cause year Female Male `Male - Female` ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Communicable diseases 1990 7.3 8.06 0.76 ## 2 Communicable diseases 2017 4.91 5.47 0.5600 ## 3 Injuries 1990 1.41 2.84 1.430 ## 4 Injuries 2017 1.42 3.05 1.63 ## 5 Non-communicable diseases 1990 12.8 13.91 1.110 ## 6 Non-communicable diseases 2017 19.15 21.74 2.59
All of these differences are positive which means every year, more men die than women. Which make sense, as more boys are born than girls.
And what if we want to look at both
sex at the same time, so to create Table 3.2 from Table 3.3?
pivot_wider() can deal with multiple variables at the same time,
names_from = c(sex, year):
## # A tibble: 3 x 5 ## cause Female_1990 Female_2017 Male_1990 Male_2017 ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Communicable diseases 7.3 4.91 8.06 5.47 ## 2 Injuries 1.41 1.42 2.84 3.05 ## 3 Non-communicable diseases 12.8 19.15 13.91 21.74
pivot_wider() has a few optional arguments that may be useful for you.
pivot_wider(..., values_fill = 0) can be used to fill empty cases (if you have any) with a value you specified.
pivot_wider(..., names_sep = ": ") can be used to change the separator that gets put between the values (e.g., you may want “Female: 1990” instead of the default “Female_1990”).
Remember that pressing F1 when your cursor is on a function opens it up in the Help tab where these extra options are listed.
3.8.2 Pivot values from columns to rows (longer)
The inverse of
If you’re lucky enough, your data comes from a proper database and is already in the long and tidy format.
But if you do get landed with something that looks like Table 3.2, you’ll need to know how to wrangle the variables currently spread across different columns into the tidy format (where each column is a variable, each row is an observation).
pivot_longer() can be a little bit more difficult to use as you need to describe all the columns to be collected using a
select_helper. Run `?select_helpers and click on the first result in the Help tab for a reminder.
For example, here we want to collect all the columns that include the words Female or Male, the select helper for it is
## # A tibble: 6 x 3 ## cause sex_year deaths_millions ## <chr> <chr> <dbl> ## 1 Communicable diseases Female_1990 7.3 ## 2 Communicable diseases Female_2017 4.91 ## 3 Communicable diseases Male_1990 8.06 ## 4 Communicable diseases Male_2017 5.47 ## 5 Injuries Female_1990 1.41 ## 6 Injuries Female_2017 1.42
You’re probably looking at the example above and thinking that’s all nice and simple on this miniature example dataset, but how on earth will I figure this out on a real-world example.
And you’re right, we won’t deny that
pivot_longer() is one of the most technically complicated functions in this book, and it can take a lot of trial and error to get it to work.
How to get started with your own
pivot_longer() transformation is to first play with the
select() function to make sure you are telling R exactly which columns to pivot into the longer format.
For example, before working out the
pivot_longer() code for the above example, we would figure this out first:
## # A tibble: 3 x 4 ## Female_1990 Female_2017 Male_1990 Male_2017 ## <dbl> <dbl> <dbl> <dbl> ## 1 7.3 4.91 8.06 5.47 ## 2 1.41 1.42 2.84 3.05 ## 3 12.8 19.15 13.91 21.74
Then, knowing that
matches("Female|Male") works as expected inside our little
select() test, we can copy-paste it into
pivot_longer() and add the
values_to arguments. Both of these arguments are new column names that you can make up (in the above example, we are using “sex_year” and “deaths_millions”).
separate() a column into multiple columns
pivot_longer() did a great job fetching the different observations that were spread across multiple columns into a single one, it’s still a combination of two variables - sex and year.
We can use the
separate() function to deal with that.
## # A tibble: 12 x 4 ## cause sex year deaths_millions ## <chr> <chr> <int> <dbl> ## 1 Communicable diseases Female 1990 7.3 ## 2 Communicable diseases Female 2017 4.91 ## 3 Communicable diseases Male 1990 8.06 ## 4 Communicable diseases Male 2017 5.47 ## 5 Injuries Female 1990 1.41 ## 6 Injuries Female 2017 1.42 ## 7 Injuries Male 1990 2.84 ## 8 Injuries Male 2017 3.05 ## 9 Non-communicable diseases Female 1990 12.8 ## 10 Non-communicable diseases Female 2017 19.15 ## 11 Non-communicable diseases Male 1990 13.91 ## 12 Non-communicable diseases Male 2017 21.74
We’ve also added
convert = TRUE to
year would get converted into a numeric variable.
The combination of, e.g., “Female-1990” is a character variable, so after separating them both
year would still be classified as characters.
convert = TRUE recognises that
year is a number and will appropriately convert it into an integer.