3.8 Reshaping data - long vs wide format | R for Health Data Science

3.8 Reshaping data - long vs wide format

So far, all of the examples we’ve shown you have been using ‘tidy’ data. Data is ‘tidy’ when it follows a couple of rules: each variable is in its own column, and each observation is in its own row. Making data ‘tidy’ often means transforming the table from a “wide” format into a “long” format. Long format is efficient to use in data analysis and visualisation and can also be considered “computer readable”.

But sometimes when presenting data in tables for humans to read, or when collecting data directly into a spreadsheet, it can be convenient to have data in a wide format. Data is ‘wide’ when some or all of the columns are levels of a factor. An example makes this easier to see.

gbd_wide <- read_csv("data/global_burden_disease_wide-format.csv")
gbd_long <- read_csv("data/global_burden_disease_cause-year-sex.csv")

TABLE 3.2: Global Burden of Disease data in human-readable wide format. This is not tidy data.
cause	Female_1990	Female_2017	Male_1990	Male_2017
Communicable diseases	7.30	4.91	8.06	5.47
Injuries	1.41	1.42	2.84	3.05
Non-communicable diseases	12.80	19.15	13.91	21.74

TABLE 3.3: Global Burden of Disease data in analysis-friendly long format. This is tidy data.
cause	year	sex	deaths_millions
Communicable diseases	1990	Female	7.30
Communicable diseases	2017	Female	4.91
Communicable diseases	1990	Male	8.06
Communicable diseases	2017	Male	5.47
Injuries	1990	Female	1.41
Injuries	2017	Female	1.42
Injuries	1990	Male	2.84
Injuries	2017	Male	3.05
Non-communicable diseases	1990	Female	12.80
Non-communicable diseases	2017	Female	19.15
Non-communicable diseases	1990	Male	13.91
Non-communicable diseases	2017	Male	21.74

Tables 3.3 and 3.2 contain the exact same information, but in long (tidy) and wide formats, respectively.

3.8.1 Pivot values from rows into columns (wider)

If we want to take the long data from 3.3 and put some of the numbers next to each other for easier visualisation, then pivot_wider() from the tidyr package is the function to do it. It means we want to send a variable into columns, and it needs just two arguments: the variable we want to become the new columns, and the variable where the values currently are.

gbd_long %>% 
  pivot_wider(names_from = year, values_from = deaths_millions)

## # A tibble: 6 x 4
##   cause                     sex    `1990` `2017`
##   <chr>                     <chr>   <dbl>  <dbl>
## 1 Communicable diseases     Female   7.3    4.91
## 2 Communicable diseases     Male     8.06   5.47
## 3 Injuries                  Female   1.41   1.42
## 4 Injuries                  Male     2.84   3.05
## 5 Non-communicable diseases Female  12.8   19.15
## 6 Non-communicable diseases Male    13.91  21.74

This means we can quickly eyeball how the number of deaths has changed from 1990 to 2017 for each cause category and sex. Whereas if we wanted to quickly look at the difference in the number of deaths for females and males, we can change the names_from = argument from = years to = sex. Furthermore, we can also add a mutate() to calculate the difference:

gbd_long %>% 
  pivot_wider(names_from = sex, values_from = deaths_millions) %>% 
  mutate(Male - Female)

## # A tibble: 6 x 5
##   cause                      year Female  Male `Male - Female`
##   <chr>                     <dbl>  <dbl> <dbl>           <dbl>
## 1 Communicable diseases      1990   7.3   8.06          0.76  
## 2 Communicable diseases      2017   4.91  5.47          0.5600
## 3 Injuries                   1990   1.41  2.84          1.430 
## 4 Injuries                   2017   1.42  3.05          1.63  
## 5 Non-communicable diseases  1990  12.8  13.91          1.110 
## 6 Non-communicable diseases  2017  19.15 21.74          2.59

All of these differences are positive which means every year, more men die than women. Which make sense, as more boys are born than girls.

And what if we want to look at both year and sex at the same time, so to create Table 3.2 from Table 3.3? No problem, pivot_wider() can deal with multiple variables at the same time, names_from = c(sex, year):

gbd_long %>% 
  pivot_wider(names_from = c(sex, year), values_from = deaths_millions)

## # A tibble: 3 x 5
##   cause                     Female_1990 Female_2017 Male_1990 Male_2017
##   <chr>                           <dbl>       <dbl>     <dbl>     <dbl>
## 1 Communicable diseases            7.3         4.91      8.06      5.47
## 2 Injuries                         1.41        1.42      2.84      3.05
## 3 Non-communicable diseases       12.8        19.15     13.91     21.74

pivot_wider() has a few optional arguments that may be useful for you. For example, pivot_wider(..., values_fill = 0) can be used to fill empty cases (if you have any) with a value you specified. Or pivot_wider(..., names_sep = ": ") can be used to change the separator that gets put between the values (e.g., you may want “Female: 1990” instead of the default “Female_1990”). Remember that pressing F1 when your cursor is on a function opens it up in the Help tab where these extra options are listed.

3.8.2 Pivot values from columns to rows (longer)

The inverse of pivot_wider() is pivot_longer(). If you’re lucky enough, your data comes from a proper database and is already in the long and tidy format. But if you do get landed with something that looks like Table 3.2, you’ll need to know how to wrangle the variables currently spread across different columns into the tidy format (where each column is a variable, each row is an observation).

pivot_longer() can be a little bit more difficult to use as you need to describe all the columns to be collected using a select_helper. Run `?select_helpers and click on the first result in the Help tab for a reminder.

For example, here we want to collect all the columns that include the words Female or Male, the select helper for it is matches("Female|Male"):

gbd_wide %>% 
  pivot_longer(matches("Female|Male"), 
               names_to = "sex_year", 
               values_to = "deaths_millions") %>% 
  slice(1:6)

## # A tibble: 6 x 3
##   cause                 sex_year    deaths_millions
##   <chr>                 <chr>                 <dbl>
## 1 Communicable diseases Female_1990            7.3 
## 2 Communicable diseases Female_2017            4.91
## 3 Communicable diseases Male_1990              8.06
## 4 Communicable diseases Male_2017              5.47
## 5 Injuries              Female_1990            1.41
## 6 Injuries              Female_2017            1.42

You’re probably looking at the example above and thinking that’s all nice and simple on this miniature example dataset, but how on earth will I figure this out on a real-world example. And you’re right, we won’t deny that pivot_longer() is one of the most technically complicated functions in this book, and it can take a lot of trial and error to get it to work. How to get started with your own pivot_longer() transformation is to first play with the select() function to make sure you are telling R exactly which columns to pivot into the longer format. For example, before working out the pivot_longer() code for the above example, we would figure this out first:

gbd_wide %>% 
  select(matches("Female|Male"))

## # A tibble: 3 x 4
##   Female_1990 Female_2017 Male_1990 Male_2017
##         <dbl>       <dbl>     <dbl>     <dbl>
## 1        7.3         4.91      8.06      5.47
## 2        1.41        1.42      2.84      3.05
## 3       12.8        19.15     13.91     21.74

Then, knowing that matches("Female|Male") works as expected inside our little select() test, we can copy-paste it into pivot_longer() and add the names_to and values_to arguments. Both of these arguments are new column names that you can make up (in the above example, we are using “sex_year” and “deaths_millions”).

3.8.3 `separate()` a column into multiple columns

While pivot_longer() did a great job fetching the different observations that were spread across multiple columns into a single one, it’s still a combination of two variables - sex and year. We can use the separate() function to deal with that.

gbd_wide %>% 
  # same pivot_longer as before
  pivot_longer(matches("Female|Male"), 
               names_to = "sex_year", 
               values_to = "deaths_millions") %>% 
  separate(sex_year, into = c("sex", "year"), sep = "_", convert = TRUE)

## # A tibble: 12 x 4
##    cause                     sex     year deaths_millions
##    <chr>                     <chr>  <int>           <dbl>
##  1 Communicable diseases     Female  1990            7.3 
##  2 Communicable diseases     Female  2017            4.91
##  3 Communicable diseases     Male    1990            8.06
##  4 Communicable diseases     Male    2017            5.47
##  5 Injuries                  Female  1990            1.41
##  6 Injuries                  Female  2017            1.42
##  7 Injuries                  Male    1990            2.84
##  8 Injuries                  Male    2017            3.05
##  9 Non-communicable diseases Female  1990           12.8 
## 10 Non-communicable diseases Female  2017           19.15
## 11 Non-communicable diseases Male    1990           13.91
## 12 Non-communicable diseases Male    2017           21.74

We’ve also added convert = TRUE to separate() so year would get converted into a numeric variable. The combination of, e.g., “Female-1990” is a character variable, so after separating them both sex and year would still be classified as characters. But the convert = TRUE recognises that year is a number and will appropriately convert it into an integer.