3.7 Reshaping data - long vs wide format

So far, all of the example we’ve shown you have been using ‘tidy’ data. Data is ‘long’ when each variable is in its own column, and each observation is in its own row. This long format is efficient to use in data analysis and visualisation and can also be considered “computer readable”.

But sometimes when presenting data in tables for humans to read, or when collecting data directly into a spreadsheet, it can be convenient to have data in a wide format. Data is ‘wide’ when some or all of columns are levels of a factor. An example makes this easier to see.

TABLE 3.2: Global Burden of Disease data in human-readable wide format. This is not tidy data.
cause Female-1990 Female-2017 Male-1990 Male-2017
Communicable diseases 7.30 4.91 8.06 5.47
Injuries 1.41 1.42 2.84 3.05
Non-communicable diseases 12.80 19.15 13.91 21.74
TABLE 3.3: Global Burden of Disease data in analysis-friendly long format. This is tidy data.
cause year sex deaths_millions
Communicable diseases 1990 Female 7.30
Communicable diseases 2017 Female 4.91
Communicable diseases 1990 Male 8.06
Communicable diseases 2017 Male 5.47
Injuries 1990 Female 1.41
Injuries 2017 Female 1.42
Injuries 1990 Male 2.84
Injuries 2017 Male 3.05
Non-communicable diseases 1990 Female 12.80
Non-communicable diseases 2017 Female 19.15
Non-communicable diseases 1990 Male 13.91
Non-communicable diseases 2017 Male 21.74

Tables 3.3 and 3.2 contain the exact same information, but in long (tidy) and wide formats, respectively.

Same data in the long ('tidy', necessary for efficient analysis) and wide (easier for human-readability/presentation/manual data entry) formats. TODO: replace with updated data.

FIGURE 3.2: Same data in the long (‘tidy’, necessary for efficient analysis) and wide (easier for human-readability/presentation/manual data entry) formats. TODO: replace with updated data.

3.7.1 spread() values from rows into columns

If we want to take the long data from 3.3 and put some of the numbers next to each other for easier visualisation, then spread() is the function to do it. It means we want to spread a variable into columns, and it needs just two arguments: the column we want to spread, and the column where the values are.

## # A tibble: 6 x 4
##   cause                     sex    `1990` `2017`
##   <chr>                     <chr>   <dbl>  <dbl>
## 1 Communicable diseases     Female   7.3    4.91
## 2 Communicable diseases     Male     8.06   5.47
## 3 Injuries                  Female   1.41   1.42
## 4 Injuries                  Male     2.84   3.05
## 5 Non-communicable diseases Female  12.8   19.15
## 6 Non-communicable diseases Male    13.91  21.74

In this example, we are sending gbd_long into the spread(year, deaths_millions) to put the year variable into different columns. The values to fill the new columns with are deaths_millions.

This means we can quickly eyeball how the number of deaths have changed from 1990 to 2017 for each cause category and sex. Whereas if we wanted to quickly look at the difference in the number of deaths for Females and Males, we can just the sex variable instead, so spread(sex, deaths_millions). Furthermore, we can now add a mutate() to show this difference in a new column:

## # A tibble: 6 x 5
##   cause                      year Female  Male `Male - Female`
##   <chr>                     <dbl>  <dbl> <dbl>           <dbl>
## 1 Communicable diseases      1990   7.3   8.06          0.76  
## 2 Communicable diseases      2017   4.91  5.47          0.5600
## 3 Injuries                   1990   1.41  2.84          1.430 
## 4 Injuries                   2017   1.42  3.05          1.63  
## 5 Non-communicable diseases  1990  12.8  13.91          1.110 
## 6 Non-communicable diseases  2017  19.15 21.74          2.59

All of these differences are positive which means every year, more men die than women. Which make sense, as more boys are born than girls.

And what if we want to spread both year and sex at the same time, so to create table 3.2 from Table 3.3? spread() can only accept two arguments. First is the variable to be spread to columns. Second is the values to used. Therefore, to spread by two columns we need to combine them beforehand. This is what the unite() function is for:

## # A tibble: 2 x 3
##   cause                 sex_year    deaths_millions
##   <chr>                 <chr>                 <dbl>
## 1 Communicable diseases Female_1990            7.3 
## 2 Communicable diseases Female_2017            4.91

We are using the slice(1:2) to select the first two rows - just for efficient printing (: in R is a shorthand for creating sequential numbers, e.g. 1:4 is 1, 2, 3, 4).

We can then spread() the united column:

## # A tibble: 3 x 5
##   cause                     Female_1990 Female_2017 Male_1990 Male_2017
##   <chr>                           <dbl>       <dbl>     <dbl>     <dbl>
## 1 Communicable diseases            7.3         4.91      8.06      5.47
## 2 Injuries                         1.41        1.42      2.84      3.05
## 3 Non-communicable diseases       12.8        19.15     13.91     21.74

Both spread() and unite() have a few optional arguments that you may be useful for you. For example, spread(..., fill = 0) is used to fill empty cells (default is fill = NA). Or unite(..., sep = " ") can be used to change the separator that gets put between the values (e.g. you may want “Female-1990” or “Female: 1990” instead of the default "_"). Remember that pressing F1 when your cursor is on a function opens it up in the Help tab where these extra options are listed.

The unite() is a very convenient function for pasting values from multiple columns together, but if you want to do something more special (e.g. also round numbers or add different separators between multiple different columns), then the paste() function inside mutate() will give you that extra flexibility and control.

For example, this:

## # A tibble: 1 x 3
##   cause                 sex_year    deaths_millions
##   <chr>                 <chr>                 <dbl>
## 1 Communicable diseases Female_1990             7.3

is similar to:

## # A tibble: 1 x 5
##   cause                  year sex    deaths_millions sex_year   
##   <chr>                 <dbl> <chr>            <dbl> <chr>      
## 1 Communicable diseases  1990 Female             7.3 Female_1990

They’re similar but not exactly the same as unite() drops the original columns (and only keeps the new united one), whereas mutate() creates a new column and keeps all existing ones as well.

To make them equivalent, you could either add drop = FALSE inside unite() (keeping all columns) or %>% select(-sex, -year) after the mutate() (to drop/deselect these).

3.7.2 gather() values from columns to rows

The opposite of spread() is gather(). If you’re lucky enough, your data comes from a proper database and is already in the long and tidy format. But if you do get landed with something that looks like the Table 3.2, you’ll need to know how to gather and separate the variables currently spread across different columns into the tidy format (each column is a variable, each row is an observation).

We could try and run gather() without any extra arguments (again, using slice(1:6) just for shorter printing, the first 6 lines this time):

## # A tibble: 6 x 2
##   key         value                    
##   <chr>       <chr>                    
## 1 cause       Communicable diseases    
## 2 cause       Injuries                 
## 3 cause       Non-communicable diseases
## 4 Female-1990 7.3                      
## 5 Female-1990 1.41                     
## 6 Female-1990 12.8

So it gathers all column names into a new variable called key, and puts everything in the rows into a column called value. However, the cause variable already was how we wanted it - in a column of its own, so we don’t want this gathered together the deaths_millions values. So we can tell gather() to leave it where it is:

## # A tibble: 6 x 3
##   cause                     sex_year    deaths_millions
##   <chr>                     <chr>                 <dbl>
## 1 Communicable diseases     Female-1990            7.3 
## 2 Injuries                  Female-1990            1.41
## 3 Non-communicable diseases Female-1990           12.8 
## 4 Communicable diseases     Female-2017            4.91
## 5 Injuries                  Female-2017            1.42
## 6 Non-communicable diseases Female-2017           19.15

Now there, because selection (or deselection) of columns needs to be the fourth argument to gather() (first on is the data that gets piped - %>% in, second and third the names of the new columns), we also need to include the names of the new columns before we can specify that we want -cause to stay where it is.

And finally, we need to use separate() to put sex and year into their own columns:

## # A tibble: 6 x 4
##   cause                     sex     year deaths_millions
##   <chr>                     <chr>  <int>           <dbl>
## 1 Communicable diseases     Female  1990            7.3 
## 2 Injuries                  Female  1990            1.41
## 3 Non-communicable diseases Female  1990           12.8 
## 4 Communicable diseases     Female  2017            4.91
## 5 Injuries                  Female  2017            1.42
## 6 Non-communicable diseases Female  2017           19.15

It is important to notice the quotes around the new column names: into = c("sex", "year"). Most tidyverse functions don’t want to use use quotes around column names, so this can be confusing.

We’ve also added convert = TRUE to separate() so year would get converted into a numeric variable. The combination of, e.g., “Female-1990” is a character variable, so after separating them both sex and year would still be classified as characters. But the convert = TRUE recognises that year is a number and will appropriately convert it into an integer.

When working with large datasets with a lot of columns that need gathering, then the select() helpers are extremely useful.