2.11 Joining multiple datasets

It is common for different pieces of information to be kept in different files or tables and you often want to combine them together. For example, consider you have some demographic information (id, sex, age) in one file:

library(tidyverse)
patientdata <- read_csv("data/patient_data.csv")
patientdata
## # A tibble: 6 x 3
##      id sex      age
##   <dbl> <chr>  <dbl>
## 1     1 Female    24
## 2     2 Male      59
## 3     3 Female    32
## 4     4 Female    84
## 5     5 Male      48
## 6     6 Female    65

And another one with some lab results (id, measurement):

labsdata <- read_csv("data/labs_data.csv")
labsdata
## # A tibble: 4 x 2
##      id measurement
##   <dbl>       <dbl>
## 1     5        3.47
## 2     6        7.31
## 3     8        9.91
## 4     7        6.11

Notice how these datasets are not only different sizes (6 rows in patientdata, 4 rows in labsdata), but include information on different patients: patiendata has ids 1, 2, 3, 4, 5, 6, labsdata has ids 5, 6, 8, 7.

A comprehensive way to join these is to use full_join() retaining all information from both tibbles (and matching up rows by shared columns, in this case id):

full_join(patientdata, labsdata)
## Joining, by = "id"
## # A tibble: 8 x 4
##      id sex      age measurement
##   <dbl> <chr>  <dbl>       <dbl>
## 1     1 Female    24       NA   
## 2     2 Male      59       NA   
## 3     3 Female    32       NA   
## 4     4 Female    84       NA   
## 5     5 Male      48        3.47
## 6     6 Female    65        7.31
## 7     8 <NA>      NA        9.91
## 8     7 <NA>      NA        6.11

However, if we are only interested in matching information, we use the inner join:

inner_join(patientdata, labsdata)
## Joining, by = "id"
## # A tibble: 2 x 4
##      id sex      age measurement
##   <dbl> <chr>  <dbl>       <dbl>
## 1     5 Male      48        3.47
## 2     6 Female    65        7.31

And finally, if we want to retain all information from one tibble, we use either the left_join() or the right_join():

left_join(patientdata, labsdata)
## Joining, by = "id"
## # A tibble: 6 x 4
##      id sex      age measurement
##   <dbl> <chr>  <dbl>       <dbl>
## 1     1 Female    24       NA   
## 2     2 Male      59       NA   
## 3     3 Female    32       NA   
## 4     4 Female    84       NA   
## 5     5 Male      48        3.47
## 6     6 Female    65        7.31
right_join(patientdata, labsdata)
## Joining, by = "id"
## # A tibble: 4 x 4
##      id sex      age measurement
##   <dbl> <chr>  <dbl>       <dbl>
## 1     5 Male      48        3.47
## 2     6 Female    65        7.31
## 3     8 <NA>      NA        9.91
## 4     7 <NA>      NA        6.11

2.11.1 Further notes about joins

  • The joins functions (full_join(), inner_join(), left_join(), right_join()) will automatically look for matching column names. You can use the by = argument to specify by hand. This is especially useful if the columns are named differently in the datasets, e.g., left_join(data1, data2, by = c("id" = "patient_id")).

  • The rows do not have to be ordered, the joins match on values within the rows, not the order of the rows within the tibble.

  • Joins are used to combine different variables (columns) into a single tibble. If you are getting more data of the same variables, use bind_rows() instead:

patientdata_new <- read_csv("data/patient_data_updated.csv")
patientdata_new
## # A tibble: 2 x 3
##      id sex      age
##   <dbl> <chr>  <dbl>
## 1     7 Female    38
## 2     8 Male      29
bind_rows(patientdata, patientdata_new)
## # A tibble: 8 x 3
##      id sex      age
##   <dbl> <chr>  <dbl>
## 1     1 Female    24
## 2     2 Male      59
## 3     3 Female    32
## 4     4 Female    84
## 5     5 Male      48
## 6     6 Female    65
## 7     7 Female    38
## 8     8 Male      29

Finally, it is important to understand how joins behave if there are multiple matches within the tibbles. For example, if patient id 4 had a second measurement as well:

labsdata_updated <- labsdata %>% 
  add_row(id = 5, measurement = 2.49)
labsdata_updated
## # A tibble: 5 x 2
##      id measurement
##   <dbl>       <dbl>
## 1     5        3.47
## 2     6        7.31
## 3     8        9.91
## 4     7        6.11
## 5     5        2.49

When we now do a left_join() with our main tibble - patientdata:

left_join(patientdata, labsdata_updated)
## Joining, by = "id"
## # A tibble: 7 x 4
##      id sex      age measurement
##   <dbl> <chr>  <dbl>       <dbl>
## 1     1 Female    24       NA   
## 2     2 Male      59       NA   
## 3     3 Female    32       NA   
## 4     4 Female    84       NA   
## 5     5 Male      48        3.47
## 6     5 Male      48        2.49
## 7     6 Female    65        7.31

We get 7 rows, instead of 6 - as patient id 5 now appears twice with the two different measurements. So it is important to either know your datasets well or keep an eye on the number of rows to make sure any increases/decreases in the tibble sizes are as you expect them to be.