2.11 Joining multiple datasets
It is common for different pieces of information to be kept in different files or tables and you often want to combine them together.
For example, consider you have some demographic information (id
, sex
, age
) in one file:
## # A tibble: 6 x 3
## id sex age
## <dbl> <chr> <dbl>
## 1 1 Female 24
## 2 2 Male 59
## 3 3 Female 32
## 4 4 Female 84
## 5 5 Male 48
## 6 6 Female 65
And another one with some lab results (id
, measurement
):
## # A tibble: 4 x 2
## id measurement
## <dbl> <dbl>
## 1 5 3.47
## 2 6 7.31
## 3 8 9.91
## 4 7 6.11
Notice how these datasets are not only different sizes (6 rows in patientdata
, 4 rows in labsdata
), but include information on different patients: patiendata
has ids 1, 2, 3, 4, 5, 6, labsdata
has ids 5, 6, 8, 7.
A comprehensive way to join these is to use full_join()
retaining all information from both tibbles (and matching up rows by shared columns, in this case id
):
## Joining, by = "id"
## # A tibble: 8 x 4
## id sex age measurement
## <dbl> <chr> <dbl> <dbl>
## 1 1 Female 24 NA
## 2 2 Male 59 NA
## 3 3 Female 32 NA
## 4 4 Female 84 NA
## 5 5 Male 48 3.47
## 6 6 Female 65 7.31
## 7 8 <NA> NA 9.91
## 8 7 <NA> NA 6.11
However, if we are only interested in matching information, we use the inner join:
## Joining, by = "id"
## # A tibble: 2 x 4
## id sex age measurement
## <dbl> <chr> <dbl> <dbl>
## 1 5 Male 48 3.47
## 2 6 Female 65 7.31
And finally, if we want to retain all information from one tibble, we use either the left_join()
or the right_join()
:
## Joining, by = "id"
## # A tibble: 6 x 4
## id sex age measurement
## <dbl> <chr> <dbl> <dbl>
## 1 1 Female 24 NA
## 2 2 Male 59 NA
## 3 3 Female 32 NA
## 4 4 Female 84 NA
## 5 5 Male 48 3.47
## 6 6 Female 65 7.31
## Joining, by = "id"
## # A tibble: 4 x 4
## id sex age measurement
## <dbl> <chr> <dbl> <dbl>
## 1 5 Male 48 3.47
## 2 6 Female 65 7.31
## 3 8 <NA> NA 9.91
## 4 7 <NA> NA 6.11
2.11.1 Further notes about joins
The joins functions (
full_join()
,inner_join()
,left_join()
,right_join()
) will automatically look for matching column names. You can use theby =
argument to specify by hand. This is especially useful if the columns are named differently in the datasets, e.g.,left_join(data1, data2, by = c("id" = "patient_id"))
.The rows do not have to be ordered, the joins match on values within the rows, not the order of the rows within the tibble.
Joins are used to combine different variables (columns) into a single tibble. If you are getting more data of the same variables, use
bind_rows()
instead:
## # A tibble: 2 x 3
## id sex age
## <dbl> <chr> <dbl>
## 1 7 Female 38
## 2 8 Male 29
## # A tibble: 8 x 3
## id sex age
## <dbl> <chr> <dbl>
## 1 1 Female 24
## 2 2 Male 59
## 3 3 Female 32
## 4 4 Female 84
## 5 5 Male 48
## 6 6 Female 65
## 7 7 Female 38
## 8 8 Male 29
Finally, it is important to understand how joins behave if there are multiple matches within the tibbles. For example, if patient id 4 had a second measurement as well:
## # A tibble: 5 x 2
## id measurement
## <dbl> <dbl>
## 1 5 3.47
## 2 6 7.31
## 3 8 9.91
## 4 7 6.11
## 5 5 2.49
When we now do a left_join()
with our main tibble
- patientdata
:
## Joining, by = "id"
## # A tibble: 7 x 4
## id sex age measurement
## <dbl> <chr> <dbl> <dbl>
## 1 1 Female 24 NA
## 2 2 Male 59 NA
## 3 3 Female 32 NA
## 4 4 Female 84 NA
## 5 5 Male 48 3.47
## 6 5 Male 48 2.49
## 7 6 Female 65 7.31
We get 7 rows, instead of 6 - as patient id 5 now appears twice with the two different measurements. So it is important to either know your datasets well or keep an eye on the number of rows to make sure any increases/decreases in the tibble sizes are as you expect them to be.