3.7 select() columns | R for Health Data Science

3.7 `select()` columns

The select() function can be used to choose, rename, or reorder columns of a tibble.

For the following select() examples, let’s create a new tibble called gbd_2rows by taking the first 2 rows of gbd_full (just for shorter printing):

gbd_2rows <- gbd_full %>% 
  slice(1:2)

gbd_2rows

## # A tibble: 2 x 5
##   cause                  year sex    income       deaths_millions
##   <chr>                 <dbl> <chr>  <chr>                  <dbl>
## 1 Communicable diseases  1990 Female High                   0.21 
## 2 Communicable diseases  1990 Female Upper-Middle           1.150

Let’s select() two of these columns:

gbd_2rows %>% 
  select(cause, deaths_millions)

## # A tibble: 2 x 2
##   cause                 deaths_millions
##   <chr>                           <dbl>
## 1 Communicable diseases           0.21 
## 2 Communicable diseases           1.150

We can also use select() to rename the columns we are choosing:

gbd_2rows %>% 
  select(cause, deaths = deaths_millions)

## # A tibble: 2 x 2
##   cause                 deaths
##   <chr>                  <dbl>
## 1 Communicable diseases  0.21 
## 2 Communicable diseases  1.150

The function rename() is similar to select(), but it keeps all variables whereas select() only kept the ones we mentioned:

gbd_2rows %>% 
  rename(deaths = deaths_millions)

## # A tibble: 2 x 5
##   cause                  year sex    income       deaths
##   <chr>                 <dbl> <chr>  <chr>         <dbl>
## 1 Communicable diseases  1990 Female High          0.21 
## 2 Communicable diseases  1990 Female Upper-Middle  1.150

select() can also be used to reorder the columns in your tibble. Moving columns around is not relevant in data analysis (as any of the functions we showed you above, as well as plotting, only look at the column names, and not their positions in the tibble), but it is useful for organising your tibble for easier viewing.

So if we use select like this:

gbd_2rows %>% 
  select(year, sex, income, cause, deaths_millions)

## # A tibble: 2 x 5
##    year sex    income       cause                 deaths_millions
##   <dbl> <chr>  <chr>        <chr>                           <dbl>
## 1  1990 Female High         Communicable diseases           0.21 
## 2  1990 Female Upper-Middle Communicable diseases           1.150

The columns are reordered.

If you want to move specific column(s) to the front of the tibble, do:

gbd_2rows %>% 
  select(year, sex, everything())

## # A tibble: 2 x 5
##    year sex    cause                 income       deaths_millions
##   <dbl> <chr>  <chr>                 <chr>                  <dbl>
## 1  1990 Female Communicable diseases High                   0.21 
## 2  1990 Female Communicable diseases Upper-Middle           1.150

And this is where the true power of select() starts to come out. In addition to listing the columns explicitly (e.g., mydata %>% select(year, cause...)) there are several special functions that can be used inside select(). These special functions are called select helpers, and the first select helper we used is everything().

The most common select helpers are starts_with(), ends_with(), contains(), matches() (but there are several others that may be useful to you, so press F1 on select() for a full list, or search the web for more examples).

Let’s say you can’t remember whether the deaths column was called deaths_millions or just deaths or deaths_mil, or maybe there are other columns that include the word “deaths” that you want to select():

gbd_2rows %>% 
  select(starts_with("deaths"))

## # A tibble: 2 x 1
##   deaths_millions
##             <dbl>
## 1           0.21 
## 2           1.150

Note how “deaths” needs to be quoted inside starts_with() - as it’s a word to look for, not the real name of a column/variable.