2.5 Operators for filtering data

Operators are symbols that tell R how to handle different pieces of data or objects. We have already introduced three: $ (selects a column), <- (assigns values or results to a variable), and the pipe - %>% (sends data into a function).

Other common operators are the ones we use for filtering data - these are arithmetic comparison and logical operators. This may be for creating subgroups, or for excluding outliers or incomplete cases.

The comparison operators that work with numeric data are relatively straightforward: >, <, >=, <=. The first two check whether your values are greater or less than another value, the last two check for “greater than or equal to” and “less than or equal to”. These operators are most commonly spotted inside the filter() function:

gbd_short %>% 
  filter(year < 1995)
## # A tibble: 3 x 3
##    year cause                     deaths_millions
##   <dbl> <chr>                               <dbl>
## 1  1990 Communicable diseases               15.4 
## 2  1990 Injuries                             4.25
## 3  1990 Non-communicable diseases           26.7

Here we send the data (gbd_short) to the filter() and ask it to retain all years that are less than 1995. The resulting tibble only includes the year 1990. Now, if we use the <= (less than or equal to) operator, both 1990 and 1995 pass the filter:

gbd_short %>% 
  filter(year <= 1995)
## # A tibble: 6 x 3
##    year cause                     deaths_millions
##   <dbl> <chr>                               <dbl>
## 1  1990 Communicable diseases               15.4 
## 2  1990 Injuries                             4.25
## 3  1990 Non-communicable diseases           26.7 
## 4  1995 Communicable diseases               15.1 
## 5  1995 Injuries                             4.53
## 6  1995 Non-communicable diseases           29.3

Furthermore, the values either side of the operator could both be variables, e.g., mydata %>% filter(var2 > var1).

To filter for values that are equal to something, we use the == operator.

gbd_short %>% 
  filter(year == 1995)
## # A tibble: 3 x 3
##    year cause                     deaths_millions
##   <dbl> <chr>                               <dbl>
## 1  1995 Communicable diseases               15.1 
## 2  1995 Injuries                             4.53
## 3  1995 Non-communicable diseases           29.3

This reads, take the GBD dataset, send it to the filter and keep rows where year is equal to 1995.

Accidentally using the single equals = when double equals is necessary == is a common mistake and still happens to the best of us. It happens so often that the error the filter() function gives when using the wrong one also reminds us what the correct one was:

gbd_short %>% 
  filter(year = 1995)
## Error: Problem with `filter()` input `..1`.
## x Input `..1` is named.
## ℹ This usually means that you've used `=` instead of `==`.
## ℹ Did you mean `year == 1995`?

The answer to “do you need ==?” is almost always, “Yes R, I do, thank you”.

But that’s just because filter() is a clever cookie and is used to this common mistake. There are other useful functions we use these operators in, but they don’t always know to tell us that we’ve just confused = for ==. So if you get an error when checking for an equality between variables, always check your == operators first.

R also has two operators for combining multiple comparisons: & and |, which stand for AND and OR, respectively. For example, we can filter to only keep the earliest and latest years in the dataset:

gbd_short %>% 
  filter(year == 1995 | year == 2017)
## # A tibble: 6 x 3
##    year cause                     deaths_millions
##   <dbl> <chr>                               <dbl>
## 1  1995 Communicable diseases               15.1 
## 2  1995 Injuries                             4.53
## 3  1995 Non-communicable diseases           29.3 
## 4  2017 Communicable diseases               10.4 
## 5  2017 Injuries                             4.47
## 6  2017 Non-communicable diseases           40.9

This reads: take the GBD dataset, send it to the filter and keep rows where year is equal to 1995 OR year is equal to 2017.

Using specific values like we’ve done here (1995/2017) is called “hard-coding”, which is fine if we know for sure that we will not want to use the same script on an updated dataset. But a cleverer way of achieving the same thing is to use the min() and max() functions:

gbd_short %>% 
  filter(year == max(year) | year == min(year))
## # A tibble: 6 x 3
##    year cause                     deaths_millions
##   <dbl> <chr>                               <dbl>
## 1  1990 Communicable diseases               15.4 
## 2  1990 Injuries                             4.25
## 3  1990 Non-communicable diseases           26.7 
## 4  2017 Communicable diseases               10.4 
## 5  2017 Injuries                             4.47
## 6  2017 Non-communicable diseases           40.9
TABLE 2.4: Filtering operators.
Operators Meaning
== Equal to
!= Not equal to
< Less than
> Greater than
<= Less than or equal to
>= Greater then or equal to
& AND
| OR

2.5.1 Worked examples

Filter the dataset to only include the year 2000. Save this in a new variable using the assignment operator.

mydata_year2000 <- gbd_short %>% 
  filter(year == 2000)

Let’s practice combining multiple selections together.

Reminder: ‘|’ means OR and ‘&’ means AND.

From gbd_short, select the lines where year is either 1990 or 2017 and cause is “Communicable diseases”:

new_data_selection <- gbd_short %>% 
  filter((year == 1990 | year == 2013) & cause == "Communicable diseases")

# Or we can get rid of the extra brackets around the years
# by moving cause into a new filter on a new line:

new_data_selection <- gbd_short %>% 
  filter(year == 1990 | year == 2013) %>% 
  filter(cause == "Communicable diseases")
  
# Or even better, we can include both in one filter() call, as all
# separate conditions are by default joined with "&":

new_data_selection <- gbd_short %>% 
  filter(year == 1990 | year == 2013,
         cause == "Communicable diseases")

The hash symbol (#) is used to add free text comments to R code. R will not try to run these lines, they will be ignored. Comments are an essential part of any programming code and these are “Dear Diary” notes to your future self.