2.7 Missing values (NAs) and filters

Filtering for missing values (NAs) needs special attention and care. Remember the small example tibble from Table 2.3 - it has some NAs in columns var2 and var3:

mydata
## # A tibble: 4 x 5
##      id sex     var1  var2  var3
##   <int> <chr>  <dbl> <dbl> <dbl>
## 1     1 Male       4    NA     2
## 2     2 Female     1     4     1
## 3     3 Female     2     5    NA
## 4     4 Male       3    NA    NA

If we now want to filter for rows where var2 is missing, filter(var2 == NA) is not the way to do it, it will not work.

Since R is a programming language, it can be a bit stubborn with things like these. When you ask R to do a comparison using == (or <, >, etc.) it expects a value on each side, but NA is not a value, it is the lack thereof. The way to filter for missing values is using the is.na() function:

mydata %>% 
  filter(is.na(var2))
## # A tibble: 2 x 5
##      id sex    var1  var2  var3
##   <int> <chr> <dbl> <dbl> <dbl>
## 1     1 Male      4    NA     2
## 2     4 Male      3    NA    NA

We send mydata to the filter and keep rows where var2 is NA. Note the double brackets at the end: that’s because the inner one belongs to is.na(), and the outer one to filter(). Missing out a closing bracket is also a common source of errors, and still happens to the best of us.

If filtering for rows where var2 is not missing, we do this8

mydata %>% 
  filter(!is.na(var2))
## # A tibble: 2 x 5
##      id sex     var1  var2  var3
##   <int> <chr>  <dbl> <dbl> <dbl>
## 1     2 Female     1     4     1
## 2     3 Female     2     5    NA

In R, the exclamation mark (!) means “not”.

Sometimes you want to drop a specific value (e.g., an outlier) from the dataset like this. The small example tibble mydata has 4 rows, with the values for var2 as follows: NA, 4, 5, NA. We can exclude the row where var2 is equal to 5 by using the “not equals” (!=)9:

mydata %>% 
  filter(var2 != 5)
## # A tibble: 1 x 5
##      id sex     var1  var2  var3
##   <int> <chr>  <dbl> <dbl> <dbl>
## 1     2 Female     1     4     1

However, you’ll see that by doing this, R drops the rows where var2 is NA as well, as it can’t be sure these missing values were not equal to 5.

If you want to keep the missing values, you need to make use of the OR (|) operator and the is.na() function:

mydata %>% 
  filter(var2 != 5 | is.na(var2))
## # A tibble: 3 x 5
##      id sex     var1  var2  var3
##   <int> <chr>  <dbl> <dbl> <dbl>
## 1     1 Male       4    NA     2
## 2     2 Female     1     4     1
## 3     4 Male       3    NA    NA

Being caught out by missing values, either in filters or other functions is common (remember mydata$var2 %>% mean() returns NA unless you add na.rm = TRUE). This is also why we insist that you always plot your data first - outliers will reveal themselves and NA values usually become obvious too.

Another thing we do to stay safe around filters and missing values is saving the results and making sure the number of rows still add up:

subset1 <- mydata %>% 
  filter(var2 == 5)

subset2 <- mydata %>% 
  filter(! var2 == 5)

subset1
## # A tibble: 1 x 5
##      id sex     var1  var2  var3
##   <int> <chr>  <dbl> <dbl> <dbl>
## 1     3 Female     2     5    NA
subset2
## # A tibble: 1 x 5
##      id sex     var1  var2  var3
##   <int> <chr>  <dbl> <dbl> <dbl>
## 1     2 Female     1     4     1

If the numbers are small, you can now quickly look at RStudio’s Environment tab and figure out whether the number of observations (rows) in subset1 and subset2 add up to the whole dataset (mydata). Or use the nrow() function to check the number of rows in each dataset:

Rows in mydata:

nrow(mydata)
## [1] 4

Rows in subset1:

nrow(subset1)
## [1] 1

Rows in subset2:

nrow(subset2)
## [1] 1

Asking R whether adding these two up equals the original size:

nrow(subset1) + nrow(subset2) == nrow(mydata)
## [1] FALSE

As expected, this returns FALSE - because we didn’t add special handling for missing values. Let’s create a third subset only including rows where var3 is NA:

Rows in subset2:

subset3 <- mydata %>% 
  filter(is.na(var2))

nrow(subset1) + nrow(subset2) + nrow(subset3) == nrow(mydata)
## [1] TRUE

  1. In this simple example, mydata %>% filter(! is.na(var2)) could be replaced by a shorthand: mydata %>% drop_na(var2), but it is important to understand how the ! and is.na() work as there will be more complex situations where using these is necessary.↩︎

  2. filter(var2 != 5) is equivalent to filter(! var2 == 5)↩︎