2.7 Missing values (NAs) and filters
Filtering for missing values (NAs) needs special attention and care.
Remember the small example tibble from Table 2.3 - it has some NAs in columns var2
and var3
:
## # A tibble: 4 x 5
## id sex var1 var2 var3
## <int> <chr> <dbl> <dbl> <dbl>
## 1 1 Male 4 NA 2
## 2 2 Female 1 4 1
## 3 3 Female 2 5 NA
## 4 4 Male 3 NA NA
If we now want to filter for rows where var2
is missing, filter(var2 == NA)
is not the way to do it, it will not work.
Since R is a programming language, it can be a bit stubborn with things like these.
When you ask R to do a comparison using ==
(or <
, >
, etc.) it expects a value on each side, but NA is not a value, it is the lack thereof.
The way to filter for missing values is using the is.na()
function:
## # A tibble: 2 x 5
## id sex var1 var2 var3
## <int> <chr> <dbl> <dbl> <dbl>
## 1 1 Male 4 NA 2
## 2 4 Male 3 NA NA
We send mydata
to the filter and keep rows where var2
is NA
.
Note the double brackets at the end: that’s because the inner one belongs to is.na()
, and the outer one to filter()
.
Missing out a closing bracket is also a common source of errors, and still happens to the best of us.
If filtering for rows where var2
is not missing, we do this8
## # A tibble: 2 x 5
## id sex var1 var2 var3
## <int> <chr> <dbl> <dbl> <dbl>
## 1 2 Female 1 4 1
## 2 3 Female 2 5 NA
In R, the exclamation mark (!) means “not”.
Sometimes you want to drop a specific value (e.g., an outlier) from the dataset like this.
The small example tibble mydata
has 4 rows, with the values for var2
as follows: NA, 4, 5, NA.
We can exclude the row where var2
is equal to 5 by using the “not equals” (!=
)9:
## # A tibble: 1 x 5
## id sex var1 var2 var3
## <int> <chr> <dbl> <dbl> <dbl>
## 1 2 Female 1 4 1
However, you’ll see that by doing this, R drops the rows where var2
is NA as well, as it can’t be sure these missing values were not equal to 5.
If you want to keep the missing values, you need to make use of the OR (|
) operator and the is.na()
function:
## # A tibble: 3 x 5
## id sex var1 var2 var3
## <int> <chr> <dbl> <dbl> <dbl>
## 1 1 Male 4 NA 2
## 2 2 Female 1 4 1
## 3 4 Male 3 NA NA
Being caught out by missing values, either in filters or other functions is common (remember mydata$var2 %>% mean()
returns NA unless you add na.rm = TRUE
).
This is also why we insist that you always plot your data first - outliers will reveal themselves and NA values usually become obvious too.
Another thing we do to stay safe around filters and missing values is saving the results and making sure the number of rows still add up:
## # A tibble: 1 x 5
## id sex var1 var2 var3
## <int> <chr> <dbl> <dbl> <dbl>
## 1 3 Female 2 5 NA
## # A tibble: 1 x 5
## id sex var1 var2 var3
## <int> <chr> <dbl> <dbl> <dbl>
## 1 2 Female 1 4 1
If the numbers are small, you can now quickly look at RStudio’s Environment tab and figure out whether the number of observations (rows) in subset1
and subset2
add up to the whole dataset (mydata
). Or use the nrow()
function to check the number of rows in each dataset:
Rows in mydata
:
## [1] 4
Rows in subset1
:
## [1] 1
Rows in subset2
:
## [1] 1
Asking R whether adding these two up equals the original size:
## [1] FALSE
As expected, this returns FALSE - because we didn’t add special handling for missing values.
Let’s create a third subset only including rows where var3
is NA:
Rows in subset2
:
subset3 <- mydata %>%
filter(is.na(var2))
nrow(subset1) + nrow(subset2) + nrow(subset3) == nrow(mydata)
## [1] TRUE
In this simple example,
mydata %>% filter(! is.na(var2))
could be replaced by a shorthand:mydata %>% drop_na(var2)
, but it is important to understand how the ! andis.na()
work as there will be more complex situations where using these is necessary.↩︎filter(var2 != 5) is equivalent to filter(! var2 == 5)
↩︎