6.3 Get and check the data
# Load packages
library(tidyverse)
library(finalfit)
library(gapminder)
# Create object gapdata from object gapminder
gapdata <- gapminder
It is vital that datasets be carefully inspected when first read (for help reading data into R see 2.1). The three functions below provide a clear summary, allowing errors or miscoding to be quickly identified. It is particularly important to ensure that any missing data is identified (see Chapter 11). If you don’t do this you will regret it! There are many times when an analysis has got to a relatively advanced stage before the researcher was hit by the realisation that the dataset was far from complete.
## Rows: 1,704
## Columns: 6
## $ country <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghani…
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia,…
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997,…
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.…
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 1…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134,…
## label var_type n missing_n missing_percent
## country country <fct> 1704 0 0.0
## continent continent <fct> 1704 0 0.0
## year year <int> 1704 0 0.0
## lifeExp lifeExp <dbl> 1704 0 0.0
## pop pop <int> 1704 0 0.0
## gdpPercap gdpPercap <dbl> 1704 0 0.0
label | var_type | n | missing_n | mean | sd | median |
---|---|---|---|---|---|---|
year | <int> | 1704 | 0 | 1979.5 | 17.3 | 1979.5 |
lifeExp | <dbl> | 1704 | 0 | 59.5 | 12.9 | 60.7 |
pop | <int> | 1704 | 0 | 29601212.3 | 106157896.7 | 7023595.5 |
gdpPercap | <dbl> | 1704 | 0 | 7215.3 | 9857.5 | 3531.8 |
label | var_type | n | missing_n | levels_n | levels | levels_count |
---|---|---|---|---|---|---|
country | <fct> | 1704 | 0 | 142 |
|
|
continent | <fct> | 1704 | 0 | 5 |
Africa", Americas“, Asia", Europe”, ``Oceania"
|
624, 300, 396, 360, 24 |
As can be seen, there are 6 variables, 4 are continuous and 2 are categorical. The categorical variables are already identified as factors
. There are no missing data. Note that by default, the maximum number of factor levels shown is give, which is why 142 country names are not printed. This can be adjusted using ff_glimpse(gapdata, levels_cut = 142)