6.3 Get and check the data

It is vital that datasets carefully inspected when first read (for help reading data into R see 2.1). The three functions below provide a clear summary, allowing errors or miscoding to be quickly identified. It is particularity important to ensure that any missing data is identified (see Chapter 11). If you don’t do this you will regret it! There are many times when an analysis has got to a relatively advanced stage before the researcher was hit by the realisation that the dataset was far from complete.

## Observations: 1,704
## Variables: 6
## $ country   <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghani…
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia,…
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997,…
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.…
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 1…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134,…
##               label var_type    n missing_n missing_percent
## country     country    <fct> 1704         0             0.0
## continent continent    <fct> 1704         0             0.0
## year           year    <int> 1704         0             0.0
## lifeExp     lifeExp    <dbl> 1704         0             0.0
## pop             pop    <int> 1704         0             0.0
## gdpPercap gdpPercap    <dbl> 1704         0             0.0

TABLE 6.1: Gapminder dataset, ff_glimpse: continuous
label var_type n missing_n mean sd median
year <int> 1704 0 1979.5 17.3 1979.5
lifeExp <dbl> 1704 0 59.5 12.9 60.7
pop <int> 1704 0 29601212.3 106157896.7 7023595.5
gdpPercap <dbl> 1704 0 7215.3 9857.5 3531.8
TABLE 6.1: Gapminder dataset, ff_glimpse: categorical
label var_type n missing_n levels_n levels levels_count
country <fct> 1704 0 142
continent <fct> 1704 0 5 “Africa”, “Americas”, “Asia”, “Europe”, “Oceania” 624, 300, 396, 360, 24

As can be seen, there are 6 variables, 4 are continuous and 2 are categorical. The categorical variables are already identified as factors. There are no missing data.