6.4 Check the data

It is vital that data is carefully inspected when first read (for help reading data into R see 2.1). The three functions below provide a clear summary allowing errors or miscoding to be quickly identified. It is particularity important to ensure that any missing data is identified (see Chapter 14). If you don’t do this you will regret it! There are many times when an analysis has got to a relatively advanced stage before the researcher realised the dataset was incomplete.

## Observations: 1,704
## Variables: 6
## $ country   <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghani…
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia,…
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997,…
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.…
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 1…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134,…
##               label var_type    n missing_n missing_percent
## country     country    <fct> 1704         0             0.0
## continent continent    <fct> 1704         0             0.0
## year           year    <int> 1704         0             0.0
## lifeExp     lifeExp    <dbl> 1704         0             0.0
## pop             pop    <int> 1704         0             0.0
## gdpPercap gdpPercap    <dbl> 1704         0             0.0

## Continuous
##               label var_type    n missing_n missing_percent       mean
## year           year    <int> 1704         0             0.0     1979.5
## lifeExp     lifeExp    <dbl> 1704         0             0.0       59.5
## pop             pop    <int> 1704         0             0.0 29601212.3
## gdpPercap gdpPercap    <dbl> 1704         0             0.0     7215.3
##                    sd     min quartile_25    median quartile_75          max
## year             17.3  1952.0      1965.8    1979.5      1993.2       2007.0
## lifeExp          12.9    23.6        48.2      60.7        70.8         82.6
## pop       106157896.7 60011.0   2793664.0 7023595.5  19585221.8 1318683096.0
## gdpPercap      9857.5   241.2      1202.1    3531.8      9325.5     113523.1
## 
## Categorical
##               label var_type    n missing_n missing_percent levels_n
## country     country    <fct> 1704         0             0.0      142
## continent continent    <fct> 1704         0             0.0        5
##                                                      levels
## country                                                   -
## continent "Africa", "Americas", "Asia", "Europe", "Oceania"
##                     levels_count               levels_percent
## country                        -                            -
## continent 624, 300, 396, 360, 24 36.6, 17.6, 23.2, 21.1,  1.4
TABLE 6.1: Gapminder dataset, ff_glimpse: continuous
label var_type n missing_n mean sd median
year <int> 1704 0 1979.5 17.3 1979.5
lifeExp <dbl> 1704 0 59.5 12.9 60.7
pop <int> 1704 0 29601212.3 106157896.7 7023595.5
gdpPercap <dbl> 1704 0 7215.3 9857.5 3531.8
TABLE 6.1: Gapminder dataset, ff_glimpse: categorical
label var_type n missing_n levels_n levels levels_count
country <fct> 1704 0 142
continent <fct> 1704 0 5 “Africa”, “Americas”, “Asia”, “Europe”, “Oceania” 624, 300, 396, 360, 24

As can be seen, there are 6 variables, 4 are continuous and 2 are categorical. The categorical variables are already identified as factors. There are no missing data.