6.3 Get and check the data

# Load packages
library(tidyverse)
library(finalfit)
library(gapminder)

# Create object gapdata from object gapminder
gapdata <- gapminder

It is vital that datasets be carefully inspected when first read (for help reading data into R see 2.1). The three functions below provide a clear summary, allowing errors or miscoding to be quickly identified. It is particularly important to ensure that any missing data is identified (see Chapter 11). If you don’t do this you will regret it! There are many times when an analysis has got to a relatively advanced stage before the researcher was hit by the realisation that the dataset was far from complete.

glimpse(gapdata) # each variable as line, variable type, first values

## Rows: 1,704
## Columns: 6
## $ country   <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghani…
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia,…
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997,…
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.…
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 1…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134,…

missing_glimpse(gapdata) # missing data for each variable

##               label var_type    n missing_n missing_percent
## country     country    <fct> 1704         0             0.0
## continent continent    <fct> 1704         0             0.0
## year           year    <int> 1704         0             0.0
## lifeExp     lifeExp    <dbl> 1704         0             0.0
## pop             pop    <int> 1704         0             0.0
## gdpPercap gdpPercap    <dbl> 1704         0             0.0

ff_glimpse(gapdata) # summary statistics for each variable

TABLE 6.1: Gapminder dataset, ff_glimpse: continuous.
label	var_type	n	mean	sd	median
year	<int>	1704	1979.5	17.3	1979.5
lifeExp	<dbl>	1704	59.5	12.9	60.7
pop	<int>	1704	29601212.3	106157896.7	7023595.5
gdpPercap	<dbl>	1704	7215.3	9857.5	3531.8

TABLE 6.2: Gapminder dataset, ff_glimpse: categorical.
label	var_type	n	missing_n	levels_n	levels	levels_count
country	<fct>	1704	0	142
continent	<fct>	1704	0	5	`Africa",`Americas“, `Asia",`Europe”, ``Oceania"	624, 300, 396, 360, 24

As can be seen, there are 6 variables, 4 are continuous and 2 are categorical. The categorical variables are already identified as factors. There are no missing data. Note that by default, the maximum number of factor levels shown is give, which is why 142 country names are not printed. This can be adjusted using ff_glimpse(gapdata, levels_cut = 142)