## 6.5 Plot the data

We will start by comparing life expectancy between the 5 continents of the world in two different years. Always plot your data first. Never skip this step! We are particularly interested in the distribution. There’s that word again. The shape of the data. Is it normal? Is it skewed? Does it differ between regions and years?

There are three useful plots which can help here:

• Histograms: examine shape of data and compare groups;
• Q-Q plots: are data normally distributed?
• Box-plots: identify outliers, compare shape and groups.

### 6.5.1 Histogram

mydata %>%
filter(year %in% c(2002, 2007)) %>%
ggplot(aes(x = lifeExp)) +       # remember aes()
geom_histogram(bins = 20) +      # histogram with 20 bars
facet_grid(year ~ continent)     # optional: add scales="free"                                 

What can we see? That life expectancy in Africa is lower than in other regions. That we have little data for Oceania given there are only two countries included, Australia and New Zealand. That Africa and Asia have greater variability in life expectancy by country than in the Americas or Europe. That the data follow a reasonably normal shape, with Africa 2002 a little right skewed.

### 6.5.2 Q-Q plot

Quantile-quantile sounds more complicated than it really is. It is a graphical method for comparing the distribution (think shape) of our own data to a theoretical distribution, such as the normal distribution. In this context, quantiles are just cut points which divide our data into bins each containing the same number of observations. For example, if we have the life expectancy for 100 countries, then quartiles (note the quar-) for life expectancy are the three ages which split the observations into 4 groups each containing 25 countries. A Q-Q plot simply plots the quantiles for our data against the theoretical quantiles for a particular distributions (the default shown below is the normal distribution). If our data follow that distribution (e.g. normal), then our data points fall on the theoretical straight line.

mydata %>%
filter(year %in% c(2002, 2007)) %>%
ggplot(aes(sample = lifeExp)) +      # Q-Q plot requires 'sample'
geom_qq() +                          # defaults to normal distribution
geom_qq_line(colour = "blue") +      # add the theoretical line
geom_abline(intercept = 0, slope = 1) +
facet_grid(year ~ continent)

What can we see? We are looking to see if the data follow the 45 degree line which is included in the plot. These do reasonably, except for Africa which is curved upwards at each end. This is the right skew we could see on the histograms too. If your data is not follow a normal distribution, then you can not use the t-test or the ANOVA, but have to use a non-parametric test as shown in Section 6.10.

We are frequently asked about performing a hypothesis test to check the assumption of normality, such as the Shapiro-Wilk normality test. We do not recommend this, simply because it is often non-significant when the number of observations is small but the data look skewed, and often significant when the number of observations is high but the data look reasonably normal on inspection of plots. It is therefore not useful in practice - common sense should prevail.

### 6.5.3 Boxplot

Boxplots are our preferred method for comparing a continuous variable such as life expectancy across a categorical explanatory variable. For continuous data, box plots are a lot more appropriate than bar plots with error bars (also known as dynamite plots). We intentionally do not even show you how to make dynamite plots.

The box represents the median (bold horizontal line in the middle) and interquartile range (where 50% of the data sits). The lines (whiskers) extend to the lowest and highest values that are still within 1.5 times the interquartile range. Outliers (anything outwidth the whiskers) are represented as points.

Thus it contains information, not only on central tendency (median), but on the variation in the data and the distribution of the data, for instance a skew should be obvious.

mydata %>%
filter(year %in% c(2002, 2007)) %>%
ggplot(aes(x = continent, y = lifeExp)) +
geom_boxplot() +
facet_wrap(~ year)

What can we see? The median life expectancy is lower in Africa than in any other continent. The variation in life expectancy is greatest in Africa and smallest in Oceania. The data in Africa looks skewed, particularly in 2002 - the lines/whiskers are unequal lengths.

mydata %>%
"Life expectancy by continent in 2002 v 2007") # add title