7.2 Fitting simple models
7.2.1 The Question (2)
We are interested in modelling the change in life expectancy for different countries over the past 60 years.
7.2.2 Get the data
7.2.3 Check the data
Always check a new dataset, as described in Section 6.3.
7.2.4 Plot the data
Let’s plot the life expectancies in European countries over the past 60 years, focussing on the UK and Turkey.
We can add in simple best fit lines using geom_smooth()
.
gapdata %>%
filter(continent == "Europe") %>% # Europe only
ggplot(aes(x = year, y = lifeExp)) + # lifeExp~year
geom_point() + # plot points
facet_wrap(~ country) + # facet by country
scale_x_continuous(
breaks = c(1960, 2000)) + # adjust x-axis
geom_smooth(method = "lm") # add regression lines
## `geom_smooth()` using formula 'y ~ x'
7.2.5 Simple linear regression
As you can see, ggplot()
is very happy to run and plot linear regression models for us.
While this is convenient for a quick look, we usually want to build, run, and explore these models ourselves.
We can then investigate the intercepts and the slope coefficients (linear increase per year):
First let’s plot two countries to compare, Turkey and United Kingdom:
gapdata %>%
filter(country %in% c("Turkey", "United Kingdom")) %>%
ggplot(aes(x = year, y = lifeExp, colour = country)) +
geom_point()
The two non-parallel lines may make you think of what has been discussed above (Figure 7.6).
First, let’s model the two countries separately.
# United Kingdom
fit_uk <- gapdata %>%
filter(country == "United Kingdom") %>%
lm(lifeExp~year, data = .)
fit_uk %>%
summary()
##
## Call:
## lm(formula = lifeExp ~ year, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.69767 -0.31962 0.06642 0.36601 0.68165
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.942e+02 1.464e+01 -20.10 2.05e-09 ***
## year 1.860e-01 7.394e-03 25.15 2.26e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4421 on 10 degrees of freedom
## Multiple R-squared: 0.9844, Adjusted R-squared: 0.9829
## F-statistic: 632.5 on 1 and 10 DF, p-value: 2.262e-10
# Turkey
fit_turkey <- gapdata %>%
filter(country == "Turkey") %>%
lm(lifeExp~year, data = .)
fit_turkey %>%
summary()
##
## Call:
## lm(formula = lifeExp ~ year, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4373 -0.3457 0.1653 0.9008 1.1033
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -924.58989 37.97715 -24.35 3.12e-10 ***
## year 0.49724 0.01918 25.92 1.68e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.147 on 10 degrees of freedom
## Multiple R-squared: 0.9853, Adjusted R-squared: 0.9839
## F-statistic: 671.8 on 1 and 10 DF, p-value: 1.681e-10
Accessing the coefficients of linear regression
A simple linear regression model will return two coefficients - the intercept and the slope (the second returned value).
Compare these coefficients to the summary()
output above to see where these numbers came from.
## (Intercept) year
## -294.1965876 0.1859657
## (Intercept) year
## -924.5898865 0.4972399
The slopes make sense - the results of the linear regression say that in the UK, life expectancy increases by 0.186 every year, whereas in Turkey the change is 0.497 per year. The reason the intercepts are negative, however, may be less obvious.
In this example, the intercept is telling us that life expectancy at year 0 in the United Kingdom (some 2000 years ago) was -294 years. While this is mathematically correct (based on the data we have), it obviously makes no sense in practice. It is important to think about the range over which you can extend your model predictions, and where they just become unrealistic.
To make the intercepts meaningful, we will add in a new column called year_from1952
and re-run fit_uk
and fit_turkey
using year_from1952
instead of year
.
gapdata <- gapdata %>%
mutate(year_from1952 = year - 1952)
fit_uk <- gapdata %>%
filter(country == "United Kingdom") %>%
lm(lifeExp ~ year_from1952, data = .)
fit_turkey <- gapdata %>%
filter(country == "Turkey") %>%
lm(lifeExp ~ year_from1952, data = .)
## (Intercept) year_from1952
## 68.8085256 0.1859657
## (Intercept) year_from1952
## 46.0223205 0.4972399
Now, the updated results tell us that in year 1952, the life expectancy in the United Kingdom was 69 years. Note that the slopes do not change. There was nothing wrong with the original model and the results were correct, the intercept was just not meaningful.
Accessing all model information tidy()
and glance()
In the fit_uk and fit_turkey examples above, we were using fit_uk %>% summary()
to get R to print out a summary of the model.
This summary is not, however, in a rectangular shape so we can’t easily access the values or put them in a table/use as information on plot labels.
We use the tidy()
function from library(broom)
to get the variable(s) and specific values in a nice tibble:
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 68.8 0.240 287. 6.58e-21
## 2 year_from1952 0.186 0.00739 25.1 2.26e-10
In the tidy()
output, the column estimate
includes both the intercepts and slopes.
And we use the glance()
function to get overall model statistics (mostly the r.squared).
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.984 0.983 0.442 633. 2.26e-10 1 -6.14 18.3 19.7
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
7.2.6 Multivariable linear regression
Multivariable linear regression includes more than one explanatory variable. There are a few ways to include more variables, depending on whether they should share the intercept and how they interact:
Simple linear regression (exactly one predictor variable):
myfit = lm(lifeExp ~ year, data = gapdata)
Multivariable linear regression (additive):
myfit = lm(lifeExp ~ year + country, data = gapdata)
Multivariable linear regression (interaction):
myfit = lm(lifeExp ~ year * country, data = gapdata)
This equivalent to:
myfit = lm(lifeExp ~ year + country + year:country, data = gapdata)
These examples of multivariable regression include two variables: year
and country
, but we could include more by adding them with +
, it does not just have to be two.
We will now create three different linear regression models to further illustrate the difference between a simple model, additive model, and multiplicative model.
Model 1: year only
# UK and Turkey dataset
gapdata_UK_T <- gapdata %>%
filter(country %in% c("Turkey", "United Kingdom"))
fit_both1 <- gapdata_UK_T %>%
lm(lifeExp ~ year_from1952, data = .)
fit_both1
##
## Call:
## lm(formula = lifeExp ~ year_from1952, data = .)
##
## Coefficients:
## (Intercept) year_from1952
## 57.4154 0.3416
gapdata_UK_T %>%
mutate(pred_lifeExp = predict(fit_both1)) %>%
ggplot() +
geom_point(aes(x = year, y = lifeExp, colour = country)) +
geom_line(aes(x = year, y = pred_lifeExp))
By fitting to year only (lifeExp ~ year_from1952
), the model ignores country.
This gives us a fitted line which is the average of life expectancy in the UK and Turkey.
This may be desirable, depending on the question.
But here we want to best describe the data.
How we made the plot and what does predict()
do?
Previously, we were using geom_smooth(method = "lm")
to conveniently add linear regression lines on a scatter plot.
When a scatter plot includes categorical value (e.g., the points are coloured by a variable), the regression lines geom_smooth()
draws are multiplicative.
That is great, and almost always exactly what we want.
Here, however, to illustrate the difference between the different models, we will have to use the predict()
model and geom_line()
to have full control over the plotted regression lines.
gapdata_UK_T %>%
mutate(pred_lifeExp = predict(fit_both1)) %>%
select(country, year, lifeExp, pred_lifeExp) %>%
group_by(country) %>%
slice(1, 6, 12)
## # A tibble: 6 x 4
## # Groups: country [2]
## country year lifeExp pred_lifeExp
## <fct> <int> <dbl> <dbl>
## 1 Turkey 1952 43.6 57.4
## 2 Turkey 1977 59.5 66.0
## 3 Turkey 2007 71.8 76.2
## 4 United Kingdom 1952 69.2 57.4
## 5 United Kingdom 1977 72.8 66.0
## 6 United Kingdom 2007 79.4 76.2
Note how the slice()
function recognises group_by() and in this case shows us the 1st, 6th, and 12th observation within each group.
Model 2: year + country
##
## Call:
## lm(formula = lifeExp ~ year_from1952 + country, data = .)
##
## Coefficients:
## (Intercept) year_from1952 countryUnited Kingdom
## 50.3023 0.3416 14.2262
gapdata_UK_T %>%
mutate(pred_lifeExp = predict(fit_both2)) %>%
ggplot() +
geom_point(aes(x = year, y = lifeExp, colour = country)) +
geom_line(aes(x = year, y = pred_lifeExp, colour = country))
This is better, by including country in the model, we now have fitted lines more closely representing the data. However, the lines are constrained to be parallel. This is the additive model that was discussed above. We need to include an interaction term to allow the effect of year on life expectancy to vary by country in a non-additive manner.
Model 3: year * country
##
## Call:
## lm(formula = lifeExp ~ year_from1952 * country, data = .)
##
## Coefficients:
## (Intercept) year_from1952
## 46.0223 0.4972
## countryUnited Kingdom year_from1952:countryUnited Kingdom
## 22.7862 -0.3113
gapdata_UK_T %>%
mutate(pred_lifeExp = predict(fit_both3)) %>%
ggplot() +
geom_point(aes(x = year, y = lifeExp, colour = country)) +
geom_line(aes(x = year, y = pred_lifeExp, colour = country))
This fits the data much better than the previous two models.
You can check the R-squared using summary(fit_both3)
.
Advanced tip: we can apply a function on multiple objects at once by putting them in a list()
and using a map_()
function from the purrr package. library(purrr)
gets installed and loaded with library(tidyverse)
, but it is outside the scope of this book. Do look it up once you get a little more comfortable with using R, and realise that you are starting to do similar things over and over again. For example, this code:
mod_stats1 <- glance(fit_both1)
mod_stats2 <- glance(fit_both2)
mod_stats3 <- glance(fit_both3)
bind_rows(mod_stats1, mod_stats2, mod_stats3)
returns the exact same thing as:
## # A tibble: 3 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.373 0.344 7.98 13.1 1.53e- 3 1 -82.9 172. 175.
## 2 0.916 0.908 2.99 114. 5.18e-12 2 -58.8 126. 130.
## 3 0.993 0.992 0.869 980. 7.30e-22 3 -28.5 67.0 72.9
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
What happens here is that map_df()
applies a function on each object in the list it gets passed, and returns a df (data frame). In this case, the function is glance()
(note that once inside map_df()
, glance
does not have its own brackets.
7.2.7 Check assumptions
The assumptions of linear regression can be checked with diagnostic plots, either by passing the fitted object (lm()
output) to base R plot()
, or by using the more convenient function below.
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.