9.5 Fitting logistic regression models in base R
The glm()
stands for generalised linear model
and is the standard base R approach to logistic regression.
The glm()
function has several options and many different types of model can be run.
For instance, ‘Poisson regression’ for count data.
To run binary logistic regression use family = binomial
.
This defaults to family = binomial(link = 'logit')
.
Other link functions exist, such as the probit
function, but this makes little difference to final conclusions.
Let’s start with a simple univariable model using the classical R approach.
##
## Call:
## glm(formula = mort_5yr ~ ulcer.factor, family = binomial, data = melanoma)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9925 -0.9925 -0.4265 -0.4265 2.2101
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.3514 0.3309 -7.105 1.20e-12 ***
## ulcer.factorPresent 1.8994 0.3953 4.805 1.55e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 215.78 on 204 degrees of freedom
## Residual deviance: 188.24 on 203 degrees of freedom
## AIC: 192.24
##
## Number of Fisher Scoring iterations: 5
This is the standard R output which you should become familiar with. It is included in the previous figures. The estimates of the coefficients (slopes) in this output are on the log-odds scale and always will be.
Easier approaches for doing this in practice are shown below, but for completeness here we will show how to extract the results.
str()
shows all the information included in the model object, which is useful for experts but a bit off-putting if you are starting out.
The coefficients and their 95% confidence intervals can be extracted and exponentiated like this.
## (Intercept) ulcer.factorPresent
## 0.0952381 6.6818182
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.04662675 0.1730265
## ulcer.factorPresent 3.18089978 15.1827225
Note that the 95% confidence interval is between the 2.5% and 97.5% quantiles of the distribution, hence why the results appear in this way.
A good alternative is the tidy()
function from the broom package.
## # A tibble: 2 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.0952 0.331 -7.11 1.20e-12 0.0466 0.173
## 2 ulcer.factorPresent 6.68 0.395 4.80 1.55e- 6 3.18 15.2
We can see from these results that there is a strong association between tumour ulceration and 5-year mortality (OR 6.68, 95%CI 3.18, 15.18).
Model metrics can be extracted using the glance()
function.
## # A tibble: 1 x 8
## null.deviance df.null logLik AIC BIC deviance df.residual nobs
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 216. 204 -94.1 192. 199. 188. 203 205