9.5 Fitting logistic regression models in base R

The glm() stands for generalised linear model and is the standard base R approach to logistic regression.

The glm() function has several options and many different types of model can be run. For instance, ‘Poisson regression’ for count data.

To run binary logistic regression use family = binomial. This defaults to family = binomial(link = 'logit'). Other link functions exists, such as the probit function, but this makes little difference to final conclusions.

Let’s start with a simple univariable model using the classical R approach.

## 
## Call:
## glm(formula = mort_5yr ~ ulcer.factor, family = binomial, data = melanoma)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9925  -0.9925  -0.4265  -0.4265   2.2101  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -2.3514     0.3309  -7.105 1.20e-12 ***
## ulcer.factorPresent   1.8994     0.3953   4.805 1.55e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 215.78  on 204  degrees of freedom
## Residual deviance: 188.24  on 203  degrees of freedom
## AIC: 192.24
## 
## Number of Fisher Scoring iterations: 5

This is the standard R output which you should become familiar with. It what is included in the previous figures. The estimates of the coefficients (slopes) in this output are on the log-odds scale and always will be.

Easier approaches for doing this in practice are shown below, but for completeness here we will show how to extract the results. str() shows all the information included in the model object, which is useful for experts but a bit off-putting if you are starting out.

The coefficients and their 95% confidence intervals can be extracted and exponentiated like this.

##         (Intercept) ulcer.factorPresent 
##           0.0952381           6.6818182
## Waiting for profiling to be done...
##                          2.5 %     97.5 %
## (Intercept)         0.04662675  0.1730265
## ulcer.factorPresent 3.18089978 15.1827225

Note that the 95% confidence interval is between the 2.5% and 97.5% quantiles of the distribution, hence why the results appear in this way.

A good alternative is the tidy() function from the broom package.

## # A tibble: 2 x 7
##   term                estimate std.error statistic  p.value conf.low conf.high
##   <chr>                  <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)           0.0952     0.331     -7.11 1.20e-12   0.0466     0.173
## 2 ulcer.factorPresent   6.68       0.395      4.80 1.55e- 6   3.18      15.2

We can see from these results that there is a strong association between tumour ulceration and 5-year mortality (OR 6.68, 95%CI 3.18, 15.18).

Model metrics can be extracted using the glance() function.

## # A tibble: 1 x 7
##   null.deviance df.null logLik   AIC   BIC deviance df.residual
##           <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int>
## 1          216.     204  -94.1  192.  199.     188.         203