## 9.4 Model assumptions

Binary logistic regression is robust many assumptions which can cause problems in other statistical analyses. The main assumptions are:

- Binary dependent variable - this is obvious, but as above we need to check (alive, death from disease, death from other causes doesn’t work);
- Independence of observations - the observations should not be repeated measurements or matched data;
- Linearity of continuous explanatory variables and the log-odds outcome - take age as an example. If the outcome, say death, gets more frequent or less frequent as age rises, the model will work well. However, say children and the elderly are at high risk of death, but those in middle years are not, then the relationship is not linear. Or more correctly, is not monotonic, meaning that the response does not only go in one direction.
- No multicollinearity - explanatory variables should not be highly correlated with each other;

### 9.4.1 Linearity of continuous variables to the response

A graphical check of linearity can be performed using a best fit “loess” line. This is on the probability scale, so it is not going to be straight. But it should be monotonic - it should only ever go up or down.

```
library(tidyverse)
melanoma %>%
mutate(
mort_5yr.num = as.numeric(mort_5yr) - 1
) %>%
select(mort_5yr.num, age, year) %>%
gather(key = "predictors", value = "value", -mort_5yr.num) %>%
ggplot(aes(x = value, y = mort_5yr.num)) +
geom_point(size = 0.5, alpha = 0.5) +
geom_smooth(method = "loess") +
facet_wrap(~predictors, scales = "free_x")
```

Age is interesting here, the relationship is u-shaped. The chance of death is higher in the young and the old compared with the middle-aged. This will need to be accounted for in any model including age as a predictor.

### 9.4.2 Multicollinearity

Multicollinearity occurs when two highly correlated exlanatory variables are included in a model. If both variables desribe the same thing, then their coefficients (ORs) can become unstable potentially leading to erroneous conclusions. Think about your variables before you start - would any be expected to be highly correlated?

The `ggpairs()`

function from `library(GGally)`

gives you all the plots you can dream of and more, but it is a lot:

```
library(GGally)
explanatory = c("ulcer.factor", "age", "sex.factor",
"year", "t_stage.factor")
melanoma %>%
remove_labels() %>% # ggpairs is older and doesn't like labels
ggpairs(columns = explanatory)
```

If you have many variables you want to check you can split them up.

**Continuous to continuous**

```
select_explanatory = c("age", "year")
melanoma %>%
remove_labels() %>%
ggpairs(columns = select_explanatory)
```

**Continuous to categorical**

Let’s split that up a bit and use a clever `gather()`

and `facet_wrap()`

combination.
We want to compare everything against, for example, age so we need to add `-age`

to the gather call so it doesn’t get lumped up with everything else.
But because the excluded variable has to be in the third argument of `gather()`

we need to type in `key, value`

as placeholders:

```
select_explanatory = c("age", "ulcer.factor",
"sex.factor", "t_stage.factor")
melanoma %>%
select(one_of(select_explanatory)) %>%
gather(key, value, -age) %>%
ggplot(aes(value, age)) +
geom_boxplot() +
facet_wrap(~key, scale = "free", ncol = 3) +
coord_flip()
```

**Categorical to categorical**

```
select_explanatory = c("ulcer.factor", "sex.factor", "t_stage.factor")
melanoma %>%
select(one_of(select_explanatory)) %>%
gather(key, value, -sex.factor) %>%
ggplot(aes(value, fill = sex.factor)) +
geom_bar(position = "fill") +
ylab("proportion") +
facet_wrap(~key, scale = "free", ncol = 2) +
coord_flip()
```

None of the explanatory variables are highly correlated with one another.

We are not trying to over-egg this, but multicollinearity can be important. The message as always is the same. Understand the underlying data using plotting and tables, and you are unlikely to come unstuck.