6.8 Multiple testing

6.8.1 Pairwise testing and multiple comparisons

When the F-test is significant, we will often want to determine where the differences lie. This should of course be obvious from the boxplot you have made. However, some are fixated on the p-value!

pairwise.t.test(aov_data$lifeExp, aov_data$continent, 
                p.adjust.method = "bonferroni")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  aov_data$lifeExp and aov_data$continent 
## 
##        Americas Asia   
## Asia   0.180    -      
## Europe 0.031    1.9e-05
## 
## P value adjustment method: bonferroni

A matrix of pairwise p-values can be produced using the code above. Here we can see that there is good evidence of a difference in means between Europe and Asia.

We have to keep in mind that the p-value’s significance level of 0.05 means we have a 5% chance of finding a difference in our samples which doesn’t exist in the overall population.

Therefore, the more statistical tests performed, the greater the chances of a false positive result. This is also known as type I error - finding a difference when no difference exists.

There are three approaches to dealing with situations where multiple statistical tests are performed. The first is not to perform any correction at all. Some advocate that the best approach is simply to present the results of all the tests that were performed, and let sceptical readers make adjustments for themselves. This is attractive, but presupposes a sophisticated readership who will take the time to consider the results in their entirety.

The second and classical approach is to control for the so-called family-wise error rate. The “Bonferroni” correction is the most famous and most conservative, where the threshold for significance is lowered in proportion to the number of comparisons made. For example, if three comparisons are made, the threshold for significance should be lowered to 0.017. Equivalently, all p-values should be multiplied by the number of tests performed (in this case 3). The adjusted values can then be compared to a threshold of 0.05, as is the case above. The Bonferroni method is particularly conservative, meaning that type II errors may occur (failure to identify true differences, or false negatives) in favour or minimising type I errors (false positives).

The third approach controls for something called false-discovery rate. The development of these methods has been driven in part by the needs of areas of science where many different statistical tests are performed at the same time, for instance, examining the influence of 1000 genes simultaneously. In these hypothesis-generating settings, a higher tolerance to type I errors may be preferable to missing potential findings through type II errors. You can see in our example, that the p-values are lower with the fdr correction than the Bonferroni correction ones.

pairwise.t.test(aov_data$lifeExp, aov_data$continent, 
                p.adjust.method = "fdr")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  aov_data$lifeExp and aov_data$continent 
## 
##        Americas Asia   
## Asia   0.060    -      
## Europe 0.016    1.9e-05
## 
## P value adjustment method: fdr

Try not to get too hung up on this. Be sensible. Plot the data and look for differences. Focus on effect size. For instance, what is the actual difference in life expectancy in years, rather than the p-value of a comparison test. Choose a method which fits with your overall aims. If you are generating hypotheses which you will proceed to test with other methods, the fdr approach may be preferable. If you are trying to capture robust effects and want to minimise type II errors, use a family-wise approach.

If your head is spinning at this point, don’t worry. The rest of the book will continuously revisit these and other similar concepts, e.g., “know your data”, “be sensible, look at the effect size”, using several different examples and datasets. So do not feel like you should be able to understand everything immediately. Furthermore, these things are easier to conceptualise when using your own dataset - especially if that’s something you’ve put your blood, sweat and tears into collecting.