P-values and confidence intervals

When you present your research, it seems the first question is either “what’s your p-value?” or “is it significant?”. Despite their importance, basic misunderstandings of p-values and statistical significance abound. Even if you’ve had a statistics course, it’s good to take another look at these concepts. It’s also a good to reconsider what we are trying to accomplish and whether p-values are the best way to do that.

What is a p-value?

Let’s start with the definition of a p-value: it is the probability of observing a statistic or one more extreme if the null hypothesis is true. It’s worth committing that phrase to memory, exactly as is, because many common rewordings are completely wrong. In other words, we make an assumption about the world (that the null hypothesis is true), then calculate the probability we would observe our statistic or one more extreme, based on that assumption. If that probability is small, we will decide to reject the null hypothesis. These are decisions about how we will act; we never prove that null hypothesis is true or false, we only decide to accept or reject it.

There are several common misconceptions about p-values, and I’ll discuss three of them.

Misconception #1

Most people intuitively expect a p-value to be small if our statistic is very different from the null hypothesis. Even so, a small p-value does not necessarily mean our statistic is very different from the null hypothesis, because something else has an equally strong control on the p-value, and that’s sample size.

A simple demonstration makes this clear. Suppose we measured the lengths of a bivalve species from two different environments, and we are interested in whether the mean lengths are different. Our null hypothesis would be that they are the same, that the difference in means would be zero. We would run a t-test on the difference in means, and if the p-value is small, we would reject the null hypothesis, and if it is large, we would accept the null.

Let’s try this with simulated data. We’ll suppose that we collected 25 bivalves from each environment and that their mean lengths differ only slightly (9.11 vs. 9.14, or an actual difference in means of 0.03).

options(digits=3) # round all results to 3 decimal places
set.seed(1688)    # set the seed value for the random number generator, so the same results can be reproduced
shallowEnvt <- round(rnorm(n=25, mean=9.14, sd=0.5), 1)
deepEnvt <- round(rnorm(n=25, mean=9.11, sd=0.5), 1)
t.test(shallowEnvt, deepEnvt)

## 
##  Welch Two Sample t-test
## 
## data:  shallowEnvt and deepEnvt
## t = -1, df = 40, p-value = 0.3
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.463  0.135
## sample estimates:
## mean of x mean of y 
##      9.15      9.31

The output shows that the difference in means is not statistically significant, meaning that the p-value of 0.3 is greater than the common 0.05 cutoff.

Suppose we collect more data, say 5000 shells from each environment.

shallowEnvt <- round(rnorm(n=5000, mean=9.14, sd=0.5), 1)
deepEnvt <- round(rnorm(n=5000, mean=9.11, sd=0.5), 1)
t.test(shallowEnvt, deepEnvt)

## 
##  Welch Two Sample t-test
## 
## data:  shallowEnvt and deepEnvt
## t = 3, df = 10000, p-value = 0.006
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.00782 0.04694
## sample estimates:
## mean of x mean of y 
##      9.14      9.11

Re-running our t-test, we now find that the difference in means is significant: the 0.006 p-value is smaller than the 0.05 cutoff.

The difference in mean lengths for the populations is the same in both cases; it is 0.03 (9.14 - 9.11). That small p-value is not telling us that the means are greatly different, only that with more data, we are now able to reject the null hypothesis of zero difference.

In short, the p-value does not tell us what the difference in means is. All it does is guide us as to whether we should accept or reject the null hypothesis. When sample size was small, we did not have enough data to detect such a small difference in means, but as we collected more data, we could.

Misconception #2

The second misconception is that a p-value tells you the probability that the null hypothesis is correct. This misconception is extremely widespread, and it arises because people invert the definition of a p-value. Instead of it being (correctly) “the probability of observing a statistic or one more extreme if the null hypothesis is true”, it is flipped to become “the probability the null hypothesis is true given the observed statistic (or one more extreme)”.

There are several ways to understand why you cannot invert the definition, and I’ll describe two.

First, imagine that you’ve censused frogs in an area and have found that 90% of frogs are green. You could then state, “If it is a frog, there’s a 90% probability that it is green”, and you should see that this logical statement has the same form as the definition of a p-value: if X, there’s a probability Y. If we invert our statement, it becomes: “If it is green, there’s a 90% probability that it is a frog”. That’s obviously wrong: if you find something green in nature, it is most likely a plant! This is a logical fallacy called Affirming the Consequent.

The second way to understand why we can’t treat the p-value as the probability that the null hypothesis is true. The null hypothesis is a statement about how we think things actually are in nature. We don’t usually know whether such a statement is true or not, so we collect a limited amount of data and try to make an inference. The only way to know for certain whether the null hypothesis is true or not would be to collect and measure the entire population (e.g., every shell from each environment). This is usually difficult or impossible to do. Regardless of whether you can actually count every shell, you should realize that the mean of the population exists, it has an actual value that is fixed. Therefore, there is no chance element to whether the null hypothesis is true, so there is no probability that it is true: it is either true or it is false.

Misconception #3

The third misconception is that p-values test a plausible hypothesis.

Think about the null hypothesis for these bivalves, that the difference in the mean length is zero. I will posit that we all know, or should know, whether that statement is true or false. Furthermore, we know the answer without collecting a single bit of data: this null hypothesis is false.

How can we say this? Imagine that we were able to measure every individual in the two populations precisely (hopefully this isn’t Donax!). If you calculated those two means, they would almost certainly be different. It might be in the fifth or sixth decimal place, but they would not be identical. In other words, this null hypothesis of zero difference in mean length is assuredly false.

For a great many problems, the null hypothesis is called a zero null: zero difference in means, zero correlation, zero slope, etc. For almost all of these problems, the null hypothesis is almost certainly false. Given that, rejecting the null is only a matter of collecting enough data. Collect too small of a data set and you’ll accept the null (as we did for our bivalves). Increase your sample size, and at some point you will be able to detect the difference, no matter how small of a difference. The more data you collect, the smaller of a difference you will be able to detect.

A great many published p-values and statistical tests have little value, as they are evaluating hypotheses that is manifestly false. There is a way to avoid such pointless exercises, a way that is also more informative.

An alternative: confidence intervals

Our hypothesis test evaluated only the case of zero difference in means. We also saw that you could detect any difference in means if you collect a large enough sample, but we know that some of those differences in means would not be biologically important (for example, suppose the difference in mean size was 0.001 cm). It would be good if we could test not just the zero-null hypothesis, but all other hypotheses that correspond to biologically irrelevant differences.

There’s an easy way to test the null hypothesis and every other possible hypothesis, and that is confidence intervals. A confidence interval is the set of acceptable null hypotheses, that is, it defines the range of null hypotheses that you would have to accept.

In many cases, the confidence interval is symmetric about the measured statistic. Where it is, you often see it stated as X ± Y, where X is the statistic, and Y is a distance that defines the confidence limits. Specifically, X - Y would mark one end of the confidence interval, and X + Y would mark the other. Anything at those two confidence limits or between them would be an acceptable null hypothesis.

Given a confidence interval, it is simple to test any hypothesis: if the hypothesis lies within the confidence interval, it is acceptable. If a hypothesis lies outside of the confidence interval, you can reject it.

This approach of measuring a quantity (called the effect size) and its confidence interval, is known as estimation and uncertainty. You are using your statistic (the measured difference in means) as an estimate of the parameter (the true difference in means). The confidence interval provides you with an estimate of your uncertainty. The larger your confidence interval, the greater your uncertainty. What’s useful is that this estimate and your uncertainty can be used in other calculations where you want to propagate your uncertainty through a series of calculations.

All of this is good: confidence intervals let you test hypotheses just like a p-value will, but they let you do much more. You should use confidence intervals in lieu of p-values wherever you can.

Confidence intervals in R

Confidence intervals in R are simple, and most of the time, they are calculated automatically. Going back to our example from before:

shallowEnvt <- round(rnorm(n=25, mean=9.14, sd=0.5), 1)
deepEnvt <- round(rnorm(n=25, mean=9.11, sd=0.5), 1)
t.test(shallowEnvt, deepEnvt)

## 
##  Welch Two Sample t-test
## 
## data:  shallowEnvt and deepEnvt
## t = 1, df = 40, p-value = 0.3
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.140  0.396
## sample estimates:
## mean of x mean of y 
##      9.26      9.13

For the small data set, the estimate of the difference in means (the effect size) is obtained by subtracting the two measured means shown on the last line: 0.13 (9.26 - 9.13). The 95% confidence interval is labeled in the output. You could report this as “Difference of means is 0.13 (95% CI: -0.140–0.396)”, or, with a little math, you could report this as “Difference of means is 0.13 ± 0.266 (95% CI)”.

shallowEnvt <- round(rnorm(n=5000, mean=9.14, sd=0.5), 1)
deepEnvt <- round(rnorm(n=5000, mean=9.11, sd=0.5), 1)
t.test(shallowEnvt, deepEnvt)

## 
##  Welch Two Sample t-test
## 
## data:  shallowEnvt and deepEnvt
## t = 2, df = 10000, p-value = 0.02
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.00444 0.04348
## sample estimates:
## mean of x mean of y 
##      9.13      9.11

For the large data set, we have a different, and likely somewhat better estimate of the difference in means: 0.02 (the difference in those two values on the last line, 9.13-9.11). You could report the estimate and the uncertainty as “Difference of means is 0.02 (95% CI: 0.00444–0.04348)” or “Difference of means is 0.02 ± 0.0196 (95% CI)”.

Note that the confidence interval is narrower in the second case: confidence intervals shrink as your sample size grows. In other words, your uncertainty decreases as your sample size grows.

Finally, you may have heard of Bayesian credible intervals. For parametric problems like this, credible intervals and confidence intervals typically give similar results and lead you to similar conclusions.

The bottom line is, wherever possible, use confidence intervals or Bayesian credible intervals instead of p-values.

Comments/Questions/Corrections: Steven Holland (stratum@uga.edu)

Peer-review: This document has been peer-reviewed.

Acknowledgments: The author thanks Michal Kowalewski (Florida Museum of Natural History) for his helpful comments.

Note: This tutorial may have been expanded or revised after peer-review process has been completed. The author is responsible for all conceptual and scripting errors in the current version of this document.

Our Sponsors: National Science Foundation (Sedimentary Geology and Paleobiology Program), National Science Foundation (Earth Rates Initiative), Paleontological Society, Society of Vertebrate Paleontology

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.