Sunday seems like a good day for a meta. That is, instead of writing about what people write about health science, I will write about what people write about people writing about science. Got that?

Following the debate about review of research in the New York Times, that I wrote about in Unhealthful News 7, the NYT published an interesting article on debates about statistical methods, including attempts to explain the advantages of likelihood ratios and Bayesian updating compared to the common practice of statistical significance testing. It was good to see someone trying, but I think it was pretty hopeless to try to do all that in an article shorter than many of my blog posts. He (reporter Benedict Carey) gets serious props for not getting things wrong (in contrast to the reporters who are characterized the amusing commentary on the state of science reporting found in this other NYT piece). However, I am afraid that in order to cover so much without being able to assume his readers had any particular background, he had to simplify things to the point that they lost most meaning to anyone who did not already understand everything he was writing about.

I am curious: For those of you who do not already understand the content, did it really make any sense? Please let me know if you get a chance.

I expect to cover most of the topics touched on in that article during the course of the series. I will cover only two of them today.

The article pointed out limitations of the statistical approach that is based on significance testing. But before we look at statistics that do more (another day), it is worth taking a step back to consider that significance testing have very limited value, but is not inherently misleading – it means what it means, which is useful, though it does not mean what most people think it means.

To make that a little more concrete, frequentist statistics (the type of statistics that produce statistical significance and confidence intervals) were created to look at studies where there is random error (which includes any study where there is sampling, like all of experimental or observational epidemiology) and deal with the important question, could the result we saw have just been an artifact of that random error? The answer, when it is put that way, is always “yes” – it is always possible. But we know that the bigger the study, the less likely it is that random error produces results far away from the truth (though study size does not help with non-random error like confounding or systematic measurement error). So how big is big enough?

One way to answer the question is to make sure it is sufficiently unlikely to have been random error that we are comfortable in ignoring that possibility. To come to that conclusion, the frequentist statistics give us the answer to the question, “if there really were no association between the exposure and outcome we studied, is it unlikely that we could have seen an association as large as we saw in the data due to bad luck (i.e., random error) alone?” That question is not quite detailed enough, however, because we have to define “unlikely”. Back in the ancient days of error statistics (early 20th century) someone made the arbitrary choice of the .05 (5%) level because that seems about the right size for “unlikely” and we have five fingers on each hand so we like that number. So, this means something is statistically significant if when (a) we assume the “null” (which in an epidemiology study usually means “there is no association between exposure and outcome”) and (b) calculate the distribution of outcomes that would occur given that assumption of the null and the random error process that occurs, and from that (c) add up the probability that the actual observed association *or something even stronger *would occur, and then (d) see whether that probability is less than 2.5% (yes, it is called “the .05 level” but for reasons that are not really relevant we actually look for a probability less than .025).

What a mess, huh? It might be reassuring to know that less than half of epidemiology students can get that definition right on a test, and far fewer can explain it five years after they graduate. I would guess that of all the people who have spoken or typed some variation of the phrase “statistically significant” in the last day less than 1/10th of them could get the actual definition even roughly right.

Here is a better way to think of it: If something is statistically significant, it is less likely that the result was random error than if the result was not statistically significant. Pretty much nothing else is true. The result is not “right”. It is not necessarily likely to be close to right. It is not necessarily even “sure enough”, since the .05 level is completely arbitrary and ignores the specifics of the situation. There are any number of other errors that are not measured at all. And it is still possible that the result was bad luck.

Making things even worse, there are biases in the way results are analyzed and reported. Researcher routinely abuse the statistics so badly (e.g., by running dozens of different statistical models and only reporting the one they like) that the frequentist statistics lose all meaning, but are still reported.

But the biggest problem is that these statistics do not tell us what we want to know, even ignoring all of these limitations. Most people want to say “statistically significant at the 5% level means there is a 95% chance the result is true” which is so incredibly wrong, in at least three ways, that it does not even get partial credit. But you can see why someone would want to say that. Saying that (a)-(d) bit above, along with the “nothing else is true” bit and caveats from the later paragraphs is not what we want to hear. We want to know what the chance is that something is true.

That is what Carey was trying to explain in his article: There are methods for addressing that, but frequentist statistics (significance testing, p-values, even confidence intervals) do not do it because *they cannot do it*. Carey mentions a coin-flipping experiment. Here is an extreme one, based on the “extraordinary claims require extraordinary evidence” theme that keeps coming up in this series. Imagine you flipped a coin, which you inspected and determined to be a normal coin, 500 times and it came up heads about half the time, but then you turned on some music (I suggest the new Metric iTunes sessions – good stuff) and did it again and the coin came up heads every time. This would be a statistically significant departure from the null assumption that “playing music does not make flipped coins any more likely to come up heads”. (Not only would it be “significant at the .05 level” but it would be significant at the .000..more than a hundred more zeroes…0001 level. So would you believe playing the new album causes those results? Of course not. The frequentist statistics say that chance alone (really really extreme luck) was vanishingly unlikely to have produced that result, but though we rule out chance, other explanations –- like I am asleep and just dreaming this all – are much more plausible than “playing the new Metric album causes coins to land heads all the time”.

So what does a significance test tell us about the probability our result reflects the true probability of flipping head with music playing? By itself, nothing. Notice in the coin flip story, our explanation depended on our knowledge of how the world really works, the possibility of dreaming, etc., things that the significance test completely ignores because it is not about what is happening in the world. It requires an entirely different kind of statistics, which the average newspaper reader has never heard of, Bayesian analysis.

The main lesson here seems to be that it is not possible to successfully explain statistical significance in 1000 words. Well, that was the main lesson for me. For my readers, I propose this lesson: Statistical significance is useful in epidemiology because it has mostly stopped clinical researchers from doing studies on 5 people and generalizing the quite-likely-random result. But anyone who emphasizes the statistical significance of a result in epidemiology probably does not understand epidemiology. In kindergarten we learn “do not hit people”, and while that remains a good rule of conduct throughout our lives, we learn that it is not sufficient. Researchers and reporters who emphasize statistical significance above all else are like an adult remembering the kindergarten-level rule and thinking “so long as I obey the rule to not hit people then I am being moral”.

More on that later. But the second point, which I will cover (much more briefly), is that the seeking of statistically significant results to report not only overstates the importance of statistical significance, but overstates the true values of the phenomena being reported about.

The NYT article quoted Andrew Gelman, one of the leading Bayesian statisticians:

But if the true effect of what you are measuring is small, then by necessity anything you discover is going to be an overestimate [of that effect].

Did you get that? I believe he meant the point more broadly, but it is a good one about statistical significance, if we treat “got a statistically significant result” as “discovered something”. The principle here can be stated this way: if gather together the set of “estimated associations between someone having a particular exposure and getting a particular outcome that achieved the level of statistical significance” and consider their quantitative estimates of the association (e.g., smoking increases your risk of heart attack by a factor of 2.5), the results (like the 2.5) will be higher than the true value. Why? Because results that have errors – random or otherwise, perhaps even intentional – that make them further from the null (bigger) are more likely to rise to the level of statistical significance. So the “interesting” results reported in the newspaper are likely to be overstated compared to the average study result. This is not just because the media likes to hype big, scary, or strange results (though that is a big problem in itself) but because an emphasis on statistical significance actually messes up our estimates. Nice, huh?

But the same token, if you read journal articles, you may notice authors often gleefully proclaim “ours is the first study to show…”. When I read that, I always mentally append to the end of the sentence, “…and therefore our results are probably wrong.” Why? If no one has ever observed the association before, then the result is probably not representative of the truth, either due to chance to or due to something worse. The one statistically significant result someone has gotten is not more likely to be right, it is more likely to be wrong.

Ok, that’s all for today. I hope that by taking about 1/10th of what the NYT article was trying to cover, and writing more about it than the entire length of that article, I have provided a bit of insight to those who did not already understand the material. This has suggested a new goal: I aim to help anyone who follows this series to the end to be better at thinking about uncertainty statistics than most people with degrees in epidemiology, and certainly better than those with degrees in medicine or other clinical health fields. You probably will not be able to do all the calculations a epidemiology student knows how to do, but that is just simple mechanical stuff that a computer can do. Much more useful is to know how to make better sense of it (and if I run out of other material, I will even show you how to do the calculations).

Importantly, if you tell me which bits of today’s post did not make sense, I will be sure to address them better the next time I cover the topic.