Someone just asked me the following. Since I have not covered how to interpret such things in a while, I though I would put it in a post.
Someone tweeted about some segment of Oregon’s youth smoking rate going up.
95% confidence interval for 2008 was 8.0% – 9.3%
95% confidence interval for 2009 was 8.7% – 11.2%
Doesn’t this mean we can’t be 95% sure that the smoking rate actually increased?
First, I will answer a fundamentally different, but similar sounding question that is consistent with the numbers provided: Is the change statistically significant at the .05 level, or equivalently, does the 95% confidence interval for the difference between the two percentages include zero?
A quick answer to that requires only observing that the (unreported) point estimate for 2008 is in the range of 8.6 or 8.7, the middle of the confidence interval (note for other cases if you do this: for a ratio measure, “middle” means the geometric mean, and when the CI pushes up toward a limit of possible values — like 0% in this case — it gets more complicated). If it was 8.7, even if that were perfectly precise with no random sampling error, the difference would not be statistically significant since that point falls within the CI for the 2009 value — that is, the random error for the 2009 number alone is enough to make the difference not statistically significant. Since the point estimate might be a bit below that, it is not quite so clean, but it is still easy to conclude that the difference is not statistically significant because it is so close and there is random error for the 2008 figure.
If you want to do a better job of it, you can back out the missing statistics (the whole thing would be cleaner and easier if they reported the actual data, so you could just compare the sample proportions). After calculating the point estimate, you can calculate the standard error because the ends of the CI are 1.96*SE away from the point estimate. With those estimates you can use the formula (e.g., here) for the SE of the difference, giving us the CI for the difference (multiply by 1.96, add to and subtract from the difference), which is -0.1 to 2.7.
But much more interesting than “is the difference statistically significant?” is some variation on the question actually asked, how sure are we that there is a increase. The answer to that is not available from these statistics. You see, frequentist statistics never answer the question “how likely is…? (If “frequentist” is meaningless jargon to you, suffice to say it includes p-values, confidence intervals, about 99.99% of the statistics about error you see in medicine or public health, and about 100% of those you see in the newspaper.) A 95% confidence interval is defined by an answer to a complicated hypothetical question (you can find it in earlier posts here, or look it up) about what would happen if a particular number (the one at the border of the CI, not the point estimate) were the true value. It does not address what the chances of particular values being true are. Indeed, it is based on an epistemic philosophy that denies the validity of that question.
But the thing is that such a question is what we want the answer to. This is true to such an extent that when you see someone try to translate the frequentist statistics into words, they pretty much always phrase it in terms of the answer we want — i.e., incorrectly. But it should be obvious this is wrong if you just think about it: What if the survey that produced those percentages is known to be of terrible quality? Then it obviously should not make you feel extremely sure of anything, regardless of how low the random sampling error might be (which would happen if it were a large sample, even if the survey was fatally flawed — size matters, but a lot less than other things). Or, what if you had a boatload of other evidence that there was a decrease? Then you might be quite sure that was true, even though this result nudged you in the direction of believing there was an increase.
Drawing conclusions about the probability of a worldly phenomenon requires taking into consideration everything we know. It also calls for Bayesian statistics, the need for which is usually mentioned first, but really this is a technical layer on top of the need to consider everything you know. This has all kinds of annoying features, like the probability existing in your thoughts rather than having any “real” existence. Which is why it is tempting to focus on the much less useful, but well-defined, probabilities that appear in frequentist statistics, which are then misinterpreted.
As for what I believe knowing the little that I learned from the question I got, combined with other knowledge about how the world is: It seems really unlikely that the smoking rate would go up (or down) by 15% in one year. It is mostly the same population, after all, and smoking behavior is highly serially correlated (i.e., what an individual does in 2008 is very predictive of 2009). Thus, I am pretty confident the change is overstated, whatever it really was. Based on this, any government official or other activist trying to make a big deal about this number must not understand statistics, though I would have been 95% sure of that even before I heard what they had to say.
“You see, frequentist statistics never answer the question “how likely is…?”
Surely they do answer the question 'how likely is…[something]', but it's just not the something that people intuitively expect (i.e. probability of an outcome), which is why it gets mangled so much.
In your difference of proportions example, a p value of a test of differences is a measure of 'how likely is it we'd observe the difference in proportions we are observing (or a larger difference) if there genuinely was no difference between the proportions in the underlying population'.
I'd agree that there is the whole 'cult of the p value' thing where people don't consider the other issues of measurement quality. But there are still many spheres of data presentation where things like standard error & p values aren't even considered when it'd be appropriate to do so. A frequentist approach has many limits, but even its strengths aren't always (or maybe even often) used.
Thanks for the reply. To clarify, what I meant is that they never answer a probability question about the real world. So, they answer questions of “how likely would…?” (subjunctive tense reflecting the hypothetical nature of the question) about some mathematical construct, as in “if X it were true, how likely would Y be?”, but not “is” questions that refer to reality. Even such an apparently mathematical question as “how likely is it that a fair die rolls a 1” can only be answered with the subjective (Bayesian) analysis, and only hypothetical questions about non-real constructs can be answered with the nice clean math.
I definitely agree that ignoring random error in reporting or analysis is bad. But reporting just a p-value, as is often done, can be just as bad in its own way — it is not very informative and to most readers it implies things that are not true. Reporting a CI is much better, so long as readers recognize that it is just a rough reporting of about how much random error there is, and that the exact numbers are meaningless.
But while we are at it, the version of the “how likely is it….” hypothetical (which is better phrased “how likely would it be…”, btw) is incomplete in important ways that are often overlooked. The question also needs to include caveats about there being no measurement error (or else be rephrased to refer to the measured rather than true values) and that selection is unbiased. Furthermore — quite important and always ignored — the question needs to refer to the statistical model: “analyzing the data using only the particular model that was used”.
But if multiple models were tried and only the one the researchers liked best was reported (less likely to have happened for the simple proportion from Oregon, but possible even there), then the “using the particular model” error statistics are misleading. Instead an incredibly complicated (and never reported) statistic described by “if all of the following models were tried and the one that was the most X was reported”. The error statistics that are reported are based on the “using this particular model” fiction, and are thus incorrect. (I have written several papers about this.)