In the previous two parts, I offered a back-of-the-envelope assessment of the possible impact of a restaurant/bar smoking ban on heart attacks. I estimated that the effect, if it exists, would be in the order of 1% of the population’s total, and only that high if we take near-worst-case estimates of the effects of second-hand smoke and assume, without scientific support, that intermittent medium-term exposure accounts for much of the effect. Let us now set aside the latter parts of that sentence and just assume that we are trying to detect a 1% decrease.
Is it possible to detect a 1% change in something? Yes, of course, but only under the right circumstances. If we have a rigid object with a length X, and we are about to do something that we think will change the length to .99*X, then two measurements — one before and one after — would suffice to confirm this, assuming we have an accurate measuring device and were very confident in our ability to use it precisely. But, of course, for the heart attack case we are not talking about something that was fixed and constant for all time before the event and again after at its new value. The most important difference is that we are counting up random events that result in a series of realizations of an incidence rate that we are hypothesizing has dropped by 1%.
Is it possible to detect a 1% change in such a probability of a random draw? Yes, of course, but only if you have enough observations and that some other conditions hold. Imagine you have a coin that you have altered so that you hypothesize that when you flip it, it lands heads only 49.5% of the time, a 1% decrease from what it did before. Could we throw the coin enough times to detect the change with a good confidence? Yes, but the number of heads we would need to throw to have confidence in the observation would be greater than the number of heart attacks observed in the North Carolina results. What does this mean for those results? It means that even setting aside other complications, pretending that the pattern of heart attacks was as regular and reliable as flipping a single coin, we would still have enough random noise that it would be difficult to detect the signal.
In that scenario in which heart attacks are like coin flips, however, it would be extremely unlikely that we would err so much as to estimate an effect 20 times as great as the plausible true maximum. So what happened?
The problem was that the “some other conditions hold” caveat I stated was not met — not by a long shot — and the analysts tried to deal with this by forcing the data into a complicated model. Instead of just looking at before and after collections of a few tens of thousands of coin flips that vary only as a result of the one change, they were trying to deal with a series that showed lage changes over time that had nothing to do with the event they were trying to assess. In particular, there was a downward trend in heart attacks over time. So obviously if you just compare before the change with after, the latter will show a reduction. This is exactly what some of the lowest-skilled dishonest analysts have done over the years, effectively claiming that the impact of the downward trend that existed before the smoking ban was due to the ban when it continued afterwards.
More sophisticated dishonest analysts used complicated models to try to trick naive readers (i.e., policy makers, news reporters, and 99.9% of everyone else) into believing that they had accounted for this obvious problem. But before getting to them, what would an honest and sophisticated analyst do? The answer in this North Carolina case is:
Give up without even bothering to try.
As I already noted, merely being able to detect the hypothesized signal in the noise, assuming the ban is the only change between before and after, requires a bit more data than all that was used for this analysis. Using up some of the statistical power to model the downward trend, even if it were a very simply shaped curve and you knew the shape, would leave even less power to detect the impact of the policy shock. So an honest analyst who knew what he was doing would insist on getting a lot more data before doing the analysis. And as it turns out, honest analysts who have gathered such sufficient data, for much larger populations with longer periods for estimating the time trend, have found no measurable effects on heart attacks from smoking bans.
So what did the present analysts do? The only approach that would have any hope of working would be to assume that the downward trend was constant (in some sense, such as the percentage reduction from year to year was constant) except for the effect of the ban. But that option did not work for the government employees who were tasked with “proving” that their department’s pet regulation had a big effect. Sadly for them, the time trend clearly flattened out, so the gains after were less than those from before. If the trend had accelerated they might well have claimed it was caused by the ban, but because it decelerated, they were not willing to do that simple analysis, which would blame the reduction in reduction on the ban.
So they set out to use other data to try to explain the time trend. This is not wrong in theory. After all, the time trend is being caused by something — it is not just a magical effect of calendar pages turning. So if we had all the data in the world and knew what to do with it, we would not have to deal with the trend itself since we could predict the rate of heart attacks at any given time, however it was trending, with the other data. But here we run into the problem again of not having nearly enough data, not only not enough observations (events) but not enough other variables to really explain the changes over time. Very sophisticated analysts with lots of data might attempt to explain complicated phenomena like this.
Such sophistication is more common in economics, but there are a few examples in public health, like the attempt to estimate the mortality effects of outdoor air pollution: By collecting daily data on mortality, air pollution, weather, and many other variables from multiple cities for years, researchers attempted to estimate the effects of the air pollution. Unfortunately, this was extremely difficult because hot weather is strongly correlated with high air pollution days, and the hot weather itself is at least ten times as deadly as the worst case estimate for the pollution, so basically the estimate for the effects of pollution is determined by exactly what is estimated to be the effect of weather — make the estimate for that a bit low and it will look like pollution is doing a lot more harm than it really is. The air pollution research is notorious for the researchers making a technical error in their first published estimate, and having to revise their estimate down by half. (Added bonus lesson for the day: Do not believe the oft-repeated claims about how many people are killed by outdoor air pollution. They are little more than guesses.)
In the case of the smoking ban, it is the time trend that has effects in the order of ten times the hypothesized effect, but the implication is the same: Unless you have a very good estimate of the effect of that other factor, or have enough data to figure it out, there is no way to estimate the effect of interest. So, once again, no honest analyst who knew what he was doing would attempt this.
A dishonest analyst, however, would find that he had all manner of options for getting the result he wanted by using different combinations of the variable he has and employing many different statistical forms. The analysts could experiment with different options and report only one of them, as if it were the only one tried and as if it were somehow the right choice among the many different models. This is the most nefarious type of what I labeled publication bias in situ, and is almost certainly what the smoking ban advocates have done in the various cases where they used complicated analyses to “show” that the effects of the ban are far greater than is plausible.
Finally, we might ask what an honest research might do if tempted to just give this a go, even realizing that the chances are that it would not be possible to get a stable estimate (i.e., one that does not change a lot as a result of whims of model details or the random sampling error in the data). One thing that would be required would be to do some tests to see if the estimate was sensitive to reasonable changes in the model or data, and most importantly to report the results of those tests. To their credit, the authors of the NC study actually did a bit of that. You would never know it from reading the political documents surrounding the study, like the press release, but they did one of the more telling tests: They took their model and calculated what it would estimate the effects of the ban were if it had been implemented at a different time. That is, they kept the same data and used the model to estimate the apparent effect of a new regulation that started at a different time from when it really did. The results for the two alternative times they report are a 27% decrease in heart attacks (recall that the touted “result” of the study was a 21% reduction) and a 12% increase. That is, during months when their estimate of the effect of the new ban should have been zero (since it did not happen then), the estimates ranged from bigger than their estimated effect from the actual ban to a substantial effect in the other direction. Put another way, their model is generating noise, and the 21% result is just an accident of when the ban was implemented and the details of their model; had the ban been implemented a month sooner or later, the same data would have “shown” a very different effect, though almost certainly one that was still far larger than was plausible, one way or the other. They could have just as easily have gotten any other number within tens of percentage points.
And maybe they did. That is, maybe they tried models that produced those results and buried them. But I am thinking maybe not. After all, the analysts would not have even reported the results of the little validation analysis if they were trying hard to lie. If they were ASH-UK or Glantz, they would have buried those results or, more likely, never done the test. If I had to guess, I might go with the story that the analysts tried to do an honest job and report that their model could not really find anything, but their political bosses insisted that they report something favorable without caveat. The analysts left in the information that shows the political claim to be a lie because they could get away with that in the text. The “public health” politicos are not smart enough to understand what that means, if they take the time to read it at all. If that is really the story, however, it is not so good — anyone who would allow the politicos to claim that their analysis showed something that it clearly did not, and who stayed in their job and accepted the behavior, shares much of the guilt even if they tried to sneak the truth in.