A few days ago I made a comment about clinical trial stopping rules being based on dubious ethics, as practiced. In part 1 and part 2, I made the following points: clinical trials almost always assign some people to an inferior treatment, an action which serves the greater good by giving better information for many future decisions; hurting people for the greater good is not necessarily unethical, but pretending you are not doing it seems indefensible; people who do clinical trials and act as “medical ethicists” for them usually deny that they are hurting some subjects, though this is clearly false; clinical trials are an example of what are known as “bandit problems” (a reference to playing slot machines), characterized by a tradeoff between gathering more information and making the best use of the information you have; there is a well-worked mathematics for optimizing that tradeoff.
Today I will conclude the analysis, staring by expanding on that last point. The optimization depends on every variable in sight, notably including how many more times you are going to “play” (i.e., make the decision in question), as well as more subtle points like your prior beliefs and whether you only care about average payoff or might care about other details of distribution of payoffs (e.g., you might want to take an option whose outcomes vary less, even though it is somewhat inferior on average). Obviously the decision predominantly depends on how much evidence you have to support the claim that one option is better than the other, on average, and how different the apparent payoffs are, but the key is that that is not all it depends on.
I hinted at some of this point yesterday, pointing out that you would obviously choose to stop focusing on gathering information, switching over to making the apparently best choice all the time, at different times when you were expecting to play five or a thousand times. Obviously the value of information varies greatly, with the the value of being more certain of the best choice increasing with the number of future plays. On the more subtle points, if you are pretty sure at the start that option X is better, but the data you collect is favoring option Y a bit, you would want to gather a bit more data before abandoning your old belief, as compared to demanding a bit less if the data was tending to support X after all. And if the payoffs are complicated, rather than being simply “win P% of the time, lose (100-P)% of the time”, with varying outcomes, maybe even some good and some bad, then more data gathering will be optimal. This is the case even if you are just concerned with the average payoff, but even more so if people might have varying preferences about those outcomes, such as worrying more about one disease than another (I have to leave the slot machine metaphor to make that point).
So, stopping rules make sense and can be optimized mathematically. That optimization is based on a lot of information, but thanks to years of existing research it can be done with a lot less cost than, say, performing medical tests on a few study subjects. So there is no excuse for not doing it right.
So what actually happens when these stopping rules that are demanded by “ethics” committees are designed in practice? Nothing very similar to what I just described. Typically the rule is some close variation on “stop if, when you check on the data gathered so far, and one of the regimens is statistically significantly better than the other(s)”. Why this rule, which ignores all of the factors that go into the bandit optimization other than how sure you are about which regimen is best, based only on the study data, ignoring all other sources of information?
It goes back to the first point I made in this exploration, the fiction that clinical trials do not involve giving some people a believed-inferior regimen. As I noted, as soon as you make one false declaration, others tend to follow from it. One resulting falsehood that is necessary to maintain the fiction is that in any trial, we actually know and believe absolutely nothing until the data from this study is available, so we are not giving someone a believed-inferior treatment. A second resulting falsehood is that we must stop the study as soon as we believe that one treatment is inferior. An additional falsehood that is needed to make the second one function is that we know nothing until we reach some threshold (“significance”), otherwise we would quit once the first round of data was gathered (at which time we would clearly know something).
The first and last of these are obviously wrong, as I illustrated by pointing out that an expert faced with choosing one of the regimens for himself or a relative before the study was done would never flip a coin, as he would pretty much always have an opinion about which was better. But they do follow from that “equipoise” assumption, the assumption that until the trial gives us an answer, we know nothing. That assumption was, recall, what was required to maintain the fiction that no group in the trial was being assigned an inferior treatment.
As for stopping when one regimen is shown to be inferior, based on statistical significance, I believe this is the most compelling point of the whole story: Based on the equipoise assumption, the statistical significance standard basically says stop when we believe there is a 97.5% chance that one choice is better. (It is not quite that simple, since statistical significance cannot be directly interpreted in terms of phrases like “there is an X% chance”, but since we are pretending we have no outside knowledge, it is pretty close for simple cases.) So what is wrong with being this sure? Because it is pretty much never (I have never heard of an exception) chosen based on how many more times we are going to play – that is, how many total people are going to follow the advice generated by the study. If there are tens of millions people who might follow the advice (as is the case with many preventive medicines or bits of nutritional advice), then that 2.5% chance of being wrong seems pretty large, especially if all you need to do is keep a thousand people on the believed-inferior regimen for just a few more years.
But *sputter* *gasp* we can’t do that! We cannot intentionally keep people on an inferior regimen!
Now we have come full circle. Yes we can do that, and indeed always do that every time we start a trial. That is, we have someone on a believed-inferior regimen because we never prove the answer to an empirical question. There is nothing magical about statistical significance (or any variation thereof) – it is just an arbitrary threshold with no ethical significance (indeed, it also lacks any real significance statistically, despite the name). It usually means that we have a stronger believe about what is inferior when we have statistically significant data as compared to when we do not, but there is no bright line between ignorance and believing, let alone between believing and “knowing”.
So, if we are asking people to make a sacrifice of accepting assignment to the believed-inferior option in the first place, we must be willing to allow them to keep making the sacrifice after we become more sure of that belief, up to a point. But since there is clearly no bright line, that point should start to consider some bandit optimization, like how many plays are yet to happen.
This is not to say that we should just use the standard bandit problem optimizations from operations research, which typically assume we are equally worried about losses during the data gathering phase as during the information exploiting phase. It is perfectly reasonable that we are more concerned with the people in the trial, perhaps because we are more concerned with assigning someone to do something as compared to merely not advising people correctly. We would probably not except nine excess deaths in the study population (in expected value terms) to prevent ten expected excess deaths among those taking the advice. We might even put the tradeoff at 1-to-1000, which might justify the above case, making 97.5% sure the right point to quit even though millions of people’s actions was at stake. But whatever that tradeoff, it should be reasonably consistent. Thus, for other cases where the number of people who might heed the advice is only thousands, or a hundred million, the stopping rule should be pegged at a different point.
So there is the critical problem. Whatever you think about the right tradeoff, or how much to consider outside information, or other details, there is a tradeoff and there is an inconsistency. Either we are asking people to make unreasonable levels of sacrifice when there is less at stake (fewer future “plays”) or we are not calling for enough sacrifice when there is more at stake. There is a lot of room for criticism on many other points that I have alluded to, and I would argue that almost all stopping rules kick in too soon and that most trials that accumulate data slowly should also be planned for a longer run (i.e., they should not yet stop at the scheduled ending time), though some trials should not be done at all because the existing evidence is already sufficient. But those are debatable points and I can see the other side of them, while the failure to consider how many more plays seems inescapable. The current practice can only be justified based on the underlying patent fiction.
When the niacin study that prompted this analysis was stopped, it was apparently because an unexpected side effect, stroke, had reached the level of statistical significance but perhaps also because there was no apparent benefit. This one kind of feels like it was in the right ballpark in terms of when to stop – they were seeing no benefit, after all, and there was prior knowledge that make it plausible that there was indeed none. But imagine a variation where the initial belief was complete confidence in the preventive regimen, and there was some apparent heart attack benefit in the study data, but the extra strokes (which were completely unexpected and thus more likely to have been a statistical fluke) outweighed the benefit by an amount that achieved statistical significance. Would we really want to give up so quickly on something that we had thought would be beneficial to tens of millions of people?
The situation becomes even more complicated when there are multiple outcomes. An example is the Women’s Health Initiative, the trial that resulted in post-menopausal estrogen regimens being declared to be unhealthy rather than healthy. It was stopped because the excess breast cancer cases in the treatment group hit the magic threshold. But there were offsetting benefits in terms of hip fracture and other diseases, so the bottom line was really unclear. Someone with particularly low risk of breast cancer and high risk of fracture might have still wanted to go with the therapy, but we cannot tease out enough detail because the trial ended too soon. Whatever we might have learned from continuing longer could have helped millions and really would not have hurt subjects much on net, but now we will never know. (Arguably the trial had become such a train wreck by the time it ended, with a huge portion of each trial arm being “noncompliant” – i.e., those assigned to hormones having stopped taking them and those assigned to placebo having sought out hormone treatment – and many being lost to follow up. Still those were not the reasons the study was stopped, and everyone mostly just pretended they had not happened.)
Bottom line: Pretending that trials do not hurt (in expected value terms) some subjects is unethical. Engineering their design in a way that provides suboptimal information in order to maintain that fiction is even worse.