Monthly Archives: May 2011

Unhealthful News 150 – Understanding (some of) the ethics of trials and stopping rules, part 3

A few days ago I made a comment about clinical trial stopping rules being based on dubious ethics, as practiced.  In part 1 and part 2, I made the following points: clinical trials almost always assign some people to an inferior treatment, an action which serves the greater good by giving better information for many future decisions; hurting people for the greater good is not necessarily unethical, but pretending you are not doing it seems indefensible; people who do clinical trials and act as “medical ethicists” for them usually deny that they are hurting some subjects, though this is clearly false; clinical trials are an example of what are known as “bandit problems” (a reference to playing slot machines), characterized by a tradeoff between gathering more information and making the best use of the information you have; there is a well-worked mathematics for optimizing that tradeoff.

Today I will conclude the analysis, staring by expanding on that last point.  The optimization depends on every variable in sight, notably including how many more times you are going to “play” (i.e., make the decision in question), as well as more subtle points like your prior beliefs and whether you only care about average payoff or might care about other details of distribution of payoffs (e.g., you might want to take an option whose outcomes vary less, even though it is somewhat inferior on average).  Obviously the decision predominantly depends on how much evidence you have to support the claim that one option is better than the other, on average, and how different the apparent payoffs are, but the key is that that is not all it depends on.

I hinted at some of this point yesterday, pointing out that you would obviously choose to stop focusing on gathering information, switching over to making the apparently best choice all the time, at different times when you were expecting to play five or a thousand times.  Obviously the value of information varies greatly, with the the value of being more certain of the best choice increasing with the number of future plays.  On the more subtle points, if you are pretty sure at the start that option X is better, but the data you collect is favoring option Y a bit, you would want to gather a bit more data before abandoning your old belief, as compared to demanding a bit less if the data was tending to support X after all.  And if the payoffs are complicated, rather than being simply “win P% of the time, lose (100-P)% of the time”, with varying outcomes, maybe even some good and some bad, then more data gathering will be optimal.  This is the case even if you are just concerned with the average payoff, but even more so if people might have varying preferences about those outcomes, such as worrying more about one disease than another (I have to leave the slot machine metaphor to make that point).

So, stopping rules make sense and can be optimized mathematically.  That optimization is based on a lot of information, but thanks to years of existing research it can be done with a lot less cost than, say, performing medical tests on a few study subjects.  So there is no excuse for not doing it right.

So what actually happens when these stopping rules that are demanded by “ethics” committees are designed in practice?  Nothing very similar to what I just described.  Typically the rule is some close variation on “stop if, when you check on the data gathered so far, and one of the regimens is statistically significantly better than the other(s)”.  Why this rule, which ignores all of the factors that go into the bandit optimization other than how sure you are about which regimen is best, based only on the study data, ignoring all other sources of information? 

It goes back to the first point I made in this exploration, the fiction that clinical trials do not involve giving some people a believed-inferior regimen.  As I noted, as soon as you make one false declaration, others tend to follow from it.  One resulting falsehood that is necessary to maintain the fiction is that in any trial, we actually know and believe absolutely nothing until the data from this study is available, so we are not giving someone a believed-inferior treatment.  A second resulting falsehood is that we must stop the study as soon as we believe that one treatment is inferior.  An additional falsehood that is needed to make the second one function is that we know nothing until we reach some threshold (“significance”), otherwise we would quit once the first round of data was gathered (at which time we would clearly know something). 

The first and last of these are obviously wrong, as I illustrated by pointing out that an expert faced with choosing one of the regimens for himself or a relative before the study was done would never flip a coin, as he would pretty much always have an opinion about which was better.  But they do follow from that “equipoise” assumption, the assumption that until the trial gives us an answer, we know nothing.  That assumption was, recall, what was required to maintain the fiction that no group in the trial was being assigned an inferior treatment.

As for stopping when one regimen is shown to be inferior, based on statistical significance, I believe this is the most compelling point of the whole story:  Based on the equipoise assumption, the statistical significance standard basically says stop when we believe there is a 97.5% chance that one choice is better.  (It is not quite that simple, since statistical significance cannot be directly interpreted in terms of phrases like “there is an X% chance”, but since we are pretending we have no outside knowledge, it is pretty close for simple cases.)  So what is wrong with being this sure?  Because it is pretty much never (I have never heard of an exception) chosen based on how many more times we are going to play – that is, how many total people are going to follow the advice generated by the study.  If there are tens of millions people who might follow the advice (as is the case with many preventive medicines or bits of nutritional advice), then that 2.5% chance of being wrong seems pretty large, especially if all you need to do is keep a thousand people on the believed-inferior regimen for just a few more years. 

But *sputter* *gasp* we can’t do that!  We cannot intentionally keep people on an inferior regimen!

Now we have come full circle.  Yes we can do that, and indeed always do that every time we start a trial.  That is, we have someone on a believed-inferior regimen because we never prove the answer to an empirical question.  There is nothing magical about statistical significance (or any variation thereof) – it is just an arbitrary threshold with no ethical significance (indeed, it also lacks any real significance statistically, despite the name).  It usually means that we have a stronger believe about what is inferior when we have statistically significant data as compared to when we do not, but there is no bright line between ignorance and believing, let alone between believing and “knowing”. 

So, if we are asking people to make a sacrifice of accepting assignment to the believed-inferior option in the first place, we must be willing to allow them to keep making the sacrifice after we become more sure of that belief, up to a point.  But since there is clearly no bright line, that point should start to consider some bandit optimization, like how many plays are yet to happen.

This is not to say that we should just use the standard bandit problem optimizations from operations research, which typically assume we are equally worried about losses during the data gathering phase as during the information exploiting phase.  It is perfectly reasonable that we are more concerned with the people in the trial, perhaps because we are more concerned with assigning someone to do something as compared to merely not advising people correctly.  We would probably not except nine excess deaths in the study population (in expected value terms) to prevent ten expected excess deaths among those taking the advice.  We might even put the tradeoff at 1-to-1000, which might justify the above case, making 97.5% sure the right point to quit even though millions of people’s actions was at stake.  But whatever that tradeoff, it should be reasonably consistent.  Thus, for other cases where the number of people who might heed the advice is only thousands, or a hundred million, the stopping rule should be pegged at a different point.

So there is the critical problem.  Whatever you think about the right tradeoff, or how much to consider outside information, or other details, there is a tradeoff and there is an inconsistency.  Either we are asking people to make unreasonable levels of sacrifice when there is less at stake (fewer future “plays”) or we are not calling for enough sacrifice when there is more at stake.  There is a lot of room for criticism on many other points that I have alluded to, and I would argue that almost all stopping rules kick in too soon and that most trials that accumulate data slowly should also be planned for a longer run (i.e., they should not yet stop at the scheduled ending time), though some trials should not be done at all because the existing evidence is already sufficient.  But those are debatable points and I can see the other side of them, while the failure to consider how many more plays seems inescapable.  The current practice can only be justified based on the underlying patent fiction.

When the niacin study that prompted this analysis was stopped, it was apparently because an unexpected side effect, stroke, had reached the level of statistical significance but perhaps also because there was no apparent benefit.  This one kind of feels like it was in the right ballpark in terms of when to stop – they were seeing no benefit, after all, and there was prior knowledge that make it plausible that there was indeed none.  But imagine a variation where the initial belief was complete confidence in the preventive regimen, and there was some apparent heart attack benefit in the study data, but the extra strokes (which were completely unexpected and thus more likely to have been a statistical fluke) outweighed the benefit by an amount that achieved statistical significance.  Would we really want to give up so quickly on something that we had thought would be beneficial to tens of millions of people?

The situation becomes even more complicated when there are multiple outcomes.  An example is the Women’s Health Initiative, the trial that resulted in post-menopausal estrogen regimens being declared to be unhealthy rather than healthy.  It was stopped because the excess breast cancer cases in the treatment group hit the magic threshold.  But there were offsetting benefits in terms of hip fracture and other diseases, so the bottom line was really unclear.  Someone with particularly low risk of breast cancer and high risk of fracture might have still wanted to go with the therapy, but we cannot tease out enough detail because the trial ended too soon.  Whatever we might have learned from continuing longer could have helped millions and really would not have hurt subjects much on net, but now we will never know.  (Arguably the trial had become such a train wreck by the time it ended, with a huge portion of each trial arm being “noncompliant” – i.e., those assigned to hormones having stopped taking them and those assigned to placebo having sought out hormone treatment – and many being lost to follow up.  Still those were not the reasons the study was stopped, and everyone mostly just pretended they had not happened.)

Bottom line:  Pretending that trials do not hurt (in expected value terms) some subjects is unethical.  Engineering their design in a way that provides suboptimal information in order to maintain that fiction is even worse.

Unhealthful News 149 – Understanding (some of) the ethics of trials and stopping rules, part 2

Yesterday I explained why clinical trials (aka randomized clinical trials, RCTs, medical experiments on people) almost always inflict harm on some of their subjects, as assessed based on current knowledge (which is, of course, the only way we can measure anything).  To clarify, this means that one group or another in the trial experiences harm in expected value terms.  “Expected value” means it is true for the average person, though some individuals might benefit while others suffer loss, and averaging across hypothetical repetitions of the world, because sometimes the luck of the draw causes an overall result that is very different from what would occur on average.

The critical ethical observation about this is that causing this harm is ok.  Some people have to suffer some loss – in this case by volunteering to be a study subject and getting assigned to the believed-inferior regimen – for the greater good.  In this case, the greater good is the knowledge that lets us choose/recommend a regimen for everyone in the future based on the additional knowledge we gained from the study.  There is nothing inherently unethical with causing some people harm for a greater good.  Sometimes that is unethical, certainly, but not always.  If we tried to impose an ethical rule that said we could never make some people worse off for the greater good (or even the much narrower variation, “…some identified people…”, a large fraction of human activity would grind to a halt.  Now it does turn out that it is always possible, when an action has a net gain for society, to compensate those who are being hurt so that everyone comes out ahead in expected value terms (for those with some economics, I am referring to potential Pareto improvement).  But it turns out that for most clinical trials, no such compensation is offered and, bizarrely, it is often considered “unethical” to provide it (another pseudo-ethical rule that some “health ethicists” subscribe to, and another story for another day:  they claim that it would be too coercive to offer someone decent compensation to be in a trial, which …um… explains why it is considered unethical coercion to pay people to work their jobs?)

However, though it is not necessarily unethical to take actions that hurt people, there is a good case to be made that it is per se unethical to hurt people but claim to not be doing so.  Thus, an argument could be made that invading Iraq was ethical even though it was devastating for the Iraqi people (I am not saying I believe that, I am just saying there is room to argue).  But when US government apologists claim that the invasion was ethical because it made the average Iraqi better off, they are conceding the situation is unethical:  Not only are they lying, but they are implicitly admitting that the invasion was unethical because their defense of it requires making a false claim.  Similarly, banning smoking in bars/pubs is the subject of legitimate ethical debate even though it clearly hurts smokers and the pub business.  But when supporters of the ban pretend that pubs have not suffered they are being unethical and are implying that they think the truth (“the bans are costly for the pubs in most places, but we feel the benefits are worth the cost”) would not be considered ethical or convincing.

So, it seems that those doing and justifying clinical trials are on rather shaky ethical ground based on their rhetoric alone, because they pretend that no one is being hurt.  This is simply false.  Their claim is that if we are doing the trial then we must not know which of the regimens being compared is better, so no one is being assigned to an inferior choice.  But as I explained yesterday, this is simply false in almost all cases – they are misrepresenting the inevitable uncertainty as being complete ignorance.  But it gets worse, because as is usually the case that once you take one nonsensical step, others follow from it (which you can interpret as either “one false assumption leads to bad conclusions via logical reasoning” or “trying to defend the indefensible usually requires more indefensible steps to patch over the mess you have made”).  The stopping rules, as they now exist, are one of those bad steps that follow.

But it occurs to me that I need to explain one more epistemic principle before making the final point, so I will do that today and add a “part 3” to the plan here (you need to read part 1 to know what I am talking about here, btw).  I hope that anyone who likes reading what I write will find this worthwhile.

Clinical trials are an example of the tradeoff between gathering more information about which choice is better and exploiting the information you have to make the apparent best choice.  Yesterday I pointed out that if an expert is making a decision about a health regimen (e.g., a treatment option) for himself or a close relative right now, he almost certainly would have a first choice.  This is a case of just exploiting current knowledge because there is no time to learn more, so the choice is whichever seems to be better right now, even if it only seems a little better and is quite uncertain.  But if we are worried not just about the next member of the target population, but the next thousand or million who could benefit from a treatment or health-improving action, it would be worth resolving the uncertainty some.  The best way to do that is to mix up what we are doing a bit.  That is, instead of just going with the apparently better regimen (which would provide some information – it would help narrow down exactly what the expected outcomes are for that regimen) we seek the additional information of clarifying the effects of the other regimen.

Aside – yes, sorry; it is hard for me to present complicated topics that have subtle subpoints without getting all David Foster Wallace-esque – I already use his sentence structure, after all.  For a lot of trials, one of the regimens represents the current common practice, being used for comparison to the new drug/intervention/whatever of interest.  This is a regimen that we actually already have a lot of data about, and for which more usually continues to accumulated.  Thus, you might say, we can just assign everyone to the other regimen, if it is believed to be better, and use the data about the old standard from other sources.  This is true, and it is yet another epistemic disgrace that we do not make better use of that information in evaluating the new regimen.  But there are big advantages to having the data come from the same study that examined the new regimen.  This is often attributed to the value of randomization and blinding, but the main benefits have to do with people in studies being enough different from average that it is tricky to compare them to the population average.  People in studies experience placebo effects and Hawthorne effects (effects of merely being studied, apart from receiving any intervention, which are often confused with placebo effects – ironically including in the study that generated the name “Hawthorne effect”), and are just plain different.  Thus, though we should make better use of data from outside the study, there is still great value in assigning some people to each of the regimens that is being studied.

The tradeoff between exploiting best-available information and paying the price to improve our information is called a “two-armed bandit problem” (or more generally, just a “bandit problem”), a metaphor based on the slot machine, which used to be a mechanical device with an arm that you pulled to spin real mechanical dials, thus earning the epithet, “one-armed bandit” (this was back before it became all digital and able to take your money as fast as you could push a button).  Imagine a slot machine with a choice of two arms you can pull, which almost certainly have different expected payoffs.  If you are only going to play once, you should obviously act on whatever information you have.  If you are going to play a handful of times, and you have good information about which pays off better you should probably just stick with that one.  If you have no good information you could try something like alternating until one of them paid off, and then sticking with that one for the rest of your plays.  This strategy might well have you playing the poorer choice – winning is random, so the first win can easily come from the one that wins less – but you do not have much chance to learn any better. 

But imagine you planned to play a thousand times.  In that case, you would want to plan to play each of them some number of times to get a comparison.  If there is an apparent clear advantage for one of the choices, play it for the remainder of your plays (actually, if it starts to look like the test phase was a fluke because you are not winning as much in the later plays, you might reopen your inquiry – think of this as post-marketing surveillance).  On the other hand, if it still seems close, keep playing both of them some to improve your information.  The value of potential future information is that it might change your mind about which of the options is better (further information that confirms what you already believe has less practical value because it does not change your choice, though it does create a warm fuzzy feeling).  Now imagine an even more extreme case, where you can keep betting pennies for as long as you want, but eventually you have to bet the rest of your life’s savings on one spin.  In that case you would want to play many times – we are talking perhaps tens of thousands of times (let’s assume that the effort of playing does not matter) – to be extremely sure about which offers the better payoff.

There actually is an exact mathematics to this, with a large literature and some well-worked problems.  It is the type of problem that a particular kind of math geek really likes to work out (guess who?).  The calculations hinge on your prior beliefs about probability distributions and Bayesian updating, two things that are well understood by many people, but not by those who design the rules for most (not all) clinical trials.

Clinical trials are a bandit problem.  Each person in the study is a pull of the arm, just like everyone that comes after during the “exploit the information from the study to always play the best choice from now on” phase.  Many types of research are not like this because the study does not involve taking exactly the action that you want to eventually optimize, but clinical trials have this characteristic.

You may have seen emerging hints of the stopping rule.  The period of gathering more information in the bandit problem is, of course, the clinical trial period, while the exploitation of that knowledge is everyone else who is or will be part of the target population, now and into the future until some new development renders the regimen obsolete or reopens the question.  The stopping rule, then, is the point when we calculate that going further with the research phase has more costs (assigning some people to the inferior treatment) than benefits (the possibility of updating our understanding in a way that changes our mind about what is the better regimen).  It should also already be clear that the stopping rule should vary based on several different piece of information.  Therein lies part (not all) of the ethical problem with existing stopping rules

I hope to pull these threads together in part 3 (either tomorrow, or later in the week if a news story occurs that I do not want to pass up).

Unhealthful News 148 – Understanding the ethics of trials and stopping rules, part 1, with an aside about alcohol and the NHS

A couple of people asked me about an allusion I made to clinical trial stopping rules yesterday – rules which are based on a very weak understanding of statistics and epistemology, and thus, arguably, weak ethics – which I said was a story for another day.  But since there is nothing I particularly want to cover in today’s health news, I will let today be that day I start the explanation.  (For those looking for a most standard Unhealthful News style analysis, you can find it in the second part of this post where I link to a couple of other bloggers who did that for recent UK statistics about alcohol and hospital admissions.)  Besides, whenever I move something to the category “to do later” it joins such a long list that it is often lost – note that this observation should serve as a hint to those of you who have asked me to analyze something and I said I would get to it: please ask again if you are still interested!  (If you do not want to post a comment, my gmail I use for work is cvphilo.)

Clinical trials (a prettied-up name for medical or health experiments conducted on people) which follow the study subjects for a long period of time (e.g., they give one group a drug they hope will prevent heart attacks and the other group a placebo, and then watch them for years to count heart attacks) often have a stopping rule.  Such rules basically say that someone will look at the accumulated data periodically, rather than waiting until the planned end, to make sure that it does not already clearly show one group is doing better (in terms of the main outcome of the study and major side effects).  If the data support the claim that one group is apparently suffering inferior health outcomes because of their treatment, the argument goes, then it would be unethical to continue the trial and thus continue to assign them to the inferior regimen.  Oh, except those who recite the textbook justification for the stopping rules would probably phrase that as something like “if one treatment is clearly inferior” rather than the much longer version of the conditional I wrote; therein lies much of the problem.

Backing up a couple of steps, to understand the problem it is useful to realize that most trials involve assigning some people to a treatment that is believed to be inferior.  Realizing this is not necessary for figuring out a statistically optimal stopping rule, but it does immediately get rid of a persistent ethical fantasy that interferes with good analysis.  A typical trial involves comparing some new treatment, preventative medicine, or public health intervention to whatever is currently being done.  Almost always this is because those who initiated, funded, and approved of the research believe that the new regimen will produce better outcomes than the old one.  There are other scenarios too, of course, such as comparing two existing competing regimens, but the point is that those with the greatest expertise almost always have a belief about which is better.  If they had to decide, right now, which would be used for the next few decades, ignoring all future information from the trial or any other source, they would be able to make a decision.  More realistically, if they had to decide which regimen to follow/use for themselves, or their parent or child, right now (because what we might learn over the next ten years cannot aid in today’s decision), they would be able to make a decision.  Just because we are not sure which regimen is better (or how much better), and thus want to do research to become more sure, does not mean that there is not a prevailing expert opinion.

Many people who fancy themselves ethicists (and many more who just want to do trials without feeling guilty about it) take refuge in a fantasy concept called “equipoise”.  The term (which is actually a rather odd jargonistic adoption of that word – not that it is used in conversation anyway) is used to claim that when we do a trial, we are exactly balanced in our beliefs about which regimen produces better outcomes.  Obviously this might be true on rare occasions (though incredibly rare – we are talking about perfect balance here).  But most of the time the user of the word is confusing uncertainty with complete ignorance.  That is, someone obviously feels inadequately certain about which regimen is better, but this is not the same as having no belief at all.  Keep in mind that we are talking about the experts here, not random people or policy makers.  They know what the existing evidence shows and, if forced to make a decision right now about which regimen to assign to a close relative who is in the target population, it would be an incredibly rare case where they were happy to flip a coin. 

Every now and then, there is a case of such incredible ignorance that no one has any guess as to whether a treatment will help or hurt (e.g., this condition is always fatal in a few weeks, so let’s just start doing whatever we can think of – the results will be pretty random, but we have nothing to lose), and occasionally a situation is so complex that it starts bordering on chaos theory (e.g., what will the new cigarette plain packaging rule in Australia do? nothing? discourage smoking? expand the black market? provoke a political backlash? reinstate smoking’s role as a source of rebellion?).  But such examples are extremely rare.

It is also sometimes the case that no one is being assigned to an option inferior to their best available option had they not been in the trial.  For example, offering a promising new drug – or even just condoms and education – for people at high risk of HIV in Africa, comparing them to a group that does not get the intervention, may hurt no one.  If the researchers only had enough budget to give the treatment to a limited group of people, that group is helped (according to our prior belief) while the other group is right where they otherwise would have been.  Their lack of access to the better regimen is due to their inability to afford a better lot in life, who while they are not helped, they are in no way hindered by being in the control arm of the trial.  (Bizarrely, it is situations like these that often provoke greater ethical objections than cases where people are assigned to a believed-inferior regiment when they could afford to buy either regimen for themselves, but that is yet another story of the confused world of “health ethics”.)  Another example is the study I wrote about recently in which some smokers are given snus while others are not; setting aside all that is wrongheaded about the approach of this study, it does have the advantage that one group benefits (at least they get some free product they can resell) and the other is exactly where they would have been had there been no study.  There is a similar circumstance in which the trial only assigns people to the believed-better treatment, with the plan of comparing them to the known outcomes for people not getting that treatment.  This is similar to having a control group that just gets the standard treatment, though people who do trials do not like this approach because the data is harder to analyze (they have to consider the same challenges that exist for observational data).  But all of these cases, while certainly not rare, are unusual.

I will reiterate one point here, in case it is not yet clear (one of the challenges in turning seminar material into written essays is I get no realtime feedback, so I cannot be sure if I have not made something clear):  We are never sure about which of the regimens is better, so we might be wrong.  Handing out the condoms might actually increase HIV transmission; we are pretty sure that is not the case, but it is possible we are wrong.  Or niacin might not actually prevent any heart attacks, even though it seems like it should.  But there is still a belief about what is better when we start.

The bottom line, then, is that most trials involve assigning some people to do something that is believed to produce inferior health outcomes.  Why is this ok?  It is because it is for the greater good.  We want to be more sure about what is the better regimen so we can give better treatment/advice to thousands or millions of people, and so judge that it is ethical to let a few hundred informed volunteers follow the believed-inferior option to do so.  Also we usually want to measure how much better the better regimen is, perhaps because it costs more and we want to decide if it is worth the cost, because we want to be able to compare it to new competing regimens that might emerge in the future, or perhaps just out of curiosity. 

Asking people to suffer for what is declared to be the greater good is, of course, not an unusual act.  Every time someone rights a check to a humanitarian charity, they are doing this, and governments force such choices (taxation, zoning and eminent domain, conscription).  But the people who chatter about medical ethics, and make the rules about trials, like to pretend that they are not doing that.  From that pretense comes the stopping rules, which I realize I have not mentioned yet.  But this is a complex subject and requires some foundations.  I will end that for today and continue tomorrow.


On a completely unrelated note, for those of you who want some regular Unhealthful News and do not read Chris Snowdon (I know a lot of you do), check out what he wrote, based on what <Nigel Hawkes wrote about a recent UK report that hospital admissions due to alcohol consumption have skyrocketed.  I will not repeat their analysis and do not have much to add to it.  The simple summary is (a) the claim makes no sense because dangerous drinking has decreased a lot, as has alcohol-caused mortality, and (b) it is obvious that the apparent increase was due to a change in the way the counting was done. 

It is pretty clear that the government knew they were making a misleading claim when they released this information.  Their own reports recognized the true observations, but their press release about their new estimate did not.  The National Health Service is on a crusade to vilify health-affecting behaviors they do not approve of.  But governments lie – we know that.  While the commonality of that does not make it any less disgraceful, the greater disgrace belongs to the press that is supposed to resist government lies, not transcribe them.  But, as Hawkes and Snowdon predicted (they wrote their posts right after the report came out, before the news cycle), the press picked up, with hundreds of articles that report the misleading claims and seem to completely lack skepticism (examples here here here here).  This is not a difficult error to catch, either by running a Google search for the blogs that had already debunked the claim before the stories ran, or simply by asking “hey, we know that heavy drinking is way down, so how can this possibly be true?” 

I suppose it is not too surprising that the average reader has no idea what stopping rules do when they read one was employed, let alone what is wrong with them, when the health reporters cannot even do simply arithmetic or fact checking.