There are cancer cluster stories like this in some local press all the time, but this one caught my eye because the story went national and is near places where I spent a lot of my life. The phenomenon is that people in a community have had a substantially higher than expected rate of cancer. This results in fear, demands for investigation, blame, etc. In this case, it is childhood cancers in a particular rural town.
One thing that is interesting about all these cases is that the second sentence in the previous paragraph has three serious problems but we are so used to reading such statements that you probably did not even notice. I will come back to this point at the end. The first thing that is necessary to know about a cancer cluster, a hot streak in sports, or any other concentration of events that are substantially random in their occurrence is that some clustering is statistically inevitable. The second thing to know is that much of the time there is a bit more clustering than would be predicted by the statistics alone, but we have a difficult time knowing which of the clusters (usually a small minority of them) represent real phenomena as opposed to a purely random occurrence.
The statistical process at work here is called a poisson distribution, the result of a collection of independent rare random events like a disease in an individual. The graphic on this page shows that distribution, and illustrates that even if it is most likely that there would just be 2 or 3 (or even one or zero) events in the population over a time period, there is still a non-trivial chance that there will be 6 or 8, and given enough chances (e.g., given the number of communities in America) there will be several cases where there are 20. To see (or imagine) this in practice, roll a single die 12 times (or imagine doing so) and count how often it comes up 6. (This actually follows a binomial distribution, but it is very similar and illustrates the point and is easier to imagine than any do-it-yourself poisson distribution generator.) You probably immediately figured out that on average the result is 2. But I suspect that without doing the experiment you can also predict that 1 and 3 are fairly likely, with 0 and 4 rather less likely and so on. Even rolling 5 would not be terribly rare, and given enough tries you will sometimes get “clusters” of 10 or more.
So that means that cancer clusters are inevitable even if everyone in the entire population has exactly the same risk — that is, if there is nothing that is causing any community to have higher risk. Of course, most people, even those not gripped with fear, do not understand such statistics. Moreover, they are unlikely to accept this information because unethical activists (including many ostensible scientists) so often abuse statistics, or deny the validity of statistics, to confuse people.
In any case, the lesson should not be that every cluster is a statistical accident: Typically the statistics show something like this: Due to random clustering alone we should see, say, 10 communities with 20 or more cases, but the epidemiology shows that there are actually 14. That might just be a lot of bad luck, but it very likely means that a few of those 14 have experienced some real specific cause (e.g., their groundwater or air has nasty industrial chemicals in it), so their risk really is higher than average (recall that the predicted 10 is based on everyone having exactly average risk). The hard part, then, is that most of the clusters were random, but a few represent a huge elevation in the local risk, and we cannot know which. We could ask “is there something about that community that could be causing lots of extra cancers”, but it would not do us any good. The answer to that is always “yes”. Once you observe a statistical pattern, coming up with a plausible explanation for it is quite easy.
But, wait, the statistics get a lot more complicated, making a mess of the whole thing. Recall the sentence above that referred to “community”, “rate”, and “cancer”. All of these are rather plastic, and once people are convinced there is a cluster they are likely to alter the definitions to increase the apparent high rate of disease. There are a lot of different cancers, many of which are largely unrelated to each other. For a given set of people and time period, some will occur less often than the population average and some more. If you pick out the ones that are elevated, the cluster will tend to increase. So, if in Clyde, Ohio there were an elevated risk of breast cancer, you can bet that it would be noticed and mentioned as further “evidence” that the childhood cancers were caused by some local exposure. The same is true for bladder, pancreatic, or any other cancer with complicated causes. But since (I am surmising) these other cancers were not elevated, they were not included. If they had been included, the increase would have been less (because the elevated cancers would have been diluted). The particular cancers that were elevated do not necessarily have similar causes, but they could all be grouped together as if they were a single disease.
A community is also not so well defined. In this case it is actually clearer than most since there is a free-standing town. But even then, it seems safe to assume that if outlying areas or a neighboring town had elevated rates of the particular cancers they would have been included in the reports of a cluster, but if not they would have been excluded. If the extra cancers were among 3-15 year-olds, you can be sure that they would have been the ones being discussed, but if the elevation was greater for 10-22 year-olds, the report would change to emphasize them. Finally, epidemiology has some precise definitions of what we measure in particular studies, but someone vaguely referring to a “rate” can pick and choose several variables, such as what time period to look at and how to define the outcome (e.g., once people start trying to diagnose cancers they find a lot of “pre-cancerous” conditions that would not have been noticed if the worry did not exist).
In short, people are much more likely to find a cluster than the simple statistics indicate because they look at the data in many different ways and find what looks “best”. It is sometimes called the “Texas sharpshooter” (I am not sure why): Fire six shots into the side of a barn and then draw a target around each one of them to produce six bullseye shots.
Of course, that still leaves us with a real problem. Some statistical correlations represent real important phenomena, even if they were found by torturing the data. This is the problem with dealing with any publication bias in situ: We do not always know what we are looking for, so sometimes have legitimate reasons for changing our analysis after we see the data. Thus a simplistic rule (“always analyze people in age groups of 5 years and only look at county level data”) that avoids some of the Texas sharpshooting might cost us important information. But allowing us to pursue that information opens the door for less ethical researchers (or worried lay people) to concoct “evidence” for phenomena that do not really exist.
As the series continues, I hope to show that epidemiology is not so arcane that interested readers cannot understand it. But I hope this is sufficient to show that you should not let anyone convince you it is simple.