Statistical inference isn't easy, either

I was just playing around with the citation management software / website Mendeley (recommended by Amir, and worth checking out for the auto-formatting of citations alone) when I trolled over to their "most read articles in all disciplines" section and saw that the 3rd most read article was a PLoS Medicine piece titled "Why most published research findings are false: author's reply to Goodman and Greenland," by Ioannidis. Ignoring the fact that it's the response and not the original paper (huh?) and led on by the rather provocative title, I poked around and discovered that Ioannidis' work just got written up in The Atlantic and was covered in pretty nice detail by Marginal Revolution back when it came out. So blogging about it does feel a bit like trying to review a restaurant that's already been covered by Frank Bruni and Food and Wine, but I'm going to go ahead and do so anyway since the point is so worthwhile.

The crux of the paper rests on a pretty simple idea: if you're running a huge number of one-off statistical tests (i.e., not testing the same hypothesis over and over) a fraction of your results proportional to the power of your test will be false positives (i.e., type I error). This is pretty straightforward a concept for anyone doing applied work: if you're checking to make sure you've got balance across treated and controlled populations in a randomized trial, for example, having an occasional statistically significant difference between the two populations isn't a huge deal as long as the percentage of variables that turn up that way is proportional to the significance level you're setting. Yes, you should follow through as a good little applied researcher and make sure something's not hiding there, but some portion of your results will always end up that way due to random variation.

The nice step that Ioannidis takes is to look at the entire field of medical research and apply the same logic, effectively viewing the suite of randomized trials as a game where we keep picking new potential tests for the same problems over and over again, some subset of which are guaranteed to be incorrectly not-rejected. To quote Alex Tabarrok's pithy wording of it in the Marginal Revolution post:
Want to avoid colon cancer? Let's see if an apple a day keeps the doctor away. No? What about a serving of bananas? Let's try vitamin C and don't forget red wine.
Moreover, since the number of things that actually, say, help avoid colon cancer is likely small, and the number of tests being run to find things which do is large, Ioannidis concludes that a large portion ("most") results are in fast false positives and thus meaningless. It's a pretty simple premise which leads to a pretty deep statement about how we think about learning about the world.

So the solutions to this are, of course, pretty intuitive: don't trust small sample size studies; insist on retesting hypotheses; be skeptical of results in any field where a large number of researchers are pursuing solutions to the same problem. In short, demand robustness checks on everything, and make sure that what's being shown is not just an artifact of your specific data set. Good lessons that all applied researchers should have tattooed across their proverbial chests already, but nonetheless a nice thing to be reminded of.


  1. hey - did you try Mendeley? Is it any better than Bibtex (I write all my papers in latex)?


  2. I haven't tried it out, but Mendeley apparently will put together a BibTeX'd up version of the citation for any paper you're looking at, though it's currently still in beta. It's the tab at the bottom of the paper's abstract that says "BibTeX".