The phrase “studies show…” is almost always followed by bs. To understand why, I’ll point you to a post by David Epstein.
David is the author of Range, The Sports Gene, and the new book Range Adapted for Young Readers which I bought for my 12-year-old son and nephew.
I’ve been a long-time reader of David’s letter and this post is both useful and timeless.
Everything in Your Fridge Causes and Prevents Cancer
It’s a reminder that outlier studies and results in general make headlines, but are statistically inevitable if you do enough studies.
Excerpt:
It wasn’t every sauna enthusiast who reaped the supposed protective effect against dementia; it was specifically those who used a sauna 9-12 times a month. Sauna bathers who hit the wooden bench 5-8 times a month — sorry, no effect. And those who went more than 12 times a month — again, no luck.
That should raise a caution flag in your head.
When only a very specific subpopulation in a study experiences a benefit, it may indeed be that there is some extremely nuanced sweet spot. But it is more likely that the researchers collected a lot of data, which in turn allowed them to analyze many different correlations between sauna use and dementia; the more different analyses they can do, the more likely some of those analyses will generate false positives, just by statistical chance. And then, of course, those titillating positive results are the ones that end up at the top of the paper, and in the press release.
Here’s the point I want to hammer home: when you see a tantalizing health headline — like that saunas prevent dementia — keep an eye out for indications that the effect only applies to specific subgroups of the study population. Even if the headline is very authoritative, revealing nuggets are often buried lower in the story.
I want to stress that you shouldn’t assume the sauna results can’t possibly be true. But when you see Bears-undefeated-in-alternate-jerseys type conclusions — and someone is claiming one thing causes the other — you should hold out for more evidence.
This doesn’t just happen in health news. Investing/trading is another area where making a mountain out of a statistical molehill is rampant. Unless you are specifically studying a phenomenon that you’d expect to be discontinuous (binary, “phase change”, threshold cutoff) you should be wary of any signal from a specific range of an otherwise continuous function.
I’ll take a simple example from Kris Longmore’s article explaining how month-end rebalance trades work. The post is titled How Wealth Managers Pay You To Trade. He writes (emphasis mine):
How I’d Test This
So here’s the hypothesis: if we can identify which asset outperformed during the first part of the month, the underperformer should outperform as we approach month-end, when rebalancing pressure is likely to be greatest.
The first step is simple. Pull daily data for SPY and TLT going back as far as you can get it (I used data from 2007). You can get this from Yahoo Finance – nothing fancy.
Then ask a straightforward question: If I know which asset outperformed during the first 15 trading days of the month, can I predict which will outperform during the last ~7 trading days?
Why 15 days? Because it’s roughly two-thirds of a trading month, and it gives us a reasonable window to identify the outperformer before month-end rebalancing kicks in.
Could you use 10 days? 20 days? Sure. But 15 seemed reasonable and shouldn’t really matter much. If it did, then that would be a big red flag. We want stuff that’s fairly robust to the actual implementation details.
Back in my floor days my biz partner was incubating a futures trend strategy and he’d have me look at the backtest results. I’m no scientist, but I knew enough to realize that if the signal depended on a particular value of the parameter (ie the exact amount of what defined a “breakout”, the number of lookback days, etc), then the result was overfit.
It’s the same idea as David’s sauna therapy study.
When you are in a competitive domain where many people are constantly mining, “too good to be true” discoveries should be met with extra skepticism.
A current example of this is the so-called Mississippi Miracle in which both the left and right appear to have an axe regarding the childhood literacy improvement in Mississippi schools. It checks the box of “domain where many people are constantly mining” so interventions that show huge returns deserve a lot of skepticism. You can count on Freddie deBoer to deliver that, but I think the pushback in the comments section of his post show the complexity:
