Normally, I try to be wary of analogies. They can lull us into the complacency of thinking we understand something complex just because we grasp the result of passing it through a neat filter. (Yes, I did write a post that used an extremely convoluted analogy to explain brain stuff, shhh. I’ve grown up or something.)
But I’m comfortable using the language of statistics for things in life that would seem to lie outside that field’s domain. Perhaps because “seem” is the operative word here: statistics really is everywhere, as long as there is anything of which we can’t be certain and we have ways of quantifying that uncertainty. While there are of course a lot of formalisms and technical details of this field (which I love!) that are invisible in daily life and probably don’t deserve analogies, ever since I learned how fundamental the problem of overfitting is in statistical models and machine learning, I’ve found it to be a recurring theme in unexpected places. Don’t worry, this isn’t a Math Post. Not at its core.
For the uninitiated, “overfitting” basically means that a model intended to represent/predict properties of a general population fails to do so, because the process used to develop that model was based on capturing the special quirks of a smaller, more manageable subset of data. This post is a neat and intuitive explanation of the idea in more detail, but you should be fine if you just understood that sentence—or, failing that, you agree that in this graph, the squiggly line looks suspicious and unlikely to be a good model even though it passes through all the points, while the straight line doesn’t. The squiggly line is guilty of the mortal sin of overfitting.
Understanding why the squiggly line is suspicious is a lot more subtle. You could vaguely gesture to Occam’s razor, sure, but that just pushes the question back. Why are simpler explanations generally better? Stated in the original form (“don’t multiply entities beyond necessity”), it’s a tautology—of course including unnecessary stuff in an explanation is bad, yet the phenomenon of overfitting is about the badness of complex explanations that are actually better at explaining the subset of data in question than the simpler one. So the weak form of Occam’s razor won’t do. It turns out there’s some fascinating mathematical justification for Occam’s razor beyond the fuzzy sense that it just works, but I don’t understand it enough to confidently write about it here (I know, I know, hold back your tears). [Update: I wrote about it here.]
I confess that even though I could point to the fancy statistical theory that proves that more complex models have more variance, the intuitions underlying those proofs aren’t exactly trivial. Not that we should necessarily expect them to be. Half the purpose of this post is to give myself the chance to check that I really understand what I think I’ve learned.
With that caveat out of the way, I can say this much: as Gardner’s post explains, an overfit model tries to find patterns in noise. It tries to explain stuff that shouldn’t be explained by the variables in question. This is the stuff that (we assume, reasonably or otherwise) results from things like measurement error or any other perturbation that isn’t systematic, unless we’ve left out a relevant variable, which is often very plausible. Suffice it to say, if you have some really simple statistical software you can try simulating data that are related in a pattern you can represent with a straight line, but with this noise thrown in, and it’s surprising what sorts of spurious “patterns” you can find in there besides lines. Look at this graph if you don’t believe me:
Pretty close, right? Closer than this, at least:
But nope. The points in both graphs are fake data generated basically by taking that straight line and adding noise to it that is completely undetermined by the data themselves. And that noise is completely symmetrical (more precisely, the population it comes from is symmetrical).
Yet I would have every reason to forgive you if you thought there was a squiggly W-shaped pattern in those data.
The fact that literally random junk can create the appearance of patterns like that should give us reason to be cautious about complex explanations, unless we have good causal justifications for thinking that complexity exists. You might think this is just a problem for the abstract models of stats geeks, but I wouldn’t be so sure. Pretty much any time you consider how one factor of reality might be related to others—in terms of scale, not just the binary “does this thing cause this other thing?” sense—you form a model, no matter how implicit. It doesn’t have to be a straight line, but the concept is the same. When you use a relatively small collection of observations to infer a pattern about a whole population, complexity might fool you.
To be concrete here for once, look at the woes of anyone who tries to apply the anecdotal advice of a friend or family member to their own life, yet ends up disappointed when it doesn’t work. On one level, this could be due to noise. Either the trusted counsel was lucky, or the recipient of the advice was unlucky. But the problem might be more persistent, no matter how many times they try again. If so, I think the issue more likely lies in the fact that the counsel has overfit their advice model to themselves. They’ve stumbled upon a way of relating threads of reality that makes perfect sense for their situation, precisely because it was designed based on the nuances of their own life.
A similar story seems to be going on with the disconnect between parts of modern civilization and the environments in which we evolved. (Speculation alert, though this is hardly original to me.) Is it any wonder that our genetic model that arose to fit (or at least survive in) settings drastically different from those where many of us now live, one that arose especially to serve the prime directive of reproduction rather than grasping truth or relieving suffering, would fail outside those settings and objectives unless we made conscious efforts to the contrary? This is even worse than a lack of wide sampling on evolution’s part. It’s a change in the distribution itself.
To be clear, I’m absolutely not advocating a reversion to Paleolithic lifestyles, since as I just said, evolution serves masters that we shouldn’t necessarily expect to align with our welfare. Still, this gives us reason to be suspicious of intuitions and heuristics that plausibly originated from an evolutionary process blind to the data we now have. It’s relatively easy for us to see such disconnects when we look at cognitive biases, but our assessments of right and wrong could be subject to the very same dangers. If you go through your life basically never encountering a situation in which pushing someone to their death would save five people’s lives, and your moral model has settled on “don’t push people to their deaths” as a rule with no counterexamples in its experience, of course you’ll say no to pushing the fat man to stop the trolley. If you’re used to the forcible taking of money always being associated with threats of violence against the innocent, of course you’ll cry “taxation is theft” when money is redistributed from the rich to the poor. If you’ve basically never encountered a risk of low probability or a distant future arrival time, whose stakes could be disastrous for generations to come according to the judgments of people who have studied such a risk diligently, of course you’ll scoff at climate change or unaligned artificial intelligence.
Even if the distribution has changed. Even if the new data contradict the old model.
Lest you think I’m just trying to be cute here, by considering these to be overfitting problems, we don’t merely see a surface-level resemblance. We can predict that the remedies are the same as in statistics. I don’t know yet whether these remedies actually work, but I’m inviting the reader to join me in testing them.
One prescription, as I’ve mentioned, is penalizing complexity unless it pulls its weight in consistently explaining that which simplicity can’t. This is what statisticians call regularization, and it can be especially handy when you don’t have the luxury of just getting more data (so that it’s less likely you’ve failed to capture the properties of the general population). Practically, I’ve tried to apply this by making my career plans as robust as possible. I reject options that would require me to specialize in many particular skills that, while valuable if I stuck with such options, would probably be useless in most other domains that could suit me—unless the advantages of these options outweighed this penalty.
Another strategy, common in algorithms for visual recognition, is data augmentation. Jargon aside, this just means forming your model (of whatever it is you’re trying to understand/predict) with multiple different perspectives on each data point. The point here is to make sure that you’re extracting the relevant information from each piece of evidence in itself, rather than mistaking the particularities of your picture of that evidence for important patterns. Hilariously, visual recognition algorithms that don’t use this technique often will correctly categorize a picture of a cat, but if the picture is rotated ever so slightly or some pixels are replaced, it hasn’t the faintest clue what to make of this image. So the solution is to copy the cat and intentionally rotate it. Conjuring real-life examples of this is trickier, but that makes me even more curious as to what we could learn from trying it out. Go forth and rotate your cats! If you changed a few peripheral details of something in a pool of things you’re trying to understand, which aspects would stay the same, and which ones would fade into irrelevance?
The last example I want to mention is the ensemble method. Data scientists sometimes improve performance of predictive models by developing a bunch of separate models of subsets of the whole data (note that these can overlap), then averaging their predictions or taking a majority vote. Basically, the hope is that a large enough number of independent approaches to the same problem can smooth out each other’s weak spots, doing better than any individual approach would do. That “independent” part is very important, though. Just like the wisdom of the crowd, a large number of repetitions of people making the same systematic error won’t correct that error. This strategy especially fascinates me because it shows just how symbiotic the relationship between human and machine problem-solving can be. Ensemble methods were almost certainly initially inspired by cases of humans trying it out and succeeding, but since the feedback we receive on our predictions is often slow and noisy, perhaps our machine friends can harness their speed and objectivity to offer some wisdom we’ve missed.