[Disclaimer: I should flag all the claims here with “…assuming you have a good model of the world.” A lot of things can go wrong when you try to use Bayesianism, or any other model-based decision-making method for that matter, on a terrible model. Garbage in, garbage out.]
Plenty of folks I admire love Bayesianism, a school of statistics and philosophy of knowledge. As do I, though as I’ll discuss here, it’s not magic, and sometimes Bayes gets credit for a principle that’s a lot more broad. (Sometimes, though less commonly, this happens for frequentism too!)
So much ink has been spilled on Bayes that I doubt I’ll make any strictly original claims. But I can at least give a synthesis of both philosophical and statistical perspectives (most rabid philosophical Bayesians on the Internet are not statisticians, and vice versa), and touch on the significance of these ideas for practical decision-making.
First, the math. I’ll state Bayes’ rule in words:
The credibility of some claim given some evidence is proportional to the chance of seeing that evidence if the claim is true, times the credibility of the claim before considering the evidence.
I used the weaselly “proportional” partly because the sentence would’ve been too long and confusing otherwise. But more importantly, the factor that’s missing is one that doesn’t matter when you want to assess the odds of a claim relative to its denial. In fact, lots of popular methods for using this rule in complicated stats models don’t compute that missing factor, precisely for this reason.
The upshot is: to assess how much more credible a claim is than its denial, you adjust your starting judgment (the prior) in the direction of whichever of the two makes the evidence less surprising. And the more skewed this ratio of surprise (the likelihood ratio) is, the more you adjust. More on this later. But as one example, the fossil record is much more surprising in a universe where creationism is true than one where evolution is—awfully kind of God to not put any rabbit fossils in the Precambrian strata!—hence it makes sense to interpret the fossil record as evidence for evolution. Not that these are the only two options, but the principle is the same.
Here’s the symbolic way to state it, for your viewing pleasure:
Perhaps an obvious point: since this is just a mathematical fact, a tiny bit of algebra applied to the definition of conditional probability, there’s ironically nothing uniquely “Bayesian” about Bayes’ rule. Frequentists acknowledge it as much as Bayesians do. They definitely acknowledge the likelihood ratio principle in designing some hypothesis tests, though as we’ll see, the notorious P-value muddies the waters. (See RustyStatistician’s answer here.)
Where frequentists and Bayesians disagree, among other points, is on whether it makes sense to apply Bayes’ rule to make sense of quantities like “the probability that this claim [about some deterministic value] is true given that evidence.” On the Bayesian interpretation, probability is a measure of our uncertainty, of how plausible some proposition in question is. The frequentist interpretation, as I understand it, distinguishes how subjectively plausible a claim is (or “should” be) from its probability. Probability computations other than 0% or 100% are reserved for things that are—apologies for scare quote abuse—”inherently random/stochastic.” Technically this seems to be a feature of the propensity interpretation, but it’s also how frequentists in practice often object to Bayesian claims, and it’s consistent with frequentism when an event (supposedly) can’t be put in a class of relevantly similar events. By “relevantly similar” I mean, for instance, if I flip a quarter we might say that quarter is sufficiently similar to every other quarter that has been flipped before, that we can model it as having 50% probability of heads based on past flip frequencies.
There’s some intuitive appeal to this. Things like coin flips, poker hands, fluctuations of financial graphs around a trend … these seem like random things, while there seems to be nothing random about the fraction of Americans who support raising the minimum wage. There definitely doesn’t appear to be anything random about whether the 1000th digit of pi is even; it’s just a number, you can look it up and confirm that it’s odd.
On closer inspection, though, this distinction doesn’t really make sense. The difference is not qualitative, but a mere difference in the strength of our ability to resolve our uncertainty if we gathered enough data within our grasp. To explain that argument, we need to go where angels and undergrads fear to tread.
Why introductory stats is confusing
(Or at least, one reason it’s confusing. Another reason is that combinatorics is hard, but I digress.)
Here’s an innocent-enough claim: “There is a 95% probability that the true effect size in [insert scientific regression study here] is in the 95% confidence interval.”
The way confidence intervals are defined in frequentist hypothesis testing, this is false. Much to the chagrin of the students in the intro stats courses I’ve TA’d. The canonically “correct” statement in this paradigm is: “The probability that the true effect size in [insert scientific regression study here] is in the 95% confidence interval is either 0% or 100%, we just don’t know which. But if arbitrarily many 95% confidence intervals were constructed from this same sampling procedure, 95% of them would contain the true effect size.”
Clear as mud.
Now, I’ll be the first to say that statistics does involve a lot of genuine subtleties that need to be respected if you’re going to understand it. “Linear” regression doesn’t mean you can’t have powers greater than one in your predictors. In a continuous distribution, every outcome has probability 0%. Simpson’s paradox is a thing. I could go on.
But I think this is one concept that is so needlessly confusing that students shouldn’t be blamed for not getting it.
Suppose our frequentist hero is an exit pollster who, on November 4, 2020, estimated a 95% confidence interval of [-0.01, 0.10] for Joe Biden’s popular vote margin of victory. If you had offered them a bet for 90 cents paying a dollar if the true margin ended up in this interval, I claim they’d be a fool not to take it—assuming they truly believed this interval, as something based on a representative sample of voters. Why? Well, the expected monetary value is +5 cents if you assign 95% credibility to the true margin being in [-0.01, 0.10]. (And I’m assuming that for money on the scale of cents, utility increases linearly with money.)
But the frequentist view apparently can’t make sense of this. It’s just one confidence interval, and the true margin either is or isn’t in that interval. The expected value of the bet is undefined, on this view. If your philosophy of probability is inconsistent with how you would ideally make decisions under uncertainty, the very sort of thing that motivates humanity to study probability in the first place, I think that’s a damning mark against such a philosophy.
We can see the same logic in the case of our “1000th digit of pi” question. If someone offers you a bet that costs 40 cents if that digit is odd and pays 60 if it’s even, then—unless you memorize digits of pi as a hobby, and assuming you are banned from checking the Internet, and further assuming we ignore the psychological evidence provided by the fact that this person was willing to offer the bet at all—clearly you should take it. If you’re very risk averse, fine, let’s modify it to let the law of large numbers do its magic. If you get offered the equivalent of this bet for every thousandth digit of pi up to the trillionth, for heaven’s sake, take those bets and become a multimillionaire on average. To be fair, the frequentist will not necessarily call you irrational here. But as far as I can tell, they will have no basis for agreeing with the very plausible claim that you are doing something positively rational, either. They have to suspend judgment. (See also: this video.)
“But what about quantum mechanics? Radioactive decay?” you might protest. Surely these are genuine instances of objective randomness?
Well, not necessarily. The many-worlds interpretation of quantum mechanics is deterministic, in the sense that it says not even these phenomena are intrinsically random. It happens that an observer, confined to one branch, would find it physically impossible to predict the future within their branch if they knew a Theory of Everything, but this does not mean the cosmos taken as a whole is fundamentally unpredictable even from a God’s-eye view. I am certainly not a quantum physicist, but a sizable proportion of experts in this field endorse MWI, and for what it’s worth I find its mathematical consistency more convincing than objections that it is simply counterintuitive.
This might seem like hair-splitting, since I conceded that an individual simply cannot predict quantum events within the branch they experience. That sure sounds like deep randomness, no? But the key is that this is at bottom a limitation of the individual’s knowledge, or access to all the physical data. It’s not a property of the quantum events themselves.
And the same is true for chaos, coin flips, and so on. If you had to guess a coin flip as soon as the coin left the flipper’s thumb, with sufficiently sophisticated physics knowledge you would be able to beat the 50% success rate by at least a small margin. There doesn’t seem to be any reason this couldn’t be taken to an extreme of practical certainty, with arbitrarily accurate knowledge of the initial conditions. Given this, it’s not clear why a frequentist wouldn’t say of the coin flip as well, “Its probability of landing heads is either 0% or 100%, you just don’t know which.”
One response I could foresee is that they actually would agree with that statement when it comes to a real-world coin flip; “coin flips” in stats textbooks are just a convenient abstraction. To which I say, sure, but why doesn’t this reasoning equally apply to the exit poll case? It is a convenient abstraction that, to the pollster, Biden’s margin of victory was random, since they couldn’t practically know that margin without counting all the votes. We can’t explain this in terms of time, either, since I assume a frequentist will agree that a coin that has already been flipped, and is covered by someone’s hand, has 50% probability of being heads. If that assumption is wrong, well, this is my incredulous stare.
I should note: some of my credence in MWI comes from pre-existing credence in Bayesianism—the idea of “inherent randomness” just doesn’t make intuitive sense to me, upon examination of everything else in physics and analysis of the concept. So it would not make sense to count this point as an argument in favor of Bayesianism per se. But it does show that physics is not inconsistent with the Bayesian view of probability.
The incompleteness of P-values
Recall that The Rule for adjusting your belief in some claim versus its denial is to go in the direction of whichever of the two makes the evidence more likely. You have to consider both how likely the evidence is if the claim is true, and how likely it is if it’s false.
A P-value by definition only reports one of the two. It is nothing more or less than the probability of the evidence—or anything more “extreme” than that evidence, which sounds fuzzy but is generally clear from context—if a certain privileged claim called the null hypothesis is true.
I explain this at length here. In that piece, I lamented the frequency (pun fully intended) with which people get the definition of a P-value backwards, and yet, as with the confidence interval confusion, can I blame them? The probability of a hypothesis given evidence is a perfectly reasonable thing to want to quantify. It’s what science, indeed basic truth-seeking, is in the business of assessing. The probability of the evidence given the hypothesis, while obviously useful information, is not sufficient to make decisions.
Coming back to our fossil record example: imagine that a creationist argued, “Let’s be generous to the godless Darwinists, and call evolution the null hypothesis. Innocent until proven guilty. It would be absurdly unlikely for the fossils to be arranged in exactly the configuration we see, if evolution were true. Null hypothesis rejected!” This is technically true! Any particular arrangement of fossils is unlikely under basically any hypothesis other than the super-specific one that says, “The laws of physics are such that fossils will accumulate in exactly the pattern observed by contemporary paleontologists.” But of course, 1) creationism is not that specific hypothesis, and by considering the ratio of the very tiny probabilities these two hypotheses assign to the evidence, evolution wins on this score; and 2) Occam’s razor is not kind to that specific hypothesis, if we interpret it as more than just a tautology, i.e., as claiming that the laws of physics are “rigged” in some way that favors this figuration of fossils.
(To be fair, there’s a sense in which our straw creationist is not just wrong about the implications of this evidence, but also cheating. Real hypothesis tests that frequentist statisticians perform have some directionality to them.)
The Problem with Popper
Much of the sympathy to frequentist philosophy seems to have its roots in a view that science is all about falsifying claims, not supporting them; deduction, not induction. The great philosopher Karl Popper championed this view, and I’d be surprised if the dear reader were not taught Popper’s philosophy implicitly (not by name, certainly!) in grade school science classes. I’ll let this paper speak for itself:
Some people may want to think about whether it makes scientific sense to “directly address whether the hypothesis is correct.” Some people may have already concluded that usually it does not, and be surprised that a statement on hypothesis testing that is at odds with mainstream scientific thought is apparently being advocated by the ASA leadership. Albert Einstein’s views on the scientific method are paraphrased by the assertion that, “No amount of experimentation can ever prove me right; a single experiment can prove me wrong” (Calaprice 2005). This approach to the logic of scientific progress, that data can serve to falsify scientific hypotheses but not to demonstrate their truth, was developed by Popper (1959) and has broad acceptance within the scientific community. … It is held widely, though less than universally, that only deductive reasoning is appropriate for generating scientific knowledge. Usually, frequentist statistical analysis is associated with deductive reasoning and Bayesian analysis is associated with inductive reasoning
This essay makes the excellent counterpoint that scientists usually do not, and should not, take some evidence from experiment that appears inconsistent with theory H as deductive proof that H is false. What they actually do is consider background assumptions they’ve made in the experimental design and analysis (chief of which is “I made no mistakes in data collection and there was no measurement error”), weigh how plausible it is that one of those assumptions is false rather than H, and, if the track record of H is quite strong, often settle on rejecting one of the assumptions. If, however, the track record of the assumptions is even stronger, then it’s arguably time to throw H into the dustbin.
This should remind you of a prior.
Our Platonic ideal scientist might also ask, “Okay, even if H isn’t perfectly consistent with this evidence, is there another theory that does any better, without cheating/overfitting?”
This should remind you of a likelihood ratio, and the prior of the alternative.
Point being, even the most precise physical theories, like Newton’s laws, can’t be logically refuted by experiment without a tapestry of assumptions about humans’ ability to perfectly measure nature. And fields like biology make a whole host of theoretical claims that are not mathematically precise, and therefore aren’t subject to strict deduction. This doesn’t make them unscientific.
While some readers may take this as a pedantic point, strictly speaking “deduction” doesn’t even seem possible in anything other than pure mathematics and logic:
If all premises are true, the terms are clear, and the rules of deductive logic are followed, then the conclusion reached is necessarily true.Wikipedia entry, “Deductive reasoning.”
I cannot overstate how strong a standard this is. As anyone familiar with theoretical math knows, deduction only allows you to make the most modest, conservative claims, since you are constrained by rules that demand absolute certainty (up to human margin of error, and the objections of radical skeptics, anyway). This is the difference between a “theorem” and a “theory.” Doesn’t matter how many examples you test the Riemann hypothesis on, if you don’t prove it, no Millennium Prize for you. Unless you prove one of the others, anyway.
Which is why I find it bizarre to uphold frequentist philosophy as on the side of deduction. “P < 0.00001” doesn’t give you deductive certainty that the null hypothesis is false. Not even “P < 0.000000000000000000000001” would. (Technically, some null hypotheses are of the form “the effect of X on Y is exactly 0,” which you can certainly reject without any evidence in the first place, if the definition of effect in question pertains to values on a continuum. But that’s tangential to this point, and would render frequentist hypothesis testing unnecessary anyway.)
This is not a vice of frequentism. It’s a virtue. I claim that frequentists already in practice agree with Bayesians that deduction is not feasible in science. What they seem to be doing is an approximation to (an attempt at) deduction. In everyday language, you can round off “this is extremely unlikely if H is true” to “this is impossible if H is true,” and “deduce” that H is false. And the appeal of this strategy to the frequentist is that it doesn’t require claims about probabilities of hypotheses, other than 0%. But this isn’t really deduction, nor is it the proper method of induction either, and I can only wonder how many frequentists would be more open to Bayesianism if they embraced this. Induction is not epistemological anarchy, Hume notwithstanding. We have a nice mathematical formalism, Bayes’ rule, telling us how to do induction just fine.
Bayesianism is accused of being too subjective, since probabilities depend on priors that are at the researcher’s discretion. Gelman has a decent reply to this:
The prior distribution requires information and user input, that’s for sure, but I don’t see this as being any more “subjective” than other aspects of a statistical procedure, such as the choice of model for the data (for example, logistic regression) or the choice of which variables to include in a prediction, the choice of which coefficients should vary over time or across situations, the choice of statistical test, and so forth.
Yet, as elegantly explained in the dialogue I linked at the start, the frequentist P-value for a given hypothesis test can depend on the subjective intentions of the researcher. For example, the P-value of a sequence of coin flips “HHHHT,” relative to the null hypothesis “this coin is fair,” depends on whether you resolved to flip the coin exactly 5 times or stop at the first tails. Or stop at the first “HT.” The adjusted P-value that won’t get your paper rejected depends on how many models you truly, deeply decided to test, and are trusted on your honor to report honestly. (Even I, fan of Bayes that I am, was a bit surprised to learn that multiple testing isn’t really a problem for Bayesian inference.)
If your tool for establishing scientific truth can be gamed so systematically, and avoiding this gaming dissuades researchers from asking as many questions as they please, that speaks to something perverse about the tool.
Moreover, many frequentist procedures are equivalent to Bayesian ones with an uninformative, or “flat,” prior. In the context of a study about the sizes of drug effects on some health outcome, this would be the prior that models all effect sizes as equally likely. Does that model sound at all plausible to you? Do you expect, before considering the evidence contained in a given study, that it’s just as likely a dose of a new antidepressant instantly and permanently cures depression as that it makes the user feel moderately better for a day? Of course not. By considering the record of similar drugs, you probably expect a bit more than a placebo’s worth of effect (or, heck, not even a bit more), and not much variance around that mean. This is not a bald, unscientific assumption or fuzzy subjectivity. It’s an entirely reasonable summary of previous relevant evidence and common sense, that is, a prior.
As that dialogue also states, proponents of Bayesianism are not recommending that researchers just report the final answer of their prior-times-update-factor computation and call it a day. The sensible standard is to require researchers to report the likelihoods they use in their analysis. And if you, intrepid reader, disagree with their choice of prior, it’s your prerogative to take a prior that makes more sense to you, and update it with the evidence.
“Why should I care?”
You’re probably not a statistician, scientist, or anyone else who uses this stuff in your day job. (My prior says so!)
So what’s at stake for you?
For one, these are some compelling reasons to be generally skeptical of claims backed up by P-values alone. Including negative claims! “P > 0.05” does not mean “no evidence for a decision-relevant effect.” This has some pretty far-reaching implications for assessing evidence about public health, policy, charity, and such.
But I think this stuff is important at a more general level. As incredibly simple as this idea is—that something counts as “evidence for” a claim to the extent (and only to the extent) that it is more likely under that claim than its denial—it’s all too easy not to abide by it when evaluating non-deductive arguments, or incorporating new information. Having the mathematical concept on hand helps me try to follow the evidence where it actually leads. There’s some evidence that Bayesian reasoning is partly responsible for the success of the world’s best predictors.
I suspect this could have crucial applications in philosophy, too. Many philosophical arguments rely on the strength of intuitions in favor of different ideas. I do think intuition is basically all we’ve got as the bedrock of, for example, ethics. But not all intuitions are created equal. An important question to ask of any given intuition is, “Would I expect people to believe this even in a world where it was false?” If so, all else equal, the fact that you find an intuition compelling is not particularly good evidence for that intuition. I highly recommend that the reader try cross-examining their beliefs with this standard.
One area where I’ve found this exercise enlightening is in moral consideration of animals. In a world where animals hypothetically did have all the features we’d consider necessary for a being to be morally important, would it be surprising for humans to feel anyway that we are overwhelmingly more important than animals—creatures that cannot protest verbally against the harm we inflict on them, that are so genetically distant from us that helping them provides “us” basically no benefit in inclusive fitness? No. This suggests that the fact that most of us consider it weird to care as much about a chicken’s suffering as a human’s is pretty weak, if any, evidence that we wouldn’t care as much upon careful moral reflection.
Then there’s The Future. I will give the question of cluelessness a more thorough treatment in a forthcoming post. But as a teaser, the Bayesian approach at least gives us a way to confront uncertainty about radically unprecedented events. We don’t have to simply say no probabilities can be assigned because the events aren’t repetitions of near-identical processes. This certainly isn’t to say we should be cavalier about making decisions based on extremely imperfect estimates, only that we aren’t totally hopeless. Choosing to ignore that which you can’t confidently predict isn’t noble skepticism; it’s an implicit prediction that the thing you’re ignoring washes out in the end, which is a strong assumption!
Appendix: Why you might not want to use Bayesian statistical methods anyway
This essay is fundamentally about philosophical frequentism vs Bayesianism, not the clusters of statistical methods that have been labeled frequentist vs Bayesian. I don’t have much of a problem with many “frequentist” methods, any more than I have a problem as a pure consequentialist with following rules. Null hypothesis significance testing, as we’ve seen, is a glaring exception. Non-Bayesian methods can be super useful heuristics when, as is often the case, computing the full Bayesian solution isn’t feasible. Andrew Gelman at Columbia University is as Bayesian as they come, and he agrees with this assessment.
In my research territory of online and reinforcement learning, for instance, “frequentist regret bound” is a common term for a guarantee about the performance of an algorithm that holds universally across some class of problems. There’s nothing inherently frequentist about this. I’ve recently done some work building on a paper that uses a Bayesian approach and proves exactly this sort of guarantee. (For the record, this is why I strongly prefer to call it a “worst-case regret bound.”) (For the second record, so far Bayesian methods have been found superior in reinforcement learning, but who’s counting?)
Though I don’t agree with every point in this Fervent Defense of Frequentist Statistics, it’s worth a read.