This is a linkpost for https://confusopoly.com/2019/04/03/the-optimizers-curse-wrong-way-reductions/.
Summary
I spent about two and a half years as a research analyst at GiveWell. For most of my time there, I was the point person on GiveWell’s main cost-effectiveness analyses. I’ve come to believe there are serious, underappreciated issues with the methods the effective altruism (EA) community at large uses to prioritize causes and programs. While effective altruists approach prioritization in a number of different ways, most approaches involve (a) roughly estimating the possible impacts funding opportunities could have and (b) assessing the probability that possible impacts will be realized if an opportunity is funded.
I discuss the phenomenon of the optimizer’s curse: when assessments of activities’ impacts are uncertain, engaging in the activities that look most promising will tend to have a smaller impact than anticipated. I argue that the optimizer’s curse should be extremely concerning when prioritizing among funding opportunities that involve substantial, poorly understood uncertainty. I further argue that proposed Bayesian approaches to avoiding the optimizer’s curse are often unrealistic. I maintain that it is a mistake to try and understand all uncertainty in terms of precise probability estimates.
I go into a lot more detail in the full post.
I'm feeling confused.
I basically agree with this entire post. Over many years of conversations with Givewell staff or former staff, I can't readily recall speaking to anyone affiliated with Givewell who I can identify that they would substantively disagree with the suggestions in this post. But you obviously feel that some (reasonably large?) group of people disagrees with some (reasonably large?) part of your post. I understand a reluctance to give names, but focusing on Givewell specifically as much of their thoughts on these matters are public record here, can you identify what specifically in that post or the linked extra reading you disagree with? Or are you talking to EAs-not-at-Givewell? Or do you think Givewell's blog posts are reasonable but their internal decision-making process nonetheless commits the errors they warn against? Or some possibility I'm not considering?
I particularly note that your first suggestion to 'entertain multiple models' sounds extremely similar to 'cluster thinking' as described and advocated-for here, and the other suggestions also don't sound like things I would expect Givewell to disagree with. This leaves me at a bit of a loss as to what you would like to see change, and how you would like to see it change.
Thanks for raising this.
To be clear, I'm still a huge fan of GiveWell. GiveWell only shows up in so many examples in my post because I'm so familiar with the organization.
I mostly agree with the points Holden makes in his cluster thinking post (and his other related posts). Despite that, I still have serious reservations about some of the decision-making strategies used both at GW and in the EA community at large. It could be that Holden and I mostly agree, but other people take different positions. It could be that Holden and I agree about a lot of things at a high-level but then have significantly different perspectives about how those things we agree on at a high-level should actually manifest themselves in concrete decision making.
For what it's worth, I do feel like the page you linked to from GiveWell's website may downplay the role cost-effectiveness plays in its final recommendations (though GiveWell may have a good rebuttal).
In a response to Taymon's comment, I left a specific example of something I'd like to see change. In general, I'd like people to be more reluctant to brute-force push their way through uncertainty by putting numbers on things. I don't think people need to stop doing that entirely, but I think it should be done while keeping in mind something like: "I'm using lots of probabilities in a domain where I have no idea if I'm well-calibrated...I need to be extra skeptical of whatever conclusions I reach."
Fair enough. I remain in almost-total agreement, so I guess I'll just have to try and keep an eye out for what you describe. But based on what I've seen within EA, which is evidently very different to what you've seen, I'm more worried about little-to-zero quantification than excessive quantification.
That's interesting—and something I may not have considered enough. I think there's a real possibility that there could be excessive quantification in some areas of the EA but not enough of it in other areas.
For what it's worth, I may have made this post too broad. I wanted to point out a handful of issues that I felt all kind of fell under the umbrella of "having excessive faith in systematic or mathematical thinking styles." Maybe I should have written several posts on specific topics that get at areas of disagreement a bit more concretely. I might get around to those posts at some point in the future.
FWIW, as someone who was and is broadly sympathetic to the aims of the OP, my general impression agrees with "excessive quantification in some areas of the EA but not enough of it in other areas."
(I think the full picture has more nuance than I can easily convey, e.g. rather than 'more vs. less quantification' it often seems more important to me how quantitative estimates are being used - what role they play in the overall decision-making or discussion process.)
Can you elaborate on which areas of EA might tend towards each extreme? Specific examples (as vague as needed) would be awesome too, but I understand if you can't give any
Unfortunately I find it hard to give examples that are comprehensible without context that is either confidential or would take me a lot of time to describe. Very very roughly I'm often not convinced by the use of quantitative models in research (e.g. the "Racing to the Precipice" paper on several teams racing to develop AGI) or for demonstrating impact (e.g. the model behind ALLFED's impact which David Denkenberger presented in some recent EA Forum posts). OTOH I often wish that for organizational decisions or in direct feedback more quantitative statements were being made -- e.g. "this was one of the two most interesting papers I read this year" is much more informative than "I enjoyed reading your paper". Again, this is somewhat more subtle than I can easily convey: in particular, I'm definitely not saying that e.g. the ALLFED model or the "Racing to the Precipice" paper shouldn't have been made - it's more that I wish they would have been accompanied by a more careful qualitative analysis, and would have been used to find conceptual insights and test assumptions rather than as a direct argument for certain practical conclusions.
I'd also be excited to see more people in the EA movement doing the sort of work that I think would put society in a good position for handling future problems when they arrive. E.g., I think a lot of people who associate with EA might be awfully good and pushing for progress in metascience/open science or promoting a free & open internet.
A recent example of this happening might be EA LTF Fund grants to various organizations trying to improve societal epistemic rationality (e.g. by supporting prediction markets)
I haven't had time yet to think about your specific claims, but I'm glad to see attention for this issue. Thank you to contributing what I suspect is an important discussion!
You might be interested in the following paper which essentially shows that under an additional assumption the Optimizer's Curse not only makes us overestimate the value of the apparent top option but in fact can make us predictably choose the wrong option.
The crucial assumption roughly is that the reliability of our assessments varies sufficiently much between options. Intuitively, I'm concerned that this might apply when EAs consider interventions across different cause areas: e.g., our uncertainty about the value of AI safety research is much larger than our uncertainty about the short-term benefits of unconditional cash transfers.
(See also the part on the Optimizer's Curse and endnote [6] on Denrell and Liu (2012) in this post by me, though I suspect it won't teach you anything new.)
Kind of an odd assumption that dependence on luck varies from player to player.
If we are talking about charity evaluations then reliability can be estimated directly so this is no longer a predictable error.
Can you expand on how you would directly estimate the reliability of charity evaluations? I feel like there are a lot of realistic situations where this would be extremely difficult to do well.
I mean do the adjustment for the optimizer's curse. Or whatever else is in that paper.
I think talk of doing things "well" or "reliably" should be tabooed from this discussion, because no one has any coherent idea of what the threshold for 'well enough' or 'reliable enough' means or is in this context. "Better" or "more reliable" makes sense.
Intuitively, it strikes me as appropriate for some realistic situations. For example, you might try to estimate the performance of people based on quite different kinds or magnitudes of inputs; e.g. one applicant might have a long relevant track record, for another one you might just have a brief work test. Or you might compare the impact of interventions that are backed by very different kinds of evidence - say, a RCT vs. a speculative, qualitative argument.
Maybe there is something I'm missing here about why the assumption is odd, or perhaps even why the examples I gave don't have the property required in the paper? (The latter would certainly be plausible as I read the paper a while ago, and even back then not very closely.)
Hmm. This made me wonder whether the paper's results depends on the decision-maker being uncertain about which options have been estimated reliably vs. unreliably. It seems possible that the effect could disappear if the reliability of my estimates varies but I know that the variance of my value estimate for option 1 is v_1, the one for option 2 v_2 etc. (even if the v_i vary a lot). (I don't have time to check the paper or get clear on this I'm afraid.)
Is this what you were trying to say here?
Thanks Max! That paper looks interesting—I'll have to give it a closer read at some point.
I agree with you that how the reliability of assessments varies between options is crucial.
Can you give an example of a time when you believe that the EA community got the wrong answer to an important question as a result of not following your advice here, and how we could have gotten the right answer by following it?
Sure. To be clear, I think most of what I'm concerned about applies to prioritization decisions made in highly-uncertain scenarios. So far, I think the EA community has had very few opportunities to look back and conclusively assess whether highly-uncertain things it prioritized turned out to be worthwhile. (Ben makes a similar point at https://www.lesswrong.com/posts/Kb9HeG2jHy2GehHDY/effective-altruism-is-self-recommending.)
That said, there are cases where I believe mistakes are being made. For example, I think mass deworming in areas where almost all worm infections are light cases of trichuriasis or ascariasis is almost certainly not among the most cost-effective global health interventions.
Neither trichuriasis nor ascariasis appear to have common/significant/easily-measured symptoms when infections are light (i.e., when there are not many worms in an infected person's body). To reach the conclusion that treating these infections has a high expected value, extrapolations are made from the results of a study that had some weird features and occurred in a very different environment (an environment with far heavier infections and additional types of worm infections). When GiveWell makes its extrapolations, lots of discounts, assumptions, probabilities, etc. are used. I don't think people can make this kind of extrapolation reliably (even if they're skeptical, smart, and thinking carefully). When unreliable estimates are combined with an optimization procedure, I worry about the optimizer's curse.
Someone who is generally skeptical of people's ability to productively use models in highly-uncertain situations might instead survey experts about the value of treating light trichuriasis & asariasis infections. Faced with the decision of funding either this kind of deworming or a different health program that looked highly-effective, I think the example person who ran surveys would choose the latter.
Just to be clear, much of the deworming work supported by people in the EA community happens in areas where worm infections are more intense or are caused by worm species other than Trichuris & Ascaris. However, I believe a non-trivial amount of deworming done by charities supported by the EA community occurs in areas w/ primarily light infections from those worms.
FYI I asked about this on GiveWell's most recent open thread, Josh replied:
There's actually a thing called the Satisficer's Curse (pdf) which is even more general:
Also, if your criterion for choosing an intervention is how frequently it still looks good under different models and priors, as people seem to be suggesting in lieu of EV maximization, you will still get similar curses - they'll just apply to the number of models/priors, rather than the number in the EV estimate.
Isn't this essentially a reformulation of the common EA argument that the most high-impact ideas are likely to be "weird-sounding" or unintuitive? I think it's a strong point in favor of explicit modelling, but I want to avoid double-counting evidence if they are in fact similar arguments.
Nah, I'm just saying that a curse applies to every method, so it doesn't tell us to use a particular method. I'm excluding arguments from the issue, not bringing them in. So if we were previously thinking that weird causes are good and common sense/model pluralism aren't useful, then we should just stick to our guns. But if we were previously thinking that common sense/model pluralism are generally more accurate anyway, then we should stick with them.
Well it does not change the ordering of options. You're kind of doing a wrong-way reduction here: you're taking the question of what project should I support and "reducing" it to literal quantitative estimation of effectiveness. Optimizer's curse only matters when comparing better-understood projects to worse-understood projects, but you are talking about "prioritizing among funding opportunities that involve substantial, poorly understood uncertainty".
We can specify a prior distribution.
Well no, but it's better if you do. That Deutsch quote seems to say that it could allow people to take bad reasons and overstate them; that sounds like a problem with thinking in general. And there is no reason to assume that probabilistic decision makers will overestimate as opposed to underestimate. There have been many times when I had a vague, scarce prejudice/suspicion based on personal ignorance, and deeper analysis of reliable sources showed that I was correct and underconfident. If you think your vague suspicions aren't useful, then just don't trust them! Every system of thinking is going to bottom out in "be rational, don't be irrational" at some point, so this is not a problem with probabilism in particular.
The reason it's better is that it allows better rigor and accuracy. For instance, look how this post revolves around the optimizer's curse. Here's a question: how are you going to adjust for the optimizer's curse if you don't use probability (implicitly or explicitly)? And if people weren't using probabilistic decision theory, no one would have discovered the optimizer's curse in the first place!
Hey! I didn't consent to being included in your post!!!
Here's what it means, formally: given that I have an equal desire to be right about the existence of God and the nonexistence of God, and given some basic assumptions about my money and my desire for money, I would make a bet with at most 50:1 odds that all-powerful-God exists.
But in Bayesian decision theory, they aren't on the same footing. They have very different levels of robustness. They are not well-grounded and this matters for how readily we update away from them. Is the notion of robustness inadequate for solving some problem here? In the Norton paper that you cite later on this point, I ctrl-F for "robust" and find nothing.
All of your suggestions make perfect sense under standard, Bayesian probability and decision theory. As stated, they are kind of platitudinous. Moreover, it's not clear to me that abandoning these principles in favor of a some deeper concept of ignorance actually helps motivate any of your recommendations. Why, exactly, is it important that I embrace model skepticism for instance - just because I have decided to abandon probabilities? Does abandoning probabilities reduce the variance in the usefulness of different models? It can't, actually, because without probabilities the variance is going to be undefined.
In practice, I haven't done things with multiple quantitative models because (a) models are tough to build, and (b) a good model accommodates all kinds of uncertainty anyway. It's never been the case where I've found some new information/ideas, decided to update my model, and then realized "uh oh, I can't do this in this model." I can always just add new calculations for the new considerations, and it becomes a bit kludgy but still seems more accurate. So yeah this is good in theory but the practical value seems very limited. To be sure, I haven't really tried it yet.
If we want to test the accuracy of a model, we need to test a statistically significant number of the things predicted by the model. It's not sufficient for us to donate to AMF, see that AMF seems to work pretty well (or not), and then judge Givewell accordingly. We need to see whether Givewell's ordering of multiple charities holds.
Testing works well in some contexts. In others it's just unrealistic.
Improving social capacity tends to work better when society is trusted to actually do the right thing.
But these are exactly the things that you are objecting to. Where do you think probability estimates of deeply uncertain things come from? If there's some disagreement here about the actual reliability of things like intuition and tradition, it hasn't been made explicit. Instead, you've just said that such things should not be expressed in the form of quantitative probabilities.
Thanks for the detailed comment!
I expect we’ll remain in disagreement, but I’ll clarify where I stand on a couple of points you raised:
Certainly, the optimizer’s curse may be a big deal when well-understood projects are compared with poorly-understood projects. However, I don’t think it’s the case that all projects involving "substantial, poorly understood uncertainty" are on the same footing. Rather, each project is on its own footing, and we're somewhat ignorant about how firm that footing is.
Yes, absolutely. What I worry about is how reliable those priors will be. I maintain that, in many situations, it’s very hard to defend any particular prior.
This gets at what I’m really worried about! Let’s assume decisionmakers coming up with probabilistic estimates to assess potential activities don’t have a tendency to overestimate or underestimate. However, once a decisionmaker has made many estimates, there is reason to believe the activities that look most promising likely involve overestimates (because of the optimizer’s curse).
This is a great question!
Rather than saying, "This is a hard problem, and I have an awesome solution no one else has proposed," I'm trying to say something more like, "This is a problem we should acknowledge! Let's also acknowledge that it's a damn hard problem and may not have an easy solution!"
That said, I think there are approaches that have promise (but are not complete solutions):
-Favoring opportunities that look promising under multiple models.
-Being skeptical of opportunities that look promising under only a single model.
-Learning more (if that can cause probability estimates to become less uncertain & hazy).
-Doing more things to put society in a good position to handle problems when they arise (or become apparent) instead of trying to predict problems before they arise (or become apparent).
This is how a lot of people think about statements of probability, and I think that’s usually reasonable. I’m concerned that people are sometimes accidentally equivocating between: “I would bet on this with at most 50:1 odds” and “this is as likely to occur as a perfectly fair 50-sided die being rolled and coming up ‘17’”
The notion of robustness points in the right direction, but I think it’s difficult (perhaps impossible) to reliably and explicitly quantify robustness in the situations we’re concerned about.
"Footing" here is about the robustness of our credences, so I'm not sure that we can really be ignorant of them. Yes different projects in a poorly understood domain will have different levels of poorly understood uncertainty, but it's not clear that this is more important than the different levels of uncertainty in better-understood domains (e.g. comparisons across Givewell charities).
What do you mean by reliable?
Yes, but it's very hard to attack any particular prior as well.
Yes I know but again it's the ordering that matters. And we can correct for optimizer's curse, and we don't know if these corrections will overcorrect or undercorrect.
"The problem" should be precisely defined. Identifying the correct intervention is hard because the optimizer's curse complicates comparisons between better- and worse-substantiated projects? Yes we acknowledge that. And you are not just saying that there's a problem, you are saying that there is a problem with a particular methodology, Bayesian probability. That is very unclear.
This is just a generic bucket of "stuff that makes estimates more accurate, sometimes" without any more connection to the optimizer's curse than to any other facets of uncertainty.
Let's imagine I make a new group whose job is to randomly select projects and then estimate each project's expected utility as accurately and precisely as possible. In this case the optimizer's curse will not apply to me. But I'll still want to evaluate things with multiple models, learn more and use proxies such as social capacity.
What is some advice that my group should not follow, that Givewell or Open Philanthropy should follow? Aside from the existing advice for how to make adjustments for the Optimizer's Curse.
If you want, you can define some set of future updates (e.g. researching something for 1 week) and specify a probability distribution for your belief state after that process. I don't think that level of explicit detail is typically necessary though. You can just give a rough idea of your confidence level alongside likelihood estimates.
I don't think this leaves you in a good position if your estimates and rankings are very sensitive to the choice of "reasonable" priors. Chris illustrated this in his post at the end of part 2 (with the atheist example), and in part 3.
You could try to choose some compromise between these priors, but there are multiple "reasonable" ways to compromise. You could introduce a prior on these priors, but you could run into the same problem with multiple "reasonable" choices for this new prior.
What do you mean by "a good position"?
Ah, I guess we'll have to switch to a system of epistemology which doesn't bottom out in unproven assumptions. Hey hold on a minute, there is none.
I'm getting a little confused about what sorts of concrete conclusions we are supposed to take away from here.
I'm not saying we shouldn't use priors or that they'll never help. What I am saying is that they don't address the optimizer's curse just by including them, and I suspect they won't help at all on their own in some cases.
Maybe checking sensitivity to priors and further promoting interventions whose value depends less on them (among some set of "reasonable" priors) would help. You could see this as a special case of Chris's suggestion to "Entertain multiple models".
Perhaps you could even use an explicit model to combine the estimates or posteriors from multiple models into a single one in a way that either penalizes sensitivity to priors or gives less weight to more extreme estimates, but a simpler decision rule might be more transparent or otherwise preferable. From my understanding, GiveWell already uses medians of its analysts' estimates this way.
I get your point, but the snark isn't helpful.
You seem to be using "people all agree" as a stand-in for "the optimizer's curse has been addressed". I don't get this. Addressing the optimizer's curse has been mathematically demonstrated. Different people can disagree about the specific inputs, so people will disagree, but that doesn't mean they haven't addressed the optimizer's curse.
I think combining into a single model is generally appropriate. And the sub-models need not be fully, explicitly laid out.
Suppose I'm demonstrating that poverty charity > animal charity. I don't have to build one model assuming "1 human = 50 chickens", another model assuming "1 human = 100 chickens", and so on.
Instead I just set a general standard for how robust my claims are going to be, and I feel sufficiently confident saying "1 human = at least 60 chickens", so I use that rather than my mean expectation (e.g. 90).
Maybe we're thinking about the optimizer's curse in different ways.
The proposed solution of using priors just pushes the problem to selecting good priors. It's also only a solution in the sense that it reduces the likelihood of mistakes happening (discovered in hindsight, and under the assumption of good priors), but not provably to its minimum, since it does not eliminate the impacts of noise. (I don't think there's any complete solution to the optimizer's curse, since, as long as estimates are at least somewhat sensitive to noise, "lucky" estimates will tend to be favoured, and you can't tell in principle between "lucky" and "better" interventions.)
If you're presented with multiple priors, and they all seem similarly reasonable to you, but depending on which ones you choose, different actions will be favoured, how would you choose how to act? It's not just a matter of different people disagreeing on priors, it's also a matter of committing to particular priors in the first place.
If one action is preferred with almost all of the priors (perhaps rare in practice), isn't that a reason (perhaps insufficient) to prefer it? To me, using this could be an improvement over just using priors, because I suspect it will further reduce the impacts of noise, and if it is an improvement, then just using priors never fully solved the problem in practice in the first place.
I agree with the rest of your comment. I think something like that would be useful.
The problem of the optimizer's curse is that the EV estimates of high-EV-options are predictably over-optimistic in proportion with how unreliable the estimates are. That problem doesn't exist anymore.
The fact that you don't have guaranteed accurate information doesn't mean the optimizer's curse still exists.
Well there is, just spend too much time worrying about model uncertainty and other people's priors and too little time worrying about expected value estimation. Then you're solving the optimizer's curse too much, so that your charity selections will be less accurate and predictably biased in favor of low EV, high reliability options. So it's a bad idea, but you've solved the optimizer's curse.
Maximize the expected outcome over the distribution of possibilities.
What do you mean by "the priors"? Other people's priors? Well if they're other people's priors and I don't have reason to update my beliefs based on their priors, then it's trivially true that this doesn't give me a reason to prefer the action. But you seem to think that other people's priors will be "reasonable", so obviously I should update based on their priors, in which case of course this is true - but only in a banal, trivial sense that has nothing to do with the optimizer's curse.
Hm? You're just suggesting updating one's prior by looking at other people's priors. Assuming that other people's priors might be rational, this is banal - of course we should be reasonable, epistemically modest, etc. But this has nothing to do with the optimizer's curse in particular, it's equally true either way.
I ask the same question I asked of OP: give me some guidance that applies for estimating the impact of maximizing actions that doesn't apply for estimating the impact of randomly selected actions. So far it still seems like there is none - aside from the basic idea given by Muelhauser.
Is the problem the lack of guaranteed knowledge about charity impacts, or is the problem the optimizer's curse? You seem to (incorrectly) think that chipping away at the former necessarily means chipping away at the latter.
It's always worth entertaining multiple models if you can do that at no cost. However, doing that often comes at some cost (money, time, etc). In situations with lots of uncertainty (where the optimizer's curse is liable to cause significant problems), it's worth paying much higher costs to entertain multiple models (or do other things I suggested) than it is in cases where the optimizer's curse is unlikely to cause serious problems.
I don't agree. Why is the uncertainty that comes from model uncertainty - as opposed to any other kind of uncertainty - uniquely important for the optimizer's curse? The optimizer's curse does not discriminate between estimates that are too high for modeling reasons, versus estimates that are too high for any other reason.
The mere fact that there's more uncertainty is not relevant, because we are talking about how much time we should spend worrying about one kind of uncertainty versus another. "Do more to reduce uncertainty" is just a platitude, we always want to reduce uncertainty.
I made a long top-level comment that I hope will clarify some problems with the solution proposed in the original paper.
This is a good point. Somehow, I think you’d want to adjust your posterior downward based on the set or the number of options under consideration and how unlikely the data that makes the intervention look good. This is not really useful, since I don't know how much you should adjust these. Maybe there's a way to model this explicitly, but it seems like you'd be trying to model your selection process itself before you've defined it, and then you look for a selection process which satisfies some properties.
You might also want to spend more effort looking for arguments and evidence against each option the more options you're considering.
When considering a larger number of options, you could use some randomness in your selection process or spread funding further (although the latter will be vulnerable to the satisficer's curse if you're using cutoffs).
If I haven’t decided on a prior, and multiple different priors (even an infinite set of them) seem equally reasonable to me.
That's the basic idea given by Muelhauser. Corrected posterior EV estimates.
As opposed to equal effort for and against? OK, I'm satisfied. However, if I've done the corrected posterior EV estimation, and then my specific search for arguments-against turns up short, then I should increase my EV estimates back towards the original naive estimate.
As I recall, that post found that randomized funding doesn't make sense. Which 100% matches my presumptions, I do not see how it could improve funding outcomes.
I don't see how that would improve funding outcomes.
In Bayesian rationality, you always have a prior. You seem to be considering or defining things differently.
Here we would probably say that your actual prior exists and is simply some kind of aggregate of these possible priors, therefore it's not the case that we should leap outside our own priors in some sort of violation of standard Bayesian rationality.
+1
In conversations I've had about this stuff, it seems like the crux is often the question of how easy it is to choose good priors, and whether a "good" prior is even an intelligible concept.
Compare Chris' piece ("selecting good priors is really hard!") with this piece by Luke Muehlhauser ("the optimizer's curse is trivial, just choose an appropriate prior!")
Before anything like a crux can be identified, complainants need to identify what a "good prior" even means, or what strategies are better than others. Until then, they're not even wrong - it's not even possible to say what disagreement exists. To airily talk about "good priors" or "bad priors", being "easy" or "hard" to identify, is just empty phrasing and suggests confusion about rationality and probability.
Hey Kyle, I'd stopped responding since I felt like we were well beyond the point where we were likely to convince one another or say things that those reading the comments would find insightful.
I understand why you think "good prior" needs to be defined better.
As I try to communicate (but may not quite say explicitly) in my post, I think that in situations where uncertainty is poorly understood, it's hard to come up with priors that are good enough that choosing actions based explicit Bayesian calculations will lead to better outcomes than choosing actions based on a combination of careful skepticism, information gathering, hunches, and critical thinking.
As a real world example:
Venture capitalists frequently fund things that they're extremely uncertain about. It's my impression that Bayesian calculations rarely play into these situations. Instead, smart VCs think hard and critically and come to conclusions based on processes that they probably don't full understand themselves.
It could be that VCs have just failed to realize the amazingness of Bayesianism. However, given that they're smart & there's a ton of money on the table, I think the much more plausible explanation is that hardcore Bayesianism wouldn't lead to better results than whatever it is that successful VCs actually do.
Again, none of this is to say that Bayesianism is fundamentally broken or that high-level Bayesian-ish things like "I have a very skeptical prior so I should not take this estimate of impact at face value" are crazy.
I interned for a VC, albeit a small and unknown one. Sure, they don't do Bayesian calculations, if you want to be really precise. But they make extensive use of quantitative estimates all the same. If anything, they are cruder than what EAs do. As far as I know, they don't bother correcting for the optimizer's curse! I never heard it mentioned. VCs don't primarily rely on the quantitative models, but other areas of finance do. If what they do is OK, then what EAs do is better. This is consistent with what finance professionals told me about the financial modeling that I did.
Plus, this is not about the optimizer's curse. Imagine that you told those VCs that they were no longer choosing which startups are best, instead they now have to select which ones are better-than-average and which ones are worse-than-average. The optimizer's curse will no longer interfere. Yet they're not going to start relying more on explicit Bayesian calculations. They're going to use the same way of thinking as always.
And explicit Bayesian calculation is rarely used by anyone anywhere. Humans encounter many problems which are not about optimizing, and they still don't use explicit Bayesian calculation. So clearly the optimizer's curse is not the issue. Instead, it's a matter of which kinds of cognition and calculation people are more or less comfortable with.
Explicit Bayesian calculation is a way of choosing actions based on a combination of careful skepticism, information gathering, hunches, and critical thinking. (With math too.)
I'm guessing you mean we should use intuition for the final selection, instead of quantitative estimates. OK, but I don't see how the original post is supposed to back it up; I don't see what the optimizer's curse has to do with it.
I'm struggling to understand how your proposed new group avoids the optimizer's curse, and I'm worried we're already talking past each other. To be clear, I'm don't believe there's something wrong with Bayesian methods in the abstract. Those methods are correct in a technical sense. They clearly work in situations where everything that matters can be completely quantified.
The position I'm taking is that the scope of real-world problems that those methods are useful for is limited because our ability to precisely quantify things is severely limited in many real-world scenarios. In my post, I try to build the case for why attempting Bayesian approaches in scenarios where things are really hard to quantify might be misguided.
Because I'm not optimizing!
Of course it is still the case that the highest-scoring estimates will probably be overestimates in my new group. The difference is, I don't care about getting the right scores on the highest-scoring estimates. Now I care about getting the best scores on all my estimates.
Or to phrase it another way, suppose that the intervention will be randomly selected rather than picked from the top.
Well yes, but I think the methods work better than anything else for all these scenarios.
This paper (Schuyler, J. R., & Nieman, T. (2007, January 1). Optimizer's Curse: Removing the Effect of this Bias in Portfolio Planning. Society of Petroleum Engineers. doi:10.2118/107852-MS; earlier version) has some simple recommendations for dealing with the Optimizer's Curse:
The paper's focus is actually on a more concrete Bayesian approach, based on modelling the population from which potential projects are sampled.
Perhaps a related phenomenon is that "adding maximal value on the margin" can look a lot like "defecting from longterm alliances & relationships" when viewed from a different framing?
Most clearly when ongoing support to longterm allies is no longer as leveraged as marginal support towards a new effort.
It's definitely an interesting phenomenon & worth thinking about seriously.
Any procedures for optimizing for expected impact could go wrong if the value of long-term alliances and relationships isn't accounted for.
Do you have any thoughts on Tetlock's work which recommends the use of probabilistic reasoning and breaking questions down to make accurate forecasts?
I think it's super exciting—a really useful application of probability!
I don't know as much as I'd like to about Tetlock's work. My understanding is that the work has focused mostly on geopolitical events where forecasters have been awfully successful. Geopolitical events are a kind of thing I think people are in an OK position for predicting—i.e. we've seen a lot of geopolitical events in the past that are similar to the events we expect to see in the future. We have decent theories that can explain why certain events came to pass while others didn't.
I doubt that Tetlock-style forecasting would be as fruitful in unfamiliar domains that involve Knightian-ish uncertainty. Forecasting may not be particularly reliable for questions like:
-Will we have a detailed, broadly accepted theory of consciousness this century?
-Will quantum computers take off in the next 50 years?
-Will any humans leave the solar system by 2100?
(That said, following Tetlock's guidelines may still be worthwhile if you're trying to predict hard-to-predict things.)
I think I agree with everything you've said there, except that I'd prefer to stay away from the term "Knightian", as it seems to be so often taken to refer to an absolute, binary distinction. It seems you wouldn't endorse that binary distinction yourself, given that you say "Knightian-ish", and that in your post you write:
But I think, whatever one's own intentions, the term "Knightian" sneaks in a lot of baggage and connotations. And on top of that, the term is interpreted in so many different ways by different people. For example, I happened to have recently seen events very similar to those you contrasted against cases of Knightian-ish uncertainty used as examples to explain the concept of Knightian uncertainty (in this paper):
So I see the term "Knightian" as introducing more confusion than it's worth, and I'd prefer to only use it if I also give caveats to that effect, or to highlight the confusions it causes. Typically, I'd prefer to rely instead on terms like more or less resilient, precise, or (your term) hazy probabilities/credences. (I collected various terms that can be used for this sort of idea here.)
[I know this comment is very late to the party, but I'm working on some posts about the idea of a risk-uncertainty distinction, and was re-reading your post to help inform that.]
Thanks for this!
Is there a tl;dr of these issues?
Thanks Milan—I probably should have been a bit more detailed in my summary.
Here are the main issues I see:
-The optimizer's curse is an underappreciated threat to those who prioritize among causes and programs that involve substantial, poorly understood uncertainty.
-I think EAs are unusually prone to wrong-way reductions: a fallacy where people try to solve messy, hard problems with tidy, formulaic approaches that actually create more issues than they resolve.
--I argue that trying to turn all uncertainty into something like numeric probability estimates is a wrong-way reduction that can have serious consequences.
--I argue that trying to use Bayesian methods in situations where well-ground priors are unavailable is often a wrong-way reduction. (For what it's worth, I rarely see EAs actually deploy these Bayesian methods, but I often see people suggest that the proper approaches in hard situations involve "making a Bayesian adjustments." In many of these situations, I'd argue that something closer to run-of-the-mill critical thinking beats Bayesianism.)
-I think EAs sometimes have an unwarranted bias towards numerical, formulaic approaches over less-quantitative approaches.
Late to the party, but I was re-reading this as it relates to another post I'm working on, and I realised I have a question. You write: (note that I say "you" in this comment a lot, but I'd also be interested in anyone else's thoughts on my questions)
That makes sense to me, and seems a very worthwhile point. (It actually seems to me it might have been worth emphasising more, as I think a casual reader could think this post was a critique of formal/explicit/quantitative models in particular.)
But then in a footnote, you add:
I'm not sure I understand what you mean by that, or if it's true/makes sense. It seems to me that, ultimately, if we're engaging in a process that effectively provides a ranking of how good the options seem (whether based on cost-effectiveness estimates or just how we "feel" about them), and there's uncertainty involved, and we pick the option that seems to come out on top, the optimizer's curse will be relevant. Even if we use multiple separate informal ways of looking at the problem, we still ultimately end up with a top ranked option, and, given that that option's ended up on top, we should still expect that errors have inflated its apparent value (whether that's in numerical terms or in terms of how we feel) more than average. Right?
Or did you simply mean that using multiple perspectives means that the various different errors and uncertainties might be more likely to balance out (in the same sort of way that converging lines of evidence based on different methodologies make us more confident that we've really found something real), and that, given that there'd effectively be less uncertainty, the significance of the optimizer's curse would be smaller. (This seems to fit with "the risk of postdecision surprise may be reduced".)
If that's what you meant, that seems reasonable to me, but it seems that we could get the same sort of benefits just by doing something like gathering more data or improving our formal models. (Though of course that may often be more expensive and difficult than cluster thinking, so highlighting that we also have the option of cluster thinking does seem useful.)
Just saw this comment, I'm also super late to the party responding to you!
Totally agree! Honestly, I had several goals with this post, and I almost complete failed on two of them:
Instead, I think this post came off as primarily a criticism of certain kinds of models and a criticism of GiveWell's approach to prioritization (which is unfortunate since I think the Optimizer's Curse isn't as big an issue for GiveWell & global health as it is for many other EA orgs/cause areas).
--
On the second piece of your comment, I think we mostly agree. Informal/cluster-style thinking is probably helpful, but it definitely doesn't make the Optimizer's Curse a non-issue.
I don't know how promising others think this is, but I quite liked Concepts for Decision Making under Severe Uncertainty with Partial Ordinal and Partial Cardinal Preferences. It tries to outline possible decision procedures once you relax some of the subject expected utility theory assumptions you object to. For example, it talks about the possibility of having a credal set of beliefs (if one objects to the idea of assigning a single probability) and then doing maximin on this i.e. selecting the outcome that has the best expected utility according to its least favorable credences.
I’m going to try to clarify further why I think the Bayesian solution in the original paper on the Optimizer’s Curse is inadequate.
The Optimizer's Curse is defined by Proposition 1: informally, the expectation of the estimated value of your chosen intervention overestimates the expectation of its true value when you select the intervention with the maximum estimate.
The proposed solution is to instead maximize the posterior expected value of the variable being estimated (conditional on your estimates, the data, etc.), with a prior distribution for this variable, and this is purported to be justified by Proposition 2.
However, Proposition 2 holds no matter which priors and models you use; there are no restrictions at all in its statement (or proof). It doesn’t actually tell you that your posterior distributions will tend to better predict values you will later measure in the real world (e.g. by checking if they fall in your 95% credence intervals), because there need not be any connection between your models or priors and the real world. It only tells you that your maximum posterior EV equals your corresponding prior’s EV (taking both conditional on the data, or neither, although the posterior EV is already conditional on the data).
Something I would still call an “optimizer’s curse” can remain even with this solution when we are concerned with the values of future measurements rather than just the expected values of our posterior distributions based on our subjective priors. I’ll give 4 examples, the first just to illustrate, and the other 3 real-world examples:
1. Suppose you have n different fair coins, but you aren’t 100% sure they’re all fair, so you have a prior distribution over the future frequency of heads (it could be symmetric in heads and tails, so the expected value would be 1/2 for each), and you use the same prior for each coin. You want to choose the coin which has the maximum future frequency of landing heads, based on information about the results of finitely many new coin flips from each coin. If you select the one with the maximum expected posterior, and repeat this trial many times (flip each coin multiple times, select the one with the max posterior EV, and then repeat), you will tend to find the posterior EV of your chosen coin to be greater than 1/2, but since the coins are actually fair, your estimate will be too high more than half of the time on average. I would still call this an “optimizer’s curse”, even though it followed the recommendations of the original paper. Of course, in this scenario, it doesn’t matter which coin is chosen.
Now, suppose all the coins are as before except for one which is actually biased towards heads, and you have a prior for it which will give a lower posterior EV conditional on k heads and no tails than the other coins would (e.g. you’ve flipped it many times before with particular results to achieve this; or maybe you already know its bias with certainty). You will record the results of k coin flips for each coin. With enough coins, and depending on the actual probabilities involved, you could be less likely to select the biased coin (on average, over repeated trials) based on maximum posterior EV than by choosing a coin randomly; you'll do worse than chance.
(Math to demonstrate the possibility of the posteriors working this way for k heads out of k: you could have a uniform prior on the true future long-run average frequency of heads for the unbiased coins, i.e. p(μi)=1 for μi in the interval [0,1], then p(μi|k heads)=(k+1)μki, and E[μi|k heads]=(k+1)/(k+2), which goes to 1 as k goes to infinity. You could have a prior which gives certainty to your biased coin having any true average frequency <1, so any of the unbiased coins which lands heads k out of k times will beat it for k large enough.)
If you flip each coin k times, there’s a number of coins, n, so that the true probability (not your modelled probability) of at least one of the n−1 other coins getting k heads is strictly greater than 1−1/n, i.e. 1−(1−1/2k)n−1>1−1/n (for k=2, you need n>8, and for k=10, you need n>9360, so n grows pretty fast as a function of k). This means, with probability strictly greater than 1−1/n, you won’t select the biased coin, so with probability strictly less than 1/n, you will select the biased coin. So, you actually do worse than random choice, because of how many different coins you have and how likely one of them is to get very lucky. You would have even been better off on average ignoring all of the new k×n coin flips and sticking to your priors, if you already suspected the biased coin was better (if you had a prior with mean >1/2).
2. A common practice in machine learning is to select the model with the greatest accuracy on a validation set among multiple candidates. Suppose that the validation and test sets are a random split of a common dataset for each problem. You will find that under repeated trials (not necessarily identical; they could be over different datasets/problems, with different models) that by choosing the model with the greatest validation accuracy, this value will tend to be greater than its accuracy on the test set. If you build enough models each trial, you might find the models you select are actually overfitting to the validation set (memorizing it), sometimes to the point that the models with highest validation accuracy will tend to have worse test accuracy than models with validation accuracy in a lower interval. This depends on the particular dataset and machine learning models being used. Part of this problem is just that we aren’t accounting for the possibility of overfitting in our model of the accuracies, but fixing this on its own wouldn’t solve the extra bias introduced by having more models to choose from.
3. Due to the related satisficer’s curse, when doing multiple hypothesis tests, you should adjust your p-values upward or your p-value cutoffs (false positive rate, significance level threshold) downward in specific ways to better predict replicability. There are corrections for the cutoff that account for the number of tests being performed, a simple one is that if you want a false positive rate of α, and you’re doing m tests, you could instead use a cutoff of 1−(1−α)m.
4. The satisficer’s curse also guarantees that empirical study publication based on p-value cutoffs will cause published studies to replicate less often than their p-values alone would suggest. I think this is basically the same problem as 3.
Now, if you treat your priors as posteriors that are conditional on a sample of random observations and arguments you’ve been exposed to or thought of yourself, you’d similarly find a bias towards interventions with “lucky” observations and arguments. For the intervention you do select compared to an intervention chosen at random, you’re more likely to have been convinced by poor arguments that support it and less likely to have seen good arguments against it, regardless of the intervention’s actual merits, and this bias increases the more interventions you consider. The solution supported by Proposition 2 doesn’t correct for the number of interventions under consideration.
This is an issue of the models and priors. If your models and priors are not right... then you should update over your priors and use better models. Of course they can still be wrong... but that's true of all beliefs, all reasoning, etc.
If you assume from the outside (unbeknownst to the agent) that they are all fair, then you're not showing a problem with the agent's reasoning, you're just using relevant information which they lack.
My prior would not be uniform, it would be 0.5! What else could "unbiased coins" mean? This solves the problem, because then a coin with few head flips and zero tail flips will always have posterior of p > 0.5.
In this case we have a prior expectation that simpler models are more likely to be effective.
Do we have a prior expectation that one kind of charity is better? Well if so, just factor that in, business as usual. I don't see the problem exactly.
Bayesian EV estimation doesn't do hypothesis testing with p-value cutoffs. This is the same problem popping up in a different framework, yes it will require a different solution in that context, but they are separate.
The proposed solution applies here too, just do (simplistic, informal) posterior EV correction for your (simplistic, informal) estimates.
Of course that's not going to be very reliable. But that's the whole point of using such simplistic, informal thinking. All kinds of rigor get sacrificed when charities are dismissed for sloppy reasons. If you think your informally-excluded charities might actually turn out to be optimal then you shouldn't be informally excluding them in the first place.
tl;dr: even using priors, with more options and hazier probabilities, you tend to increase the number of options which are too sensitive to supporting information (or just optimistically biased due to your priors), and these options look disproportionately good. This is still an optimizer’s curse in practice.
In practice, your models and priors will almost always be wrong, because you lack information; there's some truth of the matter of which you aren't aware. It's unrealistic to expect us to have good guesses for the priors in all cases, especially with little information or precedent as in hazy probabilities, a major point of the OP.
You'd hope that more information would tend to allow you to make better predictions and bring you closer to the truth, but when optimizing, even with correctly specified likelihoods and after updating over priors as you said should be done, the predictions for the selected coin can be more biased in expectation with more information (results of coin flips). On the other hand, the predictions for any fixed coin will not be any more biased in expectation over the new information, and if the prior's EV hadn't matched the true mean, the predictions would tend to be less biased.
More information (flips) per option (coin) would reduce the bias of the selection on average, but, as I showed, more options (coins) would increase it, too, because you get more chances to be unusually lucky.
The intent here again is that you don't know the coins are fair.
Fair enough.
How would you do this in practice? Specifically, how would you get an idea of the magnitude for the correction you should make?
Maybe you could test your own (or your group's) prediction calibration and bias, but it's not clear how exactly you should incorporate this information, and it's likely these tests won't be very representative when you're considering the kinds of problems with hazy probabilities mentioned in the OP.
I'm interested in what you think about using subjective confidence intervals to estimate effectiveness of charities and then comparing them. To account for the optimizer's curse, we can penalize charities that have wider confidence intervals. Not sure how it would be done in practice, but there probably is a mathematical method to calculate how much they should be penalized. Confidence intervals communicate both, value and uncertainty at the same time and therefore avoid some of the problems that you talk about.