From time to time, someone makes the case for why transparency in reasoning is important. The latest conceptualization is Epistemic Legibility by Elizabeth, but the core concept is similar to reasoning transparency used by OpenPhil, and also has some similarity to A Sketch of Good Communication by Ben Pace.
I'd like to offer a gentle pushback. The tl;dr is in my comment on Ben's post, but it seems useful enough for a standalone post.
“How odd I can have all this inside me and to you it's just words.” ― David Foster Wallace
When and why reasoning legibility is hard
Say you demand transparent reasoning from AlphaGo. The algorithm has roughly two parts: tree search and a neural network. Tree search reasoning is naturally legible: the "argument" is simply a sequence of board states. In contrast, the neural network is mostly illegible - its output is a figurative "feeling" about how promising a position is, but that feeling depends on the aggregate experience of a huge number of games, and it is extremely difficult to explain transparently how a particular feeling depends on particular past experiences. So AlphaGo would be able to present part of its reasoning to you, but not the most important part.[1]
Human reasoning uses both: cognition similar to tree search (where the steps can be described, written down, and explained to someone else) and processes not amenable to introspection (which function essentially as a black box that produces a "feeling"). People sometimes call these latter signals “intuition”, “implicit knowledge”, “taste”, “S1 reasoning” and the like. Explicit reasoning often rides on top of this.
Extending the machine learning metaphor, the problem with human interpretability is that "mastery" in a field often consists precisely in having some well-trained black box neural network that performs fairly opaque background computations.
Bad things can happen when you demand explanations from black boxes
The second thesis is that it often makes sense to assume the mind runs distinct computational processes: one that actually makes decisions and reaches conclusions, and another that produces justifications and rationalizations.
In my experience, if you have good introspective access to your own reasoning, you may occasionally notice that a conclusion C depends mainly on some black box, but at the same time, you generated a plausible legible argument A for the same conclusion after you reached the conclusion C.
If you try running, say, Double Crux over such situations, you'll notice that even if someone refutes the explicit reasoning A, you won't quite change the conclusion to ¬C. The legible argument A was not the real crux. It is quite often the case that (A) is essentially fake (or low-weight), whereas the black box is hiding a reality-tracking model.
Stretching the AlphaGo metaphor a bit: AlphaGo could be easily modified to find a few specific game "rollouts" that turned out to "explain" the mysterious signal from the neural network. Using tree search, it would produce a few specific examples how such a position may evolve, which would be selected to agree with the neural net prediction. If AlphaGo showed them to you, it might convince you! But you would get a completely superficial understanding of why it evaluates the situation the way it does, or why it makes certain moves.
Risks from the legibility norm
When you make a strong norm pushing for too straightforward "epistemic legibility", you risk several bad things:
First, you increase the pressure on the "justification generator" to mask various black boxes by generating arguments supporting their conclusions.
Second, you make individual people dumber. Imagine asking a Go grandmaster to transparently justify his moves to you, and to play the moves that are best justified - if he tries to play that way, he will become a much weaker player. A similar thing applies to AlphaGo - if you allocate computational resources in such a way that a much larger fraction is consumed by tree search at each position, and less of the neural network is used overall, you will get worse outputs.
Third, there's a risk that people get convinced based on bad arguments - because their "justification generator" generated a weak legible explanation, you managed to refute it, and they updated. The problem comes if this involves discarding the output of the neural network, which was much smarter than the reasoning they accepted.
What we can do about it
My personal impression is that society as a whole would benefit from more transparent reasoning on the margin.
What I'm not convinced of, at all, is that trying to reason much more transparently is a good goal for aspiring rationalists, or that some naive (but memetically fit) norms around epistemic legibility should spread.
To me, it makes sense for some people to specialize in very transparent reasoning. On the other hand, it also makes sense for some people to mostly "try to be better at Go", because legibility has various hidden costs.
A version of transparency that seems more robustly good to me is the one that takes legibility to a meta level. It's perfectly fine to refer to various non-interpretable processes and structures, but we should ideally add a description of what data they are trained on (e.g. “I played at the national level”). At the same time, if such black-box models outperform legible reasoning, it should be considered fine and virtuous to use models which work. You should play to win, if you can.
Examples
An example of a common non-legible communication:
A: Can you explain why you feel that getting this person to implement a "Getting Things Done" system is not a good idea?
B: I don't know exactly, I feel it won't do him any good
An example of how to make the same conversation worse by naive optimization for legibility
A: Can you explain why you feel that getting this person to implement a "Getting Things Done" system is not a good idea?
B: I read a thread on Twitter yesterday where someone explained that research on similar motivational techniques does not replicate, and also another thread where someone referenced research that people who over-organize their lives are less creative.
A: Those studies are pretty weak though.
B: Ah I guess you’re right.
An example of how to actually improve the same conversation by striving for legibility:
A: Can you explain why you feel that getting this person to implement a "Getting Things Done" system is not a good idea?
B: I guess I can't explain it transparently to you. My model of this person just tells me that there is a fairly high risk that teaching them GTD won't have good results. I think it's based on experience with a hundred people I've met on various courses who are trying to have a positive impact on the world. Also, when I had similar feelings in the past, it turned out they were predictive in more than half of the cases.
If you've always understood the terms "reasoning transparency" or "epistemic legitimacy" in the spirit of the third conversation, and your epistemology routinely involves steps like "I'm going to trust this black-box trained on lots of data a lot more than this transparent analysis based on published research", then you're probably safe.
How this looks in practice
In my view, it is pretty clear that some of the main cruxes of current disagreements about AI alignment are beyond the limits of legible reasoning. (The current limits, anyway.)
In my view, some of these intuitions have roughly the "black-box" form explained above. If you try to understand the disagreements between e.g. Paul Christiano and Eliezer Yudkowsky, you often end up in a situation where the real difference is "taste", which influences how much weight they give to arguments, how good or bad various future "board positions" are evaluated to be, etc. Both Elizer and Paul are extremely smart, have spent more than a decade thinking about AI safety and even more time on relevant topics such as ML or decision theory or epistemics.
A person new to AI safety evaluating their arguments is roughly at a similar position to a Go novice trying to make sense of two Go grandmasters disagreeing about a board, with the further unfortunate feature that you can't just make them play against each other, because in some sense they are both playing for the same side.
This isn't a great position to be in. But in my view it's better to understand where you are rather than, for example, naively updating on a few cherry-picked rollouts.
See also
Thanks to Gavin for help with writing this post.
- ^
We can go even further if we note that the later AlphaZero policy network doesn’t use tree search when playing.
I liked this post by Katja Grace on these themes.
I enjoyed reading this post, and I think I agree with your assessment:
(In addition to the Christiano-Yudkowsky example you give, one could also point to the Hanson-Yudkowsky AI-Foom Debate of 2008.)
In addition to "Epistemic Legibility" and "A Sketch of Good Communication," which you mention, I'd recommend "Public beliefs vs. Private beliefs" (Tyre, 2022) to others who enjoyed this post – Tyre explores a somewhat related theme.
.
On the other hand, if someone in EA is making decisions about high-stakes interventions while their judgement is being influenced by a subconscious optimization for things like status and power, I think it's probably beneficial to subject their "justification generator" to a lot of pressure (in the hope that that will cause them, and onlookers, to end up making the best decisions from an EA perspective).