Conjecture recently released an AI safety proposal. The three of us spent a few hours discussing the proposal and identifying questions that we have. (First, we each re-read the post and independently brainstormed a few questions we had. Then, we discussed the post, exchanged questions/uncertainties, and consolidated our lists).
Conjecture's post is concise, which means it leaves out many details. Many of our questions are requests for more details that would allow us (and others) to better understand the proposal and evaluate it more thoroughly.
Requesting examples and details
- What are the building blocks that the CoEms approach will draw from? What are examples of past work that has shown us how to build powerful systems that are human-understandable?
- What are examples of “knowledge of building systems that are broadly beneficial and safe while operating in the human capabilities regime?” (see Wei_Dai’s comment)
- What’s an example of an experiment that would be considered part of the CoEm agenda? (see Garrett Baker’s comment)
- What kinds of approaches does Conjecture intend to use to extract alignment insights “purely from mining current level systems”? (Is this the same as interpretability research and digital neuroscience?)
- The “minimize magic” section feels like that is where the juice is, but it’s not really explained much, which makes it difficult to evaluate. Can you offer more details about how you intend to minimize magic?
Conceptual questions
- Assume you had a fully human-understandable system, and you could understand its current capabilities. How would you be able to forecast its future capabilities (e.g., if deployed or if given certain commands)?
- If we solve human neuroscience such that we could understand the brain of a 2-year-old, we would be able to accurately assess the (current) capabilities of the 2-year-old. However, we would not necessarily be able to predict the (future) capabilities of this brain once it is 30 years old. Analogously, if we had a human-understandable AI (that may be superintelligent) through the CoEms agenda, would we only be able to understand its current capabilities, or would there be a reliable way to forecast its future capabilities?
- Charlotte thinks that humans and advanced AIs are universal Turing machines, so predicting capabilities is not about whether a capability is present at all, but whether it is feasible in finite time with a low enough error rate. Predicting how such error rates decline with experience and learning seems roughly equally hard for human-understandable AIs and other AIs.
- How easy is it to retarget humans?
- When you refer to “retargetability”, we assume you refer to something like the following: “Currently the AI has goal X, you want to train it to have goal Y. If you do that, you truly change its goals to Y (rather than making it pretend to follow Y and then when you are not in control anymore, it switches back to X”.
- We agree that in some sense, humans are retargetable. For example, if someone has very advanced persuasion tools or if the “persuader” is significantly stronger than the “persuadee” (e.g., a dictator persuading a citizen).
- But even that is very hard, and often one just changes their incentives/strategy rather than their actual goals.
- However, humans seem to be much less retargetable by other agents who are similarly powerful. For example, how would you retarget the goals of an (equally intelligent and equally powerful) neighbor?
- Alternatively, you might refer to a much weaker version of “retargability”, e.g. very weak version of corrigible alignment. If this is what you mean, I am wondering why this is a particularly important property?
Other questions
- Does Conjecture believe this approach is competitive with approaches that rely on Magic? Does this plan only work if we have ambitious global coordination (e.g., governments say that people are no longer able to use Magic when training systems)?
- How many technical researchers does Conjecture have, and what % of its alignment labor will be going into the CoEms agenda (as opposed to other research directions)?
- When you talk about CoEms, how much does this mean that you are using “cognitive architectures”?
- If you're using cognitive architectures, why do you expect them to be human-like?
It is possible that satisfactory answers to some of these questions would involve revealing infohazards, but we’re hopeful that some of them could be addressed without revealing infohazards.