I follow Crocker's rules.
Under moral uncertainty, many moral perspectives care much more about averting downsides than producing upsides.
Additionally, tractability is probably higher for extinction-level threats, since they are "absorptive"; decreasing the chance we end up in one gives humanity and their descendants ability to do whatever they figure out is best.
Finally, there is a meaningful sense in which working on improving the future is plagued by questions about moral progress and lock-in of values, and my intuition is that most interventions that take moral progress seriously and try to avoid lock-in boil down to working on things that are fairly equivalent to avoiding extinction. Interventions that don't take moral progress seriously instead may look like locking in current values.
That's maybe a more productive way of looking at it! Makes me glad I estimated more than I claimed.
I think governments are probably the best candidate for funding this, or AI companies in cooperation with governments. And it's an intervention which has limited downside and is easy to scale up/down, with the most important software being evaluated first.
Dario Amodei is the 43rd Giving What We Can Pledge member, (a?) Tom Brown the 1214th, and (a?) Jack Clark the 4002nd.
Since this is turning out to be basically an AMA for LTFF, another question:
How high is the bar for giving out grants to projects trying to increase human intelligence[1]? Has the LTFF given out grants in the area[2], and is this something you're looking for?
(A short answer without justification, or a simple yes/no, would be highly appreciated for me to know whether this is a gap I should be trying to fill.)
I was curious how the "popularity" of the ITN factors has changed in EA recently. In short: Mentions of "importance" have become slightly more popular, and both "neglectedness" and "tractability" have become slightly less popular, by ~2-6 percentage points.
I don't think this method is strong enough to make conclusions, but it does track my perception of a vibe-shift towards considering importance more than the other two factors.
Searching the EA forum for the words importance/neglectedness/tractability (in quotation marks for exact matches) in the last year yields 454/87/110 (in percentages 69%/13%/17%) results, for important/neglected/tractable it's 1249/311/152 (73%/18%/9%).
When searching for all time the numbers for importance/neglectedness/tractability are 2824/761/858 (in percentages 63%/17%/19%) results, for important/neglected/tractable it's 7956/2002/1129 (71%/18%/10%). I didn't find a way to exclude results from the last year, unfortunately.
Argument in favor of giving to humans:
Factory farming will stop at some point in this century, while human civilization could stay for a much longer time. So you can push humanity in a slightly better long-term direction by improving the circumstances in the third world, e.g. reducing the chance that some countries will want to acquire nuclear weapons for conflict because of wars because of famines.
So there's an option to affect trajectory change by giving to global health, but not really for animal welfare.
The backlink-checker doesn't show anything of the sorts; but I think it doesn't work for discord or big social media websites like 𝕏.
Awesome post. Loved it.
Here's some thoughts I had while reading, with no particular coherent theme:
I think this distinction maps pretty cleanly to a now-forgotten concept in AI alignment, the former being indeed a mesa-optimizer, the second mapping onto optimization daemons. I think these should be given different names, maybe "full gradient hacker" and "internal gradient hacker"? A big difference is that a system could have multiple internal gradient hackers. Maybe it's just a question about the level we're looking at, and whether the hacker is short-/long-term beneficial/detrimental to itself/the supersystem?
Internal gradient hackers have been observed in non-neural network systems, for example in Eurisko, where a heuristic assigned itself as the discoverer of other heuristics, resulting in a very high Worth. I don't think we've seen something like this in the context of neural networks, but I could imagine circuits copying themselves "backwards" through the network and mutating along the way. I guess the fact that there's no recurrence (yet…) in advanced ML models is a big advantage.
Here's the relevant passage:
—Douglas B. Lenat, “ᴇᴜʀɪꜱᴋᴏ: A Program That Learns New Heuristics and Domain Concepts” p. 30, 1983
I'm not sure this is completely true, though I have to think a bit more about it. There's techniques like dropout, which make training more robust, and in the context of an internal gradient hacker this would probably change parts of the hacker while leaving other parts untouched, which makes it much more difficult for reliable internal communication. I guess it would also provide an incentive for an internal gradient hacker to "evolve" internal redundancy & modularity, which we don't want.
I also know that people have observed that swapping layers of neural networks doesn't have a very large effect; I don't think this is used as a training technique but it could be.
I knew it! I've been wondering about this for literally years, thanks for confirming that this is a thing that happens.
The examples of gradient hackers with positive effects seem like they could be following the pattern of "here's a sub-system doing something bad (e.g. transposons copying themselves incessantly), which the system needs to defend against, so the system finds a way (e.g. introns) to defend which carries other (maybe greater) benefits but which wouldn't have been found otherwise", does that seem like it explains things?