I have never been satisfied by the "AI infers that it is simulated and changes its behavior" argument because it seems like the root issue is always that some information has leaked into the simulation. The problem goes from, "how do we prevent AI from escaping a box?" to "How do we prevent information from entering a box?" The components of this problem being:
These questions seem relatively approachable compared to other avenues of AI safety research.
Hi - this is my first post on the EA forum!
Be warned that this topic is extremely speculative. If I were to write this into a full length post then I'd dig up with some actual numbers.
Dysonian SETI ("artifact SETI") is a general strategy for SETI that start with the principle that it may be much more likely to find the remnants of an advanced civilization than it would be to encounter an active civilization. If this principle is true then this may seem like bad news for anybody that cares about X-Risk, but instead we should consider it an opportunity to learn from those who came before us.
Premise 1: Data regarding what X-Risks pose the biggest threat and how to deal with them is extremely valuable. [80% certainty]
Premise 2: Discovering remnants of civilizations would (almost instantaneously) provide centuries/millennia worth of insight into X-Risks. [80% certainty]
Premise 3: Not discovering remnants of civilizations would at least provide some insight into what "great filters" we have already passed as a civilization. (This is a topic for a different post regarding what we ought to do if we find that humanity is doomed in the near future.) [70% certainty]
Premise 4: Restricting our efforts to Dysonian SETI - rather than Active SETI ("communication SETI") - eliminates the hypothetical dark-forest risk by ensuring that we only ever observe. [90% certainty]
Premise 5: SETI as a cause area is virtually non-existent and thus likely to benefit immensely from marginal funding. [90% certainty]
Conclusion/Hypothesis: Promoting Dysonian SETI could be a very valuable cause area.
What other methods are there that would in principle allow iteration?
If it is true that "a failed AGI attempt could result in unrecoverable loss of human potential within the bounds everything that it can affect", then our options are to A) not fail or B) limit the bounds of everything that it can affect. In this sense any strategy that hopes to allow for iteration is abstractly equivalent to a box/simulation/sandbox whatever you may call it.