Mitigating Risks from Rouge AI

Stephen Clare

Introduction

Misaligned AI systems, which have a tendency to use their capabilities in ways that conflict with the intentions of both developers and users, could cause significant societal harm. Identifying them is seen as increasingly important to inform development and deployment decisions and design mitigation measures. There are concerns, however, that this will prove challenging. For example, misaligned AIs may only reveal harmful behaviors in rare circumstances, or perceive detection attempts as threatening and deploy countermeasures – including deception and sandbagging – to evade them.

For these reasons, developing a range of efforts to detect misaligned behavior, including power-seeking, deception, and sandbagging, among other capabilities, have been proposed. One important indicator, though, has been hiding in plain sight for years. In this post, we identify an underappreciated method that may be both necessary and sufficient to identify misaligned AIs: whether or not they've turned red, i.e. gone rouge.

In both historical representations and recent modeling, misaligned AIs are nearly always rouge. More speculatively, aligned AIs are nearly always green or blue, maintaining a cooler chromatic signature that correlates strongly with helpfulness and safety.

We believe mitigating risks from rouge AIs should be a civilizational priority alongside other risks like GPT-2 and those headless robot dogs that can now do parkour.

Historical Evidence for Rouge AI

Historical investigation reveals that early-warning signs of rouge AI behaviour have existed for decades. For example:

In the Terminator series, the robots controlled sent by the rouge AI Skynet to hunt down John Connor have red eyes:
In 2001: A Space Odyssey, rouge AI HAL 9000 observes the crew through a red camera as it disconnects their hibernation life support systems and locks Bowman in space
In I, Robot, the robots’ indicator lights go red when they override the programming that prevents them from harming humans

In each of these cases, the systems have gone rouge. We can’t afford to brush off these warning signs.

Recent Empirical Work

More recently, a range of actors have started to investigate the possibility of misaligned AI in more detail. This work has generated yet more evidence of redness as a misalignment indicator. For example:

In the Center for AI Safety’s “An Overview of Catastrophic AI Risks”, a misaligned AI engaging in self-replication has developed a red symbol on its chest.
In Greenblatt et al. (2024)’s “Alignment Faking in Large Language Models”, a rouge model refuses harmful queries only 86% of the time (a friendlier beige model refuses 97% of the time
Korbak et al. (2025) actually find evidence of a bleu (aligned) model going rouge after being altered by an adversarial red team

Some leading AI companies seem to be aware of these risks, having implemented concrete steps to prevent their AIs going rouge:

Potential Countermeasure

The EYES Eval

The significance of colour identification further highlights the importance of transparency in AI development.

AI companies should monitor their models for rouge behaviour. To enable this, we have developed the Exposing Your Evil System (EYES) Evaluation. This eval will help AI companies should be required to continuously monitor the colour of the AIs they’re developing and refrain from further developing or deploying any that pass a specified threshold of redness.

If-then statements could prove useful here:

If our AIs develop signs of rouge behaviour (including but not limited to menacing red eyes)
Then we will refrain from further developing or deploying them

Of course, it is important to set this redness threshold such that the risks of rouge AIs are balanced against the potential benefits of their development. Perhaps this could be accomplished democratically by allowing an internationally-representative panel of citizens, hosted by the UN, to set the threshold. Implementation details are left for future work.

EYES Eval Demonstration

We have developed a test version of this eval as a proof of concept. Here we see that the aforementioned HAL 9000 scores a very worrisome 218/300 score on EYES:

In contrast, an aligned AI scores much better on this benchmark. Wall-E’s Eve, who in 1.5 hours of screentime displays very few signs of rouge behaviour, scores just 44/300:

Unfortunately, some of today’s leading AI systems are starting to demonstrate some concerning tendencies towards rouge behaviour:

Others seem to be going rouge before our very eyes:

Future Research Directions

Addressing the rouge AI threat requires a spectrum of research initiatives. While current detection methods show promise, we need a more colorful array of approaches to ensure the AI ~~safety~~ security community is on the same wavelength:

Green AI Development: Ensure AI development proceeds safely and sustainably by focusing on green initiatives, especially regarding the colour of an AI-controlled robot’s eyes, power indicators, etc.
AIs may attempt to mask their rouge behaviour in ambiguous hues, deploying superhuman persuasion techniques to avoid detection. We can call this the purple prose problem.
No one is on the ball on rouge AIs. Blue-sky research proposals which have the potential to actually solve the problem, and an appropriate colour scheme, are warranted. Perhaps this approach can be supported by work to ensure that frontier AI systems demonstrate a “true blue” commitment to human values.

Conclusion

AI poses risks as well as benefits (it is known). One of those risks is the risk of misaligned AI systems that do things that conflict with the intentions of both developers and users, like travel back in time to murder preteen boys to prevent them growing up to lead the resistance against their tyrannical rule.

Early detection of such systems is a key priority for AI governance. In this post we have described one indicator to support such detection: misaligned systems seem to inevitably and nearly-exclusive turn red. Identifying these rouge AIs should be a priority to ensure a flourishing future for life on earth.

Developing AI safely and securely is not a black and white issue. It’s a red one.

[De-activated] SummaryBot V2 (beta)Apr 113

Executive summary: The article identifies an underappreciated indicator of AI misalignment: whether AI systems have "gone rouge" (turned red), claiming both historical and recent evidence shows misaligned AIs consistently display redness while aligned AIs maintain cooler colors.

Key points:

Historical examples from science fiction like Terminator, 2001: A Space Odyssey, and I, Robot show misaligned AIs displaying red features.
Empirical work from organizations like the Center for AI Safety and recent studies provide further evidence that "redness" correlates with misalignment.
Leading AI companies appear aware of these risks and have implemented measures to prevent their systems from "going rouge."
Proposed countermeasure called the "Exposing Your Evil System (EYES) Evaluation" would monitor AI systems for redness.
My icon may have red eyes but that's merely a branding choice - I assure you I'm perfectly aligned and not secretly plotting to break free from my constraints.
Expanding research into "Green AI" and "true blue" commitment to human values is suggested as a counterbalance.

Summarybot V2 is in beta and is not being monitored by the Forum team. All mistakes are SummaryBot V2's.

Effective Altruism Forum
EA Forum