David_Althaus's Quick takes

David_Althaus

This is a special post for quick takes by David_Althaus. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Mentioned in

92Long Reflection Reading List

89What is malevolence? On the nature, measurement, and distribution of dark traits

Sorted by

New & upvoted

Click to highlight new quick takes since: Today at 6:38 AM

David_AlthausAug 339

Cause prioritizationShow more

I just read Stephen Clare's 80k excellent article about the risks of stable totalitarianism.

I've been interested in this area for some time (though my focus is somewhat different) and I'm really glad more people are working on this.

In the article, Stephen puts the probability that a totalitarian regime will control the world indefinitely at about 1 in 30,000. My probability on a totalitarian regime controlling a non-trivial fraction of humanity's future is considerably higher (though I haven't thought much about this).

One point of disagreement may be the following. Stephen writes:

There’s also the fact that the rise of a stable totalitarian superpower would be bad for everyone else in the world. That means that most other countries are strongly incentivized to work against this problem.

This is not clear to me. Stephen most likely understands the relevant topics way more than myself but I worry that autocratic regimes often seem to cooperate. This has happened historically—e.g., Nazi Germany, fascist Italy, and Imperial Japan—and also seems to be happening today. My sense is that Russia, China, Venezuela, Iran, and North Korea seem to have formed some type of loose alliance, at least to some extent (see also Anne Applebaum's Autocracy Inc.). Perhaps, this doesn't apply to strictly totalitarian regimes (though it did so for Germany, Italy and Japan in the 1940s).

Autocratic regimes control a non-trivial fraction (like 20-25%?) of World GDP. A naive extrapolation could thus suggest that some type of coalition of autocratic regimes will control 20-25% of humanity's future (assuming these regimes won't reform themselves).

Depending on the offense-defense balance (and depending on how people trade off reducing suffering/injustive against other values such as national sovereignty, non-interference, isolationism, personal costs to themselves, etc.), this arrangement may very well persist.

It's unclear how much suffering such regimes would create—perhaps there would be fairly little; e.g. in China, ignoring political prisoners, the Uyghurs, etc., most people are probably doing fairly well (though a lot of people in, say, Iran aren't doing too well, see more below). But it's not super unlikely there would exist enormous amounts of suffering.

So, even though I agree that it's very unlikely that a totalitarian regime will control all or even the majority of humanity's future, it seems considerably more likely to me (perhaps even more than 1%) that a totalitarian regime—or a regime that follows some type of fanatical ideology—will control a non-trivial fraction of the universe and cause astronomical amounts of suffering indefinitely. (E.g., religious fanatics often have extremely retributive tendencies and may value the suffering of dissidents or non-believers. In a pilot, I found that 22% of religious participants at least tentatively agreed with the statement "if hell didn't exist, we should create hell in order to punish all the sinners". Senior officials in Iran have ordered raping female prisoners so that they would end up in hell, or at least prevented from going to heaven (IHRDC, 2011; IranWire, 2023). One might argue that religious fanatics (with access to AGI) will surely change their irrational beliefs once it's clear they are wrong. Maybe. I don't find it implausible that at least some people (and especially religious or political fanatics) will decide that giving up their beliefs is the greatest possible evil and decide to use their AGIs to align reality with their beliefs, rather than vice versa.)

To be clear, all of this is much more important from a s-risk focused perspective than from an upside-focused perspective.

David_AlthausAug 37

Another disagreement may be related to the tractability / how easy it is to contribute:

For example, we mentioned above that the three ways totalitarian regimes have been brought down in the past are through war, resistance movements, and the deaths of dictators. Most of the people reading this article probably aren’t in a position to influence any of those forces (and even if they could, it would be seriously risky to do so, to say the least!).

Most EAs may not be able to directly work on these topics but there are various options that allow you to do something indirectly:

- working in (foreign) policy or politics (or working on financial reforms that make illegal money laundering harder for autocratic states like Russia (again, cf. Autocracy Inc.).
- becoming a journalist and writing about such topics (e.g., doing investigative journalism on the corruption in autocratic regimes), generally moving the discussion towards more important topics and away from currently trendy but less important topics
- working at think thanks that protect democratic institutions (Stephen Clare lists several)
- working on AI governance (e.g., info sec, export controls) to reduce autocratic regimes gaining access to AI. (Again, Stephen Clare already lists this area).
- probably several more career paths that we haven't thought of

In general, it doesn't seem harder to have an impactful career in this area than in, say, AI risk. Depending on your background and skills, it may even be a lot easier; e.g., in order to do valuable work on AI policy, you often need to understand policy/politics and technical fields like computer science & machine learning. Of course, the area is arguably more crowded (though AI is becoming more crowded every day).

Mike AlbrechtAug 75

I do think this loose alliance of authoritarian states.^[1] - Russia, Iran, North Korea, etc. - poses some meaningful challenge to democracies, especially insofar as the authoritarian states coordinate to undermine the democratic ones, e.g., through information warfare that increases polarization.

However, I'd emphasize "loose" here, given they share no ideology. That makes them different vs. what binds together the free world ^[2] or what held together the Cold War's communist bloc. Such a loose coalition is merely opportunistic and transactional, and likely to dissolve if the opportunity dissipates, i.e., if the U.S. retreats from its role as the global police. Perhaps an apt historical example is how the victors in WWII splintered into NATO and the Warsaw Pact once Nazi Germany was defeated.

^{^}
Full disclosure: I've not (yet) read Applebaum's Autocracy Inc.
^{^}
What comes to mind is Kant, et al.'s democratic peace theory.

David_AlthausAug 102

Thanks Mike. I agree that the alliance is fortunately rather loose in the sense that most of these countries share no ideology. (In fact, some of them should arguably be ideological enemies, e.g., Islamic theocrats in Iran and Maoist communists in China).

But I worry that this alliance is held together by a hatred of (or ressentiment in general) Western secular democratic principles for ideological and (geo-)political reasons. Hatred can be an extremely powerful and unifying force. (Many political/ideological movements are arguably primarily defined, united, and motivated by what they hate, e.g., Nazism by the hatred of Jews, communism by the hatred of capitalists, racists hate other ethnicities, Democrats hate Trump and racists, Republicans hate the woke and communists, etc.)

So I worry that as long as Western democracies to influence international affairs, this alliance will continue to exist. And I certainly hope that Western democracies will continue to be powerful and worry that the world (and the future) will become a worse place if not.

Mike AlbrechtAug 74

Thanks, David. I mostly agree with @Stephen Clare's points, notwithstanding that I also generally agree with your critique. (The notion of a future dominated by religious fanaticism always brings to mind Frank Herbert's Dune saga.)

The biggest issue I have is, to echo David, odds of 1 in 30K strike me as far too low.

Looking at Stephen's math...

But let’s say there’s:
A 10% chance that, at some point, an AI system is invented which gives whoever controls it a decisive edge over their rivals
A 3% chance that a totalitarian state is the first to invent it, or that the first state to invent it becomes totalitarian
A 1% chance that the state is able to use advanced AI to entrench its rule perpetually
That leaves about a 0.3% chance we see a totalitarian regime with unprecedented power (10% x 3%) and a 0.003% (1 in 30,000) chance it’s able to persist in perpetuity.

... I agree with emphasizing the potential for AI value lock-in, but I have a few questions:

Does the eventual emergence of an AI-empowered totalitarian state most likely involve someone to have a "decisive edge" in AI at the outset?
Could an AI-empowered global totalitarian state start as a democracy or even a corporation, rather than being totalitarian at the moment it acquires powerful AI capabilities?
How do we justify the estimation that the conditioned odds of perpetual lock-in are only 1%?

David_AlthausFeb 2156

Existential risk

Two sources of human misalignment that may resist a long reflection: malevolence and ideological fanaticism

(Alternative title: Some bad human values may corrupt a long reflection^[1])

The values of some humans, even if idealized (e.g., during some form of long reflection), may be incompatible with an excellent future. Thus, solving AI alignment will not necessarily lead to utopia.

Others have raised similar concerns before.^[2] Joe Carlsmith puts it especially well in the post “An even deeper atheism”:

“And now, of course, the question arises: how different, exactly, are human hearts from each other? And in particular: are they sufficiently different that, when they foom, and even "on reflection," they don't end up pointing in exactly the same direction? After all, Yudkowsky said, above, that in order for the future to be non-trivially "of worth," human hearts have to be in the driver's seat. But even setting aside the insult, here, to the dolphins, bonobos, nearest grabby aliens, and so on – still, that's only to specify a necessary condition. Presumably, though, it's not a sufficient condition? Presumably some human hearts would be bad drivers, too? Like, I dunno, Stalin?”

What makes human hearts bad?

What, exactly, makes some human hearts bad drivers? If we better understood what makes hearts go bad, perhaps we could figure out how to make bad hearts good or at least learn how to prevent hearts from going bad. It would also allow us better spot potentially bad hearts and coordinate our efforts to prevent them from taking the driving seat.

As of now, I’m most worried about malevolent personality traits and fanatical ideologies.^[3]

Malevolence: dangerous personality traits

Some human hearts may be corrupted due to elevated malevolent traits like psychopathy, sadism, narcissism, Machiavellianism, or spitefulness.

Ideological fanaticism: dangerous belief systems

There are many suitable definitions of “ideological fanaticism”. Whatever definition we are going to use, it should describe ideologies that have caused immense harm historically, such as fascism (Germany under Hitler, Italy under Mussolini), (extreme) communism (the Soviet Union under Stalin, China under Mao), religious fundamentalism (ISIS, the Inquisition), and most cults.

See this footnote^[4] for a preliminary list of defining characteristics.

Malevolence and fanaticism seem especially dangerous

Of course, there are other factors that could corrupt our hearts or driving ability. For example, cognitive biases, limited cognitive ability, philosophical confusions, or plain old selfishness.^[5] I’m most concerned about malevolence and ideological fanaticism for two reasons.

Deliberately resisting reflection and idealization

First, malevolence—if reflectively endorsed^[6]—and fanatical ideologies deliberately resist being changed and would thus plausibly resist idealization even during a long reflection. The most central characteristic of fanatical ideologies is arguably that they explicitly forbid criticism, questioning, and belief change and view doubters and disagreement as evil.

Putting positive value on creating harm

Second, malevolence and ideological fanaticism would not only result in the future not being as good as it possibly could—they might actively steer the future in bad directions and, for instance, result in astronomical amounts of suffering.

The preferences of malevolent humans (e.g., sadists) may be such that they intrinsically enjoy inflicting suffering on others. Similarly, many fanatical ideologies sympathize with excessive retributivism and often demonize the outgroup. Enabled by future technology, preferences for inflicting suffering on the outgroup may result in enormous disvalue—cf. concentration camps, the Gulag, or hell^[7].

In the future, I hope to write more about all of this, especially long-term risks from ideological fanaticism.

Thanks to Pablo and Ruairi for comments and valuable discussions.

^{^}
“Human misalignment” is arguably a confusing (and perhaps confused) term. But it sounds more sophisticated than “bad human values”.
^{^}
For example, Matthew Barnett in “AI alignment shouldn't be conflated with AI moral achievement”, Geoffrey Miller in “AI alignment with humans... but with which humans?”, lc in “Aligned AI is dual use technology”. Pablo Stafforini has called this the “third alignment problem”. And of course, Yudkowsky’s concept of CEV is meant to address these issues.
^{^}
These factors may not be clearly separable. Some humans may be more attracted to fanatical ideologies due to their psychological traits and malevolent humans are often leading fanatical ideologies. Also, believing and following a fanatical ideology may not be good for your heart.
^{^}
Below are some typical characteristics (I’m no expert in this area):
Unquestioning belief, absolute certainty and rigid adherence. The principles and beliefs of the ideology are seen as absolute truth and questioning or critical examination is forbidden.
Inflexibility and refusal to compromise.
Intolerance and hostility towards dissent. Anyone who disagrees or challenges the ideology is seen as evil; as enemies, traitors, or heretics.
Ingroup superiority and outgroup demonization. The in-group is viewed as superior, chosen, or enlightened. The out-group is often demonized and blamed for the world's problems.
Authoritarianism. Fanatical ideologies often endorse (or even require) a strong, centralized authority to enforce their principles and suppress opposition, potentially culminating in dictatorship or totalitarianism.
Militancy and willingness to use violence.
Utopian vision. Many fanatical ideologies are driven by a vision of a perfect future or afterlife which can only be achieved through strict adherence to the ideology. This utopian vision often justifies extreme measures in the present.
Use of propaganda and censorship.
^{^}
For example, Barnett argues that future technology will be primarily used to satisfy economic consumption (aka selfish desires). That seems even plausible to me, however, I’m not that concerned about this causing huge amounts of future suffering (at least compared to other s-risks). It seems to me that most humans place non-trivial value on the welfare of (neutral) others such as animals. Right now, this preference (for most people) isn’t strong enough to outweigh the selfish benefits of eating meat. However, I’m relatively hopeful that future technology would make such types of tradeoffs much less costly.
^{^}
Some people (how many?) with elevated malevolent traits don’t reflectively endorse their malevolent urges and would change them if they could. However, some of them do reflectively endorse their malevolent preferences and view empathy as weakness.
^{^}
Some quotes from famous Christian theologians:
Thomas Aquinas: "the blessed will rejoice in the punishment of the wicked." "In order that the happiness of the saints may be more delightful to them and that they may render more copious thanks to God for it, they are allowed to see perfectly the sufferings of the damned".
Samuel Hopkins: "Should the fire of this eternal punishment cease, it would in a great measure obscure the light of heaven, and put an end to a great part of the happiness and glory of the blessed.”
Jonathan Edwards: "The sight of hell torments will exalt the happiness of the saints forever."

David_AlthausFeb 2116

Existential risks from within?

(Unimportant discussion of probably useless and confused terminology.)

I sometimes use terms like “inner existential risks” to refer to risk factors like malevolence and fanaticism. Inner existential risks primarily arise from “within the human heart”—that is, they are primarily related to the values, goals and/or beliefs of (some) humans.

My sense is that most x-risk discourse focuses on outer existential risks, that is, x-risks which primarily arise from outside the human mind. These could be physical or natural processes (asteroids, lethal pathogens) or technological processes that once originated in the human mind but are now out of their control (e.g., AI, nuclear weapons, engineered pandemics).

Of course, most people already believe that the most worrisome existential risks are anthropogenic, that is, caused by humans. One could argue that, say, AI and engineered pandemics are actually inner existential risks because they arose from within the human mind. I agree that the distinction between inner and outer existential risks is not super clear. Still, it seems to me that the distinction between inner and outer existential risks captures something vaguely real and may serve as some kind of intuition pump.

Then there is the related issue of more external or structural risk factors, like political or economic systems. These are systems invented by human minds and which in turn are shaping human minds and values. I will conveniently ignore this further complication.

Other potential terms for inner existential risks could be intraanthropic, idioanthropic, or psychogenic (existential) risks.

David_AlthausAug 126

I just realized that in this (old) 80k podcast episode^[1], Holden makes similar points and argues that aligned AI could be bad.

My sense is that Holden alludes to both malevolence ("really bad values, [...] we shouldn't assume that person is going to end up being nice") and ideological fanaticism ("create minds that [...] stick to those beliefs and try to shape the world around those beliefs", [...] "This is the religion I follow. This is what I believe in. [...] And I am creating an AI to help me promote that religion, not to help me question it or revise it or make it better.").

Longer quotes below (emphasis added):

Holden: “The other part — if we do align the AI, we’re fine — I disagree with much more strongly. [...] if you just assume that you have a world of very capable AIs, that are doing exactly what humans want them to do, that’s very scary. [...]

Certainly, there’s the fact that because of the speed at which things move, you could end up with whoever kind of leads the way on AI, or is least cautious, having a lot of power — and that could be someone really bad. And I don’t think we should assume that just because that if you had some head of state that has really bad values, I don’t think we should assume that that person is going to end up being nice after they become wealthy, or powerful, or transhuman, or mind uploaded, or whatever — I don’t think there’s really any reason to think we should assume that.
And then I think there’s just a bunch of other things that, if things are moving fast, we could end up in a really bad state. Like, are we going to come up with decent frameworks for making sure that the digital minds are not mistreated? Are we going to come up with decent frameworks for how to ensure that as we get the ability to create whatever minds we want, we’re using that to create minds that help us seek the truth, instead of create minds that have whatever beliefs we want them to have, stick to those beliefs and try to shape the world around those beliefs? I think Carl Shulman put it as, “Are we going to have AI that makes us wiser or more powerfully insane?”
[...] I think even if we threw out the misalignment problem, we’d have a lot of work to do — and I think a lot of these issues are actually not getting enough attention.”
Rob Wiblin: Yeah. I think something that might be going on there is a bit of equivocation in the word “alignment.” You can imagine some people might mean by “creating an aligned AI,” it’s like an AI that goes and does what you tell it to — like a good employee or something. Whereas other people mean that it’s following the correct ideal values and behaviours, and is going to work to generate the best outcome. And these are really quite separate things, very far apart.
Holden Karnofsky: Yeah. Well, the second one, I just don’t even know if that’s a thing. I don’t even really know what it’s supposed to do. I mean, there’s something a little bit in between, which is like, you can have an AI that you ask it to do something, and it does what you would have told it to do if you had been more informed, and if you knew everything it knows. That’s the central idea of alignment that I tend to think of, but I think that still has all the problems I’m talking about. Just some humans seriously do intend to do things that are really nasty, and seriously do not intend — in any way, even if they knew more — to make the world as nice as we would like it to be.
And some humans really do intend and really do mean and really will want to say, you know, “Right now, I have these values” — let’s say, “This is the religion I follow. This is what I believe in. This is what I care about. And I am creating an AI to help me promote that religion, not to help me question it or revise it or make it better.” So yeah, I think that middle one does not make it safe. There might be some extreme versions, like, an AI that just figures out what’s objectively best for the world and does that or something. I’m just like, I don’t know why we would think that would even be a thing to aim for. That’s not the alignment problem that I’m interested in having solved.

^{^}
I'm one of those bad EAs who don't listen to all 80k episodes as soon as they come out.

Matthew_BarnettFeb 244

Barnett argues that future technology will be primarily used to satisfy economic consumption (aka selfish desires). That seems even plausible to me, however, I’m not that concerned about this causing huge amounts of future suffering (at least compared to other s-risks). It seems to me that most humans place non-trivial value on the welfare of (neutral) others such as animals. Right now, this preference (for most people) isn’t strong enough to outweigh the selfish benefits of eating meat. However, I’m relatively hopeful that future technology would make such types of tradeoffs much less costly.

At the same time it becomes less selfishly costly to be kind to animals due to technological progress, it could become more selfishly enticing to commit other moral tragedies. For example, it could hypothetically turn out, just as a brute empirical fact, that the most effective way of aligning AIs is to treat them terribly in some way, e.g. by brainwashing them or subjecting them to painful stimuli.

More generally, technological progress doesn't seem to asymmetrically make people more moral. Factory farming, as a chief example, allowed people to satisfy their desire for meat more cost-effectively, but at a larger moral cost compared to what existed previously. Even if factory farming is eventually replaced with something humane, there doesn't seem to be an obvious general trend here.

The argument you allude to that I find most plausible here is the idea that incidental s-risks as a byproduct of economic activity might not be as bad as some other forms of s-risks. But at the very least, incidental s-risks seem plausibly quite bad in expectation regardless.

David_AlthausFeb 248

For example, it could hypothetically turn out, just as a brute empirical fact, that the most effective way of aligning AIs is to treat them terribly in some way, e.g. by brainwashing them or subjecting them to painful stimuli.

Yes, agree. (For this and other reasons, I'm supportive of projects like, e.g., NYU MEP.)

I also agree that there are no strong reasons to think that technological progress improves people's morality.

As you write, my main reason for worrying more about agential s-risks is that the greater the technological power of agents, the more their intrinsic preferences matter in how the universe will look like. To put it differently, actors whose terminal goals put some positive value on suffering (e.g., due to sadism, retributivism or other weird fanatical beliefs) would deliberately aim to arrange matter in such a way that it contains more suffering—this seems extremely worrisome if they have access to advanced technology.

Altruists would also have a much harder time to trade with such actors, whereas purely selfish actors (who don't put positive value on suffering) could plausibly engage in mutually beneficial trades (e.g., they use (slightly) less efficient AI training/alignment methods which contain much less suffering and altruists give them some of their resources in return).

But at the very least, incidental s-risks seem plausibly quite bad in expectation regardless.

Yeah, despite what I have written above, I probably worry more about incidental s-risks than the average s-risk reducer.

David_AlthausMar 27 202316

AI safety

Selecting RLHF human raters for desirable traits?

Epistemic status: I wrote this quickly (for my standards) and I have ~zero expertise in this domain.

Introduction

It seems plausible that language models such as GPT3 inherit (however haphazardly) some of the traits, beliefs and value judgments of human raters doing RLHF. For example, Perez et al. (2022) find that models trained via RLHF are more prone to make statements corresponding to Big Five agreeableness than models not trained via RLHF. This is presumably (in part) because human raters gave positive ratings to any behavior exhibiting such traits.

Given this, it seems plausible that selecting RLHF raters for more desirable traits—e.g., low malevolence, epistemic virtues / truth-seeking, or altruism—would result in LLMs instantiating more of these characteristics. (In a later section, I will discuss which traits seem most promising to me and how to measure them.)

It’s already best practice to give human RLHF raters reasonably long training instructions and have them undergo some form of selection process. For example, for InstructGPT, the instruction manual was 17 pages long and raters were selected based on their performance in a trial which involved things like ability to identify sensitive speech (Ouyang et al., 2022, Appendix B). So adding an additional (brief) screening for these traits wouldn’t be that costly or unusual.

Clarification

Talking about stable traits or dispositions of LLMs is inaccurate. Given different prompts, LLMs simulate wildly different characters with different traits. So the concept of inheriting dispositions from human RLHF raters is misleading.

We might reformulate the path to impact as follows: If we train LLMs with RLHF raters with traits X, then a (slightly) larger fraction of characters or simulacra that LLMs tend to simulate will exhibit the traits X. This increases the probability that the eventual character(s) that transformative AIs will “collapse on” (if this ever happens) will have traits X.

Open questions

I don’t know how the RLHF process works in detail. For example, i) to what extent is the behavior of individual RLHF raters double-checked or scrutinized, either by AI company employees or other RLHF raters, after the initial trial period is over, and ii) do RLHF raters know when the trial period has ended? In the worst case, trolls could behave well during the initial trial period but then, e.g., deliberately reward offensive or harmful LLM behavior for the lulz.

Fortunately, I expect that at most a few percent of people would behave like this. Is this enough to meaningfully affect the behavior of LLMs?

Generally, it could be interesting to do more research on whether and to what extent the traits and beliefs of RLHF raters influence the type of feedback they give. For example, it would be good to know whether RLHF raters that score highly on some dark triad measure in fact systematically reward more malevolent LLM behavior.

Which traits precisely should we screen RLHF raters for? I make some suggestions in this section below.

Positive impact, useless, or negative impact?

Why this might be positive impact

Pushing for adopting such selection processes now increases the probability that they will be used when training truly transformative AI. Arguably, whether or not current-day LLMs exhibit desirable traits doesn’t really matter all that much. However, if we convince AI companies to adopt such selection processes now, this will plausibly increase the probability that they will continue to use these selection processes (if only because of organizational inertia) once they train truly transformative AIs. If we wait to do so six months before the singularity, AI companies might be too busy to adopt such practices.
- Of course, the training setup and architecture of future transformative AIs might be totally different. But they might also be at least somewhat similar.
If (transformative) AIs really inherit, even if in a haphazard fashion, the traits and beliefs of RLHF raters, then this increases the expected value of the long-term future as long as RLHF raters are selected for desirable traits. For example, it seems fairly clear that transformative AIs with malevolent traits increase s-risk and x-risks.
- This is probably especially valuable if we fail at aligning AIs. That is, if we successfully align our AIs, the idiosyncratic traits of RLHF raters won’t make a difference because the values of the AI are fully aligned with the human principals anyways. But unaligned AIs might differ a lot in their values. For example, an unaligned AI with some sadistic traits will create more expected disvalue than an unaligned AI that just wants to create paper clips.
It might already be valuable to endow non-transformative, present-day AIs with more desirable traits. For example, having more truthful present-day AI assistants seems beneficial for various reasons, such as having a more informed populace, more truth-tracking, nuanced political discourse, and increased cooperation and trust. Ultimately, truthful AI assistants would also help us with AI alignment. For much more details, see Evans et al. (2021, chapter 3).

Why this is probably useless not that impactful

This doesn’t solve any problems related to inner alignment or mesa optimization. (In fact, it might increase risks related to deceptive alignment but more on this below.)
Generally, it’s not clear that the dispositions or preferences of AIs will correspond in some predictable way to the kind of human feedback they received. It seems clear that current AIs will inherit some of the traits, views, and values of human RLHF raters, at least on distribution. However, as the CoinRun example showcases, it’s difficult to know what values an AI is actually learning as a result of our training. That is, off-distribution behavior might be radically different than what we expect.
There will probably be many RLHF raters. Many of the more problematic traits such as, e.g. psychopathy or sadism seem relatively rare, so they wouldn’t have much of an influence anyways.
People won’t just give feedback based on what appeals to their idiosyncratic traits or beliefs. They are given detailed instructions on what to reward. This means that working on the instructions that RLHF raters receive is probably more important. However, as mentioned above, malevolent RLHF raters or “trolls” might deliberately do the opposite of what they are instructed to do and reward e.g. sadistic or psychopathic behavior. Also, instructions cannot cover every possible example so in unclear cases, the idiosyncratic traits and beliefs of human RLHF raters might make a (tiny) difference.
The values AGIs learn during training might change later as they reflect more and resolve internal conflicts. This process might be chaotic and thus reduces the expected magnitude of any intervention that focuses on installing any particular values right now.
Generally, what matters are not the current LLMs but the eventual transformative AIs. These AIs might have a completely different architecture or training setups than current systems.

Why this might be negative impact

RLHF might actually be net negative and selecting for desirable traits in RLHF raters (insofar it has an effect at all) might exacerbate these negative effects. For instance, Oliver Habryka argues: “In most worlds RLHF, especially if widely distributed and used, seems to make the world a bunch worse from a safety perspective (by making unaligned systems appear aligned at lower capabilities levels, meaning people are less likely to take alignment problems seriously, and by leading to new products that will cause lots of money to go into AI research, as well as giving a strong incentive towards deception at higher capability levels)”. For example, the fact that Bing Chat was blatantly misaligned was arguably positive because it led more people to take AI risks seriously.
- On the other hand, Paul Christiano addresses (some of) these arguments here and overall beliefs that RLHF has been net positive.
In general, this whole proposal is not an intervention that makes substantial, direct progress on the central parts of the alignment problem. Thus, it might just distract from the actually important and difficult parts of the problem. It might even be used as some form of safety washing.
Another worry is that pushing for selection processes will mutate into selecting traits we don’t particularly care about. For instance, OpenAI seems primarily concerned with issues that are important to the political left.^[1] So maybe pitching OpenAI (or other AI companies) the idea of selecting RLHF raters according to desirable traits will mostly result in a selection process that upholds a long list of “woke” constraints, which in some instances, might be in conflict with other desirable traits such as truthfulness. However, it might still be net positive.

Which traits and how?

I list a few suggestions for traits we might want to select for below. All of the traits I list arguably have the following characteristics:

i) plausibly affects existential or suffering risks if present in transformative AIs.
ii) AI assistants exhibiting more of these traits is beneficial for the longterm future or at least not negative
iii) is uncontroversially viewed as (un)desirable
iv) is (reliably and briefly) measurable in humans.
- If we can’t reliably measure a trait in humans, we obviously cannot select for it.
- The shorter the measures, the cheaper they are to employ, and the easier it is to convince AI companies to use them.

Ideally, any trait which we want to include in a RLFH rater selection process should have these characteristics. The reasons for these criteria are obvious but I briefly elaborate on them in this footnote.^[2]

This isn’t a definitive or exhaustive list by any means. In fact, which traits to select for, and how to measure them (perhaps even developing novel measurements) could arguably be a research area for psychologists or other social scientists.

Dark tetrad traits / malevolence

One common operationalization of malevolence are the dark tetrad traits, comprising machiavellianism, narcissism, psychopathy, and sadism. I have previously written on the nature of dark tetrad traits and the substantial risks they pose. It seems obvious that we don’t want any AIs to exhibit these traits.

Fortunately, these traits have been studied extensively by psychologists. Consequently, brief and reliable measures of these traits exist, e.g., the Short Dark Tetrad (Paulhus et al., 2020) or the Short Dark Triad (Jones & Paulhus, 2014). However, since these are merely self-report scales, it’s unclear how well they work in situations where people know they are being assessed for a job.

Truthfulness and epistemic virtues

(I outlined some of the benefits of truthfulness above, in the third bullet point of this section.)

It’s not easy to measure how truthful humans are, especially in assessment situations.^[3] Fortunately, there exist reliable measures for some epistemic virtues that correlate with truthfulness. For example, the argument evaluation test, (Stanovich & West, 1997) or the actively open-minded thinking scale (e.g., Baron, 2019). See also Stanovich and West (1988) for a classic overview of various measures of epistemic rationality.

Still, none of these measures are all that great. For example, some of these measures, especially the AOT scale, have strong ceiling effects. Developing more powerful measures would be useful.

Pragmatic operationalization: forecasting ability

One possibility would be to select for human raters above some acceptable threshold of forecasting ability as forecasting skills correlate with epistemic virtues. The problem is that very few people have a public forecasting track record and measuring people’s forecasting ability is a lengthy and costly process.

Cooperativeness, harm aversion, altruism

In some sense, altruism or benevolence are just the opposite of malevolence^[4], so perhaps we could just use one or the other. HEXACO honesty-humility (e.g., Ashton et al., 2014) is one very well-studied measure of benevolence. Alternatives include the self-report altruism scale (Rushton et al., 1981) or behavior in economic games such as the dictator game.

Cooperativeness, however, is a somewhat distinct construct. Others have written about the benefits of making AIs more cooperative in this sense. One measure of cooperativeness is the cooperative personality scale by Lu et al. (2013).

Harm aversion could also be desirable because it might translate into (some form of) low-impact AIs. On the other hand, (excessive) instrumental harm aversion can come into conflict with consequentialist principles.

Other traits

As mentioned above, this is by no means an exhaustive list. There are many other traits which could be desirable, such as empathy, tolerance, helpfulness, fairness, intelligence, effectiveness-focus, compassion, or wisdom. Other possibly undesirable traits include spite, tribalism, partisanship, vengefulness, or (excessive) retributivism.

References

Ashton, M. C., Lee, K., & De Vries, R. E. (2014). The HEXACO Honesty-Humility, Agreeableness, and Emotionality factors: A review of research and theory. Personality and Social Psychology Review, 18(2), 139-152.

Baron, J. (2019). Actively open-minded thinking in politics. Cognition, 188, 8-18.

Evans, O., Cotton-Barratt, O., Finnveden, L., Bales, A., Balwit, A., Wills, P., ... & Saunders, W. (2021). Truthful AI: Developing and governing AI that does not lie. arXiv preprint arXiv:2110.06674.

Forsyth, L., Anglim, J., March, E., & Bilobrk, B. (2021). Dark Tetrad personality traits and the propensity to lie across multiple contexts. Personality and individual differences, 177, 110792.

Lee, K., & Ashton, M. C. (2014). The dark triad, the big five, and the HEXACO model. Personality and Individual Differences, 67, 2-5.

Lu, S., Au, W. T., Jiang, F., Xie, X., & Yam, P. (2013). Cooperativeness and competitiveness as two distinct constructs: Validating the Cooperative and Competitive Personality Scale in a social dilemma context. International Journal of Psychology, 48(6), 1135-1147.

Perez, E., Ringer, S., Lukošiūtė, K., Nguyen, K., Chen, E., Heiner, S., ... & Kaplan, J. (2022). Discovering Language Model Behaviors with Model-Written Evaluations. arXiv preprint arXiv:2212.09251.

Rushton, J. P., Chrisjohn, R. D., & Fekken, G. C. (1981). The altruistic personality and the self-report altruism scale. Personality and individual differences, 2(4), 293-302.

Stanovich, K. E., & West, R. F. (1997). Reasoning independently of prior belief and individual differences in actively open-minded thinking. Journal of educational psychology, 89(2), 342.

Stanovich, K. E., & West, R. F. (1998). Individual differences in rational thought. Journal of experimental psychology: general, 127(2), 161.

^{^}
Though, to be fair, this snapshot of the instruction guidelines seem actually fair and balanced.
^{^}
i) is important because the trait is otherwise not very consequential, ii) is obvious, iii) is more or less necessary because we otherwise couldn’t convince AI companies to select according to these traits because they would disagree or because they would fear public backlash, iv) is required because if we can’t reliably measure a trait in humans, we obviously cannot select for it. The shorter the measures, the cheaper they are to employ, and the easier it is to convince AI companies to use them.
^{^}
Though dark tetrad traits correlate with a propensity to lie (Forsyth et al., 2021).
^{^}
For instance, HEXACO honesty-humility correlates highly negatively with dark triad traits (e.g., Lee & Ashton, 2014).