S

sammyboiz🔸

CS undergrad & EA organizer
432 karmaJoined

Bio

I might do AI safety or ML after undergrad

Comments
61

You refer to alignment faking/deceptive alignment, where a model in training expects negative reward and gives responses accordingly but outputs it’s true desires outside of training. This a solvable problem which is why I say alignment is not that hard.


Some other counterarguements:

  1. LLMs will have no reason to take over the world before or after RLHF. They do not value it as a terminal goal. It is possible that they gain a cohorent, consistent, and misaligned goal purely by accident midway through RLHF and then fake it’s way through the rest of the fine-tuning. But this is unlikely and again solvable.
  2. Making LLMs unaware they are in training is possible.

You say that an LLM would optimize towards a reward function that not reasonable or balanced by humans. 

But human preference IS what a model optimizes for, how do you reconcile this?


For example, Construct a scenario or give a moral thought experiment in which you believe GPT acts in an unbalanced way. Could you find one? If so, can this not be solved with more and better RLHF?

You say that there is a gap between how the model professes it will act and how it will actually act. However, a model trained to obey the RLHF objective will expect negative reward if decided taking over the world, so why would it? Saying that a model will make harmful, unhelpful choices is akin to saying the base model will output typos. Both of these things are trained against. If you refer to deceptive alignment, this is a engineering problem as I stated.

I agree with the "fresh sheet of paper." Reading the alignment faking paper and the current alignment challenges has been way more informative than reading Yudkowsky.

 

I think theese circles have granted him too many bayes points for predicting alignment when the technical details of his alignment problems basically don't apply to deep learning as you said.

Oh I see. I was quick to bifurcate between deontology and utilitarianism. I guess I'm less familiar with other branches of consequentialism. Sorry for being unclear in the critique. My whole reply was just centered around being bad deontologically.

I see your point.

For the interest of the people today, there is an argument to be made for taking on risk of extinction. However, if this is not a purely utilitarian argument, I think it's extremely careless and condemnable to impose this risk on humanity just because you have personally deemed it acceptable. This would be a deontological nightmare. Who gave AI labs the right to risk the lives of 8 billion people? 

I was reluctant to get into the weeds here but how can anything near this model be possible if 2^300 is around how many atoms there are in the universe and we already have conquered 2^150 of them. At some point, there will likely be no more growing and then there will be millions of stable utopia years.

Reaping the benefits of AGI later is pretty insignificant in my opinion. If we get aligned AGI utopia, we will have utopia for millions of years. Acceleration by a few years if negligible if it increase p(doom) by >1%.

1% X 1 million utopia years = 10 thousand utopia years (better than 2 utopia years)

Dario gives a 25% p(doom) if I'm not mistaken. He still continues the build the tech that could knowingly bring doom. Dario and Anthropic are pro-acceleration via their messaging and actions according to a LW'er. How is this position coherent?

I don't think you can name another company that admits to building technology with a >1% chance of killing everyone... besides maybe OpenAI.

Load more