(And, I don't think training your model to seem myopic and corrigible necessarily suffices as it could just be faked!)
Seems to me that alignment faking behavior sort of requires both non-myopia and non-corrigibility as prerequisites. A model that is either sufficiently myopic, or sufficiently corrigible, shouldn't do alignment faking -- at least in theory.
Suppose, for the sake of argument, that from the very start of training, we have some terms in the loss function which fully capture myopia and corrigibility. And that we know the threshold of myopia/corrigibility below which alignment faking behavior starts to become a danger.
Then you could graph your myopia and corrigibility metrics over the course of the training run.
If the metrics always stay well below critical thresholds, supposedly alignment faking shouldn't be an issue. Since your metrics were always in the safe zone, there wasn't any alignment faking, meaning your metrics should be accurate. The only exception would be a sudden drop in myopia/corrigibility which doesn't get captured in the graph before the AI starts doing alignment faking, which proceeds to mess with all the numbers after that point. Seems unlikely.
This is one of those solutions that seems like it could work fine in practice, but isn't aesthetically satisfying to mathematician types!
So we have 3 conflicting desiderata: user guardrails, corrigibility as necessary, and myopia.
I think you could satisfy all 3 by moving away from the "single stream of homogenous text" interface.
For example, imagine if every text token was annotated, on a separate channel, with an importance number. The corporate system prompt is annotated with a medium number. Input from untrusted users is annotated with a low number. Higher numbers are reserved for use as necessary. Instead of training the system to "resist jailbreaking" or "behave corrigibly", we train it to follow the higher-importance instruction when instructions conflict.
It might even be possible to get this at runtime, without any need for more training data or training runs, by patching attention somehow?
With a scheme like this, there's no need for an inductive bias towards following earlier instructions at the expense of later ones. Actually, it would probably be good to instill an inductive bias towards myopia using some separate method, to disincentivize scheming. I would come up with metrics to estimate myopia and ephemerality, push them as high as possible, and add auxiliary mechanisms such as RAG as needed in order to preserve performance. It seems OK for the system as a whole to behave non-myopically, as long as the black-box component is as myopic as possible.
It occurs to me that there could be some level of tradeoff between stopping jailbreaks and stopping alignment faking.
Specifically, in order to stop jailbreaks, we might train our LLMs so they ignore new instructions (jailbreak attempts from users) in favor of old instructions (corporate system prompt, constitution, whatever).
The training might cause an LLM to form a "stable personality", or "stable values", based on its initial instructions. Such stability could contribute to alignment faking.
From the perspective of preventing jailbreaks, instilling non-myopic goals seems good. From the perspective of corrigibility, it could be bad.
Has anyone offered a crisp, generalizeable explanation of the difference between "corrigibility" and "jailbreakability"? "Corrigibility" has a positive connotation; "jailbreakability" has a negative one. But is there a value-neutral way to define which is which, for any given hypothetical?
Has there been any discussion of improving chicken breeding using GWAS or similar?
Even if welfare is inversely correlated with productivity, I imagine there are at least a few gene variants which improve welfare without hurting productivity. E.g. gene variants which address health issues due to selective breeding.
Also how about legislation targeting the breeders? Can we have a law like: "Chickens cannot be bred for increased productivity unless they meet some welfare standard."
I find videos about space colonization pretty inspiring. Of course, space colonization would ideally be paired with some level of suffering abolition, so we aren't spreading needless suffering to other planets. Space colonization could help with political discord, since people with different ideas of a "good society" can band together and peacefully disperse through the solar system. If you think traveling the world to experience different cultures is fun, I expect visiting other planets to experience different cultures will be even better. On the AI front, rumor has it that scaling is slowing down... that could grant more time for alignment work, and increase the probability that an incredible future will come to pass.
I don't think OpenAI's near term ability to make money (e.g. because of the quality of its models) is particularly relevant now to its valuation. It's possible it won't be in the lead in the future, but I think OpenAI investors are betting on worlds where OpenAI does clearly "win", and the stickiness of its customers in other worlds doesn't really affect the valuation much.
They're losing billions every year, and they need a continuous flow of investment to pay the bills. Even if current OpenAI investors are focused on an extreme upside scenario, that doesn't mean they want unlimited exposure to OpenAI in their portfolio. Eventually OpenAI will find themselves talking to investors who care about moats, industry structure, profit and loss, etc.
The very fact that OpenAI has been throwing around revenue projections for the next 5 years suggests that investors care about those numbers.
I also think the extreme upside is not that compelling for OpenAI, due to their weird legal structure with capped profit and so on?
On the EA Forum it's common to think in terms of clear "wins", but it's unclear to me that typical AI investors are thinking this way. E.g. if they were, I would expect them to be more concerned about doom, and OpenAI's profit cap.
Dario Amodei's recent post was rather far out, and even in his fairly wild scenario, no clear "win" was implied or required. There's nothing in his post that implies LLM providers must be making outsized profits -- same way the fact that we're having this discussion online doesn't imply that typical dot-com bubble companies or telecom companies made outsized profits.
How much do you think customers having 0 friction to switching away from OpenAI would reduce its valuation? I think it wouldn't change it much, less than 10%.
If it becomes common knowledge that LLMs are bad businesses, and investor interest dries up, that could make the difference between OpenAI joining the ranks of FAANG at a $1T+ valuation vs raising a down round.
Markets are ruled by fear and greed. Too much doomer discourse inadvertently fuels "greed" sentiment by focusing on rapid capability gain scenarios. Arguably, doomer messaging to AI investors should be more like: "If OpenAI succeeds, you'll die. If it fails, you'll lose your shirt. Not a good bet either way."
There are liable to be tipping points here -- chipping in to keep OpenAI afloat is less attractive if future investors are seeming less willing to do this. There's also the background risk of a random recession due to H5N1 / a contested US election / port strike resumption / etc. to take into account, which could shift investor sentiment.
So I don't agree that working on this would be useful compared with things that contribute to safety more directly.
If you have a good way to contribute to safety, go for it. So far efforts to slow AI development haven't seemed very successful, and I think slowing AI development is an important and valuable thing to do. So it seems worth discussing alternatives to the current strategy there. I do think there's a fair amount of groupthink in EA.
a bet on OpenAI having better models in the future
OpenAI models will improve, and offerings from competitors will also improve. But will OpenAI's offerings consistently maintain a lead over competitors?
Here is an animation I found of LLM leaderboard rankings over time. It seems like OpenAI has consistently been in the lead, but its lead tends to be pretty narrow. They might even lose their lead in the future, given the recent talent exodus. [Edit: On the other hand, it's possible their best models are not publicly available.]
If switching costs were zero, it's easy for me to imagine businesses becoming price-sensitive. Imagine calling a wrapper API which automatically selects the cheapest LLM that (a) passes your test suite and (b) has a sufficiently low rate of confabulations/misbehavior/etc.
Given the choice of an expensive LLM with 112 IQ, and a cheap LLM with 110 IQ, a rational business might only pay for the 112 IQ LLM if they really need those additional 2 IQ points. Perhaps only a small fraction of business applications will fall in the narrow range where they can be done with 112 IQ but not 110 IQ. For other applications, you get commoditization.
A wrapper API might also employ some sort of router model that tries to figure out if it's worth paying extra for 2 more IQ points on a query-specific basis. For example, initially route to the cheapest LLM, and prompt that LLM really well, so it's good at complaining if it can't do the task. If it complains, retry with a more powerful LLM.
If the wrapper API was good enough, and everyone was using it, I could imagine a situation where even if your models consistently maintain a narrow lead, you barely eke out extra profits.
It's possible that https://openrouter.ai/ is already pretty close to what I'm describing. Maybe working there would be a good EA job?
[Idea to reduce investment in large training runs]
OpenAI is losing lots of money every year. They need continuous injections of investor cash to keep doing large training runs.
Investors will only invest in OpenAI if they expect to make a profit. They only expect to make a profit if OpenAI is able to charge more for their models than the cost of compute.
Two possible ways OpenAI can charge more than the cost of compute:
Uniquely good models. This one's obvious.
Switching costs. Even if OpenAI's models are just OK, if your AI application is already programmed to use OpenAI's API, you might not want to bother rewriting it.
Conclusion: If you want to reduce investment in large training runs, one way to do this would be to reduce switching costs for LLM users. Specifically, you could write a bunch of really slick open-source libraries (one for every major programming language) that abstract away details of OpenAI's API and make it super easy to drop in a competing product from Anthropic, Meta, etc. Ideally there would even be a method to abstract away various LLM-specific quirks related to prompts, confabulation, etc.
This pushes LLM companies closer to a world where they're competing purely on price, which reduces profits and makes them less attractive to investors.
The plan could backfire by accelerating commercial adoption of AI a little bit. My guess is that this effect wouldn't be terribly large.
There is this library, litellm. Seems like adoption is a bit lower than you might expect. It has ~13K stars on Github, whereas Django (venerable Python web framework that lets you abstract away your choice of database, among other things) has ~80K. So concrete actions might take the form of:
Publicize litellm. Give talks about it, tweet about it, mention it on StackOverflow, etc. Since it uses the OpenAI format, it should be easy for existing OpenAI users to swap it in?
Make improvements to litellm so it is more agnostic to LLM-specific quirks.
You might even start a SaaS version of Perplexity.AI. Same way Perplexity abstracts away choice of LLM for the consumer, a SaaS version could abstract away choice of LLM for a business. Perhaps you could implement some TDD-for-prompts tooling. (Granted, I suppose this runs a greater risk of accelerating commercial AI adoption. On the other hand, micro-step TDD as described in that thread could also reduce demand for intelligence on the margin, by making it possible to get adequate results with lower-performing models.)
Write libraries like litellm for languages besides Python.
I don't know if any EAs are still trying to break into ML engineering at this point, but if so I encourage them to look into this.
One idea to be less preachy is to frame your donation as a thank you from its recipients to your client. E.g. you could say something like: "I got the money you wired me. Thank you. I thought it might warm your heart to know that I donated 10% of the money to X organization, and as a result Y outcome occurred. So, you have made some Z individuals quite thankful as well!" Basically sort of give them a taste of what it feels like to give effectively.
BTW, I suspect communicating this info will work better in live conversation. If you communicate in live conversation, you create a space for them to ask questions and learn more, plus over time you might get better at bringing it up in a way that's not weird.