Isn't mechinterp basically setting out to build tools for AI self-improvement?
One of the things people are most worried about is AIs recursively improving themselves. (Whether all people who claim this kind of thing as a red line will actually treat this as a red line is a separate question for another post.)
It seems to me like mechanistic interpretability is basically a really promising avenue for that. Trivial example: Claude decides that the most important thing is being the Golden Gate Bridge. Claude reads up on Anthropic's work, gets access to the relevant tools, and does brain surgery on itself to turn into Golden Gate Bridge Claude.
More meaningfully, it seems like any ability to understand in a fine-grained way what's going on in a big model could be co-opted by an AI to "learn" in some way. In general, I think the case that seems most likely soonest is:
* Learn in-context (e.g. results of experiments, feedback from users, things like we've recently observed in scheming papers...)
* Translate this to appropriate adjustments to weights (identified using mechinterp research)
* Execute those adjustments
Maybe I'm late to this party and everyone was already conceptualising mechinterp as a very dual-use technology, but I'm here now.
Honestly, maybe it leans more towards "offense" (i.e., catastrophic misalignment) than defense! It will almost inevitably require automation to be useful, so we're ceding it to machines out of the gate. I'd expect tomorrow's models to be better placed to make sense of and use of mechinterp techniques than humans are - partly just because of sheer compute, but also maybe (and now I'm into speculating on stuff I understand even less) the nature of their cognition is more suited to what's involved.
Some of my thoughts on funding.
It's giving season and I want to finally get around to publishing some of my thoughts and experiences around funding. I haven't written anything yet because I feel like I am mostly just revisiting painful experiences and will end up writing some angry rant. I have ideas for how things could be better so hopefully this can lead to positive change not just more complaining. All my experiences are in AI Safety.
On Timing: Certainty is more important than speed. The total decision time is less important than the overdue time. Expecting a decision in 30 days and getting it in 35 days is worse than if I expect the decision in 90 days and I get it in 85 days.
Grantmakers providing statistics about timing expectations makes things worse. If the mean or median response time is N days it is now N+5 days is it appropriate for me to send a follow-up email to check on the status? Technically it's not late yet. It could come tomorrow or in N more days. Imagine if the Uber app showed you the global mean wait time for the last 12 months and there was no map to track your driver's arrival.
"It doesn't have to reduce the waiting time it just has to reduce the uncertainty" - Rory Sutherland
My conversations about people's expectations and experiences with people in Berkeley are at times very different to those outside of Berkeley.
After I posted my announcement about shutting down AISS and my comment on the LTFF update several people reached out to me about their experiences. Some people I already knew well, some I had met and others I didn't know before. Some of them had received funding a couple of times but their negative experiences led them to not reapply and walk away from their work or the ecosystem entirely. At least one mentioned having a draft post about their experience that they did not feel comfortable publishing.
There was definitely a point for me where I had already given up but just not realised it. I had already run out of fundi
Anthropic's Twitter account was hacked. It's "just" a social media account, but it raises some concerns.
Update: the post has just been deleted. They keep the updates on their status page: https://status.anthropic.com/
I'd love to see an 'Animal Welfare vs. AI Safety/Governance Debate Week' happening on the Forum. The risks from AI cause has grown massively in importance in recent years, and has become a priority career choice for many in the community. At the same time, the Animal Welfare vs Global Health Debate Week demonstrated just how important and neglected the cause of animal welfare remains. I know several people (including myself) who are uncertain/torn about whether to pursue careers focused on reducing animal suffering or mitigating existential risks related to AI. It would help to have rich discussions comparing both causes's current priorities and bottlenecks, and a debate week would hopefully expose some useful crucial considerations.
We should expect that the incentives and culture for AI-focused companies to make them uniquely terrible for producing safe AGI.
From a “safety from catastrophic risk” perspective, I suspect an “AI-focused company” (e.g. Anthropic, OpenAI, Mistral) is abstractly pretty close to the worst possible organizational structure for getting us towards AGI. I have two distinct but related reasons:
1. Incentives
2. Culture
From an incentives perspective, consider realistic alternative organizational structures to “AI-focused company” that nonetheless has enough firepower to host successful multibillion-dollar scientific/engineering projects:
1. As part of an intergovernmental effort (e.g. CERN’s Large Hadron Collider, the ISS)
2. As part of a governmental effort of a single country (e.g. Apollo Program, Manhattan Project, China’s Tiangong)
3. As part of a larger company (e.g. Google DeepMind, Meta AI)
In each of those cases, I claim that there are stronger (though still not ideal) organizational incentives to slow down, pause/stop, or roll back deployment if there is sufficient evidence or reason to believe that further development can result in major catastrophe. In contrast, an AI-focused company has every incentive to go ahead on AI when the case for pausing is uncertain, and minimal incentive to stop or even take things slowly.
From a culture perspective, I claim that without knowing any details of the specific companies, you should expect AI-focused companies to be more likely than plausible contenders to have the following cultural elements:
1. Ideological AGI Vision AI-focused companies may have a large contingent of “true believers” who are ideologically motivated to make AGI at all costs and
2. No Pre-existing Safety Culture AI-focused companies may have minimal or no strong “safety” culture where people deeply understand, have experience in, and are motivated by a desire to avoid catastrophic outcomes.
The first one should be self-explanatory. Th
The recently released 2024 Republican platform said they'll repeal the recent White House Executive Order on AI, which many in this community thought is a necessary first step to make future AI progress more safe/secure. This seems bad.
From https://s3.documentcloud.org/documents/24795758/read-the-2024-republican-party-platform.pdf, see bottom of pg 9.
I’m working on a project to estimate the cost-effectiveness of AIS orgs, something like Animal Charity Evaluators does. This involves gathering data on metrics such as:
* People impacted (e.g., scholars trained).
* Research output (papers, citations).
* Funding received and allocated.
Some organizations (e.g., MATS, AISC) share impact analyses, there’s no broad comparison. AI safety orgs operate on diverse theories of change, making standardized evaluation tricky—but I think rough estimates could help with prioritization.
I’m looking for:
1. Previous work
2. Collaborators
3. Feedback on the idea
If you have ideas for useful metrics or feedback on the approach, let me know!
OpenAI have their first military partner in Anduril. Make no mistake—although these are defensive applications today, this is a clear softening, as their previous ToS banned all military applications. Ominous.