mjkerrison🔸️

Transitioning to AI safety

47 karmaJoined Nov 2022Working (6-15 years)Melbourne VIC, Australia

Bio

Former management consultant and data scientist. Currently on sabbatical to try to transition to work on AI safety in some capacity.

How others can help me

Looking for high-leverage opportunities or ideas in AI safety

Posts
1

Sorted by New

mjkerrison️'s Quick takes

mjkerrison🔸️

· 4mo ago · 1m read

Comments
14

80,000 Hours is shifting its strategic approach to focus more on AGI

mjkerrison🔸️16d7

Hey NobodyInteresting,

I see you're getting downvoted and disagreed with without any direct interaction, so I'll bite.

diversity... western developed mindset... clearly lacking

I think this is a bit combative in how it's delivered. I suspect a number of people actually agree with the 'echo chamber' problem - it seems like a number of the other comments more or less say something like this in how they disagree with 80k's conclusions.

However, you might need to elaborate on the "western" aspect. What do you think is the shortcoming you're identifying here? For instance, I expect a western lens is the right one for this problem because most of the frontier labs are western.

AGI delivery by 2030 will fail... AGI will surely be much more complex, if possible even

I think this is a controversial call - some agree, some disagree. But I don't think anyone can say this with confidence: for instance, the frontier lab leaders all seem to be saying this, in a way that doesn't seem to be explained by "hype" or "pandering to investors". This would be a good one to elaborate on your views / provide a link to the strongest evidence you buy into.

we barely have resources to properly run these current LLM models

I think this is just not true? Maybe we don't have the resources to run them personally, but plenty of companies seem to be running their latest models for (limited) public access just fine. Are you addressing these inequality considerations or something else?

Congratulations you have now jumped on a trend-wagon that will take us nowhere, while forgetting about actual causes that are relevant right now

I think I've mostly responded to this, but I'd like to connect it with the "left behind" point at the end:

I think it's good, and highly in keeping with EA principles, to (a) respond to information as it comes in and (b) think and work on the margin. This is probably too big to tackle in this comment - maybe I'll write something up later - but

changing advice on the margin (for the next person) isn't by necessity the same as 'leaving people behind'
I would be interested in why or whether people who agree with the advice / 80k's judgment feel they should stay where they are, instead of also pivoting?

Discussion Thread: Existential Choices Debate Week

mjkerrison🔸️21d1

50% agree

It seems to me that extinction is the ultimate form of lock-in, while surviving provides more opportunities to increase the value of the future. This moves me very far toward Agree. It seems possible, however, that there could be future that rely on actions today that are so much better than alternatives that it could be worth rolling worse dice, or futures so bad that extinction could be preferable, so this brings me back a bit from very high Agree.

On the margin: I think we are not currently well-equipped to determine whether actions are or aren't increasing the value of the future^[1]. Focusing on protecting what we have seems more prudent, as there are concerningly many concerningly high extinction risks.

^{^}
This includes things like concerns about today's humans vs other forms of intelligence, too.

The Game Board has been Flipped: Now is a good time to rethink what you’re doing

mjkerrison🔸️2mo5

Another tentative implication that goes without saying, but I'll say it anyway: review who you're listening to.

Who got these developments "right" and "wrong"? How will you weight what those people say in the next 12 months?

New Year's Thread: What do you want to achieve this year?

Answer by mjkerrison🔸️Jan 01, 20256

Do the ol' career transition into AI safety. Probably governance/policy, but with a technical flavour.

(If you need stats, data science, report writing, or management consultancy skills - please hit me up!)

mjkerrison️'s Quick takes

mjkerrison🔸️4mo12

AI safety

Isn't mechinterp basically setting out to build tools for AI self-improvement?

One of the things people are most worried about is AIs recursively improving themselves. (Whether all people who claim this kind of thing as a red line will actually treat this as a red line is a separate question for another post.)

It seems to me like mechanistic interpretability is basically a really promising avenue for that. Trivial example: Claude decides that the most important thing is being the Golden Gate Bridge. Claude reads up on Anthropic's work, gets access to the relevant tools, and does brain surgery on itself to turn into Golden Gate Bridge Claude.

More meaningfully, it seems like any ability to understand in a fine-grained way what's going on in a big model could be co-opted by an AI to "learn" in some way. In general, I think the case that seems most likely soonest is:

Learn in-context (e.g. results of experiments, feedback from users, things like we've recently observed in scheming papers...)
Translate this to appropriate adjustments to weights (identified using mechinterp research)
Execute those adjustments

Maybe I'm late to this party and everyone was already conceptualising mechinterp as a very dual-use technology, but I'm here now.

Honestly, maybe it leans more towards "offense" (i.e., catastrophic misalignment) than defense! It will almost inevitably require automation to be useful, so we're ceding it to machines out of the gate. I'd expect tomorrow's models to be better placed to make sense of and use of mechinterp techniques than humans are - partly just because of sheer compute, but also maybe (and now I'm into speculating on stuff I understand even less) the nature of their cognition is more suited to what's involved.

mjkerrison️'s Quick takes

mjkerrison🔸️4mo3

CommunityShow more

If someone isn't already doing so, someone should estimate what % of (self-identified?) EAs donate according to our own principles. This would be useful (1) as a heuristic for the extent to which the movement/community/whatever is living up to its own standards, and (1i) assuming the answer is 'decently' it would be useful evidence for PR/publicity/responding to marginal-faith tweets during bouts of criticism.

Looking at the Rethink survey from 2020, they have some info about which causes EAs are giving to but they seem to note that not many people respond on this? And it's not quite the same question. To do: check GWWC for whether they publish anything like this.

Edit to add: maybe an imperfect but simple and quick instrument for this could be something like "For what fraction of your giving did you attempt a cost-effectiveness assessment (CEA), read a CEA, or rely on someone else who said they did a CEA?". I don't think it actually has to be about whether the respondent got the "right" result per se; the point is the principles. Deferring to GiveWell seems like living up to the principles because of how they make their recommendations, etc.

Rekindling the Fire: Principles-First EA Groups as a Path to Impact [Talk Transcript]

mjkerrison🔸️4mo5

Can you add / are you comfortable adding anything on who "us" is and which orgs or what kinds of orgs are hesitant? Is your sense this is universal, or more localised (geographically, politically, cause area...)?

Julia_Wise's Quick takes

mjkerrison🔸️4mo1

Good point and good fact.

My sense, though, is that if you scratch most "expand the moral circle" statements you find a bit of implicit moral realism. I think generally there's an unspoken "...to be closer to its truly appropriate extent", and that there's an unspoken assumption that there'll be a sensible basis for that extent. Maybe some people are making the statement prima facie though. Could make for an interesting survey.

GWWC's 2024 evaluations of evaluators

mjkerrison🔸️4mo5

Love to see these reports!

I have two suggestions/requests for 'crosstabs' on this info (which is naturally organised by evaluator, because that's what the project is!):

As-of-today, which evaluators/charities sit where on the recommendation scale. The info for that is mostly on GWWC's website but not quite organised as such. I'm thinking of rows for cause areas, columns for buckets, e.g. 'Recommended' at one end and 'Maybe not cost-effective' at the other (though maybe you'd drop things off altogether). Just something to help visualise what's moved and by how much, and broadly why are things sitting where they are (e.g. THL corporate campaigns sliding off the recommended list for 'procedural' reasons, so not in the Recommended column but now in a 'Nearly' column or something).
I'd love a clear checklist of what you think needs improvement per evaluated program to help with making the list a little more evergreen. I think all that info is in your reporting, but if you called it out I think it would
1. help evaluated programs and
2. help donors to
  1. get a sense for how up-to-date that recommendation is (given the rotating/rolling nature of the evaluation program)
  2. and possibly do their own assessment for whether the charity 'should' be recommended 'now'.

mjkerrison️'s Quick takes

mjkerrison🔸️4mo1

AI safety

Is anyone keeping tabs on where AI's actually being deployed in the wild? I feel like I mostly see (and so this could be a me problem) big-picture stuff, but there seems to be a proliferation of small actors doing weird stuff. Twitter / X seems to have a lot more AI content, and apparently YouTube comments do now as well (per conversation I stumbled on while watching some YouTube recreationally - language & content warnings: https://youtu.be/p068t9uc2pk?si=orES1UIoq5qTV5TH&t=2240)

mjkerrison🔸️

Bio

How others can help me

Posts 1

Comments14

Posts
1

Comments
14