Hide table of contents

I've just opened summer MATS applications (where I'll supervise people to write mech interp papers) I'd love to get applications from any readers who are interested! Apply here, due Feb 28

As part of this, I wrote up a list of research areas I'm currently excited about, and thoughts for promising directions within those, which I thought might be of wider interest, so I've copied it in below:

Understanding thinking models

Eg o1, r1, Gemini Flash Thinking, etc - ie models that produce a really long chain of thought when reasoning through complex problems, and seem to be much more capable as a result. These seem like a big deal, and we understand so little about them! And now we have small thinking models like r1 distilled Qwen 1.5B, they seem quite tractable to study (though larger distilled versions of r1 will be better. I doubt you need full r1 though).

  • This is so understudied that I’m excited about even just rigorous black box investigations here, like seeing what happens when you intervene on the chain of thought
    • For example, when a model solves a hard maths problem by thinking a bunch until it finds the key idea, what’s going on? Did it just “get lucky” and every line was another roll of the dice for finding the key idea? Or was it steadily making progress towards that goal!
    • How deterministic are these models? If you take the first K lines of the CoT and regenerate it, do you get the same output?
  • My mental model is that when o1 is solving a maths problem it’s essentially exploring a “reasoning tree” - at each step in a solution, there’s a bunch of possible next steps. o1 chooses promising ones based on heuristics, and explores that branch of the tree, but with a bunch of reflexes to eg stop and say “that doesn’t sound right” or “I think I’ve gone wrong, let’s backtrack and try another path” or “let’s list out all relevant factors, and hope this helps me choose the next step better”
    • Can you find any evidence for or against this hypothesis?
  • An obvious safety strategy is “let’s just read the chain of thought and see if it thinks anything suspicious”. Can you find evidence for or against this being effective?
    • Eg, is the chain of thought faithful to the model’s underlying computation? Can you find evidence of unfaithfulness? What happens if you edit seemingly key steps in the chain of thought?
    • Can you create situations where the model does use suspicious reasoning? What happens if you tell it that its chain of thought is being monitored - does it do the bad behaviour but without suspicious reasoning?
  • How well do existing interpretability methods work?
    • Can we probe for anything interesting in the chain of thought?
    • Can we make steering vectors to steer the reasoning? Eg make it backtrack, make it be more creative, etc
    • Do sparse autoencoders tell us anything interesting?
      • You can use the r1 LLaMA 3.1 8B distill and see how well the LLaMA Scope SAEs transfer, you will likely get better results if you finetune the SAEs on the finetuned model activations, as they were trained on base LLaMA 3.1 8B
  • Note: In all of these ideas you likely want some kind of dataset of problems for the model to reason about. Maths (eg GSM8K) and hard multiple choice (eg MMLU) are probably decent for a tiny reasoning model, though may be too easy.

Sparse Autoencoders

In previous rounds I was predominantly interested in Sparse Autoencoder projects, but I’m comparatively less excited about SAEs now - I still think they’re cool, and am happy to get SAE applications/supervise SAE projects, but think they’re unlikely to be a silver bullet and expect to diversify my projects a bit more (I’ll hopefully write more on my overall takes soon).

Within SAEs, I’m most excited about:

  • Work that tries to understand and measure fundamental problems with SAEs, eg:
    • Feature absorption

    • Whether SAEs learn the “right” concepts

    • Whether our interpretations of SAE latents (aka features)[1] are correct

      • I suspect our explanations are often way too general, and the true explanation is more specific (preliminary evidence)
        • Standard autointerp metrics do concerningly well on randomly initalized transformers… (largely because they score latents that light up on a specific token highly, I think)
      • Doing a really deep dive into rigorously interpreting a latent, including checking for false negatives, could be cool!
  • Work that tries to fix these problems, eg
  • Attempts to make SAEs practically useful (or show that they’re not), in a way that involves comparing rigorously to baselines. Eg
  • Exploring very different approaches to decomposing model concepts
  • Sanity checking how well the underlying assumptions behind SAEs actually apply to real language models
    • Can we find the “true” direction corresponding to a concept? How could we tell if we’ve succeeded?
    • Can we find a compelling case study of concepts represented in superposition, that couldn’t just be made up of a smaller set of orthogonal concepts? How confident can we be that superposition is really a thing?
    • Can we find examples of non-linear representations? (Note: it’s insufficient to just find concepts that live in subspaces of greater than one dimension)
  • Basic science of SAEs
    • Why are some concepts learned, and not others? How is this affected by the data, SAE size, etc.
    • How big an improvement *are *Matryoshka SAEs? Should we just switch to using them all the time, or do they have some flaws?
    • What’s up with high-frequency latents? (ie which activate on >10% of input tokens - they seem notably more common in JumpReLU and TopK, and are very annoying)

Model diffing

What happens to a model during finetuning? If we have both the original and tuned model, can we somehow take the “diff” between the two to just interpret what changed during finetuning?

  • I think model diffing could be a very big deal, and has largely been under-explored! Many alignment relevant concerns, like goal seeking, planning capabilities, etc seem like they could be introduced during finetuning. Intuitively it seems like it *should *be a big advantage to have access to the original model, but I’ve not seen much work demonstrating this so far.
  • Relevant prior work:
  • What to diff?
    • Applying this to some of the small thinking models, like Qwen 1.5B r1 distilled, could be super interesting
    • Base vs chat models is another super interesting direction
      • It’d be best to start with a specific capability here. Eg refusal, instruction following, chain of thought (especially circuits re what the chain of thought should “look like”, even if it’s not faithful to the model’s reasoning), conversational style/tone, specialized knowledge (eg finetuning on Python code or French), hallucination/saying ‘I don’t know’
    • Taking a model finetuned for some specific task, maybe doing it yourself, might be easier to analyse
      • Using a LoRA might make it even cleaner
  • There’s various low tech ideas here, like looking at the KL divergence between the two models at each token on some prompt/rollouts, or patching activations/swapping weights between the original and tuned model to try to isolate where the key changes were
  • Crosscoders for model diffing (basically, an SAE trained on the concatenation of a residual stream from the original model and the tuned model) seem like they have a lot of potential for finding model diffing insights - the tentative explorations in the paper are a good start, but there’s a lot more that can be done. I’d love to see someone train a crosscoder on an r1 distill and analyse it
  • Applying the stage-wise model diffing approach from Bricken et al to something else and see what you can learn

Understanding sophisticated/safety relevant behaviour

LLMs are getting good enough that they start to directly demonstrate some alignment relevant behaviours. Most interpretability work tries to advance the field in general by studying arbitrary, often toy, problems, but I’d be very excited to study these phenomena directly!

  • Warning: The most interesting behaviours tend to happen in the largest models, eg LLaMA 405B. This is a pain to run yourself, and I do not recommend it unless you have experience with this kind of thing.
  • In the recent alignment faking paper, they found that LLaMA 405B would sometimes fake alignment when prompted, *without *needing to reason aloud - I’d be really excited if you can learn anything about what’s going on here, eg with probes or activation patching from edited prompts
  • Chen et al shows that LLMs form surprisingly accurate and detailed models of the user, eg their gender, age, socioeconomic status, and level of education, and do this from very little information. They can find these with probes, and steer with these to change the model’s actions in weird ways.
    • This is wild! What else can we learn here? What else do models represent about the user? How are these inferred? How else do they shape behaviour?
    • Do LLMs form dynamic models of users for attributes that vary across turns, eg emotion, what the user knows, etc.
      • As a stretch goal, do LLMs ever try to intentionally manipulate these? Eg detect when a user is sad and try to make them happy
    • You could try making probe training data by having an LLM generate conversations while modelling various desired attributes
  • Can we give LLMs a simple social deception game, like werewolf, maybe prompt them with some strategy advice, and have them play competently? If so, can we say anything about whether they’re modelling other players, or acting strategically?

Being useful

Interpretability is often pretty abstract, pursuing lofty blue skies goals, and it’s hard to tell if your work is total BS or not. I’m excited about projects that take a real task, one that can be defined without ever referencing interpretability, and trying to beat non-interp baselines in a fair(ish) fight - if you can do this, it’s strong evidence you’ve learned *something *real

Investigate fundamental assumptions

There’s a lot of assumptions behind common mechanistic interpretability works, both scientific assumptions and theory of change assumptions, that in my opinion have insufficient evidence. I’d be keen to gather evidence for and against!

  • Can we find examples of features in real language models that are not linearly represented?
  • Is circuit analysis even needed, if SAEs can find the right concepts for us? A common take is that understanding the circuits an SAE latent participates in could help us interpret it, but is this true? Can you find any case studies where it actually helps?
    • Feel free to use very high level circuit analysis, like attribution patching between that latent and latents in a much earlier layer, or even to input tokens
  • Is superposition real?
  • Is sparsity actually a good proxy for interpretability?
  • Do the directions corresponding to features mean the same thing across layers, or is there some systematic “drift”? If there’s drift, I weakly predict it can largely be explained as each layer applying a fixed linear transform plus a non-linear component.
  • Fuzzier things: (I have no clue how to research these, but I’d love to see progress!)
    • Are circuits real? Is this the right way to think abut models?
    • Are features real? Do models actually think in concepts?
    • Does it make sense to expect features to be linearly represented?

  1. I favour the term latent over feature, because feature also refers to the subtly but importantly different concept of “the interpretable concept”, which an SAE “feature” imperfectly corresponds to, and it’s very confusing for it to mean both. ↩︎

23

0
0

Reactions

0
0

More posts like this

Comments1
Sorted by Click to highlight new comments since:

Executive summary: The post announces open summer MATS applications and outlines several exciting research directions in mechanistic interpretability, including understanding thinking models, advancing sparse autoencoders, exploring model diffing, investigating safety-relevant behaviors, promoting practical interpretability projects, and examining fundamental assumptions in the field.

Key points:

  1. Summer MATS applications are now open for supervising mechanistic interpretability projects, with a submission deadline of February 28.
  2. Interest in studying thinking models that generate extensive chains of thought to unravel their reasoning processes and assess their determinism and safety.
  3. Continued focus on Sparse Autoencoders (SAEs) to identify and address fundamental issues, improve interpretability techniques, and explore alternative decomposition methods.
  4. Exploration of model diffing to understand changes during finetuning, which could provide insights into alignment and model behavior modifications.
  5. Investigation of sophisticated and safety-relevant behaviors in large language models, such as alignment faking and user attribute modeling, highlighting the need for advanced interpretability tools.
  6. Promotion of practical interpretability projects that tackle real-world tasks and challenge existing baselines, alongside a critical examination of foundational assumptions in mechanistic interpretability.

 

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Curated and popular this week
Relevant opportunities