Effective Altruism Forum
Topics
EA Forum

Hide table of contents

AI interpretability

AI interpretability

Related entries

Contributors

Interpretability is the ability for the decision processes and inner workings of AI and machine learning systems to be understood by humans or other outside observers.^[1]

Present-day machine learning systems are typically not very transparent or interpretable. You can use a model's output, but the model can't tell you why it made that output. This makes it hard to determine the cause of biases in ML models.^[1]

Interpretability is a focus of Chris Olah and Anthropic's work, though most AI alignment organisations work on interpretability to some extent, such as Redwood Research^[2].

...

Posts tagged AI interpretability

Relevance

35

Interpreting Neural Networks through the Polytope Lens

· 2y ago

3

3

24

PhD Position: AI Interpretability in Berlin, Germany

Martian Moonshine

· 2y ago · 1m read

3

3

21

Rational Animations' intro to mechanistic interpretability

· 9mo ago

3

3

15

Chris Olah on working at top AI labs without an undergrad degree

· 3y ago · 88m read

3

3

10

Sentience in Machines - How Do We Test for This Objectively?

· 2y ago · 3m read

3

3

158

Announcing Apollo Research

· 2y ago

2

2

123

High-level hopes for AI alignment

Holden Karnofsky

· 2y ago · 23m read

2

2

90

The case for becoming a black-box investigator of language models

· 3y ago · 3m read

2

2

54

A Barebones Guide to Mechanistic Interpretability Prerequisites

· 2y ago · 4m read

2

2

32

Against LLM Reductionism

Erich_Grunewald 🔸

· 2y ago

2

2

25

[MLSN #8]: Mechanistic interpretability, using law to inform AI alignment, scaling laws for proxy gaming

· 2y ago · 5m read

2

2

23

The limited upside of interpretability

· 2y ago · 12m read

2

2

18

Concrete Steps to Get Started in Transformer Mechanistic Interpretability

· 2y ago · 15m read

2

2

17

Why mechanistic interpretability does not and cannot contribute to long-term AGI safety (from messages with a friend)

· 2y ago

2

2

6

Why and When Interpretability Work is Dangerous

Nicholas / Heather Kross

· 2y ago

2

2