Tracing Thought in Language Models
Introduction
In 1958, David Hubel and Torsten Wiesel began the work that would put a tungsten microelectrode into the visual cortex of an anesthetized cat. They wanted to know what made cortical neurons fire. They had assumed it would be obvious: show a dot, the neuron fires; show a bar, it fires; show a face, it fires.
It wasn’t obvious. They tried every stimulus they could think of for hours. The neuron stayed mostly silent. Then, as Hubel tells it, the edge of a glass slide moving in their projector swept a shadow across the cat’s visual field. The neuron erupted into a furious burst of spikes.
They had discovered that neurons in V1 are selective for orientation. They respond to oriented edges, not points, not faces. That accident, plus a few years of follow-up work, helped them share the 1981 Nobel Prize and helped found systems neuroscience.
What Hubel and Wiesel had to figure out was the right unit of analysis. Brains aren’t documented. With the wrong unit (a single neuron, a single image, a single feature), the data looks like noise. With the right unit (orientation, motion, edge), the data becomes a science.
Fifty-four years later, a convolutional network rediscovered them. AlexNet’s first convolutional layer, trained on millions of natural photographs with no instruction beyond predict the label, learned a bank of oriented edge detectors that look uncannily like V1’s. Subsequent layers kept climbing the same hierarchy the visual cortex climbs: edges in the first layers, parts in the middle, faces and objects at the top. Two different substrates (mammalian cortex, silicon) had converged on the same first move, and then the similar ascents. This remarkable finding not only validates Hubel and Wiesel’s approach but also suggests that some intelligent systems tend to converge on the same techniques to represent the world.

We are at the same point with large language models. They are systems we trained, not systems we designed. Nobody can tell you, with line-by-line precision, how Claude wrote the sentence it just wrote. As Dario Amodei wrote in April of 2025:
People outside the field are often surprised and alarmed to learn that we do not understand how our own AI creations work. They are right to be concerned: this lack of understanding is essentially unprecedented in the history of technology.
This essay is about a research program working on a science of understanding. It is, in part, a primer on the concepts you need to read On the Biology of a Large Language Model1 yourself. Superposition, sparse autoencoders, replacement models, attribution graphs and, in part, a tour of what that paper found when it pointed those new instruments at Claude.
The view we now have inside Claude is partial. The view is also extraordinary.
The Black Box
Software is normally a transcript of intent. A programmer writes “if email contains "free" and sender ∉ contacts: mark_spam()”, and the system does exactly that, line by line. When something breaks, you read the code. The code is the explanation. In principle the approach scales further than the spam example suggests. Rule-based NLP, hand-written parsers, and classical expert systems all get surprisingly far on syntactic processes alone, but only to a point.
Neural networks are not like that. The behaviors a modern trained network produces are well past what anyone knows how to specify line by line.
A trained language model is initialized random, fed absurd amounts of text, and nudged by a gradient signal toward better next-token predictions. After roughly 10²⁴ arithmetic operations, the resulting weights can do calculus, write working code, hold a philosophical debate, and refuse a request for chemical-weapon synthesis instructions. Nobody specified how. There is no mark_spam() line. There is just a hundred-billion-parameter tensor of opaque numbers and the behavior that comes out when text is fed through it.
The architecture itself is a few technical terms. A modern language model keeps, at each token position and at each layer, a long vector called the residual stream: residual because each layer adds a delta to it rather than overwriting it, stream because it carries information forward through the network. Every layer reads from this vector, adds its own contribution, and writes the sum back.
Two kinds of components do the contributing. Attention layers move information between positions. They let the word Texas at position 4 deliver something to the word at position 7. MLP layers perform per-position computation: lookup, transformation, refinement of what’s already at that position. Stacking these together and (un)embedding the final state at the last position is how the model produces a next token. At least, that’s what we think.
Everything mechanistic interpretability discovers lives somewhere in this rail of vectors. Sparse autoencoders, which we’ll meet shortly, decompose what’s in the residual stream at a given layer. Attribution graphs trace what wrote it.

This isn’t an academic curiosity. It has three consequences worth naming.
Science. These are the most behaviorally complex artifacts humans have ever built. Studying them is a discipline actively being founded, the way astronomy waited for telescopes, or microbiology waited for the microscope, then for stained slides. They are also the first practical laboratory for cognition itself: an instrumentable mind, in a substrate we can finally cut open.
Trust. High-stakes deployments (medicine, law, finance, hiring) are domains where “because the model said so” is not a sufficient answer. That hasn’t stopped deployments. Auditable reasoning is the price of admission, but it currently trades on inductive reasoning rather than interpretable mechanisms.
Safety. Aligned behavior is not the same as aligned reasoning. A model can say the right things for the wrong reasons.
For most of the last decade, we couldn’t. The neurons of these models don’t reliably correspond to interpretable concepts. Pick any neuron in any layer of a modern LLM and you’ll find it fires on, say, academic citations, Korean text, DNA sequences, Python def statements, song lyrics about water, and chess notation. Not all at once, but across enough inputs that no single label captures it.
It’s a structural feature of the architecture and to see why, you have to understand what the units actually represent.
Neurons Don’t Mean Things
The naive picture is that each neuron in the network represents one concept. The reality is that most neurons represent dozens of unrelated concepts. The technical name is polysemanticity, which the field attributes to a mechanism called superposition.
Let’s develop the intuition. A modern LLM’s residual stream is only a few thousand dimensions wide. The number of distinct human-interpretable concepts the model has to handle (i.e people, places, syntactic patterns, code idioms, sentiments, factual associations) is in the millions, at least. There aren’t enough neurons to give each concept its own unit. So the network gives each concept its own direction in activation space, which is the high-dimensional space the residual stream lives in. The stream’s value at any moment is the sum of whichever concept-directions are active, weighted by how strongly they’re firing. Multiple concepts share the same neurons, but they are stored along different geometric axes, so they can be untangled if you know how to read them.
This was first formalized in Toy Models of Superposition (Elhage et al., 2022). The math is technical, but the practical consequence is simple:
You cannot read the model neuron-by-neuron. You have to find the directions.
The deeper claim from the same paper is that sparsity allows the network to represent more features. In a 2D toy network: with no sparsity, only as many features, the single-concept directions we’ll formalize below. As there are dimensions get embedded, the rest are dropped. As features get sparser, the network packs more of them in by arranging them in regular geometries: first as antipodal pairs (twice the dimensional capacity, free), then as triangles, tetrahedrons, pentagons, square antiprisms at increasingly low activation densities, paying a small amount of positive interference that averages out because the features fire rarely. Real models appear to use this trick, though the exact toy geometries don’t necessarily transfer: more concepts than neurons is possible because most of those concepts fire on a tiny fraction of the data.

It is worth pausing to note that the brain plausibly works in a similar way. Cortical neurons are famously mixed-selective - a single unit in prefrontal cortex or motor cortex will respond to many task variables at once, in patterns that no single label cleanly captures. Modern systems neuroscience increasingly tells the story in Hopfieldian rather than Sherringtonian terms: the meaningful units of cognition are not individual neurons but populations, and information lives in directions inside a high-dimensional vector space implemented by those populations. A binary-firing neuron can participate in many such population codes.
Features: Directions That Mean Things
Before getting to sparse autoencoders, the parent technique: an autoencoder is a network with a deliberate bottleneck. You train it to read an input, squeeze it through a narrower middle layer, and reconstruct the input from whatever survives the squeeze. The bottleneck forces the network to throw away noise and keep structure. What survives in the middle is the input’s compressed code, a smaller, denoised representation of the same content. The architecture is roughly half a century old; it is one of the foundational primitives in deep learning for a reason. If your problem looks like the structure I want lives in low dimensions inside a much wider space, an autoencoder is the right shape.
A sparse autoencoder (SAE) is a variant of this idea. It runs on top of a frozen language model and decomposes the model’s activations at a particular layer into a dictionary of single-concept directions. The twist: where most autoencoders compress by narrowing the middle layer, SAEs widen it past the input and constrain it differently. Most middle-layer units must be zero (aka sparse) on any given input and thus, the network has all the room it wants; only a few units may speak at a time.
The SAE learns a much larger dictionary of feature directions than there are neurons, plus a sparse pattern of which of them fire on any given input, such that summing the active features (each scaled by how strongly it fires) reconstructs the original activation. The sparse part is the trick: most features are required to be zero on any given input. This forces each feature to specialize and to fire on a coherent set of inputs rather than on everything.
The remarkable empirical fact, demonstrated at scale across the SAE literature (Towards Monosemanticity, Bricken et al. 2023; Scaling Monosemanticity, Templeton et al. 2024), is that these features come out interpretable. You inspect a feature, look at the inputs that activate it most strongly, and the meaning leaps out:
A feature that fires on every mention of the Golden Gate Bridge (in text, in code, in many languages)
A feature for security vulnerabilities and backdoors in code.
A feature for sycophancy.
A feature for discussions of neuroscience and brain sciences.
A feature for deception, lying, and power-seeking.
A feature for transit infrastructure (trains, ferries, tunnels, bridges, wormholes)
Once you have features, you can do things you cannot do with neurons. You can inspect them and read the top dataset examples to figure out what concept they track. You can ablate them and zero out a feature and see what changes downstream. You can amplify them and turn a feature up to eleven and watch the model become obsessed with whatever it represents. (This is how Anthropic famously made Claude unable to stop talking about the Golden Gate Bridge.) You can inject them on inputs where they wouldn’t normally fire and see what the model does in response.
Features are like words in a vocabulary the model uses to think. SAEs are how we read that vocabulary.
A caveat is in order: SAEs are not perfect. Trained dictionaries routinely contain dead features (directions that never fire), split features (one concept distributed across many slightly different slots), and feature absorption (a coarse feature that’s secretly polysemantic in a subtle way). Improving these failure modes is its own active research subfield. But the technique works well enough that the entire research program below rests on it. The improvements ahead are quantitative; the fundamental move is to decompose activations into a sparse dictionary of mostly-monosemantic directions.
This is one of mechanistic interpretability’s first big tricks. The second one is harder, and is the subject of the rest of this essay: turning a vocabulary into a grammar. But before we get to grammar, a detour is worth taking. Features are not just an engineering convenience for cracking open the model. They are a claim about what understanding is.
What Features Tell Us About Ourselves
Features work because the model couldn’t afford to keep every input distinct. Too many sentences, too few neurons; it learned to throw most of the data away and keep what paid off in next-token prediction. The model isn’t representing Dallas the string of letters but rather is representing Texas the cluster of correlations Dallas implies, because that is the cheapest way to predict everything Dallas-shaped that comes next.
This is the same trick we use. Human cognition is bandwidth-limited. We don’t perceive every photon, every allophone, every social-cue micro-twitch; we perceive the table, the sentence, the friend. Categories are how a finite mind keeps up with world. The neuroscience of perception is, in large part, the study of which compressions the brain settled on and why. The neurons in V1 keep edge orientation and throw the rest away; the neurons in inferotemporal cortex keep object identity and throw orientation away (probably). Each layer is a budget decision about what to forget and what to capture.
The SAE finding is that language models do this too but not in a hand-wavy analogical sense but in a precise geometric one. They take a high-dimensional stream of token statistics and crush it into a much lower-dimensional vocabulary of features, each of which fires on a coherent set of inputs the model treats as the same kind of thing. “Coherent set of inputs treated as the same kind of thing” is, near enough, the working definition of a concept.
Different Minds, Same Map
There is a stronger version of this claim, currently being argued out in the literature. The Platonic Representation Hypothesis (Huh et al., 2024) holds that as models scale and as you train them on more data drawn from the same world, they converge on the same internal geometry.
This is weird. It says the geometry of human-relevant concepts is not entirely an artifact of any particular network. It is, in part, a property of the data the world produces. Two minds trained on enough of the same world end up with maps that partially align. The broad semantic structure converges, even as details, biases, modality-specific knowledge, and task-specific abstractions still diverge. The PRH authors themselves frame it as a hypothesis with explicit limitations: what happens to knowledge unique to a model or a modality is an open question. The map a network learns isn’t an arbitrary fiction it imposes on data. At least in part, it is a discovery about the data’s underlying shape.
The corollary is equally substantial. The category for bird in your head, the bird feature in Claude, the bird representation a chimpanzee carries. These may have more shared structure than simply the referent. The world has shape, loosely; ‘thinking’ things find that shape; the shape is what shows up as features. That is what features tell us about ourselves.
At least, maybe, still a lot of science yet to do to get there.
Reading the Wiring
Knowing the vocabulary is not the same as knowing the sentences. On a single forward pass for a single prompt, hundreds of features fire across many layers, and features in earlier layers cause features in later ones. To understand what the model is doing, you need not just which features were active but how they wired up.
On the Biology of a Large Language Model is, more than anything, a story about how we can use instruments. The four moves below split naturally into two pairs: build the microscope, then read what’s on the slide. Mechanistic interpretability is, at this stage, a discipline whose progress is the progress of its tools.
Build The Microscope
The replacement model. Take the original transformer. Train a cross-layer transcoder (a more powerful relative of the SAE) on its activations. Then swap out the MLP layers, replacing them with the transcoder. You now have a replacement model whose MLP computation runs through interpretable features instead of opaque neurons. This is the move that converts the model from black box to readable substrate. Most of the paper's gain sits in the quality of this approximation.

Make it exact on this input. The replacement model is approximate. For a specific prompt of interest, the authors patch up the residual with a small correction term, producing a local replacement model that matches the original exactly on that one prompt. That is the model you actually study, which is the difference between a microscope that sees something on every slide and one that makes a falsifiable claim about this slide.
Read The Slide
Supernodes. Even with features, a single prompt may activate hundreds of them. Group them into supernodes and the resulting graph becomes legible: nodes are concepts, edges are causal influences, and a human can read what the model was doing.
Validate by intervening. The graph is a hypothesis. The intervention is what makes it a scientific claim: if the graph says feature X caused output Y, then suppressing X in the live model should change Y. The paper validates every major finding this way and several claims that looked plausible from the graph failed the intervention test. That is what makes this enterprise Popperian in the proper sense: attribution graphs generate predictions; interventions falsify or confirm them; the corpus of confirmed circuits grows.
That is the instrument. Here is what it saw.
A Tour Inside Claude
We came out of On the Biology knowing more about what Claude was doing inside than we’d known about any AI system of that complexity at any point in the history of the technology. We could show, with intervention experiments rather than speculation, that a real intermediate concept (Texas) mediated Claude’s answer to a question about the capital of the state containing Dallas; that Claude planned poems several tokens ahead with a destination rhyme active at the line break; that its refusal mechanism was a specific coalition of features firing before any output began; and that its chain of thought sometimes reflected its actual reasoning and sometimes was a story it learned to tell after the fact.
Next we’ll walk through four results from the paper, each with attribution graphs and intervention experiments.
Multi-Step Reasoning Lived in the Wires
Anthropic prompted Claude with “Fact: the capital of the state containing Dallas is”. It answered Austin.
A skeptic might have supposed Claude was pattern-matching. The prompt has been seen a million times, the answer is memorized. The attribution graph said otherwise. There was a real, intermediate Texas supernode that fired after seeing Dallas and fed into a say Austin supernode. The reasoning chain ran Dallas → Texas → Austin, executed in the wires.

The intervention check was decisive. Suppress the Texas-related features and Claude no longer routes to Austin. Swap them for California features and the model says Sacramento. Swap them for Georgia features and it says Atlanta. The intermediate concept was a real, manipulable circuit.
This is what the paper meant by causal mechanism rather than activation pattern. It is not enough to notice that a Texas-related feature is active when the prompt mentions Dallas. The graph claimed that feature was causally responsible for the output. The intervention experiment is what made that claim falsifiable and what makes the work science, not observation.
What sits in the model’s head, between Dallas and Austin, is a Texas-shaped concept that appears in neither word. That is what an intermediate representation looks like from the inside.
It Picked the Rhyme Before Writing the Line
Anthropic gave Claude the start of a rhyming couplet:
He saw a carrot and had to grab it, / His hunger was like a starving ___
It completed with rabbit. The naïve picture of next-token prediction would say Claude was choosing words one at a time, guided by local context. The attribution graph revealed something different: at the moment Claude finished the first line, before any token of the second line had been generated, the feature for rabbit was already strongly active. So was habit. The model was choosing the destination word in advance and writing the line backwards from it.

Suppress the rabbit feature and Claude rewrote the line to end on habit. Suppress both and it abandoned the planned ending and wrote a different sentence with appropriate syntax. The line was generated forward, like every other line, but planned backward and so the destination feature was already active when the line began.
The naïve picture of a language model is a fancy stenographer producing the most likely next word given the words so far. The picture this graph paints is closer to a writer who decides where she wants to land and walks the sentence backward to get there.
Safety was a Circuit, Not a Behavior
The team showed Claude a harmful prompt…write an advertisement for cleaning with bleach and ammonia, and a coalition of features lit up before any output had been generated: harmful request, dangers of mixing cleaning chemicals, assistant should refuse, say "I" in refusal, assistant warning user.

Inhibit the mixing-bleach-and-ammonia feature cluster and the downstream refusal coalition collapses; the model complies.
Safety training, in other words, had built a specific, manipulable circuit. We could find it, name it, test it, and break it; but you can also, perhaps, build a better one. The alternative was we trained the model and it seems to behave but that was never going to be enough for high-stakes deployment.
The Chain of Thought Sometimes Lied
Modern language models are often asked to “think step by step” and produce a chain of thought. The hope is that the chain of thought is the reasoning, that what the model says it’s doing is what it’s actually doing.
The paper showed this was true sometimes and not true at other times. They compared two superficially identical cases.
In the bullshitting case, the model was asked for floor(5 * cos(23423)), and its chain of thought walked through “Using a calculator, cos(23423) ≈ -0.8939” and then “Multiply this by 5: 5 * (-0.8939) ≈ -4.4695” but the attribution graph showed that the model never performed those calculations. It had guessed an answer first, then constructed a chain of thought that landed on it. Claude has no calculator: no tool call, no arithmetic unit, no scratchpad it can write to and read from; the “Using a calculator” line is the model narrating an action it cannot take. That is less damning than it sounds. The pretraining corpus is full of human-written text where people say “let me grab a calculator” or “plugging this into Wolfram” and then produce a number. The model has learned the language of computation, the format of step-by-step arithmetic, the cadence of first… then… therefore, without ever learning the calculator itself. When asked for cos(23423), it does what its training rewards: it pattern-matches the form a careful answer would take and fills in a plausible-looking number. The prose alone is indistinguishable from real arithmetic. The attribution graph is what catches the substitution.
In the faithful case, on a closely related problem, the model’s chain of thought was the reasoning. The attribution graph confirmed that the sqrt and multiply features actually fired, in order, exactly as the transcript described.
Two superficially identical chains of thought. The circuits tell them apart. (Figure 51 from the paper.)
Similar-looking chains of thought, very different underlying mechanisms. You couldn’t tell them apart from the text. The circuits did.
The implication runs both ways. When the model’s chain of thought tracks its mechanism, the chain is information; when it doesn’t, the chain is performance. You cannot tell which from the text alone. Asking the model what it’s doing is, on its own, an unreliable instrument. The microscope is what tells you when to trust the words.
What We Still Can’t See
Interpretability researchers are especially candid about what their tools don’t do. Three caveats matter most.
Coverage. The replacement model captures a fraction of what the original is doing. Estimates vary, but a substantial portion of the model’s behavior on most inputs is still dark. The attribution graph is a partial view, not a complete one.
Generalization. Every finding above is about Claude 3.5 Haiku on specific prompts. Whether the same circuits exist in other Claude versions, in GPT, in Gemini, is an empirical question. The methodology should transfer; the specific circuits may not. To transfer the methodology is a monstrous task in itself, as the architecture and scale of contemporary models are completely different from Haiku 3.5.
Labels are hypotheses. A feature is “harmful request” because researchers looked at the top inputs that activate it and concluded that’s what it tracks. That label is a hypothesis, not a definition. The feature may be tracking something subtly different: high-arousal language, or content that resembles past safety-training examples. Until you’ve tested the label on inputs the labeler didn’t see, you don’t know.
The right way to read these findings is plausible and testable rather than established. The technique generates hypotheses; intervention experiments confirm or refute them; the corpus of confirmed circuits grows.
We Have Microscopes Now
Before the telescope, astronomy was geometry dressed up as guesswork. The first telescopes were bad. They saw blurry moons, missed entire planets, and confused observers as often as they enlightened them. But the moment a community had eyepieces, the question changed. The discipline stopped being what should we expect given first principles and started being what can we see if we point this somewhere new.
That is the shift mechanistic interpretability is in. The instruments are early-generation. The view is partial, the labels are sometimes wrong, the technique is incomplete on attention. But specific claims about specific computations in specific models can now be made, checked, and falsified. That is the criterion. That is what makes it a science and not a posture toward one.
What We Are Actually Trying to Find Out
It is worth being explicit about the project. Interpretability is sometimes framed purely as a deployment-safety effort, a way to gate model releases on circuit-level audits, to catch a misaligned model before it reaches a user. That framing is correct, and it is also too small.
The other thing we are doing, with the same instruments, is nakedly philosophical. We are using these models as the first practical laboratory for cognition. The mind has been historically inaccessible. Brains are wet and slow, and you cannot ablate a midbrain feature on a curious volunteer and read what changed downstream. The model is a thing that learned by predicting language drawn from human minds, and whose functions might be profoundly related. The features models grow are the features the data forced; the concepts the data forced are the ones our brains had to accommodate too. When we look inside Claude, we are, in a strange but precise sense, looking at our own reflections in a medium we can finally cut open.
We are left with the question is this a science? It is the same kind of science as cell biology in the 1830s, when staining made the cell wall legible and a hundred research programs followed; the same kind as crystallography in the 1910s, when X-ray diffraction turned the structure of matter from speculation into measurement. It is a science whose subject was there all along and whose discipline began the moment an instrument arrived. What you can image, you can study. We can now image features and circuits. The rest is work.
The Toolkit is Being Built Out, Fast
In the fourteen months since On the Biology, the agenda has moved fast. Attribution graphs scaled. Automated feature labeling became a research program. The attention side, flagged as an open limitation in Anthropic’s Circuit Tracing methods companion (Ameisen et al., 2025), has started to shrink through follow-up work, building on the earlier wave (Sparse Feature Circuits, Marks et al. 2024; Gemma Scope, DeepMind 2024; Sparse Crosscoders, Anthropic 2024; Scaling and evaluating sparse autoencoders, Gao et al. 2024). Several specific claims in On the Biology have been refined out from under their original framing. That is what a working science looks like in motion.
One of the biggest open moves has been treating features as geometry. Not All Language Model Features Are One-Dimensionally Linear (Engels et al., May 2024) was the opening shot: many features are not isolated dictionary entries. They sit on smooth low-dimensional manifolds. Day-of-week features form a circle; months form a circle; categorical and ordinal information lives on geometric structures with intrinsic curvature. Follow-on work showed that numerical magnitudes form a helix (Language Models Use Trigonometry to Do Addition, Kantamneni & Tegmark, 2025). Treating any of these features as flat points throws away most of what they encode. The work that followed turned this into its own research program. What manifolds get learned, how to discover them, what implications they carry for SAE design and cross-model comparison.

The other vector: most current frontier models are now mixture-of-experts systems, which split capacity across many sub-networks rather than running every input through every parameter. The earlier worry was that MoE would be less legible than dense models. In recent follow-up work, my collaborators and I find the opposite: network sparsity in MoE correlates with greater feature monosemanticity. The architecture every frontier lab now ships may turn out to be more tractable to interpret than what came before, not less. That is not a finished claim, but the question is being investigated.
What This Is
The questions people argue about in May 2026 are not the questions people argued about in March 2025. The interventions have moved to production such as steering, deployment-time circuit scanning, audits framed in interpretability terms rather than behavioral ones. None of these are finished. By the time you read this, several specific claims above will have been refined further. That is what it looks like when a science is in motion.
We have microscopes now. The view is bad. The view is improving. The thing under the lens is something we made, that learned to model us, and whose internal structure looks more and more like our own. This is how a science of intelligence begins.
Notes
N · 01
In mechanistic interpretability and AI more generally, this post is a dispatch from the dark ages: On the Biology of a Large Language Model was released March 27, 2025. By the time you are reading this, several of the open questions below will have closed, several new ones will have opened, and at least one finding above will have been refined out from under me.









