DLA Study — Phase 1
TransformerLens Analysis
Jasdeep Jaitla · 2026 · 7 models · 124M–32.5B parameters
Abstract
Using TransformerLens— a mechanistic interpretability library used by Anthropic, DeepMind, and the ARENA MI curriculum — we test Metaphori Engine™ Structured Notation (MESN™) across 7 models spanning a 262x scale range (124M to 32.5B parameters). This proof-of-concept establishes the experimental methodology, defines the 8-family head specialization taxonomy, and validates the multi-specialization hypothesis that the larger 43-model study later confirmed.
Critically, Phase 1 tests three input variants — prose, MESN™, and code — rather than just two. This three-way comparison reveals that MESN™ occupies a unique position: it is neither prose nor code, but simultaneously engages attention heads associated with both.
Models Tested
| Model | Parameters | Layers | Heads | Device |
|---|---|---|---|---|
| GPT-2 | 124M | 12 | 144 | M2 Max (MPS) |
| GPT-2 Medium | 345M | 24 | 384 | M2 Max (MPS) |
| Pythia-1.4B | 1.4B | 24 | 384 | M2 Max (MPS) |
| Mistral-7B-v0.1 | 7.2B | 32 | 1,024 | A100 40GB |
| Pythia-12B | 12B | 36 | 1,440 | A100 80GB |
| GPT-NeoX-20B | 20B | 44 | 2,816 | A100 80GB |
| Qwen 2.5 32B | 32.5B | 64 | 2,560 | A100 80GB |
Scaling Behavior
The primary metric is per-token head engagement: aggregate attention head activation normalized by token count. This controls for the length difference between variants — MESN™ uses ~14% fewer tokens than prose on average.
The effect is stable-to-increasing across scale. At 124M parameters, GPT-2 already shows +23.9% advantage. At 1.4B, Pythia amplifies this to +26.3%. The multi-specialization signal is robust at minimal scale and doesn't require large models to manifest.
Per-token head engagement advantage by model scale
The Three-Variant Finding
Phase 1 uniquely tests three input variants — prose, MESN™, and Python-style pseudocode. This three-way comparison was dropped in Phase 2 for cleaner statistical testing, but it produced a critical early insight:
- Code activates code_syntactic heads strongly but semantic_conceptual heads weakly
- Prose activates semantic_conceptual heads strongly but code_syntactic heads weakly
- MESN™ activates both simultaneously — code_syntactic AND semantic_conceptual heads
This was the first empirical evidence for the multi-specialization hypothesis: MESN™ uniquely bridges the specialization gap between “language” heads and “code” heads, engaging both in the same expression.
Activation Distribution Shape
Beyond raw magnitude, the early experiments measured the shape of the activation distribution — how engagement is spread across heads:
| Metric | Direction | Interpretation |
|---|---|---|
| Gini coefficient | MESN™ < Prose | More equal distribution across heads |
| IQR | MESN™ < Prose | Tighter distribution, fewer extreme outliers |
| Kurtosis | MESN™ > Prose | More peaked — more heads at similar activation |
MESN™ doesn't just activate heads more strongly — it activates them moreevenly. Prose creates a few highly-active heads and many dormant ones. MESN™ creates a broader, more uniform engagement pattern. This observation directly motivated the cross-family diversity hypothesis in the 43-model study.
The Qwen Step-Change
Per-family activation analysis across the 4 models with surviving data reveals two distinct clusters — not a smooth gradient:
- Cluster 1 (7B–20B): Mistral, Pythia-12B, GPT-NeoX-20B — consistent ~27–33% advantage across all 8 families
- Cluster 2 (32B): Qwen 2.5 32B — dramatically higher ~46–50% advantage, roughly 1.6x the other models
The jump is not gradual. It's a step-change that may reflect Qwen's larger scale, its GQA architecture, its 152K vocabulary (3x larger than the other models), or training data composition. 32 of 32 family-direction checks are positive.
Per-family activation advantage: 7B–20B cluster vs Qwen 2.5 32B
Late-Layer Concentration
Head specialization concentrates in the final 25% of network depth across all architectures. Early layers show near-zero specialization scores; late layers carry the highest-scoring specialized heads.
| Depth Zone | Mistral 7B | Pythia 12B | NeoX 20B | Qwen 32B |
|---|---|---|---|---|
| Early (0–25%) | 0.018 | 0.044 | 0.035 | 0.007 |
| Mid-Early (25–50%) | 0.018 | 0.054 | 0.054 | 0.008 |
| Mid-Late (50–75%) | 0.042 | 0.068 | 0.062 | 0.009 |
| Late (75–100%) | 0.083 | 0.054 | 0.059 | 0.035 |
Qwen 2.5 32B shows the most extreme late-layer concentration: its early layers (0.007) are nearly inert while late layers (0.035) carry 5x the specialization. All of Qwen's top-10 most specialized heads fall in the final 10% of the network (layers 57–63 of 64).
Semantic Category Hierarchy
Not all semantic categories benefit equally. The advantage follows a consistent hierarchy across all 4 models:
| Category | Mean Advantage | Tier |
|---|---|---|
| Bidirectional | +60.0% | Strong |
| Metaphor | +55.0% | Strong |
| Transformation | +51.6% | Strong |
| Contrastive | +49.6% | Strong |
| Hierarchical | +18.0% | Moderate |
| Definition | +7.9% | Modest |
| Causal | +6.5% | Modest |
| Complex | +6.3% | Modest |
Bidirectional, metaphor, and transformation categories — where MESN™ encodes mutual influence, analogy, and state change — show the largest advantages (49–60%). Causal, complex, and definition categories show more modest gains. Only one negative entry exists in the entire 32-cell matrix: Mistral 7B on causal at −3.5%.
Why TransformerLens
TransformerLens provides zero-friction DLA — z @ W_O @ W_U is computable directly from built-in weight matrices. A single run_with_cache() call captures all intermediate activations. This made it ideal for iterative proof-of-concept work.
At the Mistral-7B scale, TransformerLens hit limitations: ~30% memory overhead from internal format conversion, GQA head expansion that changes activation patterns, and no support for newer architectures. The migration to nnsight addressed all of these while preserving the core DLA methodology. Cross-validation on Mistral-7B confirmed both tools produce the same directional results.
What This Phase Established
- The 8-family taxonomy — developed on GPT-2, validated across 7 models, carried unchanged into the 43-model study
- The multi-specialization hypothesis — first articulated after the three-variant comparison showed MESN™ bridging language and code heads
- DLA as the primary measurement tool — per-head contribution to vocabulary logits, not attention weights alone
- The "targeted vs shotgun" insight — prose activates many families weakly (shotgun); MESN™ activates the right families strongly (targeted)
- Late-layer concentration — MESN™-responsive heads cluster in the final 25% of network depth
- The scaling signal — detectable at 124M parameters, amplifies with scale, suggesting a fundamental architectural response
- The Qwen step-change — a qualitative jump from ~28% to ~47% advantage hinting that scale and architecture interact
Limitations
32 stimulus pairs — statistically adequate (p < 0.001) but too small for the complexity tier analysis that became central to Phase 2. No generation capture, meaning no perplexity analysis or spontaneous operator detection. GPT-2, GPT-2 Medium, and Pythia-1.4B data was not preserved — their metrics survive in notebook outputs but raw per-head data cannot be reanalyzed. Single-run results with no automated replication. No MoE or MLA architectures tested.