DLA Study — Phase 1

TransformerLens Analysis

Jasdeep Jaitla · 2026 · 7 models · 124M–32.5B parameters

Abstract

Using TransformerLens— a mechanistic interpretability library used by Anthropic, DeepMind, and the ARENA MI curriculum — we test Metaphori Engine™ Structured Notation (MESN™) across 7 models spanning a 262x scale range (124M to 32.5B parameters). This proof-of-concept establishes the experimental methodology, defines the 8-family head specialization taxonomy, and validates the multi-specialization hypothesis that the larger 43-model study later confirmed.

Critically, Phase 1 tests three input variants — prose, MESN™, and code — rather than just two. This three-way comparison reveals that MESN™ occupies a unique position: it is neither prose nor code, but simultaneously engages attention heads associated with both.

32/32

Family-direction checks positive

262×

Scale range tested

+23–26%

Per-token head engagement advantage

p < 0.001

Wilcoxon signed-rank across all models

Models Tested

Model	Parameters	Layers	Heads	Device
GPT-2	124M	12	144	M2 Max (MPS)
GPT-2 Medium	345M	24	384	M2 Max (MPS)
Pythia-1.4B	1.4B	24	384	M2 Max (MPS)
Mistral-7B-v0.1	7.2B	32	1,024	A100 40GB
Pythia-12B	12B	36	1,440	A100 80GB
GPT-NeoX-20B	20B	44	2,816	A100 80GB
Qwen 2.5 32B	32.5B	64	2,560	A100 80GB

Scaling Behavior

The primary metric is per-token head engagement: aggregate attention head activation normalized by token count. This controls for the length difference between variants — MESN™ uses ~14% fewer tokens than prose on average.

The effect is stable-to-increasing across scale. At 124M parameters, GPT-2 already shows +23.9% advantage. At 1.4B, Pythia amplifies this to +26.3%. The multi-specialization signal is robust at minimal scale and doesn't require large models to manifest.

Per-token head engagement advantage by model scale

The Three-Variant Finding

Phase 1 uniquely tests three input variants — prose, MESN™, and Python-style pseudocode. This three-way comparison was dropped in Phase 2 for cleaner statistical testing, but it produced a critical early insight:

Code activates code_syntactic heads strongly but semantic_conceptual heads weakly
Prose activates semantic_conceptual heads strongly but code_syntactic heads weakly
MESN™ activates both simultaneously — code_syntactic AND semantic_conceptual heads

This was the first empirical evidence for the multi-specialization hypothesis: MESN™ uniquely bridges the specialization gap between “language” heads and “code” heads, engaging both in the same expression.

Activation Distribution Shape

Beyond raw magnitude, the early experiments measured the shape of the activation distribution — how engagement is spread across heads:

Metric	Direction	Interpretation
Gini coefficient	MESN™ < Prose	More equal distribution across heads
IQR	MESN™ < Prose	Tighter distribution, fewer extreme outliers
Kurtosis	MESN™ > Prose	More peaked — more heads at similar activation

MESN™ doesn't just activate heads more strongly — it activates them moreevenly. Prose creates a few highly-active heads and many dormant ones. MESN™ creates a broader, more uniform engagement pattern. This observation directly motivated the cross-family diversity hypothesis in the 43-model study.

The Qwen Step-Change

Per-family activation analysis across the 4 models with surviving data reveals two distinct clusters — not a smooth gradient:

Cluster 1 (7B–20B): Mistral, Pythia-12B, GPT-NeoX-20B — consistent ~27–33% advantage across all 8 families
Cluster 2 (32B): Qwen 2.5 32B — dramatically higher ~46–50% advantage, roughly 1.6x the other models

The jump is not gradual. It's a step-change that may reflect Qwen's larger scale, its GQA architecture, its 152K vocabulary (3x larger than the other models), or training data composition. 32 of 32 family-direction checks are positive.

Per-family activation advantage: 7B–20B cluster vs Qwen 2.5 32B

Late-Layer Concentration

Head specialization concentrates in the final 25% of network depth across all architectures. Early layers show near-zero specialization scores; late layers carry the highest-scoring specialized heads.

Depth Zone	Mistral 7B	Pythia 12B	NeoX 20B	Qwen 32B
Early (0–25%)	0.018	0.044	0.035	0.007
Mid-Early (25–50%)	0.018	0.054	0.054	0.008
Mid-Late (50–75%)	0.042	0.068	0.062	0.009
Late (75–100%)	0.083	0.054	0.059	0.035

Qwen 2.5 32B shows the most extreme late-layer concentration: its early layers (0.007) are nearly inert while late layers (0.035) carry 5x the specialization. All of Qwen's top-10 most specialized heads fall in the final 10% of the network (layers 57–63 of 64).

Semantic Category Hierarchy

Not all semantic categories benefit equally. The advantage follows a consistent hierarchy across all 4 models:

Category	Mean Advantage	Tier
Bidirectional	+60.0%	Strong
Metaphor	+55.0%	Strong
Transformation	+51.6%	Strong
Contrastive	+49.6%	Strong
Hierarchical	+18.0%	Moderate
Definition	+7.9%	Modest
Causal	+6.5%	Modest
Complex	+6.3%	Modest

Bidirectional, metaphor, and transformation categories — where MESN™ encodes mutual influence, analogy, and state change — show the largest advantages (49–60%). Causal, complex, and definition categories show more modest gains. Only one negative entry exists in the entire 32-cell matrix: Mistral 7B on causal at −3.5%.

Why TransformerLens

TransformerLens provides zero-friction DLA — z @ W_O @ W_U is computable directly from built-in weight matrices. A single run_with_cache() call captures all intermediate activations. This made it ideal for iterative proof-of-concept work.

At the Mistral-7B scale, TransformerLens hit limitations: ~30% memory overhead from internal format conversion, GQA head expansion that changes activation patterns, and no support for newer architectures. The migration to nnsight addressed all of these while preserving the core DLA methodology. Cross-validation on Mistral-7B confirmed both tools produce the same directional results.

What This Phase Established

The 8-family taxonomy — developed on GPT-2, validated across 7 models, carried unchanged into the 43-model study
The multi-specialization hypothesis — first articulated after the three-variant comparison showed MESN™ bridging language and code heads
DLA as the primary measurement tool — per-head contribution to vocabulary logits, not attention weights alone
The "targeted vs shotgun" insight — prose activates many families weakly (shotgun); MESN™ activates the right families strongly (targeted)
Late-layer concentration — MESN™-responsive heads cluster in the final 25% of network depth
The scaling signal — detectable at 124M parameters, amplifies with scale, suggesting a fundamental architectural response
The Qwen step-change — a qualitative jump from ~28% to ~47% advantage hinting that scale and architecture interact

Limitations

32 stimulus pairs — statistically adequate (p < 0.001) but too small for the complexity tier analysis that became central to Phase 2. No generation capture, meaning no perplexity analysis or spontaneous operator detection. GPT-2, GPT-2 Medium, and Pythia-1.4B data was not preserved — their metrics survive in notebook outputs but raw per-head data cannot be reanalyzed. Single-run results with no automated replication. No MoE or MLA architectures tested.