DLA+MLP Study — Phase 1

Two-Pathway Analysis: Attention vs MLP Signal Routing

Jasdeep Jaitla · 2026 · 54 models · simultaneous dual-pathway attribution

Abstract

The first systematic comparison of how Metaphori Engine™ Structured Notation (MESN™)signal routes through both computational pathways in transformer models — attention heads (DLA) and MLP layers (feed-forward attribution). By measuring both simultaneously across 54 models, we discover that the structured notation response is a dual-pathway system with architecture-dependent routing.

54
Models, dual-pathway attribution
35×
MLP magnitude vs attention
r=0.080
Pathway correlation (p=0.56)
3
Routing patterns discovered

The magnitude gap

PathwayMean AdvantageModels PositiveAbsolute Magnitude
DLA (Attention)+8.4%48/54 (89%)~22 (mean)
MLP (Feed-forward)–1.8%23/54 (43%)~779 (mean)
Combined+6.5%33/54 (61%)

Despite carrying 35× more signal, MLPs show a modest −1.8% advantage on average. The attention pathway is smaller but more discriminating. Think: MLPs provide the base signal (highway), attention heads provide the steering (differential routing). MESN™ engages the steering mechanism more strongly.

Pathway independence

DLA and MLP advantages are statistically uncorrelated (Pearson r=0.080, p=0.56). They are genuinely independent pathways — knowing a model's DLA advantage tells you nothing about its MLP advantage. This means DLA-only studies capture at most half the picture.

Three routing patterns

PatternCountDLA DirectionMLP DirectionInterpretation
Dual-Positive16++Both pathways favor MESN™
Attention-Dominant26+Attention favors MESN™, MLPs favor prose
MLP-Compensating3+MLPs compensate for negative attention

Dual-positive models

The 16 models where both pathways favor MESN™ show the highest combined signal. Notable: Llama 3.1 70B Base leads with +23.2% DLA and +24.5% MLP for a +47.7% combined advantage. The DS-R1-Distill family is entirely dual-positive — all 4 models — suggesting that distillation preserves MLP response. Mistral 7B Base reaches +29.0% combined.

ModelDLAMLPCombined
Llama 3.1 70B Base+23.2%+24.5%+47.7%
DS-R1-Distill Qwen 14B+15.0%+29.4%+44.4%
DS-R1-Distill Qwen 32B+18.4%+15.0%+33.4%
DS-R1-Distill Qwen 7B+10.5%+21.9%+32.4%
Mistral 7B Base+15.1%+13.9%+29.0%
Qwen 2.5 14B Base+22.9%+4.6%+27.5%

The Gemma resolution

The MLP-Compensating pattern is exclusively Gemma. Gemma's negative DLA is not a failure to respond — it's a rerouting. Structured notation signal travels through MLP layers instead of attention heads.

ModelDLAMLPCombined
Gemma 4 E4B IT–1.5%+12.1%+10.6%
Gemma 2 27B–5.9%+4.3%–1.6%
Gemma 4 31B IT–6.8%+2.3%–4.5%

Gemma 4 E4B IT is “rescued” by MLP: its −1.5% DLA would classify it as a negative responder, but +12.1% MLP lifts the combined signal to +10.6%.

Combined signal reshuffles the leaderboard

Adding MLP attribution changes which models appear to respond most strongly. Models with strong DLA but strongly negative MLP drop dramatically:

  • Aya Expanse 32B: +8.1% DLA, −46.1% MLP = −38.0% combined
  • Ministral 3 14B: +15.2% DLA, −25.5% MLP = −10.3% combined

A DLA-only study would call these positive responders. The full picture is more complex.

Implications

Two-pathway analysis reveals that structured notation doesn't simply “activate attention heads” — it creates architecture-dependent routing patterns that involve the entire forward pass. Future work should always measure both pathways simultaneously.