DLA+MLP Study — Phase 1
Two-Pathway Analysis: Attention vs MLP Signal Routing
Jasdeep Jaitla · 2026 · 54 models · simultaneous dual-pathway attribution
Abstract
The first systematic comparison of how Metaphori Engine™ Structured Notation (MESN™)signal routes through both computational pathways in transformer models — attention heads (DLA) and MLP layers (feed-forward attribution). By measuring both simultaneously across 54 models, we discover that the structured notation response is a dual-pathway system with architecture-dependent routing.
The magnitude gap
| Pathway | Mean Advantage | Models Positive | Absolute Magnitude |
|---|---|---|---|
| DLA (Attention) | +8.4% | 48/54 (89%) | ~22 (mean) |
| MLP (Feed-forward) | –1.8% | 23/54 (43%) | ~779 (mean) |
| Combined | +6.5% | 33/54 (61%) | — |
Despite carrying 35× more signal, MLPs show a modest −1.8% advantage on average. The attention pathway is smaller but more discriminating. Think: MLPs provide the base signal (highway), attention heads provide the steering (differential routing). MESN™ engages the steering mechanism more strongly.
Pathway independence
DLA and MLP advantages are statistically uncorrelated (Pearson r=0.080, p=0.56). They are genuinely independent pathways — knowing a model's DLA advantage tells you nothing about its MLP advantage. This means DLA-only studies capture at most half the picture.
Three routing patterns
| Pattern | Count | DLA Direction | MLP Direction | Interpretation |
|---|---|---|---|---|
| Dual-Positive | 16 | + | + | Both pathways favor MESN™ |
| Attention-Dominant | 26 | + | – | Attention favors MESN™, MLPs favor prose |
| MLP-Compensating | 3 | – | + | MLPs compensate for negative attention |
Dual-positive models
The 16 models where both pathways favor MESN™ show the highest combined signal. Notable: Llama 3.1 70B Base leads with +23.2% DLA and +24.5% MLP for a +47.7% combined advantage. The DS-R1-Distill family is entirely dual-positive — all 4 models — suggesting that distillation preserves MLP response. Mistral 7B Base reaches +29.0% combined.
| Model | DLA | MLP | Combined |
|---|---|---|---|
| Llama 3.1 70B Base | +23.2% | +24.5% | +47.7% |
| DS-R1-Distill Qwen 14B | +15.0% | +29.4% | +44.4% |
| DS-R1-Distill Qwen 32B | +18.4% | +15.0% | +33.4% |
| DS-R1-Distill Qwen 7B | +10.5% | +21.9% | +32.4% |
| Mistral 7B Base | +15.1% | +13.9% | +29.0% |
| Qwen 2.5 14B Base | +22.9% | +4.6% | +27.5% |
The Gemma resolution
The MLP-Compensating pattern is exclusively Gemma. Gemma's negative DLA is not a failure to respond — it's a rerouting. Structured notation signal travels through MLP layers instead of attention heads.
| Model | DLA | MLP | Combined |
|---|---|---|---|
| Gemma 4 E4B IT | –1.5% | +12.1% | +10.6% |
| Gemma 2 27B | –5.9% | +4.3% | –1.6% |
| Gemma 4 31B IT | –6.8% | +2.3% | –4.5% |
Gemma 4 E4B IT is “rescued” by MLP: its −1.5% DLA would classify it as a negative responder, but +12.1% MLP lifts the combined signal to +10.6%.
Combined signal reshuffles the leaderboard
Adding MLP attribution changes which models appear to respond most strongly. Models with strong DLA but strongly negative MLP drop dramatically:
- Aya Expanse 32B: +8.1% DLA, −46.1% MLP = −38.0% combined
- Ministral 3 14B: +15.2% DLA, −25.5% MLP = −10.3% combined
A DLA-only study would call these positive responders. The full picture is more complex.
Implications
Two-pathway analysis reveals that structured notation doesn't simply “activate attention heads” — it creates architecture-dependent routing patterns that involve the entire forward pass. Future work should always measure both pathways simultaneously.