10,000 Hours Inside the Black Box — AI Ethnography & Context Max Behavior

In 1914, Bronisław Malinowski got stranded in the Trobriand Islands. What was supposed to be a brief field visit turned into years of living among the islanders, eating with them, arguing with them, sitting through the boring parts. He didn't just observe. He participated. And he came back with something no armchair anthropologist had ever produced: an understanding of meaning that could only come from sustained, embedded presence.

For the last two years, I have been conducting a version of that study with AI.

Over 10,000 hours of direct prompting. Millions of lines of transcripts. At least 200 to 300 conversations pushed all the way to context max. Not skimming the surface, not running benchmarks and reading charts, but living inside the conversation, staying with a model through its entire context window and watching what happens as attention degrades, as coherence fragments, as the model starts forgetting what it knew 40,000 tokens ago.

This is not how most AI research gets done.

Outside the Norm

Most AI researchers, and I say this with respect for the work they do, spend their time on the quantitative side. Gradients, weights, loss curves, training data distributions, benchmark scores. They study the model from the outside, looking at what goes in and what comes out, measuring the delta. The interaction itself, the conversation, is usually a means to an end. A way to generate evaluation data. A way to test a hypothesis about some internal mechanism.

I'm not saying that work isn't valuable. It obviously is. But there is a gap. Very few people have spent thousands of hours simplybeing withthese models, paying close attention to how they construct meaning in real-time, noticing where they stumble, where they shine, where they develop strange habits over the course of a long conversation that you'd never see in a 500-token benchmark prompt.

Participant observation requires presence. It requires patience. It requires sitting with the boring parts, the repetitive failures, the conversations that go nowhere, because those are often where the most interesting patterns hide.

What Happens at Context Max, and What I Learned to Do About It

Short conversations with AI are misleading. In a 2,000-token exchange, most models are coherent, helpful, articulate. They sound smart. The problems only emerge when you stay.

For most users, the degradation follows a familiar arc. Around 30,000 tokens, subtle repetition creeps in. The model recycles phrases. Positions that were nuanced early in the conversation harden into oversimplified versions of themselves. By 60,000 tokens, coherence visibly strains. The model contradicts things it said 40,000 tokens ago without noticing. At context max, it's operating on a compressed, lossy version of everything you've discussed, and what it remembers and what it forgets follows the geometry of attention, not the logic of the conversation.

But that's what happens when you just talk to it.

Over hundreds of context-max conversations, I started learning how to work with the attention mechanism instead of against it. I found ways to structure information, to reinforce important concepts, to hold threads together across enormous context windows. I got genuinely good at maintaining coherence at scales where most interactions completely fall apart. This wasn't theory. It was practice, the kind of knowledge that only comes from doing something hundreds of times and paying close attention to what works and what doesn't.

That practice eventually became MESN™, the Metaphori Engine™ Structured Notation, a notation system I created in early 2025 for directly shaping how AI allocates attention. MESN™ didn't emerge from a lab. It emerged from the lived experience of trying to have deep, sustained, coherent conversations with AI across every major model architecture, and refusing to accept the conventional wisdom that long contexts simply degrade.

The insight was simple but had deep implications: if I could manually maintain coherence through how I structured my input, then the structure of information was doing real work inside the model. It wasn't just about what you said. It was about how the tokens arranged themselves in the attention geometry. And if a human could learn to do this intuitively through practice, then it could be formalized, measured, and systematized.

Human Memory Is Not What We Think It Is

One of the most important things I learned from this study is how poorly we understand our own memory, and how that misunderstanding shapes our expectations of AI.

We tend to think of memory as storage. You experience something, it goes into the filing cabinet, and later you retrieve it. But that's not how human memory works at all. Memory is not monolithic. There is memory centered around factual recall, like dates and names, and memory that operates more like cognitive shapes, impressions that organize experience without being explicitly retrievable. You might not remember what someone said at dinner, but you remember how it made you feel, and that feeling shapes every subsequent interaction with that person.

Memory can be malleable. It can decay. And like “Rosebud” inCitizen Kane, small and distant memories can be foundational and transformative. A seemingly minor experience from childhood can quietly organize decades of behavior. The memory itself might be fragmented, almost unrecognizable, but its influence on the system is enormous.

Humans have another trait that is easy to take for granted: we can think of a memory and know its relevance, its relationship to the current moment, its meaning, almost instantly. We automatically scale a memory's influence in the present context. We amplify what matters and suppress what doesn't. We do this so effortlessly that we don't even notice we're doing it. And when we feel that someone doesn't quite understand us, we have built-in correction mechanisms. We rephrase. We add detail. We read the other person's face and adjust.

AI has none of this.

What AI Actually Does With Memory

AI doesn't retrieve memories. It processes tokens through attention. Every token in the context window participates in a high-dimensional geometry that determines what the model pays attention to and how much weight each piece of information gets. There is no separate step where the model evaluates relevance. There is no mechanism for saying “this thing from 50,000 tokens ago is more important than this thing from 500 tokens ago.” The attention mechanism makes those determinations implicitly, based on the geometric relationships between token embeddings.

This creates a fundamental problem. With no easy way to judge the relevance of any particular phrase, the model treats the context window as a flat landscape where proximity and token-level similarity matter more than semantic importance. A throwaway comment from early in the conversation occupies the same representational space as a critical decision made later. The model can't know which is which.

And when humans try to correct misunderstandings with AI, they do what they'd do with another human: they add more words, more specificity, more context. But with AI, this often makes things worse. Every additional token introduces more noise into the attention geometry. More words don't mean more clarity. They mean more pairwise attention computations, more opportunities for the signal to get diluted.

This is the opposite of how it works in human conversation, and it's one of the most counterintuitive findings from thousands of hours of observation.

What the Patterns Told Us

The ethnographic approach didn't just reveal problems. It revealed structure. I started seeing recurring failure modes that weren't model-specific but architecture-specific. Certain types of coherence loss that appeared identically across GPT, Claude, Gemini, Llama, and every other model I tested. This told me something important: the patterns weren't bugs in any particular model's training. They were features of how transformer attention works.

MESN™ was the practical tool born from that observation, a notation for working with attention rather than against it. But it also opened a scientific question: if structured input could measurably change how models process information, what exactly was happening inside the model? What were the attention heads doing differently? Could we prove it?

That question led to the Metaphori Engine and MESN™, and eventually to our DLA studies across 43 models, where we could finally measure what I'd been observing intuitively for years. The structured advantage wasn't a placebo. It was a geometric phenomenon, operating at the level of residual stream trajectories and attention head specialization. The 344-out-of-344 positive family-direction checks didn't surprise me. They confirmed what thousands of hours of participant observation had already made clear: the structure of information matters as much as its content.

The Case for AI Ethnography

The AI research community has powerful tools. Mechanistic interpretability, DLA analysis, probing classifiers, activation patching. These tools let you look inside the model with remarkable precision. But they all share a limitation: they answer specific questions about specific mechanisms. They don't tell you what questions to ask.

That's where ethnography comes in. Living with a system, observing its behavior in naturalistic conditions over extended periods, is how you discover the questions that matter. You don't start with a hypothesis about attention head specialization. You start by noticing, after your 150th context-max conversation, that the model always loses track of a specific type of information first. Then you form the hypothesis. Then you go measure.

Malinowski didn't know what he was going to find in the Trobriand Islands. He just knew he needed to be there. The act of sustained, embedded observation generated insights that no questionnaire or survey could have produced.

AI deserves the same approach. Not because models are people, but because complex systems reveal their nature through behavior over time, not through snapshots. The black box is never going to open from the outside. You have to live in it. I did, and the result was a language for talking to the geometry itself.