Deep Dive

Beyond the Basics · 17 Min Read

The crash course gets you started. This gets you fluent. Five articles that explain what's happening under the hood.

In This Section

01 — How Large Language Models Work
02 — Tokens, Context Windows & Why They Matter
03 — Inside the Transformer: How LLMs Think
04 — What Is RAG and Why Should You Care?
05 — Agents, Automation & What's Coming Next

How Large Language Models Work

A Mental Model

When you type a message to Claude or ChatGPT, something specific is happening on the other end. It's not a search engine. It's not a database lookup. And it's not a human. Understanding what it is changes how you use it, and why it sometimes gets things wrong (hallucinates).

It Started with Prediction

At the core, a large language model is trained to do one thing: predict the next word. Or more precisely, the next token — a chunk of text that might be a word, part of a word, or a punctuation mark. Given "The capital of France is," the model predicts "Paris" because that sequence appeared countless times in its training data.

That sounds trivial. But when you train a model on hundreds of billions of words — books, websites, academic papers, code, conversations — the model has to develop an extraordinarily sophisticated internal understanding of language, facts, reasoning patterns, and context in order to predict well. The prediction task forces the model to "understand" the material it's predicting from.

What Training Means

Training a language model involves feeding it enormous amounts of text and having it repeatedly guess the next token. Every time it gets it wrong, the model's internal parameters — billions of numerical weights — are adjusted slightly. Over trillions of these adjustments, the model gets better and better at prediction.

Those weights are the model. They're a compressed, mathematical representation of everything the model "learned" from the training data. When you ask Claude a question, you're not querying a database, you're activating a complex web of numerical relationships that, together, produce a probable next sequence of tokens that constitutes a useful answer.

A Note on Masking

During training, the model never gets to "look ahead" at the answer. Future tokens are masked — hidden — so the model must predict each token based only on what came before. This is called causal masking, and it's what forces the model to learn the structure of language rather than memorize sequences. Without masking, the training task would be trivial and the model would learn nothing useful.

The Key Insight

LLMs don't retrieve facts — they generate text that is statistically likely to be correct based on patterns in training data. This is why they're so fluent, and why they can confidently be wrong. They're always producing plausible-sounding output. Whether it's accurate is a separate question.

Interactive Playground

See it in action. Pick a prompt and step through the loop — one token at a time.

Then Came Instruction Tuning

A raw pre-trained model is a powerful text predictor, but it won't helpfully answer your questions — it'll just continue your text as if it were a document. The models you use today have been through a second phase called instruction tuning (or RLHF — Reinforcement Learning from Human Feedback), where human raters score outputs and the model is fine-tuned to produce responses that are helpful, honest, and safe.

This is why Claude and ChatGPT feel like assistants rather than autocomplete engines. The underlying mechanism is still prediction — but the model has been shaped to predict responses that a helpful, knowledgeable assistant would give.

A Technical Nuance

Instruction tuning and RLHF are technically sequential steps, not the same thing. Instruction tuning happens first — it gives the model a baseline of assistant-like behavior by training it on examples of good responses. RLHF comes second, using human preference ratings to polish the model's personality and sharpen its safety guardrails. Both phases are often grouped under "fine-tuning," but they do different jobs.

What This Means for How You Use It

This mental model has practical implications:

→
Confidence ≠ accuracy. The model generates fluent text. Fluency and correctness are not the same thing. Always verify factual claims on anything that matters.
→
Context shapes output. The model is predicting what a good response looks like given everything in the conversation. More context = better prediction = better output.
→
It doesn't "know" things the way you do. Its knowledge is frozen at its training cutoff. It has no access to your company's internal data, recent news, or anything not in its training set — unless you explicitly provide it. Newer models now have web access.

Tokens, Context Windows & Why They Matter

The Invisible Limits

Every time you interact with an AI, two invisible constraints are at work: the size of a token and the size of the context window. Understanding both makes you a significantly more effective user.

What Is a Token?

Models don't process text word by word — they process it in chunks called tokens. A token is roughly 3–4 characters of English text, which works out to about ¾ of a word on average. "Unbelievable" might be two tokens. "AI" is one. "GPT-4" is three.

Why does this matter? Because AI pricing, speed, and limits are all measured in tokens — not words or characters. When a model has a "200,000 token context window," that's roughly 150,000 words, or about 600 pages of text. When you're billed for API usage, you're billed per token in and out.

For most everyday users, tokens are invisible. But if you're doing large-scale document analysis or building on top of AI APIs, they become very relevant very fast.

What Is a Context Window?

The context window is the amount of text a model can "see" at once — your full conversation, any documents you've uploaded, the system instructions, and the response it's generating. Think of it as the model's working memory.

Everything inside the context window is active and available. Everything outside it simply doesn't exist to the model. This has a counterintuitive implication: in a very long conversation, early messages may eventually get pushed out of context — and the model will genuinely have no memory of them.

Practical Implications

→
For long conversations: If a chat session drags on for hours, the model may start to "forget" early context. Starting a fresh conversation with a summary is sometimes more effective than continuing indefinitely.
→
For quality: Research suggests that models pay more attention to content at the beginning and end of a long context than in the middle. If you're feeding in a long document and want the model to focus on a specific section, consider putting that section first.

Inside the Transformer: How LLMs Think

Real Explanation.

The word "transformer" gets thrown around constantly. It's in the name GPT (Generative Pre-trained Transformer). It's the architecture underneath Claude, Gemini, LLaMA, and every major LLM. But what is it? This article explains the mechanics — the attention mechanism, activation functions, backpropagation, temperature — in plain language.

The Paper That Started Everything

In 2017, eight researchers at Google Brain published a paper titled "Attention Is All You Need." It introduced the transformer architecture and quietly made everything that came after it possible — GPT, BERT, Claude, Gemini, all of it.

Before transformers, the dominant approach to language modeling used recurrent neural networks (RNNs), which processed text sequentially — one word at a time, left to right. This worked, but it was slow, struggled with long sequences, and had trouble connecting ideas that were far apart in a document. Transformers threw out the sequential constraint entirely. The key insight: instead of reading word by word, the model should be able to look at every word in a sequence simultaneously and dynamically decide which words are most relevant to each other. They called this mechanism attention.

Why the Title Matters

"Attention Is All You Need" was a deliberate provocation. Prior architectures combined attention with recurrence and convolution. The paper's claim was that attention alone — no recurrence, no convolution — was sufficient to build a state-of-the-art language model. They were right.

Step 1: Turning Words Into Numbers (Embeddings)

Before any attention can happen, the model needs to convert text into something it can compute with. Text gets broken into tokens (see Article 02), and each token is mapped to a high-dimensional vector — a list of hundreds or thousands of numbers. This is called an embedding.

These numbers aren't arbitrary. Through training, tokens that appear in similar contexts end up with similar embeddings. "Dog" and "puppy" end up close together in the embedding space. "Dog" and "democracy" end up far apart. The geometry of this space encodes meaning — which is why the model can reason about analogies, synonyms, and semantic relationships.

The transformer also adds positional encodings — another set of numbers layered on top of the embeddings to tell the model where each token sits in the sequence. Without this, the model would have no sense of word order.

Interactive: The Geometry of Words

The tokens in your prompt each live at a coordinate in a high-dimensional space. The cloud below is a 3D projection of that space for a small vocabulary of animals, colors, and foods. Drag to rotate. Hover a token to see its nearest neighbors by cosine similarity. Click to pin.

Cosine Similarity

We measure "closeness" by the angle between two vectors, not their Euclidean distance. Two tokens pointing the same direction score near 1.00. Perpendicular = 0. Opposite = −1.

Interactive: Arithmetic on Meaning

The result that made everyone care about embeddings: take the vector for "king," subtract "man," add "woman" — and the nearest token is "queen." The direction from man→woman is the same direction as king→queen. Meaning is encoded in geometry.

One Caveat

Analogies work cleanly in toy demos and messily in the wild. Real embedding spaces inside frontier models are richer, more entangled, and harder to interpret — but the core geometric intuition holds.

Step 2: The Attention Mechanism

This is the core of what makes transformers work. The idea: when processing any given token, the model should be able to "pay attention" to other tokens in the sequence — and the amount of attention should be determined dynamically based on the content, not hardcoded by position.

Here's how it works. For each token, the model computes three vectors:

→
Query (Q): "What am I looking for?" — a representation of what this token needs from the others.
→
Key (K): "What do I offer?" — a representation of what each token contains.
→
Value (V): "What should I pass on?" — the content that gets weighted and summed.

The model computes a similarity score between a token's Query and every other token's Key. High similarity = high attention weight. These weights are then used to create a weighted sum of the Values. The result: each token's representation gets updated by pulling in information from the tokens most relevant to it.

Analogy

Imagine reading the sentence "The trophy didn't fit in the suitcase because it was too big." When your brain processes "it," it automatically looks back and decides "it" refers to the trophy, not the suitcase. That's exactly what the attention mechanism does — it figures out which other tokens are relevant to resolving the meaning of the current one.

Multi-Head Attention

Rather than running attention once, transformers run it in parallel across many "heads" — typically 12, 16, or more in large models. Each head learns to pay attention to different kinds of relationships. One head might learn syntactic structure (subject-verb agreement). Another might track coreferences (what "it" refers to). Another might pick up on semantic relationships (synonyms, opposites).

The outputs from all heads are concatenated and projected back into a single representation. The result is a much richer understanding of context than any single attention pass could produce.

Step 3: Feed-Forward Layers & Activation Functions

After the attention mechanism updates each token's representation, that representation passes through a feed-forward neural network — a simpler layer applied independently to each token position. This is where a large portion of the model's "knowledge" is thought to be stored: the pattern associations learned during training.

Inside these layers, an activation function introduces non-linearity — meaning the model can learn complex, curved patterns rather than just straight-line relationships. Without activation functions, stacking layers wouldn't add any expressive power; the whole network would collapse to a single linear transformation.

The most common activation functions in modern LLMs:

→
ReLU (Rectified Linear Unit): The classic. Outputs zero for any negative input, and the value itself for positive inputs. Simple, fast, effective.
→
GELU (Gaussian Error Linear Unit): A smoother version of ReLU. Most modern LLMs — including GPT and Claude — use GELU or variants of it because the smoothness helps with gradient flow during training.
→
SwiGLU: Used in newer architectures like LLaMA. A gated variant that has empirically shown better performance on large models.

Step 4: How the Model Learns — Backpropagation

Training a transformer means adjusting billions of numerical parameters (the weights in the attention and feed-forward layers) so that the model gets better at predicting the next token. The mechanism that makes this work is backpropagation.

Here's the loop:

1.
Forward pass. The model takes an input sequence and predicts the next token. It produces a probability distribution over the entire vocabulary — e.g., "Paris" has 40% probability, "London" has 12%, "Berlin" has 8%, etc.
2.
Loss calculation. The model's prediction is compared to the actual next token in the training data. A loss function (typically cross-entropy) measures how wrong the prediction was. A confident wrong answer produces a high loss; a confident correct answer produces a low one.
3.
Backward pass. The loss is propagated backwards through the network layer by layer, computing the gradient — the direction and magnitude of how each parameter should change to reduce the loss.
4.
Weight update. An optimizer (typically Adam) nudges each parameter slightly in the direction that reduces the loss. Repeat this trillions of times across the training corpus and the model converges toward useful predictions.

The remarkable thing is that no one programs the model to "know" facts or grammar. All of that emerges from this simple training loop applied at scale. The structure of knowledge is an emergent property of learning to predict text well.

Scale is the secret ingredient

The transformer architecture was published in 2017. But ChatGPT didn't launch until 2022. The gap isn't explained by new architectural breakthroughs — it's mostly explained by scale: more parameters, more training data, more compute. GPT-3 has 175 billion parameters trained on ~500 billion tokens. That scale is what pushed from "interesting research" to "genuinely useful."

Temperature: Controlling Randomness at Inference

When the model generates a response, it doesn't just always pick the highest-probability next token. If it did, responses would be deterministic and repetitive. Instead, a parameter called temperature controls how the model samples from its probability distribution.

→
Low temperature (close to 0): The model becomes more deterministic, almost always picking the highest-probability token. Outputs are focused, consistent, and conservative. Good for factual tasks, code generation, and structured outputs where you want precision over creativity.
→
High temperature (closer to 1 or above): The probability distribution is flattened — lower-probability tokens get more of a chance. Outputs become more varied, surprising, and creative. Good for brainstorming, creative writing, and generating diverse options. Also increases the chance of incoherence at extremes.
→
Temperature = 1: The model samples proportionally to the raw probabilities — no adjustment. This is typically the default for general use.

If you've ever noticed that asking Claude the same question twice produces slightly different answers, temperature is why. You're not getting a cached lookup — you're getting a fresh sample from the model's probability distribution each time.

Most consumer-facing tools don't expose temperature as a setting, but API users can set it directly. It's one of the most practically useful parameters to understand if you're building with LLMs.

Why This Matters for How You Use These Tools

You don't need to understand transformer internals to use AI well. But knowing them changes a few things:

→
Hallucinations make sense. The model is always generating the most statistically plausible next token — not retrieving facts from a database. Plausible and true are not the same thing. Now you know why.
→
Context window limits make sense. The attention mechanism operates over the full context window — every token attends to every other token. This is computationally expensive (quadratic in sequence length), which is why context windows are a real constraint, not an arbitrary one.
→
Consistency and creativity are a dial. If you need the model to be precise and consistent, ask for it (or use a low temperature via API). If you want divergent ideas, ask for 10 options instead of one — you're sampling from a distribution, so diversity is achievable.
→
More context = better attention. Because attention is dynamic and content-based, giving the model rich context genuinely helps. It's not just good practice — it's architecturally why better prompts get better results.

What Is RAG and Why Should You Care?

The Architecture Behind AI That Knows Your Business

You've probably heard "RAG" thrown around in AI conversations. It sounds technical — and it is, under the hood — but the core idea is simple and important, especially if you're thinking about how AI could work with your company's internal information.

The Problem RAG Solves

A standard language model is trained once, then deployed. Its knowledge is frozen at the training cutoff. It knows nothing about your company's internal documents, your proprietary processes, your customer data, or anything that happened after it was trained.

You could retrain the model on your data — but that's enormously expensive and slow. You could fine-tune it — but fine-tuning teaches behavior patterns, not specific facts. Neither approach is practical for most organizations that want AI to "know" their internal knowledge base.

RAG — Retrieval-Augmented Generation — is the practical solution.

How RAG Works

Instead of baking knowledge into the model, RAG keeps knowledge in an external database and retrieves it at query time. When you ask a question, the system:

1.
Searches a vector database of your documents for content relevant to your question.
2.
Retrieves the most relevant chunks of text.
3.
Injects that retrieved content into the model's context window along with your question.
4.
The model generates a response grounded in that retrieved content — and can cite it.

The model's job is no longer to remember facts — it's to reason over facts you hand it in real time. This makes the output far more accurate and verifiable for domain-specific questions.

Analogy

Think of the model as a brilliant analyst and the RAG system as a research assistant who pulls the right documents before the analyst speaks. The analyst doesn't need to have memorized everything — they just need to reason well over what's in front of them.

Why This Matters for Non-Engineers

You're unlikely to build a RAG system yourself. But you'll increasingly encounter enterprise AI tools — internal chatbots, knowledge base assistants, document search tools — that are built on this architecture. Understanding RAG helps you:

→
Evaluate vendor claims. "Our AI knows your company's data" almost always means RAG. Now you can ask smart questions about data freshness, retrieval quality, and citation accuracy.
→
Understand output quality. RAG systems are only as good as their underlying document store. If the knowledge base is outdated or poorly organized, the AI answers will reflect that.
→
Do it yourself, simply. NotebookLM is essentially a no-code RAG tool. You upload documents; it retrieves and reasons over them. That's RAG at your fingertips today.

Agents, Automation & What's Coming Next

From Assistant to Actor — The Shift Already Underway

So far, every AI tool we've discussed has one thing in common: you give it a prompt, it gives you a response, you decide what to do next. The model is reactive. You're the one taking action in the world.

That's changing. The next major shift in how AI is used — already underway — is the move from AI as a responder to AI as an actor.

What Is an AI Agent?

An AI agent is a model that can take a sequence of actions to accomplish a goal — not just produce a response. It can use tools: browse the web, run code, search databases, send emails, fill out forms, interact with software. You give it a goal; it figures out the steps and executes them.

The difference from a chat model is autonomy and action. A chat model tells you how to search for flights. An agent books the flight.

What Agentic Workflows Look Like Today

You're already seeing early versions of this:

→
Manus and similar fully autonomous agents can take a high-level goal — research a market, produce a report, book a meeting — and execute dozens of steps across the web and your tools without any hand-holding.

What This Means for Knowledge Workers

The honest answer is: a lot, over the next few years. Tasks that currently require a human to orchestrate a sequence of decisions and actions — research synthesis, report generation, scheduling, data processing, operational monitoring — are increasingly automatable with agents.

The ratio of time spent on routine orchestration vs. judgment and strategy shifts dramatically. The professionals who adapt fastest are the ones who start thinking now about which parts of their work are mechanical sequences and which parts genuinely require human judgment, relationships, and accountability.

The Practical Move Right Now

You don't need to understand agentic frameworks to benefit from this shift. Start by identifying one multi-step, recurring task in your work. Think through which steps are mechanical and which require your judgment. That mapping exercise is the foundation of every AI workflow — and the skill that will matter most as agents become more capable.

The Limits Worth Knowing

Agents are powerful and also brittle. They fail in ways that are harder to catch than a single bad response — a chain of plausible-looking steps can lead somewhere wrong. The current best practice is "human in the loop": agents handle the mechanical steps, humans review and approve at key decision points. That's not a temporary workaround; it's a thoughtful design principle for consequential workflows.

Test Your Understanding

Quick Check

Six questions on the concepts from the Deep Dive. Harder than the Crash Course — the explanations do the teaching.

Put It to Work — Browse Prompts →

Confused by a term? See the Glossary →