Beyond the Basics · 17 Min Read
The crash course gets you started. This gets you fluent. Five articles that explain what's happening under the hood.
In This Section
A Mental Model
When you type a message to Claude or ChatGPT, something specific is happening on the other end. It's not a search engine. It's not a database lookup. And it's not a human. Understanding what it is changes how you use it, and why it sometimes gets things wrong (hallucinates).
At the core, a large language model is trained to do one thing: predict the next word. Or more precisely, the next token — a chunk of text that might be a word, part of a word, or a punctuation mark. Given "The capital of France is," the model predicts "Paris" because that sequence appeared countless times in its training data.
That sounds trivial. But when you train a model on hundreds of billions of words — books, websites, academic papers, code, conversations — the model has to develop an extraordinarily sophisticated internal understanding of language, facts, reasoning patterns, and context in order to predict well. The prediction task forces the model to "understand" the material it's predicting from.
Training a language model involves feeding it enormous amounts of text and having it repeatedly guess the next token. Every time it gets it wrong, the model's internal parameters — billions of numerical weights — are adjusted slightly. Over trillions of these adjustments, the model gets better and better at prediction.
Those weights are the model. They're a compressed, mathematical representation of everything the model "learned" from the training data. When you ask Claude a question, you're not querying a database, you're activating a complex web of numerical relationships that, together, produce a probable next sequence of tokens that constitutes a useful answer.
A Note on Masking
During training, the model never gets to "look ahead" at the answer. Future tokens are masked — hidden — so the model must predict each token based only on what came before. This is called causal masking, and it's what forces the model to learn the structure of language rather than memorize sequences. Without masking, the training task would be trivial and the model would learn nothing useful.
The Key Insight
LLMs don't retrieve facts — they generate text that is statistically likely to be correct based on patterns in training data. This is why they're so fluent, and why they can confidently be wrong. They're always producing plausible-sounding output. Whether it's accurate is a separate question.
Interactive Playground
See it in action. Pick a prompt and step through the loop — one token at a time.
A raw pre-trained model is a powerful text predictor, but it won't helpfully answer your questions — it'll just continue your text as if it were a document. The models you use today have been through a second phase called instruction tuning (or RLHF — Reinforcement Learning from Human Feedback), where human raters score outputs and the model is fine-tuned to produce responses that are helpful, honest, and safe.
This is why Claude and ChatGPT feel like assistants rather than autocomplete engines. The underlying mechanism is still prediction — but the model has been shaped to predict responses that a helpful, knowledgeable assistant would give.
A Technical Nuance
Instruction tuning and RLHF are technically sequential steps, not the same thing. Instruction tuning happens first — it gives the model a baseline of assistant-like behavior by training it on examples of good responses. RLHF comes second, using human preference ratings to polish the model's personality and sharpen its safety guardrails. Both phases are often grouped under "fine-tuning," but they do different jobs.
This mental model has practical implications:
The Invisible Limits
Every time you interact with an AI, two invisible constraints are at work: the size of a token and the size of the context window. Understanding both makes you a significantly more effective user.
Models don't process text word by word — they process it in chunks called tokens. A token is roughly 3–4 characters of English text, which works out to about ¾ of a word on average. "Unbelievable" might be two tokens. "AI" is one. "GPT-4" is three.
Why does this matter? Because AI pricing, speed, and limits are all measured in tokens — not words or characters. When a model has a "200,000 token context window," that's roughly 150,000 words, or about 600 pages of text. When you're billed for API usage, you're billed per token in and out.
For most everyday users, tokens are invisible. But if you're doing large-scale document analysis or building on top of AI APIs, they become very relevant very fast.
The context window is the amount of text a model can "see" at once — your full conversation, any documents you've uploaded, the system instructions, and the response it's generating. Think of it as the model's working memory.
Everything inside the context window is active and available. Everything outside it simply doesn't exist to the model. This has a counterintuitive implication: in a very long conversation, early messages may eventually get pushed out of context — and the model will genuinely have no memory of them.
Real Explanation.
The word "transformer" gets thrown around constantly. It's in the name GPT (Generative Pre-trained Transformer). It's the architecture underneath Claude, Gemini, LLaMA, and every major LLM. But what is it? This article explains the mechanics — the attention mechanism, activation functions, backpropagation, temperature — in plain language.
In 2017, eight researchers at Google Brain published a paper titled "Attention Is All You Need." It introduced the transformer architecture and quietly made everything that came after it possible — GPT, BERT, Claude, Gemini, all of it.
Before transformers, the dominant approach to language modeling used recurrent neural networks (RNNs), which processed text sequentially — one word at a time, left to right. This worked, but it was slow, struggled with long sequences, and had trouble connecting ideas that were far apart in a document. Transformers threw out the sequential constraint entirely. The key insight: instead of reading word by word, the model should be able to look at every word in a sequence simultaneously and dynamically decide which words are most relevant to each other. They called this mechanism attention.
Why the Title Matters
"Attention Is All You Need" was a deliberate provocation. Prior architectures combined attention with recurrence and convolution. The paper's claim was that attention alone — no recurrence, no convolution — was sufficient to build a state-of-the-art language model. They were right.
Before any attention can happen, the model needs to convert text into something it can compute with. Text gets broken into tokens (see Article 02), and each token is mapped to a high-dimensional vector — a list of hundreds or thousands of numbers. This is called an embedding.
These numbers aren't arbitrary. Through training, tokens that appear in similar contexts end up with similar embeddings. "Dog" and "puppy" end up close together in the embedding space. "Dog" and "democracy" end up far apart. The geometry of this space encodes meaning — which is why the model can reason about analogies, synonyms, and semantic relationships.
The transformer also adds positional encodings — another set of numbers layered on top of the embeddings to tell the model where each token sits in the sequence. Without this, the model would have no sense of word order.
Interactive: The Geometry of Words
The tokens in your prompt each live at a coordinate in a high-dimensional space. The cloud below is a 3D projection of that space for a small vocabulary of animals, colors, and foods. Drag to rotate. Hover a token to see its nearest neighbors by cosine similarity. Click to pin.
Cosine Similarity
We measure "closeness" by the angle between two vectors, not their Euclidean distance. Two tokens pointing the same direction score near 1.00. Perpendicular = 0. Opposite = −1.
Interactive: Arithmetic on Meaning
The result that made everyone care about embeddings: take the vector for "king," subtract "man," add "woman" — and the nearest token is "queen." The direction from man→woman is the same direction as king→queen. Meaning is encoded in geometry.
One Caveat
Analogies work cleanly in toy demos and messily in the wild. Real embedding spaces inside frontier models are richer, more entangled, and harder to interpret — but the core geometric intuition holds.
This is the core of what makes transformers work. The idea: when processing any given token, the model should be able to "pay attention" to other tokens in the sequence — and the amount of attention should be determined dynamically based on the content, not hardcoded by position.
Here's how it works. For each token, the model computes three vectors:
The model computes a similarity score between a token's Query and every other token's Key. High similarity = high attention weight. These weights are then used to create a weighted sum of the Values. The result: each token's representation gets updated by pulling in information from the tokens most relevant to it.
Analogy
Imagine reading the sentence "The trophy didn't fit in the suitcase because it was too big." When your brain processes "it," it automatically looks back and decides "it" refers to the trophy, not the suitcase. That's exactly what the attention mechanism does — it figures out which other tokens are relevant to resolving the meaning of the current one.
Rather than running attention once, transformers run it in parallel across many "heads" — typically 12, 16, or more in large models. Each head learns to pay attention to different kinds of relationships. One head might learn syntactic structure (subject-verb agreement). Another might track coreferences (what "it" refers to). Another might pick up on semantic relationships (synonyms, opposites).
The outputs from all heads are concatenated and projected back into a single representation. The result is a much richer understanding of context than any single attention pass could produce.
After the attention mechanism updates each token's representation, that representation passes through a feed-forward neural network — a simpler layer applied independently to each token position. This is where a large portion of the model's "knowledge" is thought to be stored: the pattern associations learned during training.
Inside these layers, an activation function introduces non-linearity — meaning the model can learn complex, curved patterns rather than just straight-line relationships. Without activation functions, stacking layers wouldn't add any expressive power; the whole network would collapse to a single linear transformation.
The most common activation functions in modern LLMs:
Training a transformer means adjusting billions of numerical parameters (the weights in the attention and feed-forward layers) so that the model gets better at predicting the next token. The mechanism that makes this work is backpropagation.
Here's the loop:
The remarkable thing is that no one programs the model to "know" facts or grammar. All of that emerges from this simple training loop applied at scale. The structure of knowledge is an emergent property of learning to predict text well.
Scale is the secret ingredient
The transformer architecture was published in 2017. But ChatGPT didn't launch until 2022. The gap isn't explained by new architectural breakthroughs — it's mostly explained by scale: more parameters, more training data, more compute. GPT-3 has 175 billion parameters trained on ~500 billion tokens. That scale is what pushed from "interesting research" to "genuinely useful."
When the model generates a response, it doesn't just always pick the highest-probability next token. If it did, responses would be deterministic and repetitive. Instead, a parameter called temperature controls how the model samples from its probability distribution.
If you've ever noticed that asking Claude the same question twice produces slightly different answers, temperature is why. You're not getting a cached lookup — you're getting a fresh sample from the model's probability distribution each time.
Most consumer-facing tools don't expose temperature as a setting, but API users can set it directly. It's one of the most practically useful parameters to understand if you're building with LLMs.
You don't need to understand transformer internals to use AI well. But knowing them changes a few things:
The Architecture Behind AI That Knows Your Business
You've probably heard "RAG" thrown around in AI conversations. It sounds technical — and it is, under the hood — but the core idea is simple and important, especially if you're thinking about how AI could work with your company's internal information.
A standard language model is trained once, then deployed. Its knowledge is frozen at the training cutoff. It knows nothing about your company's internal documents, your proprietary processes, your customer data, or anything that happened after it was trained.
You could retrain the model on your data — but that's enormously expensive and slow. You could fine-tune it — but fine-tuning teaches behavior patterns, not specific facts. Neither approach is practical for most organizations that want AI to "know" their internal knowledge base.
RAG — Retrieval-Augmented Generation — is the practical solution.
Instead of baking knowledge into the model, RAG keeps knowledge in an external database and retrieves it at query time. When you ask a question, the system:
The model's job is no longer to remember facts — it's to reason over facts you hand it in real time. This makes the output far more accurate and verifiable for domain-specific questions.
Analogy
Think of the model as a brilliant analyst and the RAG system as a research assistant who pulls the right documents before the analyst speaks. The analyst doesn't need to have memorized everything — they just need to reason well over what's in front of them.
You're unlikely to build a RAG system yourself. But you'll increasingly encounter enterprise AI tools — internal chatbots, knowledge base assistants, document search tools — that are built on this architecture. Understanding RAG helps you:
From Assistant to Actor — The Shift Already Underway
So far, every AI tool we've discussed has one thing in common: you give it a prompt, it gives you a response, you decide what to do next. The model is reactive. You're the one taking action in the world.
That's changing. The next major shift in how AI is used — already underway — is the move from AI as a responder to AI as an actor.
An AI agent is a model that can take a sequence of actions to accomplish a goal — not just produce a response. It can use tools: browse the web, run code, search databases, send emails, fill out forms, interact with software. You give it a goal; it figures out the steps and executes them.
The difference from a chat model is autonomy and action. A chat model tells you how to search for flights. An agent books the flight.
You're already seeing early versions of this:
The honest answer is: a lot, over the next few years. Tasks that currently require a human to orchestrate a sequence of decisions and actions — research synthesis, report generation, scheduling, data processing, operational monitoring — are increasingly automatable with agents.
The ratio of time spent on routine orchestration vs. judgment and strategy shifts dramatically. The professionals who adapt fastest are the ones who start thinking now about which parts of their work are mechanical sequences and which parts genuinely require human judgment, relationships, and accountability.
The Practical Move Right Now
You don't need to understand agentic frameworks to benefit from this shift. Start by identifying one multi-step, recurring task in your work. Think through which steps are mechanical and which require your judgment. That mapping exercise is the foundation of every AI workflow — and the skill that will matter most as agents become more capable.
Agents are powerful and also brittle. They fail in ways that are harder to catch than a single bad response — a chain of plausible-looking steps can lead somewhere wrong. The current best practice is "human in the loop": agents handle the mechanical steps, humans review and approve at key decision points. That's not a temporary workaround; it's a thoughtful design principle for consequential workflows.
Test Your Understanding
Six questions on the concepts from the Deep Dive. Harder than the Crash Course — the explanations do the teaching.
Confused by a term? See the Glossary →