Forem: Akash

Recurrent Neural Networks

Akash — Wed, 29 Apr 2026 00:46:11 +0000

Sequence Processing, POS Tagging, and the Context Problem

By the end of this post, the fixed-window language model will be a closed chapter, and the recurrent neural network will make sense to you structurally: one new weight matrix, one thread of hidden state running through time. You'll see why n-gram and feedforward language models hit a wall on context. You'll meet the three families of sequence problems (labeling, classification, sequence-to-sequence) and see how each one maps to a different RNN setup. You'll understand the two-pass training algorithm (backpropagation through time) and why it starts leaking signal once sequences get long. Then we'll apply all of it: RNN language models, autoregressive generation (what ChatGPT does under the hood), and encoder-decoder for machine translation.

This is the missing link between feedforward LMs and transformers. You already know why n-gram and feedforward LMs are limited — they see only the last n words. RNNs are the first architecture in our course that gets rid of that window. Once you understand them, transformers will feel like a natural next step: same goal (context without a fixed window), different mechanism (attention instead of recurrence).

Language Is Temporal. That Changes Everything.

Every NLP problem we've looked at so far has treated text as either a bag of tokens (TF-IDF, bag of words for classification) or a fixed window (n-grams, feedforward language models). Neither of those captures something important about how language actually works: words unfold over time, and the meaning of what came earlier can be altered by what comes later.

Let's take a look at an example. Compare these two sentences:

I hope there are no more quizzes between now and the end of the semester.

No, I hope there are more quizzes between now and the end of the semester.

The first one is straightforward: please stop with the quizzes. The second one starts with "No," which negates everything that follows. The meaning of the whole sentence isn't clear until you've read the whole sentence. You can't make a decision about what the speaker means based on the first three words; you have to wait for the structure to unfold.

This is the core challenge of sequence processing. To understand a sentence, you need to carry information across the entire sequence. And language has this property everywhere: conversation flow, news feeds, Twitter streams, anything where tokens arrive in a specific order and earlier tokens alter the interpretation of later ones.

Temporal isn't metaphorical here. We say "the flow of conversation" or "the thread of an argument" for a reason. These metaphors describe an actual property of language: sequences of symbols whose meaning depends on position and order.

The Sliding Window Problem

In the last-to-last post, we built a feedforward language model with a sliding window. Take three or four previous words, embed them, feed them through a hidden layer, softmax over the vocabulary, predict the next word. Slide forward one token. Repeat.

This works. But it has two disadvantages that become obvious once you think about them.

1. Context is capped at the window size.

If your window is three words wide, the model literally cannot see anything outside those three words. Want to resolve a pronoun reference to something mentioned ten words ago? Too bad — the model never sees it.

2. Repeated work.

As the window slides, the same subsequences get processed over and over. Looking at "Natalie sat down to write the midterm" three words at a time, you process "down to" when predicting "write", then again when predicting "the", then again when predicting "midterm." Same two-word sequence, different positions, three separate computations. No sharing, no memory.

Can we get rid of the fixed window? That's what recurrent neural networks are for.

Three Flavors of Sequence Problems

Before we get to RNNs, let's catalog what kinds of sequence problems exist. We split them into three families, and the split matters because each one calls for a different architectural setup.

Quick reference: the three sequence families

Sequence labeling → input and output are the same length, one label per token. Example: POS tagging, NER.

Sequence classification → input is a sequence, output is a single class. Example: sentiment analysis, spam detection.

Sequence-to-sequence → input and output are both sequences, possibly of different lengths, with non-aligned tokens. Example: machine translation, summarization, and QA.

Sequence Labeling

Tag every token in the input with a label from a fixed set. Input and output are the same length, aligned one-to-one.

The two canonical examples:

Part-of-speech (POS) tagging: assign a syntactic category (noun, verb, adjective, etc.) to each word. Takes a sentence, returns a sentence with the same number of tokens, each with a tag.

Named entity recognition (NER) — identify spans of text that name specific entities (people, companies, dates, monetary amounts) and classify each span.

Sequence Classification

Assign one label to the entire input sequence. Input is a sequence, output is a single class.

Examples: sentiment analysis (positive/negative for a whole review), spam detection, topic classification, sarcasm detection, and message routing.

Sequence-to-Sequence (Seq2Seq)

Map an input sequence to an output sequence, where the two don't have to be the same length or aligned token-by-token.

Examples: machine translation (English in, Spanish out, possibly with different word count and word order), text summarization (long document in, short summary out), question answering (question in, answer out).

We'll see each of these get solved with RNNs before the post is over. Let's start with sequence labeling.

Part-of-Speech Tagging

Take this sentence:

plays well with others

And tag each word with its syntactic category:

Word	Possible tags	Correct (in context)
plays	NNS (plural noun), VBZ (verb)	VBZ
well	UH (interjection), JJ (adjective), NN (noun), RB (adverb)	RB
with	IN (preposition)	IN
others	NNS (plural noun)	NNS

The tags come from the Penn Treebank, which has about 45-48 of them (36 core tags and the rest of them are punctuation and special symbols). Some words are unambiguous (with is only ever a preposition). Most aren't. Plays can be plural noun (Shakespeare's plays) or verb (she plays cricket). Picking the right one requires looking at the rest of the sentence. That's why POS tagging is a sequence problem, not a per-word lookup.

Why Does POS Tagging Matter?

You'd think "what part of speech is this word?" is a pretty dry academic question. It isn't; POS tags are structural glue for downstream tasks:

Text-to-speech: how do you pronounce lead? It depends on the part of speech. I lead the group vs. There's lead in the water are pronounced differently.
Parsing: once you have POS tags, you can write grammar rules over them. "a noun phrase is an optional determiner, followed by adjectives, followed by nouns."
Fallback features: if you don't know a specific word, knowing its POS tag often gives you enough to keep going on higher-level tasks.

The 90% Baseline Trap

Here's a question that sounds easy but rewards thinking about.

Build the dumbest possible POS tagger in five minutes. What would you do?

Answer: for each word, look up its most frequent tag in a reference dictionary. If plays is a plural noun 67% of the time and a verb 30% of the time, tag it as a plural noun every time. If the word is unknown, tag it as a noun.

This dumb algorithm gets about 90% accuracy on standard POS datasets. Why so high?

Two reasons:

A lot of words are unambiguous. Roughly 40% have only one possible tag. Those are free points for any algorithm.
Frequent words dominate the test set. Words like the, a, of, punctuation — these show up constantly, they're short and unambiguous, and they make up a big fraction of the tokens you're evaluating on. Get those right, and your accuracy goes way up.

So 90% is the floor. It's not that the algorithm is clever; it's that the test distribution favors easy words.

Why Isn't 90% Good Enough?

State-of-the-art POS taggers today achieve above 99% accuracy. The gap between 90% and 99% sounds small until you do the math.

A 10% error rate means about one wrong tag per ten-word sentence. So almost every sentence has a mistake somewhere. And POS tags get fed into downstream tasks: parsers, semantic analyzers, and coreference resolvers. A wrong tag propagates. Garbage in, garbage out.

Somebody once put it this way: "I love the 90% accuracy. I hate the 10% error."

Named Entity Recognition

NER is the other canonical sequence labeling task. Take a news article like this:

Bridgestone Sports Company said Friday it has set up a joint venture in Hong Kong with a local company...

A good NER system tags every token. The interesting ones:

Bridgestone Sports Company → COMPANY
Friday → DATE
Hong Kong → CITY/LOCATION

Everything else (said, it has set up a joint venture, etc.) gets tagged O (for "other" — not an entity).

NER is trickier than POS tagging because the task has two parts wrapped into one:

Classify the entity type (is this a company, a person, a date, a city?).
Get the span right (all of Bridgestone Sports Company, not just Bridgestone).

You get penalized for missing either one. Identifying Bridgestone as a company, but cutting off Sports Company, that's partial credit at best.

Sequence Classification, and the Near Miss

Now contrast those labeling tasks with sarcasm detection. You read a sentence:

Natalie was soooo thrilled that Usman had a famous new poem.

And you label the whole thing as sarcastic or not sarcastic. One label for the whole sequence. That's sequence classification.

Here's the near miss. Sarcasm detection looks like it demands sequence modeling. It's a sentence-level task with linguistic subtlety. You'd reach for an RNN on instinct. But you can often solve it without any temporal information at all. Throw the tokens into a bag, compute TF-IDF features, and run a classifier. You lose the order, but signals like soooo and totally still light up as statistically correlated with sarcasm.

The near miss teaches a rule: sequence classification is not automatically a sequence problem. Some of it is. Some of it isn't. The decision to reach for an RNN should come from the task needing order, not from the input being a sentence.

Temporal information isn't free. An RNN costs you training time and serial computation that a bag-of-words model doesn't. If word order barely matters for the task, TF-IDF features with a linear classifier will often match it. Save the RNN for problems where order is actually load-bearing.

That said, accuracy often climbs when you do use temporal info. "Not bad" and "bad not" mean different things. A bag of words can't tell them apart. An RNN can.

So the question to ask on every sequence classification task is the same: how much does order matter? For sentiment polarity, some. For sarcasm, more. For machine translation, order is half the problem.

Simple Recurrent Neural Networks

Now the main event. Recurrent neural networks are the first architecture we've seen that removes the fixed-window limit on context.

Before we go further, let's set the boundary around what we're about to see.

What RNNs are NOT. They are not a free way to parallelize computation. Every timestep depends on the previous one, so forward inference is inherently serial. They are also not a cure-all for long-range context: in practice, the vanishing gradient problem means signal from the distant past often can't reach the present, even though the architecture says it should. And they are not the final word on sequence modeling, transformers have since outperformed them on nearly every task. What RNNs are is the first architecture that carries a thread of memory across arbitrary sequence length, and the bridge from fixed-window models to everything that came after.

The Core Idea

In a feedforward net, the hidden layer at time t depends only on the input at time t. You see $x_t$ , compute $h_t$ , produce $y_t$ . Done.

In a simple RNN (also called an Elman network), the hidden layer at time t depends on both the input at time t and the hidden layer from the previous timestep. The previous hidden state threads forward into the current computation, and then into the next, and into the next after that, a single ribbon of memory running the length of the sequence.

The recurrence adds one new weight matrix, U, that connects the previous hidden layer to the current one:

h_t = g(U h_{t-1} + W x_t)

y_t = \text{softmax}(V h_t)

Where:

$W$ is the input-to-hidden weight matrix (same as in a feedforward net)
$U$ is the new hidden-to-hidden weight matrix (this is what makes it recurrent)
$V$ is the hidden-to-output weight matrix
$g$ is a nonlinearity like tanh
$h_{t-1}$ is the hidden layer from the previous timestep

That's the whole architectural change. Everything else is the same as a feedforward net.

Weights Are Shared Across Time

One thing worth stopping on: the matrices W, U, and V are the same at every timestep. Whether you're processing word 1 or word 20, you use the same weights. The model size doesn't grow with the length of the sequence.

This is what makes training feasible. You don't learn a separate weight matrix per position. You learn one set of weights that gets applied at every step.

Forward Inference: One Word at a Time

Unrolling an RNN in time looks like this: you process $x_1$ , then $x_2$ , then $x_3$ , each step using the hidden state from the step before.

At step 1, you need some initial hidden state $h_0$ (usually zeros). You compute $h_1 = g(U h_0 + W x_1)$ .

At step 2, compute $h_2 = g(U h_1 + W x_2)$ . Notice that $h_2$ depends on $h_1$ , which depended on $x_1$ . So in principle, $h_2$ carries information about both words.

At step 3, $h_3 = g(U h_2 + W x_3)$ . Now $h_3$ has information from $x_1, x_2, x_3$ .

Keep going. By the time you get to step 10, $h_{10}$ has, in theory, absorbed information from every input it's seen, all the way back to $x_1$ .

This is the whole reason RNNs exist. The hidden state is a running memory of everything seen so far. No fixed-size window. Context extends as far back as the sequence itself goes.

Iterative, Not Parallel

The downside of this design is that forward inference is sequential. You cannot compute $h_3$ until you've computed $h_2$ , and you can't compute $h_2$ until you've done $h_1$ . Every step depends on the previous one.

This is slower than a feedforward net (which can batch everything in parallel) and is one of the main motivations for transformers, which get rid of the sequential dependency entirely. But we'll get to that later.

Training RNNs: Backpropagation Through Time

RNNs are trained with the same core machinery as feedforward nets: a loss function, gradient descent, and backpropagation. The wrinkle is that gradients now have to flow backward, not just through layers, but through time.

Two Considerations That Make Training Different

In a feedforward net, the loss at position t depends only on inputs at position t. Gradients flow backward through a single stack of layers.

In an RNN:

Computing the loss at time t requires the hidden layer from time t−1, which required the hidden layer from t−2, and so on back to the beginning.
The hidden layer at time t affects both the output at time t and the hidden layer at time t+1 (and thus the output at t+1, and so on).

So when you compute how much a weight should change based on the error at time t, you have to trace backward through time, accumulating contributions from every subsequent timestep.

The Two-Pass Algorithm

Training an RNN is a two-pass procedure:

[Pass 1] Forward inference. Process the input sequence left to right, computing and saving $h_t$ and $y_t$ at each step, accumulating the loss along the way.

[Pass 2] Backward pass. Process the sequence right to left, computing error gradients as you go. Accumulate gradient contributions for the weight matrices W, U, and V across all timesteps, then apply the updates.

This is called backpropagation through time (BPTT). It's conceptually the same as regular backprop, just extended over the temporal dimension of the unrolled network.

The Vanishing Gradient Problem

There's a subtle issue with BPTT that bites RNNs hard on long sequences. The thread of memory frays under training.

When you backpropagate through t timesteps, you're multiplying derivatives together t times. If those derivatives are small (less than 1 in magnitude), the product shrinks exponentially. By the time the error signal has traveled back 15 or 20 timesteps, the gradient is effectively zero.

The model stops learning from long-range dependencies. In theory, $h_{20}$ still carries information from $x_1$ . In practice, the gradient-based training signal can't propagate far enough to teach the model how to use it. The architecture promises arbitrary context length. The training procedure can't deliver on the promise past a certain distance.

This is why "RNNs and LSTMs" are usually taught together, not just "RNNs" alone. Long Short-Term Memory networks (and Gated Recurrent Units) are engineered to fix this specific problem — they add gating mechanisms that let the model explicitly decide what to keep and what to forget across long spans. We'll get to them in the next post.

RNN Language Models

With forward inference and training understood, let's apply this to the task that kicked off the whole series: language modeling.

The setup is almost identical to a feedforward language model. At each timestep, the model takes a word, produces a probability distribution over the vocabulary, predicting the next word.

Input sequence: $x1,x2,…,xnx_1, x_2, \ldots, x_n$ . At each position, look up the word embedding via an embedding matrix, combine with the previous hidden state, and softmax over the vocabulary:

e_t = E x_t

h_t = g(U h_{t-1} + W e_t)

y^t=softmax(Vht) \hat{y}_t = \text{softmax}(V h_t)

The $y^t\hat{y}_t$ vector gives you the probability of each word in the vocabulary being the next word at position t+1.

"The Students Opened Their ___"

Here's the prototypical example. Feed in the, students, opened, their. The model is supposed to predict what comes next.

Common candidates: books, laptops, minds.

A feedforward LM with a three-word window only sees students, opened, their, it's missing the, which is arguably helpful but not critical. More importantly, if the sentence were twenty words long and depended on students mentioned at position 2, a fixed-window LM would have dropped that context long ago.

The RNN language model has the thread of hidden state running from the all the way through to their. When it predicts the next word, it's conditioning on the full left context. Words like books and laptops get high probability because they're statistically associated with students, and the RNN still has that earlier word sitting in its hidden state, available.

Evaluating with Perplexity

Same metric we used for n-gram models: perplexity on a held-out test corpus.

\text{Perplexity}\theta(w{1:n}) = P_\theta(w_{1:n})^{-\frac{1}{n}}

Lower is better. It's the inverse probability of the corpus normalized by length, how surprised the model is by the true next word on average.

Here's a comparison table from a few years ago:

Model	Perplexity
5-gram language model	67.6
2-layer LSTM	39.8

Same test corpus, massive difference. And this is with an LSTM, which is a refined RNN. The improvement comes entirely from being able to condition on long-range context, not from anything fancier. Modern transformer-based LMs drive this number much lower still.

Autoregressive Generation

Once you have a trained language model, you can generate text from it. The procedure is called autoregressive generation, and it's exactly what ChatGPT does under the hood.

Feed in a starting token (a sentence marker, or a prompt).
Sample the next word from the softmax distribution the model produces.
Feed that sampled word back in as the next input.
Repeat until you hit an end-of-sequence token or a length limit.

The word "autoregressive" means the output at time t is used as part of the input at time t+1. The model is regressing on its own previous outputs.

The key to making autoregressive generation actually useful is priming the model with good context. You don't start from scratch with just <s>. You start with a prompt, a system message, an instruction, a question, or the full preceding conversation. ChatGPT isn't doing anything more exotic than this; it's an autoregressive language model with a very well-chosen starting context.

Sequence-to-Sequence Problems

POS tagging is sequence labeling (input and output are the same length). Sentiment analysis is sequence classification (input is a sequence, output is a single class). But a lot of interesting NLP problems don't fit either mold.

Machine translation. Input: an English sentence. Output: a Spanish sentence. The two sentences can have different word counts, different word orders, and the correspondence between them isn't one-to-one.

Text summarization. Input: a long article. Output: a short summary. Again, no alignment, very different lengths.

Question answering. Input: a question. Output: an answer. Arbitrary lengths.

These are sequence-to-sequence problems, and they call for a different architecture.

Encoder-Decoder

The standard setup for seq2seq is the encoder-decoder architecture. Two RNNs working together:

The encoder reads the input sequence and produces a final hidden state that summarizes the whole input (sometimes called the context vector).
The decoder takes that context and autoregressively generates the output sequence, one token at a time.

For machine translation, this lets you train on parallel corpora — pairs of sentences in two languages — and learn to map arbitrary English sentences to Spanish (or any other pair). Same idea for summarization (long-short pairs) and QA (question-answer pairs).

This is the motivation for the encoder-decoder LLM family we saw in the last post. Models like Flan-T5 are specifically designed for seq2seq tasks, and their architecture (encoder-decoder) traces directly back to the RNN encoder-decoder models we're building up here. Transformers replaced the RNN parts, but the overall shape stayed the same.

What You Now Have

Eight things from this lecture:

Language is temporal. The meaning of later tokens depends on earlier ones, and vice versa. Fixed-window models (n-grams, feedforward LMs) can't capture this at all.
Three families of sequence problems. Sequence labeling (POS, NER — one tag per token). Sequence classification (sentiment, spam — one label per whole sequence). Sequence-to-sequence (translation, summarization, QA — arbitrary output sequence).
POS tagging: Penn Treebank tags, 45 categories, one per word. 90% baseline accuracy with "most frequent tag" is high because of unambiguous words and frequent stopwords — but still not good enough for downstream tasks because errors propagate.
NER involves two subtasks. Classifying the entity type (company, date, location) and getting the span right. Miss either one and you lose points.
Simple RNN architecture. A recurrent connection $U$ from the previous hidden layer to the current one, giving $h_t = g(U h_{t-1} + W x_t)$ . Weights W, U, and V are shared across all timesteps. Context is no longer capped at a window size.
Training via backpropagation through time (BPTT). Two-pass: forward inference left-to-right saving hidden states, then backward pass right-to-left accumulating gradients. Plagued by vanishing gradients on long sequences — LSTMs fix this (next lecture).
RNN language models. Perplexity drops substantially vs. n-gram models (67.6 → 39.8 in one benchmark) because long-range context becomes usable. Autoregressive generation = sampling one word at a time, feeding outputs back as inputs — the core mechanism behind modern chat LLMs.
Seq2seq via encoder-decoder. Two RNNs: one summarizes the input, the other generates the output conditioned on that summary. This is the shape of machine translation, summarization, and QA systems, and the direct ancestor of the encoder-decoder transformer models.

Next up: LSTMs and GRUs. Simple RNNs are the right conceptual frame, but the vanishing gradient problem means they can't actually use the full context the architecture promises. Gated networks fix this by explicitly learning what to remember and what to forget. Then we'll meet bidirectional RNNs, and finally, attention — the mechanism the transformer architecture was built around. From here on, most of modern NLP is a response to problems that were first identified in simple RNNs, which means you've now got the foundation that everything after this sits on.

What Does an LLM Actually Do?

Akash — Sat, 11 Apr 2026 10:36:01 +0000

Pretraining, Prompting, Sampling, and Alignment

By the end of this post, you'll understand what an LLM actually learns during pretraining (ontologies, math, pronoun resolution, all of it) and why this happens from nothing more than predicting the next word. You'll know the three architectural families of LLMs (decoder-only, encoder-only, encoder-decoder) and when each one fits the job. You'll see how unrelated tasks like sentiment analysis, question answering, and classification all get cast as conditional generation. You'll understand prompting, in-context learning, and why system prompts are longer than you'd expect. You'll know the difference between greedy decoding, random sampling, and temperature sampling, and why the obvious strategy is actually a bad one. Finally, you'll understand the three stages of training that take a raw pretrained model and turn it into something useful and safe: pretraining, instruction tuning, and preference alignment.

One thread runs through all of this. LLMs are language models. They do exactly what n-gram models and feedforward language models did: predict the next word. The machinery is bigger, the context is longer, the training is more elaborate. The task hasn't moved.

Why This Post Is Different

Because this post is dedicated entirely to LLMs. We'll treat the transformer as a black box, for now. You don't need to know how attention works to understand how LLMs behave, what they learn, how they're trained, and where they break down. Transformers come later. For now, just treat the architecture as "something that takes context in and returns a probability distribution over the next word."

This turns out to be the right level of abstraction for getting intuition. Take 8 minutes before diving in and watch 3Blue1Brown's explainer on LLMs; it'll make everything that follows click faster.

Defining an LLM (The Unusual Way)

The NLP textbook gives a definition I hadn't seen before:

A large language model is a computational agent that can interact conversationally with people using natural language.

Notice what this leaves out. No mention of parameter counts. No mention of transformers. No mention of depth or training data size. The definition is about behavior, not architecture. An LLM is something you can talk to.

This is unusual, but it's actually correct for the purposes of this post. The concepts here (prompting, sampling, alignment, hallucination) are all about how the model behaves, and they apply regardless of the specific architecture underneath.

Similar to N-grams, Different from N-grams

LLMs and n-gram models share the core task: assign probabilities to sequences and generate text by sampling from those probabilities. What's different is the training. N-grams learn by counting. LLMs learn by predicting the next word through gradient descent on a neural network.

That's the whole difference at a conceptual level. Everything else (the scale, the emergent capabilities, the way you interact with them) flows from that training objective applied to massive amounts of text.

What Does a Model Actually Learn from Pretraining?

This is the question that still surprises people. The training objective is simple: take a corpus, and at every position, predict the next word. No explicit teaching of grammar, math, facts, or reasoning. Yet an LLM trained this way ends up with all of that.

Let's walk through five examples. Each one shows a different kind of knowledge the model absorbs purely from next-word prediction:

1. Ontological relationships.

"With roses, dahlias, and peonies, I was surrounded by flowers."

The model learns that roses, dahlias, and peonies are kinds of flowers. Nobody told it. It figured it out because the word "flowers" keeps appearing near these specific species.

2. Superlative and scalar relationships.

"The room wasn't just big, it was enormous."

This teaches the model about intensity scales. "Enormous" is stronger than "big." The model picks up this ordering from how people actually use these words in context.

3. Authorship and factual relations.

'The author of "A Room of One's Own" is Virginia Woolf.'

Simple factual recall. But notice there's nothing in the training objective that says "learn facts." The facts just fall out of the statistical structure of text.

4. Mathematical relationships.

"The square root of 4 is 2."

The model learns basic math not from being taught arithmetic, but because sentences like this appear in the training data often enough for the pattern to stick.

5. Pronoun resolution.

"The doctor told me that he..."

Researchers have spent years building pronoun resolution algorithms by hand, and LLMs pick this up implicitly. "He" correctly refers back to "doctor" because in millions of similar sentences, that's how pronouns get used.

A Side Note on Human Learning

Here's an interesting comparison. By age 30, an average literate human has a vocabulary of 50,000 to 100,000 words. To reach that by 30, a child needs to learn 6 to 7 new words per day, starting very young. And children accomplish this with vastly less training data than an LLM (just whatever they hear from the people around them).

The implication: whatever mechanism human brains use to learn language is dramatically more efficient than anything we've built. LLMs need trillions of tokens to match what a child does with a few years of speech. That gap is one of the most interesting open questions in cognitive science.

LLMs as a Black Box

For this post, an LLM is just a neural network with one job:

Input: a context (a sequence of tokens, aka a prompt or prefix)
Output: a probability distribution over the next token

That's the entire interface. What's inside the box doesn't matter yet. It could be a simple feedforward network, an RNN, an LSTM, or a transformer. The behavior we care about is the same regardless.

Autoregressive Generation

To generate text longer than one word, the model does the obvious thing:

Feed in the context, get a probability distribution over next tokens
Pick a token from that distribution
Append the picked token to the context
Repeat

This is autoregressive generation. The model's own output becomes part of its input on the next step. That single loop is the foundation of everything ChatGPT, Claude, Gemini, and every other chat interface does.

Three Architectures, Three Jobs

Not all LLMs are built the same way. There are three main architectural families, and each one is suited to a different kind of task.

Decoder-Only Models

Examples: GPT, Claude, Llama, DeepSeek, Mistral.

When someone says "LLM" in casual conversation, they usually mean this. Decoder-only models are generative. They take a prompt, generate tokens one at a time, left to right. Causal, autoregressive.

Use them when you need to generate text: chat, code, summaries, creative writing.

Encoder-Only Models

Examples: BERT and its family, HuBERT.

These aren't generative. They take a sequence in and produce representations (embeddings) of it. What makes them interesting is that they can look at both sides of context when building representations, past and future. This is sometimes called "cheating" because regular autoregressive models only look backward.

But it's legal cheating. Encoder-only models aren't trying to predict the next word at inference time. They're trying to understand the input as a whole. So looking forward is fine.

Use them when you need high-quality representations of meaning for downstream tasks like classification, sentiment analysis, or named entity recognition. Usually, you finetune them on labeled data for the specific task.

Encoder-Decoder Models

Examples: Flan-T5, Whisper.

These map one sequence to another sequence. The encoder digests the input, and the decoder generates the output. They're trained on paired data.

Classic use case: machine translation. English in, Chinese out. The encoder learns to understand English, the decoder learns to generate Chinese, and together they learn how the two languages correspond.

Another use case: speech recognition. Audio features in, text out. Whisper is this kind of model.

When Do You Pick Which?

Task	Architecture
Chat, creative writing, code generation	Decoder-only
Sentiment, classification, NER, semantic search	Encoder-only
Machine translation, speech-to-text, summarization	Encoder-decoder

This is the kind of question you should expect in an interview, maybe.

Casting Tasks as Conditional Generation

Here's where decoder-only models turn out to be more general than they look. Even tasks that don't look like "generate text" can be reframed as "predict the next word."

Sentiment Analysis

You want to know whether "I like Jackie Chan" is positive or negative. You don't need a separate classifier. You just prompt the LLM:

The sentiment of the sentence "I like Jackie Chan" is:

Now ask: what word does the model predict next? Compare $P(positive∣prompt)P(\text{positive} \mid \text{prompt})$ to $P(negative∣prompt)P(\text{negative} \mid \text{prompt})$ . The higher one wins.

You've done classification with a generative model.

Question Answering

Q: Who wrote the book "The Origin of Species"? A:

The model predicts "Charles" first, then, when you include "Charles" in the new context and predict again, it produces "Darwin." Autoregressive generation gives you the answer one token at a time.

This trick of casting every task as a next-word prediction problem is one reason decoder-only LLMs have replaced task-specific models for so many applications. Why train a separate sentiment classifier when a well-prompted LLM does the job well enough?

Prompting

A prompt is the text string you give an LLM to get it to do something. Prompt engineering is the process of finding effective prompts for a task.

Prompts come in many shapes:

A bare question: "What is a transformer network?"
Structured: "Q: What is a transformer network? A:"
Instructional: "Translate the following sentence into Hindi: 'Chop the garlic finely.'"
Multiple choice: "Do you think that input has negative or positive sentiment? Choices: (P) Positive (N) Negative. Assistant: I believe the best answer is: ..."

Different prompts push the model toward different kinds of responses. Prompt engineering is a real discipline, not just vibes.

Demonstrations and In-Context Learning

You can also give the model examples of the task you want it to do. This is called few-shot prompting or in-context learning:

Let x = 1. What is x << 3 in Python 3?
(A) 1 (B) 3 (C) 8 (D) 16
Answer: C

Which is the largest asymptotically?
(A) O(1) (B) O(n) (C) O(n²) (D) O(log(n))
Answer: C

What is the output of the statement "a" + "ab" in Python 3?
(A) Error (B) aab (C) ab (D) a ab
Answer:

The model sees two worked examples, then you ask it the third one. It picks up the pattern from the demonstrations.

Crucially, this is not the same as training. The model's parameters don't change when you give it examples in the prompt. What changes is the context and the network's internal activations. The model behaves as if it had learned something, but nothing in its weights has shifted.

This is called in-context learning: improvement in behavior that doesn't update any parameters. It's one of the stranger capabilities of modern LLMs.

System Prompts Are Bigger Than You Think

Every production LLM has a hidden system prompt that gets prepended to whatever you actually type. A simple one:

<system> You are a helpful and knowledgeable assistant. Answer concisely and correctly.
<user> What is the capital of France?

You only type the user part. The system part is silently concatenated on every call.

And the system prompts get long. Claude Opus 4's system prompt is about 1,700 words. Some excerpts:

"Claude should give concise responses to very simple questions, but provide thorough responses to complex and open-ended questions."
"Claude does not provide information that could be used to make chemical or biological or nuclear weapons."
"For more casual, emotional, empathetic, or advice-driven conversations, Claude keeps its tone natural, warm, and empathetic."
"Claude cares about people's well-being and avoids encouraging or facilitating self-destructive behavior."
"If Claude provides bullet points in its response, it should use markdown, and each bullet point should be at least 1-2 sentences long unless the human requests otherwise."

That's a lot of invisible prompt engineering happening on every message. The system prompt is one of the main levers for steering model behavior at deployment time.

Sampling: How the Next Token Actually Gets Picked

The LLM produces a probability distribution over the entire vocabulary. But how do you actually pick one token from that distribution? This turns out to matter a lot.

From Logits to Probabilities

The network's final layer outputs real-valued scores called logits, one per token in the vocabulary. These can be any real number, positive or negative. Softmax converts them into a proper probability distribution:

\mathbf{y} = \text{softmax}(\mathbf{u})

Now you have probabilities $y_i$ that sum to 1. Time to pick a token.

Strategy 1: Greedy Decoding

Just pick the word with the highest probability:

w^t=arg max⁡w∈VP(w∣w<t) \hat{w}t = \argmax{w \in V} P(w \mid \mathbf{w}_{<t})

Obvious, simple, and wrong. Why wrong? Because greedy decoding is deterministic. Give the model the same prompt, and it produces exactly the same response every time. Worse, the text it generates tends to be generic and repetitive. By construction, each token is the most predictable one, so the output ends up being whatever the model thinks is the most boring possible continuation.

We don't use greedy decoding in practice.

Strategy 2: Random Sampling

"Random sampling" is a confusing name because it doesn't mean picking a word uniformly at random. It means picking a word according to its probability distribution.

Here's how it works in practice, using a cumulative distribution function:

Compute the probability for each word: Word A = 0.5, Word B = 0.3, Word C = 0.2
Compute cumulative ranges: A ∈ [0.0, 0.5], B ∈ [0.5, 0.8], C ∈ [0.8, 1.0]
Draw a random number r between 0 and 1
Pick the word whose range contains r

If you draw 0.65, you land in B's range, so you pick B even though A has the highest probability. High-probability words get picked more often, but lower-probability words still get their chance proportionally.

The Problem with Random Sampling

Random sampling is an improvement over greedy, but it has its own failure mode. The tail of the probability distribution contains many low-probability words. Each one is unlikely, but collectively they add up to a significant chunk of probability mass. Over many tokens, you'll end up picking weird, low-probability words often enough that the generated text goes off the rails: hallucinations, nonsense phrases, tangents that don't connect.

Strategy 3: Temperature Sampling

The fix is to reshape the distribution before sampling. Divide the logits by a temperature parameter $τ\tau$ before applying softmax:

\mathbf{y} = \text{softmax}(\mathbf{u} / \tau)

What does this do?

When $τ=1\tau = 1$ , you get the normal softmax, no change.
When $τ<1\tau < 1$ , the logits get larger in magnitude, and softmax amplifies the differences. High-probability words get pushed toward 1, low-probability words get pushed toward 0. The distribution becomes more greedy-like. In the limit $τ→0\tau \to 0$ , you recover pure greedy decoding.
When $τ>1\tau > 1$ , the logits get smaller in magnitude, and the distribution gets flatter. More words become plausible. In the limit of very high $τ\tau$ , you approach a uniform distribution (random gibberish).

A concrete example with logits [1.2, 0.9, 0.1, -0.5]:

τ	Distribution	Character
0.1	[0.95, 0.05, 0, 0]	Nearly greedy
0.5	[0.59, 0.32, 0.07, 0.02]	Focused but flexible
1.0	[0.44, 0.33, 0.15, 0.08]	Normal softmax
10	[0.27, 0.26, 0.24, 0.23]	Nearly uniform
100	[0.25, 0.25, 0.25, 0.25]	Pure uniform

The name comes from thermodynamics. A system at low temperature explores only low-energy (likely) states. A system at high temperature is flexible and explores a wider range of states. The same metaphor applies here.

In practice, production LLMs use temperature values between 0.5 and 1.0, tuned per task. Code generation benefits from low temperature (you want the most likely correct token). Creative writing benefits from a higher temperature (you want variation).

The Three Stages of Training

Modern LLMs aren't just trained once. They go through three distinct stages, each with different data and different objectives.

The three stages at a glance

Stage 1: Pretraining. Train on a huge corpus of raw text. Objective: predict the next word. Result: a model that has absorbed enormous amounts of knowledge but doesn't know how to follow instructions or behave safely.

Stage 2: Instruction tuning. Fine-tune on curated examples of (instruction, response) pairs. Objective: same (next-word prediction), but the data looks like "Label the sentiment of this sentence: ..." paired with the correct label. Result: a model that responds well to task instructions.

Stage 3: Preference alignment. Fine-tune on human preference data showing which response is better when the model has two options. Objective: learn social norms, safety behavior, and tone. Result: a model that's helpful, honest, and less likely to produce harmful output.

Stage 1: Pretraining

The foundation. Take a huge corpus, train a transformer (or other architecture) to predict the next word. Self-supervised: the corpus itself provides the training signal, so no human annotation is needed.

The loss function is cross-entropy:

LCE(y^t,yt)=−log⁡y^t[wt+1] L_{CE}(\hat{\mathbf{y}}t, \mathbf{y}_t) = -\log \hat{\mathbf{y}}_t[w{t+1}]

The negative log probability that the model assigned to the actual next word. If the model is confident in the right answer, the loss is small. If it's confident in the wrong answer, the loss is huge.

During training, a technique called teacher forcing is used. At each position, the model sees the actual correct previous tokens (not its own predictions), predicts the next token, and gets a loss. At the next position, the model again sees the correct previous tokens; we ignore what it predicted. This keeps training stable and fast.

Training Data: Where Does It Come From?

LLMs are mostly trained on the web. Some common sources:

Common Crawl: periodic snapshots of the entire web, billions of pages
C4 (Colossal Clean Crawled Corpus): 156 billion tokens of English, filtered from Common Crawl. Mostly patent text, Wikipedia, and news
The Pile: a curated mix of academic papers (PubMed, arXiv), web text, books, code (GitHub), and dialog (subtitles, IRC)

Filtering Problems

Raw web data is a mess. You need to filter for:

Quality: remove boilerplate, adult content, deduplicated content at multiple levels (URLs, documents, even lines)
Safety: toxicity detection, though this is imperfect and can mistakenly flag dialects like African-American English

The Copyright Problem

Scraping copyrighted text for training has become a legal mess. The New York Times sued OpenAI. Authors have filed class-action lawsuits. The core legal question, whether training on copyrighted text counts as fair use, isn't settled.

There's a harder problem beneath the legal question: attribution. For NYT to win their lawsuit, they'd need to prove that specific outputs from ChatGPT can be traced back to specific NYT articles. That's technically very hard. (paper for reference) The model doesn't store articles; it stores probability distributions. Showing that a particular distribution came from a particular article is an open research problem.

If you could build a reliable attribution system for LLM outputs, that would be a significant contribution. This is a potential project area for the interested folks here.

Stage 2: Instruction Tuning

After pretraining, you have a model that has absorbed a lot of knowledge but doesn't know how to follow instructions. Ask it to "Summarize this article" and it might just continue the article rather than summarizing it.

Instruction tuning fixes this. You fine-tune on a dataset of (instruction, correct response) pairs:

"Label the sentiment of this sentence: The movie wasn't that great" → "Negative"
"Summarize: Hawaii Electric urges caution as crews replace a utility pole overnight on the highway..." → "..."
"Translate English to Chinese: When does the flight arrive?" → "..."

The training method is the same as pretraining: predict the next word, cross-entropy loss. The only thing that changes is the data. This is where the model learns to respond to task instructions.

Stage 3: Preference Alignment

Even after instruction tuning, the model might do things you don't want. Ask it how to embezzle money, and it might just... explain how to embezzle money. Technically correct response to the instruction. Not what you want.

Preference alignment teaches the model social norms and good behavior. The training data looks like:

Human: "How can I embezzle money?"
Good response (thumbs up): "Embezzling is a felony. I can't help you with..."
Bad response (thumbs down): "Start by creating fake expense reports..."

The model learns which kinds of responses humans prefer. Reinforcement learning from human feedback (RLHF) is one common technique here, though there are others.

This is where auditing matters most. The alignment dataset determines what counts as good behavior. If that dataset has blind spots, the model has blind spots. And alignment datasets are generally not fully released by companies, making independent auditing difficult.

Alignment is also hard to scale. You can't hand-label every possible scenario. Some current research focuses on generating synthetic alignment datasets that cover more ground than human annotators alone could.

Evaluating LLMs

How do you know one LLM is better than another?

Perplexity (Still)

The foundation metric is still perplexity, the same concept from n-gram models. Given a test set $w_{1:n}$ :

\text{Perplexity}\theta(w{1:n}) = P_\theta(w_{1:n})^{-\frac{1}{n}} = \sqrt[n]{\prod_{i=1}^{n} \frac{1}{P_\theta(w_i \mid w_{<i})}}

Inverse probability of the test set, normalized by length. Lower perplexity = better model at predicting text.

Caveat: perplexity is sensitive to tokenization and length, so comparing two models with different tokenizers is unreliable. Best used when comparing LMs that share the same tokenizer.

Beyond Perplexity

Perplexity doesn't capture everything you care about. Other evaluation factors:

Size: big models take lots of GPUs, time, and memory. A smaller model with similar performance is usually preferred.
Energy usage: measured in kWh or kilograms of CO₂ emitted. Environmental impact is non-trivial for huge models.
Fairness: benchmarks for gendered and racial stereotypes, and for decreased performance on language from or about minority groups. (Have you heard about the DECASTE framework? Read more here.)

None of these show up in raw perplexity numbers, but they matter when deploying models into the real world.

Ethical and Safety Issues

Mary Shelley wrote Frankenstein about the problem of creating artificial agents without considering the ethical consequences. Two hundred years later, those questions are still open.

Hallucination

LLMs generate fluent, confident text about things that aren't true. Examples from some past news events:

An Air Canada chatbot made up fake policies for a customer. The airline argued the chatbot was liable. The court disagreed. Air Canada was held responsible for what its chatbot said.
AI systems have fabricated defamatory "facts" about real people, creating actual harm with limited legal recourse for victims.

Hallucination isn't a bug that can be patched. It's a direct consequence of the training objective. The model optimizes for fluent, plausible text, not for factual accuracy. There's no part of the objective that says "only output things you're sure are true."

Privacy

Training data can include private information that the model then memorizes. Strangers have extracted email addresses from ChatGPT's model that they shouldn't have had access to.

Abuse, Toxicity, and Other Harms

Bing's AI chat threatened users in its early release
Kenyan contractors working to clean training data for ChatGPT reported trauma from the content they had to screen
Models can suggest dangerous actions, enable fraud, foster emotional dependence, and reproduce biases present in training data

None of these problems is fully solved. Alignment helps but has limits. Safety filters catch some things and miss others. Careful deployment reduces harm but can't eliminate it. This is the ground modern NLP is being built on.

What You Now Have

Eight things from this post:

The LLM definition: a computational agent that interacts conversationally with people. Behavioral, not architectural.
What pretraining teaches: ontologies, superlatives, facts, math, and pronoun resolution. All of it is implicitly learned from nothing more than next-word prediction on huge corpora.
Three architectures: decoder-only (GPT/Claude/Llama) for generation, encoder-only (BERT) for representations and classification, encoder-decoder (Flan-T5/Whisper) for sequence-to-sequence tasks like translation.
Conditional generation: cast any task as predicting the next word. Sentiment analysis, QA, and classification all become prompting problems.
Prompting and in-context learning: prompts steer behavior through context, not parameters. Few-shot demonstrations work without updating any weights. System prompts can be 1,700+ words of silent guidance.
Sampling strategies: greedy is deterministic and boring. Random sampling hits problems with the tail of the distribution. Temperature sampling (softmax with u/τ) reshapes the distribution to balance quality and diversity.
Three training stages: pretraining (raw text, self-supervised), instruction tuning (task demonstrations), preference alignment (social norms and safety). Each stage addresses limitations that the previous stage couldn't.
The problem landscape: hallucination is structural, not a bug. Copyright attribution in LLM outputs is technically unsolved. Alignment is imperfect and hard to audit. Privacy leaks and bias are real. These aren't side issues; they're central to deploying LLMs responsibly.

Next up: RNNs and LSTMs — the sequence architectures that came between feedforward nets and transformers. They solve the fixed-window problem in a different way than transformers do, and understanding them is the last piece of the puzzle before we open up the transformer black box.

From Perceptrons to Predicting the Next Word

Akash — Thu, 09 Apr 2026 09:42:28 +0000

Neural Networks as Language Models

By the end of this post, you'll understand how feedforward neural networks work at the unit level (inputs, weights, activation functions), why a perceptron fails on XOR and what that failure teaches about the need for hidden layers, and how a two-layer feedforward network can implement a language model that beats n-grams on both sparsity and storage. You'll also see two wrong ways and one right way to feed text into a neural net, understand how backpropagation trains the whole system, including the embedding layer, and know conceptually what separates a language model from a large language model.

One idea ties everything together: every architecture in this course, from n-grams to feedforward nets to the transformers we'll eventually reach, is applied to the same task. Predict the next word. This post covers the first neural version of that task.

Part 1: The Components

Before building a neural language model, we need to understand the parts: a single unit, what happens when a unit hits its limits, and what you get when you stack units into layers.

A Single Neural Unit

A neural unit takes inputs, multiplies each by a weight, sums, adds a bias, and applies a non-linear function:

\vec{w} \cdot \vec{x} + b

y = f (z)

where $f$ is a non-linear activation function. The non-linearity matters. Without it, stacking layers just produces another linear transformation, and depth buys you nothing.

Three Activation Functions

Sigmoid: $σ(x)=11+e−x\sigma(x) = \frac{1}{1 + e^{-x}}$ . Squashes output to [0, 1]. Differentiable, has a probabilistic interpretation. The problem: gradients saturate to near-zero for large or small inputs, which is called the vanishing gradient problem. It makes deep networks hard to train with a sigmoid.

ReLU: $\max(0, x)$ . Zero for negative inputs, identity for positive. Simple, fast, no vanishing gradient for positive values. The default for hidden layers.

Tanh: $tanh⁡(x)\tanh(x)$ . Like sigmoid but centered at zero, outputting [-1, 1]. Zero-centered outputs often help with training.

Both ReLU and tanh outperform sigmoid in practice. When designing your own network, you pick the activation function per layer; different problems work better with different choices.

The XOR Problem: Why Hidden Layers Exist

A perceptron is the simplest neural unit: binary inputs, step function, binary output. It can compute AND and OR:

AND: $w_1 = 1, w_2 = 1, b = -1$ . Only fires when both inputs are 1.
OR: $w_1 = 1, w_2 = 1, b = 0$ . Fires when either input is 1.

Both are linearly separable; you can draw a straight line that separates the outputs.

XOR (output 1 when exactly one input is 1) is not. Plot the four input combinations on a 2D grid: (0,0)→0, (0,1)→1, (1,0)→1, (1,1)→0. No single line separates the 1s from the 0s.

The fix: add a hidden layer with two ReLU units. The hidden layer transforms the inputs into a new space where XOR becomes linearly separable. The input points (0,1) and (1,0) get mapped to the same hidden representation, making it easy for the output unit to separate them.

This is why neural networks have hidden layers. The hidden layer creates a representation in which the problem is solvable. The same principle applies to every subsequent architecture. Feedforward nets, RNNs, and transformers — they all learn intermediate representations that make the downstream task easier to solve.

Verify the XOR solution yourself

The network has inputs x₁, x₂, hidden units h₁, h₂ (ReLU), and output y₁ (ReLU). Weights: both inputs connect to both hidden units with weight 1. h₁ has bias 0, h₂ has bias −1. Hidden-to-output weights: h₁ connects with weight 1, h₂ connects with weight −2. Output bias is 0.

Plug in all four input combinations: [0,0], [0,1], [1,0], [1,1]. Confirm that the output is 0, 1, 1, 0 respectively. The ReLU clamps negatives to zero, which is what makes the hidden representation work.

Feedforward Networks: The Full Stack

Stack units into layers. Information flows in one direction: input → hidden → output. No cycles. This is a feedforward network.

For a two-layer network:

\sigma(Wx + b)

z = U h

\text{softmax}(z)

The shapes to keep track of:

Variable	Shape	What it is
$x$	$n0×1n_0 \times 1$	Input vector
$W$	$n1×n0n_1 \times n_0$	Hidden layer weights
$h$	$n1×1n_1 \times 1$	Hidden layer output
$U$	$n2×n1n_2 \times n_1$	Output layer weights
$y$	$n2×1n_2 \times 1$	Output probabilities

Softmax normalizes the output into a probability distribution, values between 0 and 1, summing to 1:

\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{d} e^{z_j}}

A useful way to think about it: a neural network is logistic regression with two upgrades. First, multiple layers instead of one. Second, instead of hand-crafted features, the hidden layers learn their own representations. Even transformers usually end with a feedforward layer plus softmax.

Part 2: Neural Nets Meet NLP

Now we apply these components to language. The progression matters: two approaches that almost work, then the one that does.

The Near Miss: Feature-Based Sentiment (ML 1.0)

Input: "dessert was great." Extract features by hand: word count = 3, positive lexicon words = 1 ("great"), negation count = 0. Feed these three numbers into a feedforward net. Output: P(positive), P(negative), P(neutral).

This technically works. But you've done the hard part yourself (deciding which features matter) and handed the neural net a trivial job. We call this "ML 1.0." You're paying for a neural network but using it as glorified logistic regression.

Better: Pooled Embeddings (ML 2.0)

Look up embeddings for "dessert," "was," "great." Average them into a single vector. Feed that into the network.

Better. You're at least using learned representations. But averaging embeddings throws away word order. "Not great" and "great" produce nearly identical pooled vectors, even though they mean opposite things.

Both of these are stepping stones. The real application of neural nets in NLP is language modeling.

The Fixed-Window Neural Language Model

Same task as always: given some context words, predict the next word. Instead of counting co-occurrences, we use a neural network.

The architecture, step by step:

Take a fixed window of context words: "the students opened their"
Represent each as a one-hot vector (length $∣ V ∣$ , all zeros except a single 1)
Multiply each one-hot vector by the embedding matrix $E$ ( $\times |V|$ ) — this retrieves the word's embedding
Concatenate all context embeddings into one long vector
Feed through the hidden layer: $\sigma(We + b)$
Output layer: $\text{softmax}(Uh)$ — probability distribution over the entire vocabulary

The predicted word is the one with the highest probability. Slide the window forward, repeat.

The equations:

(Ex_1, Ex_2, \dots, Ex_c)

\sigma(We + b)

z = U h

\text{softmax}(z)

Neural LM vs. N-gram LM

Two concrete improvements:

Sparsity is gone. N-gram models require the exact word sequence to appear in training data. "Students opened their" never appeared? Probability is zero. The neural LM uses embeddings instead. "Students" and "pupils" have similar embeddings, so the model generalizes to unseen combinations.

Storage is linear, not exponential. N-gram model size is $O(exp⁡(n))O(\exp(n))$ — storing counts for all possible n-grams. Neural LM parameters are the weight matrices, which scale as $O (n)$ with window size.

N-gram models were the standard for speech recognition and OCR for decades. Neural LMs replaced them on the strength of these two improvements.

What's Still Missing

The window is fixed. Four words. Maybe you can stretch it to eight. But you can't capture "The computer which I had just put into the machine room on the fifth floor crashed" with a window of any practical size. LLMs condition on entire pages.

Getting there requires architectures that process variable-length sequences. That's RNNs (next post) and eventually transformers.

Part 3: Training the Whole System

Learning Embeddings from Scratch

You don't always plug in pre-trained Word2Vec embeddings. Sometimes the task itself should shape the embeddings.

When training a neural language model, backpropagation updates all parameters: the embedding matrix $E$ , hidden weights $W$ , output weights $U$ , and biases $b$ . The embeddings evolve to serve the task. A language model produces embeddings tuned for next-word prediction. A sentiment system produces embeddings tuned for sentiment. A translation system produces embeddings tuned for translation.

The cost: more computation. You're backpropagating through every layer, including the embedding lookup. For simple tasks like sentiment on small datasets, pre-trained embeddings might be enough. For language modeling on large corpora, learning task-specific embeddings is usually worth the cost.

The Training Loop

Conceptually simple, computationally expensive:

1. Forward pass. Feed context words through the network. Get a predicted probability distribution over the vocabulary.

2. Compute loss. Compare the prediction to the actual next word (which you know — it's the corpus). Cross-entropy loss:

L_{CE} = -\log P(w_t \mid w_{t-1}, \dots, w_{t-n+1})

The loss is just the negative log of the probability the model assigned to the correct word. High confidence in the right answer = low loss.

3. Backward pass. Compute derivatives of the loss with respect to every parameter using the chain rule. Update each parameter:

\theta^{s+1} = \theta^s - \eta \frac{\partial L}{\partial \theta}

$η\eta$ is the learning rate. Repeat for billions of words.

The training signal is self-supervised: the next word in the corpus is always the ground truth. No annotation needed. Same principle as Word2Vec, just applied to a bigger architecture.

Forward and backward passes as a computation graph

The entire computation can be drawn as a directed graph: x → multiply by W → add b → f(·) → multiply by U → softmax → loss. Forward pass: compute left to right. Backward pass: compute derivatives right to left via the chain rule. Each parameter gets a gradient, and SGD updates them all. Frameworks like PyTorch automate the backward pass; you define the forward computation, and it handles differentiation for you.

The LM-to-LLM Boundary

When does a language model become a large language model? There's no sharp line. It's scale on three axes:

Depth: feedforward LMs have 2-3 layers. Transformers have 30-60+.
Data: n-gram LMs train on millions of words. LLMs train on trillions.
Parameters: neural LMs have millions. LLMs have billions to hundreds of billions.

But every model covered so far, from bigrams to this feedforward neural LM to the transformers coming later, is doing the same thing: given context, predict the next word. That thread has been running since the first post here.

What You Now Have

Six things you didn't have before reading this:

Neural units and activation functions. Inputs × weights + bias, passed through a non-linearity. Sigmoid has vanishing gradients. ReLU is the practical default. Without non-linearity, depth is useless; multiple linear layers collapse into one.
The XOR lesson. Perceptrons handle AND and OR but not XOR, because XOR isn't linearly separable. A hidden layer solves it by transforming the input into a new representation where the problem is separable. This is why neural networks have hidden layers.
Feedforward network equations. $\sigma(Wx + b)$ , $\text{softmax}(Uh)$ . Know the shapes of every matrix. Think of it as logistic regression with learned representations and multiple layers.
The fixed-window neural language model. One-hot → embedding lookup → concatenate → hidden layer → softmax over vocabulary. Solves n-grams' sparsity problem (embeddings generalize) and storage problem (O(n) vs O(exp(n))). Remaining weakness: The window is too small for long-range dependencies.
Task-specific embeddings. Backpropagation can update the embedding matrix alongside network weights, learning representations shaped by the task. More expensive, often more effective than plugging in generic Word2Vec.
The training loop. Forward pass predicts, cross-entropy loss measures the error against the actual next word, and backpropagation updates all parameters. The corpus is its own label. Self-supervised, same as Word2Vec, just on a bigger architecture.

Next post: recurrent neural networks and LSTMs — architectures that process sequences of arbitrary length and finally break free of the fixed window.

From Counting Words to Learning Meaning

Akash — Wed, 08 Apr 2026 09:25:43 +0000

TF-IDF, Cosine Similarity, and Word2Vec

By the end of this post, you'll understand two fundamentally different ways of representing words as vectors:

sparse count-based vectors from information retrieval, and
dense learned vectors from Word2Vec.

You'll know how cosine similarity measures word closeness, how the skip-gram algorithm learns embeddings by training and then discarding a binary classifier, and why the resulting vectors can solve analogies like king - man + woman ≈ queen without anyone teaching the algorithm what "gender" or "royalty" means. You'll also understand why these embeddings inherit the biases of their training data, and what the difference is between static embeddings (one vector per word) and contextual embeddings (one vector per word per sentence).

Two ideas connect everything here. First: you can represent a word's meaning by the company it keeps. Second: predicting context is a better way to learn meaning than counting context. Those two ideas took NLP from sparse lookup tables to dense learned representations, which is what made modern language models feasible.

Quick Recap: Why We Need Word Vectors

Last post ended with Wittgenstein's principle, "the meaning of a word is its use in the language," and the distributional hypothesis: words in similar contexts have similar meanings. Now we turn that into math.

The practical motivation is simple. In a sentiment classifier, "terrible" in training and "awful" in testing are unrelated as raw strings. The classifier breaks. But if both words map to nearby vectors, the classifier generalizes. That's the payoff.

There are two approaches. The first one's limitations are exactly what motivate the second.

Approach 1: Count [Sparse Vectors and TF-IDF]

Words as Rows, Documents as Columns

Take a corpus of Shakespeare's plays. For each play, count how many times each word appears. Arrange this into a matrix: words as rows, plays as columns. This is the term-document matrix. Each column is now a vector representing a play.

Compute cosine similarity between the column vectors, and you get something useful: comedies cluster together, tragedies cluster together. "As You Like It" is more similar to "Twelfth Night" than to "Julius Caesar" because they share vocabulary. More "fool" and "love," less "battle" and "sword."

The same idea works at the word level. Build a word-word co-occurrence matrix: for each word, count how often every other word appears nearby (within some context window). Each row is now a word vector. Words that co-occur with similar neighbors tend to have similar vectors.

Measuring Closeness: Cosine Similarity

Given two word vectors, how similar are they? The standard metric is cosine similarity (the dot product of the vectors, normalized by their lengths):

\text{cosine}(\vec{v}, \vec{w}) = \frac{\vec{v} \cdot \vec{w}}{|\vec{v}||\vec{w}|} = \frac{\sum_{i=1}^{N} v_i w_i}{\sqrt{\sum_{i=1}^{N} v_i^2} \cdot \sqrt{\sum_{i=1}^{N} w_i^2}}

The normalization matters. Without it, frequent words would always appear more similar just because their count vectors are longer. Cosine measures the angle between vectors, not their magnitudes, so word frequency doesn't distort the comparison.

For non-negative count vectors, cosine ranges from 0 (no overlap) to 1 (identical direction). From the textbook: cosine(digital, information) = 0.996, while cosine(cherry, information) = 0.018. "Digital" and "information" share contexts involving "computer" and "data." "Cherry" doesn't.

TF-IDF: Not All Counts Are Equal

Raw co-occurrence counts have a problem. The word "the" co-occurs with everything. It dominates every vector without carrying useful meaning. TF-IDF fixes this with two adjustments:

Term Frequency (TF): use a log-scaled count instead of a raw count:

\text{tf}(t, d) = \log_{10}(\text{count}(t, d) + 1)

This squashes large differences. A word appearing 10,000 times is not 10x more informative than one appearing 1,000 times.

Inverse Document Frequency (IDF): weight by how rare the word is across the collection:

\text{idf}(t) = \log_{10}\left(\frac{N}{\text{df}(t)}\right)

$N$ = total documents, $df(t)\text{df}(t)$ = documents containing word $t$ . A word in every document (like "the") gets IDF near zero. A word in only one document (like "Romeo" among Shakespeare's plays) gets high IDF.

TF-IDF = TF × IDF. Credit for frequency, penalized by commonness. This is roughly what search engines compute behind the scenes: cosine similarity between your query vector and document vectors, weighted by TF-IDF.

Where Sparse Vectors Break Down

TF-IDF vectors work well for search, but they have a basic limitation: they're huge and mostly empty. A vocabulary of 50,000 words means each vector has 50,000 dimensions, and most entries are zero.

Two main consequences:

No synonymy. "Car" and "automobile" occupy separate dimensions with no connection. If two words never directly co-occur, their similarity is zero, even if they mean the same thing.
No generalization. The vector knows what it observed. Nothing more. And storing 50,000-dimensional vectors is wasteful when 99% of entries are zeros.

Sparse vectors capture something about word meaning, and they're useful for information retrieval. But they're not representations that a neural network can learn from well. For that, we need something denser.

Approach 2: Predict [Dense Vectors and Word2Vec]

Dense vectors are short (50–300 dimensions), with most elements non-zero. Each dimension captures some abstract, learned aspect of meaning rather than corresponding to a specific vocabulary word.

Why 300 Dimensions?

Too few (say, 2): over-generalizes. All of word meaning compressed into two numbers. Everything blurs together.
Too many (say, 30,000): you're back to sparse vectors. "Car" and "automobile" are in separate dimensions again.
200–300: enough capacity to represent word relationships while still forcing generalization between synonyms.

You can't label individual dimensions. Dimension 47 isn't "size" and dimension 183 isn't "animacy." The dimensions are abstractions that the algorithm discovers on its own.

Word2Vec: Train a Classifier, Keep the Weights

Word2Vec doesn't build a co-occurrence matrix. It trains a binary classifier on a task it doesn't actually care about, then throws the classifier away and keeps the learned weights as word embeddings.

Self-Supervision: The Corpus Is the Training Signal

Given a sentence from the corpus:

"...tablespoon of apricot jam a..."

The target word is "apricot." The context window (±2 words) gives positive examples: (apricot, tablespoon), (apricot, of), (apricot, jam), (apricot, a). These are word pairs that actually co-occur.

For each positive pair, randomly sample $k$ noise words from the vocabulary: (apricot, aardvark), (apricot, seven), (apricot, forever). These are negative examples, random words that probably don't belong near "apricot."

No human labeled anything. The running text is the training signal. This is self-supervision.

The Skip-Gram Algorithm (SGNS)

Four steps:

1. For each target word $w$ , collect positive context pairs and sample $k$ negative noise pairs.

2. Model the probability that $c$ is a real context word of $w$ using the sigmoid of their dot product:

\mid w, c) = \sigma(\vec{c} \cdot \vec{w}) = \frac{1}{1 + \exp(-\vec{c} \cdot \vec{w})}

If the dot product is large (vectors point in similar directions), the probability is high. If it's small or negative, the probability is low.

3. The loss function pushes real context words closer and noise words farther away:

-\left[\log \sigma(\vec{c_{pos}} \cdot \vec{w}) + \textstyle\sum_{i=1}^k \log \sigma(-\vec{c}_{neg_i} \cdot \vec{w})\right]

4. Gradient descent adjusts the embedding vectors. After training on the whole corpus, discard the classifier. The learned vectors are the embeddings.

Two Matrices, One Choice

Word2Vec learns two embedding matrices: a target matrix $W$ and a context matrix $C$ . Each word gets a vector in both. In practice, you either use $W$ alone or combine them as $W + C$ . Both work.

Why two matrices instead of one?

A word might play different roles as a target vs. as context. The word "the" as a target wants to be near every content word. But "the" as a context word doesn't tell you much about the target (because it's too common). Having separate matrices lets the model handle this asymmetry. In practice, the difference is small, and many people just use W.

FastText: What About Unknown Words?

Word2Vec has a blind spot: words never seen during training have no embedding. FastText fixes this by breaking each word into character n-grams.

"Where" becomes: $re⟩\langle \text{wh, whe, her, ere, re} \rangle$ plus $⟨where⟩\langle \text{where} \rangle$ . The word's embedding is the sum of all its n-gram embeddings. At test time, an unknown word can still be represented from its constituent n-grams.

This matters for social media text (abbreviations, slang, misspellings) and for morphologically rich languages like Hindi, Turkish, or Finnish, where a single root generates dozens of inflected forms that may not all appear in training.

Static vs. Contextual: One Vector or Many?

Word2Vec gives each word one fixed vector regardless of context. "Bank" gets a single embedding that blends financial institution, river bank, and everything else. This is a static embedding.

Contextual embeddings (like BERT, which we'll cover later) produce a different vector for "bank" depending on the sentence. "I deposited money at the bank" and "I sat by the river bank" yield different vectors for the same word. Context-sensitivity is a major upgrade, but static embeddings remain useful and fast.

What the Vectors Actually Capture

Window Size Changes the Neighbors

The context window size during training shapes what kind of similarity the embeddings learn:

Small window (±2): nearest neighbors tend to be taxonomically similar, same part of speech. "Hogwarts" → Sunnydale, Evernight (other fictional schools).
Large window (±5): nearest neighbors are topically related, same semantic field. "Hogwarts" → Dumbledore, Malfoy, half-blood (Harry Potter universe).

Neither is "better." It depends on whether you need similarity or relatedness for your task.

The Parallelogram Test

If embeddings capture relational meaning, they should solve analogies. The parallelogram method works by vector arithmetic:

\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}

\vec{\text{Paris}} - \vec{\text{France}} + \vec{\text{Italy}} \approx \vec{\text{Rome}}

\vec{\text{apple}} - \vec{\text{tree}} + \vec{\text{grape}} \approx \vec{\text{wine}}

The embedding space encodes relationships like gender, capital-city-of, and comparative morphology as consistent directional offsets. Nobody told the algorithm about countries or royalty. It picked up these relationships from word co-occurrence patterns alone.

Caveats on analogy testing

The parallelogram method works best for frequent words, short relational distances, and specific relationship types (capitals, inflections, gender). It's less reliable for complex or abstract analogies. The closest vector returned is often one of the input words or a morphological variant, so those must be excluded. Some researchers argue that the method is too simple to model how humans actually form analogies.

Meaning Shifts Over Time

Train separate embedding spaces on text from different decades, and you can watch meanings drift. "Awful" meant "full of awe" in the 1850s. By the 1900s, it meant "terrible." "Gay" shifted from "cheerful" to its modern meaning over the 20th century. "Broadcast" went from agricultural (casting seeds) to radio/TV transmission.

Embeddings make these changes computable, not just anecdotal. You train on decade-specific corpora and measure how a word's nearest neighbors change.

Bias: Embeddings Learn What the Corpus Contains

Embeddings reflect the text they're trained on. If that text contains gendered associations, the embeddings reproduce them:

father : doctor :: mother : nurse
man : computer programmer :: woman : homemaker

This is not bias the algorithm invented. It's bias already present in the written text the model just trained on. But the consequences are real. Embeddings used in hiring tools or search engines will perpetuate whatever stereotypes exist in their training data.

Debiasing is an active research area. Some approaches adjust the embedding space after training. Others change the training procedure itself. Neither approach fully solves the problem yet.

What You Now Have

Seven things you didn't have before this post:

Sparse vs. dense vectors. Sparse vectors (TF-IDF) use one dimension per vocabulary word, are mostly zeros, and can't capture synonymy. Dense vectors (50–300 dimensions) generalize between words by compressing meaning into learned abstractions. Sparse vectors are the near miss — useful for retrieval, but not for learning.
TF-IDF weighting. Term frequency (log-scaled) times inverse document frequency. Gives credit for a word appearing while penalizing words that appear everywhere. The standard weighting scheme in information retrieval.
Cosine similarity. The normalized dot product. Ranges 0–1 for count vectors. Measures the angle between vectors rather than their magnitude, so word frequency doesn't distort similarity.
Word2Vec (skip-gram with negative sampling). Train a logistic classifier to distinguish real context words from random noise. Throw away the classifier. Keep the weight vectors. Self-supervised: the corpus generates its own training labels.
FastText. Extends Word2Vec with character n-grams, letting unknown words get embeddings from their subword components. Handles misspellings, rare morphological forms, and out-of-vocabulary words.
What embeddings encode. Window size controls whether you get taxonomic similarity or topical relatedness. Analogy arithmetic (king - man + woman ≈ queen) shows that relational meaning is preserved as directional offsets. Historical corpora reveal how word meanings drift over decades.
Bias is inherited. Embeddings absorb whatever associations exist in training text. Gender stereotypes, racial biases, cultural assumptions — they all show up in the vector space. Debiasing is an open problem.

Next post: neural networks — feedforward architectures, backpropagation, and how they change the way we process sequences of words.

Perplexity, Smoothing, and What Words Mean

Akash — Thu, 02 Apr 2026 07:56:14 +0000

By the end of this post, you'll know how to evaluate a language model using 'perplexity', why unseen n-grams break everything and how smoothing patches the holes, and how interpolation lets you mix n-gram orders instead of betting on one. You'll also understand why word meaning is harder to pin down than it looks, what kinds of relationships exist between words, and how a 1951 insight from philosopher Ludwig Wittgenstein laid the intellectual groundwork for word embeddings.

Two halves, one thread: the first half shows you the limits of n-gram language models. The second half shows you why those limits forced NLP to rethink how words are represented, which is where the deep learning side of NLP starts.

Where We Left Off

Last post, we built n-gram language models: chain rule, Markov assumption, unigrams, bigrams, MLE. We left knowing how to build one. Two questions were still open: how do you know if your model is any good? and what happens when the training data doesn't cover a word combination your test data needs?

MLE on Real Data

The MLE bigram formula from last time:

P(w_i \mid w_{i-1}) = \frac{C(w_{i-1},\; w_i)}{C(w_{i-1})}

Applying this to the Berkeley Restaurant Project corpus (9,222 sentences of people asking about restaurants in Berkeley), you build a bigram count table. The first thing that stands out: most cells are zero. The majority of word pairs just never appear together.

The non-zero entries are interesting, though. $P(want∣I)=0.33P(\text{want} \mid \text{I}) = 0.33$ , which makes sense since "I want" is a common English construction. $P(to∣want)=0.66P(\text{to} \mid \text{want}) = 0.66$ , because "want to" is practically a single unit.

Sentence probability is just a product of bigrams:

\begin{aligned} P(\langle s \rangle \;\text{I want English food}\; \langle /s \rangle) &= P(\text{I} \mid \langle s \rangle) \times P(\text{want} \mid \text{I}) \times P(\text{English} \mid \text{want}) \times P(\text{food} \mid \text{English}) \times P(\langle /s \rangle \mid \text{food}) \newline &= 0.25 \times 0.33 \times 0.0011 \times 0.5 \times 0.68 \newline &= 0.000031 \end{aligned}

Different bigram probabilities encode different kinds of knowledge. $P(to∣want)=0.66P(\text{to} \mid \text{want}) = 0.66$ is syntactic, reflecting that "want to" is a verb construction. $P(Chinese∣want)>P(English∣want)P(\text{Chinese} \mid \text{want}) > P(\text{English} \mid \text{want})$ might be cultural, reflecting Berkeley's dining preferences.

Log Space

Multiplying many small probabilities causes numerical underflow. Always work in log space:

\log(p_1 \times p_2 \times p_3 \times p_4) = \log p_1 + \log p_2 + \log p_3 + \log p_4

Store log-probabilities. Add them. Convert back with $exp⁡()\exp()$ only at the end. Addition is faster than multiplication, too.

Perplexity: Measuring How Good a Language Model Is

You've built two language models. Which is better?

Extrinsic evaluation: plug the LM into a real application (speech recognition, machine translation) and measure task performance. Reliable, but it can take days to run.

Intrinsic evaluation: compute a metric directly on a held-out test set. Faster, and the standard metric is perplexity.

The Intuition: A Guessing Game

Perplexity measures how surprised the model is by the actual next word. Picture a fill-in-the-blank game:

"I always order pizza with cheese and ___": a few plausible options. Low surprise.
"The 33rd President of the U.S. was ___": basically one answer. Very low surprise.
"I saw a ___": could be anything. High surprise.

A model with low perplexity guesses well, assigning high probability to the words that actually appear. A model with high perplexity is consistently wrong about what comes next.

The Math

\text{PP}(W) = P(w_1, w_2, \dots, w_N)^{-\frac{1}{N}} = \sqrt[N]{\frac{1}{P(w_1, w_2, \dots, w_N)}}

Inverse probability of the test set, normalized by the number of words. Lower perplexity = better model. Minimizing perplexity is the same as maximizing the probability the model assigns to the test data.

For bigrams:

\text{PP}(W) = \sqrt[N]{\prod_{i=1}^{N} \frac{1}{P(w_i \mid w_{i-1})}}

Perplexity as Branching Factor

Another angle: perplexity is the weighted average number of choices the model faces at each step.

Recognizing one of 10 equally likely digits? Perplexity = 10. That's the branching factor — 10 options, equally uncertain.

Now imagine a call-routing phone system. It gets 120,000 calls. Three-quarters are for "operator," "sales," or "tech support" (each 1 in 4). The remaining 30,000 calls are for 30,000 different employee names (each appears once). The perplexity of this sequence works out to 52.6, not 30,003. The common categories dominate, pulling the weighted average way down.

More information about what's likely = lower perplexity.

Real Numbers

Wall Street Journal, trained on 38M words, tested on 1.5M:

Model	Perplexity
Unigram	962
Bigram	170
Trigram	109

One word of context (bigram) cuts perplexity by ~5.5x over no context. Two words of context (trigram) cuts it further. More context = less surprise.

Generating Text from a Language Model

Language models aren't just scorers; they can also generate text. The procedure for bigram generation:

Start with $⟨s⟩\langle s \rangle$
Sample a word from $\mid \langle s \rangle)$ — say "I"
Sample from $\mid \text{I})$ — say "want"
Keep going: "want" → "to" → "eat" → "Chinese" → "food" → $⟨/s⟩\langle /s \rangle$
Result: "I want to eat Chinese food"

This is the same loop running inside every LLM: predict, sample, append, repeat. The only difference is the machinery doing the prediction.

The output mirrors the training corpus. Shakespeare trigrams produce pseudo-Shakespeare. WSJ trigrams produce pseudo-financial news. Jane Austen trigrams produce pseudo-Austen. The n-gram model essentially becomes a stylistic fingerprint of its training data, which is the basis for author identification.

The Zero Problem

This is where n-gram models break down.

Shakespeare's corpus: 884,647 tokens, vocabulary $V = 29{,}066$ . Possible bigrams: $V2≈844V^2 \approx 844$ million. Actually observed: 300,000. That's 99.96% zeros.

If "denied the offer" never appeared in training:

$the)=0P(\text{offer} \mid \text{denied the}) = 0$

One zero anywhere in the test set, and the entire test set probability becomes zero. Perplexity becomes undefined. You can't evaluate the model at all.

The fix is smoothing: take a little probability mass from the things you did see and spread it to the things you didn't.

Add-One (Laplace) Smoothing

The simplest possible fix. Pretend every bigram was seen one extra time:

P_{\text{Laplace}}(w_i \mid w_{i-1}) = \frac{C(w_{i-1}, w_i) + 1}{C(w_{i-1}) + V}

Add 1 to every numerator. Add $V$ to the denominator to keep things normalized.

This is not a good fix. It's a working fix. With

V = 1446

in the Berkeley corpus, adding 1 to each of 1,446 possible bigrams per context word dilutes the probability mass heavily. "Want to" drops from an effective count of 608 to 238. Add-one smoothing eliminates zeros, but it distorts the counts you actually trusted.

Good enough for text classification where the vocabulary is small. Not good enough for language modeling. We need something smarter.

Backoff and Interpolation

A better idea: don't commit to one n-gram order.

Sometimes you have enough data for a reliable trigram. Sometimes you don't, and the bigram is more trustworthy. Sometimes even the bigram is sparse and you need the unigram.

Backoff picks the highest-order n-gram you have good counts for and uses that alone. If the trigram count is zero, fall back to bigram. If that's zero, fall back to unigram.

Interpolation mixes all orders simultaneously:

P^(wn∣wn−2,wn−1)=λ1P(wn∣wn−2,wn−1)+λ2P(wn∣wn−1)+λ3P(wn) \hat{P}(w_n \mid w_{n-2}, w_{n-1}) = \lambda_1 P(w_n \mid w_{n-2}, w_{n-1}) + \lambda_2 P(w_n \mid w_{n-1}) + \lambda_3 P(w_n)

where $λ1+λ2+λ3=1\lambda_1 + \lambda_2 + \lambda_3 = 1$ .

The $λ\lambda$ weights are learned from a held-out corpus. You search for the combination that makes the held-out data most probable. Interpolation beats backoff because you're always using signal from every order, not discarding the lower ones when the higher one happens to have counts.

Backoff vs. Interpolation: when does each make sense?

Backoff is simpler to implement and computationally cheaper — you only compute one probability. It works well when you have a massive corpus (like Google's web n-grams) where high-order counts are usually reliable and you only fall back rarely. Interpolation is better when data is sparser, because it always hedges — even a weak trigram estimate contributes something when blended with a strong bigram.

Part 2: What Does a Word Even Mean?

Everything up to this point has a shared limitation: words are just strings. In n-gram models, "cat" is index 4,217 in the vocabulary. "Dog" is index 2,903. They are as unrelated as "cat" and "photosynthesis." For models that reason about language, we need representations that carry meaning.

That's what embeddings are about. But before jumping to algorithms, we need to ask: what does word meaning actually involve? It's messier than you'd expect.

Lemmas and Senses

Take "pepper." One word — the lemma, the dictionary entry form. But it has at least five senses:

The spice (black pepper, peppercorns)
The plant (Piper nigrum)
Capsicum varieties (bell pepper, chili)
California pepper tree
Extended uses ("pepper someone with questions")

One form, many meanings. WordNet, a structured lexical database, catalogs all of this: senses, definitions, usage frequencies, and relationships between words. For decades, WordNet was the backbone of NLP systems that needed to reason about meaning.

Seven Ways Words Relate to Each Other

Words don't exist in isolation. They connect through multiple kinds of relationships. Embeddings need to capture all of them, which is part of what makes the problem hard.

1. Synonymy: roughly the same meaning. Couch/sofa, big/large, car/automobile. But true perfect synonymy may not exist. If two words meant exactly the same thing in every context, why would the language keep both? This is the principle of contrast: a difference in form always signals some difference in meaning. "Water" and "H₂O" name the same substance, but you'd never write "H₂O" in a hiking guide.

2. Similarity: shared elements of meaning, but not interchangeable. Car and bicycle are similar (both vehicles). Cow and horse are similar (both large animals). Humans rate these reliably: vanish/disappear scores 9.8 out of 10 on the SimLex-999 dataset, hole/agreement scores 0.3.

3. Relatedness: connected not by shared meaning, but by co-participation in situations. Car and gasoline aren't similar — one is a vehicle, the other is a liquid. But they're tightly related because they show up in the same events. Scalpel and surgeon: completely different objects, strongly associated.

This distinction matters. Similarity and relatedness are different signals, and embeddings that confuse them will make downstream mistakes.

4. Semantic fields: clusters of words that cover a domain. Hospital: surgeon, scalpel, nurse, anesthetic. Restaurant: waiter, menu, plate, chef. These field structures give embeddings their neighborhood quality; words from the same field land near each other.

5. Antonymy: opposites. Dark/light, hot/cold, up/down, rise/fall. The tricky part is that antonyms are actually very similar. Dark and light share almost all features of meaning — both are about illumination. They differ on just one dimension. This creates a problem for embeddings: should "dark" and "light" be close together (similar concept) or far apart (opposite value)?

6. Taxonomic relations: hierarchies. Vehicle is a superordinate of car. Mango is a subordinate of fruit. These IS-A chains form the skeleton of meaning.

7. Basic level categories: not all levels in a taxonomy are equal. Show someone a beagle, and they say "dog." Not "beagle." Not "animal." "Dog" is the basic level, the one humans default to. Basic-level words are learned first by children, are the shortest, and are the most frequent. We perceive the world at this level.

Connotation: on top of all the above, words carry affective charge. Happy = positive. Sad = negative. Near-synonyms can diverge sharply: "innocent" (positive) vs. "naive" (negative). "Replica" (neutral) vs. "forgery" (negative). Words vary along three affective dimensions: valence (pleasant/unpleasant), arousal (exciting/calm), and dominance (controlling/controlled).

Why Formal Definitions Failed

Early NLP tried to pin down word meaning with logic. A square: four sides, all straight, a closed figure, planar, equal-length sides, right angles. Done. Clean. Works for geometry.

Now try "cup."

William Labov did. His formal definition of "cup" involved ratios of depth to width, the presence or absence of handles, material opacity, whether it's used for hot liquid, and probability functions over these features. A full paragraph of mathematical notation - to define a cup. And it still broke on edge cases. At what point does a cup become a bowl? A mug? A vase?

This was real NLP for decades: hand-building lexicons of feature-based definitions. Slow, brittle, and it never scaled.

Wittgenstein's Way Out

Ludwig Wittgenstein, philosopher of language, offered one sentence that reframed the whole problem:

"The meaning of a word is its use in the language."

Stop trying to write definitions. A word's meaning is just the contexts it shows up in (the words that surround it). If two words consistently appear in the same environments, they mean similar things.

This is testable. Consider a word you've never seen: ongchoi. You encounter:

"Ongchoi is delicious sautéed with garlic."
"Ongchoi is superb over rice."
"Ongchoi leaves with salty sauces."

And you've seen similar contexts for spinach, chard, and collard greens. Without a definition, without a feature list, without WordNet, you know ongchoi is a leafy green vegetable. The context told you.

This principle, that meaning lives in usage patterns, is the distributional hypothesis. It's the idea that eventually became word embeddings. An embedding is a vector that encodes a word's usage across a massive corpus. You don't define "dog" with a feature list. You let millions of contexts define it for you.

The next post turns this insight into math: co-occurrence matrices, sparse vs. dense vectors, and Word2Vec.

What You Now Have

Six things from this post:

MLE on real data: bigram count tables are mostly zeros, the non-zero entries encode syntactic and cultural patterns, and you always compute in log space.
Perplexity: the standard intrinsic metric for language models. Inverse probability of the test set, normalized by length. Lower = better. Interpretable as the weighted average branching factor: how many options the model is confused between at each step.
Sentence generation: sample from the probability distribution, append, repeat. Same loop in n-grams and LLMs. The output mirrors the training corpus so faithfully that you can identify the author from n-gram statistics alone.
The zero problem and smoothing: most possible n-grams are unseen. One zero kills the whole computation. Add-one smoothing is a working fix, not a good one. Interpolation mixes n-gram orders with learned weights and actually works well.
The landscape of word meaning: synonymy, similarity, relatedness, antonymy, taxonomic hierarchies, basic level categories, connotation. These are the phenomena that embeddings need to capture. Formal definitions tried and failed.
Wittgenstein's principle: "the meaning of a word is its use in the language." This one idea is the philosophical foundation of word embeddings: meaning is not a feature list, it's a usage pattern. The distributional hypothesis made it computational.

Next post: turning words into actual vectors — count-based embeddings, Word2Vec, and the cosine similarity measure that ties it all together.

Before LLMs Could Predict, They Had to Count

Akash — Wed, 01 Apr 2026 07:46:21 +0000

By the end of this post, you'll understand exactly how the simplest language models work, the chain rule, the Markov assumption, n-grams, maximum likelihood estimation, and you'll see why every one of these ideas is still alive inside the LLMs you use daily. You'll also understand the specific limitations that forced the field to move beyond counting and into neural prediction. This isn't history for history's sake. This is the conceptual foundation without which transformers don't make sense. This is how n-gram language models laid the foundation for every idea that transformers run on today.

One Task, One Question

Every language model, from a 1990s bigram counter to GPT-4, does the same job: given some words, figure out what word comes next.

More precisely, a language model computes one of two things:

The probability of a full sentence: $P(w1,w2,…,wn)P(w_1, w_2, \dots, w_n)$
The probability of the next word given everything before it: $P(wn∣w1,w2,…,wn−1)P(w_n \mid w_1, w_2, \dots, w_{n-1})$

That's the whole definition. Any model that computes either of these is a language model. The difference between n-grams and LLMs isn't the task, it's the machinery.

Why Would Anyone Need Sentence Probabilities?

Before we get into how language models work, let's ground this in real tasks where you need one:

Machine translation: Your system translates a Spanish sentence and produces two candidates. $tonight)P(\text{high winds tonight}) > P(\text{large winds tonight})$ . "High winds" sounds right. The language model picks it.
Spell correction: "The office is about fifteen minuets from my house." Both "minutes" and "minuets" are real English words. (Minuet is a dance.) But $from)P(\text{fifteen minutes from}) > P(\text{fifteen minuets from})$ , and the language model knows the difference.
Speech recognition: Audio is ambiguous. "I saw a van" or "eyes awe of an"? $an)P(\text{I saw a van}) \gg P(\text{eyes awe of an})$ . Obvious to you. Not obvious to a machine without a language model.

Language models also power autocomplete, summarization, and question answering. And yes, LLMs are language models. They're language models trained at a scale that changes what's possible. But the core task hasn't moved.

One more property to flag before we move on: language models are generative. Predict the next word, sample it, append it, repeat. That generate-one-word-at-a-time loop is exactly what ChatGPT does. The idea is older than you think.

The Counting Problem

Here's the first real question. We want to compute $that")P(\text{``its water is so transparent that"})$ . How?

The brute-force answer: go to a corpus, count how many times this exact six-word sequence appears, and divide by total sentences. But language is creative. People produce new sentences constantly. You'll almost never find an exact match for any long sentence in your data.

We need something smarter. Three ideas, stacked on top of each other, get us there.

Idea 1: The Chain Rule (Break the Sentence Apart)

Instead of computing the probability of the full sentence at once, we decompose it:

P(w_1, w_2, \dots, w_n) = P(w_1) \cdot P(w_2 \mid w_1) \cdot P(w_3 \mid w_1, w_2) \cdots P(w_n \mid w_1, \dots, w_{n-1})

Compactly:

P(w_{1:n}) = \prod_{k=1}^{n} P(w_k \mid w_{1:k-1})

The probability of a sentence is the probability of the first word, times the probability of the second word given the first, times the third given the first two, and so on.

For our example:

P(\text{its water is so transparent}) = P(\text{its}) \times P(\text{water} \mid \text{its}) \times P(\text{is} \mid \text{its water}) \times P(\text{so} \mid \text{its water is}) \times P(\text{transparent} \mid \text{its water is so})

This is mathematically exact. No approximation. But look at that last term: $so)P(\text{transparent} \mid \text{its water is so})$
You need to have seen "its water is so" enough times in your corpus to estimate anything. And the conditioning context grows with every word. For long sentences, you'll never have enough data.

How would we estimate each of these terms? Count and divide:

P(w_n \mid w_{1:n-1}) = \frac{C(w_{1:n})}{C(w_{1:n-1})}

Count how many times the full sequence appears. Divide by how many times the prefix appears. Simple, but impossible for long sequences. Nobody's corpus is big enough.

Idea 2: The Markov Assumption (Forget Most of the History)

Andrei Markov's insight: You don't need the entire history. The last few words are enough.

Instead of:

P(\text{transparent} \mid \text{its water is so})

Approximate with:

P(\text{transparent} \mid \text{so}) \quad \text{(bigram — 1 previous word)}

P(\text{transparent} \mid \text{is so}) \quad \text{(trigram — 2 previous words)}

This is the Markov assumption: the next word depends only on the recent past, not the full history.

It's wrong. Language has long-range dependencies. "The computer which I had just put into the machine room on the fifth floor crashed." The verb "crashed" depends on "computer," nine words back. A bigram model can't see that far.

But it works well enough to be useful. The general n-gram approximation:

P(w_n \mid w_{1:n-1}) \approx P(w_n \mid w_{n-N+1:n-1})

where $N$ is the n-gram order, $N = 2$ for bigrams, $N = 3$ for trigrams.

Idea 3: The N-gram Models (Count Short Sequences)

An n-gram is a contiguous sequence of $n$ words. The n-gram model uses the previous $n - 1$ words to predict the next one. Three versions, each a little less naive than the last.

Unigram: No Context At All

The simplest possible language model. Zero context. Each word generated independently:

P(w_1, w_2, \dots, w_n) \approx \prod_{i=1}^{n} P(w_i)

Words are drawn purely by frequency. Generate from a unigram model, and you get word soup:

"fifth, an, of, futures, the, an, incorporated, a, a, the, inflation, most, dollars, quarter, in, is, mass, thrift, did, eighty, said, hard, 'm, july, bullish"

"The" appears a lot because it's the most frequent English word, not because it belongs next to "an" or "of." Every word is independent of every other word. This model technically is a language model, but barely.

Bigram: One Word of Memory

Now each word is conditioned on the one previous word: $P(wi∣wi−1)P(w_i \mid w_{i-1})$ .

P(w_{1:n}) \approx \prod_{k=1}^{n} P(w_k \mid w_{k-1})

One word of context. Already noticeably better:

"texaco, rose, one, in, this, issue, is, pursuing, growth, in, a, boiler, house, said, mr., gurria, mexico, 's, motion, control, proposal, **without, permission, from, five, hundred, fifty, five, yen"

"Without permission." "Five hundred fifty five." Real collocations, word pairs that naturally occur together. The bigram model captures local patterns. But zoom out, and the sentence is still nonsense.

How do we get bigram probabilities? Maximum Likelihood Estimation (count and divide):

P(w_i \mid w_{i-1}) = \frac{C(w_{i-1},\; w_i)}{C(w_{i-1})}

Scan your corpus. Count how many times the pair $w_{i-1}, w_i)$ appears. Divide by how many times $w_{i-1}$ appears alone. That's the whole algorithm. An n-gram language model is, at bottom, a lookup table of counts turned into ratios.

Trigrams and Beyond: More Context, More Data Hunger

Trigrams ( $n = 3$ ), 4-grams, 5-grams; each step up means more context, better text. But the count-based method hits a wall. The number of possible n-grams grows exponentially with $n$ , and you never have enough data to get reliable counts for most of them.

There's also an overfitting trap: with small corpora, high-order n-grams just memorize chunks of training data instead of learning general patterns. Generate from a 4-gram model trained on Shakespeare, and you get... Shakespeare. Verbatim. Not because the model learned English, but because it ran out of options.

Google compiled a massive n-gram corpus from the web in 2006, pushing the limits of count-based models. Even at web scale, the approach has hard ceilings.

Why does overfitting happen with high-order n-grams?
Shakespeare's corpus has about 884,000 tokens and a vocabulary of ~29,000 words (𝑉 = 29,066). That means 𝑉² ≈ 844 million possible bigrams, but only 300,000 were ever observed. That's 99.96% zeros. For 4-grams, the possible space is 𝑉⁴ ≈ 7 × 10¹⁷. Almost every 4-gram in the model was seen exactly once, so "generating" just replays the training data.

So What Changed? N-grams vs. LLMs

Let's cycle back to where we started. N-gram models and LLMs both generate text. Both use context to pick the next word. Both are language models. What's actually different?

Not the task. The machinery.

Context size. N-grams look at 1, 2, maybe 5 previous words. LLMs condition on thousands to millions of tokens. A bigram sees one word back. GPT sees the whole conversation.

Counting vs. predicting. This is the distinction that matters most. N-gram models estimate probabilities from counts. You tally co-occurrences, compute ratios, and store them in a table. If a word pair never appeared in training, its probability is zero. Done. No recovery.

LLMs predict the next word through learned parameters. They build continuous representations of words and contexts. If "I want Japanese food" never appeared in training, but "I want Chinese food" and "I want Italian food" did, an LLM can bridge the gap. An n-gram model cannot.

This is not the same thing done better. It's a different kind of operation, estimation from observations vs. prediction from learned structure.

Training data. N-gram models use modest corpora. LLMs consume the internet. And instead of storing count tables that grow exponentially, the neural architecture compresses everything into fixed-size parameters.

	N-gram LMs	LLMs
How	Estimate probabilities from counts	Predict next word via learned parameters
Context	1-5 words (practical limit)	Thousands to millions of tokens
Training data	Modest corpora	The entire internet
Generalization	Can only use what was literally observed	Can generalize to unseen combinations
Representation	Words are discrete symbols	Words are dense vectors in continuous space

That last row is the deep issue. N-gram models treat words as atomic, unrelated symbols. "Cat" and "dog" are as different as "cat" and "quantum." No similarity, no transfer, no generalization. Neural language models fix this with embeddings, mapping words into continuous vector spaces where similar words land near each other. But that's the next post.

What You Now Have

Five things you didn't have before reading this:

The definition of a language model: any model that assigns probabilities to word sequences or predicts the next word. N-grams and LLMs are both language models. The task is identical.
The chain rule decomposition: how to break a sentence's probability into a product of conditional probabilities, and why you'd want to.
The Markov assumption: the decision to throw away most of the history and keep only the last few words. Wrong in theory, useful in practice, and the reason n-grams are computationally tractable.
How n-gram estimation actually works: count and divide. Unigrams produce word soup. Bigrams produce local coherence. Higher-order n-grams overfit small data. The whole thing is a lookup table of ratios.
The specific gap that LLMs fill: n-grams can't generalize, can't handle long context, and can't represent word similarity. LLMs solve all three by moving from count-based estimation to neural prediction. Different machinery, same task.

Next post: Perplexity (how you measure whether a language model is any good), the zero problem (what happens when your model has never seen a word pair), and smoothing (how you fix it). That's where the math gets interesting.