A large language model is a neural network trained to model sequences. Given preceding tokens, it produces a score for every token in its vocabulary. A decoding strategy turns those scores into a choice. The chosen token becomes part of the next input, and the loop continues until a stopping condition is reached.
Text becomes tokens
The tokenizer breaks text into vocabulary units and maps each unit to an integer ID. Tokens may be whole words, punctuation, common word pieces, bytes, or other learned fragments. The same visible word can tokenize differently depending on spacing, language, capitalization, and the model’s tokenizer.
IDs become positions in a learned space
An integer token ID by itself has no useful geometry. An embedding table maps each token to a dense vector. Training arranges these vectors—and transforms them in context—so the network can use patterns of similarity, contrast, grammar, and meaning. Position information is also introduced because a bag of words is not a sentence.
Attention lets each position gather context
For each token position, the model produces query, key, and value vectors. A query is compared with allowed keys to produce attention scores. After scaling, masking, and normalization, those scores mix value vectors into a context-aware result. Multiple attention heads can learn different relationships in parallel.
“it” looks backward
The model can assign more attention weight to earlier tokens that help resolve the current representation. Causal masking prevents a generation position from reading future tokens.
A transformer block does more than attention
Attention moves information between token positions. A feed-forward network then performs a learned transformation at each position. Residual connections preserve and add signals; normalization keeps activations numerically manageable. This structure repeats through many blocks.
The model predicts a distribution, not destiny
The final hidden representation is projected into one logit per vocabulary token. Softmax turns adjusted logits into probabilities. Greedy decoding selects the highest. Sampling permits alternatives; temperature reshapes confidence; top-k and top-p restrict the candidate set. The decoding policy changes style and variability without retraining the model.
Reshape and sample the next token
Temperature rescales the logits before softmax. Top-p keeps the smallest high-probability set whose cumulative mass reaches the selected threshold. Change both, then sample repeatedly.
The model …
The logits are invented for this small demonstration. Real models score their full vocabulary, and production decoders may add repetition penalties, minimum lengths, grammar constraints, or other policies.
Generation has a memory bill
Without caching, every new token would force repeated computation for the earlier context. A key-value cache stores attention keys and values from previous positions so they can be reused. This speeds generation, but the cache grows with context length, batch size, layer count, and representation size. Long conversations therefore consume real accelerator memory even when the model weights do not change.
Where the behavior comes from
Pretraining commonly asks the model to predict tokens across a very large corpus, turning linguistic regularities into parameters. Later stages may include supervised instruction tuning, preference optimization, safety training, tool-use training, or domain adaptation. The prompt then steers those learned patterns at inference time; it does not rewrite the model’s weights.