How an LLM produces a sentence

A large language model is a neural network trained to model sequences. Given preceding tokens, it produces a score for every token in its vocabulary. A decoding strategy turns those scores into a choice. The chosen token becomes part of the next input, and the loop continues until a stopping condition is reached.

01 · TextTokenizetext → IDs

02 · VectorsEmbedIDs → numbers

03 · ContextTransformattention + MLP

04 · ScoresPredictlogits → probabilities

05 · LoopChoose + appendone new token

The high-level inference path of a decoder-style autoregressive language model.

Text becomes tokens

The tokenizer breaks text into vocabulary units and maps each unit to an integer ID. Tokens may be whole words, punctuation, common word pieces, bytes, or other learned fragments. The same visible word can tokenize differently depending on spacing, language, capitalization, and the model’s tokenizer.

The hidden machinery of language.

Tokens preserve an exact reversible encoding of the input text, but they are not guaranteed to align with words.

IDs become positions in a learned space

An integer token ID by itself has no useful geometry. An embedding table maps each token to a dense vector. Training arranges these vectors—and transforms them in context—so the network can use patterns of similarity, contrast, grammar, and meaning. Position information is also introduced because a bag of words is not a sentence.

cat dog blue run walk

A toy 2D projection. Actual embeddings have many dimensions, and contextual representations change as tokens pass through layers.

Attention lets each position gather context

For each token position, the model produces query, key, and value vectors. A query is compared with allowed keys to produce attention scores. After scaling, masking, and normalization, those scores mix value vectors into a context-aware result. Multiple attention heads can learn different relationships in parallel.

“it” looks backward

The model can assign more attention weight to earlier tokens that help resolve the current representation. Causal masking prevents a generation position from reading future tokens.

Brighter cells mean greater conceptual attention weight. The empty upper triangle represents future positions hidden by a causal mask.

A transformer block does more than attention

Attention moves information between token positions. A feed-forward network then performs a learned transformation at each position. Residual connections preserve and add signals; normalization keeps activations numerically manageable. This structure repeats through many blocks.

01Masked multi-head attentionmix positions

02Residual connection + normalizationstabilize

03Feed-forward networktransform each position

04Residual connection + normalizationpass onward

× NRepeat across the modeldeeper representations

Exact ordering and components vary among transformer families, but attention, feed-forward transformations, residual paths, and normalization are recurring pieces.

The model predicts a distribution, not destiny

The final hidden representation is projected into one logit per vocabulary token. Softmax turns adjusted logits into probabilities. Greedy decoding selects the highest. Sampling permits alternatives; temperature reshapes confidence; top-k and top-p restrict the candidate set. The decoding policy changes style and variability without retraining the model.

Reshape and sample the next token

Temperature rescales the logits before softmax. Top-p keeps the smallest high-probability set whose cumulative mass reaches the selected threshold. Change both, then sample repeatedly.

Local probability toy

Temperature 1.00 Top-p threshold 1.00

works

—

learns

—

changes

—

predicts

—

…other

—

The model …

The logits are invented for this small demonstration. Real models score their full vocabulary, and production decoders may add repetition penalties, minimum lengths, grammar constraints, or other policies.

Generation has a memory bill

Without caching, every new token would force repeated computation for the earlier context. A key-value cache stores attention keys and values from previous positions so they can be reused. This speeds generation, but the cache grows with context length, batch size, layer count, and representation size. Long conversations therefore consume real accelerator memory even when the model weights do not change.

Where the behavior comes from

Pretraining commonly asks the model to predict tokens across a very large corpus, turning linguistic regularities into parameters. Later stages may include supervised instruction tuning, preference optimization, safety training, tool-use training, or domain adaptation. The prompt then steers those learned patterns at inference time; it does not rewrite the model’s weights.

An LLM is not a database lookup and its fluency is not a truth guarantee. It generates plausible continuations from learned statistical structure. Reliable systems add retrieval, tools, citations, verification, constraints, and human judgment when the task demands them.

References & next reads

Vaswani et al.Attention Is All You Need

The transformer architecture paper.

Hugging FaceKV cache strategies

Why autoregressive generation caches attention state.

Next in the seriesWhat actually runs an ML model?

Follow the model from file to runtime and hardware.