How a neural network learns

Suppose we want to recognize a handwritten digit. The input is not handed to the computer as the idea of “seven.” It arrives as a rectangular array of pixel values. The network transforms that array through layers of arithmetic and produces ten scores, one for each possible digit. At first its internal numbers are mostly unhelpful. Training nudges them toward better predictions, one batch at a time.

InputA tensor

What training changesParameters

Everything begins as a tensor

A tensor is a multidimensional array with a shape and a numerical data type. A grayscale image might have shape 28 × 28. A batch of 64 color images might have shape 64 × 3 × 224 × 224: examples, channels, height, width. Text, audio, video, model weights, intermediate activations, and gradients can all be represented as tensors.

“Tensor” is not a claim that the data understands anything. It is a compact structure that lets software perform large groups of numerical operations efficiently.

The forward pass

Each connection carries a weight. A neuron-like unit multiplies its inputs by weights, adds a bias, and applies an activation function. Stacking many such transformations lets the model build increasingly useful internal representations.

Inputx₁x₂x₃

Outputŷ₁ŷ₂

Lines are learned weights; circles hold intermediate activations. Real networks usually represent these operations as matrix multiplications rather than individual drawn neurons.

inputs× weights+ bias→ activation→ output

Why activations matter

Without nonlinear activations, many stacked linear layers collapse into one linear transformation. Functions such as ReLU, GELU, sigmoid, or tanh bend the computation, allowing a network to approximate complex boundaries and interactions. Different architectures choose activations for different numerical and modeling properties.

A loss turns “wrong” into a number

The model’s output is compared with the desired target using a loss function. Classification often uses cross-entropy; regression may use squared or absolute error. The loss is not intelligence or disappointment. It is a scalar objective: lower should mean better according to the task we specified.

Walk the loss curve yourself

Change the starting weight, learning rate, momentum, and step count. Then step the optimizer or run the complete trail. High learning rates can overshoot or diverge; momentum can accelerate and also carry the update past the minimum.

Runs in your browser

Starting weight −4.0 Learning rate 0.80 Momentum 0.00 Run length 16 steps

Current weight−4.000

Current loss9.400

Gradient−3.000

Completed0 steps

Ready at the selected starting point.

This deliberately uses one parameter and the loss L(w) = 0.25(w − 2)² + 0.4 so the actual update path can be drawn. Real networks repeat the same gradient logic across vastly higher-dimensional parameter spaces.

Backpropagation assigns responsibility

The crucial question is not merely “how wrong was the answer?” but “which parameter contributed how much to that error?” Backpropagation applies the chain rule from the loss backward through the recorded computation. It produces a gradient for each trainable parameter: the local direction in which a tiny change would increase the loss. Moving against that direction should reduce it, at least nearby.

Frameworks such as PyTorch build a computation graph during the forward pass and use automatic differentiation to calculate these gradients. Some intermediate tensors must be retained for backward computation, which is one reason training generally needs more memory than inference.

The complete training step

01 · BatchLoad examplesinputs + targets

02 · ForwardPredictrun the layers

03 · LossMeasure errorone objective

04 · BackwardFind gradientschain rule

05 · UpdateMove parametersoptimizer step

The loop repeats across many batches and epochs. The optimizer state may carry momentum or adaptive statistics between steps.

Optimizers are update rules

Basic stochastic gradient descent subtracts the gradient multiplied by a learning rate. Momentum smooths the direction across steps. Adam keeps moving averages of gradient statistics and adapts update sizes per parameter. None removes the need for careful data, objectives, validation, and hyperparameter choices.

Learning the task, not memorizing the sheet

Low training loss alone is not enough. A model can memorize quirks in its training examples and fail on new data—overfitting. A separate validation set helps measure generalization. Regularization, data augmentation, dropout, weight decay, early stopping, and simply obtaining better representative data can all influence the gap.

EpochOne pass through training data

BatchExamples processed together

Learning rateUpdate step size

GradientLocal loss direction

Training ends; inference remains

During inference the model normally performs only the forward pass. The learned parameter values are fixed, gradient tracking is disabled, and the system can optimize for latency, throughput, power, or memory. This same distinction holds for a tiny classifier and a large language model: training discovers parameters; inference uses them.

References & next reads

PyTorchAutograd mechanics

Computation graphs, saved tensors, and reverse automatic differentiation.

PyTorch tutorialAutomatic differentiation

A concrete weight, bias, loss, and backward example.

Next in the seriesHow an LLM produces a sentence

Apply these foundations to transformer language models.