Suppose we want to recognize a handwritten digit. The input is not handed to the computer as the idea of “seven.” It arrives as a rectangular array of pixel values. The network transforms that array through layers of arithmetic and produces ten scores, one for each possible digit. At first its internal numbers are mostly unhelpful. Training nudges them toward better predictions, one batch at a time.
Everything begins as a tensor
A tensor is a multidimensional array with a shape and a numerical data type. A grayscale image might have shape 28 × 28. A batch of 64 color images might have shape 64 × 3 × 224 × 224: examples, channels, height, width. Text, audio, video, model weights, intermediate activations, and gradients can all be represented as tensors.
The forward pass
Each connection carries a weight. A neuron-like unit multiplies its inputs by weights, adds a bias, and applies an activation function. Stacking many such transformations lets the model build increasingly useful internal representations.
Why activations matter
Without nonlinear activations, many stacked linear layers collapse into one linear transformation. Functions such as ReLU, GELU, sigmoid, or tanh bend the computation, allowing a network to approximate complex boundaries and interactions. Different architectures choose activations for different numerical and modeling properties.
A loss turns “wrong” into a number
The model’s output is compared with the desired target using a loss function. Classification often uses cross-entropy; regression may use squared or absolute error. The loss is not intelligence or disappointment. It is a scalar objective: lower should mean better according to the task we specified.
Walk the loss curve yourself
Change the starting weight, learning rate, momentum, and step count. Then step the optimizer or run the complete trail. High learning rates can overshoot or diverge; momentum can accelerate and also carry the update past the minimum.
Ready at the selected starting point.
This deliberately uses one parameter and the loss L(w) = 0.25(w − 2)² + 0.4 so the actual update path can be drawn. Real networks repeat the same gradient logic across vastly higher-dimensional parameter spaces.
Backpropagation assigns responsibility
The crucial question is not merely “how wrong was the answer?” but “which parameter contributed how much to that error?” Backpropagation applies the chain rule from the loss backward through the recorded computation. It produces a gradient for each trainable parameter: the local direction in which a tiny change would increase the loss. Moving against that direction should reduce it, at least nearby.
Frameworks such as PyTorch build a computation graph during the forward pass and use automatic differentiation to calculate these gradients. Some intermediate tensors must be retained for backward computation, which is one reason training generally needs more memory than inference.
The complete training step
Optimizers are update rules
Basic stochastic gradient descent subtracts the gradient multiplied by a learning rate. Momentum smooths the direction across steps. Adam keeps moving averages of gradient statistics and adapts update sizes per parameter. None removes the need for careful data, objectives, validation, and hyperparameter choices.
Learning the task, not memorizing the sheet
Low training loss alone is not enough. A model can memorize quirks in its training examples and fail on new data—overfitting. A separate validation set helps measure generalization. Regularization, data augmentation, dropout, weight decay, early stopping, and simply obtaining better representative data can all influence the gap.
Training ends; inference remains
During inference the model normally performs only the forward pass. The learned parameter values are fixed, gradient tracking is disabled, and the system can optimize for latency, throughput, power, or memory. This same distinction holds for a tiny classifier and a large language model: training discovers parameters; inference uses them.