“The model runs on a GPU” is useful shorthand, but it skips most of the stack. An application prepares input. A framework or inference runtime loads model artifacts. A compiler selects and fuses operations. Hardware-specific libraries launch optimized kernels. Memory systems feed values to arithmetic units. Then the output climbs back up the stack.
What is inside the model artifact?
A deployable package may contain a computation graph, named operations, tensor shapes, learned weights, numerical types, metadata, and preprocessing configuration. Language models also need a tokenizer and generation settings. Formats differ: some preserve flexible framework code; others describe a portable static graph; specialized engines may be compiled for a particular accelerator.
The operations and how tensor outputs feed later inputs.
Learned tensor values—often the majority of the file size.
Architecture, tokenizer vocabulary, labels, normalization, or generation defaults.
The runtime plans the computation
When a runtime loads a model, it validates operators and shapes, applies graph optimizations, allocates or reuses buffers, and assigns work to available execution providers. Constant folding can calculate fixed expressions once. Operator fusion can combine several steps into one kernel, reducing launches and intermediate memory traffic. Unsupported operations may fall back to another device.
Portable runtimes such as ONNX Runtime expose hardware-specific execution providers for CPUs, GPUs, mobile accelerators, and NPUs. The same high-level graph can therefore use very different libraries beneath it—within the limits of provider support.
Most of the work is tensor arithmetic
Neural networks repeatedly multiply matrices, reduce values, normalize, apply elementwise functions, gather indexed rows, and move or reshape data. A “kernel” is an implementation of such an operation for a target processor. Kernel quality matters: tile sizes, data layout, vectorization, parallel occupancy, fusion, and memory access can decide whether expensive hardware spends its time calculating or waiting.
CPU, GPU, or NPU?
Few powerful, flexible cores; excellent control flow, preprocessing, small models, and broad operator support. Cache and SIMD instructions still provide substantial parallelism.
Many parallel lanes plus high memory bandwidth. Strong for large matrix operations and batches, provided enough work is available to keep the device occupied.
Specialized data paths for neural operations, often optimized for lower precision and power efficiency on phones, laptops, edge devices, or data-center accelerators.
Memory movement is part of the computation
A processor cannot multiply parameters it cannot reach. Weights may travel from storage into system memory, then accelerator memory, cache, and registers. Intermediate activations and caches compete for the same capacity. For large models, memory bandwidth—the rate at which values arrive—can matter as much as peak arithmetic throughput.
Why fewer bits can run faster
Quantization represents values using fewer bits—perhaps FP16, BF16, FP8, INT8, or 4-bit schemes instead of FP32. Smaller values reduce model storage and memory traffic and may unlock faster specialized arithmetic. The tradeoff is approximation error, scale metadata, calibration work, and the possibility that some layers remain sensitive and need higher precision.
| Representation | Approx. weight bytes | Typical concern |
|---|---|---|
| FP32 | 4 per parameter | Large memory and bandwidth cost |
| FP16 / BF16 | 2 per parameter | Range and precision differ between formats |
| INT8 / FP8 | 1 per parameter | Calibration, scales, hardware support |
| 4-bit families | ~0.5 per parameter + metadata | Greater approximation and packing complexity |
Latency and throughput pull differently
Latency asks how long one request waits. Throughput asks how much total work finishes per unit time. Batching combines requests so parallel hardware is better utilized, improving throughput but potentially making an individual request wait. Production servers use queues, dynamic batches, caches, replicas, and model parallelism to balance these goals.