“The model runs on a GPU” is useful shorthand, but it skips most of the stack. An application prepares input. A framework or inference runtime loads model artifacts. A compiler selects and fuses operations. Hardware-specific libraries launch optimized kernels. Memory systems feed values to arithmetic units. Then the output climbs back up the stack.

ApplicationPrompt, image, request handling, preprocessing, output logic
Model runtimeGraph execution, memory planning, scheduling, device selection
Kernels + driversMatrix multiplication, attention, convolution, activation, communication
HardwareCPU, GPU, NPU, accelerator memory, interconnect, storage
Each layer hides complexity while passing constraints—shape, precision, memory, and supported operations—to the layer below.

What is inside the model artifact?

A deployable package may contain a computation graph, named operations, tensor shapes, learned weights, numerical types, metadata, and preprocessing configuration. Language models also need a tokenizer and generation settings. Formats differ: some preserve flexible framework code; others describe a portable static graph; specialized engines may be compiled for a particular accelerator.

Graph

The operations and how tensor outputs feed later inputs.

shapesoperators
Weights

Learned tensor values—often the majority of the file size.

FP32FP16INT8 / 4
Configuration

Architecture, tokenizer vocabulary, labels, normalization, or generation defaults.

metadatatokenizer

The runtime plans the computation

When a runtime loads a model, it validates operators and shapes, applies graph optimizations, allocates or reuses buffers, and assigns work to available execution providers. Constant folding can calculate fixed expressions once. Operator fusion can combine several steps into one kernel, reducing launches and intermediate memory traffic. Unsupported operations may fall back to another device.

Portable runtimes such as ONNX Runtime expose hardware-specific execution providers for CPUs, GPUs, mobile accelerators, and NPUs. The same high-level graph can therefore use very different libraries beneath it—within the limits of provider support.

Most of the work is tensor arithmetic

Neural networks repeatedly multiply matrices, reduce values, normalize, apply elementwise functions, gather indexed rows, and move or reshape data. A “kernel” is an implementation of such an operation for a target processor. Kernel quality matters: tile sizes, data layout, vectorization, parallel occupancy, fusion, and memory access can decide whether expensive hardware spends its time calculating or waiting.

activation matrix× weight matrix+ bias→ next tensor

CPU, GPU, or NPU?

CPU

Few powerful, flexible cores; excellent control flow, preprocessing, small models, and broad operator support. Cache and SIMD instructions still provide substantial parallelism.

GPU

Many parallel lanes plus high memory bandwidth. Strong for large matrix operations and batches, provided enough work is available to keep the device occupied.

NPU / ASIC

Specialized data paths for neural operations, often optimized for lower precision and power efficiency on phones, laptops, edge devices, or data-center accelerators.

The fastest device is workload-dependent. A GPU can lose to a CPU on a tiny request once transfer and launch overhead are counted. An NPU may be efficient but support fewer operations. Real runtimes partition around these constraints.

Memory movement is part of the computation

A processor cannot multiply parameters it cannot reach. Weights may travel from storage into system memory, then accelerator memory, cache, and registers. Intermediate activations and caches compete for the same capacity. For large models, memory bandwidth—the rate at which values arrive—can matter as much as peak arithmetic throughput.

Storagemodel file · large · persistent
System RAMloading · staging · CPU work
VRAM / shared memoryweights · activations · KV cache
Cache + registerssmall · fast · feeding arithmetic
Conceptual hierarchy only. Integrated systems may share physical memory, while discrete accelerators require explicit transfers across an interconnect.

Why fewer bits can run faster

Quantization represents values using fewer bits—perhaps FP16, BF16, FP8, INT8, or 4-bit schemes instead of FP32. Smaller values reduce model storage and memory traffic and may unlock faster specialized arithmetic. The tradeoff is approximation error, scale metadata, calibration work, and the possibility that some layers remain sensitive and need higher precision.

RepresentationApprox. weight bytesTypical concern
FP324 per parameterLarge memory and bandwidth cost
FP16 / BF162 per parameterRange and precision differ between formats
INT8 / FP81 per parameterCalibration, scales, hardware support
4-bit families~0.5 per parameter + metadataGreater approximation and packing complexity

Latency and throughput pull differently

Latency asks how long one request waits. Throughput asks how much total work finishes per unit time. Batching combines requests so parallel hardware is better utilized, improving throughput but potentially making an individual request wait. Production servers use queues, dynamic batches, caches, replicas, and model parallelism to balance these goals.

Cold startLoad + compile + allocate
First-token latencyProcess the prompt
Token throughputGenerate the continuation
BatchingShare hardware efficiently

One inference request, end to end

01 · PrepareEncode inputtokens or tensors
02 · ScheduleBatch + placechoose devices
03 · ExecuteRun kernelsmove + multiply
04 · DecodeInterpret outputsample or postprocess
05 · ServeStream resultmeasure + repeat
For generated text, stages three and four repeat token by token while caches and scheduler state evolve.

References & next reads