What actually runs an ML model?

“The model runs on a GPU” is useful shorthand, but it skips most of the stack. An application prepares input. A framework or inference runtime loads model artifacts. A compiler selects and fuses operations. Hardware-specific libraries launch optimized kernels. Memory systems feed values to arithmetic units. Then the output climbs back up the stack.

ApplicationPrompt, image, request handling, preprocessing, output logic

Model runtimeGraph execution, memory planning, scheduling, device selection

Kernels + driversMatrix multiplication, attention, convolution, activation, communication

HardwareCPU, GPU, NPU, accelerator memory, interconnect, storage

Each layer hides complexity while passing constraints—shape, precision, memory, and supported operations—to the layer below.

What is inside the model artifact?

A deployable package may contain a computation graph, named operations, tensor shapes, learned weights, numerical types, metadata, and preprocessing configuration. Language models also need a tokenizer and generation settings. Formats differ: some preserve flexible framework code; others describe a portable static graph; specialized engines may be compiled for a particular accelerator.

Graph

The operations and how tensor outputs feed later inputs.

shapesoperators

Weights

Learned tensor values—often the majority of the file size.

FP32FP16INT8 / 4

Configuration

Architecture, tokenizer vocabulary, labels, normalization, or generation defaults.

metadatatokenizer

The runtime plans the computation

When a runtime loads a model, it validates operators and shapes, applies graph optimizations, allocates or reuses buffers, and assigns work to available execution providers. Constant folding can calculate fixed expressions once. Operator fusion can combine several steps into one kernel, reducing launches and intermediate memory traffic. Unsupported operations may fall back to another device.

Portable runtimes such as ONNX Runtime expose hardware-specific execution providers for CPUs, GPUs, mobile accelerators, and NPUs. The same high-level graph can therefore use very different libraries beneath it—within the limits of provider support.

Most of the work is tensor arithmetic

Neural networks repeatedly multiply matrices, reduce values, normalize, apply elementwise functions, gather indexed rows, and move or reshape data. A “kernel” is an implementation of such an operation for a target processor. Kernel quality matters: tile sizes, data layout, vectorization, parallel occupancy, fusion, and memory access can decide whether expensive hardware spends its time calculating or waiting.

activation matrix× weight matrix+ bias→ next tensor

CPU, GPU, or NPU?

CPU

Few powerful, flexible cores; excellent control flow, preprocessing, small models, and broad operator support. Cache and SIMD instructions still provide substantial parallelism.

GPU

Many parallel lanes plus high memory bandwidth. Strong for large matrix operations and batches, provided enough work is available to keep the device occupied.

NPU / ASIC

Specialized data paths for neural operations, often optimized for lower precision and power efficiency on phones, laptops, edge devices, or data-center accelerators.

The fastest device is workload-dependent. A GPU can lose to a CPU on a tiny request once transfer and launch overhead are counted. An NPU may be efficient but support fewer operations. Real runtimes partition around these constraints.

Memory movement is part of the computation

A processor cannot multiply parameters it cannot reach. Weights may travel from storage into system memory, then accelerator memory, cache, and registers. Intermediate activations and caches compete for the same capacity. For large models, memory bandwidth—the rate at which values arrive—can matter as much as peak arithmetic throughput.

Storagemodel file · large · persistent

System RAMloading · staging · CPU work

VRAM / shared memoryweights · activations · KV cache

Cache + registerssmall · fast · feeding arithmetic

Conceptual hierarchy only. Integrated systems may share physical memory, while discrete accelerators require explicit transfers across an interconnect.

Why fewer bits can run faster

Quantization represents values using fewer bits—perhaps FP16, BF16, FP8, INT8, or 4-bit schemes instead of FP32. Smaller values reduce model storage and memory traffic and may unlock faster specialized arithmetic. The tradeoff is approximation error, scale metadata, calibration work, and the possibility that some layers remain sensitive and need higher precision.

Representation	Approx. weight bytes	Typical concern
FP32	4 per parameter	Large memory and bandwidth cost
FP16 / BF16	2 per parameter	Range and precision differ between formats
INT8 / FP8	1 per parameter	Calibration, scales, hardware support
4-bit families	~0.5 per parameter + metadata	Greater approximation and packing complexity

Latency and throughput pull differently

Latency asks how long one request waits. Throughput asks how much total work finishes per unit time. Batching combines requests so parallel hardware is better utilized, improving throughput but potentially making an individual request wait. Production servers use queues, dynamic batches, caches, replicas, and model parallelism to balance these goals.

Cold startLoad + compile + allocate

First-token latencyProcess the prompt

Token throughputGenerate the continuation

BatchingShare hardware efficiently

One inference request, end to end

01 · PrepareEncode inputtokens or tensors

02 · ScheduleBatch + placechoose devices

03 · ExecuteRun kernelsmove + multiply

04 · DecodeInterpret outputsample or postprocess

05 · ServeStream resultmeasure + repeat

For generated text, stages three and four repeat token by token while caches and scheduler state evolve.

References & next reads

ONNX RuntimeExecution Providers

How a runtime assigns graphs across CPU, GPU, NPU, and specialized libraries.

NVIDIATensorRT

Graph optimization, kernel tuning, fusion, and lower-precision inference.

Start of the seriesHow a neural network learns

Return from deployment to gradients and training.