One 1920 × 1080 frame in 24-bit RGB is about 6.22 MB. At 30 frames per second, uncompressed picture data alone approaches 187 MB every second—or roughly 11.2 GB per minute. Add audio, higher frame rates, greater bit depth, or 4K resolution and the torrent grows quickly.
Watch prediction save a frame
A generated object moves across a static scene. Compare the current source with the previous decoded frame used as a crude prediction, then watch a bitrate-controlled block approximation become the next decoded reference.
Animating locally. Yellow marks motion; the middle panel shows what the previous frame cannot predict.
This is an explanatory block-and-residual model, not an H.264 or AV1 encoder. Real codecs search motion vectors, choose variable block shapes and transforms, quantize residual coefficients, and entropy-code syntax.
MP4 is the box, not the compressor
An MP4 file is a container. It can hold encoded video, encoded audio, timing information, captions, metadata, and the indexes a player needs to seek. H.264/AVC, H.265/HEVC, and AV1 are codecs: agreed methods for encoding and decoding a stream. So “MP4 versus H.264” is like “envelope versus language”—they solve different layers of the problem.
Containers and codecs form combinations
First compress each picture
Video codecs reuse many ideas from still-image compression: separate brightness from color, sample color more sparsely, split the picture into blocks, transform spatial detail into frequency-like coefficients, quantize those values, and entropy-code the result. Modern block sizes and transforms are more flexible than classic JPEG, but the family resemblance is strong.
Pixel format is not the same as codec
A decoded frame still needs an arrangement of numerical samples. RGB is intuitive for displays, while Y′CbCr families are common in compressed video. A label such as 4:2:0 describes chroma sampling; 8-bit, 10-bit, or 12-bit describes precision; limited versus full range and color metadata describe how codes map toward light and color.
Then borrow from time
The decisive trick is temporal prediction. If a ball moves across an otherwise unchanged room, there is little reason to encode the entire room thirty times each second. The encoder divides frames into regions, searches nearby frames for similar regions, records motion vectors, and then encodes the smaller residual—the difference between its prediction and what actually appeared.
I-frames, often called keyframes, are coded without depending on other pictures. P-frames can predict from earlier reference pictures. B-frames can use references on both sides in display time. A group of pictures combines these roles, balancing compression, random access, error recovery, and decoding complexity.
Quality, bitrate, and work
An encoder can spend more bits to preserve detail, or spend more computation searching for better predictions. A low bitrate forces harder choices: smooth gradients may band, texture may smear, blocks may become visible, and fast motion may break into mosquito-like noise. A more efficient codec may reach similar perceived quality with fewer bits, but often asks more of the encoder, decoder, hardware, or licensing environment.
Rate control decides where bits go
Constant bitrate targets predictable delivery but may waste bits on easy scenes and starve difficult ones. Variable bitrate spends more on motion, texture, grain, or scene changes and less on static material. Quality-targeted modes let size vary to hold a more consistent quality. Two-pass encoding can inspect the whole program before allocating its final budget.
Codec generations trade simplicity for efficiency
Mature motion-compensated block coding with extremely broad hardware and software support. A durable delivery baseline.
Larger and more flexible coding units, stronger prediction, and improved efficiency—paired with more complexity and licensing considerations.
An open web-oriented codec widely associated with WebM and streaming. Supports modern resolution and color capabilities.
A royalty-free AOMedia codec with a large toolset for high compression efficiency, at the cost of heavier encoding and newer hardware requirements.
Editing-oriented codec families favor independent or lightly dependent frames, responsive seeking, and repeated post-production work over tiny delivery files.
Lossless coding preserves decoded samples exactly. Useful for preservation and specialist workflows, but far larger than ordinary web delivery.
The GOP controls dependence
A long group of pictures can improve compression because more frames share references, but it increases seek distance and vulnerability to missing dependencies. Short GOPs cost more bits and are friendlier to editing, low-latency contribution, and random access. All-intra codecs make every frame independently decodable; inter-frame delivery codecs deliberately build a web of temporal dependence.
| Priority | Likely design | Tradeoff |
|---|---|---|
| Small streaming file | Longer GOP, inter prediction, modern codec | More encode/decode work and dependencies |
| Fast editing and scrubbing | All-intra or short GOP, high bitrate | Larger files |
| Live low latency | Short buffers, restricted reordering | Lower compression efficiency |
| Preservation | Lossless or lightly compressed mezzanine | Storage and bandwidth |
Why seeking and streaming need structure
A player cannot always begin from an arbitrary predicted frame because its references may be missing. Containers provide timestamps and indexes; encoders place random-access points; streaming systems divide media into segments. Together, those structures let a player seek, switch quality levels, buffer ahead, and keep audio aligned with video.
Adaptive streaming stores several answers
A streaming service often encodes a ladder of resolutions and bitrates. A manifest describes short aligned segments for each level. The player estimates network and buffer conditions, then switches at a segment boundary. The “video” you watch may therefore be assembled from multiple encoded representations during one session.
References
Containers, tracks, MIME types, and MP4.
MDNWeb video codec guideRaw frame scale, codecs, and quality tradeoffs.
W3CISO BMFF byte streamsInitialization, media segments, and random access.