How video formats actually work

One 1920 × 1080 frame in 24-bit RGB is about 6.22 MB. At 30 frames per second, uncompressed picture data alone approaches 187 MB every second—or roughly 11.2 GB per minute. Add audio, higher frame rates, greater bit depth, or 4K resolution and the torrent grows quickly.

1080p RGB frame6.22 MB

30 fps for one minute~11.2 GB

Watch prediction save a frame

A generated object moves across a static scene. Compare the current source with the previous decoded frame used as a crude prediction, then watch a bitrate-controlled block approximation become the next decoded reference.

Live canvas simulation

Bit budget 68% Object motion 55% Keyframe interval 48 frames

Current frame0 · I-frame

Block approximation4 px

Changed samples—

Estimated payload—

Animating locally. Yellow marks motion; the middle panel shows what the previous frame cannot predict.

This is an explanatory block-and-residual model, not an H.264 or AV1 encoder. Real codecs search motion vectors, choose variable block shapes and transforms, quantize residual coefficients, and entropy-code syntax.

MP4 is the box, not the compressor

An MP4 file is a container. It can hold encoded video, encoded audio, timing information, captions, metadata, and the indexes a player needs to seek. H.264/AVC, H.265/HEVC, and AV1 are codecs: agreed methods for encoding and decoding a stream. So “MP4 versus H.264” is like “envelope versus language”—they solve different layers of the problem.

Video · H.264

Audio · AAC

Captions

Timing + index

A conceptual MP4 container. The exact tracks and codecs vary; the filename alone does not reveal every ingredient.

Containers and codecs form combinations

MP4ISO base media family

H.264 / HEVC / AV1+AAC / Opus + text

WebMrestricted Matroska family

VP8 / VP9 / AV1+Vorbis / Opus + text

MOVQuickTime container

H.264 / HEVC / ProRes+PCM / AAC + metadata

Matroska · MKVflexible open container

many video codecs+many audio + subtitle tracks

First compress each picture

Video codecs reuse many ideas from still-image compression: separate brightness from color, sample color more sparsely, split the picture into blocks, transform spatial detail into frequency-like coefficients, quantize those values, and entropy-code the result. Modern block sizes and transforms are more flexible than classic JPEG, but the family resemblance is strong.

Pixel format is not the same as codec

A decoded frame still needs an arrangement of numerical samples. RGB is intuitive for displays, while Y′CbCr families are common in compressed video. A label such as 4:2:0 describes chroma sampling; 8-bit, 10-bit, or 12-bit describes precision; limited versus full range and color metadata describe how codes map toward light and color.

4:4:4 · Y

4:4:4 · Cb/Cr

4:2:2 · chroma

4:2:0 · chroma

Conceptual sampling density only. 4:2:2 reduces horizontal chroma detail; 4:2:0 reduces chroma sampling across both dimensions while preserving luma resolution.

Then borrow from time

The decisive trick is temporal prediction. If a ball moves across an otherwise unchanged room, there is little reason to encode the entire room thirty times each second. The encoder divides frames into regions, searches nearby frames for similar regions, records motion vectors, and then encodes the smaller residual—the difference between its prediction and what actually appeared.

Iwhole reference

Pmotion + residue

Bbetween references

Pmotion + residue

Bbetween references

Inew reference

I-frames can stand alone. Predicted frames describe changes relative to reference pictures. The exact pattern is chosen by the encoder and codec.

I-frames, often called keyframes, are coded without depending on other pictures. P-frames can predict from earlier reference pictures. B-frames can use references on both sides in display time. A group of pictures combines these roles, balancing compression, random access, error recovery, and decoding complexity.

A prediction is not a finished image. The decoder combines reference data, motion instructions, and residual corrections to reconstruct each displayed frame.

Quality, bitrate, and work

An encoder can spend more bits to preserve detail, or spend more computation searching for better predictions. A low bitrate forces harder choices: smooth gradients may band, texture may smear, blocks may become visible, and fast motion may break into mosquito-like noise. A more efficient codec may reach similar perceived quality with fewer bits, but often asks more of the encoder, decoder, hardware, or licensing environment.

More source detailMore data to explain

More motion or noiseHarder predictions

More bitrateFewer forced losses

More encoder effortSmarter bit spending

Rate control decides where bits go

Constant bitrate targets predictable delivery but may waste bits on easy scenes and starve difficult ones. Variable bitrate spends more on motion, texture, grain, or scene changes and less on static material. Quality-targeted modes let size vary to hold a more consistent quality. Two-pass encoding can inspect the whole program before allocating its final budget.

Codec generations trade simplicity for efficiency

H.264 · AVC

Mature motion-compensated block coding with extremely broad hardware and software support. A durable delivery baseline.

compatibleMP4hardware

H.265 · HEVC

Larger and more flexible coding units, stronger prediction, and improved efficiency—paired with more complexity and licensing considerations.

HDR4K+MP4 / MOV

VP9

An open web-oriented codec widely associated with WebM and streaming. Supports modern resolution and color capabilities.

openWebMstreaming

AV1

A royalty-free AOMedia codec with a large toolset for high compression efficiency, at the cost of heavier encoding and newer hardware requirements.

openMP4 / WebMefficient

ProRes / DNx

Editing-oriented codec families favor independent or lightly dependent frames, responsive seeking, and repeated post-production work over tiny delivery files.

intraeditinghigh bitrate

FFV1 / lossless

Lossless coding preserves decoded samples exactly. Useful for preservation and specialist workflows, but far larger than ordinary web delivery.

losslessarchivespecialist

The GOP controls dependence

A long group of pictures can improve compression because more frames share references, but it increases seek distance and vulnerability to missing dependencies. Short GOPs cost more bits and are friendlier to editing, low-latency contribution, and random access. All-intra codecs make every frame independently decodable; inter-frame delivery codecs deliberately build a web of temporal dependence.

Priority	Likely design	Tradeoff
Small streaming file	Longer GOP, inter prediction, modern codec	More encode/decode work and dependencies
Fast editing and scrubbing	All-intra or short GOP, high bitrate	Larger files
Live low latency	Short buffers, restricted reordering	Lower compression efficiency
Preservation	Lossless or lightly compressed mezzanine	Storage and bandwidth

Why seeking and streaming need structure

A player cannot always begin from an arbitrary predicted frame because its references may be missing. Containers provide timestamps and indexes; encoders place random-access points; streaming systems divide media into segments. Together, those structures let a player seek, switch quality levels, buffer ahead, and keep audio aligned with video.

Adaptive streaming stores several answers

A streaming service often encodes a ladder of resolutions and bitrates. A manifest describes short aligned segments for each level. The player estimates network and buffer conditions, then switches at a segment boundary. The “video” you watch may therefore be assembled from multiple encoded representations during one session.

References

MDNMedia container formats

Containers, tracks, MIME types, and MP4.

MDNWeb video codec guide

Raw frame scale, codecs, and quality tradeoffs.

W3CISO BMFF byte streams

Initialization, media segments, and random access.