Last updated: April 5, 2026 · Pricing & Deployment · by Daniel Ashford
What is VRAM?
The GPU memory that determines which models can run on which hardware.
Definition
VRAM is the dedicated memory on a GPU. For LLM inference, the model must fit into VRAM. The amount required depends on model size and numerical precision.
How It Works
Rule of thumb: ~2 bytes per parameter at 16-bit. A 70B model needs ~140GB VRAM (or ~35GB at 4-bit quantization). Consumer GPUs top out at 24GB (RTX 4090). Professional GPUs offer 48-80GB (A100, H100). Tensor parallelism splits models across multiple GPUs when needed.
Example
To run Llama 4 405B at 4-bit quantization, you need approximately 200GB of VRAM — achievable with 3x H100 GPUs.
Related Terms
See How Models Compare
Understanding vram is important when choosing the right AI model. See how 12 models compare on our leaderboard.