Last updated: April 5, 2026 · Pricing & Deployment · by Daniel Ashford

What is VRAM?

QUICK ANSWER

The GPU memory that determines which models can run on which hardware.

Definition

VRAM is the dedicated memory on a GPU. For LLM inference, the model must fit into VRAM. The amount required depends on model size and numerical precision.

How It Works

Rule of thumb: ~2 bytes per parameter at 16-bit. A 70B model needs ~140GB VRAM (or ~35GB at 4-bit quantization). Consumer GPUs top out at 24GB (RTX 4090). Professional GPUs offer 48-80GB (A100, H100). Tensor parallelism splits models across multiple GPUs when needed.

Example

To run Llama 4 405B at 4-bit quantization, you need approximately 200GB of VRAM — achievable with 3x H100 GPUs.

Related Terms

GPU (Graphics Processing Unit)
The specialized hardware that LLMs run on.
Quantization
Compressing an LLM to use less memory by reducing numerical precision.
Self-Hosting
Running an LLM on your own hardware instead of using a cloud API.
Parameters
The numerical weights inside an LLM that encode its learned knowledge.

See How Models Compare

Understanding vram is important when choosing the right AI model. See how 12 models compare on our leaderboard.

View Leaderboard →Our Methodology
← Browse all 47 glossary terms
DA
Daniel Ashford
Founder & Lead Evaluator · 200+ models evaluated