vLLM
vLLM is an open source LLM inference and serving engine that runs large language models on your own GPUs with high throughput — exposing an OpenAI-compatible API so you can self-host open models instead of paying per token for a hosted one.
What is vLLM?
vLLM is an open source LLM inference and serving engine for large language models, originally built at UC Berkeley’s Sky Computing Lab. It runs open-weight models like Llama, Qwen, and Mixtral on your own GPUs and serves them through an OpenAI-compatible API, using a memory technique called PagedAttention to push far more concurrent requests through the same hardware.
What is vLLM best for?
Teams serving open models to many concurrent users in production, where throughput and cost per token matter. vLLM shines when you have GPU infrastructure and want the maximum tokens-per-second out of it. It’s a poor fit for casual experiments on a laptop — for that, a one-command tool like Ollama is far simpler.
What can vLLM do?
- Serve 200+ model architectures — decoder-only LLMs, mixture-of-experts, multimodal, and embedding models
- Maximize GPU throughput with PagedAttention, continuous batching, and prefix caching
- Expose an OpenAI-compatible API server (plus an Anthropic Messages API and gRPC)
- Shard large models across GPUs and nodes with tensor, pipeline, and expert parallelism (via Ray)
- Run quantized weights — FP8, INT8/INT4, GPTQ, AWQ, GGUF, and compressed-tensors
- Speed up generation with speculative decoding, CUDA graphs, and FlashAttention kernels
- Run on NVIDIA and AMD GPUs, x86/ARM/PowerPC CPUs, plus TPU, Intel Gaudi, and Apple Silicon plugins
Where does vLLM fall short?
- It’s built for datacenter GPUs. You need real NVIDIA or AMD hardware to see the benefit — roughly 16–24GB of VRAM for an 8B model, and two or more 80GB GPUs to serve a 70B model in FP16.
- Setup and tuning are more involved than one-command runners. Getting batching, memory, and parallelism right takes effort, and CPU and Apple Silicon support lags well behind the GPU path.
- Each instance serves a single model. Hosting several models means running several instances and GPUs — there’s no built-in multi-model router.
Is vLLM free?
Yes — vLLM is fully open source under the permissive Apache-2.0 license, free to use, modify, and run commercially with no paid tier or license fee. Your only cost is the hardware, or cloud GPUs, you run it on. There’s no official managed cloud, though several commercial inference providers run vLLM under the hood.
FAQ
Is vLLM open source? Yes. vLLM is released under the Apache-2.0 license — genuine OSI-approved open source, free for commercial use, and maintained by a community of 2,000+ contributors.
What do I need to run vLLM? A machine with a supported GPU (NVIDIA or AMD) and enough VRAM for your model — roughly 16–24GB for an 8B model, and multiple 80GB GPUs for a 70B one. Install it with pip install vllm. CPU-only runs are supported but much slower.
Is vLLM better than Ollama? For production serving to many users, usually yes — vLLM delivers far higher throughput under concurrent load. For quick local experiments on a single machine, Ollama is easier to set up. They’re built for different jobs.
Does vLLM work with the OpenAI API? Yes. vLLM ships an OpenAI-compatible server, so existing OpenAI client code and SDKs work by pointing them at your vLLM endpoint instead of OpenAI’s.