~/tools/ollama
Ollama
tool

Ollama

Ollama is an open source tool for running large language models locally — one command pulls a model like Llama, Gemma, or Qwen and serves it on your own machine through a REST API, so you can build with open models offline instead of calling a hosted API.

What is Ollama?

Ollama is an open source tool for running large language models locally on your own machine. A single command — ollama run llama3 — pulls a model, loads it onto your CPU or GPU, and serves it through a REST API. Built on the llama.cpp engine, it handles model downloads, quantization, and memory management so you can run open models offline without wiring anything up yourself.

What is Ollama best for?

Developers and tinkerers who want to run open models on a laptop or workstation with minimal setup — for local development, private chat, prototyping AI features, or keeping data off third-party servers. It’s the fastest way to go from “I want to try Qwen” to a running model, and its always-on API makes it easy to plug local inference into apps.

What can Ollama do?

  • Pull and run open models — Llama, Gemma, Qwen, Mistral, DeepSeek, Phi, and hundreds more from its model library
  • Serve models through a REST API with OpenAI-compatible endpoints, so existing OpenAI client code works by pointing at your local server
  • Run on macOS, Windows, and Linux natively, plus an official Docker image
  • Use either CPU or GPU (NVIDIA, AMD, and Apple Silicon) automatically
  • Import and customize models with a Modelfile — set the system prompt, parameters, or bring your own GGUF weights
  • Drive everything from a simple CLI, or call the official Python and JavaScript libraries
  • Run multimodal and embedding models alongside text generation

Where does Ollama fall short?

  • It’s CLI- and API-first, with no built-in chat window — you talk to it from the terminal or pair it with a separate front-end like LobeChat or LibreChat.
  • It’s tuned for single-user local use, not high-throughput production serving. For serving open models to many concurrent users, a dedicated engine like vLLM delivers far more tokens per second.
  • On Apple Silicon it’s less memory-efficient than tools that use native MLX models, so large models can be tighter on RAM than alternatives.
  • Local model quality and speed are capped by your hardware — big models need lots of RAM or VRAM, and run slowly without a capable GPU.

Is Ollama free?

Yes — Ollama is fully open source under the permissive MIT license, free to use, modify, and run commercially, with no license fee. Running models locally costs nothing beyond your own hardware. Ollama also offers an optional paid cloud for running larger models on its servers: a free tier, Pro at $20/mo (3 concurrent cloud models), and Max at $100/mo (10 concurrent cloud models). The local tool itself stays free.

FAQ

Is Ollama open source? Yes. Ollama is released under the MIT license — genuine OSI-approved open source, free for commercial use, with a large community of contributors on GitHub.

Can I run Ollama for free? Yes. The local tool is free and runs models entirely on your own machine at no cost beyond hardware. The hosted Ollama Cloud (Pro and Max plans) is the only paid part, and it’s optional.

What do I need to run Ollama? A macOS, Windows, or Linux machine. It runs on CPU, but a GPU with enough VRAM is much faster — roughly 8GB of RAM/VRAM for small models, and more for larger ones. Install it, then ollama run <model>.

Does Ollama work with the OpenAI API? Yes. Ollama exposes OpenAI-compatible endpoints, so existing OpenAI client libraries and SDKs work by pointing them at your local Ollama server instead of OpenAI’s.

Is Ollama or vLLM better? They’re built for different jobs. Ollama is easiest for local, single-user experiments and development; vLLM is built for high-throughput production serving on datacenter GPUs.