Ollama is the fastest way to run open large language models on your own machine. One install command, ollama pull to download a model, ollama run to chat, and you have a private, offline model with an OpenAI-compatible API on localhost:11434. Under the hood it wraps llama.cpp, and on Apple Silicon it now uses Apple's MLX backend. The latest release is v0.30.8 (June 12, 2026). Getting started takes two minutes; the friction comes later, when the model quietly runs on your CPU instead of your GPU, or falls over with an out-of-memory error, or buckles the moment a second user connects. This guide covers both halves: the exact install and run commands, then the problems people actually hit and the fix for each, plus an honest comparison with llama.cpp, LM Studio, and vLLM.
Key takeaways
- Install is one command. On Linux,
curl -fsSL https://ollama.com/install.sh | sh. macOS and Windows have native installers. Thenollama run llama3.2. - The number-one problem is "it is using my CPU, not my GPU." Run
ollama psto see the GPU/CPU split. If the model does not fit in VRAM, Ollama spills layers to the CPU and it crawls. - Out-of-memory is almost always context size. Lower
num_ctx, pick a smaller quantization, or use a smaller model. Recent versions auto-shrink context to fit VRAM. - Ollama is single-user by design. It has no PagedAttention or continuous batching, so latency spikes past five to ten concurrent requests. For multi-user serving, use vLLM.
- It exposes an OpenAI-compatible API, so most OpenAI SDK code works by pointing
base_urlathttp://localhost:11434/v1. - Pick the runtime for the job: Ollama for local development, llama.cpp for raw speed and embedded use, LM Studio for a GUI, vLLM for production serving.
Install Ollama
Ollama runs on Linux, macOS, and Windows, and in Docker. Pick your platform.
# Linux: one-line install (sets up a systemd service and starts the server)
curl -fsSL https://ollama.com/install.sh | sh
# macOS: download the app, or via Homebrew
brew install --cask ollama
# Windows: download and run the installer from ollama.com/download
# Docker (CPU)
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
# Docker with NVIDIA GPU: add --gpus=all
After installing, confirm the version and that the background server is reachable:
ollama --version
curl http://localhost:11434/api/tags # should return your (empty) model list
Run your first model
Models are pulled from the Ollama library by name and tag. The tag encodes the parameter size and quantization, which is what determines whether it fits on your hardware.
# Download and chat with a model
ollama run llama3.2
# Pull without running, then list what you have
ollama pull qwen3
ollama list
# See what is loaded and whether it is on GPU or CPU
ollama ps
# Show generation stats, including peak memory use
ollama run llama3.2 --verbose
# Remove a model to reclaim disk
ollama rm qwen3
Inside a chat, type /bye to exit. If you are unsure which models to choose for your hardware, see my companion piece on the best local models to run and the broader case for running LLMs locally.
Use the OpenAI-compatible API
This is the feature that makes Ollama genuinely useful in apps: it speaks the OpenAI API. Point any OpenAI client at the local endpoint and most code works unchanged. That also means a local Ollama model can back an agent framework such as CrewAI or Agent Zero for fully private, no-API-cost development.
# Raw HTTP
curl http://localhost:11434/v1/chat/completions -d '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "Say hello in one sentence."}]
}'
# Python, using the official OpenAI SDK
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama") # key is ignored
resp = client.chat.completions.create(
model="llama3.2",
messages=[{"role": "user", "content": "Say hello in one sentence."}],
)
print(resp.choices[0].message.content)
Customize a model with a Modelfile
A Modelfile lets you bake a system prompt and parameters into a reusable named model, the same idea as a Dockerfile.
# Modelfile
FROM llama3.2
PARAMETER temperature 0.3
PARAMETER num_ctx 8192
SYSTEM "You are a terse assistant. Answer in at most three sentences."
# Build and run it
ollama create terse-llama -f Modelfile
ollama run terse-llama
1. Ollama is using the CPU instead of the GPU
What you see: generation is painfully slow, your GPU sits near idle, and the fans barely spin up.
What it is: the most common Ollama complaint. If a model plus its context does not fit in available VRAM, Ollama silently offloads the overflow layers to the CPU, and those layers drag the whole generation down to CPU speed.
The fix: run ollama ps, which shows the GPU/CPU split for each loaded model. If it says anything other than 100% GPU, the model is too big for your VRAM. Pick a smaller parameter size or a smaller quantization, lower num_ctx, or set num_gpu to control how many layers go to the GPU. Also confirm your GPU drivers and, on NVIDIA, CUDA are installed, and remember that browsers and the desktop compositor are already eating some VRAM.
2. Out-of-memory and CUDA OOM errors
What you see: CUDA error: out of memory, or the model loads and then crashes mid-generation, often once you raise the context window.
What it is: the context window is usually the culprit. OOM appears consistently when num_ctx is pushed to 4096 or higher on cards with limited VRAM, because the KV cache grows with context length.
The fix: lower num_ctx to what you actually need, choose a smaller quantization (a Q4 quant uses far less memory than Q8 or full precision), or step down a model size. Run ollama run <model> --verbose to see peak memory and how close you are to the VRAM ceiling. Recent Ollama versions will automatically shrink the context to fit available VRAM rather than crashing, but setting a sane num_ctx yourself is still the reliable fix.
3. It falls apart under concurrent requests
What you see: Ollama is snappy for one user, but as soon as a handful of requests arrive at once, latency spikes and requests queue up behind each other.
What it is: this is architectural, not a bug. Ollama does not implement PagedAttention or continuous batching, so it is not built to serve many users in parallel. Benchmarks show P95 time-to-first-token climbing from a few seconds to over a minute once concurrency passes roughly ten users.
The fix: for light parallelism, set OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS to tune how many requests and models load at once. But if you are genuinely serving multiple users in production, switch the serving layer to vLLM, which uses PagedAttention and continuous batching to deliver roughly 16 to 20 times Ollama's concurrent throughput. Keep Ollama for development and single-user workloads.
4. You cannot reach Ollama from another machine or a web app
What you see: Ollama works locally but a different computer, a container, or a browser front end cannot connect, or gets blocked by CORS.
What it is: by default the server binds to localhost and only accepts same-origin browser requests.
The fix: set OLLAMA_HOST=0.0.0.0 so it listens on all interfaces, make sure port 11434 is open, and set OLLAMA_ORIGINS to allow your web app's origin. On Linux these go in the systemd service environment; restart with systemctl restart ollama.
# Linux: expose Ollama on the network
sudo systemctl edit ollama
# Add under [Service]:
# Environment="OLLAMA_HOST=0.0.0.0"
# Environment="OLLAMA_ORIGINS=*"
sudo systemctl restart ollama
5. Picking a model that fits your hardware
What you see: you do not know whether a 7B, 13B, or 70B model will run on your machine, so you guess and hit OOM or molasses-slow CPU inference.
What it is: the deciding factor is VRAM versus the model's size at its quantization. As a rough rule for common Q4 quants: a 7 to 8B model wants about 5 to 6 GB, a 13 to 14B model about 8 to 10 GB, and a 70B model roughly 40 to 48 GB. Add headroom for the context window.
The fix: match the model to your VRAM with margin to spare, prefer a Q4_K_M quant for the best size-to-quality trade-off, and verify with ollama ps that it loaded fully onto the GPU. When in doubt, start smaller; a model that runs entirely on the GPU beats a larger one spilling to CPU.
Ollama vs llama.cpp vs LM Studio vs vLLM (2026)
Ollama is not the only way to run models locally, and the ecosystem has split cleanly by workload. Ollama and LM Studio are experience layers; llama.cpp and MLX are the underlying engines; vLLM is a serving system. Here is how to choose.
| Tool | Best for | Interface | Concurrency | Notes |
|---|---|---|---|---|
| Ollama | Single-developer local prototyping on any OS | CLI plus OpenAI-compatible API | Weak past ~5 to 10 users | Wraps llama.cpp, MLX on Apple Silicon; easiest on-ramp |
| llama.cpp | Raw speed, embedded and weird-hardware deployments | CLI and library | Manual | The engine Ollama is built on; roughly 15 to 25% faster single-user |
| LM Studio | GUI-first model browsing on Mac or Windows | Desktop GUI plus local server | Single-user | Comparable to Ollama for sequential GGUF inference; closed-source app |
| vLLM | Multi-user production serving on NVIDIA or AMD GPUs | Python server, OpenAI-compatible | Very high (PagedAttention, continuous batching) | ~16 to 20x Ollama concurrent throughput; plan 20 to 30% more VRAM |
What changed in Ollama in 2026
If you are reading an older guide, here is what is current as of v0.30.8 (June 12, 2026):
- MLX is the default Apple Silicon backend (v0.19 and later), with benchmarks around 1.6x prefill and roughly 2x decode versus the older path. Recent releases stabilized the MLX runner further.
- Prompt caching was decoupled from context shift for better KV cache reuse, and speculative decoding landed in the MLX runner.
- Smarter VRAM handling:
ollama run --verbosereports peak memory, and the server auto-shrinks context to fit available VRAM instead of crashing. - The
ollama launchcommand starts integrated apps and tools alongside the corepullandrunworkflow.
Frequently asked questions
Is Ollama free?
Yes. Ollama is free and open source, and the models it runs are open-weight models you download and run locally at no per-token cost. Your only cost is the hardware and electricity.
Does Ollama use my GPU automatically?
Yes, when the model fits in VRAM and your GPU drivers are installed. If a model is too large, Ollama offloads the overflow layers to the CPU, which is slow. Run ollama ps to see the GPU/CPU split, and pick a smaller model or quantization if it is not fully on the GPU.
Why is Ollama so slow?
The usual cause is CPU fallback because the model does not fit in VRAM. Check ollama ps, reduce num_ctx, use a smaller quantization, or choose a smaller model. Under many concurrent requests, Ollama is slow by design because it lacks continuous batching; use vLLM for that.
Ollama vs llama.cpp: what is the difference?
Ollama is a convenience layer built on top of llama.cpp. llama.cpp is the underlying inference engine and runs roughly 15 to 25% faster single-user, but Ollama adds one-line install, model management, and an OpenAI-compatible API. Use Ollama for ease; drop to llama.cpp for maximum speed or embedded deployments.
Can Ollama serve multiple users in production?
Not well. Ollama has no PagedAttention or continuous batching, so latency spikes past roughly five to ten concurrent users. For production multi-user serving, vLLM delivers around 16 to 20 times the concurrent throughput. Keep Ollama for development and single-user use.
How do I expose the Ollama API on my network?
Set OLLAMA_HOST=0.0.0.0 so the server listens on all interfaces, open port 11434, and set OLLAMA_ORIGINS to allow your web app's origin. On Linux, add these to the systemd service with sudo systemctl edit ollama and restart.
Resources
- Ollama Download – official installers for macOS, Windows, and Linux
- Ollama on GitHub – source, releases, and issues
- Ollama Model Library – browse models and tags
- llama.cpp and vLLM – the speed engine and the serving system
- Best local models to run and why run LLMs locally – companion guides on this site
Last updated: June 13, 2026.
Josh writes about AI agents, local AI, and GEO, and runs nowservingto.com, a daily-fresh directory of Toronto's newest restaurants.
