Joshua Opolko

Ollama Setup Guide (2026): Install, Run LLMs Locally, and Fix the Common Issues

A playful orange line-art llama, the mascot of the Ollama project

Ollama is the fastest way to run open large language models on your own machine. One install command, ollama pull to download a model, ollama run to chat, and you have a private, offline model with an OpenAI-compatible API on localhost:11434. Under the hood it wraps llama.cpp, and on Apple Silicon it now uses Apple's MLX backend. The latest release is v0.30.8 (June 12, 2026). Getting started takes two minutes; the friction comes later, when the model quietly runs on your CPU instead of your GPU, or falls over with an out-of-memory error, or buckles the moment a second user connects. This guide covers both halves: the exact install and run commands, then the problems people actually hit and the fix for each, plus an honest comparison with llama.cpp, LM Studio, and vLLM.

Key takeaways

How do I install Ollama?

Ollama runs on Linux, macOS, and Windows, and in Docker. Pick your platform.

# Linux: one-line install (sets up a systemd service and starts the server)
curl -fsSL https://ollama.com/install.sh | sh

# macOS: download the app, or via Homebrew
brew install --cask ollama

# Windows: download and run the installer from ollama.com/download

# Docker (CPU)
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
# Docker with NVIDIA GPU: add --gpus=all

After installing, confirm the version and that the background server is reachable:

ollama --version
curl http://localhost:11434/api/tags   # should return your (empty) model list

How do I run my first Ollama model?

Models are pulled from the Ollama library by name and tag. The tag encodes the parameter size and quantization, for example llama3.2:3b-instruct-q4_K_M, which determines whether the model fits on your hardware and how fast it runs. If you leave out the tag, Ollama pulls a sensible default (usually a 4-bit quantized version of a mid-size variant).

# Download and chat with a model
ollama run llama3.2

# Pull without running, then list what you have
ollama pull qwen3
ollama list

# See what is loaded and whether it is on GPU or CPU
ollama ps

# Show generation stats, including peak memory use
ollama run llama3.2 --verbose

# Remove a model to reclaim disk
ollama rm qwen3

Before you run, check how much VRAM you have available. A 7 to 8B model at Q4 needs roughly 5 to 6 GB, a 13B model needs about 8 to 10 GB, and a 70B model needs 40 GB or more. If the model is too large, Ollama will load it partly on the CPU, which is much slower. Run ollama ps after loading to see the GPU/CPU split and confirm the model is fully on the GPU. When in doubt, start with a smaller model that fits entirely in VRAM; it will outperform a larger model that spills over to system RAM.

Inside a chat, type /bye to exit. If you are unsure which models to choose for your hardware, see my companion piece on the best local models to run and the broader case for running LLMs locally.

How do I use Ollama's OpenAI-compatible API?

Ollama's OpenAI-compatible API is the feature that makes it genuinely useful in real apps. Point any OpenAI client at http://localhost:11434/v1 with an arbitrary API key (Ollama ignores the key value) and the standard SDK calls work without modification: chat completions, streaming, model listing, and embeddings are all supported. That means any code written against the OpenAI SDK can be redirected to a fully local, zero-cost model by changing a single base_url parameter. It also means a local Ollama model can back an agent framework such as CrewAI or Agent Zero for fully private, no-API-cost development. If you are building a tool that needs to stay entirely on-device, whether for privacy, offline capability, or cost control, the API compatibility makes Ollama the easiest drop-in replacement for any OpenAI-dependent project.

# Raw HTTP
curl http://localhost:11434/v1/chat/completions -d '{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "Say hello in one sentence."}]
}'

# Python, using the official OpenAI SDK
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")  # key is ignored
resp = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Say hello in one sentence."}],
)
print(resp.choices[0].message.content)

How do I customize a model with a Modelfile?

A Modelfile lets you bake a system prompt and inference parameters into a reusable named model, following the same idea as a Dockerfile. Instead of passing the same flags and system prompt every time, you define them once in a text file and build a named model layer with ollama create. The result shows up in ollama list and behaves exactly as configured every time you call it.

The most useful parameters to set are temperature (lower for factual tasks, higher for creative work), num_ctx (the context window; raise it when you need long conversations, lower it to save VRAM), and top_p or top_k for sampling behavior. You can also add a TEMPLATE directive to override the chat template if the model's default does not match your use case. This is particularly useful when you want a consistent, stripped-down assistant persona across all your local tooling without repeating the system prompt in every call.

# Modelfile
FROM llama3.2
PARAMETER temperature 0.3
PARAMETER num_ctx 8192
SYSTEM "You are a terse assistant. Answer in at most three sentences."

# Build and run it
ollama create terse-llama -f Modelfile
ollama run terse-llama

1. Ollama is using the CPU instead of the GPU

What you see: generation is painfully slow, your GPU sits near idle, and the fans barely spin up.

What it is: the most common Ollama complaint. If a model plus its context does not fit in available VRAM, Ollama silently offloads the overflow layers to the CPU, and those layers drag the whole generation down to CPU speed.

The fix: run ollama ps, which shows the GPU/CPU split for each loaded model. If it says anything other than 100% GPU, the model is too big for your VRAM. Pick a smaller parameter size or a smaller quantization, lower num_ctx, or set num_gpu to control how many layers go to the GPU. Also confirm your GPU drivers and, on NVIDIA, CUDA are installed, and remember that browsers and the desktop compositor are already eating some VRAM.

2. Out-of-memory and CUDA OOM errors

What you see: CUDA error: out of memory, or the model loads and then crashes mid-generation, often once you raise the context window.

What it is: the context window is usually the culprit. OOM appears consistently when num_ctx is pushed to 4096 or higher on cards with limited VRAM, because the KV cache grows with context length.

The fix: lower num_ctx to what you actually need, choose a smaller quantization (a Q4 quant uses far less memory than Q8 or full precision), or step down a model size. Run ollama run <model> --verbose to see peak memory and how close you are to the VRAM ceiling. Recent Ollama versions will automatically shrink the context to fit available VRAM rather than crashing, but setting a sane num_ctx yourself is still the reliable fix.

3. It falls apart under concurrent requests

What you see: Ollama is snappy for one user, but as soon as a handful of requests arrive at once, latency spikes and requests queue up behind each other.

What it is: this is architectural, not a bug. Ollama does not implement PagedAttention or continuous batching, so it is not built to serve many users in parallel. Benchmarks show P95 time-to-first-token climbing from a few seconds to over a minute once concurrency passes roughly ten users.

The fix: for light parallelism, set OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS to tune how many requests and models load at once. But if you are genuinely serving multiple users in production, switch the serving layer to vLLM, which uses PagedAttention and continuous batching to deliver roughly 16 to 20 times Ollama's concurrent throughput. Keep Ollama for development and single-user workloads.

4. You cannot reach Ollama from another machine or a web app

What you see: Ollama works locally but a different computer, a container, or a browser front end cannot connect, or gets blocked by CORS.

What it is: by default the server binds to localhost and only accepts same-origin browser requests.

The fix: set OLLAMA_HOST=0.0.0.0 so it listens on all interfaces, make sure port 11434 is open, and set OLLAMA_ORIGINS to allow your web app's origin. On Linux these go in the systemd service environment; restart with systemctl restart ollama.

# Linux: expose Ollama on the network
sudo systemctl edit ollama
# Add under [Service]:
#   Environment="OLLAMA_HOST=0.0.0.0"
#   Environment="OLLAMA_ORIGINS=*"
sudo systemctl restart ollama

5. Picking a model that fits your hardware

What you see: you do not know whether a 7B, 13B, or 70B model will run on your machine, so you guess and hit OOM or molasses-slow CPU inference.

What it is: the deciding factor is VRAM versus the model's size at its quantization. As a rough rule for common Q4 quants: a 7 to 8B model wants about 5 to 6 GB, a 13 to 14B model about 8 to 10 GB, and a 70B model roughly 40 to 48 GB. Add headroom for the context window.

The fix: match the model to your VRAM with margin to spare, prefer a Q4_K_M quant for the best size-to-quality trade-off, and verify with ollama ps that it loaded fully onto the GPU. When in doubt, start smaller; a model that runs entirely on the GPU beats a larger one spilling to CPU.

Ollama vs llama.cpp vs LM Studio vs vLLM: which should I use in 2026?

Ollama is not the only way to run models locally, and the ecosystem has split cleanly by workload. Ollama and LM Studio are experience layers; llama.cpp and MLX are the underlying engines; vLLM is a serving system. Here is how to choose.

ToolBest forInterfaceConcurrencyNotes
OllamaSingle-developer local prototyping on any OSCLI plus OpenAI-compatible APIWeak past ~5 to 10 usersWraps llama.cpp, MLX on Apple Silicon; easiest on-ramp
llama.cppRaw speed, embedded and weird-hardware deploymentsCLI and libraryManualThe engine Ollama is built on; roughly 15 to 25% faster single-user
LM StudioGUI-first model browsing on Mac or WindowsDesktop GUI plus local serverSingle-userComparable to Ollama for sequential GGUF inference; closed-source app
vLLMMulti-user production serving on NVIDIA or AMD GPUsPython server, OpenAI-compatibleVery high (PagedAttention, continuous batching)~16 to 20x Ollama concurrent throughput; plan 20 to 30% more VRAM

What changed in Ollama in 2026

If you are reading an older guide, here is what is current as of v0.30.8 (June 12, 2026):

Frequently asked questions

Is Ollama free?

Yes. Ollama is free and open source, and the models it runs are open-weight models you download and run locally at no per-token cost. Your only cost is the hardware and electricity.

Does Ollama use my GPU automatically?

Yes, when the model fits in VRAM and your GPU drivers are installed. If a model is too large, Ollama offloads the overflow layers to the CPU, which is slow. Run ollama ps to see the GPU/CPU split, and pick a smaller model or quantization if it is not fully on the GPU.

Why is Ollama so slow?

The usual cause is CPU fallback because the model does not fit in VRAM. Check ollama ps, reduce num_ctx, use a smaller quantization, or choose a smaller model. Under many concurrent requests, Ollama is slow by design because it lacks continuous batching; use vLLM for that.

Ollama vs llama.cpp: what is the difference?

Ollama is a convenience layer built on top of llama.cpp. llama.cpp is the underlying inference engine and runs roughly 15 to 25% faster single-user, but Ollama adds one-line install, model management, and an OpenAI-compatible API. Use Ollama for ease; drop to llama.cpp for maximum speed or embedded deployments.

Can Ollama serve multiple users in production?

Not well. Ollama has no PagedAttention or continuous batching, so latency spikes past roughly five to ten concurrent users. For production multi-user serving, vLLM delivers around 16 to 20 times the concurrent throughput. Keep Ollama for development and single-user use.

How do I expose the Ollama API on my network?

Set OLLAMA_HOST=0.0.0.0 so the server listens on all interfaces, open port 11434, and set OLLAMA_ORIGINS to allow your web app's origin. On Linux, add these to the systemd service with sudo systemctl edit ollama and restart.


Resources

Last updated: June 13, 2026.

Josh writes about AI agents, local AI, and GEO, and runs nowservingto.com, a daily-fresh directory of Toronto's newest restaurants.