Set Up Local Ollama
Ollama lets you run large language models locally. When you connect it to Pinchy, your agents run entirely on your own hardware — no API keys, no cloud calls, no data leaving your infrastructure.
This is the setup for teams that need full air-gap compliance or simply want to keep everything in-house.
Prerequisites
Section titled “Prerequisites”- A running Pinchy instance (Installation)
- Ollama installed and running with at least one model pulled
Connect Ollama to Pinchy
Section titled “Connect Ollama to Pinchy”- Go to Settings → LLM Provider 2. Click Ollama (Local) 3. Enter the URL where Ollama is running (see deployment options below) 4. Click Save — Pinchy validates the connection and discovers your models
That’s it. Your agents now use your local Ollama models.
Deployment Options
Section titled “Deployment Options”Where you run Ollama depends on your setup. Here are the most common options.
A. Ollama on the host machine
Section titled “A. Ollama on the host machine”The simplest path. Install Ollama directly on the machine that runs Pinchy.
# Install Ollamacurl -fsSL https://ollama.com/install.sh | sh
# Pull a modelollama pull qwen3.5:9bIn Pinchy, set the URL to:
http://host.docker.internal:11434B. Ollama as a Docker service
Section titled “B. Ollama as a Docker service”Run Ollama alongside Pinchy in Docker. Create a docker-compose.override.yml in your project root:
services: ollama: image: ollama/ollama volumes: - ollama_data:/root/.ollama
volumes: ollama_data:Then restart:
docker compose up -dIn Pinchy, set the URL to:
http://ollama:11434C. Ollama with NVIDIA GPU (Docker)
Section titled “C. Ollama with NVIDIA GPU (Docker)”For GPU acceleration, install the NVIDIA Container Toolkit first, then add GPU reservations to your override:
services: ollama: image: ollama/ollama volumes: - ollama_data:/root/.ollama deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu]
volumes: ollama_data:Everything else is the same — Pinchy URL is http://ollama:11434.
D. Ollama on a remote server
Section titled “D. Ollama on a remote server”Run Ollama on a dedicated GPU machine and point Pinchy at it over the network.
On the Ollama server:
# Allow remote connectionsOLLAMA_HOST=0.0.0.0 ollama serveIn Pinchy, set the URL to:
http://<server-ip>:11434Recommended Models
Section titled “Recommended Models”| Use Case | Model | Size | Tool Support | Why |
|---|---|---|---|---|
| General agent | qwen3.5:9b | 6.6 GB | Yes | Reliable tool calling, good multilingual quality, multimodal — recommended default |
| Coding tasks | qwen2.5-coder:32b | 19 GB | Yes | Strong code generation |
| Large context | qwen3.5:27b | 17 GB | Yes | 256k context, highest local quality |
| Lightweight | phi3:mini | 2.3 GB | No | Fast, but no tool support — not compatible with Pinchy agents |
Pull the recommended default with:
ollama pull qwen3.5:9bModels for agent templates
Section titled “Models for agent templates”Agent templates have model recommendations built in. When you create an agent from a template, Pinchy automatically picks a model that fits the template’s needs — fast models for simple lookups, larger models for complex analysis.
Some templates require specific capabilities that not all models support. If your installed models can’t satisfy a template’s requirements, the template card will appear greyed out with a tooltip explaining what’s missing.
Vision templates
Section titled “Vision templates”The following templates analyze documents and images and require a model with vision support:
- Contract Analyzer — reads and summarizes contract clauses
- Resume Screener — extracts structured data from uploaded CVs
- Proposal Comparator — compares multiple document uploads side by side
- Compliance Checker — audits documents against policy requirements
To enable these templates, pull a vision-capable model:
# Recommended: Qwen2.5-VL (strong vision + tool calling)ollama pull qwen2.5vl:7b
# Alternative: LLaMA 3.2 Visionollama pull llama3.2-vision:11bVerify vision support before pulling:
ollama show qwen2.5vl:7b | grep visionTier-to-size mapping
Section titled “Tier-to-size mapping”Pinchy groups models into three tiers based on parameter count:
| Tier | Parameter range | Example |
|---|---|---|
| Fast | < 10B | qwen3.5:9b |
| Balanced | 10B – 39B | qwen2.5-coder:32b |
| Reasoning | 40B+ | qwen3.5:72b |
Templates declare a preferred tier; Pinchy picks the best installed match. If no model at the preferred tier is installed, it falls back to whatever is available.
Performance Expectations
Section titled “Performance Expectations”Local models are slower than cloud APIs — sometimes by an order of magnitude. Plan for the following on a modern Apple Silicon Mac or a single mid-range GPU:
- Simple chat reply: 5–15 seconds
- Tool-using reply (e.g. Smithers consulting documentation): 60–120 seconds
- First request after a long idle: add 10–30 seconds for model load
The reason tool-using replies are so much slower is that each tool round-trip forces a fresh inference pass over the entire growing context — system prompt, conversation history, and previous tool outputs all get re-processed. On cloud GPUs this is invisible; on local hardware the prefill phase dominates.
Pinchy keeps the WebSocket alive and shows a thinking indicator the whole time, so the UI never looks stuck — but the wait is real. If responsiveness matters more than air-gap compliance, consider mixing local (privacy-sensitive agents) with cloud providers (interactive agents).
Troubleshooting
Section titled “Troubleshooting”“Could not connect to Ollama at this URL”
- Check that Ollama is running:
ollama listshould show your models - Verify the URL matches your deployment option (see above)
- If Ollama runs on the host and Pinchy in Docker, use
http://host.docker.internal:11434— nothttp://localhost:11434
No models appear after connecting
- Pull at least one model first:
ollama pull qwen2.5:7b - If Ollama runs in Docker, pull from inside the container:
docker compose exec ollama ollama pull qwen2.5:7b
“No compatible models found”
- Pinchy agents require models with tool calling support. Not all Ollama models support this.
- Pull a compatible model:
ollama pull qwen2.5:7b - You can check a model’s capabilities with
ollama show <model>— look for “tools” in the capabilities list.
Slow responses
- See Performance Expectations above — tool-using replies on local hardware genuinely take 1–2 minutes, this is not a bug
- A GPU makes a big difference — even a modest one speeds up inference significantly
- Quantized models (e.g.,
qwen3.5:9b-q4_0) trade some quality for speed - For maximum responsiveness, use a cloud provider for interactive agents and reserve local models for privacy-sensitive workloads