Skip to content

Set Up Local Ollama

Ollama lets you run large language models locally. When you connect it to Pinchy, your agents run entirely on your own hardware — no API keys, no cloud calls, no data leaving your infrastructure.

This is the setup for teams that need full air-gap compliance or simply want to keep everything in-house.

  • A running Pinchy instance (Installation)
  • Ollama installed and running with at least one model pulled
  1. Go to Settings → LLM Provider 2. Click Ollama (Local) 3. Enter the URL where Ollama is running (see deployment options below) 4. Click Save — Pinchy validates the connection and discovers your models

That’s it. Your agents now use your local Ollama models.

Where you run Ollama depends on your setup. Here are the most common options.

The simplest path. Install Ollama directly on the machine that runs Pinchy.

Terminal window
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull a model
ollama pull qwen3.5:9b

In Pinchy, set the URL to:

http://host.docker.internal:11434

Run Ollama alongside Pinchy in Docker. Create a docker-compose.override.yml in your project root:

services:
ollama:
image: ollama/ollama
volumes:
- ollama_data:/root/.ollama
volumes:
ollama_data:

Then restart:

Terminal window
docker compose up -d

In Pinchy, set the URL to:

http://ollama:11434

For GPU acceleration, install the NVIDIA Container Toolkit first, then add GPU reservations to your override:

services:
ollama:
image: ollama/ollama
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
ollama_data:

Everything else is the same — Pinchy URL is http://ollama:11434.

Run Ollama on a dedicated GPU machine and point Pinchy at it over the network.

On the Ollama server:

Terminal window
# Allow remote connections
OLLAMA_HOST=0.0.0.0 ollama serve

In Pinchy, set the URL to:

http://<server-ip>:11434
Use CaseModelSizeTool SupportWhy
General agentqwen3.5:9b6.6 GBYesReliable tool calling, good multilingual quality, multimodal — recommended default
Coding tasksqwen2.5-coder:32b19 GBYesStrong code generation
Large contextqwen3.5:27b17 GBYes256k context, highest local quality
Lightweightphi3:mini2.3 GBNoFast, but no tool support — not compatible with Pinchy agents

Pull the recommended default with:

Terminal window
ollama pull qwen3.5:9b

Agent templates have model recommendations built in. When you create an agent from a template, Pinchy automatically picks a model that fits the template’s needs — fast models for simple lookups, larger models for complex analysis.

Some templates require specific capabilities that not all models support. If your installed models can’t satisfy a template’s requirements, the template card will appear greyed out with a tooltip explaining what’s missing.

The following templates analyze documents and images and require a model with vision support:

  • Contract Analyzer — reads and summarizes contract clauses
  • Resume Screener — extracts structured data from uploaded CVs
  • Proposal Comparator — compares multiple document uploads side by side
  • Compliance Checker — audits documents against policy requirements

To enable these templates, pull a vision-capable model:

Terminal window
# Recommended: Qwen2.5-VL (strong vision + tool calling)
ollama pull qwen2.5vl:7b
# Alternative: LLaMA 3.2 Vision
ollama pull llama3.2-vision:11b

Verify vision support before pulling:

Terminal window
ollama show qwen2.5vl:7b | grep vision

Pinchy groups models into three tiers based on parameter count:

TierParameter rangeExample
Fast< 10Bqwen3.5:9b
Balanced10B – 39Bqwen2.5-coder:32b
Reasoning40B+qwen3.5:72b

Templates declare a preferred tier; Pinchy picks the best installed match. If no model at the preferred tier is installed, it falls back to whatever is available.

Local models are slower than cloud APIs — sometimes by an order of magnitude. Plan for the following on a modern Apple Silicon Mac or a single mid-range GPU:

  • Simple chat reply: 5–15 seconds
  • Tool-using reply (e.g. Smithers consulting documentation): 60–120 seconds
  • First request after a long idle: add 10–30 seconds for model load

The reason tool-using replies are so much slower is that each tool round-trip forces a fresh inference pass over the entire growing context — system prompt, conversation history, and previous tool outputs all get re-processed. On cloud GPUs this is invisible; on local hardware the prefill phase dominates.

Pinchy keeps the WebSocket alive and shows a thinking indicator the whole time, so the UI never looks stuck — but the wait is real. If responsiveness matters more than air-gap compliance, consider mixing local (privacy-sensitive agents) with cloud providers (interactive agents).

“Could not connect to Ollama at this URL”

  • Check that Ollama is running: ollama list should show your models
  • Verify the URL matches your deployment option (see above)
  • If Ollama runs on the host and Pinchy in Docker, use http://host.docker.internal:11434 — not http://localhost:11434

No models appear after connecting

  • Pull at least one model first: ollama pull qwen2.5:7b
  • If Ollama runs in Docker, pull from inside the container: docker compose exec ollama ollama pull qwen2.5:7b

“No compatible models found”

  • Pinchy agents require models with tool calling support. Not all Ollama models support this.
  • Pull a compatible model: ollama pull qwen2.5:7b
  • You can check a model’s capabilities with ollama show <model> — look for “tools” in the capabilities list.

Slow responses

  • See Performance Expectations above — tool-using replies on local hardware genuinely take 1–2 minutes, this is not a bug
  • A GPU makes a big difference — even a modest one speeds up inference significantly
  • Quantized models (e.g., qwen3.5:9b-q4_0) trade some quality for speed
  • For maximum responsiveness, use a cloud provider for interactive agents and reserve local models for privacy-sensitive workloads