Set Up Local Ollama

Ollama lets you run large language models locally. When you connect it to Pinchy, your agents run entirely on your own hardware — no API keys, no cloud calls, no data leaving your infrastructure.

This is the setup for teams that need full air-gap compliance or simply want to keep everything in-house.

Prerequisites

A running Pinchy instance (Installation)
Ollama installed and running with at least one model pulled

Connect Ollama to Pinchy

Go to Settings → LLM Provider 2. Click Ollama (Local) 3. Enter the URL where Ollama is running (see deployment options below) 4. Click Save — Pinchy validates the connection and discovers your models

That’s it. Your agents now use your local Ollama models.

Deployment Options

Where you run Ollama depends on your setup. Here are the most common options.

A. Ollama on the host machine

The simplest path. Install Ollama directly on the machine that runs Pinchy.

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model
ollama pull qwen3.5:9b

In Pinchy, set the URL to:

http://host.docker.internal:11434

B. Ollama as a Docker service

Run Ollama alongside Pinchy in Docker. Create a docker-compose.override.yml in your project root:

services:
  ollama:
    image: ollama/ollama
    networks:
      default:
        aliases:
          - ollama.docker.local
    volumes:
      - ollama_data:/root/.ollama

volumes:
  ollama_data:

Then restart:

docker compose up -d

In Pinchy, set the URL to:

http://ollama.docker.local:11434

C. Ollama with NVIDIA GPU (Docker)

For GPU acceleration, install the NVIDIA Container Toolkit first, then add GPU reservations to your override:

services:
  ollama:
    image: ollama/ollama
    networks:
      default:
        aliases:
          - ollama.docker.local
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

volumes:
  ollama_data:

Everything else is the same — Pinchy URL is http://ollama.docker.local:11434. (See the option B note above for why the .local alias is required.)

D. Ollama on a remote server

Run Ollama on a dedicated GPU machine and point Pinchy at it over the network.

On the Ollama server:

# Allow remote connections
OLLAMA_HOST=0.0.0.0 ollama serve

In Pinchy, set the URL to:

http://<server-ip>:11434

Recommended Models

Use Case	Model	Size	Tool Support	Why
General agent	`qwen3.5:9b`	6.6 GB	Yes	Reliable tool calling, good multilingual quality, multimodal — recommended default
Coding tasks	`qwen2.5-coder:32b`	19 GB	Yes	Strong code generation
Large context	`qwen3.5:27b`	17 GB	Yes	256k context, highest local quality
Lightweight	`phi3:mini`	2.3 GB	No	Fast, but no tool support — not compatible with Pinchy agents

Pull the recommended default with:

ollama pull qwen3.5:9b

Models for agent templates

Agent templates have model recommendations built in. When you create an agent from a template, Pinchy automatically picks a model that fits the template’s needs — fast models for simple lookups, larger models for complex analysis.

Some templates require specific capabilities that not all models support. If your installed models can’t satisfy a template’s requirements, the template card will appear greyed out with a tooltip explaining what’s missing.

Vision templates

The following templates analyze documents and images and require a model with vision support:

Contract Analyzer — reads and summarizes contract clauses
Resume Screener — extracts structured data from uploaded CVs
Proposal Comparator — compares multiple document uploads side by side
Compliance Checker — audits documents against policy requirements

To enable these templates, pull a vision-capable model:

# Recommended: Qwen2.5-VL (strong vision + tool calling)
ollama pull qwen2.5vl:7b

# Alternative: LLaMA 3.2 Vision
ollama pull llama3.2-vision:11b

Verify vision support before pulling:

ollama show qwen2.5vl:7b | grep vision

Tier-to-size mapping

Pinchy groups models into three tiers based on parameter count:

Tier	Parameter range	Example
Fast	< 10B	`qwen3.5:9b`
Balanced	10B – 39B	`qwen2.5-coder:32b`
Reasoning	40B+	`qwen3.5:72b`

Templates declare a preferred tier; Pinchy picks the best installed match. If no model at the preferred tier is installed, it falls back to whatever is available.

Performance Expectations

Local models are slower than cloud APIs — sometimes by an order of magnitude. Plan for the following on a modern Apple Silicon Mac or a single mid-range GPU:

Simple chat reply: 5–15 seconds
Tool-using reply (e.g. Smithers consulting documentation): 60–120 seconds
First request after a long idle: add 10–30 seconds for model load

The reason tool-using replies are so much slower is that each tool round-trip forces a fresh inference pass over the entire growing context — system prompt, conversation history, and previous tool outputs all get re-processed. On cloud GPUs this is invisible; on local hardware the prefill phase dominates.

Pinchy keeps the WebSocket alive and shows a thinking indicator the whole time, so the UI never looks stuck — but the wait is real. If responsiveness matters more than air-gap compliance, consider mixing local (privacy-sensitive agents) with cloud providers (interactive agents).

Troubleshooting

“Could not connect to Ollama at this URL”

Check that Ollama is running: ollama list should show your models
Verify the URL matches your deployment option (see above)
If Ollama runs on the host and Pinchy in Docker, use http://host.docker.internal:11434 — not http://localhost:11434

Agents reply with “No API key found for provider ‘ollama’”

This indicates the agent runtime didn’t recognize your URL as a local endpoint. Make sure you’re on Pinchy v0.5.6 or later — earlier versions had a bug (#280) where host.docker.internal was rejected by the local-provider auth path.
For non-default URLs, the URL host must be one of: localhost, 127.0.0.1, a hostname ending in .local, an RFC-1918 private IPv4 (192.168.*, 10.*, 172.16-31.*), or one of Docker’s host aliases (host.docker.internal, gateway.docker.internal).
Upgrading from http://ollama:11434? Earlier docs recommended the bare service name, which fails the check above. Switch to option B — add an ollama.docker.local network alias to your ollama service and update the URL.

Agents reply with “Blocked hostname or private/internal/special-use IP address”

OpenClaw 2026.5.x ships an SSRF guard that blocks outbound HTTP to private-network destinations by default. Pinchy v0.5.6 works around this by setting request.allowPrivateNetwork: true only on the ollama-local provider in the generated openclaw.json, so calls to local Ollama go through but the guard stays active for every other provider.
If you see this error after upgrading, regenerate the config — restart Pinchy or change any provider setting in Settings → Providers and save, which triggers a fresh write. Verify the resulting openclaw.json has request.allowPrivateNetwork: true under your ollama-local provider entry.

No models appear after connecting

Pull at least one model first: ollama pull qwen3.5:9b
If Ollama runs in Docker, pull from inside the container: docker compose exec ollama ollama pull qwen3.5:9b

“No compatible models found”

Pinchy agents require models with tool calling support. Not all Ollama models support this.
Pull a compatible model: ollama pull qwen3.5:9b
You can check a model’s capabilities with ollama show <model> — look for “tools” in the capabilities list.

Slow responses

See Performance Expectations above — tool-using replies on local hardware genuinely take 1–2 minutes, this is not a bug
A GPU makes a big difference — even a modest one speeds up inference significantly
Quantized models (e.g., qwen3.5:9b-q4_0) trade some quality for speed
For maximum responsiveness, use a cloud provider for interactive agents and reserve local models for privacy-sensitive workloads