Image Attachments

You can attach images directly to chat messages — drag and drop, paste from your clipboard, or use the attachment button in the composer. Pinchy takes care of the rest.

What you can send

| Format | Status | | ---------- | --------------------------------------------------------------- | | JPEG | Sent as-is when small, re-encoded to WebP when large | | PNG | Always re-encoded to WebP (PNG → WebP usually shrinks 50–90%) | | WebP | Sent as-is when small, re-encoded to fit when large | | HEIC | Accepted directly — Pinchy re-encodes it to WebP before sending | | Others | Convert to JPEG, PNG, or WebP first |

Up to 15 MB per image. If a file exceeds that, the server rejects the upload with "File exceeds maximum size of 15 MB", and the rejection appears on the failed upload chip once the server responds.

What happens automatically

Modern smartphone photos are often 5–12 MB. To make sure your agent's vision model actually sees them, Pinchy resizes and re-encodes large images to WebP under 1.9 MB before sending inline. The original quality looks identical for chat-sized viewing, but the file fits in the model's inline image budget. No setting required — it just happens.

You can verify the conversion worked: the agent's response should reference what's actually in the picture. If the agent answers as if no image was attached, see "Troubleshooting" below.

Two paths, one upload

Each image you attach travels two routes simultaneously:

Inline to the model — the resized version is sent with the message itself, so a vision-capable model can "see" the image immediately without an extra step.
Saved to the agent's workspace — the original file lands in uploads/<filename> so the agent can re-read it later (e.g. when you ask follow-up questions about details, or when a sub-agent needs the full-resolution version).

Most of the time you don't need to think about this — the agent picks the right path. The workspace copy matters mainly for shared agents, where it's visible to anyone with access to the agent.

When something goes wrong

| You see | What it means | What to do | | ------------------------------------------ | ----------------------------------------------------------------------------- | ---------------------------------------------------------------- | | "Couldn't process this image format" | The image is in a format Pinchy can't re-encode (e.g. HEIC) and is too large. | Convert it to JPEG or PNG and try again. | | "Image too large" | The encoded message exceeded the WebSocket frame limit. | Use a smaller image, or crop before attaching. | | "No image-capable model is configured" | Your agent's model is text-only and no vision model exists to fall back to. | Ask an admin to configure a vision-capable provider (see below). |

Model capability matrix

Pinchy tracks one input-modality capability per model — vision — plus two model-trait capabilities, long-context and tools. Each capability is either present or absent — there's no partial support. Templates can require any of these (see How templates declare required capabilities below).

| Capability | What it unlocks | | ---------------- | --------------------------------------------------------------- | | vision | The model can "see" images sent inline with a message. | | long-context | The model handles very large contexts (200K+ tokens). | | tools | The model emits structured tool calls — required by all agents. |

PDFs don't need a per-model capability: they are analyzed by a dedicated PDF tool whose model Pinchy resolves from your configured providers, independent of the agent's chat model. Audio and video files are not currently accepted as uploads.

The models table is seeded automatically at boot from Pinchy's built-in model catalog. For Anthropic, OpenAI, Google, and Ollama Cloud, no manual entry is required. For local Ollama, Pinchy detects vision-capable models from the Ollama API when you configure the URL during setup.

Vision support by model

Whether the agent can actually see an image depends on the underlying LLM. As of today:

OpenAI GPT-5.x — full vision support
Anthropic Claude 4.x (Opus/Sonnet/Haiku) — full vision support
Google Gemini Pro/Flash — full vision support
Local Ollama — only models marked as vision-capable (e.g. llama3.2-vision, gemma3)

Text-only models: automatic image handling

You don't have to put your agent on a vision model just to send the occasional screenshot. When you attach an image to an agent whose chat model is text-only, Pinchy routes that one turn to a vision-capable model for you — the agent's configured model is left untouched, and the very next text-only message goes straight back to it. Nothing to toggle; the image just works.

Which model handles it: Pinchy prefers a vision model from the same provider as your agent's model (so the conversation stays coherent), and falls back to the image model configured for your deployment otherwise. If your agent uses tools, Pinchy only picks a fallback that handles tool calls reliably — so sending a screenshot never costs the agent its tools for that turn. The switch is recorded in the audit trail as chat.image_model_fallback, so it's always clear which model saw an image.

The only time this can't work is when no image-capable model is configured anywhere. Then Pinchy tells you so directly — ask an admin to configure a vision-capable provider (Anthropic, OpenAI, Google, or Ollama Cloud / local Ollama with a vision model). It will never silently drop the image and answer as if it weren't there.

How templates declare required capabilities

When you create an agent from a template, Pinchy automatically picks a model that meets the template's capability requirements. Each template declares the capabilities it needs — for example, a document-analyzer template requires vision and long-context. Pinchy's model resolver searches your configured providers for a model that satisfies both.

If no configured model meets the requirements, Pinchy shows a warning at agent-creation time and offers to proceed with the best available match. Adding a vision-capable provider (Anthropic, OpenAI, Google, or Ollama Cloud with a vision model) gives the resolver more options to work with.

Custom agents created from the "Custom Agent" template have no required capabilities — you choose the model manually in Agent Settings → General.