How it works · Receipt Scanner

How Receipt Scanner works

One API route. One vision call. One Zod schema. Token-conscious image preprocessing, schema-enforced JSON output, graceful degradation on poor-quality images. The whole loop in plain English.

TL;DR

Six steps. One API call.
No regex.

Resize the image with sharp. Base64 encode it. Send to Claude with a strict JSON prompt. Parse the response. Validate against Zod. Return the typed object.

The hard problems — line item parsing, vendor normalisation, total reconciliation — are now things the language model does for you. What used to be a 2,000-line OCR pipeline is now 400 lines of TypeScript that you can read in an afternoon.

The remaining engineering is token economics, EXIF rotation, prompt strictness, and graceful degradation when the photo is genuinely unreadable. That is what this codebase makes explicit.

<span class="dim">User uploads: tesco-receipt.jpg (4032×3024, 4.2MB, EXIF rotated)</span> <span class="hl">Step 1 · Rotate</span> sharp.rotate() // honour EXIF <span class="hl">Step 2 · Resize</span> .resize(1568, inside) // longest-edge bound <span class="hl">Step 3 · Encode</span> .jpeg(85) // 4MB → 850KB <span class="hl">Step 4 · Vision</span> anthropic.messages.create // ~1.4s <span class="hl">Step 5 · Parse</span> JSON.parse(reply) <span class="hl">Step 6 · Validate</span> Receipt.parse(json) // zod <span class="ok">{ vendor: "Tesco Express", date: "2026-04-12",</span> <span class="ok"> items: [...12], subtotal: 23.40, tax: 1.20,</span> <span class="ok"> total: 24.60, payment_method: "Visa Debit" }</span> <span class="dim">~2 seconds. ~£0.013. Fully validated.</span>
Core Data Flow

The scan loop

┌──────────────── /api/scan ──────────────────────────────────┐
│  Browser                                                    │
│   │ POST FormData(image)                                    │
│   ▼                                                         │
│  Route handler (lib/vision.ts)                              │
│   │                                                         │
│   │  Buffer ──▶ sharp.rotate()                              │
│   │  rotated ──▶ .resize(1568, fit:'inside')                │
│   │  resized ──▶ .jpeg(85)                                  │
│   │  jpeg    ──▶ .toString('base64')                        │
│   │                                                         │
│   │  base64 ──▶  anthropic.messages.create({                │
│   │              model: 'claude-3-5-sonnet-latest',         │
│   │              messages: [{ role:'user', content:[        │
│   │                { type:'image', source: base64 },        │
│   │                { type:'text',  text: SYS_PROMPT } ]}]   │
│   │            })                                           │
│   │                                                         │
│   │  text  ──▶  JSON.parse                                  │
│   │  json  ──▶  Receipt.parse  (zod)                        │
│   │  rcpt  ──▶  persist.save  (optional)                    │
│   │                                                         │
│   ▼                                                         │
│  200 OK { ok: true, id, receipt }                           │
└─────────────────────────────────────────────────────────────┘
Subsystems

Each piece, deep-dived

Image preprocessing

Why it exists

Vision APIs charge per image token, and image tokens scale with resolution. A naive implementation pays 4× per scan.

How it actually works

sharp.rotate() applies EXIF orientation. resize({ width: 1568, fit: "inside" }) bounds the longest edge. jpeg({ quality: 85 }) re-encodes to a compact format. The output is a Buffer roughly 60% smaller than the input PNG, with no measurable accuracy loss on receipts.

Vision call

Why it exists

A single language model call replaces a Tesseract pipeline with regex line-item parsers. The model sees the layout and extracts structure in one round trip.

How it actually works

anthropic.messages.create() with max_tokens 1024, model claude-3-5-sonnet-latest, content array containing one image block (base64 JPEG) and one text block (the system prompt). No streaming — we want the full JSON to validate before responding.

Prompt design

Why it exists

The model must produce parseable JSON, must not invent values, and must use null for fields it cannot read. The prompt is the contract.

How it actually works

The prompt embeds a TypeScript-style type definition, demands "JSON only, no markdown, no backticks", and explicitly instructs "do not invent values" and "return null if uncertain". This combination eliminates the regex-strip-markdown step that earlier prototypes needed.

Schema validation

Why it exists

Vision models return text that claims to be JSON. Without runtime validation, malformed output crashes downstream code or — worse — silently propagates wrong types.

How it actually works

lib/schema.ts defines a Zod schema with every field nullable. Receipt.parse(json) either returns a typed object or throws a ZodError with the specific path that failed. Errors bubble up as a 422 with the validation message.

Persistence stub

Why it exists

Receipt Scanner is a starter. The "save the result" step belongs to your stack, not ours.

How it actually works

lib/persist.ts exports save(receipt: Receipt) that does nothing by default. Replace its body with a Supabase, Prisma, or raw pg insert. The Postgres schema is documented in docs/schema.sql and reproduced in the whitepaper.

UI rendering

Why it exists

The user wants a confidence check before they save. Show the image and the extracted fields side by side; nulls render as em dashes.

How it actually works

app/page.tsx is a single client component. Drag-and-drop upload, optimistic preview using URL.createObjectURL, fetch to /api/scan, render the structured table from the typed Receipt response. No state library, no form library, no toast library.

Technology choices

Why this, not that

Next.js 14 App Router

Why we use it

File-based API routes, native multipart parsing, edge-ready deployment in one framework. The whole backend is one route file.

Why not the alternative

Express + separate React frontend — two repos to maintain, CORS to configure, no benefit for a single-route service.

TypeScript + Zod

Why we use it

Schema-driven from start to finish. The same shape defines the runtime validator and the compile-time type. Single source of truth.

Why not the alternative

JSON Schema + ajv — duplicates the type elsewhere. Manual class definitions — diverges from the runtime check.

sharp (libvips)

Why we use it

4× faster than ImageMagick, fraction of the memory, ships natively on Vercel. Best-in-class for server-side image work.

Why not the alternative

jimp — pure JS, slower, missing some formats. ImageMagick — heavy native dep, GPL licensing concerns, slower.

Anthropic Claude 3.5 Sonnet

Why we use it

Best vision OCR I have benchmarked on real-world UK receipts. Stricter JSON adherence than GPT-4o. lower hallucination rate on missing fields.

Why not the alternative

GPT-4o — slightly worse line-item accuracy on supermarket receipts. Gemini 2.5 Pro — better latency, slightly weaker structured-output adherence.

Single API call per scan

Why we use it

Predictable cost, predictable latency, no re-prompt loops. The model either reads the receipt or returns nulls.

Why not the alternative

Multi-pass agents — 3-5× cost, unbounded latency, marginal accuracy gain on a task this constrained.

Vercel deployment

Why we use it

sharp ships on Vercel's Linux runtime. Push to GitHub, set ANTHROPIC_API_KEY, you are live in 60 seconds.

Why not the alternative

Lambda — sharp native bindings need a custom layer. ECS — six AWS services for what one git push solves.

Performance & observability

What you can measure

~2s
end-to-end per scan
Resize 200ms + vision 1.4s
£0.013
per scan
Image + prompt + JSON output
token saving from resize
vs naive 4032px upload

Failure modes you should expect

Image too dark or blurry
Cause: Model returns mostly null fields
Fix: UI surfaces "low confidence" hint when most fields are null; user retakes the photo
Multi-page PDF receipt
Cause: Single image at a time, not in scope for v1
Fix: Rasterise upstream, scan each page, merge in your application layer
Hand-written receipt
Cause: Vision models read printed receipts well, scribbled tips less so
Fix: Acceptable degradation — partial extraction with notes field flagging the issue
Unsupported currency symbol
Cause: Model returns the symbol as raw text in currency field
Fix: Normalise to ISO 4217 in your downstream layer
Vision API 429
Cause: Rate limited during a busy period
Fix: Single-tier retry with exponential backoff, or upgrade your Anthropic plan
EXIF-rotated image
Cause: iPhone photos have orientation in EXIF rather than baked in
Fix: sharp.rotate() with no args reads EXIF and applies the correct rotation before re-encoding
Future direction

What’s next

Multi-page PDF receipts

Rasterise via pdfjs, scan each page, merge totals. Hotel folios, multi-day rentals.

Email-to-receipt ingestion

Inbound email parser. Forward a Pret receipt to receipts@yourdomain, get it in your database in seconds.

Bulk batch upload

Drag a folder, queue scans, show progress. Background worker processing rather than UI route.

HMRC-compatible export

CSV format that drops into Self Assessment expense schedules. Currency normalisation included.

Confidence scoring

Ask the model for a self-assessed confidence per field. Surface low-confidence fields in the UI for review.

Receipt deduplication

Hash the extracted fields, alert when the same receipt is scanned twice. Common in expense fraud workflows.

Ready to try it?

Clone the repo. Add ANTHROPIC_API_KEY. Drop a receipt photo in. Get JSON back. Five minutes from zero to working.