Whitepaper · Receipt Scanner

Receipt Scanner

Vision-model OCR for receipts, with schema-enforced JSON output and token-conscious image handling.

MIT LicensedOpen SourceSelf-HostableClaude 3.5 SonnetZod-validated~£0.013 / scan
~£0.013per scan
1568pxmax image dim
~2send-to-end
token saving

v1.0 · April 2026 · Sai Sarma · Sarma Linux

Abstract

Receipt Scanner is an open-source, MIT-licensed receipt OCR starter built around the observation that vision-capable language models now outperform dedicated OCR products on structured information extraction from photographs of receipts. It uses Anthropic Claude 3.5 Sonnet as the default vision model, sharp for token-conscious image preprocessing, and Zod for runtime schema enforcement on model output. This whitepaper documents the architecture, image-token economics, prompt design, edge cases, and integration patterns.

01Executive Summary

A user uploads a JPEG, PNG, or HEIC photograph of a receipt. The application rotates it according to EXIF orientation, resizes it to 1,568 pixels on the longest edge, re-encodes as JPEG at quality 85, base64-encodes the result, and sends it to Claude 3.5 Sonnet with a system prompt that demands strict JSON output. The response is parsed, validated against a Zod schema, and returned to the browser. Optional persistence to Postgres or any downstream system is a single function call.

End-to-end latency is approximately 2 seconds for a typical phone photo. Cost per scan is around £0.013 against Claude 3.5 Sonnet, dominated by the image input tokens. The image preprocessing step delivers roughly a fourfold token saving with no measurable accuracy loss on receipts.

Every field in the output schema is nullable. The model returns what it can read; downstream code handles missing data gracefully. Hand-written numbers, faded thermal prints, and partial occlusions degrade gracefully into nulls rather than hallucinated totals.

02Background & Motivation

Receipt OCR is one of the most common automation requests in small-business operations. Expense management is tedious, finance teams hate it, and the SaaS products that solve it (Expensify, Dext, Pleo) charge per user per month for what is fundamentally a vision-API call plus a database insert.

Until recently, "build it yourself" meant Tesseract for text extraction, regex hell for line-item parsing, and a 60% accuracy ceiling on real-world receipts. Vision-capable language models changed the calculus. A single API call returns vendor, line items, totals, tax, and payment method as structured JSON, with accuracy that beats dedicated OCR pipelines on the long tail of receipt formats.

The catch is that most "Claude vision OCR" tutorials skip the parts that matter for production: image preprocessing for token cost, EXIF rotation handling, schema validation on model output, and graceful degradation on poor-quality images. Receipt Scanner exists to be the production-shape baseline.

03The Problem

Three concrete problems shaped the project:

  • Token cost on raw phone photos. A modern phone camera produces 4,000×3,000 pixel images. Vision APIs charge per image token, and image token count scales with resolution. A naive implementation spends 4× more on every scan than a token-conscious one.
  • EXIF rotation. Phones save photos with the rotation encoded as EXIF metadata rather than baked into the pixel array. Without explicit rotation, half of submitted receipts arrive sideways and the model cannot read them.
  • Hallucinated numbers. When asked for fields that are not present in the image, language models often invent plausible-looking values. A "£12.99" tax line that did not exist on the receipt is worse than no value at all. Schema enforcement and explicit nullability are the corrective.

04Goals & Non-goals

Goals

  • Clone-and-run in under five minutes with a single Anthropic API key.
  • Production-shape image preprocessing — rotate, resize, re-encode before upload.
  • Strict JSON contract enforced at runtime via Zod.
  • Persistence stub with documented schema for Postgres / Supabase.
  • Drop-in compatibility with OpenAI gpt-4o vision via a single-file change.
  • Per-scan cost below £0.02.

Non-goals

  • Multi-page PDF receipts. Single image at a time. Multi-page workflows belong upstream.
  • Hand-written receipts. Vision models read printed receipts well, scribbled tips less so. Acceptable degradation, not a focus area.
  • HMRC-compatible export. On the roadmap but not in v1.
  • Bulk batch processing. Single user, single receipt, single response. Batch belongs in a queue plus worker, not in a UI route.

05Architecture

The system is one Next.js 14 application with one API route, one vision call per scan, and one Zod schema enforcing the contract.

Scan pipeline

Browser
   │ POST FormData(image) → /api/scan
   ▼
Route handler
   │ 1. file → ArrayBuffer → Buffer
   │ 2. sharp(buffer)
   │      .rotate()                       // honour EXIF
   │      .resize({ width: 1568, fit: 'inside' })
   │      .jpeg({ quality: 85 })
   │      .toBuffer()
   │ 3. resized.toString('base64')
   │ 4. anthropic.messages.create({
   │      model: 'claude-3-5-sonnet-latest',
   │      max_tokens: 1024,
   │      messages: [{ role:'user', content:[
   │        { type:'image', source:{ media_type:'image/jpeg', data:base64 } },
   │        { type:'text', text: SYSTEM_PROMPT }
   │      ]}]
   │    })
   │ 5. JSON.parse(response.content[0].text)
   │ 6. ReceiptSchema.parse(json)         // zod
   │ 7. persist.save(receipt)             // optional
   ▼
Response { ok: true, id, receipt }

Module map

FileResponsibility
app/api/scan/route.tsAccept upload, orchestrate, return JSON
lib/vision.tsResize image, call vision API, parse JSON
lib/schema.tsZod schema, runtime + compile-time types
lib/persist.tsNo-op by default, swap with your backend
app/page.tsxUpload form, preview, structured table

06Key Technical Decisions

Why sharp for image preprocessing

sharp is the de facto standard for server-side image manipulation in Node. It uses libvips under the hood, which is roughly 4× faster than ImageMagick and uses a fraction of the memory. Vercel ships sharp natively on the Linux runtime — no build configuration required.

Why resize to 1,568 pixels on the longest edge

Anthropic charges per image token, and the token count scales with the image resolution up to the model’s max image size. 1,568 pixels is the Claude image side beyond which extra resolution does not improve OCR accuracy on receipts (verified empirically on a 100-receipt evaluation set). Resizing 4,000×3,000 phone photos to 1,568px maximum cuts image tokens by approximately 4× with no accuracy regression.

Why JPEG re-encode at quality 85

PNG and HEIC are larger than they need to be for receipt OCR. JPEG at quality 85 is visually identical for receipts and roughly 60% smaller. Smaller payload over the wire and smaller base64 string sent to the API.

Why .rotate() first

Without explicit rotation, EXIF-rotated images arrive sideways at the model. sharp.rotate() with no arguments reads the EXIF orientation tag and applies the correct rotation, then strips the tag from the output so the new pixel array is canonical.

Why Claude 3.5 Sonnet rather than GPT-4o

On a 100-receipt evaluation set covering UK supermarkets, restaurants, retail, and petrol receipts, Claude 3.5 Sonnet extracted line items and totals with higher accuracy than GPT-4o. JSON adherence (no backticks, no commentary, no markdown fences) was also tighter. lib/vision.ts contains a commented-out OpenAI implementation for direct comparison.

Why Zod schema validation

Vision models return text that claims to be JSON. Without runtime validation, any malformed output crashes downstream code. Zod gives you a single place to catch every parse error, narrow types throughout the codebase, and reject unexpected fields the model invented.

Why every field is nullable

The model sees what it sees. A blurry receipt may not show the tax line. A dark photo may not show the date. A partial crop may not show the vendor address. Forcing the model to produce values for fields it cannot read causes hallucination. Nullability is the correct contract.

07Implementation

The Zod schema

import { z } from 'zod'

export const LineItem = z.object({
  description: z.string(),
  quantity:    z.number().nullable(),
  unit_price:  z.number().nullable(),
  total:       z.number().nullable(),
})

export const Receipt = z.object({
  vendor:         z.string().nullable(),
  vendor_address: z.string().nullable(),
  date:           z.string().nullable(),  // YYYY-MM-DD
  time:           z.string().nullable(),  // HH:MM
  currency:       z.string().nullable(),
  items:          z.array(LineItem),
  subtotal:       z.number().nullable(),
  tax:            z.number().nullable(),
  tip:            z.number().nullable(),
  total:          z.number().nullable(),
  payment_method: z.string().nullable(),
  notes:          z.string().nullable(),
})

export type Receipt = z.infer<typeof Receipt>

The system prompt

You are a receipt-OCR engine. Extract structured data from the receipt
image and respond with JSON only. No markdown, no backticks, no
commentary. Match this TypeScript type exactly:

{
  vendor: string | null,
  vendor_address: string | null,
  date: string | null,        // YYYY-MM-DD
  time: string | null,        // HH:MM
  currency: string | null,    // ISO code or symbol as printed
  items: { description: string, quantity: number|null,
           unit_price: number|null, total: number|null }[],
  subtotal: number | null,
  tax: number | null,
  tip: number | null,
  total: number | null,
  payment_method: string | null,
  notes: string | null,
}

If a field is not visible or you cannot determine it confidently,
return null. Do not invent values. Do not guess.

The vision call

import Anthropic from '@anthropic-ai/sdk'
import sharp from 'sharp'
import { Receipt } from './schema'

const client = new Anthropic()

export async function extract(input: Buffer): Promise<Receipt> {
  const buf = await sharp(input)
    .rotate()
    .resize({ width: MAX_IMAGE_PX, fit: 'inside', withoutEnlargement: true })
    .jpeg({ quality: 85 })
    .toBuffer()

  const res = await client.messages.create({
    model: process.env.VISION_MODEL ?? 'claude-3-5-sonnet-latest',
    max_tokens: 1024,
    messages: [{
      role: 'user',
      content: [
        { type: 'image', source: { type: 'base64',
            media_type: 'image/jpeg', data: buf.toString('base64') } },
        { type: 'text', text: SYSTEM_PROMPT },
      ],
    }],
  })

  const text = res.content[0].type === 'text' ? res.content[0].text : ''
  return Receipt.parse(JSON.parse(text))
}

08Results & Performance

Token cost reference (typical UK supermarket receipt at 1,568px)

Cost elementTokensCost (Claude 3.5 Sonnet)
Image input~1,500£0.005
System prompt~250£0.0008
Output JSON~500£0.0075
Per scan~2,250~£0.013

Latency breakdown (warm Vercel function)

StepTime
Network upload (5MB phone photo)200 to 600 ms
sharp resize + JPEG re-encode80 to 200 ms
Vision API round trip (Claude 3.5 Sonnet)1.2 to 2.0 s
JSON parse + Zod validate1 to 5 ms
End-to-end~1.5 to 2.8 s

Throughput economics

100 scans per day = ~£40 per month in vision API costs. Acceptable for an internal tool.
10,000 scans per day = ~£3,900 per month. At that volume, evaluate self-hosted vision alternatives or batch discounts.

09Lessons & Trade-offs

What worked

  • Resizing before upload. The single highest-impact optimisation. 4× cost reduction with no accuracy loss.
  • Strict JSON in the system prompt. "JSON only, no backticks, no commentary" eliminates the regex strip step that earlier prototypes needed.
  • Nullable schema. Forces honest output. The downstream UI displays "—" for nulls rather than hallucinated values.
  • Single vision call per scan. No re-prompts, no fallbacks, no chains. The cost and latency profile is predictable.

What we got wrong on first pass

  • Forgot EXIF rotation initially. Roughly 50% of submitted iPhone photos arrived sideways. The model cannot read sideways receipts. Adding .rotate() fixed every failing test in the eval set.
  • Used PNG passthrough at first. 5MB PNG payload became a 7MB base64 string and choked the request body parser. JPEG q85 cut payload by ~60% and removed the size limit issue.
  • Initial prompt asked for "all fields you can see". The model produced markdown commentary alongside JSON. Tightening to "JSON only, no markdown, no backticks" eliminated parse failures.

Trade-offs we accept

  • Anthropic-only by default. Best vision OCR I have benchmarked; switch to OpenAI is one file edit.
  • No multi-page PDF handling. Rasterise upstream, scan each page. Lifecycle of multi-page receipts belongs in the queue layer, not the OCR route.

10Conclusion

Receipt Scanner demonstrates that production-quality receipt OCR fits in a single Next.js API route, costs roughly 1 penny per scan, and produces structured JSON that drops cleanly into any expense workflow. Vision-capable language models have effectively obsoleted purpose-built receipt OCR products for any team with the engineering capability to wire one API call to one database. The remaining 200 lines of TypeScript are about token economics, schema enforcement, and corner cases — exactly what a starter should make explicit.

APersistence schema

The persistence layer ships as a no-op stub. Replace lib/persist.ts with a Postgres or Supabase implementation against the following schema.

create table receipts (
  id              uuid primary key default gen_random_uuid(),
  user_id         uuid references auth.users(id) on delete cascade,
  vendor          text,
  vendor_address  text,
  date            date,
  time            time,
  currency        text,
  subtotal        numeric(12,2),
  tax             numeric(12,2),
  tip             numeric(12,2),
  total           numeric(12,2),
  payment_method  text,
  notes           text,
  raw             jsonb not null,    -- the full extracted JSON
  image_url       text,              -- signed URL to the original
  created_at      timestamptz default now()
);

create table receipt_items (
  id          uuid primary key default gen_random_uuid(),
  receipt_id  uuid not null references receipts(id) on delete cascade,
  description text not null,
  quantity    numeric(10,3),
  unit_price  numeric(12,2),
  total       numeric(12,2),
  position    int
);

alter table receipts enable row level security;
alter table receipt_items enable row level security;

create policy "own_rows" on receipts
  for all using (auth.uid() = user_id);
create policy "own_rows" on receipt_items
  for all using (
    auth.uid() = (select user_id from receipts where receipts.id = receipt_items.receipt_id)
  );

BConfiguration

VariableRequiredDefaultPurpose
ANTHROPIC_API_KEYYesVision API access
VISION_MODELNoclaude-3-5-sonnet-latestOverride the model
MAX_IMAGE_PXNo1568Max image dimension before resize