How it works · RAG-over-PDF

How RAG-over-PDF works

The whole loop in plain English. Chunk, embed, retrieve, generate. Why every piece earns its place, what fails when it fails, and how to extend each subsystem.

TL;DR

Six moving parts.
No framework tax.

RAG is not magic. It is six steps in a row: parse the document, split it into chunks, turn each chunk into a vector, store the vectors, embed the question, retrieve the closest chunks, then ask the LLM to answer using only those chunks.

Frameworks like LangChain abstract those six steps behind class hierarchies. That is fine when you have a hundred-document pipeline with reranking, query rewriting, and multi-vector retrieval. It is overkill when you have one PDF and a chat box.

This project exposes every step in roughly 600 lines of TypeScript. Read it once and you understand how every "chat with my docs" product on the internet works under the hood.

<span class="dim">User uploads: company-handbook.pdf (50 pages, 100k chars)</span> <span class="hl">Step 1 · Parse</span> pdf-parse → 100,000 chars of text <span class="hl">Step 2 · Chunk</span> 1000-char windows, 200 overlap → 100 chunks <span class="hl">Step 3 · Embed</span> openai.embeddings → 100 × Float32[1536] <span class="hl">Step 4 · Store</span> vectorStore.add(chunks) <span class="dim">User asks: "what is the refund policy?"</span> <span class="hl">Step 5 · Embed</span> openai.embeddings([question]) → Float32[1536] <span class="hl">Step 6 · Retrieve</span> cosine search → top-5 chunks <span class="hl">Step 7 · Generate</span> gpt-4o-mini stream → tokens → user <span class="ok">600ms first token. ~£0.001 total cost.</span>
Core Data Flow

Indexing & Retrieval

Two passes through the system. Each one runs once.

┌──────────────── INDEXING (one time per PDF) ────────────────┐
│  Browser                                                    │
│   │ POST FormData(file)                                     │
│   ▼                                                         │
│  /api/upload                                                │
│   │  buffer  ──▶  pdf-parse  ──▶  text                     │
│   │  text    ──▶  chunker    ──▶  string[]                 │
│   │  chunks  ──▶  openai.embed ──▶ number[][]              │
│   │  pairs   ──▶  vectorStore.add()                        │
│   ▼                                                         │
│  200 OK { chunks: 100 }                                     │
└─────────────────────────────────────────────────────────────┘

┌─────────────── RETRIEVAL (per question) ────────────────────┐
│  Browser                                                    │
│   │ POST { question }                                       │
│   ▼                                                         │
│  /api/chat                                                  │
│   │  question  ──▶  openai.embed  ──▶  qVec                │
│   │  qVec      ──▶  vectorStore.search(k=5)  ──▶  top-5    │
│   │  top-5     ──▶  buildPrompt(system, ctx, q)            │
│   │  prompt    ──▶  openai.chat.stream                     │
│   │  delta     ──▶  ReadableStream  ──▶  Browser           │
│   ▼                                                         │
│  Tokens stream until completion                             │
└─────────────────────────────────────────────────────────────┘
Subsystems

Each piece, deep-dived

Why it exists. How it works. Where it breaks.

PDF Parsing

Why it exists

Text-only extraction. We do not need rasterised pages, layout reconstruction, or font metrics — only the readable text the user expects to ask questions about.

How it actually works

pdf-parse walks the PDF's text objects and concatenates them in reading order. It runs in pure JavaScript with no native bindings, which means it works on every Node host including Vercel's serverless runtime. Scanned PDFs return empty text — that is a deliberate non-goal handled with a 400 response.

Chunking

Why it exists

Whole documents are too big to send to an LLM. Whole pages are too coarse for retrieval. Sentences are too narrow to carry context. Paragraph-sized chunks with overlap is the documented sweet spot.

How it actually works

The chunker walks the text in 1,000-character windows, sliding by 800 characters each step (so each pair of consecutive chunks shares 200 characters of overlap). A sentence that spans two chunks is therefore findable in either chunk. Configurable via CHUNK_SIZE and CHUNK_OVERLAP env vars.

Embedding

Why it exists

Cosine similarity over high-dimensional vectors is the cheapest reliable way to surface "this chunk is about the same thing as this question". OpenAI's text-embedding-3-small is best-in-class for the price.

How it actually works

Each chunk is sent to OpenAI as one batch request, returning 1,536-dimensional Float32 vectors. The same model embeds the user question at retrieval time so the two vectors live in the same space. £0.000016 per 1k tokens means a 500-page PDF indexes for about 2 pence.

In-memory cosine index

Why it exists

For corpora under ten thousand chunks, the round-trip to a vector database is slower than the cosine math itself. In-memory wins on latency and zero infrastructure.

How it actually works

The store is a JavaScript array of { content, embedding, source }. Search computes cosine similarity against every chunk in O(N · 1536) and sorts. At 1,000 chunks the search completes in 12ms. At 10,000 chunks, 60ms. Beyond that, switch to pgvector.

Prompt assembly

Why it exists

The LLM is a powerful reader, not a powerful researcher. The prompt forces it to ground answers in retrieved chunks and to refuse rather than hallucinate when the chunks do not cover the question.

How it actually works

The system prompt pins the refusal instruction. The retrieved chunks are inserted as numbered passages so the model can reference them. The user question follows. We deliberately do not include conversation history in the demo — single-shot question answering is the default.

Streaming generation

Why it exists

Time to first token dominates perceived latency. Streaming the answer through the App Router's ReadableStream means users see characters appear within 600 to 900 milliseconds of pressing send.

How it actually works

The OpenAI SDK's stream:true option yields delta chunks. Each non-empty content delta is encoded and pushed into a Response stream. The client reads the body via the Fetch streams API and renders tokens as they arrive. No SSE library, no WebSocket setup.

Technology choices

Why this, not that

Next.js 14 App Router

Why we use it

Streaming responses, file-based API routes, and edge-ready deployment in one framework. The whole backend is two route files.

Why not the alternative

Express + Vite frontend — two repos, two deployments, two sets of CORS to configure. Zero benefit for this workload.

TypeScript strict mode

Why we use it

Vector dimensions, embedding shapes, and OpenAI response types are all checked at compile time. Refactors are safe; provider API changes break the build, not your users.

Why not the alternative

JavaScript — undefined property access at runtime is the most common failure mode in RAG systems. TypeScript catches that class entirely.

pdf-parse

Why we use it

Pure JavaScript, no native deps, single dependency, handles 95% of real PDFs by extracting the text layer.

Why not the alternative

unstructured — Python dependency awkward on Node serverless. pdf2pic — rasterises every page wastefully when you only need text.

OpenAI text-embedding-3-small

Why we use it

Cheapest credible embedding model on the market, 1,536 dims, faster than ada-002, scores higher on MTEB.

Why not the alternative

text-embedding-3-large — 3,072 dims is overkill for most corpora and doubles storage cost without proportionally better retrieval.

In-memory cosine

Why we use it

Zero infrastructure, zero network hops, sub-25ms search up to ten thousand chunks. Migration to pgvector is a single-file change when you outgrow it.

Why not the alternative

Pinecone — network round trip, billing surface, vendor lock-in. Worth it at a million chunks. Wasteful at a thousand.

gpt-4o-mini

Why we use it

Cheap, fast, smart enough to read context and answer accurately. £0.15 per million input tokens makes per-question cost trivial.

Why not the alternative

gpt-4o — 6x more expensive without measurably better answers when the context is already pre-filtered by retrieval.

Vercel deploy

Why we use it

Push to GitHub, live in 60 seconds. Streams work natively. The only secret is OPENAI_API_KEY.

Why not the alternative

AWS / GCP — 50x more configuration for the same result. Lambda + API Gateway + CloudFront for what Vercel does in one button click.

Performance & observability

What you can measure

Latency, cost, throughput. The numbers that matter.

600ms
time to first token
Warm Vercel function, UK region
£0.001
per question end-to-end
Embedding + retrieval + generation
10k
chunks before pgvector pays off
In-memory cosine stays under 60ms

Failure modes you should expect

Scanned image PDF
Cause: pdf-parse returns empty text
Fix: Return 400 with "no extractable text". Add tesseract.js or a vision model upstream if OCR is needed.
OpenAI 429 on embeddings
Cause: Hit the per-minute rate limit during a large indexing run
Fix: Retry with exponential backoff. Or switch to a tiered key. Not implemented in v1 — single-user starter.
Token limit exceeded
Cause: Very long PDF + high TOP_K + long chunks overflows the gpt-4o-mini context window
Fix: Reduce TOP_K, reduce CHUNK_SIZE, or upgrade the chat model with the env var.
Server restart loses index
Cause: In-memory store is volatile by design
Fix: Migrate to pgvector. The vector-store interface has three methods — replace them with Postgres calls.
Hallucinated answer
Cause: Retrieved chunks did not cover the question; model invented an answer
Fix: System prompt pins "say so plainly" — test adversarial questions to verify the refusal still fires after prompt edits.
Future direction

What’s next

Each item below is one pull request away. Contributions welcome.

Cross-encoder re-ranking

Take top-20 by cosine, re-rank with a cross-encoder, keep top-5. Measurable quality bump on ambiguous questions.

Hybrid search (BM25 + dense)

Sparse keyword matching catches rare terms that dense embeddings miss. Reciprocal rank fusion for the merge.

Multi-document UI

The store API already supports many sources. The demo UI is single-PDF for clarity. Extend to namespaces.

Citation rendering

Pass chunk IDs through the prompt and into the response. Render footnote-style citations in the chat view.

pgvector adapter

A drop-in lib/vector-store.pgvector.ts implementing the same interface. Documented schema in the whitepaper.

Local embedding option

Ollama-hosted bge-small or nomic-embed-text-v1.5. Stay on-prem for sensitive corpora.

Ready to try it?

Clone the repo. Add OPENAI_API_KEY. Upload a PDF. Ask a question. Roughly five minutes from zero to working.