RAG-over-PDF
A minimal, production-shaped RAG starter for PDF question answering.
Abstract
RAG-over-PDF is an open-source, MIT-licensed Retrieval-Augmented Generation starter designed around the principle that the load-bearing parts of RAG fit in roughly 600 lines of TypeScript. It uses Next.js 14, the OpenAI text-embedding-3-small model, an in-memory cosine similarity index, and gpt-4o-mini for streaming generation. This whitepaper documents the architecture, chunking strategy, embedding economics, retrieval pipeline, latency profile, and the migration path to pgvector for teams who outgrow the in-memory store.
01Executive Summary
RAG-over-PDF answers questions about a PDF. Upload the document, the application chunks it, embeds each chunk, stores the vectors in memory. When a user asks a question, the question is embedded, the top-five most similar chunks are retrieved by cosine similarity, the chunks are stuffed into the system prompt, and an answer is streamed back through Server-Sent Events.
Time-to-first-token sits between 600 and 900 milliseconds on a warm Vercel function in the UK region. End-to-end answer cost is roughly £0.001 per question against gpt-4o-mini. Indexing a 500-page PDF costs about 2 pence in embedding fees. The whole application runs in one process with no external infrastructure beyond OpenAI itself.
The vector store interface has three methods: add, search, clear. Replacing it with a pgvector-backed implementation is a single-file change. Everything else in the pipeline — parsing, chunking, prompting, streaming — is independent of the storage layer.
02Background & Motivation
"Chat with our docs" is the most common AI request from non-technical stakeholders. The default response is to reach for LangChain, provision Pinecone, write 400 lines of glue code, and produce something that nobody on the team understands six months later. The total cost — in dollars, time, and operational debt — is wildly out of proportion to the problem.
Most of that complexity is not load-bearing. Frameworks bring in abstractions for ingestion pipelines, agent loops, multi-vector retrievers, and tool routing — none of which a "ground a chatbot in a 50-page PDF" use case actually needs. Pinecone brings in a network hop, a billing surface, an SDK, and a vendor relationship for what is fundamentally a 1,536-dimensional dot product.
RAG-over-PDF is a deliberate counterweight. It does the simplest thing that works: parse the PDF in pure JavaScript, chunk by character count, embed with OpenAI, store in a JavaScript array, retrieve by cosine similarity, generate by streaming completion. Every line of code is in the repository, documented in the wiki, and ready to be rewritten when the requirements outgrow the defaults.
03The Problem
Three concrete pain points motivated this project:
- Framework opacity. LangChain hides the chunking, embedding, retrieval, and prompt-construction steps behind class hierarchies. When retrieval quality drops, the team cannot tell whether the chunker, the embedder, the index, or the prompt is the cause.
- Infrastructure premature optimisation. Pinecone, Weaviate, Qdrant, and pgvector are all excellent vector databases. None of them are needed for a corpus of fewer than ten thousand chunks, where in-memory cosine on Float32Arrays is fast enough.
- Tutorial-quality starters. Most "RAG in 100 lines" examples skip streaming, ignore EXIF rotation on images, do not handle empty PDFs, do not validate model output, and do not run on real serverless platforms.
This project addresses all three: the code is exposed and small, the storage layer scales from zero to ten thousand chunks before needing replacement, and the corner cases (scanned PDFs, rate limits, token limits) are documented as deliberate non-goals rather than silent failures.
04Goals & Non-goals
Goals
- Clone-and-run in under five minutes on a developer laptop.
- Production-shape: streaming responses, schema validation, error boundaries, deploy button.
- Readable end-to-end in 30 minutes by an intermediate TypeScript developer.
- Cost per question under £0.001 against OpenAI’s smallest models.
- Storage layer that scales to ten thousand chunks before needing replacement.
- Documented migration path to pgvector when the in-memory store is outgrown.
Non-goals
- Scanned PDFs.
pdf-parsereturns empty text on image-only PDFs. OCR is a deliberate omission to keep the stack lean — addtesseract.jsor a vision model upstream if needed. - Multi-document UI. The vector store API supports many sources; the demo UI exposes one PDF at a time.
- Re-ranking. A cross-encoder re-ranker on the top-20 retrieved chunks measurably improves quality. It is not included to keep the surface area small.
- Hybrid search. Dense plus BM25 retrieval is more robust than dense alone. Out of scope for a starter.
- Local embeddings. Ollama-based local embeddings are on the roadmap but not in v1.
05Architecture
The system is one Next.js 14 application with two API routes, one in-memory store, and one UI page.
Indexing pipeline
Browser
│ POST FormData(file) → /api/upload
▼
Route handler
│ 1. file → ArrayBuffer → Buffer
│ 2. pdf-parse(buffer) → text
│ 3. chunk(text, 1000, 200) → string[]
│ 4. openai.embeddings.create({input: chunks})
│ 5. vectorStore.clear() then vectorStore.add(chunks, vectors)
▼
Response { chunks: 47 }Question pipeline
Browser
│ POST { question } → /api/chat
▼
Route handler
│ 1. openai.embeddings.create({input: [question]})
│ 2. vectorStore.search(qVec, k=5) → top-5 chunks
│ 3. system prompt = base + chunks
│ 4. openai.chat.completions.create({stream: true})
▼
SSE → Browser (token by token)Module map
| File | Responsibility |
|---|---|
app/api/upload/route.ts | Accept PDF, parse, chunk, embed, store |
app/api/chat/route.ts | Embed question, retrieve, stream completion |
lib/chunker.ts | Fixed-size character chunking with overlap |
lib/vector-store.ts | In-memory cosine index — three methods |
lib/openai.ts | Single client, single embed helper |
app/page.tsx | Upload form, chat view, streaming render |
06Key Technical Decisions
Why pdf-parse rather than unstructured or pdf2pic
Both alternatives are heavier. pdf2pic rasterises every page and passes them through OCR — wasted work for born-digital PDFs. unstructured is excellent but has Python dependencies that are awkward on serverless Node hosts. pdf-parse is pure JavaScript, ships as a single dependency, and handles roughly 95 percent of real-world PDFs by extracting their embedded text layer.
Why fixed-size chunking with overlap
Semantic and layout-aware chunking are objectively better for documents with complex structure. They also require tuning per corpus, more code, and content-type-specific heuristics. For a starter, consistency beats cleverness: 1,000 characters with 200 overlap works for prose, contracts, manuals, and academic papers. Teams measure retrieval quality on their corpus and tune from there.
Why text-embedding-3-small
At £0.000016 per 1k tokens it is roughly 5x cheaper than the previous ada-002 generation while scoring higher on MTEB benchmarks. 1,536 dimensions is a sensible default — large enough to be expressive, small enough that cosine similarity stays under 25ms for ten thousand chunks. The -large variant doubles the dimensionality but rarely improves end-to-end answer quality enough to justify the storage and latency cost.
Why an in-memory cosine store
The pure compute cost of cosine similarity on 1,000 chunks of 1,536-dim vectors is approximately 12 milliseconds in JavaScript on a single core. For 10,000 chunks it is roughly 60ms. Beyond that the in-memory store starts becoming the bottleneck and pgvector with HNSW indexing wins decisively. The crossover point is precisely where teams typically have enough data to justify provisioning a database.
Why streaming over SSE rather than WebSockets
Chat needs one-way streaming from server to client. SSE is HTTP-native, works through every CDN and corporate proxy, has built-in browser support via EventSource, and survives Vercel’s function timeout via flushable streams. WebSockets are bidirectional, require connection management, and break through some proxies — overkill for chat output.
Why gpt-4o-mini for generation
Cheap, fast, and smart enough. The retrieval step has already done the heavy lifting; the generation model only needs to read context and answer. gpt-4o-mini at roughly £0.15 per million input tokens makes the per-question cost trivially small. Override via the CHAT_MODEL environment variable if you want a stronger reader.
07Implementation
Vector store interface
export interface VectorStore {
add(chunks: { content: string; embedding: number[]; source?: string }[]): void
search(query: number[], k: number): { content: string; score: number }[]
clear(): void
}In-memory cosine
let store: { content: string; embedding: number[] }[] = []
function cosine(a: number[], b: number[]): number {
let dot = 0, na = 0, nb = 0
for (let i = 0; i < a.length; i++) {
dot += a[i] * b[i]
na += a[i] * a[i]
nb += b[i] * b[i]
}
return dot / (Math.sqrt(na) * Math.sqrt(nb))
}
export function search(q: number[], k: number) {
return store
.map(c => ({ content: c.content, score: cosine(q, c.embedding) }))
.sort((a, b) => b.score - a.score)
.slice(0, k)
}Streaming chat handler
export async function POST(req: Request) {
const { question } = await req.json()
const [qVec] = await embed([question])
const chunks = search(qVec, TOP_K)
const context = chunks.map((c, i) => `[${i+1}]\n${c.content}`).join('\n\n')
const stream = await openai.chat.completions.create({
model: CHAT_MODEL,
stream: true,
messages: [
{ role: 'system', content: SYSTEM_PROMPT + '\n\n' + context },
{ role: 'user', content: question },
],
})
const encoder = new TextEncoder()
return new Response(new ReadableStream({
async start(controller) {
for await (const part of stream) {
const token = part.choices[0]?.delta?.content || ''
if (token) controller.enqueue(encoder.encode(token))
}
controller.close()
}
}), { headers: { 'Content-Type': 'text/plain; charset=utf-8' } })
}System prompt
The system prompt explicitly grounds the answer in the retrieved chunks and instructs the model to refuse rather than hallucinate when the answer is absent.
You answer questions strictly from the provided document chunks. If the answer is not in the chunks, say so plainly. Be concise. Quote short passages when useful. Do not invent facts. Do not use prior knowledge.
08Results & Performance
Latency breakdown (warm Vercel function, UK region)
| Step | Time |
|---|---|
| Network round trip to Vercel | 30 to 80 ms |
| Embed question (OpenAI) | 100 to 200 ms |
| Cosine search (in-memory, 100 chunks) | 1 to 3 ms |
First token from gpt-4o-mini | 300 to 700 ms |
| Stream remaining ~300 token answer | 800 to 1500 ms |
| Time to first token | ~600 to 900 ms |
| Time to completion | ~2 to 3 seconds |
Latency at scale (in-memory cosine)
| Chunks | Search time |
|---|---|
| 100 | 2 ms |
| 1,000 | 12 ms |
| 10,000 | 60 ms |
| 100,000 | 600 ms |
Above roughly ten thousand chunks the in-memory store becomes the dominant latency. Switching to pgvector with HNSW or IVFFlat indexing keeps search under 25 milliseconds even at one million chunks.
Cost
Per question against a 50-page reference PDF: ~£0.001
Per 500-page indexing pass: ~£0.02
1,000 questions per day on a 50-page PDF: ~£30 per month in OpenAI costs
Vercel free tier: 100k function invocations per month — roughly 100k questions
09Lessons & Trade-offs
What worked
- Resisting framework adoption. Every line of code in the project earns its place. There is no library between the developer and the moving parts.
- One vendor for embeddings and generation. One billing surface, one rate-limit budget, one set of documentation. Changes are a single env var away from being undone.
- Streaming from day one. Users perceive a streaming answer as faster even when total time is identical. Adding streaming after the fact is harder than building it in.
- Documenting non-goals explicitly. Listing scanned-PDF support, re-ranking, and hybrid search as deliberate omissions invites contributions for the right reasons rather than complaints about missing features.
What we got wrong on first pass
- Initial chunk size of 500 was too small. Retrieval was too noisy because individual chunks lacked context. 1,000 with 200 overlap is the sweet spot for prose.
- First version had no overlap. Sentences spanning chunk boundaries became unfindable in either chunk. Re-introducing overlap measurably improved retrieval quality.
- Original system prompt did not pin "say so plainly". The model hallucinated when chunks were irrelevant. Pinning the refusal instruction in the system prompt and testing it with adversarial questions fixed the regression.
Trade-offs we accept
- In-memory store loses state on cold start. Acceptable for demos and single-PDF workflows. Unacceptable for multi-tenant production — that is when you migrate to pgvector.
- OpenAI is the only embedding provider out of the box. Switching requires editing
lib/openai.ts. Multi-provider failover (à la SarmaLink-AI) is overkill for a starter.
10Conclusion
RAG-over-PDF demonstrates that a working, production-shape Retrieval-Augmented Generation application fits in roughly 600 lines of TypeScript and runs on free-tier infrastructure. The complexity that frameworks add to this problem is not load-bearing for the typical "chat with our docs" use case. By exposing the chunker, embedder, index, and prompt explicitly, the project lets teams understand exactly what their RAG system is doing and tune the parts that matter for their corpus.
The migration path to pgvector is a single-file change. Cross-encoder re-ranking is a 50-line addition. Multi-document support is a UI change. None of these require throwing away the foundation. That is the point of a starter: start, measure, extend.
AConfiguration
| Variable | Required | Default | Purpose |
|---|---|---|---|
OPENAI_API_KEY | Yes | — | Embeddings and generation |
EMBEDDING_MODEL | No | text-embedding-3-small | Override embedding model |
CHAT_MODEL | No | gpt-4o-mini | Override generation model |
CHUNK_SIZE | No | 1000 | Characters per chunk |
CHUNK_OVERLAP | No | 200 | Overlap between chunks |
TOP_K | No | 5 | Chunks retrieved per question |
Bpgvector schema
When the in-memory store is outgrown, replace lib/vector-store.ts with the following Postgres-backed implementation. The retrieval pipeline does not change.
-- Migration: enable pgvector and create the chunks table create extension if not exists vector; create table chunks ( id uuid primary key default gen_random_uuid(), content text not null, embedding vector(1536) not null, source text, created_at timestamptz default now() ); create index on chunks using ivfflat (embedding vector_cosine_ops) with (lists = 100); -- For very large corpora, prefer HNSW: -- create index on chunks -- using hnsw (embedding vector_cosine_ops);
// lib/vector-store.ts (pgvector implementation)
import { sql } from '@/lib/db'
export async function add(items: { content: string; embedding: number[]; source?: string }[]) {
for (const it of items) {
await sql`
insert into chunks (content, embedding, source)
values (${it.content}, ${JSON.stringify(it.embedding)}::vector, ${it.source ?? null})
`
}
}
export async function search(q: number[], k: number) {
return await sql`
select content, 1 - (embedding <=> ${JSON.stringify(q)}::vector) as score
from chunks
order by embedding <=> ${JSON.stringify(q)}::vector
limit ${k}
`
}
export async function clear() {
await sql`truncate chunks`
}