AI / RetrievalConcept

RAG over PDF, but actually useful.

Upload a PDF. Ask it anything. Get answers grounded in the document with citations to the page they came from.

Most RAG demos fall apart on a forty-page contract or a scanned report. This one was built to survive both. Semantic chunking with overlap, dense embeddings, an HNSW index in Postgres, and a small re-ranker before Claude sees a single token. It is the version I reach for when a client says: we have ten years of policy documents and nobody can find anything.

01 / 03

What it does

You drop a PDF into a small FastAPI endpoint. The service extracts the text, splits it into semantically coherent chunks, embeds each chunk with text-embedding-3-small, and writes the rows to Postgres with pgvector. That is the boring half.

The interesting half is the query path. A question comes in, gets embedded, and is matched against the index using cosine similarity over an HNSW graph. The top twenty hits are run through a tiny BGE re-ranker, the top six survive, and Claude is given them inside a strict prompt that requires page-level citations. If Claude cannot ground an answer, it says so. That last bit is the whole game.

It runs locally in Docker, deploys to Fly or Railway, and costs about twenty pence per thousand questions on top of OpenAI embeddings. Cheap enough to leave running.

02 / 03

The problem it solves

A solicitor I work with had nine hundred PDFs of legacy contracts. Search inside Acrobat is fine if you remember the exact phrase. It is useless if you remember the gist. Every junior on the team had a different system of folder names and screenshot bookmarks.

This service replaces that. They drag a contract in, and the model can find the indemnity clause whether it is called indemnity, hold harmless, or a paragraph in clause 14.3 with no heading. It cites the page. They open the PDF, jump to the page, and they are done.

03 / 03

Architecture

Three moving pieces. An ingest worker that runs on upload, a query API, and Postgres in the middle holding both the chunks and the vectors. No vector DB rental, no pinecone bill, no separate re-ranker service. Everything that needs to be durable is in one Postgres.

  • FastAPI front door for upload and query.
  • PyMuPDF for text extraction with page numbers preserved.
  • A semantic chunker that prefers to split on headings, then sentences, never mid-token.
  • OpenAI text-embedding-3-small at 1536 dimensions, batched 100 at a time.
  • Postgres 16 with pgvector and an HNSW index tuned for cosine.
  • Claude 3.5 Sonnet as the answerer, with a system prompt that forbids ungrounded claims.
Code

The interesting bits.

sql·migrations/001_init.sql
create extension if not exists vector;

create table documents (
  id          uuid primary key default gen_random_uuid(),
  filename    text not null,
  uploaded_at timestamptz not null default now()
);

create table chunks (
  id          bigserial primary key,
  document_id uuid not null references documents(id) on delete cascade,
  page        int  not null,
  content     text not null,
  embedding   vector(1536) not null
);

create index chunks_embedding_hnsw
  on chunks
  using hnsw (embedding vector_cosine_ops)
  with (m = 16, ef_construction = 64);

create index chunks_document_id_idx on chunks(document_id);
python·app/query.py
from anthropic import Anthropic
from openai import OpenAI

oai = OpenAI()
claude = Anthropic()

SYSTEM = """You answer questions using only the provided context.
If the context does not contain the answer, say so plainly.
Every claim must end with (page X) where X is the source page.
Be terse. No preamble."""

async def answer(question: str, document_id: str) -> str:
    q_emb = oai.embeddings.create(
        model="text-embedding-3-small",
        input=question,
    ).data[0].embedding

    rows = await db.fetch(
        """
        select page, content
        from chunks
        where document_id = $1
        order by embedding <=> $2
        limit 20
        """,
        document_id, q_emb,
    )

    top = rerank(question, rows)[:6]
    context = "\n\n".join(f"[page {r['page']}] {r['content']}" for r in top)

    msg = claude.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=600,
        system=SYSTEM,
        messages=[{"role": "user", "content": f"{context}\n\nQ: {question}"}],
    )
    return msg.content[0].text
Tech stack

Tools, picked deliberately.

Python 3.12FastAPIPyMuPDFOpenAI EmbeddingsClaude SonnetPostgres 16pgvectorHNSWBGE re-rankerDockerFly.io
Run it yourself

From clone to working.

01

Clone and set up Postgres

Clone the repo, copy .env.example to .env, fill in OPENAI_API_KEY and ANTHROPIC_API_KEY. The compose file spins up Postgres 16 with pgvector preinstalled.

02

Run the migrations

docker compose up -d db, then make migrate. You should end up with two tables and an HNSW index.

03

Start the API

make dev runs uvicorn with reload. The OpenAPI docs live at /docs.

04

Ingest a PDF

curl -F file=@contract.pdf http://localhost:8000/ingest. The response gives you a document_id. Ingest is idempotent on filename plus hash.

05

Ask a question

POST /query with the document_id and your question. The response is the answer text plus the page citations.

06

Deploy

fly deploy with the included fly.toml. Use Supabase or Neon for managed Postgres if you do not want to run your own.

Read the source.

The repository ships with a working example, env file template and a short README. Star it if it helps.