How AI Eval Runner works
A complete tour of the architecture: data flow, subsystems, technology choices, performance, and where the project is heading next.
A Python CLI reads a dataset file, calls a model adapter for each example, runs every configured scorer, and writes a per-example row into a DuckDB file. The viewer reads the file and renders pass rates, slices, and individual traces with HTMX. Regression mode compares one run to a baseline and exits non-zero on regressions.
Core data flow
From the moment a request enters the system to the moment a response leaves it.
Dataset file (jsonl/csv/parquet)
│
▼
CLI reads policy.yaml: model + scorers + slice keys + thresholds
│
▼
for example in dataset:
├──▶ adapter.call(example.input)
│ │ output, latency_ms, cost
│ ▼
├──▶ for scorer in scorers:
│ score = scorer(output, example.target)
│ │ { score, sub_scores, details }
│ ▼
│ trace.write(run_id, example, output, score)
│
end loop
▼
Compare-against-baseline (optional)
│ delta per scorer, per slice
▼
Summary file (json) + exit code
│
▼
Viewer reads DuckDB → HTMX pagesEach subsystem, deep-dived
Every component in the data flow above, opened up and explained.
Dataset loader
JSONL, CSV, and Parquet are the supported formats. The loader infers the schema and yields example dicts with input, target, and arbitrary metadata fields. Datasets can be split files (a directory with one file per slice) or single files. Versioning is via the file’s git commit hash, recorded with every run.
Datasets can be sampled or filtered at run time via the policy file. sample: 100 takes the first 100 examples; filter: metadata.lang == "es" runs only Spanish examples. Filters are evaluated by a small expression engine, not by exec, so untrusted policy files are safe.
Scorer contract
A scorer is a Python callable: def score(output: str, target: Any, metadata: dict) -> ScoreResult. ScoreResult has score (float between 0 and 1), sub_scores (dict, optional), and details (dict, optional, surfaced in the viewer).
Built-in scorers: exact-match, json-schema, bleu, rouge, regex, semantic (embedding cosine), llm-judge, rubric. Each is a small module under scorers/; copy any of them as a starting point for a custom scorer.
Composition is by listing multiple scorers in the policy. The viewer renders each scorer’s aggregate independently and the combined headline is computed by a configurable aggregator (mean, weighted mean, or all-pass).
Adapter API
An adapter is a Python module exporting call(input: dict) -> AdapterResult. AdapterResult has output, latency_ms, cost, and raw (provider response, optional). Defaults: openai, anthropic, openai-compatible (any base URL). Adding an adapter is fewer than fifty lines.
DuckDB trace storage
One DuckDB file per project. Schema: runs, examples, scores. Indexes on run_id and example_id. The runner inserts in a single transaction per batch of fifty examples to keep write latency reasonable.
Slice queries are SQL group by. The viewer issues them on demand; DuckDB returns in milliseconds even for large datasets. The trace file can be checked into git (small datasets) or stored in object storage (larger ones); the wiki documents both patterns.
Regression mode
evalrunner run … --against-baseline main reads the most recent run on the named baseline (branch, tag, or run id), aggregates per-scorer per-slice, and computes the delta. If any scorer regresses by more than the threshold (default 1.5%), the CLI exits 2 and writes regressions.json.
In CI, the failing exit code fails the build; regressions.json is uploaded as an artefact and surfaced in a PR comment by a small Action. The wiki has the workflow YAML.
HTMX viewer
FastAPI app. Templates in Jinja2. HTMX for swap-in interactivity (sort, filter, slice). Each page issues one or two SQL queries against DuckDB. The viewer streams updates while a run is in progress via SSE on /runs/{run_id}/events. Total JS payload is under 100 kilobytes.
CLI ergonomics
Typer for the CLI. Subcommands: run, serve, compare, export. Coloured output, progress bars, machine-readable summary. The CLI is the primary interface; the viewer is the secondary.
Why this stack
The road not taken matters as much as the road taken. Here is what was picked, why, and what was rejected and why.
Python 3.12
The audience writes Python. Scorers are functions. Adapters are functions. Python is the right language.
TypeScript , would have been a fight with the data-science workflow most teams already have.
DuckDB
Analytical, file-based, fast group-bys, no server. Perfect for trace storage with slice queries.
SQLite , slower for analytical queries. Postgres , server overhead, worse for read-heavy analytical loads.
HTMX
Sub-100KB JS, instant load, perfect for data-dense interactive views.
React , overkill, slower load, more dependencies, all for interactions HTMX handles natively.
FastAPI
Easy SSE, easy templating, easy CLI integration via shared modules.
Flask , would have worked, less typed, missing modern niceties.
uv
Fastest dependency resolver. CI-friendly. Reproducible.
pip + venv , slower, looser reproducibility.
Typer
Clean CLI ergonomics, type-checked arguments, free help text.
argparse , works, more boilerplate, less polish.
JSONL/CSV/Parquet datasets
Files in git is the simplest version control story for evaluation data.
A dataset platform , most teams do not need one.
Performance & observability
Throughput is bounded by the model adapter. The runner uses bounded concurrency (default 8) to overlap adapter calls; CPU is rarely the bottleneck. A 1,000-example dataset against a fast model (Groq, Cerebras) finishes in two to three minutes. Against a slow model it finishes when the model finishes.
Storage is dominated by the per-example output text. A run of 10,000 examples with 500-token outputs occupies roughly 25 megabytes in DuckDB. Production users keep a year of trace files comfortably on a single machine.
The viewer’s slice queries return in milliseconds for normal datasets. Larger datasets (hundreds of thousands of examples) benefit from DuckDB’s native indices; the wiki has the relevant CREATE INDEX guidance.
Where it is heading
- →Built-in calibration set for LLM-as-judge with documented procedure.
- →Adapter for Hugging Face Inference Endpoints.
- →Dataset diff tool that highlights which examples were added or relabelled between two runs.
- →Concurrent runs against a shared baseline, for batched A/B testing.
- →Native Parquet output for trace shipping into a data warehouse.
Read the full whitepaper for the formal technical write-up.