Technical Whitepaper · v1.0

AI Eval Runner

Evals as code. Datasets as files. Scorers as functions. Traces in DuckDB. Regressions in CI.

MIT LicencePython 3.12DuckDBHTMX ViewerCI-FriendlyRegression Mode
6+Built-in scorers
DuckDBTrace storage
< 100KBViewer JS
CIFirst-class

§ Abstract

Every team building with LLMs ends up running evals. Most of them run evals badly. The first version is a Python script with print statements; the second is a notebook with a chart; the third is a half-written internal dashboard. The team eventually evaluates a vendor product, gets put off by data egress concerns, and goes back to the dashboard. Months pass. Releases ship without confidence.

AI Eval Runner is the open-source consolidation. Datasets live as files in your repository. Scorers are Python functions. Runs produce per-example traces stored in DuckDB. A FastAPI viewer with HTMX renders pass rates, slices, regressions, and individual traces. The CLI is CI-friendly: a regression run exits non-zero when the score drops by more than the configured threshold, and writes a summary file the build can attach to a PR comment.

This whitepaper documents the data model, the scorer contract, the regression algorithm, the trace storage, and the operational lessons from running this harness on real release pipelines.

1Executive Summary

AI Eval Runner is a Python 3.12 CLI plus a small FastAPI viewer. The CLI runs evaluations: read a dataset, call a model adapter, score the output, write a trace. The viewer reads the trace database and shows the results.

Six built-in scorers ship in the repository: exact-match, JSON schema validation, BLEU, ROUGE, regex match, semantic similarity, LLM-as-judge, and rubric grading. Each is a small Python module; adding a new scorer is fewer than fifty lines. Datasets are JSONL, CSV, or Parquet; the runner does not care which.

Regression mode is the feature that closes the loop with CI. A run can be compared against a baseline (previous run, named branch, named tag). The CLI exits non-zero when the score regresses by more than a threshold. The build fails the PR. The team sees the regression in the comment.

2Background

The eval problem is simple to describe and easy to get wrong. You have a dataset of inputs and expected outputs. You have a model. You want to know how well the model does on the dataset. You want to know whether the next version of the model does better or worse. You want to slice the results by metadata fields. You want to do this in CI.

The vendor offerings (Braintrust, Langfuse, Helicone, others) handle this in increasingly polished ways. They are also opinionated about hosting your data, and that is a poor fit for many use cases. Internal datasets, regulated industries, and air-gapped environments cannot send their evaluation data to a third-party service.

The library offerings (DSPy, LangSmith eval, the various eval kits in PyPI) handle parts of the problem. None of them, in our experience, handle CI gracefully. Most assume a notebook context. The CI mismatch is what made every team we worked with end up writing their own.

3Problem in detail

Dataset versioning

Datasets evolve. Adding examples, fixing labels, splitting train and test. If the dataset version is not tracked, comparing two runs is meaningless. The runner ships datasets as files in git; the dataset version is its commit hash.

Scorer composition

Few real evaluations have one scorer. A useful eval combines exact-match on the structured field, BLEU on the prose part, and an LLM-as-judge on the holistic quality. The runner’s scorer contract is small and composable, so a new compound scorer is a Python function that returns a dict of sub-scores.

Slice analysis

The headline number can hide a regression in a slice. A model that improves overall while regressing in Spanish, or in long documents, or in a specific industry, has shipped a regression. Every example carries metadata; every aggregate can be sliced by metadata fields.

CI integration

The eval has to run in CI to matter. That means: deterministic, no background server, exits with the right code, writes the right artefacts, finishes inside a reasonable time budget. The runner is built around these constraints from the first commit.

4Goals + non-goals

Goals

  • Datasets as files. No external dataset service.
  • Scorers as Python functions. Composable, easy to write.
  • Per-example trace storage in DuckDB on disk.
  • Slice analysis as a first-class query.
  • Regression mode that exits non-zero on regressions.
  • HTMX viewer with sub-100KB JS.
  • Adapter API for any model.

Non-goals

  • Hosted evals. The runner is local-first.
  • A prompt-management product. Prompts live in your application code.
  • A general-purpose dataset platform. Datasets are files.
  • Training pipelines. The runner evaluates; it does not fine-tune.

5Architecture

Three things on disk: the dataset (a JSONL file or directory), the policy (which model, which scorers), and the trace database (a single DuckDB file). The CLI reads dataset and policy, writes traces. The viewer reads traces, renders pages.

Trace schema (simplified DuckDB)
  runs (run_id, dataset_id, dataset_version, model, started_at, finished_at, summary jsonb)
  examples (run_id, example_id, input jsonb, output text, expected text, metadata jsonb)
  scores (run_id, example_id, scorer, score double, details jsonb, latency_ms, cost double)

The relational shape is exactly what you want for slice analysis. Aggregating scores per slice is a SQL group-by, and DuckDB executes that on tens of millions of rows in under a second.

6Key technical decisions

DuckDB, not Postgres or SQLite

The runner’s storage is read-heavy at query time and write-heavy during a run. DuckDB is purpose-built for analytical workloads, ships as a file, has no server to deploy, and runs the slice queries in milliseconds. SQLite would have been almost as good but is row-oriented and slower for this exact shape. Postgres would have been overkill and required hosting.

HTMX, not React

The viewer is data-dense, not interaction-rich. HTMX gives interactivity without a JS framework. The viewer ships under 100KB of JS. It loads instantly, even on a flaky office connection.

Python, not TypeScript

The audience is data scientists and ML engineers as much as software engineers. Python is the lingua franca of evaluation. Scorers are Python functions because that is the language the people writing them already think in.

uv for packaging

uv builds the venv, resolves deps, runs scripts. Fastest reasonable option in 2026. Installable to a CI runner in under ten seconds.

7Implementation milestones

Milestone 1 · CLI core and exact-match

The first thing built was the CLI loop: read dataset, call adapter, score, write trace. Exact-match was the only scorer. The acceptance test was a hand-built dataset of fifteen examples and a known model.

Milestone 2 · scorer family and JSON schema

The remaining scorers were added with the contract stable. JSON schema validation was a particular focus because it is the most common scorer in production deployments.

Milestone 3 · regression mode

Compare-against-baseline landed with the right exit codes and summary artefacts. CI integration was tested against GitHub Actions and CircleCI.

Milestone 4 · HTMX viewer

The viewer was added against the stable trace schema. SSE for live tail of running evals; tRPC was rejected because Python.

8Lessons / honest limits

Lessons

  • Slice analysis is the difference. Headline numbers hide everything important. Every dataset should carry metadata; every aggregate should be sliceable.
  • LLM-as-judge has to be calibrated. A judge model that grades 4.7/5 on every output is useless. The runner ships a calibration set and a documented procedure.
  • CI integration is the feature, not the afterthought. A runner that does not exit non-zero on regression will not be used as a release gate.

Honest limits

  • No hosted mode. If you want a hosted evaluation product, use one. The runner is local-first by design.
  • No training feedback loop. Evaluation only.
  • LLM-as-judge cost. Judge models cost money; evaluating thousands of examples with a judge is not free. The runner reports the cost.
  • Slow scorers slow the whole run. Parallelism inside the runner is concurrent.futures with bounded concurrency; no GPU-aware batching. Acceptable for the dataset sizes we target.

9Conclusion

AI Eval Runner is what evaluation tooling looks like when CI is the first-class user. Datasets are files; scorers are functions; traces are DuckDB rows; the viewer is HTMX. Regressions fail the PR. Slices catch hidden problems before they ship.

The repository is MIT licensed. The wiki contains a scorer authoring guide, an adapter authoring guide, a CI recipe for GitHub Actions, and a calibration procedure for LLM-as-judge.

AI Eval Runner · Built by Sarma Linux · MIT licensed