Evals as code, with traces and regressions.
One CLI. Datasets, scorers, traces, regressions. The boring half of AI engineering done well.
AI Eval Runner is the evaluation harness most teams half-write. Datasets live as files in your repo. Scorers are Python functions. Runs produce traces in DuckDB. A FastAPI viewer (HTMX, no JS framework) shows pass rates, regressions, and individual traces. Built-in scorers cover exact-match, JSON schema, BLEU, ROUGE, LLM-as-judge, and rubric grading.
Why this exists
Every team building with LLMs runs evals. Almost none of them run evals well. The first version is a Python script with print statements; the second version is a Jupyter notebook with a chart; the third version is a half-finished homemade dashboard. By the fourth version the team is debating whether to adopt a vendor product.
The vendor products are good. They are also opinionated about your stack, and they prefer to host your data. For projects where the dataset is sensitive, the AI is internal, and the evaluations are part of CI, hosted vendors are a poor fit.
AI Eval Runner is the open-source middle. Datasets are files. Scorers are functions. Traces live in DuckDB on disk. The viewer is local, fast, and built on HTMX so the JS surface is tiny. Regression mode runs in CI and fails the build when a release loses performance against a baseline. The same harness powers your local development loop and your release gate.
What it does
Every feature below ships in the public repository today. Clone, configure, run.
Datasets as files
JSONL, CSV, or Parquet. Versioned in git. No external dataset service. Datasets can be parameterised at runtime.
Scorers as functions
A scorer is a Python function. Inputs: model output and ground truth. Output: a score and a metadata dict. Compose them.
Built-in scorers
Exact-match, JSON schema validation, BLEU, ROUGE, regex match, semantic similarity, LLM-as-judge, rubric grading. All as opt-in modules.
Per-trace storage
Every run stores per-example trace: input, output, score, latency, cost, model used. DuckDB makes ad-hoc queries fast.
Pass rates and aggregates
Aggregate scores per dataset, per slice, per model. Time series across runs. Slice by metadata fields you provide.
Regression mode
A run can be compared against a baseline; the CLI exits non-zero on regressions over a configured threshold. Drop into GitHub Actions.
HTMX viewer
A FastAPI app with HTMX for interactivity. Sub-100KB JS. Loads instantly, even on a flaky connection.
Cost and latency
Every run records cost and latency per example. Catch regressions in £ as well as accuracy.
Adapter API
Bring your own model. Adapters are small Python modules. The default adapters cover OpenAI, Anthropic, and OpenAI-compatible endpoints.
CI-friendly
No background server required. The CLI runs to completion, writes a summary file, and reports a clean exit code.
Tech stack
Architecture, in one diagram
The whole system on a single screen. Every box maps to a real folder in the repo.
┌─────────────────────────┐ ┌──────────────────────────┐
│ Dataset (JSONL/CSV) │ │ Adapter (model client) │
│ · examples in git │ ──▶ │ · OpenAI │
│ · slices, metadata │ │ · Anthropic │
└──────────────┬──────────┘ │ · OpenAI-compatible │
│ └──────────────┬───────────┘
▼ ▼
┌──────────────────────────────────────────────┐
│ CLI · evalrunner run │
│ for example in dataset: │
│ output = adapter.call(example) │
│ for scorer in scorers: │
│ score = scorer(output, example.target) │
│ trace.write(example, output, score) │
└──────────────┬─────────────────────────────────┘
▼
┌──────────────────────────────────────────────┐
│ DuckDB · traces.duckdb │
│ · per-example rows │
│ · runs, datasets, scorers tables │
└──────────────┬─────────────────────────────────┘
▼
┌──────────────────────────────────────────────┐
│ FastAPI viewer (HTMX, Jinja2) │
│ · pass rates, slices, regressions, traces │
└──────────────────────────────────────────────┘Quick start
From clone to first request in under five minutes.
git clone https://github.com/sarmakska/ai-eval-runner.git cd ai-eval-runner
uv sync cp .env.example .env # add OPENAI_KEY etc.
evalrunner run examples/qna --model gpt-4o-mini \ --scorers exact-match,json-schema --against-baseline main
evalrunner serve # FastAPI viewer on :8088
Where it fits
The patterns this repository was built around.
Release gates
A new prompt or a new model goes through eval before it merges. Regression mode fails the PR if the score drops more than the threshold.
Prompt iteration
Iterate prompts locally with the CLI, see traces side-by-side in the viewer, ship when the win is real and not just one example.
Vendor comparison
Run the same dataset across OpenAI, Anthropic, and a local model via the adapter API. Compare cost and accuracy honestly.
Customer-bound metrics
Slice eval results by customer-relevant metadata (industry, doc length, language) so the headline number is not hiding a regression in a slice.
Related products
The wider Sarma Linux toolkit. Every project ships with the same opinions: open source, MIT, real depth, no marketing fluff.
Treat evals like tests, not like dashboards.
Clone the repo, follow the four-step quick start, ship something real.