Open Source · MIT · Python 3.12

Evals as code, with traces and regressions.

One CLI. Datasets, scorers, judges, traces, regression gates, paired A/B. The boring half of AI engineering, done well.

AI Eval Runner is the evaluation harness most teams half-write. Datasets live as JSONL files in your repo. Scorers are plain Python functions. Every run is tagged with the git SHA and the dataset content version, so two runs are only compared when they are genuinely comparable. A regression diff, a CI gate, and a paired bootstrap A/B come built in. A small HTMX viewer browses runs, traces, and diffs without a JavaScript framework.

View on GitHub How it works Whitepaper Get help shipping

CLI commands

Built-in scorers

Paired

Bootstrap A/B

SQLite + DuckDB

Backends

MIT

Licence

Why this exists

Every team building with LLMs runs evals. Almost none of them run evals well. The first version is a Python script with print statements; the second version is a Jupyter notebook with a chart; the third version is a half-finished homemade dashboard. By the fourth version the team is debating whether to adopt a vendor product.

The vendor products are good. They are also opinionated about your stack and prefer to host your data. For projects where the dataset is sensitive, the AI is internal, and the evaluations are part of CI, hosted vendors are a poor fit.

AI Eval Runner is the open-source middle. Datasets are files. Scorers are functions. Traces live in SQLite or DuckDB, your choice. The viewer is local, fast, and HTMX-based so the JS surface is tiny. Regression mode runs in CI and fails the build when a release loses against a baseline. The same harness powers your local development loop and your release gate.

What is in the box

Every feature below ships in the public repository today. Clone, configure, run.

Scorers as plain Python functions

A scorer is any callable returning a float in 0.0 to 1.0. The runner applies every scorer to every prediction and stores the result under the scorer name. IDE support, real debuggers, no DSL.

Built-in scorers

exact_match, json_valid, rouge_l, and llm_judge ship as opt-in modules. Compose them, override their names, or write your own next to them.

LLM as judge with rubrics

llm_judge builds a scorer that asks a grading model to score against a rubric. Verdicts are parsed strictly as JSON and normalised to 0.0-1.0. The grading provider is fixed independently of the model under evaluation.

Datasets as files

JSONL or in-process lists. Each dataset gets an order-independent content version (SHA-256 over canonical JSON), and a DatasetRegistry records each distinct version under a name so drift is auditable.

Versioning by git SHA + dataset version

Every run is tagged with the working-tree git SHA and the dataset content version. Two runs are only compared when both the code and the data line up.

Regression diff

aieval diff prints per-scorer mean deltas and the examples that moved most. Same view in the viewer at /diff/<run_a>/<run_b>, with movers colour coded in both directions.

CI gate

aieval ci <run_id> --threshold 0.05 compares a candidate run to a baseline and exits non-zero when any scorer regresses past the threshold. Drop into GitHub Actions.

Paired bootstrap A/B

aieval pairwise resamples per-example differences with replacement and reports a percentile confidence interval. Winner is declared only when the interval clears zero. Seeded, reproducible.

HTMX viewer

FastAPI + HTMX + Jinja2. Run list, per-example traces, regression diff. Sub-100KB JS surface; loads instantly on a flaky connection.

OpenTelemetry with GenAI semantics

aieval.run and aieval.example spans, with gen_ai.request.model, gen_ai.system, aieval.run.dataset_version, aieval.score.<scorer> attributes. Behind an optional extra; no hard telemetry dependency.

SQLite or DuckDB backends

Both file-based, both server-less. AIEVAL_BACKEND switches them. SQLite by default for portability; DuckDB for ad-hoc analytics on traces.

Provider adapters

SarmaLink and OpenAI-compatible adapters ship in aieval/providers. Bring your own model by adding a small Python module that returns predictions for a list of prompts.

Architecture, in one diagram

A small Python library and CLI built around one idea: an eval run is a function over a dataset, scored by plain Python, persisted by git SHA and dataset version.

rendering

AI Eval Runner: dataset and scorers feed the runner, traces land in SQLite or DuckDB, the HTMX viewer and the compare commands read them back.

aieval/cli.py

Typer CLI: run, list, view, diff, ci, pairwise.

aieval/core/runner.py

Parallel execution with retry, scoring, telemetry, storage. Records git SHA and dataset version.

aieval/core/dataset.py

JSONL and in-process loaders. content version() and DatasetRegistry.

aieval/core/scorer.py

@scorer decorator, scorer naming, invoke_scorer context injection.

aieval/core/regression.py

compare() for per-scorer deltas and per-example diffs. The CI gate uses it.

aieval/core/pairwise.py

Paired bootstrap A/B with percentile confidence intervals.

aieval/core/telemetry.py

OpenTelemetry span and attribute capture, with a no-op fallback.

aieval/scorers/*.py

Built-in scorers: exact_match, json_valid, rouge_l, llm_judge.

aieval/backends/*.py

SQLite and DuckDB result stores, both file-based.

aieval/providers/*.py

SarmaLink and OpenAI-compatible adapters.

aieval/viewer/app.py

FastAPI + HTMX run browser and regression diff view.

Run lifecycle

Examples are fanned out under a concurrency semaphore. Each prediction is scored by every scorer. Results are written to the backend; the run is closed with a summary span.

rendering

Run lifecycle. Failures land in Failed; successes land in Done with a summary span attached.

Built-in scorers

Importable from aieval.scorers. Each one is a normal scorer, so it composes with anything you write yourself.

exact_match

1.0 when the trimmed prediction equals the trimmed expected answer, else 0.0.

json_valid

1.0 when the prediction parses as JSON, else 0.0.

rouge_l

ROUGE-L F1 between prediction and expected, on whitespace tokens.

llm_judge

Asks a grading model to score against a rubric. Verdicts parsed strictly as {"score": 0-10, "reason": "..."} and normalised to 0.0-1.0. Malformed verdicts score 0.0 rather than crashing the run.

Quick start

Five commands from clone to a first eval. Commands taken straight from the README.

Clone the repo

git clone https://github.com/sarmakska/ai-eval-runner.git
cd ai-eval-runner

Install with uv

uv sync
cp .env.example .env   # set SARMALINK_API_KEY or OPENAI_API_KEY

Run the bundled example eval

uv run aieval run examples/summarisation/eval.py

List recent runs

uv run aieval list

Open the HTMX viewer

uv run aieval view   # http://localhost:8000

Writing evals, in real code

Snippets from the wiki and the bundled examples. Every one runs as-is once you have set a provider key.

pythonWriting an eval+

from aieval import dataset, run, scorer
from aieval.scorers import llm_judge, rouge_l


@scorer
def length_under_120_words(prediction: str, _expected: str) -> float:
    return 1.0 if len(prediction.split()) <= 120 else 0.0


faithful = llm_judge(
    rubric="Reward summaries faithful to the source that omit nothing important.",
    provider="sarmalink",
    model="smart",
    name="faithfulness",
)


if __name__ == "__main__":
    run(
        name="summarisation",
        dataset=dataset.jsonl("examples/summarisation/dataset.jsonl"),
        scorers=[rouge_l, length_under_120_words, faithful],
        provider="sarmalink",
        model="smart",
    )

pythonScorer context (prompt, example, model)+

from aieval import scorer

# Declare 'example', 'model', 'provider' or 'prompt' as keyword-only
# parameters and the runner fills them in via invoke_scorer.

@scorer
def grounded_in_prompt(
    prediction: str,
    _expected: str,
    *,
    prompt: str | None = None,
) -> float:
    return 1.0 if prompt and prompt.split()[0].lower() in prediction.lower() else 0.0

# The common two-argument case is unchanged.
# The extra parameters are supplied only when you declare them.

bashComparing runs (diff, CI gate, paired A/B)+

uv run aieval list                              # find run ids
uv run aieval diff <run_a> <run_b>              # per-scorer mean deltas + movers
uv run aieval pairwise <run_a> <run_b> \
    --confidence 0.95 --iterations 2000         # paired bootstrap, winner column
uv run aieval ci <run_id> --threshold 0.05      # exits non-zero on regression

pythonIn code: compare and pairwise+

from aieval import compare, pairwise
from aieval.backends import get_backend

backend = get_backend()

# Regression diff
report = compare(backend.get_results(run_a), backend.get_results(run_b))
for d in report.scorer_deltas:
    print(d.scorer, d.delta)

# Paired bootstrap with a 95% confidence interval
for r in pairwise(backend.get_results(run_a), backend.get_results(run_b)):
    print(r.scorer, r.mean_diff, (r.ci_low, r.ci_high), r.winner)
    # winner reads: 'a', 'b', or 'tie'

yamlCI gate in a workflow+

- run: uv run aieval run my_eval.py
- run: |
    RUN_ID=$(uv run python -c \
      "from aieval.backends import get_backend; print(get_backend().list_runs()[0]['id'])")
    uv run aieval ci "$RUN_ID" --threshold 0.05

OpenTelemetry, GenAI semantics

Two span types: aieval.run once per run, aieval.example once per example. Attribute names follow the GenAI semantic conventions where they exist, with the aieval.* namespace for runner-specific signal.

Attribute	Span	Meaning
gen_ai.request.model	both	The model under evaluation.
gen_ai.system	both	The provider (sarmalink, openai).
aieval.run.name	run	The eval name.
aieval.run.dataset_version	run	The dataset content version.
aieval.run.git_sha	run	The working-tree git SHA.
aieval.run.pass_rate	run	Final pass rate.
aieval.run.avg_latency_ms	run	Mean provider latency.
aieval.example.index	example	The example index.
aieval.example.latency_ms	example	Provider latency for the example.
aieval.score.<scorer>	example	Score from each scorer.
aieval.example.error	example	Error message when an example fails.

Where it fits

The patterns this repository was built around, and the ones it deliberately is not.

Release gates in CI

A new prompt or model goes through the eval before it merges. The CI gate fails the PR if any scorer regresses by more than the threshold against the baseline run.

Prompt iteration on the desk

Run locally, see traces side by side in the viewer, ship only when the win is real on the whole dataset and not just one cherry-picked example.

Honest vendor comparison

Run the same dataset across OpenAI, SarmaLink-AI, and any OpenAI-compatible provider via the adapter API. Compare cost, latency, and accuracy with paired statistics.

Slice-aware metrics

Tag examples with metadata (industry, doc length, language) and slice the headline number so a regression in a slice is not hidden by the aggregate.

When NOT to reach for it

You need a hosted multi-tenant platform with dashboards and team management out of the box. This is a self-hosted toolkit; the viewer runs locally and there is no SaaS tier.

Not a labelling tool

Programmatic scoring is the focus. If your evaluation is pure human review with no programmatic scoring, the regression and pairwise commands give you nothing.

Tech stack

Python 3.12TyperPydantic v2DuckDBSQLitePandasFastAPIHTMXJinja2OpenTelemetry (extra)pytestuv

Compared to the alternatives

Three popular hosted eval platforms and rolling your own. Honest comparisons.

Feature	AI Eval Runner	Braintrust	LangSmith	PromptLayer	DIY
Self-hosted, MIT	Yes	No	No	No	Yes
Datasets as plain files	JSONL in git	Hosted	Hosted	Hosted	Yours
Scorers as plain Python	Yes	TypeScript / Python SDKs	SDK	Limited	Yours
LLM-as-judge with rubric	Yes, strict JSON parse	Yes	Yes	Limited	You build it
Paired bootstrap A/B	Yes, seeded	No	No	No	You build it
CI gate exit code	Yes	Yes	Yes	Partial	You build it
OTel GenAI semantics	Yes (extra)	Partial	Yes	No	You write it
Versioning by git SHA + dataset	Yes	Partial	Partial	No	Yours

Documentation, all in the wiki

Focused pages with no marketing in between. Each one answers a single operational question.

Home

High-level overview, who it is for, key concepts.

Open page

Architecture

Run lifecycle, components, design decisions.

Open page

Quick Start

Install, run the example eval, open the viewer.

Open page

Scorers

Writing scorers, scorer context, built-ins, LLM-as-judge.

Open page

Comparing Runs

Regression diff, CI gate, paired bootstrap A/B, what makes runs comparable.

Open page

Telemetry

Enabling OTel, what is captured, GenAI semantic attributes.

Open page

Roadmap

Shipped and planned.

Open page

Frequently asked

Eight real questions from teams that have wired this into their release path.

Why plain Python scorers rather than a DSL?+

IDE support, real debuggers, and no leaky abstraction between what you want to measure and how you express it. A scorer is any callable returning a float in 0.0 to 1.0. Scorers that need more signal declare example, model, provider, or prompt as keyword-only parameters and the runner fills them via invoke_scorer. This is what lets the LLM judge see the original request without changing the common two-argument case.

What does the paired bootstrap actually do?+

It resamples the per-example differences between two runs with replacement many times, recomputes the mean difference on each resample, and reports a percentile confidence interval. The winner column reads b when the interval sits above zero, a when it sits below zero, or tie when the interval straddles zero. The generator is seeded so results are reproducible.

What makes two runs comparable?+

Each run is tagged with the git SHA and the dataset content version. The dataset version is an order-independent SHA-256 over the canonical JSON of every row, so reordering rows does not invalidate a comparison, but editing a row does. Comparing runs over different datasets is possible but rarely meaningful, since examples no longer line up by index.

How does the CI gate decide there is a regression?+

aieval ci compares the candidate run against a baseline and exits non-zero when any scorer mean drops by more than the threshold. By default the baseline is the previous run of the same name; pass --baseline <run_id> to pin it. If there is no baseline the gate passes, so the first run on a new eval never blocks.

Do I need a server for the backend?+

No. SQLite and DuckDB are both file-based. Set AIEVAL_BACKEND and AIEVAL_DB_PATH to switch. SQLite is the default for portability; DuckDB is there when you want ad-hoc analytics on traces.

How does the LLM judge avoid crashing the run?+

The grading model is prompted to answer with a single JSON object containing score and reason. The verdict is parsed strictly. A malformed verdict scores 0.0 rather than crashing the run, so a flaky judge does not bring the eval down.

How does telemetry stay optional?+

Capture lives in aieval/core/telemetry.py behind an optional extra. A span() context manager yields a SpanHandle whose set() method records attributes once they are known. The captured attribute dict is available in both modes, so the same code path works whether or not OpenTelemetry is installed, and tests assert on captured attributes without standing up a collector.

Can the LLM judge grade itself?+

It can, but the design lets you fix the grading provider and model independently of the model under evaluation. That means you can grade a cheap model with a stronger judge, which usually surfaces more meaningful signal than a model marking its own homework.

Treat evals like tests, not like dashboards.

Clone the repo, write a scorer, run the example eval, wire the CI gate.

View on GitHub How it works Whitepaper Get help shipping

All open-source projects

Evals as code, with traces and regressions.

Why this exists

What is in the box

Scorers as plain Python functions

Built-in scorers

LLM as judge with rubrics

Datasets as files

Versioning by git SHA + dataset version

Regression diff

CI gate

Paired bootstrap A/B

HTMX viewer

OpenTelemetry with GenAI semantics

SQLite or DuckDB backends

Provider adapters

Architecture, in one diagram

Run lifecycle

Built-in scorers

Quick start

Writing evals, in real code

OpenTelemetry, GenAI semantics

Where it fits

Release gates in CI

Prompt iteration on the desk

Honest vendor comparison

Slice-aware metrics

When NOT to reach for it

Not a labelling tool

Tech stack

Compared to the alternatives

Documentation, all in the wiki

Home

Architecture

Quick Start

Scorers

Comparing Runs

Telemetry

Roadmap

Frequently asked

Related products

SarmaLink-AI

MCP Server Toolkit

Voice Agent Starter

Agent Orchestrator

Local LLM Router

StaffPortal

RAG-over-PDF

Receipt Scanner

Webhook-to-Email

k8s-ops-toolkit

terraform-stack

slipstream

forge-infer

shipyard

lsmdb

raftkv

sandboxd

Treat evals like tests, not like dashboards.