Open Source · MIT · Python 3.12

Evals as code, with traces and regressions.

One CLI. Datasets, scorers, traces, regressions. The boring half of AI engineering done well.

AI Eval Runner is the evaluation harness most teams half-write. Datasets live as files in your repo. Scorers are Python functions. Runs produce traces in DuckDB. A FastAPI viewer (HTMX, no JS framework) shows pass rates, regressions, and individual traces. Built-in scorers cover exact-match, JSON schema, BLEU, ROUGE, LLM-as-judge, and rubric grading.

6+
Built-in scorers
DuckDB
Trace storage
HTMX
Viewer UI
CI
Regression mode
MIT
Licence

Why this exists

Every team building with LLMs runs evals. Almost none of them run evals well. The first version is a Python script with print statements; the second version is a Jupyter notebook with a chart; the third version is a half-finished homemade dashboard. By the fourth version the team is debating whether to adopt a vendor product.

The vendor products are good. They are also opinionated about your stack, and they prefer to host your data. For projects where the dataset is sensitive, the AI is internal, and the evaluations are part of CI, hosted vendors are a poor fit.

AI Eval Runner is the open-source middle. Datasets are files. Scorers are functions. Traces live in DuckDB on disk. The viewer is local, fast, and built on HTMX so the JS surface is tiny. Regression mode runs in CI and fails the build when a release loses performance against a baseline. The same harness powers your local development loop and your release gate.

What it does

Every feature below ships in the public repository today. Clone, configure, run.

Datasets as files

JSONL, CSV, or Parquet. Versioned in git. No external dataset service. Datasets can be parameterised at runtime.

Scorers as functions

A scorer is a Python function. Inputs: model output and ground truth. Output: a score and a metadata dict. Compose them.

Built-in scorers

Exact-match, JSON schema validation, BLEU, ROUGE, regex match, semantic similarity, LLM-as-judge, rubric grading. All as opt-in modules.

Per-trace storage

Every run stores per-example trace: input, output, score, latency, cost, model used. DuckDB makes ad-hoc queries fast.

Pass rates and aggregates

Aggregate scores per dataset, per slice, per model. Time series across runs. Slice by metadata fields you provide.

Regression mode

A run can be compared against a baseline; the CLI exits non-zero on regressions over a configured threshold. Drop into GitHub Actions.

HTMX viewer

A FastAPI app with HTMX for interactivity. Sub-100KB JS. Loads instantly, even on a flaky connection.

Cost and latency

Every run records cost and latency per example. Catch regressions in £ as well as accuracy.

Adapter API

Bring your own model. Adapters are small Python modules. The default adapters cover OpenAI, Anthropic, and OpenAI-compatible endpoints.

CI-friendly

No background server required. The CLI runs to completion, writes a summary file, and reports a clean exit code.

Tech stack

Python 3.12TyperDuckDBPandasFastAPIHTMXJinja2Pydantic v2pytestruffuv

Architecture, in one diagram

The whole system on a single screen. Every box maps to a real folder in the repo.

┌─────────────────────────┐      ┌──────────────────────────┐
│  Dataset (JSONL/CSV)     │      │   Adapter (model client)  │
│  · examples in git        │ ──▶  │   · OpenAI                 │
│  · slices, metadata       │      │   · Anthropic              │
└──────────────┬──────────┘      │   · OpenAI-compatible      │
               │                  └──────────────┬───────────┘
               ▼                                 ▼
        ┌──────────────────────────────────────────────┐
        │           CLI · evalrunner run                │
        │   for example in dataset:                     │
        │     output = adapter.call(example)            │
        │     for scorer in scorers:                    │
        │       score = scorer(output, example.target)  │
        │     trace.write(example, output, score)       │
        └──────────────┬─────────────────────────────────┘
                       ▼
        ┌──────────────────────────────────────────────┐
        │   DuckDB · traces.duckdb                      │
        │   · per-example rows                          │
        │   · runs, datasets, scorers tables            │
        └──────────────┬─────────────────────────────────┘
                       ▼
        ┌──────────────────────────────────────────────┐
        │   FastAPI viewer (HTMX, Jinja2)               │
        │   · pass rates, slices, regressions, traces   │
        └──────────────────────────────────────────────┘

Quick start

From clone to first request in under five minutes.

01
git clone https://github.com/sarmakska/ai-eval-runner.git
cd ai-eval-runner
02
uv sync
cp .env.example .env  # add OPENAI_KEY etc.
03
evalrunner run examples/qna --model gpt-4o-mini \
   --scorers exact-match,json-schema --against-baseline main
04
evalrunner serve  # FastAPI viewer on :8088

Where it fits

The patterns this repository was built around.

Release gates

A new prompt or a new model goes through eval before it merges. Regression mode fails the PR if the score drops more than the threshold.

Prompt iteration

Iterate prompts locally with the CLI, see traces side-by-side in the viewer, ship when the win is real and not just one example.

Vendor comparison

Run the same dataset across OpenAI, Anthropic, and a local model via the adapter API. Compare cost and accuracy honestly.

Customer-bound metrics

Slice eval results by customer-relevant metadata (industry, doc length, language) so the headline number is not hiding a regression in a slice.

Treat evals like tests, not like dashboards.

Clone the repo, follow the four-step quick start, ship something real.