Open Source · MIT · Python 3.12

Evals as code, with traces and regressions.

One CLI. Datasets, scorers, judges, traces, regression gates, paired A/B. The boring half of AI engineering, done well.

AI Eval Runner is the evaluation harness most teams half-write. Datasets live as JSONL files in your repo. Scorers are plain Python functions. Every run is tagged with the git SHA and the dataset content version, so two runs are only compared when they are genuinely comparable. A regression diff, a CI gate, and a paired bootstrap A/B come built in. A small HTMX viewer browses runs, traces, and diffs without a JavaScript framework.

6
CLI commands
4+
Built-in scorers
Paired
Bootstrap A/B
SQLite + DuckDB
Backends
MIT
Licence

Why this exists

Every team building with LLMs runs evals. Almost none of them run evals well. The first version is a Python script with print statements; the second version is a Jupyter notebook with a chart; the third version is a half-finished homemade dashboard. By the fourth version the team is debating whether to adopt a vendor product.

The vendor products are good. They are also opinionated about your stack and prefer to host your data. For projects where the dataset is sensitive, the AI is internal, and the evaluations are part of CI, hosted vendors are a poor fit.

AI Eval Runner is the open-source middle. Datasets are files. Scorers are functions. Traces live in SQLite or DuckDB, your choice. The viewer is local, fast, and HTMX-based so the JS surface is tiny. Regression mode runs in CI and fails the build when a release loses against a baseline. The same harness powers your local development loop and your release gate.

What is in the box

Every feature below ships in the public repository today. Clone, configure, run.

Scorers as plain Python functions

A scorer is any callable returning a float in 0.0 to 1.0. The runner applies every scorer to every prediction and stores the result under the scorer name. IDE support, real debuggers, no DSL.

Built-in scorers

exact_match, json_valid, rouge_l, and llm_judge ship as opt-in modules. Compose them, override their names, or write your own next to them.

LLM as judge with rubrics

llm_judge builds a scorer that asks a grading model to score against a rubric. Verdicts are parsed strictly as JSON and normalised to 0.0-1.0. The grading provider is fixed independently of the model under evaluation.

Datasets as files

JSONL or in-process lists. Each dataset gets an order-independent content version (SHA-256 over canonical JSON), and a DatasetRegistry records each distinct version under a name so drift is auditable.

Versioning by git SHA + dataset version

Every run is tagged with the working-tree git SHA and the dataset content version. Two runs are only compared when both the code and the data line up.

Regression diff

aieval diff prints per-scorer mean deltas and the examples that moved most. Same view in the viewer at /diff/<run_a>/<run_b>, with movers colour coded in both directions.

CI gate

aieval ci <run_id> --threshold 0.05 compares a candidate run to a baseline and exits non-zero when any scorer regresses past the threshold. Drop into GitHub Actions.

Paired bootstrap A/B

aieval pairwise resamples per-example differences with replacement and reports a percentile confidence interval. Winner is declared only when the interval clears zero. Seeded, reproducible.

HTMX viewer

FastAPI + HTMX + Jinja2. Run list, per-example traces, regression diff. Sub-100KB JS surface; loads instantly on a flaky connection.

OpenTelemetry with GenAI semantics

aieval.run and aieval.example spans, with gen_ai.request.model, gen_ai.system, aieval.run.dataset_version, aieval.score.<scorer> attributes. Behind an optional extra; no hard telemetry dependency.

SQLite or DuckDB backends

Both file-based, both server-less. AIEVAL_BACKEND switches them. SQLite by default for portability; DuckDB for ad-hoc analytics on traces.

Provider adapters

SarmaLink and OpenAI-compatible adapters ship in aieval/providers. Bring your own model by adding a small Python module that returns predictions for a list of prompts.

Architecture, in one diagram

A small Python library and CLI built around one idea: an eval run is a function over a dataset, scored by plain Python, persisted by git SHA and dataset version.

rendering
AI Eval Runner: dataset and scorers feed the runner, traces land in SQLite or DuckDB, the HTMX viewer and the compare commands read them back.
aieval/cli.py

Typer CLI: run, list, view, diff, ci, pairwise.

aieval/core/runner.py

Parallel execution with retry, scoring, telemetry, storage. Records git SHA and dataset version.

aieval/core/dataset.py

JSONL and in-process loaders. content version() and DatasetRegistry.

aieval/core/scorer.py

@scorer decorator, scorer naming, invoke_scorer context injection.

aieval/core/regression.py

compare() for per-scorer deltas and per-example diffs. The CI gate uses it.

aieval/core/pairwise.py

Paired bootstrap A/B with percentile confidence intervals.

aieval/core/telemetry.py

OpenTelemetry span and attribute capture, with a no-op fallback.

aieval/scorers/*.py

Built-in scorers: exact_match, json_valid, rouge_l, llm_judge.

aieval/backends/*.py

SQLite and DuckDB result stores, both file-based.

aieval/providers/*.py

SarmaLink and OpenAI-compatible adapters.

aieval/viewer/app.py

FastAPI + HTMX run browser and regression diff view.

Run lifecycle

Examples are fanned out under a concurrency semaphore. Each prediction is scored by every scorer. Results are written to the backend; the run is closed with a summary span.

rendering
Run lifecycle. Failures land in Failed; successes land in Done with a summary span attached.

Built-in scorers

Importable from aieval.scorers. Each one is a normal scorer, so it composes with anything you write yourself.

exact_match

1.0 when the trimmed prediction equals the trimmed expected answer, else 0.0.

json_valid

1.0 when the prediction parses as JSON, else 0.0.

rouge_l

ROUGE-L F1 between prediction and expected, on whitespace tokens.

llm_judge

Asks a grading model to score against a rubric. Verdicts parsed strictly as {"score": 0-10, "reason": "..."} and normalised to 0.0-1.0. Malformed verdicts score 0.0 rather than crashing the run.

Quick start

Five commands from clone to a first eval. Commands taken straight from the README.

01
Clone the repo
git clone https://github.com/sarmakska/ai-eval-runner.git
cd ai-eval-runner
02
Install with uv
uv sync
cp .env.example .env   # set SARMALINK_API_KEY or OPENAI_API_KEY
03
Run the bundled example eval
uv run aieval run examples/summarisation/eval.py
04
List recent runs
uv run aieval list
05
Open the HTMX viewer
uv run aieval view   # http://localhost:8000

Writing evals, in real code

Snippets from the wiki and the bundled examples. Every one runs as-is once you have set a provider key.

pythonWriting an eval+
from aieval import dataset, run, scorer
from aieval.scorers import llm_judge, rouge_l


@scorer
def length_under_120_words(prediction: str, _expected: str) -> float:
    return 1.0 if len(prediction.split()) <= 120 else 0.0


faithful = llm_judge(
    rubric="Reward summaries faithful to the source that omit nothing important.",
    provider="sarmalink",
    model="smart",
    name="faithfulness",
)


if __name__ == "__main__":
    run(
        name="summarisation",
        dataset=dataset.jsonl("examples/summarisation/dataset.jsonl"),
        scorers=[rouge_l, length_under_120_words, faithful],
        provider="sarmalink",
        model="smart",
    )
pythonScorer context (prompt, example, model)+
from aieval import scorer

# Declare 'example', 'model', 'provider' or 'prompt' as keyword-only
# parameters and the runner fills them in via invoke_scorer.

@scorer
def grounded_in_prompt(
    prediction: str,
    _expected: str,
    *,
    prompt: str | None = None,
) -> float:
    return 1.0 if prompt and prompt.split()[0].lower() in prediction.lower() else 0.0

# The common two-argument case is unchanged.
# The extra parameters are supplied only when you declare them.
bashComparing runs (diff, CI gate, paired A/B)+
uv run aieval list                              # find run ids
uv run aieval diff <run_a> <run_b>              # per-scorer mean deltas + movers
uv run aieval pairwise <run_a> <run_b> \
    --confidence 0.95 --iterations 2000         # paired bootstrap, winner column
uv run aieval ci <run_id> --threshold 0.05      # exits non-zero on regression
pythonIn code: compare and pairwise+
from aieval import compare, pairwise
from aieval.backends import get_backend

backend = get_backend()

# Regression diff
report = compare(backend.get_results(run_a), backend.get_results(run_b))
for d in report.scorer_deltas:
    print(d.scorer, d.delta)

# Paired bootstrap with a 95% confidence interval
for r in pairwise(backend.get_results(run_a), backend.get_results(run_b)):
    print(r.scorer, r.mean_diff, (r.ci_low, r.ci_high), r.winner)
    # winner reads: 'a', 'b', or 'tie'
yamlCI gate in a workflow+
- run: uv run aieval run my_eval.py
- run: |
    RUN_ID=$(uv run python -c \
      "from aieval.backends import get_backend; print(get_backend().list_runs()[0]['id'])")
    uv run aieval ci "$RUN_ID" --threshold 0.05

OpenTelemetry, GenAI semantics

Two span types: aieval.run once per run, aieval.example once per example. Attribute names follow the GenAI semantic conventions where they exist, with the aieval.* namespace for runner-specific signal.

AttributeSpanMeaning
gen_ai.request.modelbothThe model under evaluation.
gen_ai.systembothThe provider (sarmalink, openai).
aieval.run.namerunThe eval name.
aieval.run.dataset_versionrunThe dataset content version.
aieval.run.git_sharunThe working-tree git SHA.
aieval.run.pass_raterunFinal pass rate.
aieval.run.avg_latency_msrunMean provider latency.
aieval.example.indexexampleThe example index.
aieval.example.latency_msexampleProvider latency for the example.
aieval.score.<scorer>exampleScore from each scorer.
aieval.example.errorexampleError message when an example fails.

Where it fits

The patterns this repository was built around, and the ones it deliberately is not.

Release gates in CI

A new prompt or model goes through the eval before it merges. The CI gate fails the PR if any scorer regresses by more than the threshold against the baseline run.

Prompt iteration on the desk

Run locally, see traces side by side in the viewer, ship only when the win is real on the whole dataset and not just one cherry-picked example.

Honest vendor comparison

Run the same dataset across OpenAI, SarmaLink-AI, and any OpenAI-compatible provider via the adapter API. Compare cost, latency, and accuracy with paired statistics.

Slice-aware metrics

Tag examples with metadata (industry, doc length, language) and slice the headline number so a regression in a slice is not hidden by the aggregate.

When NOT to reach for it

You need a hosted multi-tenant platform with dashboards and team management out of the box. This is a self-hosted toolkit; the viewer runs locally and there is no SaaS tier.

Not a labelling tool

Programmatic scoring is the focus. If your evaluation is pure human review with no programmatic scoring, the regression and pairwise commands give you nothing.

Tech stack

Python 3.12TyperPydantic v2DuckDBSQLitePandasFastAPIHTMXJinja2OpenTelemetry (extra)pytestuv

Compared to the alternatives

Three popular hosted eval platforms and rolling your own. Honest comparisons.

FeatureAI Eval RunnerBraintrustLangSmithPromptLayerDIY
Self-hosted, MITYesNoNoNoYes
Datasets as plain filesJSONL in gitHostedHostedHostedYours
Scorers as plain PythonYesTypeScript / Python SDKsSDKLimitedYours
LLM-as-judge with rubricYes, strict JSON parseYesYesLimitedYou build it
Paired bootstrap A/BYes, seededNoNoNoYou build it
CI gate exit codeYesYesYesPartialYou build it
OTel GenAI semanticsYes (extra)PartialYesNoYou write it
Versioning by git SHA + datasetYesPartialPartialNoYours

Frequently asked

Eight real questions from teams that have wired this into their release path.

Why plain Python scorers rather than a DSL?+

IDE support, real debuggers, and no leaky abstraction between what you want to measure and how you express it. A scorer is any callable returning a float in 0.0 to 1.0. Scorers that need more signal declare example, model, provider, or prompt as keyword-only parameters and the runner fills them via invoke_scorer. This is what lets the LLM judge see the original request without changing the common two-argument case.

What does the paired bootstrap actually do?+

It resamples the per-example differences between two runs with replacement many times, recomputes the mean difference on each resample, and reports a percentile confidence interval. The winner column reads b when the interval sits above zero, a when it sits below zero, or tie when the interval straddles zero. The generator is seeded so results are reproducible.

What makes two runs comparable?+

Each run is tagged with the git SHA and the dataset content version. The dataset version is an order-independent SHA-256 over the canonical JSON of every row, so reordering rows does not invalidate a comparison, but editing a row does. Comparing runs over different datasets is possible but rarely meaningful, since examples no longer line up by index.

How does the CI gate decide there is a regression?+

aieval ci compares the candidate run against a baseline and exits non-zero when any scorer mean drops by more than the threshold. By default the baseline is the previous run of the same name; pass --baseline <run_id> to pin it. If there is no baseline the gate passes, so the first run on a new eval never blocks.

Do I need a server for the backend?+

No. SQLite and DuckDB are both file-based. Set AIEVAL_BACKEND and AIEVAL_DB_PATH to switch. SQLite is the default for portability; DuckDB is there when you want ad-hoc analytics on traces.

How does the LLM judge avoid crashing the run?+

The grading model is prompted to answer with a single JSON object containing score and reason. The verdict is parsed strictly. A malformed verdict scores 0.0 rather than crashing the run, so a flaky judge does not bring the eval down.

How does telemetry stay optional?+

Capture lives in aieval/core/telemetry.py behind an optional extra. A span() context manager yields a SpanHandle whose set() method records attributes once they are known. The captured attribute dict is available in both modes, so the same code path works whether or not OpenTelemetry is installed, and tests assert on captured attributes without standing up a collector.

Can the LLM judge grade itself?+

It can, but the design lets you fix the grading provider and model independently of the model under evaluation. That means you can grade a cheap model with a stronger judge, which usually surfaces more meaningful signal than a model marking its own homework.

Related products

The rest of the Sarma Linux toolkit. Same opinions throughout: open source, MIT, real depth.

SarmaLink-AI

multi-provider AI backend with sub-50ms failover across 36 engines.

Open product page

MCP Server Toolkit

Production-ready Model Context Protocol server starter, with plugins.

Open product page

Voice Agent Starter

Sub-second real-time voice loop with WebRTC, barge-in, and pluggable STT/TTS.

Open product page

Agent Orchestrator

Deterministic-replay multi-agent workflows with durable state.

Open product page

Local LLM Router

OpenAI-compatible proxy that routes between local Ollama and cloud LLMs.

Open product page

StaffPortal

Open-source HR + ops platform built to replace three SaaS subscriptions.

Open product page

RAG-over-PDF

A minimal, production-shaped RAG starter with cited streaming answers.

Open product page

Receipt Scanner

Vision-OCR receipt scanning starter with Zod-typed JSON output.

Open product page

Webhook-to-Email

A tiny, production-grade webhook receiver with HMAC and React Email.

Open product page

k8s-ops-toolkit

Helm chart for Next.js + bootstrap script for the full observability stack.

Open product page

terraform-stack

Vercel + Supabase + Cloudflare + DigitalOcean as one Terraform repo.

Open product page

slipstream

Claude Code plugin v1.0: React dashboard with code graph, cross-tab agent bus, ~95% per-read savings, 75 skills.

Open product page

forge-infer

Minimal LLM inference server with paged KV-cache and speculative decoding.

Open product page

shipyard

Multi-tenant SaaS starter with isolation, RBAC, billing, audit and rate limits.

Open product page

lsmdb

Log-structured merge-tree storage engine in Go with WAL and MVCC snapshots.

Open product page

raftkv

Raft key-value store with a fault-injection harness that proves linearizability.

Open product page

sandboxd

WebAssembly sandbox for running untrusted code under strict CPU and memory limits.

Open product page

Treat evals like tests, not like dashboards.

Clone the repo, write a scorer, run the example eval, wire the CI gate.

All open-source projects