Evals as code, with traces and regressions.
One CLI. Datasets, scorers, judges, traces, regression gates, paired A/B. The boring half of AI engineering, done well.
AI Eval Runner is the evaluation harness most teams half-write. Datasets live as JSONL files in your repo. Scorers are plain Python functions. Every run is tagged with the git SHA and the dataset content version, so two runs are only compared when they are genuinely comparable. A regression diff, a CI gate, and a paired bootstrap A/B come built in. A small HTMX viewer browses runs, traces, and diffs without a JavaScript framework.
Why this exists
Every team building with LLMs runs evals. Almost none of them run evals well. The first version is a Python script with print statements; the second version is a Jupyter notebook with a chart; the third version is a half-finished homemade dashboard. By the fourth version the team is debating whether to adopt a vendor product.
The vendor products are good. They are also opinionated about your stack and prefer to host your data. For projects where the dataset is sensitive, the AI is internal, and the evaluations are part of CI, hosted vendors are a poor fit.
AI Eval Runner is the open-source middle. Datasets are files. Scorers are functions. Traces live in SQLite or DuckDB, your choice. The viewer is local, fast, and HTMX-based so the JS surface is tiny. Regression mode runs in CI and fails the build when a release loses against a baseline. The same harness powers your local development loop and your release gate.
What is in the box
Every feature below ships in the public repository today. Clone, configure, run.
Scorers as plain Python functions
A scorer is any callable returning a float in 0.0 to 1.0. The runner applies every scorer to every prediction and stores the result under the scorer name. IDE support, real debuggers, no DSL.
Built-in scorers
exact_match, json_valid, rouge_l, and llm_judge ship as opt-in modules. Compose them, override their names, or write your own next to them.
LLM as judge with rubrics
llm_judge builds a scorer that asks a grading model to score against a rubric. Verdicts are parsed strictly as JSON and normalised to 0.0-1.0. The grading provider is fixed independently of the model under evaluation.
Datasets as files
JSONL or in-process lists. Each dataset gets an order-independent content version (SHA-256 over canonical JSON), and a DatasetRegistry records each distinct version under a name so drift is auditable.
Versioning by git SHA + dataset version
Every run is tagged with the working-tree git SHA and the dataset content version. Two runs are only compared when both the code and the data line up.
Regression diff
aieval diff prints per-scorer mean deltas and the examples that moved most. Same view in the viewer at /diff/<run_a>/<run_b>, with movers colour coded in both directions.
CI gate
aieval ci <run_id> --threshold 0.05 compares a candidate run to a baseline and exits non-zero when any scorer regresses past the threshold. Drop into GitHub Actions.
Paired bootstrap A/B
aieval pairwise resamples per-example differences with replacement and reports a percentile confidence interval. Winner is declared only when the interval clears zero. Seeded, reproducible.
HTMX viewer
FastAPI + HTMX + Jinja2. Run list, per-example traces, regression diff. Sub-100KB JS surface; loads instantly on a flaky connection.
OpenTelemetry with GenAI semantics
aieval.run and aieval.example spans, with gen_ai.request.model, gen_ai.system, aieval.run.dataset_version, aieval.score.<scorer> attributes. Behind an optional extra; no hard telemetry dependency.
SQLite or DuckDB backends
Both file-based, both server-less. AIEVAL_BACKEND switches them. SQLite by default for portability; DuckDB for ad-hoc analytics on traces.
Provider adapters
SarmaLink and OpenAI-compatible adapters ship in aieval/providers. Bring your own model by adding a small Python module that returns predictions for a list of prompts.
Architecture, in one diagram
A small Python library and CLI built around one idea: an eval run is a function over a dataset, scored by plain Python, persisted by git SHA and dataset version.
aieval/cli.pyTyper CLI: run, list, view, diff, ci, pairwise.
aieval/core/runner.pyParallel execution with retry, scoring, telemetry, storage. Records git SHA and dataset version.
aieval/core/dataset.pyJSONL and in-process loaders. content version() and DatasetRegistry.
aieval/core/scorer.py@scorer decorator, scorer naming, invoke_scorer context injection.
aieval/core/regression.pycompare() for per-scorer deltas and per-example diffs. The CI gate uses it.
aieval/core/pairwise.pyPaired bootstrap A/B with percentile confidence intervals.
aieval/core/telemetry.pyOpenTelemetry span and attribute capture, with a no-op fallback.
aieval/scorers/*.pyBuilt-in scorers: exact_match, json_valid, rouge_l, llm_judge.
aieval/backends/*.pySQLite and DuckDB result stores, both file-based.
aieval/providers/*.pySarmaLink and OpenAI-compatible adapters.
aieval/viewer/app.pyFastAPI + HTMX run browser and regression diff view.
Run lifecycle
Examples are fanned out under a concurrency semaphore. Each prediction is scored by every scorer. Results are written to the backend; the run is closed with a summary span.
Built-in scorers
Importable from aieval.scorers. Each one is a normal scorer, so it composes with anything you write yourself.
exact_match1.0 when the trimmed prediction equals the trimmed expected answer, else 0.0.
json_valid1.0 when the prediction parses as JSON, else 0.0.
rouge_lROUGE-L F1 between prediction and expected, on whitespace tokens.
llm_judgeAsks a grading model to score against a rubric. Verdicts parsed strictly as {"score": 0-10, "reason": "..."} and normalised to 0.0-1.0. Malformed verdicts score 0.0 rather than crashing the run.
Quick start
Five commands from clone to a first eval. Commands taken straight from the README.
git clone https://github.com/sarmakska/ai-eval-runner.git cd ai-eval-runner
uv sync cp .env.example .env # set SARMALINK_API_KEY or OPENAI_API_KEY
uv run aieval run examples/summarisation/eval.py
uv run aieval list
uv run aieval view # http://localhost:8000
Writing evals, in real code
Snippets from the wiki and the bundled examples. Every one runs as-is once you have set a provider key.
pythonWriting an eval+
from aieval import dataset, run, scorer
from aieval.scorers import llm_judge, rouge_l
@scorer
def length_under_120_words(prediction: str, _expected: str) -> float:
return 1.0 if len(prediction.split()) <= 120 else 0.0
faithful = llm_judge(
rubric="Reward summaries faithful to the source that omit nothing important.",
provider="sarmalink",
model="smart",
name="faithfulness",
)
if __name__ == "__main__":
run(
name="summarisation",
dataset=dataset.jsonl("examples/summarisation/dataset.jsonl"),
scorers=[rouge_l, length_under_120_words, faithful],
provider="sarmalink",
model="smart",
)pythonScorer context (prompt, example, model)+
from aieval import scorer
# Declare 'example', 'model', 'provider' or 'prompt' as keyword-only
# parameters and the runner fills them in via invoke_scorer.
@scorer
def grounded_in_prompt(
prediction: str,
_expected: str,
*,
prompt: str | None = None,
) -> float:
return 1.0 if prompt and prompt.split()[0].lower() in prediction.lower() else 0.0
# The common two-argument case is unchanged.
# The extra parameters are supplied only when you declare them.bashComparing runs (diff, CI gate, paired A/B)+
uv run aieval list # find run ids
uv run aieval diff <run_a> <run_b> # per-scorer mean deltas + movers
uv run aieval pairwise <run_a> <run_b> \
--confidence 0.95 --iterations 2000 # paired bootstrap, winner column
uv run aieval ci <run_id> --threshold 0.05 # exits non-zero on regressionpythonIn code: compare and pairwise+
from aieval import compare, pairwise
from aieval.backends import get_backend
backend = get_backend()
# Regression diff
report = compare(backend.get_results(run_a), backend.get_results(run_b))
for d in report.scorer_deltas:
print(d.scorer, d.delta)
# Paired bootstrap with a 95% confidence interval
for r in pairwise(backend.get_results(run_a), backend.get_results(run_b)):
print(r.scorer, r.mean_diff, (r.ci_low, r.ci_high), r.winner)
# winner reads: 'a', 'b', or 'tie'yamlCI gate in a workflow+
- run: uv run aieval run my_eval.py
- run: |
RUN_ID=$(uv run python -c \
"from aieval.backends import get_backend; print(get_backend().list_runs()[0]['id'])")
uv run aieval ci "$RUN_ID" --threshold 0.05OpenTelemetry, GenAI semantics
Two span types: aieval.run once per run, aieval.example once per example. Attribute names follow the GenAI semantic conventions where they exist, with the aieval.* namespace for runner-specific signal.
| Attribute | Span | Meaning |
|---|---|---|
| gen_ai.request.model | both | The model under evaluation. |
| gen_ai.system | both | The provider (sarmalink, openai). |
| aieval.run.name | run | The eval name. |
| aieval.run.dataset_version | run | The dataset content version. |
| aieval.run.git_sha | run | The working-tree git SHA. |
| aieval.run.pass_rate | run | Final pass rate. |
| aieval.run.avg_latency_ms | run | Mean provider latency. |
| aieval.example.index | example | The example index. |
| aieval.example.latency_ms | example | Provider latency for the example. |
| aieval.score.<scorer> | example | Score from each scorer. |
| aieval.example.error | example | Error message when an example fails. |
Where it fits
The patterns this repository was built around, and the ones it deliberately is not.
Release gates in CI
A new prompt or model goes through the eval before it merges. The CI gate fails the PR if any scorer regresses by more than the threshold against the baseline run.
Prompt iteration on the desk
Run locally, see traces side by side in the viewer, ship only when the win is real on the whole dataset and not just one cherry-picked example.
Honest vendor comparison
Run the same dataset across OpenAI, SarmaLink-AI, and any OpenAI-compatible provider via the adapter API. Compare cost, latency, and accuracy with paired statistics.
Slice-aware metrics
Tag examples with metadata (industry, doc length, language) and slice the headline number so a regression in a slice is not hidden by the aggregate.
When NOT to reach for it
You need a hosted multi-tenant platform with dashboards and team management out of the box. This is a self-hosted toolkit; the viewer runs locally and there is no SaaS tier.
Not a labelling tool
Programmatic scoring is the focus. If your evaluation is pure human review with no programmatic scoring, the regression and pairwise commands give you nothing.
Tech stack
Compared to the alternatives
Three popular hosted eval platforms and rolling your own. Honest comparisons.
| Feature | AI Eval Runner | Braintrust | LangSmith | PromptLayer | DIY |
|---|---|---|---|---|---|
| Self-hosted, MIT | Yes | No | No | No | Yes |
| Datasets as plain files | JSONL in git | Hosted | Hosted | Hosted | Yours |
| Scorers as plain Python | Yes | TypeScript / Python SDKs | SDK | Limited | Yours |
| LLM-as-judge with rubric | Yes, strict JSON parse | Yes | Yes | Limited | You build it |
| Paired bootstrap A/B | Yes, seeded | No | No | No | You build it |
| CI gate exit code | Yes | Yes | Yes | Partial | You build it |
| OTel GenAI semantics | Yes (extra) | Partial | Yes | No | You write it |
| Versioning by git SHA + dataset | Yes | Partial | Partial | No | Yours |
Documentation, all in the wiki
Focused pages with no marketing in between. Each one answers a single operational question.
Frequently asked
Eight real questions from teams that have wired this into their release path.
Why plain Python scorers rather than a DSL?+
IDE support, real debuggers, and no leaky abstraction between what you want to measure and how you express it. A scorer is any callable returning a float in 0.0 to 1.0. Scorers that need more signal declare example, model, provider, or prompt as keyword-only parameters and the runner fills them via invoke_scorer. This is what lets the LLM judge see the original request without changing the common two-argument case.
What does the paired bootstrap actually do?+
It resamples the per-example differences between two runs with replacement many times, recomputes the mean difference on each resample, and reports a percentile confidence interval. The winner column reads b when the interval sits above zero, a when it sits below zero, or tie when the interval straddles zero. The generator is seeded so results are reproducible.
What makes two runs comparable?+
Each run is tagged with the git SHA and the dataset content version. The dataset version is an order-independent SHA-256 over the canonical JSON of every row, so reordering rows does not invalidate a comparison, but editing a row does. Comparing runs over different datasets is possible but rarely meaningful, since examples no longer line up by index.
How does the CI gate decide there is a regression?+
aieval ci compares the candidate run against a baseline and exits non-zero when any scorer mean drops by more than the threshold. By default the baseline is the previous run of the same name; pass --baseline <run_id> to pin it. If there is no baseline the gate passes, so the first run on a new eval never blocks.
Do I need a server for the backend?+
No. SQLite and DuckDB are both file-based. Set AIEVAL_BACKEND and AIEVAL_DB_PATH to switch. SQLite is the default for portability; DuckDB is there when you want ad-hoc analytics on traces.
How does the LLM judge avoid crashing the run?+
The grading model is prompted to answer with a single JSON object containing score and reason. The verdict is parsed strictly. A malformed verdict scores 0.0 rather than crashing the run, so a flaky judge does not bring the eval down.
How does telemetry stay optional?+
Capture lives in aieval/core/telemetry.py behind an optional extra. A span() context manager yields a SpanHandle whose set() method records attributes once they are known. The captured attribute dict is available in both modes, so the same code path works whether or not OpenTelemetry is installed, and tests assert on captured attributes without standing up a collector.
Can the LLM judge grade itself?+
It can, but the design lets you fix the grading provider and model independently of the model under evaluation. That means you can grade a cheap model with a stronger judge, which usually surfaces more meaningful signal than a model marking its own homework.
Related products
The rest of the Sarma Linux toolkit. Same opinions throughout: open source, MIT, real depth.
Treat evals like tests, not like dashboards.
Clone the repo, write a scorer, run the example eval, wire the CI gate.