Open Source · MIT · Python 3.12

Evals as code, with traces and regressions.

One CLI. Datasets, scorers, traces, regressions. The boring half of AI engineering done well.

AI Eval Runner is the evaluation harness most teams half-write. Datasets live as files in your repo. Scorers are Python functions. Runs produce traces in DuckDB. A FastAPI viewer (HTMX, no JS framework) shows pass rates, regressions, and individual traces. Built-in scorers cover exact-match, JSON schema, BLEU, ROUGE, LLM-as-judge, and rubric grading.

View on GitHub Whitepaper How it works Get help shipping

6+

Built-in scorers

DuckDB

Trace storage

HTMX

Viewer UI

CI

Regression mode

MIT

Licence

Why this exists

Every team building with LLMs runs evals. Almost none of them run evals well. The first version is a Python script with print statements; the second version is a Jupyter notebook with a chart; the third version is a half-finished homemade dashboard. By the fourth version the team is debating whether to adopt a vendor product.

The vendor products are good. They are also opinionated about your stack, and they prefer to host your data. For projects where the dataset is sensitive, the AI is internal, and the evaluations are part of CI, hosted vendors are a poor fit.

AI Eval Runner is the open-source middle. Datasets are files. Scorers are functions. Traces live in DuckDB on disk. The viewer is local, fast, and built on HTMX so the JS surface is tiny. Regression mode runs in CI and fails the build when a release loses performance against a baseline. The same harness powers your local development loop and your release gate.

What it does

Every feature below ships in the public repository today. Clone, configure, run.

Datasets as files

JSONL, CSV, or Parquet. Versioned in git. No external dataset service. Datasets can be parameterised at runtime.

Scorers as functions

A scorer is a Python function. Inputs: model output and ground truth. Output: a score and a metadata dict. Compose them.

Built-in scorers

Exact-match, JSON schema validation, BLEU, ROUGE, regex match, semantic similarity, LLM-as-judge, rubric grading. All as opt-in modules.

Per-trace storage

Every run stores per-example trace: input, output, score, latency, cost, model used. DuckDB makes ad-hoc queries fast.

Pass rates and aggregates

Aggregate scores per dataset, per slice, per model. Time series across runs. Slice by metadata fields you provide.

Regression mode

A run can be compared against a baseline; the CLI exits non-zero on regressions over a configured threshold. Drop into GitHub Actions.

HTMX viewer

A FastAPI app with HTMX for interactivity. Sub-100KB JS. Loads instantly, even on a flaky connection.

Cost and latency

Every run records cost and latency per example. Catch regressions in £ as well as accuracy.

Adapter API

Bring your own model. Adapters are small Python modules. The default adapters cover OpenAI, Anthropic, and OpenAI-compatible endpoints.

CI-friendly

No background server required. The CLI runs to completion, writes a summary file, and reports a clean exit code.

Architecture, in one diagram

The whole system on a single screen. Every box maps to a real folder in the repo.

┌─────────────────────────┐      ┌──────────────────────────┐
│  Dataset (JSONL/CSV)     │      │   Adapter (model client)  │
│  · examples in git        │ ──▶  │   · OpenAI                 │
│  · slices, metadata       │      │   · Anthropic              │
└──────────────┬──────────┘      │   · OpenAI-compatible      │
               │                  └──────────────┬───────────┘
               ▼                                 ▼
        ┌──────────────────────────────────────────────┐
        │           CLI · evalrunner run                │
        │   for example in dataset:                     │
        │     output = adapter.call(example)            │
        │     for scorer in scorers:                    │
        │       score = scorer(output, example.target)  │
        │     trace.write(example, output, score)       │
        └──────────────┬─────────────────────────────────┘
                       ▼
        ┌──────────────────────────────────────────────┐
        │   DuckDB · traces.duckdb                      │
        │   · per-example rows                          │
        │   · runs, datasets, scorers tables            │
        └──────────────┬─────────────────────────────────┘
                       ▼
        ┌──────────────────────────────────────────────┐
        │   FastAPI viewer (HTMX, Jinja2)               │
        │   · pass rates, slices, regressions, traces   │
        └──────────────────────────────────────────────┘

Quick start

From clone to first request in under five minutes.

01

git clone https://github.com/sarmakska/ai-eval-runner.git
cd ai-eval-runner

02

uv sync
cp .env.example .env  # add OPENAI_KEY etc.

03

evalrunner run examples/qna --model gpt-4o-mini \
   --scorers exact-match,json-schema --against-baseline main

04

evalrunner serve  # FastAPI viewer on :8088

Where it fits

The patterns this repository was built around.

Release gates

A new prompt or a new model goes through eval before it merges. Regression mode fails the PR if the score drops more than the threshold.

Prompt iteration

Iterate prompts locally with the CLI, see traces side-by-side in the viewer, ship when the win is real and not just one example.

Vendor comparison

Run the same dataset across OpenAI, Anthropic, and a local model via the adapter API. Compare cost and accuracy honestly.

Customer-bound metrics

Slice eval results by customer-relevant metadata (industry, doc length, language) so the headline number is not hiding a regression in a slice.

Related products

The wider Sarma Linux toolkit. Every project ships with the same opinions: open source, MIT, real depth, no marketing fluff.

SarmaLink-AI

Multi-provider AI assistant with sub-50ms failover across 36 engines.

Open product page

MCP Server Toolkit

Production-ready Model Context Protocol server starter, with plugins.

Open product page

Voice Agent Starter

Sub-second real-time voice loop with WebRTC, barge-in, and pluggable STT/TTS.

Open product page

Agent Orchestrator

Deterministic-replay multi-agent workflows with durable state.

Open product page

Local LLM Router

OpenAI-compatible proxy that routes between local Ollama and cloud LLMs.

Open product page

StaffPortal

Open-source HR + ops platform built to replace three SaaS subscriptions.

Open product page

RAG-over-PDF

A minimal, production-shaped RAG starter with cited streaming answers.

Open product page

Receipt Scanner

Vision-OCR receipt scanning starter with Zod-typed JSON output.

Open product page

Webhook-to-Email

A tiny, production-grade webhook receiver with HMAC and React Email.

Open product page

Evals as code, with traces and regressions.

Why this exists

What it does

Datasets as files

Scorers as functions

Built-in scorers

Per-trace storage

Pass rates and aggregates

Regression mode

HTMX viewer

Cost and latency

Adapter API

CI-friendly

Tech stack

Architecture, in one diagram

Quick start

Where it fits

Release gates

Prompt iteration

Vendor comparison

Customer-bound metrics

Related products

SarmaLink-AI

MCP Server Toolkit

Voice Agent Starter

Agent Orchestrator

Local LLM Router

StaffPortal

RAG-over-PDF

Receipt Scanner

Webhook-to-Email

Treat evals like tests, not like dashboards.