Capability

Data Engineering

I build data pipelines that are simple enough for a single engineer to operate, Postgres as the primary store, DuckDB for analytical queries, structured extraction from documents, and RAG pipelines that ground LLM answers in your actual data.

At a glance

Backed by public open-source code, not just a description on a page.
Long-form essays on the same topics, with sources cited.
Production patterns the same hiring team can lift straight into their stack.

About Sarma

Sarma is a UK-based software engineer running Sarmalinux as a one-person studio. He ships nineteen open-source repositories spanning LLM gateways, coding agents, inference, storage engines and consensus, and writes long-form engineering essays at sarmalinux.com/blog. Senior IC, end to end.

Most data engineering work at small-to-mid scale does not need Spark, Databricks, or a dedicated data team. It needs well-designed Postgres schemas, a reliable ETL job, DuckDB for the analytical layer, and a sensible approach to documents and unstructured input. I have shipped ai-eval-runner (evals stored in DuckDB, served via FastAPI and an HTMX viewer), rag-over-pdf (a minimal but production-quality RAG pipeline), and receipt-scanner (vision-based structured extraction). These are the building blocks I wire together for client data work.

What this covers in practice

Postgres schema design and ETL

Normalised schemas with RLS, migrations via Supabase CLI, and ETL jobs in Python using uv. Incremental loads, idempotent upserts, and clear error handling.

DuckDB analytics layer

Analytical queries over Postgres data or flat files using DuckDB, fast columnar reads without a separate data warehouse. Integrated with FastAPI for query-on-demand.

RAG pipelines

Retrieval-augmented generation over PDFs, documentation, or your internal knowledge base. Chunking, embedding, vector search, and re-ranking with quality evals baked in.

Document and receipt extraction

Vision-based structured extraction from receipts, invoices, and scanned documents, outputs validated JSON using Pydantic v2, not raw OCR strings.

Eval harnesses for LLM output quality

Measure answer quality, faithfulness to retrieved context, and regression across model versions. DuckDB-backed persistence, FastAPI query endpoint, HTMX viewer.

Analytics dashboards

Recharts-backed dashboards embedded in Next.js, no separate BI tool required. The blog chart renderer on sarmalinux.com is the live reference.

Stack

Postgres + SupabaseDuckDBPython 3.12 / uvFastAPIPydantic v2pgvectorRechartsNext.jsn8nDocker

Recent work in this lane

Open-source repositories

Related writing

What a hiring team gets

Postgres-first, no separate warehouse for SME scale

DuckDB analytics without infrastructure overhead

Grounded RAG answers, not hallucinated summaries

Structured output from documents, JSON, not raw text

Evals so you know when quality degrades

Operable by a single engineer after handoff

Read the evidence

Open the public repositories, browse past work, then look at the hiring page if a PAYE shape fits your team.

Open-source repositories Past work Hire me, PAYE only