Open Source · MIT License · Hybrid retrieval · Streaming citations

RAG-over-PDF

A readable, framework-free RAG starter that takes retrieval past the toy stage. Hybrid search (dense plus BM25, fused with RRF), an LLM reranker on the recall pool, NDJSON citation streaming, and page-level highlights. Clone it, read it, ship it.

View on GitHub Read whitepaper How it works Read the wiki

~£0.001

per question

<3s

streaming answer

Dense+BM25

hybrid retrieval

RRF

fusion method

MIT

license

Why this exists

Every product team eventually wants to chat with its own docs. The default response is to reach for a heavy framework, spin up a managed vector database, write four hundred lines of glue code, and ship something nobody understands six months later.

Most of that complexity is not load-bearing. The actual moving parts of a working RAG system are short and readable. Chunk the document by page, embed the chunks, build a sparse BM25 index alongside the dense vectors, fuse the rankings, rerank the recall pool for precision, stuff the survivors in the prompt, stream the answer with citations that point back to the exact source and page.

That is roughly six hundred lines of TypeScript. RAG-over-PDF is those six hundred lines, written cleanly, with hybrid search, RRF fusion, an LLM reranker with a deterministic fallback, and NDJSON citation streaming built in. No framework hiding the indexing or retrieval stages. No managed vector store to pay for on day one.

Clone it. Read it. Ship it. Swap the in-memory store for pgvector when you outgrow it. Add semantic chunking when you measure that you need it. Do not pay framework tax up front for capabilities you have not yet exercised.

Built-in features

Everything below ships in the repository. Clone, set OPENAI_API_KEY, deploy.

Hybrid search with RRF fusion

Dense cosine for semantic similarity, BM25 for exact terms like error codes and product identifiers. Fused with Reciprocal Rank Fusion, no score normalisation required. Robust where either alone would miss.

LLM reranker on the recall pool

Hybrid search casts a wide net; the reranker reorders that pool for precision. The default LLM reranker scores each candidate against the question, and falls back to a deterministic lexical reranker when the LLM call fails.

NDJSON citation streaming

The chat response is newline-delimited JSON. Citations arrive first so the UI can render sources immediately, then answer tokens stream, then a done event closes the connection.

Multi-document chat

Index many PDFs at once and ask across all of them, or tick a subset to scope a question. Each chunk knows its document and the store holds many documents in parallel.

Page-level highlights

Text is extracted page by page via the pdf-parse pagerender hook. Every chunk carries a page number, so every citation can point at filename, page, and a snippet.

Fixed-size chunking with overlap

1,000 character chunks with 200 character overlap, tracked by page. Sentences that span boundaries are still findable from either side. Tunable through CHUNK_SIZE and CHUNK_OVERLAP.

OpenAI text-embedding-3-small

1536 dimensional vectors at low cost. Indexing a 500 page PDF costs roughly two pence. Override with EMBEDDING_MODEL if you want a different model in the same family.

In-memory store, pgvector when ready

The vector store is one file behind a small interface (add, search, hybridSearch, documents, clear). Swap to pgvector, Supabase Vector, Pinecone, or Qdrant without touching retrieval.

Streaming via App Router

gpt-4o-mini emits tokens that stream through the Next.js App Router stream API. Time to first token sits between 600 and 900ms, so users perceive responsiveness rather than latency.

Grounded answers, not hallucinations

The system prompt pins the model to the retrieved chunks. If the answer is not there the model says so plainly. Verified by fixture tests that ship in the repo.

TypeScript end to end

Strict mode throughout. Every chunk, embedding, and streamed event is typed. Schema-first means provider API changes break the build, not your users.

Wiki with the full theory

Eight wiki pages covering architecture, how RAG works under the hood, configuration tuning, cost and performance, and the pgvector migration. Read end to end in under thirty minutes.

One-click Vercel deploy

Vercel ships the App Router stream API on the free tier and the only secret is your OpenAI key. From clone to live URL in around sixty seconds.

Committed fixture tests

Unit and end-to-end tests in tests/ run against committed fixture PDFs. No network, no API key needed for CI. Catches regressions in chunking, BM25, and retrieval orchestration.

Architecture at a glance

Two pipelines that share one store. Indexing builds the dense plus BM25 indexes from the PDF. Query embeds the question, runs hybrid retrieval with RRF, reranks for precision, and streams citations followed by answer tokens.

Indexing and query flow

From an uploaded PDF to a streamed answer. The vector store is the seam: replace it for Postgres without touching anything else.

rendering

RAG-over-PDF: indexing builds the store; query embeds, fuses, reranks, then streams citations and tokens as NDJSON.

Hybrid retrieval with RRF fusion

Dense and sparse rankings are merged with Reciprocal Rank Fusion. No score normalisation needed. The fused list is the input to the reranker.

rendering

Hybrid retrieval: dense recall, BM25 recall, RRF fusion, reranker precision pass.

NDJSON citation streaming

The chat handler emits a citation event first so the UI renders sources immediately, then answer tokens stream one event per chunk, then a done event closes.

rendering

NDJSON protocol: source-first streaming so the UI shows citations before the answer arrives.

Quick start

From clone to a running app in five commands. The only secret you need is an OpenAI key.

# 1. Clone
git clone https://github.com/sarmakska/rag-over-pdf.git
cd rag-over-pdf

# 2. Install (pnpm is the committed lockfile)
pnpm install

# 3. Configure
cp .env.example .env.local
# then edit .env.local and set:
#   OPENAI_API_KEY=sk-...

# 4. Run
pnpm dev

# 5. Visit http://localhost:3000
#    - drop in one or more PDFs
#    - tick which documents to include
#    - ask a question
#    - watch citations stream first, then the answer

Full walkthrough including env var reference and tuning: Quick-Start wiki page.

The chat handler

The whole retrieval pipeline, end to end, in fewer than fifty lines. Embed the question, fuse dense and BM25, rerank, stream NDJSON. Read the real file at app/api/chat/route.ts.

// app/api/chat/route.ts (sketch of the real handler)
import { embed, chat } from '@/lib/openai'
import { store } from '@/lib/vector-store'
import { rerank } from '@/lib/reranker'
import { buildCitationsEvent, buildTokenEvent, buildDoneEvent } from '@/lib/citations'

export async function POST(req: Request) {
  const { question, docIds } = await req.json()

  // 1. Embed the question
  const [qVec] = await embed([question])

  // 2. Hybrid retrieval (dense + BM25, fused with RRF)
  const candidates = await store.hybridSearch(qVec, question, {
    k: 40,
    docIds, // optional scope to a subset of uploaded PDFs
  })

  // 3. Rerank the recall pool for precision
  const top = await rerank(question, candidates, { topK: 5 })

  // 4. Stream NDJSON: citations first, then tokens, then done
  const stream = new ReadableStream({
    async start(controller) {
      controller.enqueue(buildCitationsEvent(top))
      const completion = await chat.stream({
        system: groundingPrompt,
        messages: [{ role: 'user', content: buildPrompt(question, top) }],
      })
      for await (const token of completion) {
        controller.enqueue(buildTokenEvent(token))
      }
      controller.enqueue(buildDoneEvent())
      controller.close()
    },
  })

  return new Response(stream, {
    headers: { 'Content-Type': 'application/x-ndjson' },
  })
}

Reading the NDJSON stream

Citations first, then tokens, then done. Render sources immediately, append tokens as they arrive, close on done. Twenty lines, no library.

// Client reading the NDJSON stream
const res = await fetch('/api/chat', {
  method: 'POST',
  body: JSON.stringify({ question, docIds }),
})

const reader = res.body!.getReader()
const decoder = new TextDecoder()
let buf = ''

while (true) {
  const { value, done } = await reader.read()
  if (done) break
  buf += decoder.decode(value, { stream: true })
  const lines = buf.split('\n')
  buf = lines.pop()!
  for (const line of lines) {
    if (!line) continue
    const event = JSON.parse(line)
    if (event.type === 'citations') showSources(event.citations)
    if (event.type === 'token') appendToken(event.value)
    if (event.type === 'done') finish()
  }
}

Use cases

What people actually build with this.

Internal documentation chat

Index policies, runbooks, supplier contracts. Staff ask in plain language. BM25 catches form numbers and error codes even when phrasing differs. Citations point at the exact page.

Product documentation chatbot

Embed your docs and expose /api/chat behind a widget so users get answers grounded in your content rather than the open web. Update docs, re-index, done.

Multi-contract question answering

Upload several contracts and ask across all of them, or scope a question to one document by ticking it ("what is the termination notice in the supplier agreement").

Research assistant for long papers

Skim fifty page papers in seconds. Top-k retrieval after reranking is precise enough for academic prose without an additional re-ranking stage.

Customer support copilot

Ground answers in real product docs, not the training data. Hybrid retrieval finds exact SKUs and error codes that pure dense search misses.

Learning RAG end to end

Read the library files end to end and understand hybrid search, fusion, reranking, citation streaming. No framework hiding the moving parts.

Tech stack

Next.js 14TypeScriptOpenAI text-embedding-3-smallOpenAI gpt-4o-minipdf-parsein-repo BM25Cosine similarityTailwind CSSVercelpgvector (optional)

RAG-over-PDF vs alternatives

LangChain and LlamaIndex are frameworks. Pinecone is a managed vector database. RAG-over-PDF is a small, readable application you can fork and ship. Capability rows are taken from each project’s public documentation.

Capability	RAG-over-PDF	LangChain	LlamaIndex	Pinecone
Dense embeddings	Yes	Yes (heavy)	Yes	Yes (paid)
BM25 sparse index	Yes, in-repo	Plug-in	Plug-in	Hybrid endpoint (paid)
RRF fusion	Yes	Manual	Manual	Built-in
Reranker step	LLM + lexical fallback	Plug-in	Plug-in	Not included
Citation streaming	NDJSON, source-first	Manual	Manual	N/A
Page-level metadata	Yes	Manual	Manual	Manual
Multi-document scoping	docIds in /api/chat	Manual	Manual	Namespaces
Self-host on free tier	Vercel free	Library only	Library only	Pay per index
Lines of code to read	~600	~thousands	~thousands	Closed source
License	MIT	MIT	MIT	Commercial

When you outgrow in-memory

The vector store lives in one file behind a tiny interface. To move to Postgres, replace its body with the SQL below and keep retrieval untouched.

-- Migration to swap the in-memory store for Postgres + pgvector.
-- Drop this into your Supabase SQL editor or any Postgres 15+ instance.

create extension if not exists vector;

create table chunks (
  id          text primary key,
  doc_id      text not null,
  source      text,
  page        int,
  content     text not null,
  embedding   vector(1536) not null,
  created_at  timestamptz default now()
);

-- HNSW for fast approximate cosine search
create index chunks_embedding_idx
  on chunks
  using hnsw (embedding vector_cosine_ops);

-- Scope queries to a document set efficiently
create index chunks_doc_id_idx on chunks (doc_id);

-- Optional: keep BM25 in Postgres via tsvector
alter table chunks add column content_tsv tsvector
  generated always as (to_tsvector('english', content)) stored;

create index chunks_content_tsv_idx on chunks using gin (content_tsv);

Full migration including how to keep BM25 inside Postgres via tsvector: Swap-to-pgvector wiki page.

An honest limitations list

Every starter has trade-offs. These are the trade-offs you should know about before adopting this one.

In-memory store clears on restart

The default store lives in process. Restarting the server drops the index. Fine for demos and personal use. Move to pgvector when you need persistence.

BM25 reindexes on every upload

The corpus statistics rebuild when documents change. Trivial at starter scale, but at very large scale move term statistics into Postgres or a search engine.

Fixed-size chunking

Production RAG benefits from semantic or structure-aware chunking. Out of scope for a starter, on the roadmap. The current chunker is a forty line file you can replace.

Per-question cost is small but not zero

Each question is one embedding call, one rerank call, one generation. With the small models that lands well under a penny but it adds up at scale. Drop the reranker if you want to halve the LLM cost.

Scanned PDFs return no text

pdf-parse extracts embedded text, not pixels. Image-only PDFs need an OCR pre-pass before upload. A vision pre-pass is on the roadmap.

Single OpenAI provider by default

The starter ships against the OpenAI API. Swap to any OpenAI-compatible endpoint in lib/openai.ts. Local embeddings via Ollama are on the roadmap.

Wiki documentation

Eight wiki pages. Architecture, retrieval theory, configuration, cost, the pgvector migration, deployment, roadmap.

Architecture

Indexing and query flow diagrams, the component table, failure modes.

Read

Quick-Start

Clone, install, configure, first question answered. Five minutes start to finish.

Read

How RAG Works

Hybrid retrieval, RRF fusion, reranking, citation streaming explained in depth.

Read

Configuration

Every environment variable. Tuning chunk size, overlap, top-k, model overrides.

Read

Swap to pgvector

SQL schema and migration path to Postgres, including keeping BM25 in tsvector.

Read

Cost and Performance

Per-question cost breakdown, latency tuning, where the time goes.

Read

Deployment

Vercel one-click, Docker, self-hosted recipes.

Read

Roadmap

What is shipped, what is next, where contributions are wanted.

Read

Full index: rag-over-pdf wiki home.

Frequently asked

Why hybrid search instead of just dense embeddings?+

Dense embeddings excel at meaning and paraphrase but routinely miss exact strings. Product codes, error messages, form numbers, statute references. BM25 catches all of those. Fusing the two with Reciprocal Rank Fusion gives recall on both axes without score normalisation. It is one of the cheapest precision wins you can ship.

Why a reranker after hybrid search?+

Hybrid search is tuned for recall; it returns a wide candidate pool. The reranker reorders that pool for precision. The default uses an LLM cross-encoder, scoring each candidate against the question. If the LLM call fails, a deterministic lexical reranker takes over so the pipeline still completes. Both stages are short, readable modules in lib/.

Do I need a vector database?+

Not to start. The default in-memory store has three methods (add, search, hybridSearch) and runs in-process. Indexing and retrieval cost nothing extra. When you need persistence across restarts, replace the body of lib/vector-store.ts with Postgres calls and add a pgvector index. The retrieval pipeline never reaches past the interface, so the swap is contained.

What does a question cost?+

One embedding call for the question, one LLM call for reranking, one streaming generation. With text-embedding-3-small and gpt-4o-mini the per-question cost sits around a tenth of a penny in typical usage. Drop the reranker if you want to halve the LLM cost. Cost and performance breakdown in the wiki.

How big a PDF can I index?+

In-memory cosine search runs in milliseconds on corpora up to a few tens of thousands of chunks. A 500 page PDF chunks to roughly 1,500 chunks at the default chunk size. Several hundred such PDFs will still answer under three seconds. Past that, move to pgvector.

How are citations rendered?+

The chat endpoint emits newline-delimited JSON. The first event carries the citations array (filename, page, snippet) so the UI can show sources immediately. Then answer tokens stream as separate events. A done event closes the connection. The UI renders citations as numbered footnote links in the answer.

What about scanned or image-only PDFs?+

pdf-parse extracts embedded text, not pixels. Scanned receipts and image-only PDFs return no text. Run the file through OCR first and upload the text-bearing output. Roadmap includes a vision-OCR pre-pass for the common case.

Can I run it without OpenAI?+

The OpenAI client is lazily constructed so next build succeeds without a key. To use a different provider, swap the embedding and chat clients in lib/openai.ts for any OpenAI-compatible endpoint. Roadmap includes a local-embedding option via Ollama.

Open source · MIT

Use it. Fork it. Ship it.

MIT licensed. No strings attached. Attribution appreciated, not required. Pull requests welcome, especially around semantic chunking, local embeddings, and richer citation rendering.

Star on GitHub Read the wiki Read whitepaper

Ready to ship RAG?

Clone the repo, set OPENAI_API_KEY, deploy. The starter ships with hybrid search, reranking, NDJSON streaming, and page-level citations out of the box.

View on GitHub Read whitepaper How it works Need help? Contact

All open-source projects

RAG-over-PDF

Why this exists

Built-in features

Hybrid search with RRF fusion

LLM reranker on the recall pool

NDJSON citation streaming

Multi-document chat

Page-level highlights

Fixed-size chunking with overlap

OpenAI text-embedding-3-small

In-memory store, pgvector when ready

Streaming via App Router

Grounded answers, not hallucinations

TypeScript end to end

Wiki with the full theory

One-click Vercel deploy

Committed fixture tests

Architecture at a glance

Indexing and query flow

Hybrid retrieval with RRF fusion

NDJSON citation streaming

Quick start

The chat handler

Reading the NDJSON stream

Use cases

Internal documentation chat

Product documentation chatbot

Multi-contract question answering

Research assistant for long papers

Customer support copilot

Learning RAG end to end

Tech stack

RAG-over-PDF vs alternatives

When you outgrow in-memory

An honest limitations list

In-memory store clears on restart

BM25 reindexes on every upload

Fixed-size chunking

Per-question cost is small but not zero

Scanned PDFs return no text

Single OpenAI provider by default

Wiki documentation

Architecture

Quick-Start

How RAG Works

Configuration

Swap to pgvector

Cost and Performance

Deployment

Roadmap

Frequently asked

Use it. Fork it. Ship it.

Related products

Agent Orchestrator

AI Eval Runner

SarmaLink-AI

Receipt Scanner

Ready to ship RAG?