Readable LLM inference . Rust 1.96 . MIT

forge-infer

A minimal LLM inference server with a real paged KV-cache, continuous batching and speculative decoding. Readable in one afternoon, testable without a GPU.

The serving stack is the deliverable. The model is a deterministic placeholder you swap for real weights by implementing one trait with four methods.

3
core systems
37
tests, all green
~52%
draft acceptance
16
tokens per block
MIT
licence
forge-infer . serving stack
37/37 green
paged kv-cache
512 blocks of 16 . transactional append . OutOfBlocks safe
scheduler
admit . reserve . preempt . retire . StepPlan
speculative
52%accept
the seam
trait Model {
  forward(ctx)
  vocab_size()
  num_layers()
  eos_token()
}
forge | engine step | batch 16 | spec k=4 | model det-hash

Three interlocking systems, named the way the literature names them.

Why this matters

vLLM is the canonical PagedAttention implementation, in tens of thousands of lines of CUDA and Python. The PagedAttention paper is six diagrams. forge-infer is the middle: a faithful implementation of the paged KV-cache, continuous batching and speculative decoding, in Rust, with the awkward cases handled and the model deliberately a deterministic hash so the policy itself can be read, tested and benchmarked on any laptop.

Why this exists

I kept hitting the same wall when I tried to learn how vLLM-style serving works. The PagedAttention paper sketches the idea in a few diagrams. The real engines implement it under tens of thousands of lines of CUDA and Python glue, where the scheduling logic is tangled up with kernel launches and memory pools.

The tutorials that promise to explain it quietly mock out the bit that matters. They call paging a block table and then never evict anything. They batch requests that all happen to be the same length. The interesting cases, the ones that decide whether a real engine stays up, never appear.

forge-infer is the middle. The cache allocator, the scheduler and the speculative decoder are written for real, with the awkward cases handled. The model is deliberately a deterministic hash so the systems above the seam can be read, tested and benchmarked without a single CUDA kernel in sight. The serving stack is the deliverable. The model is a placeholder you swap out by implementing one method.

The seam, in four lines

The entire serving stack hangs off this trait. Everything above it is the engine. Everything below it is the model.

pub trait Model: Send + Sync {
    fn forward(&self, context: &[TokenId]) -> StepLogits;
    fn vocab_size(&self) -> usize;
    fn num_layers(&self) -> usize;
    fn eos_token(&self) -> TokenId;
}

Built-in features

Every component a real engine has, named the way the literature names it. Only the model itself is deliberately faked.

Real paged KV-cache

Memory split into fixed-size blocks; each sequence carries a block table of physical blocks. External fragmentation disappears entirely, internal waste is bounded by block_size minus one slots per sequence. The append is transactional, returning OutOfBlocks without touching state so the scheduler can preempt and retry safely.

Continuous batching every iteration

A scheduling decision per decode iteration, not per batch. Each call admits what fits, reserves one block per running sequence, preempts the least-progressed sequences when blocks run out, runs the decode batch, and retires anything that hit eos or its token limit. Recompute-based preemption never deadlocks.

Speculative decoding, output exact

A cheap draft proposes k tokens; the target verifies the run and accepts each with probability min(1, p/q), resampling on the first rejection. The output is provably identical to plain target decoding, pinned by a token-for-token test. A fully accepted round of four drafts emits five tokens for one target step.

One four-line Model trait

The entire serving stack hangs off a single trait with forward, vocab_size, num_layers and eos_token. Above the seam sit the three systems that decide how fast an inference server runs. Below it sits a model. Swap the model and the rest does not change.

axum HTTP surface

A native /generate endpoint and an OpenAI-compatible /v1/completions. Existing OpenAI client code points base_url at forge-infer and calls /v1/completions unchanged, including stream true.

SSE streaming

Decoded tokens stream back as Server-Sent Events deltas terminated with [DONE], so clients render output progressively instead of waiting for the whole completion.

Determinism by construction

The model is a deterministic hash, on purpose, so cache, scheduler and decoder can be verified bit-stably without a GPU. The acceptance test in speculative decoding needs p(t)/q(t) reproducible to the bit; floating-point attention across two model instances is not stable enough to assert on.

StepPlan as a value

The scheduler returns a StepPlan describing what to run rather than calling the model directly. That split is what makes preempts_when_blocks_run_out and admission_blocks_when_prompt_does_not_fit assertable with no model in the test at all.

Built-in benchmark binary

cargo run --release --bin forge-bench prints a throughput table across sequential, continuous batching and speculative strategies. The figures isolate the cost of the serving machinery rather than model compute, which is the only honest thing the benchmark can measure on a CPU.

Runs anywhere Rust does

No CUDA, no Python glue, no heavy dependency tree. cargo test in seconds, cargo run --release in under a second to a serving binary. The whole point is that you can read it on a laptop and run it on a laptop.

Tech stack

Rust 1.96axumtokioSerdeCargoServer-Sent EventsOpenAI completions shapeMIT

Architecture sketch

One engine loop. One scheduler decision per decode iteration. One model call per iteration. SSE deltas out.

rendering
Engine loop: schedule, decode, push token, repeat until finished, then stream SSE deltas.

Quick start

Clone, test, serve. No GPU, no Python, no model download.

git clone https://github.com/sarmakska/forge-infer
cd forge-infer
cargo test                                   # 37 tests across the serving stack
cargo run --release --bin forge-infer        # serves 127.0.0.1:8080
# native shape
curl -s localhost:8080/generate \
  -d '{"prompt":"hello forge","max_tokens":24}'

# OpenAI-compatible, streamed over SSE
curl -sN localhost:8080/v1/completions \
  -d '{"prompt":"stream me","max_tokens":20,"stream":true}'
# print the throughput table on your own hardware
cargo run --release --bin forge-bench

# change the listen address
FORGE_ADDR=0.0.0.0:9090 cargo run --release --bin forge-infer

What the benchmark measures

Apple M3 Pro, 64 requests, sixteen-token prompts, sixty-four max new tokens. The deterministic model is near-free, so these figures isolate the cost of the serving machinery rather than model compute. Read them for shape, not magnitude.

StrategyTokens/secms/requestNotes
sequential (batch 1)~1.88M0.031static-batching baseline, fresh engine per request
continuous batching (batch 16)~2.07M0.028sixty-four requests share one engine loop
speculative (lookahead 4)~0.59M0.10acceptance 52%, output exact

The figure that carries across to a real model is the 52% acceptance rate. Just over half the draft's proposals are reused without a target recompute, and each accepted token skips a target step.

Use cases

What people actually use this for.

Learn how vLLM-style serving actually works

The PagedAttention paper sketches the idea in diagrams. Real engines bury it under tens of thousands of lines of CUDA and Python glue. forge-infer is the middle: the allocator, scheduler and decoder for real, with the awkward cases handled and the model deliberately trivial.

Prototype a serving policy against a swap-in model

Implement the four-line Model trait against a candle backend or a remote API and the cache, batcher and speculator above the seam are unchanged. Use forge-infer as the harness while you iterate on what the model does.

Teach an internal team paged attention

Cargo build finishes in a couple of seconds. The tests never flake. The scheduler is unit-testable with no forward pass. A workshop that would have needed a GPU room runs on any laptop with Rust installed.

A reference for design reviews

When you are debating recompute versus swap, lookahead values, or block sizes for your real engine, point at code rather than slides. Every trade-off has a sentence in the design-decisions section and a test pinning the chosen behaviour.

Roads I picked, and the ones I did not

Recompute over swap

vLLM can swap KV blocks to host memory under pressure. forge-infer recomputes instead. Swap saves the recompute but adds a copy path, a host pool and a second eviction policy. Recompute is a dozen lines, never deadlocks, and makes the cost of preemption legible. Swap is a latency optimisation that only pays off once the model is real.

Deterministic hash model

A real tiny transformer in candle would let the engine produce prose, at the cost of a multi-second compile, a heavy dependency tree, and a model that is not bit-stable between instances. The speculative acceptance test needs p(t)/q(t) reproducible to the bit. A pure function of the context is what the cache, scheduler and decoder need to be verified.

StepPlan as a value

The scheduler hands back a description of what to run, not a tick that also runs the model. That separation is what makes admission, preemption and retirement assertable with no forward pass in the test at all. One indirection in the engine loop, paid every time for a testable policy.

What this is not

forge-infer is not a production inference server and it is not pretending to be one. Stated plainly:

  • No real weights. The model generates a reproducible byte stream, not language. Implement Model::forward against your own weights for prose out.
  • No shared server-side engine yet. The HTTP layer spins up a fresh Engine per request over a shared Arc<dyn Model>. The continuous-batching path is exercised by the benchmark, which drives sixty-four requests through one engine.
  • No carried KV state. A real backend would attend over the physical blocks the cache hands it; the demo model recomputes from the full context each forward.
  • Greedy argmax only. No temperature, top-p or top-k sampling on the server path. Speculative decoding still uses the standard rejection-sampling test, so the exactness guarantee holds.
MIT . zero CUDA . runs on a laptop

Read it. Fork it. Swap the model.

A shared server-side engine loop and prefix sharing across sequences with copy-on-write block tables are on the roadmap. A candle feature flag behind the Model trait is the natural next step for real text out.

All open-source projects