raftkv
A from-scratch Raft KV store in Go, with a built-in Jepsen-style fault-injection harness and a Wing and Gong linearizability checker that prove the cluster stays correct under partitions, crashes and torn writes.
Abstract
raftkv is an open-source, MIT-licensed Raft key-value store written in Go on the standard library, with no external dependencies. Its distinguishing feature is not the consensus code, it is the apparatus that judges it. A Jepsen-style fault-injection harness installs as a Filter on an in-memory network and partitions, drops, delays and crashes traffic; a leader-aware client records every operation into a History; a Wing and Gong linearizability checker runs a per-key backtracking search over the History and returns a one-line verdict. The flagship test, TestLinearizableUnderChaos, drives 200 operations across a 5-node cluster while the nemesis runs and asserts the recorded history is linearizable on every test run. This whitepaper documents the architecture, the harness, the checker, the failure modes the system catches, and the design decisions that went against the obvious option.
01Executive Summary
raftkv implements Raft from scratch: randomised-timeout leader election with a pre-vote phase, log replication with fast conflict backtracking, crash-safe persistence of term, vote and log in a length-prefixed CRC-checked append-only file, log compaction via InstallSnapshot, and the read-index path from paper section 6.4 with a leader-lease optimisation strictly shorter than the minimum election timeout.
Around the consensus core sits the apparatus that proves it works. The fault/ package ships an Injector that partitions, drops, delays and reorders messages and a Nemesis that schedules these faults against a live cluster. The Injector installs as a Filter on the in-memory network defined in cluster/network.go, so it interposes on every message without the Raft code knowing it is being attacked. The linz/ package records every client invoke and return into a History and runs a Wing and Gong backtracking search over it, partitioned per key with memoisation on the pair (linearized set, model state).
The result is a Raft implementation whose correctness claim is not "the tests passed" but "here is a recorded history under chaos, and a checker accepted it". The rule the project holds itself to is that no consistency claim ships without a recorded history a checker has accepted.
02Background & Motivation
Most Raft tutorials stop at "the test passed". That has never been satisfying, because a green test on a happy network proves almost nothing about a consensus protocol. The interesting question is what the cluster does when the network is on fire.
I had implemented Raft once before, the way most people do. Follow the paper, write a few tests that elect a leader and replicate a value, watch them go green, move on. Months later I could not honestly say whether that implementation was correct under a partition, because I had never put it under one. The tests only ever exercised the path the paper draws on its happy diagram.
This implementation inverts the priorities. The linearizability checker and the fault harness were written first and treated as the product. The Raft core was built to satisfy them. The wider lineage is Jepsen, which set the modern bar for empirically checking distributed-system claims, and the Wing and Gong paper, which gave the algorithmic backbone for a backtracking linearizability search.
03The Problem
The specific problems this project addresses:
- Unfalsifiable Raft implementations. An implementation whose tests only run on a happy network has not earned the claim that it is linearizable. The check has to survive partition, delay, reorder, crash and torn write.
- Outsourced proofs. Bolting on Porcupine or Knossos is a real option, but it puts the load-bearing algorithm in someone else\'s repo. For a teaching artefact, the checker itself has to be readable.
- Hidden disk-recovery code. Persisting the log to SQLite or bbolt would skip the most subtle thing in a Raft implementation: what the on-disk format does when the process dies mid-write.
- Clock-dependent reads. Pure leader leases are tempting for fast reads but make correctness depend on bounded clock drift, which a paused VM or an NTP step can break silently.
04Goals & Non-goals
Goals
- A from-scratch Raft core with election, replication, persistence, snapshots and read-index, mapped onto the paper.
- A Jepsen-style nemesis: partition, drop, delay, reorder, crash, against a live cluster.
- A Wing and Gong linearizability checker that returns a one-line verdict on a recorded History.
- A flagship chaos test that drives an end-to-end workload under the nemesis and asserts linearizability on every run.
- The Go standard library only. Nothing to
go get.
Non-goals
- A production data store. The transport is in-process, the on-disk format favours clarity over speed, membership is static. This is correctness-first, teaching-grade.
- A million-operation checker. Porcupine is the right answer at that scale. The Wing and Gong implementation here is sized for the history a test produces.
- Dynamic reconfiguration. No joint consensus yet. The node set is fixed at startup.
- Client sessions or idempotency keys. An unconfirmed write that the client retries can apply twice. The checker models this honestly so it stays sound.
05Architecture
Package layout
| Package | Responsibility |
|---|---|
raft/ | Consensus core: election, replication, persistence, snapshot, read-index |
kv/ | Replicated state machine: Get, Put, Delete, Snapshot, Restore |
cluster/ | In-process multi-node cluster, in-memory network, leader-aware client |
fault/ | Injector (partition, drop, delay, reorder, crash) and Nemesis scheduler |
linz/ | History recorder, Wing and Gong linearizability checker |
transport/ | Transport interface, the seam a real gRPC or TCP transport slots into |
cmd/raftkvd/ | Demo: 5-node cluster under chaos, prints the verdict |
End-to-end flow under chaos
06Key Technical Decisions
Read-index over leader leases alone
The tempting shortcut for fast reads is a pure leader lease: trust a wall-clock timer, skip the read index, answer locally. I rejected that as the only mechanism because it makes correctness depend on bounded clock drift between machines, and a paused VM or an NTP step quietly breaks linearizability with no way for the checker to have caught it in a deterministic test. So the lease here is an optimisation layered on top of the read index, never a replacement for it. Under the lease a read is local and fast; without it the leader pays for a heartbeat round. Correctness never rests on the clock alone.
Wing and Gong search over a Knossos-style solver
The well-known checkers (Knossos, Porcupine) are excellent, but they are also large bodies of someone else\'s code, and pulling one in would have meant the proof was outsourced. The point of this project is that the property can be demonstrated end to end, so I implemented the classic Wing and Gong backtracking search by hand, partitioned per key with memoisation on (linearized set, model state). It is enough for the history sizes a test produces and it is small enough to read in one sitting. Porcupine is the right answer if you need to check million-operation histories; that is explicitly not what this is.
A single mutex per node over channel-per-actor
Go nudges you toward modelling each node as a goroutine that owns its state and communicates over channels. I tried it and the code drifted away from the paper, turning Raft\'s already-subtle invariants into a message-ordering puzzle on top. I went back to one sync.Mutex per node with internal helpers suffixed Locked, so the code reads next to the paper\'s pseudocode and the lock discipline is obvious. The trade-off is that I cannot fan out RPCs while holding the lock, so sends happen on goroutines that re-acquire only to apply the reply.
A custom append-only log over an embedded database
I could have persisted the log to SQLite or bbolt and saved myself the recovery code. I wrote a length-prefixed, CRC-checked append-only file instead, because torn-write recovery from a crash is part of what I wanted to demonstrate, and burying it inside a database would have hidden the one mechanism a reviewer most wants to see. TestTornTrailingRecordDiscarded appends garbage to simulate a crash mid-write and asserts the log drops the torn record cleanly on reopen, keeping the good entries.
07The Fault-Injection Harness
fault/fault.go is the harness. An Injector exposes four operations: Partition (split the cluster into groups that cannot reach each other), Drop (silently discard messages matching a predicate), Delay (hold messages for a configurable interval), and Crash (terminate a node and recreate it from its persisted state on resume). A Nemesis wraps the Injector with a scheduler that picks operations on a randomised cadence.
The Injector installs as a Filter on the in-memory network defined in cluster/network.go. Every cross-node message passes through the Filter, so the Injector interposes on the whole cluster from a single point. The Raft code never knows it is being attacked; from its point of view, partitions look like timeouts, drops look like silence, delays look like a slow network, and crashes look like a peer that stopped acknowledging.
Crash, importantly, is not "stop the goroutine". It tears down the node and reconstitutes it from state.json, raft.log and the most recent snapshot, exercising the same recovery path a real crashed process would take.
TestLinearizableUnderChaos in cluster/cluster_test.go wires the Nemesis to a 5-node cluster, drives a 200-operation workload through the leader-aware client, and asserts the recorded History is linearizable. It runs as part of go test ./....
08The Linearizability Checker
linz/checker.go implements the classic Wing and Gong backtracking search. The input is a History, a sequence of (operation, invoke timestamp, return timestamp, response) tuples produced by the leader-aware client. The output is a boolean and, on failure, the offending key.
Two ideas keep the search tractable. First, operations on different keys cannot interfere with each other, so the checker partitions the History by key and runs the search independently on each partition. Second, within a key, the search memoises on the pair (linearized set, model state), so the same intermediate position is never re-explored. This is enough for the few-hundred-operation histories the test suite produces.
linz/example_test.go pins three diagnostic cases. ExampleCheck_staleRead writes 1, writes 2, reads 1, and asserts the checker rejects it. ExampleCheck_phantomRead reads a value that was never written and asserts rejection. A corrupted-history case asserts rejection on internally inconsistent input. The pinned expected output means go test ./linz/ fails the moment the checker stops rejecting any of them. TestUnconfirmedWriteIsFlexible covers the converse: a write whose return was never recorded is treated as may-or-may-not-have-taken-effect, so the checker stays sound under client-side ambiguity.
09Results & Performance
Measured on an Apple M3 Pro, Go 1.26.3, fast test timeouts (30ms heartbeat). Reproduce with go test ./cluster/ -bench Benchmark -run \'^$\' -benchmem.
| Workload | Result |
|---|---|
| Linearizable read under the leader lease | 457 ns/op, 0 allocs |
| Committed write, 3 nodes | 11.9 ms/op, 68 allocs |
| 200-op workload under the nemesis, 5 nodes | linearizable, checked every run |
The write figure is dominated by the heartbeat-driven replication cycle in the test configuration. A write waits for the next heartbeat to carry it to a majority, and the heartbeat interval is 30ms here. Production timeouts and entry batching would cut that sharply. Reads under the lease never leave the leader, which is why they are sub-microsecond. These are the actual numbers from the run above, not projections.
10Failure Modes Caught
The harness has empirically caught the following classes of failure while the implementation was under development. Each maps to a test that now defends against regression.
| Failure | How it surfaced | Defended by |
|---|---|---|
| Stale read served past lease expiry | Checker rejected a Get that returned an overwritten value | TestStaleReadIsRejected, lease window strictly shorter than min election timeout |
| Phantom read of an unwritten key | Checker rejected a Get that returned a value never put | TestReadOfUnwrittenValueRejected |
| Torn write on crash | Reopened log replayed half-written record as a valid entry | TestTornTrailingRecordDiscarded, CRC scan + truncate at first bad record |
| Split-brain leader after partition | Two leaders briefly accepted writes after a heal | Pre-vote phase + term-mismatched reply step-down, TestElectionAfterPartition |
| Log divergence on heal | Followers retained uncommitted entries the new leader had overwritten | Fast conflict backtracking in AppendEntries, TestLogConvergesAfterHeal |
| Snapshot install on a far-behind follower | Follower could not catch up because the leader had compacted past its next-index | InstallSnapshot RPC, TestSnapshotInstall |
| Committed entries lost on restart | State machine forgot entries past the most recent snapshot on reopen | TestRecoveryOfCommittedEntriesFromDisk |
| Checker accepts a corrupted history | Internally inconsistent History was returned as linearizable | Corrupted-history example test in linz/ |
11Lessons & Trade-offs
What worked
- Writing the checker first. Built backwards from "what would catch a stale read", which made the Raft code\'s correctness goals concrete from day one.
- One Filter seam. Routing all cross-node messages through a single Filter chokepoint made the fault harness an order of magnitude easier to write than per-edge injection.
- Pinning example test output. Encoding the checker\'s verdict in
Exampletest docstrings means the checker silently regressing is impossible. - Crash as recreate-from-disk. Crashes that restart the node from persisted state exercised the on-disk recovery path on every chaos run, not just the dedicated storage tests.
What we got wrong on first pass
- First storage layer JSON-encoded each log entry separately and concatenated. A torn write produced unparseable JSON for the whole tail, not just the bad entry. Switched to length-prefixed + CRC per record, scanned forward on reopen.
- Original election timer was a fixed interval. Two followers timed out together repeatedly under partition. Randomising the timeout inside a window above the heartbeat interval is the canonical fix and the implementation was missing it.
- Pre-vote phase was bolted on later. Without it, a partitioned follower that kept timing out would bump the term and disrupt the live cluster on heal. The checker caught the disruption window before the pre-vote phase was added.
Trade-offs we accept
- In-process transport. Real-network numbers will differ. A gRPC or TCP transport slots in behind the existing
Transportinterface. - Static membership. No joint consensus. The node set is fixed at startup.
- No client sessions. An unconfirmed write that the client retries can apply twice. The checker models this honestly (an unconfirmed write may or may not have taken effect), which keeps it sound but means the system is at-least-once, not exactly-once.
- JSON for state and snapshots. Slower than a binary format. Far easier to inspect when something is wrong, which is the right trade for a teaching artefact.
12Conclusion
raftkv demonstrates that the right way to convince yourself a Raft implementation is correct is to write the apparatus that judges it first and the implementation second. The Wing and Gong checker and the Jepsen-style nemesis are 1,000-odd lines of Go between them, fit on the Go standard library, and produce a one-line verdict that the consensus code either earns or does not. The Raft core, leader election with pre-vote, log replication with fast conflict backtracking, crash-safe persistence, snapshots and a read-index path with a clock-independent fallback, is the second half of the story, written to satisfy the first.
The rule the project holds itself to is that no consistency claim ships without a recorded history a checker has accepted. TestLinearizableUnderChaos is that contract. If the cluster ever lies, the verdict turns red, and the History is the regression test.
ATest Inventory
The repo ships 27 tests across the packages. The flagship and most-cited are listed below.
| Test | Package | What it pins |
|---|---|---|
TestElectsLeader | cluster/ | Cluster elects exactly one leader from a quiet start |
TestBasicReadWrite | cluster/ | Put then Get returns the written value |
TestElectionAfterPartition | cluster/ | Majority side elects a new leader when isolated |
TestLogConvergesAfterHeal | cluster/ | All nodes share the same committed log after a heal |
TestLeadershipChange | cluster/ | Graceful leadership transfer keeps the cluster live |
TestRecoveryOfCommittedEntriesFromDisk | cluster/ | Crashed and restarted nodes resume from persisted state |
TestSnapshotInstall | cluster/ | A far-behind follower is re-seeded via InstallSnapshot |
TestLinearizableUnderChaos | cluster/ | 200-op workload under the nemesis is linearizable |
TestTornTrailingRecordDiscarded | raft/ | CRC scan drops a torn final record on reopen |
TestStateRoundTrip · TestLogAppendAndReload · TestTruncateSuffix · TestSnapshotCompactsLog | raft/ | Storage primitives round-trip correctly |
TestSequentialValidHistory · TestConcurrentValidHistory | linz/ | Checker accepts well-formed histories |
TestStaleReadIsRejected · TestReadOfUnwrittenValueRejected | linz/ | Checker rejects classic linearizability violations |
TestDeleteThenGetNil · TestPerKeyIndependence | linz/ | Model semantics, per-key partitioning |
TestUnconfirmedWriteIsFlexible | linz/ | Unreturned writes are modelled as may-or-may-not |
ExampleCheck_staleRead | linz/ | Pinned verdict output, regression-proof |
BReading Order
linz/example_test.go, the pinned diagnostic. Five lines of Go that show the checker rejecting a stale read.linz/checker.go, the Wing and Gong search. The product.fault/fault.go, the harness. Injector + Nemesis on the Filter seam.cluster/network.go, the seam itself. One file, one interface.cluster/client.go, the leader-aware client. How operations enter the history.raft/raft.go, the consensus core. Election and replication, mapped onto the paper.raft/read.go, the read-index path and lease optimisation.raft/storage.go, the length-prefixed, CRC-checked append-only log.cluster/cluster_test.go, finishing onTestLinearizableUnderChaoswith all the prior pieces in mind.