How raftkv works
Leader election with pre-vote, log replication with fast conflict backtracking, the read-index path with leader leases, the Jepsen-style nemesis, and the Wing and Gong checker. Every claim is backed by a test in the repo.
A cluster.
A nemesis.
A checker that judges.
Nodes talk through a Transport interface. The in-memory network exposes a Filter seam. The nemesis installs as a Filter and partitions, drops, delays and crashes traffic. The Raft code never knows.
The leader-aware client records every invoke and return into a History. At the end, linz.Check runs a Wing and Gong backtracking search and returns one line: linearizable=true, or linearizable=false with the offending key.
If the consensus code ever served a stale or invented value, the verdict turns red. TestLinearizableUnderChaos drives this for 200 operations across a 5-node cluster on every test run.
A single Put, end to end
Each piece, deep-dived
Leader election with a pre-vote phase
A partitioned follower whose election timer keeps firing will keep bumping its term and disrupt the live cluster the moment the partition heals. A pre-vote phase makes a candidate ask permission before incrementing the term.
Each follower randomises its election timeout inside a window strictly larger than the heartbeat interval. On expiry it sends a PreVote RPC for term+1, increments its term and asks for real votes only if a majority acknowledges. Implemented in raft/raft.go alongside the RequestVote handler. Verified by TestElectsLeader and TestElectionAfterPartition.
Log replication with fast conflict backtracking
The naive AppendEntries retry walks back one index at a time, which is O(log length) round trips against a far-behind follower. The paper sketches a faster scheme that the implementation has to actually do.
On rejection a follower returns its conflicting term and the first index of that term. The leader skips directly past the whole bad term in one round trip, then resumes one-by-one in the contested term. TestLogConvergesAfterHeal heals a partition and asserts every node converges on the same log.
Read-index path with leader leases
Routing reads through the log doubles write traffic for a read-mostly workload. A pure leader lease, "trust the clock, answer locally", quietly breaks linearizability under a paused VM or an NTP step.
raft/read.go implements paper section 6.4. A no-op committed at the start of each term anchors the read index in the current term. The leader lease is strictly shorter than the minimum election timeout, renewed when a majority acknowledges a heartbeat. Under the lease a read is local. Outside it, the leader pays for a heartbeat round before answering. Correctness never rests on the clock alone.
Crash-safe append-only log
A torn write mid-fsync after a crash must not turn into a half-replayed entry on reopen. Burying that recovery inside SQLite or bbolt would hide the one mechanism a reviewer most wants to see.
raft/storage.go is a length-prefixed, CRC-checked append-only file. On open it scans forward, validates each CRC, and truncates at the first bad record. TestTornTrailingRecordDiscarded appends garbage to simulate a crash mid-write and asserts the log drops the torn record cleanly while keeping the good entries.
Snapshots and InstallSnapshot
A leader cannot replicate forever from an uncompacted log. A follower that has fallen behind the leader's log start has to be re-seeded from a snapshot.
The kv state machine implements Snapshot and Restore. The leader compacts its log past the snapshot index and ships the snapshot via InstallSnapshot to followers that are too far behind. TestSnapshotInstall drives a follower past compaction and asserts it catches up correctly.
Jepsen-style fault-injection harness
A green test on a happy network proves almost nothing about a consensus protocol. The interesting question is what the cluster does when the network is on fire.
fault/fault.go ships an Injector with Partition, Drop, Delay and Crash operations and a Nemesis that schedules them against a live cluster. The Injector installs as a Filter on the in-memory network (cluster/network.go), so it interposes on every message without the Raft code knowing. Crashes recreate the node from its persisted state to exercise recovery, not just stop the goroutine.
Wing and Gong linearizability checker
Porcupine and Knossos are excellent but pulling one in means the proof is outsourced. The point of the project is that I can demonstrate the property myself.
linz/checker.go is a classic Wing and Gong backtracking search, partitioned per key with memoisation on (linearized set, model state). The History records every invoke and return with monotonic timestamps. Check returns linearizable=true or linearizable=false with the offending key. example_test.go pins the output of staleRead, phantomRead and a corrupted history so the checker fails if it ever stops rejecting them.
Why this, not that
One sync.Mutex per node
Code reads next to the paper's pseudocode and the lock discipline is obvious. Locked-suffixed helpers make it impossible to forget which path holds the mutex.
Channel-per-actor goroutines, Go nudges you toward this but it turned Raft's already-subtle invariants into a message-ordering puzzle on top. I tried it, the code drifted away from the paper, I went back.
In-memory transport with a Filter seam
Deterministic, race-clean tests under chaos. The Filter is the single chokepoint where the nemesis interposes, so injecting faults is one interface, not a sprawl across the cluster.
A real gRPC or TCP transport, the right answer in production but it would have made the test suite slow and the fault harness an order of magnitude harder to write. There is a Transport interface; a real network slots in behind it.
Wing and Gong over Porcupine
Small enough to read in one sitting. Per-key partitioning and memoisation on (linearized set, model state) is enough for the history sizes a test produces.
Porcupine is the right answer for million-operation histories, that is explicitly not what this is. Pulling it in would have outsourced the proof.
Hand-rolled append log
Torn-write recovery is part of what the project demonstrates. Length prefix plus CRC32 per record, truncate at the first bad CRC on reopen. Visible end to end.
SQLite or bbolt, would have saved the recovery code but buried the most interesting mechanism. Same argument for JSON state files over Protobuf.
Read-index with lease optimisation
Lease is fast, read-index is the safety net. The lease is shorter than the minimum election timeout so it cannot outlive a leader that has actually been deposed.
Lease-only reads, correctness depends on bounded clock drift, a paused VM or an NTP step quietly breaks linearizability with no way for a deterministic test to have caught it.
Standard library only
go build ./... has nothing to fetch. The proof is self-contained, end to end, with no vendor dependencies to audit.
A protobuf wire format, a structured logger, an assertion library, would shave a hundred lines each. Not worth the dependency surface for a teaching repo.
What you can measure
Failure modes the harness actually catches
What is next
Real network transport
A gRPC or TCP Transport slotted in behind the existing seam. The nemesis, checker and Raft core stay unchanged.
Seeded chaos runs in CI
A recorded chaos run with its seed checked into the repo. A violation, if one ever surfaced, would be replayable bit for bit.
Joint-consensus reconfiguration
Today the node set is fixed at startup. Joint consensus would let nodes join and leave at runtime, which the checker would have to model.
Client sessions and idempotency
An idempotency key per write to turn the current at-least-once into exactly-once, removing the "unconfirmed write may apply twice" caveat.
Entry batching
Today a write waits for the next heartbeat. Coalescing entries into a single AppendEntries cuts the committed-write latency sharply.
Pluggable storage backend
The append-only log behind an interface, so a bbolt or LMDB backend can be benchmarked against the hand-rolled file.
Ready to read the proof?
Clone the repo. go test ./.... Then read linz/checker.go first, the Raft core second. The checker is the product.