How it works · raftkv

How raftkv works

Leader election with pre-vote, log replication with fast conflict backtracking, the read-index path with leader leases, the Jepsen-style nemesis, and the Wing and Gong checker. Every claim is backed by a test in the repo.

TL;DR

A cluster.
A nemesis.
A checker that judges.

Nodes talk through a Transport interface. The in-memory network exposes a Filter seam. The nemesis installs as a Filter and partitions, drops, delays and crashes traffic. The Raft code never knows.

The leader-aware client records every invoke and return into a History. At the end, linz.Check runs a Wing and Gong backtracking search and returns one line: linearizable=true, or linearizable=false with the offending key.

If the consensus code ever served a stale or invented value, the verdict turns red. TestLinearizableUnderChaos drives this for 200 operations across a 5-node cluster on every test run.

<span class="dim">Client: Put("x", "1") · Put("x", "2") · Get("x")</span> <span class="hl">Step 1 · Elect</span> pre-vote → vote → leader <span class="hl">Step 2 · Replicate</span> AppendEntries to majority <span class="ok">[OK] committed</span> <span class="hl">Step 3 · Apply</span> kv.Apply on each node <span class="hl">Step 4 · Nemesis</span> partition node 0,1 | 2,3,4 <span class="dim">drop · delay · crash</span> <span class="hl">Step 5 · Heal</span> logs converge after merge <span class="hl">Step 6 · Check</span> Wing and Gong per key <span class="ok">[OK] linearizable=true</span> <span class="ok">If the cluster lied, the checker would say so.</span>
Core data flow

A single Put, end to end

rendering
client.Put(x, 1) under chaos: leader replicates, majority persists, apply loop runs, verdict at end of run.
Subsystems

Each piece, deep-dived

Leader election with a pre-vote phase

Why it exists

A partitioned follower whose election timer keeps firing will keep bumping its term and disrupt the live cluster the moment the partition heals. A pre-vote phase makes a candidate ask permission before incrementing the term.

How it actually works

Each follower randomises its election timeout inside a window strictly larger than the heartbeat interval. On expiry it sends a PreVote RPC for term+1, increments its term and asks for real votes only if a majority acknowledges. Implemented in raft/raft.go alongside the RequestVote handler. Verified by TestElectsLeader and TestElectionAfterPartition.

Log replication with fast conflict backtracking

Why it exists

The naive AppendEntries retry walks back one index at a time, which is O(log length) round trips against a far-behind follower. The paper sketches a faster scheme that the implementation has to actually do.

How it actually works

On rejection a follower returns its conflicting term and the first index of that term. The leader skips directly past the whole bad term in one round trip, then resumes one-by-one in the contested term. TestLogConvergesAfterHeal heals a partition and asserts every node converges on the same log.

Read-index path with leader leases

Why it exists

Routing reads through the log doubles write traffic for a read-mostly workload. A pure leader lease, "trust the clock, answer locally", quietly breaks linearizability under a paused VM or an NTP step.

How it actually works

raft/read.go implements paper section 6.4. A no-op committed at the start of each term anchors the read index in the current term. The leader lease is strictly shorter than the minimum election timeout, renewed when a majority acknowledges a heartbeat. Under the lease a read is local. Outside it, the leader pays for a heartbeat round before answering. Correctness never rests on the clock alone.

Crash-safe append-only log

Why it exists

A torn write mid-fsync after a crash must not turn into a half-replayed entry on reopen. Burying that recovery inside SQLite or bbolt would hide the one mechanism a reviewer most wants to see.

How it actually works

raft/storage.go is a length-prefixed, CRC-checked append-only file. On open it scans forward, validates each CRC, and truncates at the first bad record. TestTornTrailingRecordDiscarded appends garbage to simulate a crash mid-write and asserts the log drops the torn record cleanly while keeping the good entries.

Snapshots and InstallSnapshot

Why it exists

A leader cannot replicate forever from an uncompacted log. A follower that has fallen behind the leader's log start has to be re-seeded from a snapshot.

How it actually works

The kv state machine implements Snapshot and Restore. The leader compacts its log past the snapshot index and ships the snapshot via InstallSnapshot to followers that are too far behind. TestSnapshotInstall drives a follower past compaction and asserts it catches up correctly.

Jepsen-style fault-injection harness

Why it exists

A green test on a happy network proves almost nothing about a consensus protocol. The interesting question is what the cluster does when the network is on fire.

How it actually works

fault/fault.go ships an Injector with Partition, Drop, Delay and Crash operations and a Nemesis that schedules them against a live cluster. The Injector installs as a Filter on the in-memory network (cluster/network.go), so it interposes on every message without the Raft code knowing. Crashes recreate the node from its persisted state to exercise recovery, not just stop the goroutine.

Wing and Gong linearizability checker

Why it exists

Porcupine and Knossos are excellent but pulling one in means the proof is outsourced. The point of the project is that I can demonstrate the property myself.

How it actually works

linz/checker.go is a classic Wing and Gong backtracking search, partitioned per key with memoisation on (linearized set, model state). The History records every invoke and return with monotonic timestamps. Check returns linearizable=true or linearizable=false with the offending key. example_test.go pins the output of staleRead, phantomRead and a corrupted history so the checker fails if it ever stops rejecting them.

Design decisions

Why this, not that

One sync.Mutex per node

Why it is here

Code reads next to the paper's pseudocode and the lock discipline is obvious. Locked-suffixed helpers make it impossible to forget which path holds the mutex.

Why not the alternative

Channel-per-actor goroutines, Go nudges you toward this but it turned Raft's already-subtle invariants into a message-ordering puzzle on top. I tried it, the code drifted away from the paper, I went back.

In-memory transport with a Filter seam

Why it is here

Deterministic, race-clean tests under chaos. The Filter is the single chokepoint where the nemesis interposes, so injecting faults is one interface, not a sprawl across the cluster.

Why not the alternative

A real gRPC or TCP transport, the right answer in production but it would have made the test suite slow and the fault harness an order of magnitude harder to write. There is a Transport interface; a real network slots in behind it.

Wing and Gong over Porcupine

Why it is here

Small enough to read in one sitting. Per-key partitioning and memoisation on (linearized set, model state) is enough for the history sizes a test produces.

Why not the alternative

Porcupine is the right answer for million-operation histories, that is explicitly not what this is. Pulling it in would have outsourced the proof.

Hand-rolled append log

Why it is here

Torn-write recovery is part of what the project demonstrates. Length prefix plus CRC32 per record, truncate at the first bad CRC on reopen. Visible end to end.

Why not the alternative

SQLite or bbolt, would have saved the recovery code but buried the most interesting mechanism. Same argument for JSON state files over Protobuf.

Read-index with lease optimisation

Why it is here

Lease is fast, read-index is the safety net. The lease is shorter than the minimum election timeout so it cannot outlive a leader that has actually been deposed.

Why not the alternative

Lease-only reads, correctness depends on bounded clock drift, a paused VM or an NTP step quietly breaks linearizability with no way for a deterministic test to have caught it.

Standard library only

Why it is here

go build ./... has nothing to fetch. The proof is self-contained, end to end, with no vendor dependencies to audit.

Why not the alternative

A protobuf wire format, a structured logger, an assertion library, would shave a hundred lines each. Not worth the dependency surface for a teaching repo.

Measured numbers

What you can measure

457ns
Linearizable read under the lease
0 allocs, leader local
11.9ms
Committed write, 3 nodes
30ms heartbeat, no batching
200
Ops under the nemesis
5 nodes, checked every run

Failure modes the harness actually catches

Partition isolates the leader
Cause: A majority partition no longer hears the heartbeat
What happens: Followers in the majority side time out, run pre-vote, elect a new leader. Old leader steps down on next term-mismatched reply. TestElectionAfterPartition asserts this.
Disk torn write on crash
Cause: Process killed mid-fsync, trailing record has bad CRC
What happens: On reopen the storage scanner stops at the first bad CRC and truncates. Good entries survive. TestTornTrailingRecordDiscarded pins this.
Follower miles behind, leader compacted
Cause: A lagging follower's next-index is below the leader's log start
What happens: Leader ships an InstallSnapshot RPC and resumes replication from snapshot.lastIndex + 1. TestSnapshotInstall pins this.
Client retries an unconfirmed write
Cause: Network delay made the client unsure the write committed
What happens: No client sessions yet, so the retry may apply twice. The checker models this honestly: an unconfirmed write may or may not have taken effect. TestUnconfirmedWriteIsFlexible captures the contract.
Stale read served by a deposed leader
Cause: Leader served a read locally past the end of its lease
What happens: The lease window is strictly shorter than the minimum election timeout, so a deposed leader's lease has already expired. Outside the lease the leader confirms with a heartbeat round before answering.
Checker rejects a recorded history
Cause: A real consistency bug surfaced under chaos
What happens: The history is the artefact. Re-run with the same seed, narrow the operation window, fix the Raft code. The verdict is the regression test.
Roadmap

What is next

Real network transport

A gRPC or TCP Transport slotted in behind the existing seam. The nemesis, checker and Raft core stay unchanged.

Seeded chaos runs in CI

A recorded chaos run with its seed checked into the repo. A violation, if one ever surfaced, would be replayable bit for bit.

Joint-consensus reconfiguration

Today the node set is fixed at startup. Joint consensus would let nodes join and leave at runtime, which the checker would have to model.

Client sessions and idempotency

An idempotency key per write to turn the current at-least-once into exactly-once, removing the "unconfirmed write may apply twice" caveat.

Entry batching

Today a write waits for the next heartbeat. Coalescing entries into a single AppendEntries cuts the committed-write latency sharply.

Pluggable storage backend

The append-only log behind an interface, so a bbolt or LMDB backend can be benchmarked against the hand-rolled file.

Ready to read the proof?

Clone the repo. go test ./.... Then read linz/checker.go first, the Raft core second. The checker is the product.