How lsmdb works
The storage layout, the write path, the compaction policy and the MVCC isolation model, all grounded in the real Go code that runs in the repo. No hand-waving over the parts that matter.
Write goes to the log.
Read walks the levels.
Compaction tidies up.
A Put appends to the WAL, fsyncs, then lands in the skip-list MemTable. When the MemTable hits MemTableSize (4 MiB default) it freezes, flushes to an L0 SSTable, and a fresh WAL and MemTable take over.
A Get walks newest first: active MemTable, immutable MemTable, every L0 table (overlapping), then a single binary-searched table per disjoint level L1 through L6. The first source that holds any version of the key decides the result.
Compaction merges L0 into L1 when L0 grows past four tables, and one oversized table down each deeper level. Newest version of a user key wins; tombstones drop only at the bottom level.
The level hierarchy
L0 holds overlapping MemTable flushes. L1 through L6 are disjoint, each ten times larger than the one above.
./data/ MANIFEST append-only JSON edit log, the atomic commit point 000007.log active WAL, fsynced before every Put returns 000004.sst L0 SSTable, may overlap with other L0 tables 000005.sst L0 SSTable 000009.sst L1 SSTable, disjoint with every other L1 table 000010.sst L1 SSTable
From Put to durable
// db.go, the durability barrier behind every acknowledged Put.
func (db *DB) write(kind encoding.Kind, key, value []byte) error {
db.mu.Lock()
defer db.mu.Unlock()
if db.closed {
return ErrClosed
}
seq := db.lastSeq + 1
// Append to the WAL and fsync before touching memory or acknowledging.
// If the process dies after this returns, recovery replays the record.
rec := encodeRecord(seq, kind, key, value)
if err := db.log.Append(rec); err != nil {
return err
}
if err := db.log.Sync(); err != nil {
return err
}
db.lastSeq = seq
db.mem.Add(seq, kind, key, value)
if db.mem.ApproximateSize() >= db.opts.MemTableSize {
if err := db.rotateMemtableLocked(); err != nil {
return err
}
}
return nil
}The WAL record framing is [crc32 4B little-endian | len 4B little-endian | payload]. CRC is Castagnoli, the same polynomial RocksDB uses. wal.Writer.Sync flushes the bufio writer and calls f.Sync, which is the fsync barrier. wal.Reader.Next treats a short read or a CRC mismatch as io.EOF, so a torn trailing record from a crash is dropped on replay instead of returning as corrupt data.
Sequences, trailers and order
// internal/encoding/encoding.go
// Trailer packs a 56-bit sequence in the high bits and the Kind in the low 8.
// Stored big-endian so a plain bytewise comparison orders keys correctly.
const MaxSequence uint64 = (1 << 56) - 1
func PackTrailer(seq uint64, kind Kind) uint64 {
return (seq << 8) | uint64(kind)
}
func MakeInternalKey(userKey []byte, seq uint64, kind Kind) InternalKey {
ik := make([]byte, len(userKey)+8)
copy(ik, userKey)
binary.BigEndian.PutUint64(ik[len(userKey):], PackTrailer(seq, kind))
return ik
}
// User keys ascending, then trailer descending so the newest version sorts first.
func CompareInternal(a, b InternalKey) int {
ua, ub := a.UserKey(), b.UserKey()
if c := compareBytes(ua, ub); c != 0 {
return c
}
ta, tb := a.Trailer(), b.Trailer()
switch {
case ta > tb:
return -1
case ta < tb:
return 1
default:
return 0
}
}Every user key gets paired with a monotonic 56-bit sequence number and a Kind byte. Within one user key the higher sequence sorts first, so a merging iterator naturally lands on the newest version. A Snapshot captures lastSeq, and the read path filters out every version with sequence higher than the snapshot. That is the entire isolation story: snapshot reads see exactly the writes committed before the snapshot, and the engine guarantees this until compaction reclaims the older versions.
// public_iterator.go
type Snapshot struct {
db *DB
seq uint64
}
func (db *DB) Snapshot() *Snapshot {
db.mu.RLock()
defer db.mu.RUnlock()
return &Snapshot{db: db, seq: db.lastSeq}
}
func (s *Snapshot) Get(key []byte) ([]byte, error) {
return s.db.getAt(key, s.seq)
}
// db.getAt walks sources newest first. The first source that holds any version
// of the key, live or tombstone, decides the result.
func (db *DB) getAt(key []byte, snap uint64) ([]byte, error) {
db.mu.RLock()
defer db.mu.RUnlock()
if v, found, ok := db.mem.Get(key, snap); ok { return finish(v, found) }
if db.imm != nil {
if v, found, ok := db.imm.Get(key, snap); ok { return finish(v, found) }
}
for i := len(db.levels[0]) - 1; i >= 0; i-- {
if v, found, ok := db.levels[0][i].Get(key, snap); ok { return finish(v, found) }
}
for lvl := 1; lvl < numLevels; lvl++ {
r := db.findTable(db.levels[lvl], key)
if r == nil { continue }
if v, found, ok := r.Get(key, snap); ok { return finish(v, found) }
}
return nil, ErrNotFound
}Newest wins, tombstones drop at the bottom
// compaction.go
// runCompaction merges inputs into one or more outputs on the target level.
// Tombstone handling: a deletion is dropped only when the compaction reaches
// the bottom level, because no older version can exist below it.
for ; merged.Valid(); merged.Next() {
key := merged.Key()
uk := key.UserKey()
// Skip superseded versions: the merging iterator yields the newest version
// of a user key first, so any later entry with the same user key is older.
if haveLast && encoding.CompareBytes(uk, lastUserKey) == 0 {
continue
}
lastUserKey = append(lastUserKey[:0], uk...)
haveLast = true
// Drop a tombstone outright when this is the bottom level.
if key.Kind() == encoding.KindDelete && isBottom {
continue
}
if writer == nil {
if err := openWriter(); err != nil { return err }
}
if err := writer.Add(key, merged.Value()); err != nil { return err }
if writer.Count() >= compactionMaxEntries {
if err := closeWriter(); err != nil { return err }
}
}pickCompaction triggers L0 to L1 when L0 grows to L0CompactionTrigger tables (default 4) and merges every L0 table with the overlapping L1 tables. Below L0 it walks levels, multiplying the byte budget by LevelSizeMultiplier (default 10) at each step, and compacts the first oversized table down. The output is split into multiple tables once a writer hits compactionMaxEntries (100,000) so individual tables stay small enough to map and to compact again later.
Why this, not that
Inline flush and compaction under the write lock
Deterministic ordering makes durability and recovery semantics trivial to reason about and to test. No concurrent mutation of the level layout to race against, every test is deterministic.
Background goroutine. The textbook design, lifts write throughput, but a regression in the recovery semantics is much harder to catch without a concurrency test harness I trust.
Append-only JSON manifest
Each edit is a few hundred bytes. One fsync per compaction. You can cat the manifest and read the history of the level layout. Adding outputs and deleting inputs in one edit is the atomic swap.
Rewritten snapshot file. Simpler to read back, but each compaction rewrites the whole list, and a rename is not the only thing that has to be atomic across a crash.
Whole keys in data blocks
Block reader is a straight scan with no restart-point bookkeeping. Format stays describable in a paragraph. Prefix compression is a clean later addition that does not touch the index or footer.
LevelDB-style prefix compression. Saves disk, complicates the reader, hard to debug.
Skip list, not B-tree or sorted slice
O(log n) insert and search with a structure that allows a single writer plus concurrent readers. That is exactly the MemTable access pattern.
Sorted slice, O(n) insert, fatal for a write buffer. Balanced B-tree, right complexity, but rebalancing makes lock-free reads hard.
CRC32C plus length framing for the WAL
A torn record from a crash either short-reads or fails the CRC. Either way the reader stops at the last fully durable record. CRC32C is fast and the polynomial RocksDB uses.
No CRC. A torn payload of the right length replays as live data on next open, which is a silent corruption bug.
What you can measure
Failure modes you should expect
What is next
WriteBatch behind one fsync
Batch many writes behind a single Sync to lift throughput by orders of magnitude when an application can tolerate a few milliseconds of loss on a crash.
Background flush and compaction
Move flush and compaction onto a dedicated goroutine once a concurrency test harness is in place to catch any regression in the recovery semantics.
Snapshot pinning
A retained minimum sequence that compaction must not drop below. Lets a long-lived snapshot survive heavy write volume without losing its versions.
Orphaned-file sweep on open
Delete unreferenced .sst files left by a crash mid-compaction. The manifest already records the live set, so anything not in it is safe to remove.
Wider bench coverage
Sustained mixed-workload benches with read/write ratios, plus latency histograms not just averages.
Optional prefix compression
Restart-point encoding inside data blocks, behind an Options flag, for workloads with long shared key prefixes where disk cost matters.
Ready to read the code?
Roughly 2,900 lines of Go and the standard library. An afternoon end to end. Start at db.go, follow into compaction.go, then internal/wal and internal/sstable.