Whitepaper · lsmdb

lsmdb

A log-structured merge-tree storage engine in pure Go: write-ahead log, skip-list MemTable, block-based SSTables, bloom filters, levelled compaction, MVCC snapshots. Standard library only.

MIT LicensedOpen SourceEmbeddedPure GoStandard library only~2.9k lines
~2.9klines of Go
2.82msdurable Put with fsync
2.15usSSTable point read
7public methods

v1.0 · 2026 · Sarma · Sarma Linux

Abstract

lsmdb is an open-source, MIT-licensed log-structured merge-tree storage engine written in Go using only the standard library. It provides ordered key-value storage with a write-ahead log for durability, a skip-list MemTable, immutable block-based SSTables with per-table bloom filters, levelled compaction with correct tombstone handling, and MVCC snapshot reads over monotonic 56-bit sequence numbers. The public surface is seven methods: Open, Close, Put, Delete, Get, NewIterator and Snapshot. Roughly 2,900 lines of Go. The whole engine is small enough to read end to end in an afternoon and real enough to stress with millions of writes. This paper documents the architecture, the durability contract, the MVCC model, the compaction policy and the trade-offs against BoltDB, Badger and Pebble.

01Executive Summary

A Put encodes a record, appends it to the WAL, fsyncs and only then adds the entry to the in-memory skip list. When the MemTable reaches MemTableSize (4 MiB by default) it freezes, flushes to a fresh L0 SSTable, and a new WAL and MemTable take over for subsequent writes. A Get walks newest first: active MemTable, immutable MemTable, every L0 table in newest-first order, then a single binary-searched table per disjoint level L1 through L6. The first source that holds any version of the key, live or tombstone, decides the result. Compaction merges L0 to L1 when L0 grows past four tables, and merges one oversized table down each deeper level otherwise.

Durability is the contract: every acknowledged Put is on disk before Put returns, and a process that dies mid-write recovers to a clean state with the torn tail dropped. Snapshots are taken in constant time and provide a stable read view at a fixed sequence number, with the caveat that compaction reclaims old versions over time so a long-lived snapshot can lose versions it needed if heavy writes run alongside it.

The engine is deliberately small. The roadmap (batched writes behind one fsync, background compaction, snapshot pinning, orphaned-file sweep) does not change the data formats or the public surface, so improvements arrive as drop-in upgrades.

02Why this engine exists

The LSM-tree as a data structure is a single page in the original paper. The production engines that implement it (LevelDB, RocksDB, Pebble) are excellent, but each carries twenty years of flags, formats and compaction strategies on top of that page. Reading their source is a worthwhile exercise that almost always ends in the same place: you understood the diagram but you cannot point at the exact lines that make a write durable, or the exact loop that drops a tombstone at the right moment.

lsmdb exists to be the version where you can. It implements every hard part for real: the write-ahead log with CRC framing and torn-tail recovery, the block-based table format with a sparse index and a per-table bloom filter, the heap-based merging iterator, levelled compaction with correct tombstone handling, MVCC snapshots over monotonic sequence numbers, and the append-only manifest that makes compaction's swap atomic. No part is faked or stubbed. There is no FFI to RocksDB, no shelling out to leveldb, no in-memory shortcut.

It is a portfolio engine and a teaching engine, not a RocksDB replacement, and the paper is honest about that line throughout. The audience is engineers learning storage engine internals, interviewers looking for a reference implementation small enough to discuss in an hour, and authors of distributed systems who want a local engine they can read in full while they build the layer above.

03Goals and non-goals

Goals

  • Correct durability: a successful Put survives a crash; a torn record is dropped on replay.
  • Snapshot isolation for reads, with the standard MVCC ordering rules.
  • A complete and readable LSM-tree implementation: WAL, MemTable, SSTable, bloom filter, manifest, compaction, merging iterator.
  • Standard library only, no external dependencies.
  • Small public surface: seven methods.
  • Deterministic test suite: flush and compaction run inline under the write lock for now, so every test is reproducible.

Non-goals

  • A networked or multi-node database. No replication, no sharding, no server. The raftkv sibling project layers Raft on top for that.
  • Multi-key transactions. Snapshots give a consistent read view; they do not give read-modify-write transactions across keys.
  • A drop-in for RocksDB or Pebble. Those engines have a decade of production hardening this one does not.
  • Pluggable comparators, column families, custom block formats. They would pull the project away from being readable.

04Architecture

Data flow

rendering
lsmdb write path: WAL fsync, memtable, flush to L0 SSTable, manifest commit, compaction.

Module map

FileResponsibility
db.goOpen, Close, Put, Delete, Get, write path, level layout, recovery
compaction.gopickCompaction, runCompaction, overlap and budget policy
iterator.gomergingIterator, min-heap over every sorted source
public_iterator.goSnapshot, Iterator, internal-key filtering by snapshot sequence
manifest.goappend-only edit log, loadManifest, openManifest
record.goWAL record encoding
internal/walWriter, Reader, CRC32C framing, torn-tail handling
internal/skiplistSkip-list primitives
internal/memtableMemTable wrapper over the skip list, MVCC-aware Get
internal/sstablewriter.go, reader.go, format.go, block-based table format
internal/bloomDouble-hashing bloom filter, target false-positive sizing
internal/encodingInternalKey, MakeInternalKey, CompareInternal, MaxSequence

05Durability and recovery

The WAL framing is [crc32 4B little-endian | len 4B little-endian | payload], CRC32C with the Castagnoli polynomial (the same polynomial RocksDB uses). wal.Writer.Append writes the framed record into a bufio buffer; Sync flushes the buffer and calls os.File.Sync. db.write calls both, in that order, before it acknowledges the Put. A process killed after Sync returns cannot lose the write; a process killed during Append leaves a short or CRC-failing trailing record that the reader silently drops as io.EOF.

Recovery happens inside Open. loadManifest replays the manifest to rebuild the live table set per level. recoverLog then iterates every pre-existing NNNNNN.log file in sequence-number order, replays each record into a fresh MemTable, and, if recovery rebuilt anything non-trivial, flushes the MemTable to an L0 SSTable before normal operation resumes. The shipped suite proves both halves of the contract: TestDurabilityAndRecovery writes two thousand keys, abandons the handle, reopens, and verifies every committed key is back. TestRecoveryDropsTornTail appends garbage to a log and asserts it is dropped on replay.

The one durability gap is mid-compaction. A crash after some output .sst files are written but before the manifest edit is fsynced leaves orphaned files on disk that the manifest does not reference. Correctness is unaffected (the manifest replay is the source of truth) but disk space leaks. An orphaned-file sweep on open is a roadmap item.

06MVCC and snapshots

Every user key is stored as an internal key: the user-visible bytes followed by an 8-byte trailer. PackTrailer(seq, kind) shifts a 56-bit sequence number eight bits left and ORs in the Kind byte; the trailer is written big-endian so that a plain bytewise comparison orders keys correctly. CompareInternal sorts user keys ascending and trailers descending, which means that within one user key, the higher sequence number sorts first. A merging iterator naturally yields the newest version of each user key first, which is what compaction and reads both rely on.

db.Snapshot takes the read lock and captures db.lastSeq. A read through the snapshot calls db.getAt(key, snap.seq), which walks sources newest first and asks each source for the highest-sequence version whose sequence is less than or equal to the snapshot. The MemTable does this by walking the version chain in the skip list; the SSTable does this by binary searching the block index, scanning the candidate block and stopping at the first matching internal key whose sequence fits. The first source that holds any version of the key decides the result.

Snapshots do not pin storage. The engine only retains old versions until compaction reclaims them, so a snapshot held across heavy write volume can lose the versions it needed. The current advice is to keep snapshots short, or to set DisableAutoCompaction while a long-lived snapshot is open. Snapshot pinning (a retained minimum sequence that compaction must not drop below) is a roadmap item.

07Levelled compaction

maybeCompactLocked calls pickCompaction and runs at most one compaction per call. The policy is intentionally simple. If L0 has accumulated L0CompactionTrigger tables (4 by default) it merges every L0 table together with the overlapping L1 tables into L1. Otherwise it walks levels, multiplying a byte budget by LevelSizeMultiplier (10 by default) at each step, and compacts the first oversized table down. The choice to start with the first table of an oversized level (rather than the one with the largest overlap or the oldest one) is deliberately a placeholder; production engines apply scoring heuristics here, and lsmdb's roadmap allows for the same.

runCompaction opens a min-heap merging iterator over the input tables and walks it in order. Three rules govern the output. First, within one user key only the newest version is kept; the merging iterator's ordering means any later entry with the same user key is older and discarded. Second, a tombstone is dropped only when the target level is the bottom level, because anywhere else a deeper level might hold a live version that the tombstone needs to keep shadowing. Third, the output is split into multiple tables once a writer hits compactionMaxEntries (100,000), which keeps individual tables small enough to compact again later.

The atomic commit is one fsynced manifest edit that records both the added outputs and the deleted inputs. Only after the edit is durable are the input files deleted. A crash between writing outputs and fsyncing the edit leaves the outputs orphaned but does not corrupt the live set.

08Comparison: BoltDB, Badger, Pebble

The Go ecosystem has three widely used embedded key-value stores. They cover different points in the design space, and lsmdb sits somewhere quite different to any of them. The comparison below states only properties that can be verified against each project’s source and documentation.

PropertylsmdbBoltDB / bboltBadgerPebble
On-disk structureLSM-tree, levels L0 to L6B+ tree, single mmapped fileLSM-treeLSM-tree, LevelDB-derived
Write durabilityWAL append + fsync per Putmmap + msync at commitWAL with configurable sync modeWAL with configurable sync mode
TransactionsNone across keys; per-key MVCCSerialisable read-write tx, single writerOptimistic SSI transactionsBatch + snapshot, no multi-key tx
SnapshotsYes, sequence-number basedYes, MVCC read transactionsYes, managed transactionsYes, sequence-number based
Value separation (key-value log)NoNoYes (vlog), helpful for large valuesNo
CompactionLevelled, inlineNone (B+ tree, in-place)Levelled, backgroundLevelled, background, multiple strategies
Concurrent writersSingle writer behind a MutexSingle writer behind a MutexConcurrent writersConcurrent writers
External dependenciesNone (standard library only)NoneSeveral (golang/protobuf, ristretto)Several
Approximate code size~2.9k lines~5k lines (bbolt)tens of thousands of linestens of thousands of lines
LicenseMITMITApache 2.0BSD 3-Clause

When each one fits

BoltDB / bbolt is the right choice when reads dominate, write throughput is moderate, and you want serialisable transactions across multiple keys for free. The B+ tree has excellent read latency and a tiny operational footprint (one file), and bbolt is the engine behind etcd and Consul. Where it struggles is heavy random-write workloads, because every update goes through the in-place B+ tree.

Badger is the right choice when values are large (its key-value separation keeps the LSM tree small and predictable) and when you need optimistic SSI transactions. It is the storage layer behind Dgraph and is hardened by production use.

Pebble is the right choice for a serious write-heavy workload on Go that needs LevelDB-compatible semantics. It descends from CockroachDB’s LevelDB port, has a decade of production hardening, supports multiple compaction strategies and a wide array of tunables.

lsmdb is not trying to compete with any of them at scale. It is the engine that fits in your head. The right choice for it is when you want to learn an LSM-tree by reading and modifying one, when you need a small embedded engine for a side project, when you want a teaching reference, or when you are writing a distributed layer on top of a local engine and you want the local part to be one you wrote yourself.

09Results and benchmarks

Measured on an Apple M3 Pro, Go 1.26.3, with go test -bench . -run '^$' -benchtime=2000x ./. These are the numbers go test printed on the author’s machine; nothing has been rounded.

BenchmarkResultWhat it measures
BenchmarkPutSync2.82 ms/opDurable write: WAL append plus fsync
BenchmarkGetMemTable10.6 us/opPoint read served from the MemTable
BenchmarkGetSSTable2.15 us/opPoint read through bloom filter and block index

The PutSync figure is fsync-bound, not engine-bound. It is the cost of forcing one record to durable storage on this laptop, dominated by the device’s fsync latency. The WAL keeps Append and Sync as separate calls precisely so an application that can tolerate a few milliseconds of loss on a crash can batch many writes behind one fsync and lift that number by orders of magnitude. The roadmap exposes that as a WriteBatch API.

The SSTable point read at 2.15 us shows the bloom filter and sparse index earning their place: a successful point read touches one filter probe, one binary search and one block. internal/bloom/bloom_test.go probes twenty thousand random keys and asserts the observed false-positive rate stays within bounds of the one-percent target. TestDurabilityAndRecovery in db_test.go writes two thousand keys, abandons the handle without Close, reopens and verifies every committed key is back.

10Conclusion

lsmdb demonstrates that a complete, correct LSM-tree storage engine fits in roughly 2,900 lines of Go using only the standard library. The complexity that production engines accumulate over the years is not, in the main, complexity of the data structure; it is complexity of the operational features layered on top: concurrency, batching, background compaction, multiple compaction strategies, pluggable comparators, column families, snapshots that pin storage, online schema changes, monitoring hooks. Stripping all of that away leaves a core that is small enough to read and modify in a sitting, and that core implements the durability contract, the MVCC model and the compaction policy with the same correctness as the production engines do.

For teaching, for portfolio work and as the local layer beneath a distributed system being built for understanding, that is the right point on the curve. For a production write-heavy workload, the production engines exist for good reasons and lsmdb does not pretend to replace them. The roadmap moves lsmdb closer to the production point without changing the data formats or the public surface, so each improvement is a drop-in upgrade.

AOptions

OptionDefaultPurpose
MemTableSize4 MiBByte threshold at which the active MemTable is frozen and flushed
BloomFalsePositiveRate0.01Target false-positive rate for per-table bloom filters
L0CompactionTrigger4Number of L0 tables that triggers a compaction into L1
LevelSizeMultiplier10How much larger each level is than the one above
DisableAutoCompactionfalseDisables automatic compaction; used by the test suite for determinism

BOperations checklist

  • Pick a data directory on a real disk. Network filesystems and tmpfs break the durability contract because their fsync semantics do not match a local disk.
  • Tune MemTableSize for your write rate. Larger tables mean fewer flushes but longer rotate stalls. Defaults are fine until they are not.
  • Keep snapshots short. Compaction reclaims old versions; a snapshot held across heavy writes can lose the versions it needed.
  • Watch L0 table count. A backlog of L0 tables raises read amplification. L0CompactionTrigger bounds it.
  • Run go test ./... against your fork. The shipped suite covers durability, recovery, torn tails, bloom-filter accuracy, compaction correctness and iteration.
  • Back up the data directory atomically. Take a filesystem-level snapshot or stop the process; copying live files mid-compaction can capture an inconsistent set.
  • Do not run two processes against one directory. The engine has no file lock yet; two writers will corrupt the manifest.