lsmdb, an LSM-tree you can actually read
A log-structured merge-tree storage engine in pure Go. Write-ahead log with CRC framing, skip-list MemTable, block-based SSTables with per-table bloom filters, levelled compaction, MVCC snapshots over a 56-bit sequence.
Small enough to read end to end in an afternoon. Real enough to stress with millions of writes. The Go standard library and one go.mod with no require block.
Why this exists
I have read the LevelDB and RocksDB source more times than I can count, and every time I came away with the same feeling: I understood the diagram but not the code. The papers explain the idea of an LSM-tree in a page. The production engines bury that idea under twenty years of flags, formats and compaction strategies.
I wanted one I could hold in my head, where I could point at the exact lines that make a write durable, that decide which table a read touches, that drop a tombstone at the right moment and not a moment sooner. So I wrote lsmdb top to bottom with the Go standard library and nothing else.
Every hard part is implemented for real: the WAL with CRC framing and torn-tail recovery, the block-based table format with a sparse index and a per-table bloom filter, the heap-based merging iterator, levelled compaction with correct tombstone handling, and MVCC snapshots over monotonic sequence numbers. It is a portfolio engine and a teaching engine, not a RocksDB replacement, and I have tried to be honest about that line throughout.
A complete LSM-tree in ~2,900 lines
Production LSM engines carry twenty years of flags, formats and compaction strategies on top of an idea the paper covers in a page. lsmdb strips back to the core that defines the data structure: WAL with torn-tail recovery, block-based SSTable with a sparse index and a per-table bloom filter, heap-based merging iterator, levelled compaction with correct tombstone handling, MVCC over monotonic sequences. Every box on every paper diagram corresponds to a real Go file you can open.
What is in there
No part is faked or stubbed. Each piece corresponds to a real file and a real test.
Write-ahead log with CRC framing
Every mutation is appended to the WAL and fsynced before Put returns. Records are length-prefixed and CRC32C-checked (Castagnoli). A torn trailing record from a crash is detected by short read or CRC mismatch and dropped on replay, so an uncommitted write never resurrects.
Skip-list MemTable
The in-memory write buffer is a probabilistic skip list with O(log n) insert and search. A single writer appends while concurrent readers walk forward, which is the access pattern the engine actually needs. Lives in internal/skiplist and internal/memtable.
Block-based SSTable format
Immutable on-disk tables with a sparse block index, a properties block, an 8-byte magic footer and a per-table bloom filter. Whole keys are stored in data blocks, not prefix-compressed, which keeps the block reader a straight scan with no restart-point bookkeeping.
Bloom filter, one percent target
Per-table double-hashing bloom filter sized from m/n = -ln(p)/ln(2)^2 with a default false-positive rate of 0.01. A point read that misses the filter skips the table entirely. The test suite holds the observed false-positive rate within bounds across twenty thousand probes.
Levelled compaction, newest wins
L0 holds overlapping MemTable flushes. L1 through L6 are disjoint and each ten times larger than the one above. pickCompaction triggers on L0 table count or per-level byte budget, runCompaction merges newest-wins through a heap-based iterator and drops tombstones only at the bottom level.
MVCC snapshots over a 56-bit sequence
Every key is stored as an internal key: user key followed by an 8-byte trailer packing a 56-bit monotonic sequence in the high bits and a Kind byte in the low bits. A Snapshot captures lastSeq; reads through it see only versions with seq less than or equal to the snapshot.
Append-only manifest, atomic swap
manifest.go writes newline-delimited JSON edits. A compaction records both the added outputs and the deleted inputs in one fsynced edit, so the swap is atomic against a crash. The manifest is small, human-readable and the natural commit point for every change to the level layout.
Heap-based merging iterator
iterator.go runs a min-heap over every sorted source (active MemTable, immutable MemTable, every L0 table, every L1 through L6 table the read touches). The first source that holds any version of a key, live or tombstone, decides the result, because newer always shadows older.
Standard library only
No external dependencies. crypto, encoding/binary, hash/crc32, os, sync, sort. go build ./... produces a static binary with no surprises. The whole engine sits in one package plus internal/ and you can read it in an afternoon.
Wiki goes byte-deep
25 wiki pages cover the SSTable byte layout, the WAL framing, the bloom-filter sizing, the internal-key trailer, the compaction state machine, the manifest format and the recovery walkthrough. Every page references real Go file paths and struct names.
Write path, end to end
WAL fsync, MemTable, flush to L0 SSTable, manifest commit, compaction.
Tech stack
The source tree
One package at the root, four subsystems under internal/. No external dependencies.
lsmdb/ db.go Open, Close, Put, Delete, Get, write path compaction.go pickCompaction, runCompaction, overlap policy iterator.go mergingIterator over every sorted source public_iterator.go Snapshot, Iterator, snapshot-filtered reads manifest.go append-only edit log, atomic commit point record.go WAL record encoding internal/wal/ Writer, Reader, CRC32C framing internal/skiplist/ Skip-list primitives internal/memtable/ MVCC-aware MemTable internal/sstable/ writer.go, reader.go, format.go, block-based table internal/bloom/ Double-hashing bloom filter internal/encoding/ InternalKey, MakeInternalKey, CompareInternal cmd/lsmdb-demo/ End-to-end demo binary
Quick start
Seven methods. Open, Close, Put, Delete, Get, NewIterator, Snapshot.
git clone https://github.com/sarmakska/lsmdb.git cd lsmdb go build ./... go test ./... go run ./cmd/lsmdb-demo # end-to-end demo
db, _ := lsmdb.Open("./data", lsmdb.Options{})
db.Put([]byte("greeting"), []byte("hello"))
snap := db.Snapshot() // freeze a view
db.Put([]byte("greeting"), []byte("updated"))
live, _ := db.Get([]byte("greeting")) // updated
old, _ := snap.Get([]byte("greeting")) // helloUse cases
Who this is actually for.
Reference engine for an interview or whiteboard
The whole engine is small enough that you can point at the exact lines that make a write durable, decide which table a read touches, or drop a tombstone at the right moment. That is the production-engine internals interview, with code you can read.
Embedded key-value store for a Go service
Open a directory, get a DB. Put, Delete, Get, NewIterator, Snapshot. A successful Put is durable across a crash. No server, no network, no daemon. Suitable for a CLI cache, a local-first app, or a test harness that wants real persistence.
Teaching the LSM-tree top to bottom
Every paper diagram corresponds to a real file: write path in db.go, compaction in compaction.go, levels in the same, internal keys in internal/encoding, the WAL framing in internal/wal. Run the test suite and watch every box light up.
Foundation for raftkv, the distributed sibling
lsmdb is the local state machine that the raftkv project replicates with Raft. Two repos, one for the engine, one for the consensus layer; both small enough to hold in your head at the same time.
Read it. Fork it. Learn it.
MIT licensed. Standard library only. Roadmap includes batched writes behind one fsync, background flush and compaction, snapshot pinning, and an orphaned-file sweep on open.