How it works · lsmdb

How lsmdb works

The storage layout, the write path, the compaction policy and the MVCC isolation model, all grounded in the real Go code that runs in the repo. No hand-waving over the parts that matter.

TL;DR

Write goes to the log.
Read walks the levels.
Compaction tidies up.

A Put appends to the WAL, fsyncs, then lands in the skip-list MemTable. When the MemTable hits MemTableSize (4 MiB default) it freezes, flushes to an L0 SSTable, and a fresh WAL and MemTable take over.

A Get walks newest first: active MemTable, immutable MemTable, every L0 table (overlapping), then a single binary-searched table per disjoint level L1 through L6. The first source that holds any version of the key decides the result.

Compaction merges L0 into L1 when L0 grows past four tables, and one oversized table down each deeper level. Newest version of a user key wins; tombstones drop only at the bottom level.

<span class="dim">Put("greeting", "hello")</span> <span class="hl">1 ·</span> encodeRecord(seq, KindSet, key, value) <span class="hl">2 ·</span> wal.Append(rec) <span class="dim">crc32 + len framed</span> <span class="hl">3 ·</span> wal.Sync() <span class="ok">fsync, barrier</span> <span class="hl">4 ·</span> mem.Add(seq, kind, k, v) <span class="hl">5 ·</span> if size &gt;= 4MiB → flush <span class="dim">Get("greeting")</span> <span class="hl">1 ·</span> mem.Get(key, MaxSequence) <span class="hl">2 ·</span> imm.Get(key, MaxSequence) <span class="hl">3 ·</span> L0 newest first (overlapping) <span class="hl">4 ·</span> L1..L6 one table each (disjoint) <span class="hl">5 ·</span> first source that has any version <span class="ok">decides</span> <span class="ok">2.15us per SSTable point read on M3 Pro</span>
Storage layout

The level hierarchy

L0 holds overlapping MemTable flushes. L1 through L6 are disjoint, each ten times larger than the one above.

./data/
  MANIFEST           append-only JSON edit log, the atomic commit point
  000007.log         active WAL, fsynced before every Put returns
  000004.sst         L0 SSTable, may overlap with other L0 tables
  000005.sst         L0 SSTable
  000009.sst         L1 SSTable, disjoint with every other L1 table
  000010.sst         L1 SSTable
rendering
Level hierarchy: L0 overlapping, L1 and below disjoint, each level ten times the size of the one above.
Write path

From Put to durable

// db.go, the durability barrier behind every acknowledged Put.
func (db *DB) write(kind encoding.Kind, key, value []byte) error {
    db.mu.Lock()
    defer db.mu.Unlock()
    if db.closed {
        return ErrClosed
    }

    seq := db.lastSeq + 1

    // Append to the WAL and fsync before touching memory or acknowledging.
    // If the process dies after this returns, recovery replays the record.
    rec := encodeRecord(seq, kind, key, value)
    if err := db.log.Append(rec); err != nil {
        return err
    }
    if err := db.log.Sync(); err != nil {
        return err
    }

    db.lastSeq = seq
    db.mem.Add(seq, kind, key, value)

    if db.mem.ApproximateSize() >= db.opts.MemTableSize {
        if err := db.rotateMemtableLocked(); err != nil {
            return err
        }
    }
    return nil
}

The WAL record framing is [crc32 4B little-endian | len 4B little-endian | payload]. CRC is Castagnoli, the same polynomial RocksDB uses. wal.Writer.Sync flushes the bufio writer and calls f.Sync, which is the fsync barrier. wal.Reader.Next treats a short read or a CRC mismatch as io.EOF, so a torn trailing record from a crash is dropped on replay instead of returning as corrupt data.

MVCC isolation model

Sequences, trailers and order

// internal/encoding/encoding.go
// Trailer packs a 56-bit sequence in the high bits and the Kind in the low 8.
// Stored big-endian so a plain bytewise comparison orders keys correctly.
const MaxSequence uint64 = (1 << 56) - 1

func PackTrailer(seq uint64, kind Kind) uint64 {
    return (seq << 8) | uint64(kind)
}

func MakeInternalKey(userKey []byte, seq uint64, kind Kind) InternalKey {
    ik := make([]byte, len(userKey)+8)
    copy(ik, userKey)
    binary.BigEndian.PutUint64(ik[len(userKey):], PackTrailer(seq, kind))
    return ik
}

// User keys ascending, then trailer descending so the newest version sorts first.
func CompareInternal(a, b InternalKey) int {
    ua, ub := a.UserKey(), b.UserKey()
    if c := compareBytes(ua, ub); c != 0 {
        return c
    }
    ta, tb := a.Trailer(), b.Trailer()
    switch {
    case ta > tb:
        return -1
    case ta < tb:
        return 1
    default:
        return 0
    }
}

Every user key gets paired with a monotonic 56-bit sequence number and a Kind byte. Within one user key the higher sequence sorts first, so a merging iterator naturally lands on the newest version. A Snapshot captures lastSeq, and the read path filters out every version with sequence higher than the snapshot. That is the entire isolation story: snapshot reads see exactly the writes committed before the snapshot, and the engine guarantees this until compaction reclaims the older versions.

// public_iterator.go
type Snapshot struct {
    db  *DB
    seq uint64
}

func (db *DB) Snapshot() *Snapshot {
    db.mu.RLock()
    defer db.mu.RUnlock()
    return &Snapshot{db: db, seq: db.lastSeq}
}

func (s *Snapshot) Get(key []byte) ([]byte, error) {
    return s.db.getAt(key, s.seq)
}

// db.getAt walks sources newest first. The first source that holds any version
// of the key, live or tombstone, decides the result.
func (db *DB) getAt(key []byte, snap uint64) ([]byte, error) {
    db.mu.RLock()
    defer db.mu.RUnlock()
    if v, found, ok := db.mem.Get(key, snap); ok { return finish(v, found) }
    if db.imm != nil {
        if v, found, ok := db.imm.Get(key, snap); ok { return finish(v, found) }
    }
    for i := len(db.levels[0]) - 1; i >= 0; i-- {
        if v, found, ok := db.levels[0][i].Get(key, snap); ok { return finish(v, found) }
    }
    for lvl := 1; lvl < numLevels; lvl++ {
        r := db.findTable(db.levels[lvl], key)
        if r == nil { continue }
        if v, found, ok := r.Get(key, snap); ok { return finish(v, found) }
    }
    return nil, ErrNotFound
}
Compaction policy

Newest wins, tombstones drop at the bottom

// compaction.go
// runCompaction merges inputs into one or more outputs on the target level.
// Tombstone handling: a deletion is dropped only when the compaction reaches
// the bottom level, because no older version can exist below it.
for ; merged.Valid(); merged.Next() {
    key := merged.Key()
    uk := key.UserKey()

    // Skip superseded versions: the merging iterator yields the newest version
    // of a user key first, so any later entry with the same user key is older.
    if haveLast && encoding.CompareBytes(uk, lastUserKey) == 0 {
        continue
    }
    lastUserKey = append(lastUserKey[:0], uk...)
    haveLast = true

    // Drop a tombstone outright when this is the bottom level.
    if key.Kind() == encoding.KindDelete && isBottom {
        continue
    }

    if writer == nil {
        if err := openWriter(); err != nil { return err }
    }
    if err := writer.Add(key, merged.Value()); err != nil { return err }
    if writer.Count() >= compactionMaxEntries {
        if err := closeWriter(); err != nil { return err }
    }
}

pickCompaction triggers L0 to L1 when L0 grows to L0CompactionTrigger tables (default 4) and merges every L0 table with the overlapping L1 tables. Below L0 it walks levels, multiplying the byte budget by LevelSizeMultiplier (default 10) at each step, and compacts the first oversized table down. The output is split into multiple tables once a writer hits compactionMaxEntries (100,000) so individual tables stay small enough to map and to compact again later.

Design decisions

Why this, not that

Inline flush and compaction under the write lock

Why we use it

Deterministic ordering makes durability and recovery semantics trivial to reason about and to test. No concurrent mutation of the level layout to race against, every test is deterministic.

Why not the alternative

Background goroutine. The textbook design, lifts write throughput, but a regression in the recovery semantics is much harder to catch without a concurrency test harness I trust.

Append-only JSON manifest

Why we use it

Each edit is a few hundred bytes. One fsync per compaction. You can cat the manifest and read the history of the level layout. Adding outputs and deleting inputs in one edit is the atomic swap.

Why not the alternative

Rewritten snapshot file. Simpler to read back, but each compaction rewrites the whole list, and a rename is not the only thing that has to be atomic across a crash.

Whole keys in data blocks

Why we use it

Block reader is a straight scan with no restart-point bookkeeping. Format stays describable in a paragraph. Prefix compression is a clean later addition that does not touch the index or footer.

Why not the alternative

LevelDB-style prefix compression. Saves disk, complicates the reader, hard to debug.

Skip list, not B-tree or sorted slice

Why we use it

O(log n) insert and search with a structure that allows a single writer plus concurrent readers. That is exactly the MemTable access pattern.

Why not the alternative

Sorted slice, O(n) insert, fatal for a write buffer. Balanced B-tree, right complexity, but rebalancing makes lock-free reads hard.

CRC32C plus length framing for the WAL

Why we use it

A torn record from a crash either short-reads or fails the CRC. Either way the reader stops at the last fully durable record. CRC32C is fast and the polynomial RocksDB uses.

Why not the alternative

No CRC. A torn payload of the right length replays as live data on next open, which is a silent corruption bug.

Performance and observability

What you can measure

2.82ms
BenchmarkPutSync
fsync-bound, not engine-bound
10.6us
BenchmarkGetMemTable
Skip-list point lookup
2.15us
BenchmarkGetSSTable
Bloom + binary search + block

Failure modes you should expect

Process killed between Put returning and a flush
Cause: MemTable is lost, WAL is on disk
Fix: recoverLog replays every *.log file in order, rebuilds the MemTable, then flushes it to an L0 table before normal operation resumes (db.go lines 153 to 199)
Process killed mid-WAL-write
Cause: Trailing record is short or CRC fails
Fix: wal.Reader.Next returns io.EOF on the torn tail, replay stops cleanly at the last durable record. TestRecoveryDropsTornTail proves it
Process killed mid-compaction
Cause: Some output .sst files written but the manifest edit was not fsynced
Fix: On open, the manifest replay does not include the half-written outputs, so they are simply unreferenced. Orphaned-file sweep on open is on the roadmap to delete them
Snapshot held across heavy write volume
Cause: Compaction reclaims the versions the snapshot needed
Fix: Keep snapshots short, or set DisableAutoCompaction while a long snapshot is open. Snapshot pinning (retained minimum sequence) is a roadmap item
Read amplification rising over time
Cause: L0 has accumulated near the trigger threshold
Fix: L0CompactionTrigger defaults to 4, lower it for read-heavy workloads or raise MemTableSize to flush less often
Bloom filter false-positive rate higher than expected
Cause: BloomFalsePositiveRate is per-table not per-key, the global rate can drift slightly under skewed workloads
Fix: Lower BloomFalsePositiveRate from 0.01 toward 0.001, doubles filter size but halves wasted block reads
Roadmap

What is next

WriteBatch behind one fsync

Batch many writes behind a single Sync to lift throughput by orders of magnitude when an application can tolerate a few milliseconds of loss on a crash.

Background flush and compaction

Move flush and compaction onto a dedicated goroutine once a concurrency test harness is in place to catch any regression in the recovery semantics.

Snapshot pinning

A retained minimum sequence that compaction must not drop below. Lets a long-lived snapshot survive heavy write volume without losing its versions.

Orphaned-file sweep on open

Delete unreferenced .sst files left by a crash mid-compaction. The manifest already records the live set, so anything not in it is safe to remove.

Wider bench coverage

Sustained mixed-workload benches with read/write ratios, plus latency histograms not just averages.

Optional prefix compression

Restart-point encoding inside data blocks, behind an Options flag, for workloads with long shared key prefixes where disk cost matters.

Ready to read the code?

Roughly 2,900 lines of Go and the standard library. An afternoon end to end. Start at db.go, follow into compaction.go, then internal/wal and internal/sstable.