How it works · lsmdb

How lsmdb works

The storage layout, the write path, the compaction policy and the MVCC isolation model, all grounded in the real Go code that runs in the repo. No hand-waving over the parts that matter.

View source code Product page Whitepaper

TL;DR

Write goes to the log.
Read walks the levels.
Compaction tidies up.

A Put appends to the WAL, fsyncs, then lands in the skip-list MemTable. When the MemTable hits MemTableSize (4 MiB default) it freezes, flushes to an L0 SSTable, and a fresh WAL and MemTable take over.

A Get walks newest first: active MemTable, immutable MemTable, every L0 table (overlapping), then a single binary-searched table per disjoint level L1 through L6. The first source that holds any version of the key decides the result.

Compaction merges L0 into L1 when L0 grows past four tables, and one oversized table down each deeper level. Newest version of a user key wins; tombstones drop only at the bottom level.

Put("greeting", "hello") 1 · encodeRecord(seq, KindSet, key, value) 2 · wal.Append(rec) crc32 + len framed 3 · wal.Sync() fsync, barrier 4 · mem.Add(seq, kind, k, v) 5 · if size >= 4MiB → flush Get("greeting") 1 · mem.Get(key, MaxSequence) 2 · imm.Get(key, MaxSequence) 3 · L0 newest first (overlapping) 4 · L1..L6 one table each (disjoint) 5 · first source that has any version decides 2.15us per SSTable point read on M3 Pro

Storage layout

The level hierarchy

L0 holds overlapping MemTable flushes. L1 through L6 are disjoint, each ten times larger than the one above.

./data/
  MANIFEST           append-only JSON edit log, the atomic commit point
  000007.log         active WAL, fsynced before every Put returns
  000004.sst         L0 SSTable, may overlap with other L0 tables
  000005.sst         L0 SSTable
  000009.sst         L1 SSTable, disjoint with every other L1 table
  000010.sst         L1 SSTable

rendering

Level hierarchy: L0 overlapping, L1 and below disjoint, each level ten times the size of the one above.

Write path

From Put to durable

// db.go, the durability barrier behind every acknowledged Put.
func (db *DB) write(kind encoding.Kind, key, value []byte) error {
    db.mu.Lock()
    defer db.mu.Unlock()
    if db.closed {
        return ErrClosed
    }

    seq := db.lastSeq + 1

    // Append to the WAL and fsync before touching memory or acknowledging.
    // If the process dies after this returns, recovery replays the record.
    rec := encodeRecord(seq, kind, key, value)
    if err := db.log.Append(rec); err != nil {
        return err
    }
    if err := db.log.Sync(); err != nil {
        return err
    }

    db.lastSeq = seq
    db.mem.Add(seq, kind, key, value)

    if db.mem.ApproximateSize() >= db.opts.MemTableSize {
        if err := db.rotateMemtableLocked(); err != nil {
            return err
        }
    }
    return nil
}

The WAL record framing is [crc32 4B little-endian | len 4B little-endian | payload]. CRC is Castagnoli, the same polynomial RocksDB uses. wal.Writer.Sync flushes the bufio writer and calls f.Sync, which is the fsync barrier. wal.Reader.Next treats a short read or a CRC mismatch as io.EOF, so a torn trailing record from a crash is dropped on replay instead of returning as corrupt data.

MVCC isolation model

Sequences, trailers and order

// internal/encoding/encoding.go
// Trailer packs a 56-bit sequence in the high bits and the Kind in the low 8.
// Stored big-endian so a plain bytewise comparison orders keys correctly.
const MaxSequence uint64 = (1 << 56) - 1

func PackTrailer(seq uint64, kind Kind) uint64 {
    return (seq << 8) | uint64(kind)
}

func MakeInternalKey(userKey []byte, seq uint64, kind Kind) InternalKey {
    ik := make([]byte, len(userKey)+8)
    copy(ik, userKey)
    binary.BigEndian.PutUint64(ik[len(userKey):], PackTrailer(seq, kind))
    return ik
}

// User keys ascending, then trailer descending so the newest version sorts first.
func CompareInternal(a, b InternalKey) int {
    ua, ub := a.UserKey(), b.UserKey()
    if c := compareBytes(ua, ub); c != 0 {
        return c
    }
    ta, tb := a.Trailer(), b.Trailer()
    switch {
    case ta > tb:
        return -1
    case ta < tb:
        return 1
    default:
        return 0
    }
}

Every user key gets paired with a monotonic 56-bit sequence number and a Kind byte. Within one user key the higher sequence sorts first, so a merging iterator naturally lands on the newest version. A Snapshot captures lastSeq, and the read path filters out every version with sequence higher than the snapshot. That is the entire isolation story: snapshot reads see exactly the writes committed before the snapshot, and the engine guarantees this until compaction reclaims the older versions.

// public_iterator.go
type Snapshot struct {
    db  *DB
    seq uint64
}

func (db *DB) Snapshot() *Snapshot {
    db.mu.RLock()
    defer db.mu.RUnlock()
    return &Snapshot{db: db, seq: db.lastSeq}
}

func (s *Snapshot) Get(key []byte) ([]byte, error) {
    return s.db.getAt(key, s.seq)
}

// db.getAt walks sources newest first. The first source that holds any version
// of the key, live or tombstone, decides the result.
func (db *DB) getAt(key []byte, snap uint64) ([]byte, error) {
    db.mu.RLock()
    defer db.mu.RUnlock()
    if v, found, ok := db.mem.Get(key, snap); ok { return finish(v, found) }
    if db.imm != nil {
        if v, found, ok := db.imm.Get(key, snap); ok { return finish(v, found) }
    }
    for i := len(db.levels[0]) - 1; i >= 0; i-- {
        if v, found, ok := db.levels[0][i].Get(key, snap); ok { return finish(v, found) }
    }
    for lvl := 1; lvl < numLevels; lvl++ {
        r := db.findTable(db.levels[lvl], key)
        if r == nil { continue }
        if v, found, ok := r.Get(key, snap); ok { return finish(v, found) }
    }
    return nil, ErrNotFound
}

Compaction policy

Newest wins, tombstones drop at the bottom

// compaction.go
// runCompaction merges inputs into one or more outputs on the target level.
// Tombstone handling: a deletion is dropped only when the compaction reaches
// the bottom level, because no older version can exist below it.
for ; merged.Valid(); merged.Next() {
    key := merged.Key()
    uk := key.UserKey()

    // Skip superseded versions: the merging iterator yields the newest version
    // of a user key first, so any later entry with the same user key is older.
    if haveLast && encoding.CompareBytes(uk, lastUserKey) == 0 {
        continue
    }
    lastUserKey = append(lastUserKey[:0], uk...)
    haveLast = true

    // Drop a tombstone outright when this is the bottom level.
    if key.Kind() == encoding.KindDelete && isBottom {
        continue
    }

    if writer == nil {
        if err := openWriter(); err != nil { return err }
    }
    if err := writer.Add(key, merged.Value()); err != nil { return err }
    if writer.Count() >= compactionMaxEntries {
        if err := closeWriter(); err != nil { return err }
    }
}

A torn record from a crash either short-reads or fails the CRC. Either way the reader stops at the last fully durable record. CRC32C is fast and the polynomial RocksDB uses.

Why not the alternative

No CRC. A torn payload of the right length replays as live data on next open, which is a silent corruption bug.

Performance and observability

What you can measure

2.82ms

BenchmarkPutSync

fsync-bound, not engine-bound

10.6us

BenchmarkGetMemTable

Skip-list point lookup

2.15us

BenchmarkGetSSTable

Bloom + binary search + block

Failure modes you should expect

Process killed between Put returning and a flush

Cause: MemTable is lost, WAL is on disk

Fix: recoverLog replays every *.log file in order, rebuilds the MemTable, then flushes it to an L0 table before normal operation resumes (db.go lines 153 to 199)

Process killed mid-WAL-write

Cause: Trailing record is short or CRC fails

Fix: wal.Reader.Next returns io.EOF on the torn tail, replay stops cleanly at the last durable record. TestRecoveryDropsTornTail proves it

Process killed mid-compaction

Cause: Some output .sst files written but the manifest edit was not fsynced

Fix: On open, the manifest replay does not include the half-written outputs, so they are simply unreferenced. Orphaned-file sweep on open is on the roadmap to delete them

Snapshot held across heavy write volume

Cause: Compaction reclaims the versions the snapshot needed

Fix: Keep snapshots short, or set DisableAutoCompaction while a long snapshot is open. Snapshot pinning (retained minimum sequence) is a roadmap item

Read amplification rising over time

Cause: L0 has accumulated near the trigger threshold

Fix: L0CompactionTrigger defaults to 4, lower it for read-heavy workloads or raise MemTableSize to flush less often

Bloom filter false-positive rate higher than expected

Cause: BloomFalsePositiveRate is per-table not per-key, the global rate can drift slightly under skewed workloads

Fix: Lower BloomFalsePositiveRate from 0.01 toward 0.001, doubles filter size but halves wasted block reads

Roadmap

What is next

WriteBatch behind one fsync

Batch many writes behind a single Sync to lift throughput by orders of magnitude when an application can tolerate a few milliseconds of loss on a crash.

Background flush and compaction

Move flush and compaction onto a dedicated goroutine once a concurrency test harness is in place to catch any regression in the recovery semantics.

Snapshot pinning

A retained minimum sequence that compaction must not drop below. Lets a long-lived snapshot survive heavy write volume without losing its versions.

Orphaned-file sweep on open

Delete unreferenced .sst files left by a crash mid-compaction. The manifest already records the live set, so anything not in it is safe to remove.

Wider bench coverage

Sustained mixed-workload benches with read/write ratios, plus latency histograms not just averages.

Optional prefix compression

Restart-point encoding inside data blocks, behind an Options flag, for workloads with long shared key prefixes where disk cost matters.

Ready to read the code?

Roughly 2,900 lines of Go and the standard library. An afternoon end to end. Start at db.go, follow into compaction.go, then internal/wal and internal/sstable.

View on GitHub Read whitepaper Product page Read the wiki Hire me to build with it

How lsmdb works

Write goes to the log.Read walks the levels.Compaction tidies up.