Architecture deep dive . forge-infer

How forge-infer works

An honest walk through the three systems that decide how fast an inference server runs: a block-paged KV-cache, an iteration-level batching scheduler, and a draft-then- verify speculative decoder. The model is deliberately the boring part.

15
subsystems
37
tests
~52%
accept
MIT
licence

Thirty-second orientation

A request arrives over HTTP, is tokenised, and joins a waiting queue. An engine loop runs one decode iteration at a time. The scheduler admits what fits, reserves KV blocks, preempts under pressure, runs a forward pass through the model, and retires finished sequences. Tokens stream back as Server-Sent Events. Three systems do the real work; the model is a one-method trait you swap out.

Module map

src/
  lib.rs              re-exports the public API
  main.rs             cargo run --release --bin forge-infer
  server.rs           axum, /generate and /v1/completions, SSE
  engine.rs           the loop that joins scheduler + model
  scheduler.rs        admission, preemption, retirement, StepPlan
  paged_cache.rs      PagedKVCache, BlockTable, transactional append
  speculative.rs      SpeculativeDecoder, rejection-sampling acceptance
  model.rs            Model trait, deterministic TinyTransformer
  tokeniser.rs        byte-level encode / decode
  bin/
    bench.rs          cargo run --release --bin forge-bench

The subsystems

Fifteen pieces, in roughly the order a request touches them. Each has a name, the file it lives in, why it exists, and what it actually does.

01.PagedKVCache: block allocation

src/paged_cache.rs

Why. A contiguous KV buffer per sequence wastes memory two ways: internal waste from oversized buffers, and external fragmentation as different-sized sequences come and go. Paging eliminates external fragmentation entirely and bounds internal waste to block_size minus one slots per sequence.

How. Memory is split into fixed-size blocks. Each sequence carries a BlockTable of the physical blocks holding its tokens. Blocks are allocated lazily, one at a time. The free list is a stack so freshly freed blocks are reused first. The default cache is 512 blocks of 16 token slots each.

02.Transactional append

src/paged_cache.rs (PagedKVCache::append)

Why. The scheduler needs to ask whether a step will fit and act on the answer atomically. If append could half-succeed it would leak blocks under pressure, exactly when the engine can least afford it.

How. append pre-checks the free count against blocks_needed_for. If the cache cannot grow the sequence it returns CacheError::OutOfBlocks { needed, free } without touching state. The sequence is unchanged, no blocks have moved, and the scheduler can preempt and retry safely. out_of_blocks_is_reported_and_leaves_state_intact pins this behaviour.

03.Lazy growth in detail

src/paged_cache.rs (blocks_needed_for)

Why. Pulling a block per token would waste allocator work. Pulling a block ahead would waste memory on sequences that finish early. Lazy growth pulls a new block only when the existing slack runs out.

How. slack = capacity - num_tokens, where capacity is blocks_held * block_size. If extra_tokens fits in slack, no new block is needed. Otherwise the additional blocks count is ceil((extra_tokens - slack) / block_size). The first token of a sequence pulls one block, the next fifteen reuse it, and the seventeenth pulls another.

04.Scheduler: admission phase

src/scheduler.rs (Scheduler::schedule)

Why. A continuous-batching engine wants to keep the running batch full. The admission loop is what lets a freshly arrived request join the very next decode iteration rather than waiting for a batch boundary.

How. While running.len() < max_batch_size and the next waiting request has tokens that fit, pull it in. Admission reserves blocks for the full current length, including any output a resumed sequence already produced. On a reservation failure the admit is rolled back via cache.free and admission stops for this step.

05.Scheduler: decode reservation

src/scheduler.rs (Scheduler::schedule)

Why. Every running sequence needs one more block this step. The scheduler has to decide before the forward pass whether the whole batch fits, so it can preempt deterministically and not partway through.

How. Sum blocks_needed_for(seq, 1) across the running batch. Compare to the free block count. If the batch fits, reserve one decode block per sequence and add it to plan.decode_batch. If it does not, drop into preemption.

06.Scheduler: preemption

src/scheduler.rs (Scheduler::schedule)

Why. When blocks run out, something has to give. The choice is which sequence to drop. Preempting the least-progressed sequence minimises the recompute cost when it resumes, the same intuition as shortest-remaining-time scheduling.

How. Pick the running sequence with the fewest output tokens, breaking ties towards the higher id for determinism. Free its blocks entirely via cache.free, set its state to Waiting, and push it back to the front of the waiting queue so it resumes before fresh requests. Repeat until the remaining batch fits. preempts_when_blocks_run_out and preempted_sequence_resumes_later pin this exact behaviour.

07.Why recompute, not swap

src/scheduler.rs (preemption rationale)

Why. vLLM offers two preemption paths: recompute and KV swap to host memory. forge-infer picks recompute. The reasoning is simple, the code is short, and the worst case is provably bounded.

How. When a sequence is preempted, its prompt plus output is re-prefilled when the cache can fit it again. No host memory pool, no copy path, no second eviction policy. Freeing any running sequence releases at least as many blocks as the next step needs, so the loop cannot deadlock. The cost is wasted recompute proportional to the preempted output length, which is why the policy preempts the least-progressed sequence.

08.Scheduler: retirement

src/scheduler.rs (Scheduler::schedule)

Why. Sequences end in two ways: they hit max_new_tokens, or the model emits eos. Both paths need to free blocks promptly so the next admission has room.

How. After the decode batch is committed, walk the running set and retire anything past its limit. Free its blocks via cache.free and list it in plan.finished. push_token(id, token, true) caps the sequence limit to current output length so the next schedule call retires it through the same path eos_retires_a_sequence_early checks.

09.StepPlan as a value

src/scheduler.rs (StepPlan)

Why. A tick() that schedules and runs the forward pass together is shorter to write but impossible to test without a model. The whole policy belongs to the scheduler; the model belongs below the seam.

How. schedule() returns a StepPlan describing what was admitted, what to decode, and what was retired. It mutates only the cache and the queues. The engine consumes the plan: for each id in plan.decode_batch it calls model.forward, takes argmax, and pushes the token back via push_token. admission_blocks_when_prompt_does_not_fit asserts the policy with no model in the test at all.

10.Engine loop

src/engine.rs (Engine::step)

Why. The engine is the thin layer that joins the scheduler to the model. It is intentionally boring; all the interesting decisions are above and below it.

How. Engine::step calls scheduler.schedule, iterates plan.decode_batch, calls model.forward over prompt plus output, takes argmax, checks against eos_token, and routes the token back through scheduler.push_token. The engine never decides what to run, and it never knows how the cache is laid out. It just connects them.

11.Speculative decoding: the idea

src/speculative.rs (SpeculativeDecoder::step)

Why. Autoregressive generation is latency bound: each token needs one target forward pass, and you cannot start token n+1 before you have token n. If a cheap draft model can guess the next few tokens, the expensive target can verify them in one shot.

How. Pair the target with a small draft model. The draft proposes a run of k tokens via draft_run. The target verifies the whole run in one forward pass. Every accepted draft token is a free token. A fully accepted round of k drafts emits k + 1 tokens for one target step: the k accepted plus the bonus token the target sampled after the run.

12.Speculative decoding: the acceptance test

src/speculative.rs (SpeculativeDecoder::step)

Why. The subtlety is keeping the output exact. Speculative decoding must be indistinguishable from plain sampling from the target. Without the rejection-sampling test it would be an approximation, not an optimisation.

How. For each drafted token t with draft probability q(t) and target probability p(t): accept with probability min(1, p(t)/q(t)). On the first rejection, stop, resample that one position from the target distribution, and discard the rest of the draft. The acceptance draw is a seeded function of the context, not a thread RNG, so the test speculative_output_matches_plain_target_decoding can assert the speculatively decoded stream equals plain greedy target decoding token for token.

13.Speculative decoding: failure modes

src/speculative.rs

Why. The loop still has to make progress when the draft is wrong. Three failure modes need explicit handling: a rejected proposal, a draft of probability zero, and an early eos token mid-run.

How. On rejection, the rejected position is resampled from the target distribution and the rest of the draft is discarded, so the round still emits at least one token (rejection_falls_back_to_a_target_token). On q(t) <= 0 the ratio is undefined; the code clamps accept_prob to 1, the correct limit, which keeps the exactness test passing. On accepted or bonus eos the round returns immediately and generate stops the stream.

14.Tokeniser

src/tokeniser.rs

Why. A serving stack needs a tokeniser. The forge-infer one is byte-level on purpose: every input is encodable, every output decodable, no out-of-vocabulary path, no dependency on a model-specific BPE.

How. encode maps bytes to TokenIds. decode maps TokenIds back. The codec is a pure function of its input, so the same prompt produces the same token stream and the same output bytes every time, which is what the determinism guarantees depend on.

15.HTTP surface

src/server.rs

Why. A serving binary has to be callable from the same client code people are already using. The OpenAI completions shape is the de facto interface; matching it lets existing OpenAI client code point base_url at forge-infer and call /v1/completions unchanged.

How. axum exposes /generate (native shape) and /v1/completions (OpenAI-compatible). Both accept prompt and max_tokens. /v1/completions adds stream true, which switches the response to Server-Sent Events: one data: chunk per emitted token until the final data: [DONE]. The address is configurable through FORGE_ADDR.

A worked preemption

The test preempts_when_blocks_run_out runs this sequence on a tiny cache of six blocks, block size one, so every token costs a block. It is the smallest case that forces the scheduler to make a real preemption decision.

rendering
Worked preemption on a 6-block cache: admit, decode, hit OutOfBlocks, evict the least progressed sequence, finish the leader, readmit the preempted one.

The tie-break towards the higher id is what makes step three pick sequence two rather than sequence one. Without it the test would still pass but a different sequence would be evicted run to run, which is exactly the kind of nondeterminism a teaching engine should not have.

Why speculative decoding is exact

The test speculative_output_matches_plain_target_decoding is the load-bearing one for the whole speculative path. With identical draft and target models it generates a sequence speculatively, then generates the same length with plain greedy target decoding, and asserts the two token streams are equal.

let (spec_tokens, _) = dec.generate(&[1, 2, 3, 4], 12);
// ... plain greedy decode from the target alone ...
assert_eq!(spec_tokens, plain, "speculation must be exact");

The acceptance probability min(1, p/q) is what makes that test pass. Anything else, for instance accepting whenever p > q, would bias the output away from the target distribution and would fail this assertion the moment p and q disagreed. The rejection-sampling formulation is not an implementation choice; it is the only one that keeps the output exact.

The wiki has the long-form pages: Architecture, Paged-KV-Cache, Continuous-Batching, Speculative-Decoding, Engine, Model-and-Tokeniser, HTTP-Server, plus reference and operations.