forge-infer
A minimal LLM inference server in Rust with a real paged KV-cache, continuous batching and speculative decoding. The serving stack read end to end, with the model deliberately deterministic so the policy can be verified without a GPU.
Abstract
Modern LLM inference engines such as vLLM and TGI implement three interlocking systems that decide how fast a server runs: a paged KV-cache, an iteration-level batching scheduler, and a speculative decoder. The published papers sketch each idea in a handful of diagrams. The real implementations bury them under tens of thousands of lines of CUDA and Python, where the scheduling logic is tangled up with kernel launches and memory pools. forge-infer is the middle: a faithful implementation of all three systems, in Rust, with a deliberately trivial deterministic model below the seam so the policy itself can be read, tested and benchmarked on any laptop. Thirty-seven tests pin the behaviour, including a token-for-token exactness test that proves the speculative output matches plain greedy target decoding.
1. Executive Summary
forge-infer is an open-source, MIT-licensed inference server that demonstrates how modern serving stacks work without pretending to be a production engine. It exposes an axum HTTP layer with an OpenAI-compatible /v1/completions endpoint and a native /generate endpoint, both streamed over Server-Sent Events. Above the model trait sit three concrete subsystems: PagedKVCache, Scheduler and SpeculativeDecoder. Below it sits a deterministic TinyTransformer that exists solely to make the policy verifiable. The Model trait has four methods. Implementing it against real weights is the supported path for serving real text.
- Three serving algorithms implemented in full, with the awkward cases handled.
- One Model trait, four methods, swap-in path for real weights.
- Thirty-seven tests, deterministic, no GPU required.
- OpenAI-compatible HTTP surface with SSE streaming.
- MIT licensed. Built by Sarma.
2. Background and Motivation
The serving side of LLMs has matured into a small but interlocking set of techniques. PagedAttention from vLLM showed that managing the KV-cache as fixed-size blocks with per-sequence block tables removes external fragmentation entirely and bounds internal waste. Continuous batching from Orca and the OSDI line of work showed that making a scheduling decision per decode iteration, rather than per batch, keeps the engine saturated even when individual sequences finish at wildly different times. Speculative decoding from Leviathan and Chen showed that pairing a target with a cheap draft and verifying its proposals with a rejection-sampling test reduces the average target steps per emitted token while leaving the output distribution exact.
The papers are short and the ideas are clean. The implementations are not. vLLM is tens of thousands of lines of CUDA and Python where the policy and the kernels share files. TGI is similar. Tutorials that promise to explain the ideas tend to mock out the case that actually matters. They call paging a block table and never evict anything. They batch identical-length requests and call it continuous. The cases that decide whether a real engine stays up under load never appear in the teaching material.
3. The Problem
There is a gap between the literature and the production engines. An engineer who wants to understand how a real serving stack behaves cannot read the papers alone, because the papers do not describe the failure cases. They cannot read vLLM either, because the CUDA and the bookkeeping are interleaved in a way that needs deep prior knowledge of both. The gap is what stops people contributing to the engines they depend on. It is the same gap that lets a vendor present a serving config as opaque magic instead of a small number of decisions with names.
forge-infer is one attempt to close that gap. The serving systems are real, named the way the literature names them, and tested for the awkward cases. The model is the part that gets faked, on purpose, so a reader spends their time on policy rather than on getting CUDA to compile.
4. Goals and Non-goals
Goals
- Implement a paged KV-cache, an iteration-level batching scheduler and a rejection-sampling speculative decoder in full.
- Keep the model trivial and deterministic so the policy is reproducible to the bit and testable without a GPU.
- Expose an OpenAI-compatible HTTP surface so existing client code points at the server unchanged.
- Keep the codebase readable in one afternoon. Every named subsystem has a file, a set of tests, and a page in the wiki.
- Make the swap-in path for real weights one trait with four methods, no other changes anywhere else in the stack.
Non-goals
- Multi-GPU or distributed serving. The bookkeeping would bury the policy.
- A model zoo. Anyone with weights writes a Model implementation; that is the point of the seam.
- A web UI. The repository is about three serving algorithms read end to end. Scope creep would bury the thing it exists to teach.
- Top-p, top-k, temperature on the server path. Greedy argmax is what the speculative exactness proof rests on; sampling would arrive later, behind a flag, with a new proof.
5. Architecture
The architecture is one engine loop and one seam. The HTTP layer is axum, the tokeniser is byte-level, and the engine consumes a StepPlan from the scheduler each iteration. The scheduler owns admission, preemption and retirement. The cache owns block allocation and the block tables. The speculative decoder is a separate path that wraps a target Model with a draft Model for single-stream decoding.
The split between schedule() returning a plan and the engine running the model is load-bearing. It is what lets the policy be unit-tested with no forward pass at all. Tests such as admission_blocks_when_prompt_does_not_fit and preempts_when_blocks_run_out exercise the scheduler against a cache and a queue, with no model in the test.
6. Key Design Decisions
6.1 Recompute-based preemption, not KV swap
vLLM offers both recompute and swap. Recompute drops a sequence under pressure and re-prefills it later. Swap copies its KV blocks to host memory and back. Swap saves the recompute but adds a copy path, a host memory pool and a second eviction policy. forge-infer picks recompute. It is a dozen lines, it never deadlocks because freeing any running sequence releases at least the blocks the next step needs, and it makes the cost of preemption legible. Swap is a latency optimisation that only pays off once the model is the expensive part; for a teaching engine it would double the cache surface area without earning its keep.
6.2 A deterministic hash model below the seam
A real tiny transformer in candle or tch would let the engine produce prose. It would also introduce a multi-second compile, a heavy dependency tree, and a forward pass that is not bit-stable between instances. The speculative acceptance test needs p(t)/q(t) reproducible to the bit. A deterministic hash is a pure function of its context, which is the property the cache, scheduler and decoder all need to be verified deterministically.
6.3 StepPlan as a value
The tempting shape is a single tick() method that schedules and runs the forward pass together. forge-infer splits them. schedule() mutates only the cache and the queues and hands back a StepPlan describing what to run. The engine consumes the plan and calls the model. The cost is one extra indirection in the engine loop. The benefit is that admission, preemption and retirement are testable without any model.
6.4 Least-progressed preemption
When blocks run out, the scheduler preempts the running sequence with the fewest output tokens, breaking ties towards the higher id. This is shortest-remaining-time applied to recompute cost: a sequence with two output tokens loses two on resume; a sequence with two hundred would lose two hundred. The tie-break exists because without it the choice would not be reproducible across runs, which is exactly what a teaching engine should not have.
6.5 Byte-level tokeniser
A byte-level codec means every input is encodable and every output decodable. There is no out-of-vocabulary path, and the tokeniser does not depend on a model-specific BPE merge file. For the serving stack the tokeniser is just an injective pair of functions; the choice of byte level keeps it that way.
6.6 OpenAI completions shape
The /v1/completions endpoint matches the OpenAI completions request and response shape, including stream true and SSE deltas terminated with [DONE]. Existing OpenAI client code points base_url at the server and works without modification. The native /generate endpoint sits next to it for callers that do not need the compatibility layer.
7. Implementation Notes
The implementation is intentionally small. PagedKVCache is a HashMap of BlockTable values keyed by SeqId, plus a free list represented as a Vec used as a stack so freshly freed blocks are reused first. blocks_needed_for is a pure function over a BlockTable and an extra token count; the scheduler asks it before calling append. append pre-checks the free count and returns OutOfBlocks { needed, free } without mutating anything if the cache cannot grow the sequence.
Scheduler::schedule runs four phases: admission, decode-block reservation, preemption, and commit/retire. Each phase mutates the cache and the queues in a way that any of the others can examine afterwards. The returned StepPlan is a plain struct of vectors, never a closure, so the engine can consume it without holding any borrow into the scheduler.
SpeculativeDecoder::step drives the draft to produce a run, calls the target on the resulting context, computes accept_prob = min(1.0, p / q), and compares it against a deterministic acceptance_draw seeded from the context and position. The seeded draw is what makes the exactness test assertable; in a real engine the draw would come from a thread RNG, but the maths is identical.
The HTTP layer reads a prompt and max_tokens, encodes the prompt, runs the engine loop, and pushes decoded tokens out as SSE deltas. The native /generate path returns a single JSON body. The OpenAI-compatible /v1/completions path returns either a single body or an SSE stream depending on the stream field.
8. Results
Measured with cargo run --release --bin forge-bench on an Apple M3 Pro (macOS 25.3, Rust 1.96), workload fixed at sixty-four requests, sixteen-token prompts and sixty-four max new tokens. The deterministic model is near-free, so these figures isolate the cost of the serving machinery rather than model compute. That is deliberate; it is the only honest thing this benchmark can measure without a GPU.
| Strategy | Tokens/sec | ms/request | Notes |
|---|---|---|---|
| sequential (batch 1) | ~1.88M | 0.031 | static-batching baseline, fresh engine per request |
| continuous batching (batch 16) | ~2.07M | 0.028 | sixty-four requests share one engine loop |
| speculative (lookahead 4) | ~0.59M | 0.10 | acceptance 52%, output exact |
Read these for shape, not magnitude. With a near-free model the continuous-versus- sequential gap (around 1.1x here) is mostly scheduler overhead; the win only grows large once the model forward pass is the expensive part, because keeping the batch full then dwarfs the bookkeeping. The figure that carries across to a real model is the 52% acceptance rate. Just over half the draft proposals are reused without a target recompute, and each accepted token skips a target step. Numbers drift a few percent run to run with machine load.
The reproducibility claim is verified by the test speculative_output_matches_plain_target_decoding, which generates a sequence speculatively, generates the same length with plain greedy target decoding, and asserts the two token streams are equal.
9. Comparisons
Comparisons here are about scope and shape, not raw throughput. forge-infer is a CPU-resident teaching engine; vLLM and TGI are production GPU engines. Comparing tokens-per-second between them would be dishonest. The useful comparison is what each project commits to and what it deliberately does not.
9.1 forge-infer vs vLLM
vLLM is the canonical PagedAttention implementation. It includes a paged KV-cache, continuous batching, speculative decoding, prefix sharing, CUDA kernels, multi-GPU tensor parallelism, and a much richer scheduler. forge-infer implements the first three, with recompute-based preemption rather than the swap variant vLLM also offers, and a deterministic stand-in model rather than CUDA kernels. The systems are recognisably the same; the surface area is two orders of magnitude smaller.
9.2 forge-infer vs TGI
TGI exposes a serving API on top of an attention implementation in Rust and Python. It supports a broader model zoo and richer sampling. forge-infer is narrower: Rust-only, one trait between the engine and the model, no model zoo. The point of difference is the seam, not the breadth.
9.3 forge-infer vs a tutorial
A typical PagedAttention tutorial sketches the cache and then never evicts. A typical continuous-batching tutorial batches identical-length requests. forge-infer runs the awkward cases: out-of-blocks preemption, resume, the rejection-sampling acceptance test. The tests pin them.
10. Trade-offs and Lessons
The biggest single decision is the deterministic model. It is the right call for this project, but it costs the ability to read prose out without writing a Model implementation. The next biggest is recompute over swap: it costs latency under heavy preemption, but it earns simplicity and provable non-deadlock. The HTTP layer today spins up a fresh Engine per request over a shared Arc<dyn Model>; that keeps the example readable but means concurrent requests do not yet share one long-lived engine loop on the server side. The continuous-batching path is exercised by the benchmark, which drives sixty-four requests through a single engine, but a shared server-side engine loop is on the roadmap.
The lesson from writing this is that the scheduling policy and the cache policy are much smaller than they look when you read the production engines. The bulk of the code in a real engine is kernels, ops fusion, and accelerator-specific bookkeeping. Pulling the policy out and putting it under tests is the move that makes the rest of the engine intelligible.
11. Conclusion
forge-infer is the engine I wish I had when I was learning how vLLM-style serving works. The serving stack is the deliverable. The model is the placeholder. The trait between them is the seam. Anyone who wants real text out implements forward against their own weights; nothing else in the stack changes. The repository is on GitHub under MIT, the wiki has long-form pages per subsystem, and the tests are the spec.
Appendix A . Configuration
Defaults are intended to match common production settings closely enough to be instructive. The two cache knobs are num_blocks and block_size; the server uses 512 blocks of 16 in default_state. Larger blocks mean fewer allocator operations but more internal fragmentation; smaller blocks mean tighter packing but more bookkeeping. The scheduler knob is max_batch_size. The speculative knob is lookahead; the benchmark uses four, a common production default. The HTTP listen address is set through FORGE_ADDR.
# environment FORGE_ADDR=127.0.0.1:8080 # axum listen address (default) # in code (Scheduler::new, PagedKVCache::new, SpeculativeDecoder::new) max_batch_size: 16 num_blocks: 512 block_size: 16 lookahead: 4
Appendix B . Swap-in Checklist
The supported path to real text is implementing the Model trait against your own weights. The trait has four methods. The rest of the stack does not change.
- Implement Model::forward over the prompt-plus-output context for a single sequence. The required output is a StepLogits value; argmax over it produces the next token. For a first pass, recomputing from the full context per call is fine; the cache lookup path is an optimisation, not a correctness requirement.
- Implement vocab_size, num_layers, eos_token. The scheduler uses eos_token to retire sequences early through the same path as the max-new-tokens limit.
- Wire the cache through. A real backend would attend over the physical blocks PagedKVCache::block_table hands it rather than recomputing the prefix on every step. The scheduler does not need to change for that to work; the change is inside forward.
- Decide on sampling. The server path is greedy argmax today. Sampling (temperature, top-p, top-k) would be added inside the forward consumer in Engine::step. The speculative path stays exact under greedy; sampling under speculation needs its own proof and is deliberately out of scope until then.
- Keep the tests honest. The thirty-seven existing tests pin behaviour over the deterministic model. When you swap the model, add a determinism test of your own and keep the speculative exactness assertion intact for the configurations you actually serve.