forge-infer
A minimal LLM inference server with a real paged KV-cache, continuous batching and speculative decoding. Readable in one afternoon, testable without a GPU.
The serving stack is the deliverable. The model is a deterministic placeholder you swap for real weights by implementing one trait with four methods.
trait Model {
forward(ctx)
vocab_size()
num_layers()
eos_token()
}Three interlocking systems, named the way the literature names them.
Why this matters
vLLM is the canonical PagedAttention implementation, in tens of thousands of lines of CUDA and Python. The PagedAttention paper is six diagrams. forge-infer is the middle: a faithful implementation of the paged KV-cache, continuous batching and speculative decoding, in Rust, with the awkward cases handled and the model deliberately a deterministic hash so the policy itself can be read, tested and benchmarked on any laptop.
Why this exists
I kept hitting the same wall when I tried to learn how vLLM-style serving works. The PagedAttention paper sketches the idea in a few diagrams. The real engines implement it under tens of thousands of lines of CUDA and Python glue, where the scheduling logic is tangled up with kernel launches and memory pools.
The tutorials that promise to explain it quietly mock out the bit that matters. They call paging a block table and then never evict anything. They batch requests that all happen to be the same length. The interesting cases, the ones that decide whether a real engine stays up, never appear.
forge-infer is the middle. The cache allocator, the scheduler and the speculative decoder are written for real, with the awkward cases handled. The model is deliberately a deterministic hash so the systems above the seam can be read, tested and benchmarked without a single CUDA kernel in sight. The serving stack is the deliverable. The model is a placeholder you swap out by implementing one method.
The seam, in four lines
The entire serving stack hangs off this trait. Everything above it is the engine. Everything below it is the model.
pub trait Model: Send + Sync {
fn forward(&self, context: &[TokenId]) -> StepLogits;
fn vocab_size(&self) -> usize;
fn num_layers(&self) -> usize;
fn eos_token(&self) -> TokenId;
}Built-in features
Every component a real engine has, named the way the literature names it. Only the model itself is deliberately faked.
Real paged KV-cache
Memory split into fixed-size blocks; each sequence carries a block table of physical blocks. External fragmentation disappears entirely, internal waste is bounded by block_size minus one slots per sequence. The append is transactional, returning OutOfBlocks without touching state so the scheduler can preempt and retry safely.
Continuous batching every iteration
A scheduling decision per decode iteration, not per batch. Each call admits what fits, reserves one block per running sequence, preempts the least-progressed sequences when blocks run out, runs the decode batch, and retires anything that hit eos or its token limit. Recompute-based preemption never deadlocks.
Speculative decoding, output exact
A cheap draft proposes k tokens; the target verifies the run and accepts each with probability min(1, p/q), resampling on the first rejection. The output is provably identical to plain target decoding, pinned by a token-for-token test. A fully accepted round of four drafts emits five tokens for one target step.
One four-line Model trait
The entire serving stack hangs off a single trait with forward, vocab_size, num_layers and eos_token. Above the seam sit the three systems that decide how fast an inference server runs. Below it sits a model. Swap the model and the rest does not change.
axum HTTP surface
A native /generate endpoint and an OpenAI-compatible /v1/completions. Existing OpenAI client code points base_url at forge-infer and calls /v1/completions unchanged, including stream true.
SSE streaming
Decoded tokens stream back as Server-Sent Events deltas terminated with [DONE], so clients render output progressively instead of waiting for the whole completion.
Determinism by construction
The model is a deterministic hash, on purpose, so cache, scheduler and decoder can be verified bit-stably without a GPU. The acceptance test in speculative decoding needs p(t)/q(t) reproducible to the bit; floating-point attention across two model instances is not stable enough to assert on.
StepPlan as a value
The scheduler returns a StepPlan describing what to run rather than calling the model directly. That split is what makes preempts_when_blocks_run_out and admission_blocks_when_prompt_does_not_fit assertable with no model in the test at all.
Built-in benchmark binary
cargo run --release --bin forge-bench prints a throughput table across sequential, continuous batching and speculative strategies. The figures isolate the cost of the serving machinery rather than model compute, which is the only honest thing the benchmark can measure on a CPU.
Runs anywhere Rust does
No CUDA, no Python glue, no heavy dependency tree. cargo test in seconds, cargo run --release in under a second to a serving binary. The whole point is that you can read it on a laptop and run it on a laptop.
Tech stack
Architecture sketch
One engine loop. One scheduler decision per decode iteration. One model call per iteration. SSE deltas out.
Quick start
Clone, test, serve. No GPU, no Python, no model download.
git clone https://github.com/sarmakska/forge-infer cd forge-infer cargo test # 37 tests across the serving stack cargo run --release --bin forge-infer # serves 127.0.0.1:8080
# native shape
curl -s localhost:8080/generate \
-d '{"prompt":"hello forge","max_tokens":24}'
# OpenAI-compatible, streamed over SSE
curl -sN localhost:8080/v1/completions \
-d '{"prompt":"stream me","max_tokens":20,"stream":true}'# print the throughput table on your own hardware cargo run --release --bin forge-bench # change the listen address FORGE_ADDR=0.0.0.0:9090 cargo run --release --bin forge-infer
What the benchmark measures
Apple M3 Pro, 64 requests, sixteen-token prompts, sixty-four max new tokens. The deterministic model is near-free, so these figures isolate the cost of the serving machinery rather than model compute. Read them for shape, not magnitude.
| Strategy | Tokens/sec | ms/request | Notes |
|---|---|---|---|
| sequential (batch 1) | ~1.88M | 0.031 | static-batching baseline, fresh engine per request |
| continuous batching (batch 16) | ~2.07M | 0.028 | sixty-four requests share one engine loop |
| speculative (lookahead 4) | ~0.59M | 0.10 | acceptance 52%, output exact |
The figure that carries across to a real model is the 52% acceptance rate. Just over half the draft's proposals are reused without a target recompute, and each accepted token skips a target step.
Use cases
What people actually use this for.
Learn how vLLM-style serving actually works
The PagedAttention paper sketches the idea in diagrams. Real engines bury it under tens of thousands of lines of CUDA and Python glue. forge-infer is the middle: the allocator, scheduler and decoder for real, with the awkward cases handled and the model deliberately trivial.
Prototype a serving policy against a swap-in model
Implement the four-line Model trait against a candle backend or a remote API and the cache, batcher and speculator above the seam are unchanged. Use forge-infer as the harness while you iterate on what the model does.
Teach an internal team paged attention
Cargo build finishes in a couple of seconds. The tests never flake. The scheduler is unit-testable with no forward pass. A workshop that would have needed a GPU room runs on any laptop with Rust installed.
A reference for design reviews
When you are debating recompute versus swap, lookahead values, or block sizes for your real engine, point at code rather than slides. Every trade-off has a sentence in the design-decisions section and a test pinning the chosen behaviour.
Roads I picked, and the ones I did not
Recompute over swap
vLLM can swap KV blocks to host memory under pressure. forge-infer recomputes instead. Swap saves the recompute but adds a copy path, a host pool and a second eviction policy. Recompute is a dozen lines, never deadlocks, and makes the cost of preemption legible. Swap is a latency optimisation that only pays off once the model is real.
Deterministic hash model
A real tiny transformer in candle would let the engine produce prose, at the cost of a multi-second compile, a heavy dependency tree, and a model that is not bit-stable between instances. The speculative acceptance test needs p(t)/q(t) reproducible to the bit. A pure function of the context is what the cache, scheduler and decoder need to be verified.
StepPlan as a value
The scheduler hands back a description of what to run, not a tick that also runs the model. That separation is what makes admission, preemption and retirement assertable with no forward pass in the test at all. One indirection in the engine loop, paid every time for a testable policy.
What this is not
forge-infer is not a production inference server and it is not pretending to be one. Stated plainly:
- No real weights. The model generates a reproducible byte stream, not language. Implement Model::forward against your own weights for prose out.
- No shared server-side engine yet. The HTTP layer spins up a fresh Engine per request over a shared Arc<dyn Model>. The continuous-batching path is exercised by the benchmark, which drives sixty-four requests through one engine.
- No carried KV state. A real backend would attend over the physical blocks the cache hands it; the demo model recomputes from the full context each forward.
- Greedy argmax only. No temperature, top-p or top-k sampling on the server path. Speculative decoding still uses the standard rejection-sampling test, so the exactness guarantee holds.
Read it. Fork it. Swap the model.
A shared server-side engine loop and prefix sharing across sequences with copy-on-write block tables are on the roadmap. A candle feature flag behind the Model trait is the natural next step for real text out.