Research Explainer · Zheng et al. (2024)

Treat LLM calls like a program; then the cache starts doing real work

SGLang pairs a small Python DSL with a runtime that understands prompt structure, shared prefixes, and batching. That combination makes agent, reasoning, long-document, and vision workloads measurably faster, while also cutting the amount of glue code needed to build them.

30%
Agent workload gain
On the generative-agent benchmark, SGLang beat vLLM by 30% in both throughput and latency.
80%
First-token latency cut
Prefetching a hot prefix reduced first-token latency from 1.0 seconds to 0.2 seconds.
55%
Less orchestration code
A real article-analysis flow dropped from 206 lines with raw OpenAI APIs to 91 lines with SGLang.

Representative speedups from reusable prompt structure

Figure based on reported results in the paper: 1.3x on the generative-agent task, 1.2x and 2.9x on two long-document benchmarks, and 1.7x on LLaVA-Bench. These are representative workload gains, not a single uniform benchmark.

SGLang starts from a blunt observation: most serious LLM applications are not one completion, they are little programs. They branch, call the model several times, reuse old prompt text, and wait for one result before deciding what to do next. If your runtime only sees isolated requests, it misses the part that matters.

The language is embedded in Python, so you write ordinary code and add a few purpose-built primitives such as gen, select, fork, and join. The interpreter treats each prompt as an asynchronous stream, which means generation calls can run in the background and only synchronise when the program actually needs the answer. That gives you normal Python ergonomics without forcing the runtime to behave like a plodding request queue.

The paper then adds a compiler and a specialised runtime. The compiler traces programs into an intermediate graph when control flow is static enough, and the runtime uses that structure to batch calls, share prefixes, and move KV cache state around before the model stalls waiting for it. This is language design with its sleeves rolled up.

The centrepiece is KV-cache reuse. SGLang stores prompt prefixes in a CPU-resident radix tree, keeps reference counts, evicts old entries with LRU logic, and schedules waiting requests by how much cached prefix they can reuse rather than simple first-come, first-served order. That sounds fussy. It is also where the speed comes from.

On the generative-agent workload, that cache-aware approach made SGLang 30% better than vLLM in both throughput and latency, even though each simulation only made one model call. Long-document tasks showed 1.2x and 2.9x speedups, which is the sort of gain you only get when repeated prompt material stops being recomputed like a goldfish with short-term memory loss. On LLaVA-Bench, the same ideas carried over to vision-language serving and improved throughput by 1.7x over llama.cpp on an M2 Ultra.

The compiler contributes as well. Code movement tries to rewrite prompt templates so more text becomes sharable prefix, and GPT-4 managed that correctly for 12 of 15 templates, adding about 60 tokens of reusable prefix on average. Prefetching then helps when a long prefix is predictably hot, cutting first-token latency from 1.0 seconds to 0.2 seconds. The forward pass still dominates the bill, but the runtime overhead for maintaining the radix tree stayed tiny in the authors' trace, about 0.07 seconds versus 17.6 seconds for model execution.

The practical message is not merely “here is another LLM framework.” The paper argues that multi-call workflows need language and runtime co-design, because the performance-critical information sits in the structure of the whole program, not in any one prompt. Reasoning setups, agent loops, retrieval pipelines, and multimodal apps all benefit when the system can see repeated prefixes and plan around them.

There is also a developer productivity angle, and it is not decorative. The article-analysis case study dropped from 206 lines of OpenAI API code to 91 lines in SGLang, a 55% reduction. If you are building an LLM workflow with branches, selections, and reused context, fewer lines usually means fewer places to forget a state transition or duplicate a prompt fragment.

The limits are clear enough. The compiler cannot handle data-dependent control flow, so arbitrary Python logic falls back to the interpreter. Grammar-constrained decoding is not there yet, tokenisation boundaries can still produce artefacts, and multimodal support is only partway done. Still, the core claim lands: once you treat LLM use as a program rather than a chatty API wrapper, old systems ideas like tracing, scheduling, and cache locality become useful again.

Frontend language

SGLang gives Python a compact set of LLM-specific primitives, so prompt extension, branching, constrained selection, and parallel forks become first-class operations instead of string handling tricks.

Compiler path

When control flow is static enough to trace, the compiler turns programs into an IR graph and applies optimisations such as code movement and prefix prefetching to increase cache reuse.

SGVM runtime

The runtime uses RadixAttention, cache-aware scheduling, and a fused extend kernel so repeated prefixes survive across calls and batches instead of being recomputed every time.

Key Finding

SGLang matters because it treats an LLM workflow as a real program with reusable structure. That shift turns prefix sharing, batching, and compilation into practical performance tools, and the measured gains are large enough to matter to both the person paying for inference and the person writing the code.

APA Reference

Zheng, L., Yin, L., Xie, Z., Sun, C., Huang, J., Yu, C. H., Cao, S., Kozyrakis, C., Stoica, I., Gonzalez, J. E., Barrett, C., & Sheng, Y. (2024). SGLang: Efficient execution of structured language model programs. arXiv. https://doi.org/10.48550/arXiv.2312.07104