semi-PD LLM Serving: Hong et al. (2025)

Figure 4 (adapted): As storage shortage on decode instances increases, Time Per Output Token (TPOT) latency explodes in standard PD disaggregation. Semi-PD offloads short decode requests to prefill GPUs, keeping decode instances below the shortage threshold. Storage shortage = fraction of KV-cache slots unavailable for new requests.

The problem with prefill-decode disaggregation

Serving a large language model involves two distinct phases. Prefill processes the input prompt in parallel, a compute-intensive burst that benefits from high GPU utilisation. Decode generates tokens one at a time, which is memory-bandwidth-bound and benefits from low occupancy to minimise queuing latency. Because these profiles conflict, the standard solution is prefill-decode (PD) disaggregation: dedicate separate GPU pools to each phase.

The approach works well for throughput, but it carries a structural inefficiency that prior work largely ignored. During decode, each active request occupies a KV-cache slot on the decode instance. As the server fills up, new requests must wait for slots to free, creating a storage shortage condition. The paper's key measurement is stark: in standard PD disaggregation, prefill instances sit at roughly 89% idle KV-cache capacity while decode instances are the bottleneck. The hardware split is approximately right; the utilisation split is badly wrong.

When decode instances hit storage shortage, latency degrades non-linearly. As shown in the chart above, TPOT remains stable at low shortage levels, then accelerates sharply past roughly 30% shortage, reaching multiples of its baseline value before the system saturates entirely. This is the latency cliff that semi-PD is designed to avoid.

What semi-disaggregation actually does

Semi-PD serving breaks the strict assignment between GPU pools. Prefill instances still handle all prefill work, but they are permitted to accept short decode requests during their idle periods. The routing decision is made at the scheduler level: when a decode instance approaches storage shortage, incoming short requests are deflected to prefill instances, which have ample KV-cache capacity and sufficient memory bandwidth to handle them efficiently.

The insight is that short sequences, typically under a few hundred tokens, are disproportionately cheap to decode. They occupy KV-cache slots briefly, generate few tokens, and complete before they can accumulate queuing delay. By routing these requests to prefill instances rather than waiting for decode capacity, the system reduces the average occupancy of decode instances and keeps them below the storage shortage threshold where latency explodes.

Implementation requires a modified scheduler that tracks shortage levels across both instance pools and a KV-cache migration mechanism to transfer state when requests switch pools mid-generation. The authors implement this in InfiniCore, their production serving framework at Infinigence-AI, and validate it against DeepSeek-V2, DeepSeek-V3, and Llama 3.1-70B under realistic load distributions. All three models show substantial improvements in both latency and SLO attainment.

What the results mean for inference infrastructure

The 2.58× latency improvement on DeepSeek models is the headline number, but the SLO results are arguably more meaningful for production operators. SLO attainment measures the fraction of requests that complete within a target latency budget, which is what most commercial services actually contract around. A 1.55–1.72× improvement in SLO-compliant request volume means an operator can serve substantially more traffic at the same quality of service, without adding GPUs.

The paper situates semi-PD within a broader debate about the right granularity for disaggregation. Full disaggregation maximises specialisation but wastes capacity. Fully collocated serving avoids waste but suffers from phase interference. Semi-PD occupies a middle position: preserve the separation where it matters (heavy prefill workloads) and relax it where the cost is low (short decode sequences). The authors argue this is the natural operating point for most production traffic distributions, which are skewed toward short completions.

There is a caveat. The benefits of semi-PD are most pronounced under high load, when decode instances are genuinely approaching shortage. At low utilisation, the overhead of cross-pool routing and KV-cache migration can slightly worsen latency for affected requests. For services with highly variable traffic, this means the scheduler logic needs to be conservative about triggering the offload path, a tuning problem the paper addresses but does not fully resolve.

Key Implication

Semi-PD serving demonstrates that the standard PD disaggregation architecture leaves a large efficiency gap on the table without requiring any new hardware or model changes. As model serving costs remain a primary concern for AI infrastructure operators, the approach offers a practical path to better utilisation through smarter scheduling, specifically, treating idle prefill capacity as a latency buffer for the decode bottleneck.

Hong, J., Zhao, J., Zhang, Z., Gu, H., He, J., Li, Y., ... & Jia, X. (2025). Semi-PD: Towards Efficient LLM Serving via Semi-Disaggregated Prefill and Decode. Tsinghua University & Infinigence-AI. arXiv:2504.19867v1. https://arxiv.org/abs/2504.19867