Research Explainer · Pan, Chodnekar, Roy & Wang (2025)

Running your own LLM can pay for itself in months; but only if you pick the right model size

A cost-benefit analysis of 54 deployment scenarios finds that small open-source models break even against commercial APIs in under three months on a $2,000 GPU, while large models can take years to justify their quarter-million-dollar hardware.

0.3 mo

Fastest break-even: small model (EXAONE 32B) vs Claude-4 Opus on a single RTX 5090

Deployment scenarios analysed across 9 open-source models and 5 commercial APIs

69.3 mo

Longest break-even: Kimi-K2 (1T params, $240k hardware) vs GPT-5

Break-Even Months: Open-Source Models vs. GPT-5

Source: Table IV, Pan et al. (2025). Break-even time in months for each open-source model when replacing GPT-5 API usage with equivalent on-premise throughput. Hardware costs range from $2k (RTX 5090) to $240k (16× A100).

What the Study Did

Pan and colleagues built a total-cost-of-ownership model that pits nine open-source LLMs against five leading commercial APIs across every plausible combination. The open models span three size classes: small (24B to 32B parameters, running on a single $2,000 NVIDIA RTX 5090), medium (70B to 120B, needing one or two $15,000 A100 GPUs), and large (235B to 1 trillion, requiring clusters of four to sixteen A100s costing $60k to $240k). On the commercial side, they priced out GPT-5, Claude-4 Opus, Claude-4 Sonnet, Grok-4, and Gemini 2.5 Pro using published per-token API rates.

The cost model itself is straightforward. Hardware is a one-time capital expense. Electricity accrues monthly at $0.15 per kWh, assuming eight hours of operation per business day. The comparison is fair because the authors normalise on throughput: they calculate how many tokens the local hardware could generate each month, then compute what the same volume would cost through a commercial API. Where the two cumulative-cost lines cross is the break-even point.

They also track performance parity. Every open model is benchmarked on GPQA (graduate-level Q&A), MATH-500, LiveCodeBench, and MMLU-Pro, with percentage deltas reported alongside each break-even figure so readers can see exactly how much accuracy they trade for cost savings.

What It Found

The headline is a story of scale. Small models pay for themselves almost immediately. EXAONE 4.0 32B, deployed on a single consumer GPU, breaks even against Claude-4 Opus in nine days (0.3 months) and against GPT-5 in 2.3 months. Performance loss relative to GPT-5 sits at about 2.7 percentage points on the composite benchmark score. For a small business running customer-support chatbots or internal document search, that gap barely registers.

Medium models occupy the middle ground you would expect. Llama-3.3-70B on a single A100 ($15k) breaks even with GPT-5 in 17.8 months, though it takes a 28-percentage-point hit on the combined benchmarks. GLM-4.5-Air and gpt-oss-120B perform better (within 5 points of GPT-5) but need two A100s and roughly 20 to 34 months to recoup the hardware spend. The practical sweet spot for mid-sized organisations is a hybrid strategy: run sensitive or high-volume workloads locally and push burst traffic to cloud APIs.

Large models tell a cautionary tale. Qwen3-235B matches GPT-5's benchmarks (scoring slightly higher on MATH-500 and MMLU-Pro) and breaks even in 34 months on $60k of hardware. But Kimi-K2, a 1-trillion-parameter model needing $240k in GPUs, does not break even against GPT-5 for nearly six years. Against aggressively priced Gemini 2.5 Pro, even Qwen3-235B takes 31 months. At that time horizon, hardware depreciation and the next generation of models become real risks.

The Pricing Wildcard

One of the paper's sharpest observations concerns the enormous variance across commercial providers. Claude-4 Opus charges $15 per million input tokens and $75 per million output tokens. GPT-5 charges $1.25 and $10 respectively. That 12x price difference on the input side means the same open-source model can break even in under a month against Claude-4 Opus while taking over two years against GPT-5. The choice of which commercial API you are replacing matters at least as much as the open model you choose.

This has a practical implication the paper does not fully spell out. If your organisation currently pays Claude-4 Opus rates, almost any local deployment looks brilliant. If you are on GPT-5 or Gemini pricing, the case for going on-premise weakens dramatically, and you need high sustained volume (50 million tokens per month or more) to make the numbers work. Vendor selection is the first lever, not the last.

What the Numbers Leave Out

The cost model is deliberately lean. It accounts for GPU hardware and electricity but excludes staffing (someone has to maintain those servers), networking, cooling infrastructure, rack space, and software licensing. The authors acknowledge this and flag it as future work. For a small team deploying a 30B model on a workstation under a desk, these omissions are minor. For an enterprise standing up a 16-GPU cluster, they could add 30 to 50 percent to the total cost and push break-even horizons well past what Table IV suggests.

The analysis also assumes eight hours of daily operation, which suits a standard business-day workflow but undercounts always-on production services. Running 24/7 triples the electricity cost, shortening break-even on the API side (because you generate more tokens locally) but increasing OpEx. The model is transparent enough that organisations can plug in their own assumptions, and the authors provide an interactive calculator for exactly that purpose.

Hardware depreciation is the elephant in the room. GPU technology turns over roughly every two years. A $240k cluster bought today competes against hardware that will be twice as fast per dollar by the time Kimi-K2 hits its break-even point. The paper wisely steers readers toward small and medium models where payback arrives before the hardware becomes obsolete.

Key Takeaway

For most organisations, self-hosting a small open-source model (24B to 32B parameters) on a $2,000 consumer GPU pays for itself within weeks to months, with less than a 5-percentage-point accuracy penalty compared to leading commercial APIs. Medium models offer a viable middle path for higher-volume workloads, typically breaking even within one to two years. Large on-premise deployments remain a niche play, justified primarily by privacy mandates or vendor lock-in concerns rather than pure economics. The single biggest variable is not which open model you pick; it is which commercial API bill you are trying to eliminate.

Reference

Pan, G., Chodnekar, V., Roy, A., & Wang, H. (2025). A cost-benefit analysis of on-premise large language model deployment: Breaking even with commercial LLM services. arXiv. https://arxiv.org/abs/2509.18101