Research Explainer · Zhang, Takeuchi, Kawahara et al. (2025)

General-purpose LLM benchmarks miss the mark, because domain-specific enterprise tasks reshuffle the leaderboard

A 27-benchmark evaluation across finance, legal, climate, and cybersecurity shows that the model topping generic tests rarely wins in specialised enterprise tasks, and smaller models routinely outperform larger ones in specific domains.

27 publicly available enterprise benchmarks spanning four domains evaluated against 8 open-source LLMs under 70B parameters

+5 rank positions gained by flan-ul2 (20B) on general summarisation, only to drop the same amount on legal summarisation

0.928 Binary F1 achieved by Llama 3.1-70B on IoTSpotter cybersecurity classification, the single highest score in the entire benchmark suite

Finance benchmark: model performance across task types

Source: Table 3 in Zhang, Takeuchi, Kawahara et al. (2025). Scores are Weighted F1 for classification and sentiment, Entity F1 for NER (FiNER-139), RR@10 for QA (FiQA-Opinion), Accuracy for ConvFinQA, and Rouge-L for summarisation (EDT). Higher is better across all metrics.

What they did

The researchers took Stanford's HELM evaluation framework and bolted on 27 enterprise-focused benchmarks that HELM had never covered. The datasets span four domains: finance (10 English datasets plus 2 Japanese), legal (7 datasets), climate (3 datasets), and cybersecurity (6 datasets). Tasks include classification, named entity recognition, question answering, summarisation, and translation. Every dataset is publicly available and contains at least 100 labelled test cases.

They then ran 8 open-source models through the lot, all under 70 billion parameters. The roster includes Llama 3.1 (8B and 70B), Phi 3.5 (3.8B), Mistral 7B, Granite 3 (8B), Flan UL2 (20B), and two Japanese-specialised models. All models used identical prompts, identical few-shot examples, greedy decoding at temperature zero, and a fixed random seed. No chain-of-thought prompting, no system prompts, no tricks.

What they found

The headline result is that no single model dominates across enterprise domains. Llama 3.1-70B tops the finance benchmark on 7 of 10 tasks, scoring 0.874 Weighted F1 on news headline classification and 0.802 Entity F1 on the challenging FiNER-139 numerical NER task. In cybersecurity it leads on 4 of 6 tasks, reaching 0.896 F1 on threat intelligence mapping. So far, so predictable for the largest model in the set.

The story gets interesting when you compare domain performance to general benchmarks. Flan UL2 ranks first on IMDb sentiment classification (0.975 accuracy) and first on CNN-DailyMail summarisation (0.299 Rouge-L). Move to domain-specific sentiment in finance or legal, and it drops to sixth and fifth respectively. The reason is telling: financial and legal texts express positive or negative sentiment through domain idiom ("year-over-year decline in operating margin") rather than the straightforward polarity words ("great", "terrible") that movie reviews favour.

Smaller models occasionally punch above their weight. Granite 3 (8B), trained on finance and legal data, holds the top summarisation score in both BillSum (0.312 Rouge-L) and Legal Summarisation (0.271 Rouge-L), beating Llama 3.1-70B at roughly nine times the parameter count. Mistral 7B leads on legal judgement prediction (0.845 Weighted F1). In climate classification, Phi 3.5 at only 3.8B parameters tops all models on wildfire tweet classification at 0.796 Weighted F1.

Finance

10 English + 2 Japanese datasets covering earnings calls, SEC filings NER, financial QA, and summarisation. Llama 3.1-70B dominates, Granite 3-8B leads on sentiment.

Legal

7 datasets spanning sentiment, terms-of-service classification, judgement prediction, and contract summarisation. Mistral 7B and Granite 3-8B outperform larger models on several tasks.

Climate

3 datasets on Reddit posts, wildfire tweets, and climate claim summarisation. Flan UL2 and Phi 3.5 lead here, not the largest model.

Cybersecurity

6 datasets on 5G protocol analysis, threat intelligence mapping, malware reports, and IoT app detection. Llama 3.1-70B leads most tasks, Flan UL2 leads summarisation.

Why the rankings shift

The authors attribute the rank shuffles primarily to training data composition. Flan UL2, pre-trained on the C4 corpus (filtered Common Crawl), was published in early 2023, before the push to include domain-specific corpora became standard. Its training data almost certainly lacked the volume of financial, legal, and cybersecurity text found in more recent datasets. Granite 3, by contrast, explicitly includes FDIC filings, finance textbooks, and EDGAR documents in its training pipeline, which explains its stable performance across both finance and legal domains despite having only 8 billion parameters.

The NER results reveal a separate problem. Traditional BIO-tag sequence labelling simply fails with decoder-only LLMs. The team adopted an extraction-based approach (the model outputs "New York (location)" rather than a tag sequence), and found it needs 20+ few-shot examples to work properly. Even then, numerical NER (extracting KPIs from SEC filings, say) remains "a particularly challenging task" where the 70B model manages only 0.697 Adjusted F1 on KPI-Edgar. Specialised domain vocabulary is not just about sentiment, it extends to structural conventions in how numbers, entities, and relations appear in enterprise text.

Why it matters

Practitioners choosing an LLM for an enterprise application face a concrete problem: the model that tops MMLU or IMDb might rank fifth or sixth on the task they actually care about. This paper provides the benchmarks, prompts, and HELM integration code to run that comparison directly. The entire framework is open-sourced and being merged into the main HELM repository.

The Japanese results add a useful data point for multilingual enterprise deployment. Granite 8B Japanese outperformed Llama 3 ELYZA JP 8B across all four Japanese finance tasks, including a BLEU score of 0.123 on English-to-Japanese financial translation. That is still a low number in absolute terms, and it signals how far specialised multilingual enterprise NLP has to go.

Key Finding

General-purpose benchmarks hide critical performance differences in enterprise domains. A model that leads on movie reviews, trivia, and news summarisation can lose to a model one-ninth its size on legal contract summarisation or financial sentiment analysis. The only reliable way to choose an LLM for an enterprise task is to evaluate it on data from that domain, and this framework makes that possible within HELM for the first time across finance, legal, climate, and cybersecurity.

Reference

Zhang, B., Takeuchi, M., Kawahara, R., Asthana, S., Hossain, M. M., Ren, G.-J., Soule, K., Mai, Y., & Zhu, Y. (2025). Evaluating large language models with enterprise benchmarks. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Industry Track), 485–505. https://aclanthology.org/2025.naacl-industry.1/