Tags: Benchmarking, electric power, Large language models, LLMs
What EPRI’s new domain-specific benchmark reveals about LLM reliability and what it means for utilities
Utilities are actively exploring large language models (LLMs) to accelerate knowledge work, summarizing technical documents, answering engineering questions, drafting procedures, and supporting decision-making. But in a safety- and compliance-critical sector, adoption requires evidence: how accurate are today’s models on utility-relevant questions, and where do they fall short?
To answer that, EPRI developed the first rigorously constructed, domain-specific LLM benchmark for the electric power sector. The benchmark is grounded in real-world power-system questions, authored and reviewed by subject-matter experts across 35 domains, and is designed to be repeatable across models and time.
Why Benchmarking Matters for Utilities
Many public benchmarks emphasize broad academic knowledge (e.g., math, science, coding). They rarely reflect the operational context utilities face: equipment constraints, protection and control considerations, regulatory requirements, and real-world tradeoffs. A power-sector benchmark helps utilities understand whether LLMs can provide reliable support in their environment and where expert oversight remains essential.
EPRI’s evaluation consists of multiple trials per model and reports confidence intervals to capture variability and stability across runs. Automated evaluation is performed using Inspect (inspect_ai), an open-source framework created by the UK AI Security Institute, with the option of SME review for higher-stakes items and edge cases.
What This Means for Adoption
The benchmark results suggest a pragmatic path for utilities:
- Start with decision-support in low-consequence contexts. Use LLMs where errors are easy to detect and have limited impact (e.g., drafting, summarization, Q&A support), and treat outputs as decision support, not decisions, especially given the operational consequences of inaccuracies.
- Require human oversight for safety, reliability, and compliance-critical work. Results on open-ended reformulations indicate materially lower reliability than in multiple-choice formats, reinforcing the need for SME/engineer review.
- If you enable retrieval (e.g., web search), validate the retrieval layer. Average gains can mask failure modes from irrelevant or misleading sources, so pair model evaluation with retrieval-quality checks and guardrails.
- Consider open-weight options for sensitive deployments but benchmark them. Open-weight models are closing the gap and can be self-hosted for deployment flexibility (including secure in-house deployments). However, utilities should still validate performance, robustness, and operational fit on domain-specific tests.
What’s Next
Future phases will build on this foundation by benchmarking domain-augmented tools and real utility use cases such as retrieval-augmented assistants, knowledge-graph-enhanced systems, and workflow automation alongside general-purpose models. EPRI will also expand applied evaluations with member utilities to measure not only accuracy, but also trust, operational impact, and integration considerations.
Learn More
- WattWorks: The Power Sector’s AI Benchmarking Hub
- Benchmarking Large Language Models for the Electric Power Sector
Microsoft Copilot was used to generate a draft of this article from an EPRI publication. AI-generated content was reviewed, edited, and fact-checked by an EPRI expert to ensure accuracy and quality.
Banner image created using ChatGPT