Test and Evaluation of Large Language Models to Support Informed Government Acquisition

Brian Mayer; Jaganmohan Chandrasekaran; Erin Lanus; Patrick Butler; Stephen Adams; Jared Gregersen; Naren Ramakrishnan; Laura Freeman

Please use this identifier to cite or link to this item: https://dair.nps.edu/handle/123456789/5389

Title:	Test and Evaluation of Large Language Models to Support Informed Government Acquisition
Authors:	Brian Mayer Jaganmohan Chandrasekaran Erin Lanus Patrick Butler Stephen Adams Jared Gregersen Naren Ramakrishnan Laura Freeman
Keywords:	Large Language Models Test and Evaluation Government Acquisition Generative Artificial Intelligence Benchmarking
Issue Date:	2-Apr-2025
Publisher:	Acquisition Research Program
Citation:	APA
Series/Report no.:	Acquisition Management;SYM-AM-25-313 ;SYM-AM-25-401
Abstract:	"As large language models (LLMs) continue to advance and find applications in critical decision-making systems, robust and thorough test and evaluation (T&E) of these models will be necessary to ensure we reap their promised benefits without the risks that often come with LLMs. Most existing applications of LLMs are in specific areas like healthcare, marketing, and customer support and thus these domains have influenced their T&E processes. When investigating LLMs for government acquisition, we encounter unique challenges and opportunities. Key challenges include managing the complexity and novelty of Artificial Intelligence (AI) systems and implementing robust risk management practices that can pass muster with the stringency of government regulatory requirements. Data management and transparency are critical concerns, as is the need for ensuring accuracy (performance). Unlike traditional software systems developed for specific functionalities, LLMs are capable of performing a wide variety of functionalities (e.g., translation, generation). Furthermore, the primary mode of interaction with an LLM is through natural language. These unique characteristics necessitate a comprehensive evaluation across diverse functionalities and accounting for the variability in the natural language inputs/outputs. Thus, the T&E for LLMs must support evaluating the model’s linguistic capabilities (understanding, reasoning, etc.), generation capabilities (e.g., correctness, coherence, and contextually relevant responses), and other quality attributes (fairness, security, lack of toxicity, robustness). T&E must be thorough, robust, and systematic to fully realize the capabilities and limitations (e.g., hallucinations and toxicity) of LLMs and to ensure confidence in their performance. This work aims to provide an overview of the current state of T&E methods for ascertaining the quality of LLMs and structured recommendations for testing LLMs, thus resulting in a process for assuring warfighting capability. "
Description:	SYM Paper / SYM Panel
URI:	https://dair.nps.edu/handle/123456789/5389
Appears in Collections:	Annual Acquisition Research Symposium Proceedings & Presentations

Files in This Item:

File	Description	Size	Format
SYM-AM-25-313.pdf	SYM Paper	638.82 kB	Adobe PDF	View/Open
SYM-AM-25-401.pdf	SYM Presentation	918.16 kB	Adobe PDF	View/Open

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets