Please use this identifier to cite or link to this item: https://dair.nps.edu/handle/123456789/5389
Title: Test and Evaluation of Large Language Models to Support Informed Government Acquisition
Authors: Brian Mayer
Jaganmohan Chandrasekaran
Erin Lanus
Patrick Butler
Stephen Adams
Jared Gregersen
Naren Ramakrishnan
Laura Freeman
Keywords: Large Language Models
Test and Evaluation
Government Acquisition
Generative Artificial Intelligence
Benchmarking
Issue Date: 2-Apr-2025
Publisher: Acquisition Research Program
Citation: APA
Series/Report no.: Acquisition Management;SYM-AM-25-313
;SYM-AM-25-401
Abstract: "As large language models (LLMs) continue to advance and find applications in critical decision-making systems, robust and thorough test and evaluation (T&E) of these models will be necessary to ensure we reap their promised benefits without the risks that often come with LLMs. Most existing applications of LLMs are in specific areas like healthcare, marketing, and customer support and thus these domains have influenced their T&E processes. When investigating LLMs for government acquisition, we encounter unique challenges and opportunities. Key challenges include managing the complexity and novelty of Artificial Intelligence (AI) systems and implementing robust risk management practices that can pass muster with the stringency of government regulatory requirements. Data management and transparency are critical concerns, as is the need for ensuring accuracy (performance). Unlike traditional software systems developed for specific functionalities, LLMs are capable of performing a wide variety of functionalities (e.g., translation, generation). Furthermore, the primary mode of interaction with an LLM is through natural language. These unique characteristics necessitate a comprehensive evaluation across diverse functionalities and accounting for the variability in the natural language inputs/outputs. Thus, the T&E for LLMs must support evaluating the model’s linguistic capabilities (understanding, reasoning, etc.), generation capabilities (e.g., correctness, coherence, and contextually relevant responses), and other quality attributes (fairness, security, lack of toxicity, robustness). T&E must be thorough, robust, and systematic to fully realize the capabilities and limitations (e.g., hallucinations and toxicity) of LLMs and to ensure confidence in their performance. This work aims to provide an overview of the current state of T&E methods for ascertaining the quality of LLMs and structured recommendations for testing LLMs, thus resulting in a process for assuring warfighting capability. "
Description: SYM Paper / SYM Panel
URI: https://dair.nps.edu/handle/123456789/5389
Appears in Collections:Annual Acquisition Research Symposium Proceedings & Presentations

Files in This Item:
File Description SizeFormat 
SYM-AM-25-313.pdfSYM Paper638.82 kBAdobe PDFView/Open
SYM-AM-25-401.pdfSYM Presentation918.16 kBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.