Please use this identifier to cite or link to this item:
                
    
    https://dair.nps.edu/handle/123456789/5389| Title: | Test and Evaluation of Large Language Models to Support Informed Government Acquisition | 
| Authors: | Brian Mayer Jaganmohan Chandrasekaran Erin Lanus Patrick Butler Stephen Adams Jared Gregersen Naren Ramakrishnan Laura Freeman | 
| Keywords: | Large Language Models Test and Evaluation Government Acquisition Generative Artificial Intelligence Benchmarking | 
| Issue Date: | 2-Apr-2025 | 
| Publisher: | Acquisition Research Program | 
| Citation: | APA | 
| Series/Report no.: | Acquisition Management;SYM-AM-25-313 ;SYM-AM-25-401 | 
| Abstract: | "As large language models (LLMs) continue to advance and find applications in critical decision-making systems, robust and thorough test and evaluation (T&E) of these models will be necessary to ensure we reap their promised benefits without the risks that often come with LLMs. Most existing applications of LLMs are in specific areas like healthcare, marketing, and customer support and thus these domains have influenced their T&E processes. When investigating LLMs for government acquisition, we encounter unique challenges and opportunities. Key challenges include managing the complexity and novelty of Artificial Intelligence (AI) systems and implementing robust risk management practices that can pass muster with the stringency of government regulatory requirements. Data management and transparency are critical concerns, as is the need for ensuring accuracy (performance). Unlike traditional software systems developed for specific functionalities, LLMs are capable of performing a wide variety of functionalities (e.g., translation, generation). Furthermore, the primary mode of interaction with an LLM is through natural language. These unique characteristics necessitate a comprehensive evaluation across diverse functionalities and accounting for the variability in the natural language inputs/outputs. Thus, the T&E for LLMs must support evaluating the model’s linguistic capabilities (understanding, reasoning, etc.), generation capabilities (e.g., correctness, coherence, and contextually relevant responses), and other quality attributes (fairness, security, lack of toxicity, robustness). T&E must be thorough, robust, and systematic to fully realize the capabilities and limitations (e.g., hallucinations and toxicity) of LLMs and to ensure confidence in their performance. This work aims to provide an overview of the current state of T&E methods for ascertaining the quality of LLMs and structured recommendations for testing LLMs, thus resulting in a process for assuring warfighting capability. " | 
| Description: | SYM Paper / SYM Panel | 
| URI: | https://dair.nps.edu/handle/123456789/5389 | 
| Appears in Collections: | Annual Acquisition Research Symposium Proceedings & Presentations | 
Files in This Item:
| File | Description | Size | Format | |
|---|---|---|---|---|
| SYM-AM-25-313.pdf | SYM Paper | 638.82 kB | Adobe PDF | View/Open | 
| SYM-AM-25-401.pdf | SYM Presentation | 918.16 kB | Adobe PDF | View/Open | 
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

