Please use this identifier to cite or link to this item: https://dair.nps.edu/handle/123456789/5389
Full metadata record
DC FieldValueLanguage
dc.contributor.authorBrian Mayer-
dc.contributor.authorJaganmohan Chandrasekaran-
dc.contributor.authorErin Lanus-
dc.contributor.authorPatrick Butler-
dc.contributor.authorStephen Adams-
dc.contributor.authorJared Gregersen-
dc.contributor.authorNaren Ramakrishnan-
dc.contributor.authorLaura Freeman-
dc.date.accessioned2025-05-02T17:01:31Z-
dc.date.available2025-05-02T17:01:31Z-
dc.date.issued2025-04-02-
dc.identifier.citationAPAen_US
dc.identifier.urihttps://dair.nps.edu/handle/123456789/5389-
dc.descriptionSYM Paper / SYM Panelen_US
dc.description.abstract"As large language models (LLMs) continue to advance and find applications in critical decision-making systems, robust and thorough test and evaluation (T&E) of these models will be necessary to ensure we reap their promised benefits without the risks that often come with LLMs. Most existing applications of LLMs are in specific areas like healthcare, marketing, and customer support and thus these domains have influenced their T&E processes. When investigating LLMs for government acquisition, we encounter unique challenges and opportunities. Key challenges include managing the complexity and novelty of Artificial Intelligence (AI) systems and implementing robust risk management practices that can pass muster with the stringency of government regulatory requirements. Data management and transparency are critical concerns, as is the need for ensuring accuracy (performance). Unlike traditional software systems developed for specific functionalities, LLMs are capable of performing a wide variety of functionalities (e.g., translation, generation). Furthermore, the primary mode of interaction with an LLM is through natural language. These unique characteristics necessitate a comprehensive evaluation across diverse functionalities and accounting for the variability in the natural language inputs/outputs. Thus, the T&E for LLMs must support evaluating the model’s linguistic capabilities (understanding, reasoning, etc.), generation capabilities (e.g., correctness, coherence, and contextually relevant responses), and other quality attributes (fairness, security, lack of toxicity, robustness). T&E must be thorough, robust, and systematic to fully realize the capabilities and limitations (e.g., hallucinations and toxicity) of LLMs and to ensure confidence in their performance. This work aims to provide an overview of the current state of T&E methods for ascertaining the quality of LLMs and structured recommendations for testing LLMs, thus resulting in a process for assuring warfighting capability. "en_US
dc.description.sponsorshipAcquisition Research Programen_US
dc.language.isoen_USen_US
dc.publisherAcquisition Research Programen_US
dc.relation.ispartofseriesAcquisition Management;SYM-AM-25-313-
dc.relation.ispartofseries;SYM-AM-25-401-
dc.subjectLarge Language Modelsen_US
dc.subjectTest and Evaluationen_US
dc.subjectGovernment Acquisitionen_US
dc.subjectGenerative Artificial Intelligenceen_US
dc.subjectBenchmarkingen_US
dc.titleTest and Evaluation of Large Language Models to Support Informed Government Acquisitionen_US
dc.typePresentationen_US
dc.typeTechnical Reporten_US
Appears in Collections:Annual Acquisition Research Symposium Proceedings & Presentations

Files in This Item:
File Description SizeFormat 
SYM-AM-25-313.pdfSYM Paper638.82 kBAdobe PDFView/Open
SYM-AM-25-401.pdfSYM Presentation918.16 kBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.