2. Describe how AI, specifically ChatGPT 4.0, can be trained to evaluate narrative assessments based on validated narrative assessment scoring tool
3. Interpret key statistical measures of inter-rater reliability (ICC , Cronbach's alpha) used to evaluate AI performance against human scorers
4. Evaluate the potential for AI integration in streamlining educational feedback across military GME programs
INTRODUCTION: In the pursuit of operational readiness, optimizing the training of future military physicians is essential. One of the foundational pillars of clinical competency development within graduate medical education is the delivery of high-quality feedback from attendings to trainees. Accurate, timely, and specific feedback not only enhances the clinical growth of resident physicians but also directly supports the creation of a flexible and adaptable medical force. Despite its importance, the quality of narrative feedback within residency training is often inconsistent and time-consuming to evaluate. This variability may hinder the standardization necessary to uphold the Department of Health Administration’s (DHA's) commitment to excellence across its joint medical enterprise. Recent literature highlights AI's emerging role in automating and refining evaluation processes in educational contexts, including the scoring of narrative responses. This study examines the use of Artificial Intelligence (AI) —specifically ChatGPT 4.0—to assess the quality of resident narrative feedback using the validated Narrative Evaluation Quality Instrument (NEQI). By comparing AI-generated scores to those of trained human reviewers, we aim to explore how such technologies can facilitate the DHA’s vision: delivering scalable, efficient, and high-quality feedback systems that ensure both a medically ready force and a ready medical force. METHODS: Narrative assessments from a midsize, multi-site university affiliated internal medicine residency program were compiled and de-identified. ChatGPT 4.0 was trained to score narrative assessments utilizing the NEQI across 3 domains of the scoring tool with further algorithm refinement to improve scoring ability based on natural language processing and adaptability to human nuance from prior human scored narrative assessments. Once the ChatGPT 4.0 model was trained with several iterations of prior, distinct assessments, 180 narrative assessments were then scored utilizing ChatGPT 4.0. The average human NEQI score was then compared to the artificial intelligence platform NEQI score for 180 narrative assessments using inter-rater reliability analysis via IBM SPSS to determine the ICC and Crohbach’s alpha between ChatGPT and human scorers. RESULTS: The Intraclass Correlation Coefficient (ICC) for average measures between ChatGPT and human scorers was 0.826 (95% CI: 0.766–0.870). The result was statistically significant (F(179, 179) = 5.743, p < .001). Cronbach’s alpha was 0.826. DISCUSSION: This study explored the use of ChatGPT 4.0, a natural language processing AI model, to evaluate the quality of resident narrative assessments using the validated Narrative Evaluation Quality Instrument (NEQI). Demonstrating a high inter-rater reliability with an ICC and Cronbach’s alpha of 0.826, ChatGPT 4.0 showed excellent agreement with human scorers, affirming its reliability and consistency as an evaluative tool. These findings underscore AI’s emerging potential to transform administrative and educational processes across military graduate medical education. High-quality, timely feedback is foundational to the development of competent, confident military physicians capable of delivering exceptional outcomes across any operational setting. The scalability of this solution offers a joint force benefit, enabling Army, Navy, and Air Force medical education programs to optimize feedback processes uniformly, contributing to the creation of a fighting force that is medically ready and an education system that is mission-focused, efficient, and future-ready.