Name
#163 Evaluating Source-Based Large Language Models for Preclinical Dermatology Education
Content Presented On Behalf Of:
Uniformed Services University
Session Type
Poster
Date
Tuesday, March 3, 2026
Start Time
5:00 PM
End Time
7:00 PM
Location
Prince Georges Expo Hall E
Focus Areas/Topics
Technology, Trending/Hot Topics or Other not listed
Learning Outcomes
Following this session, the attendee will be able to describe current limitations and ethical considerations of large language models (LLMs) in medical education

Following this session, the attendee will be able to compare the performance of different LLMs including ChatGPT, Google Gemini, NotebookLM with Notes, and NotebookLM without Notes in answering standardized Dermatology Step 1 questions.

Following this session, the attendee will be able to discuss how source-grounded AI models differ from traditional LLMs and how they might fit within the future classroom using existing learning frameworks (specifically Vygotsky's Zone of Proximal Development and the Cognitive Load Theory).
Session Currently Live
Description
Large Language Models (LLMs) are Artificial Intelligences that predict desired user outputs. There are gaps in Dermatology education that could benefit from the incorporation of LLMs. However, efforts to do so have been hindered by concerns over the accuracy, transparency, and reproducibility of responses. Furthermore, LLMs have historically performed inconsistently on standardized medical questions, possibly due to a lack of representative data within an LLM’s armamentarium. NotebookLM (NLM) by Google, an LLM that advertises to develop answers from user-uploaded sources and provide reliable citations, may show a possible solution to these shortcomings within source-based LLMs. Here, we assessed NLM with inputted pre-clerkship study guides, NLM with an inputted blank sheet of paper, ChatGPT, and Google Gemini LLMs using 3 trials of all 121 text-based Step 1 Dermatology questions in a popular preparation question bank, AMBOSS. They were evaluated for overall accuracy, accuracy by question difficulty, reproducibility of responses across trials, and agreement in answer selection between different models. Data on each of these categories was gathered, charted, and analyzed using Chi-Squared tests of Independence and Fleiss’s Kappa statistics. We found that NLM w/ Notes exhibited significantly more omissions (unanswered questions) than other LLMs. When omissions were excluded from statistical analysis, ChatGPT-4o Mini had the greatest accuracy (86%). NLM had unchanged accuracy when compared with inputted study guides and without (76% vs 76%); among all other LLMs, NLM with inputted material had the highest rates of reproducibility (Fleiss Kappa of 0.939). All the LLMs tested here performed higher than previously reported in literature, demonstrating a rapid progression in LLM capabilities. NLM has improved response completeness and reproducibility with user-inputted data, but not factual accuracy. One interpretation is that user-inputted data is being utilized more as an end-state 'filter' rather than being integrated within core reasoning processes, but the unclear nature of LLM cognition precludes definitive answers. More research is needed to harness source-based LLM potential within the classroom under structured, theory-informed educational roles.