Name
#27 Self-Hosted, Open-Source RAG for Federal Health: From Laptop to Low-Cost Cloud
Speakers
Content Presented On Behalf Of:
DHA
Session Type
Poster
Date
Tuesday, March 3, 2026
Start Time
5:00 PM
End Time
7:00 PM
Location
Prince Georges Expo Hall E
Focus Areas/Topics
Clinical Care, Technology, Policy/Management/Administrative
Learning Outcomes
1) Explain why hybrid retrieval paired with reciprocal rank fusion often outperforms single-method retrieval in retrieval augmented generation systems, and identify when to use it
2) Assemble a fully offline retrieval augmented generation stack using safetensors models and a local vector database, and benchmark large language model × retriever × reranker combinations with a reproducible harness
3) Apply parameter efficient fine tuning to fine-tune small local large language models for domain adaptation while keeping compute and storage affordable
4) Incorporate AI governance into enterprise reporting to build trust and ensure performance of generative AI solutions
2) Assemble a fully offline retrieval augmented generation stack using safetensors models and a local vector database, and benchmark large language model × retriever × reranker combinations with a reproducible harness
3) Apply parameter efficient fine tuning to fine-tune small local large language models for domain adaptation while keeping compute and storage affordable
4) Incorporate AI governance into enterprise reporting to build trust and ensure performance of generative AI solutions
Session Currently Live
Description
Problem:
The Defense Health Agency (DHA) relies on large internal document corpora for clinical, operational, and policy decisions. Searching, citing, understanding, and synthesizing information across these corpora is a standard capability of large language models (LLMs) through retrieval-augmented generation (RAG). However, the LLM solutions available within the DHA enterprise are either cost-prohibitive at scale, do not offer this RAG capability, and/or do not facilitate the appropriate governance of these corpora.
Objective:
We present and evaluate a reproducible, end to end RAG blueprint that runs on a standard government laptop, requires no paid models or tokens, and is designed for self-hosted environments while preserving a clear path to cloud scale.
Methods: Our approach combines hybrid retrieval—sparse (Best Match 25 [BM25]) and dense encoders—with Reciprocal Rank Fusion (RRF) to improve coverage and robustness. We add lightweight cross encoder re ranking to improve answer focus and use parameter efficient fine tuning (PEFT) to specialize open source LLMs to DHA domains without retraining full models. A compact offline harness evaluates RAG pipelines end to end (LLM × retriever × re ranker) on DHA like Q/A sets, with a local vector index to ensure portability and reproducibility. Governance checks quantify retrieval quality, groundedness, and hallucination risk, enabling defensible, stakeholder friendly reporting. To address cost, we benchmark laptop performance and map those footprints to small cloud instances, demonstrating that the same design translates to low cloud compute costs. Because the stack is fully open source and self hosted, we avoid per token fees and subscription costs while achieving comparable results on targeted retrieval tasks.
Results:
We will report accuracy, latency, and governance metrics across multiple configurations, showing that hybrid retrieval with RRF, cross encoder re ranking, and PEFT specialization consistently improves answer faithfulness and relevance over single method baselines, with stable performance on modest hardware. We will also provide cost analyses that quantify eliminated token fees and demonstrate the feasibility of low compute cloud deployment derived directly from the laptop profile.