Retrieval-Augmented Generation for Medical Question Answering on a Heart Failure Dataset: Performance Analysis
Background: The integration of retrieval-augmented generation (RAG) systems into the domain of medical question-answering (QA) presents a significant opportunity to enhance the effectiveness and accuracy of clinical support systems. Objective: This study aimed to explore the design choices within the RAG framework and the use of large language model (LLM) classifiers to optimize medical QA systems, enhancing response quality for patient and caregiver queries of varying risk levels. Methods: In total, we curated a dataset of 109 patient and caregiver questions related to heart failure (HF)—categorized into answerable (direct, fact-based queries), helpful deferral (general guidance or lifestyle advisory queries), and nonanswerable (out-of-scope or high-risk and medical intervention queries) types—along with relevant documents and a target answer for each question from the website . Applying a system architecture leveraging RAG with a structured query taxonomy and robust classification mechanisms, this paper provided an empirical assessment for medical QA on a HF dataset and introduced a QA system pipeline design, providing a foundation for extended application across various medical fields. Specifically, we evaluated design choices in the initial retrieval stage of RAG and their impact on performance. We assessed final answer quality from the generation stage using popular passage scoring methods for QA, such as Recall-Oriented Understudy for Gisting Evaluation (ROUGE), BERTScore, and Intersection over Union score. Results: The pipeline first uses an LLM-based classifier, achieving 65% accuracy for answerable and helpful deferral queries and 100% accuracy for identifying nonanswerable queries. In information retrieval, the BioMedical Contrastive Pre-trained Transformers (MedCPT) cross encoder performed best as a dense retrieval method, delivering an average of 93% recall @ 7 through ranked relevance scores to obtain the top documents with recall @ k denoting recall computed over the top-k retrieved items. For further retrieving snippets from such documents, its average performance was 72.5% for sentence-level snippets and 83% for paragraph-level snippets. A second LLM-based classifier, used to refine the generated responses, resulted in an overall reduction in ROUGE-1 recall by 13% and Bidirectional Encoder Representations from Transformers (BERT) precision by 11%. However, Intersection over Union scores, or the overlap between “gold answers” and system answers, increased by 24%, demonstrating enhanced alignment with ground truth responses. This also indicates the system’s improved ability to generate concise and accurate medical responses. Conclusions: The implementation of a structured RAG framework paired with LLM classifiers for medical QA introduces a promising avenue for enhancing clinical decision support systems. By systematically analyzing the impact of query taxonomy, retrieval configurations, and response strategies, this approach clarifies the relative importance of each component within the medical RAG system using a HF dataset. Our findings provide actionable guidance on optimal design choices for maximizing retrieval and response accuracy; thus, informing the development of robust, scalable medical QA systems.