Nam Dao, MD
Rank
Fellow or Postdoc
Department
Medicine
Pulmonary and Critical Care
Authors
Nam Dao*, MD, Aaron B. Waxman, MD, PhD, George R. Washko, MD, MS, Farbob N. Rahaghi, MD, PhD
Principal Investigator
Nam Dao, MD
Twitter / Website
Categories
Introduction
Accurate extraction of unstructured data, such as hemodynamic measurements from right heart catheterization (RHC) reports, is crucial for clinical care and research. Leveraging large language models (LLMs), introduces a significant challenge: hallucinations. Minimizing these hallucinations is crucial to maintaining the integrity of clinical data and ensuring reproducible research outcomes.
Methods
We developed an extraction pipeline incorporating advanced techniques to constrain LLM outputs to only valid data. The process begins with prompt engineering that employs chain-of-thought techniques, a predefined JSON schema, and role assignments. This approach guides the LLM to focus exclusively on extracting requested values.
Our pipeline includes multiple layers of validation. During extraction, we use the Pydantic framework to enforce compliance with specified formats, and clinically valid ranges. This step safeguards against unstructured results and prevents invalid outputs. Afterwards, a source-validation procedure verifies LLM outputs against the original reports, ensuring that only documented values are retained.
Results
Preliminary results, based on clinician evaluation, demonstrate an average accuracy of 95% for LLM outputs compared to original data, based on 1,820 requested values from 130 RHC reports.
Conclusion
This modular pipeline effectively reduces LLM-induced hallucinations while expediting unstructured data extraction. It highlights the importance of integrating validation techniques with clinical expertise to advance AI healthcare applications.