Job Title
Physician
Academic Rank
Instructor
Department
Medicine
Oncology Hospitalist
Authors
Praveen Meka MD
Categories
Tags
Background:
Large language models (LLMs) have demonstrated exceptional proficiency in various specialized tasks, yet their effectiveness in clinical decision support has not been thoroughly investigated. Although they achieve remarkable accuracy in other domains such as coding, a true assessment of their potential in medicine requires a comprehensive evaluation. This study seeks to evaluate the LLM Claude v3’s ability to analyze Blood Gases (BG), a common clinical task using different decision architectures.
Methods
We began by creating a comprehensive bank of Blood Gases (BGs) covering a range of clinical scenarios, each featuring a simple primary and a secondary abnormality. The target sample size was calculated to be 50 samples to achieve a statistical power of 90%, assuming effect size >30%. The first step was to use the BG database directly to gauge accuracy of Claude v3 LLM. Subsequently we incorporated different decision architectures to assess the accuracy rate which included- prompt engineering, RAG architecture, a Math scratchpad and finally query preprocessing layer.
Results:
Initial results with the Claude Large Language Model (LLM) showed less than 50% (25/50) accuracy in interpreting blood gases (BGs). However, incorporating the decision architecture with Recursive Aggregation architecture, refining prompts, and adding math scratchpad led to increased accuracy ( upto 86% 43/50). Finally, the latest architecture by preprocessing the query by performing calculations, creating a prompt which is then passed with RAG to the led to 98% (49/50) accuracy, when tested on the database of 50 ABG’s.
Conclusions:
As a hospitalist, BG interpretation has remained a crucial skill that I use regularly. Mixed acid-base disturbances can be particularly challenging. In this study I compare the LLM to evaluate a database of ABG’s with and without using decision architecture. I found the accuracy of LLM’s to interpret BGs can be significantly improved with preprocessing of the query.