Application of LLMs in CAD-RADS Classification and Patient Management.
👤 作者: Tarkowski P, Muscogiuri G, Casartelli D, Coraducci F, Sassi F, Usai J, Staśkiewicz G, Siek E, Byczkowski J, Licu RA
冠心病
📝 摘要
PURPOSE: To evaluate the capability of four publicly available Large language models (LLMs) to assign Coronary Artery Disease-Reporting and Data System (CAD-RADS) scores and provide patient management recommendations based on synthetic coronary CT angiography (CCTA) reports. METHODS: Four LLMs (ChatGPT 4o, Claude 3.7, DeepSeek, and Gemini 2.5 Pro) were tasked with analyzing reports and suggesting next steps. Prompts were framed from the perspective of both a cardiologist and a radiologist. Agreement with a human reference standard was assessed using weighted Cohen's kappa, Fleiss' kappa, and Krippendorff's alpha for CAD-RADS scoring, and unweighted Cohen's kappa for management recommendations. A Bayesian Wilcoxon signed-rank test was performed to assess directional bias. RESULTS: Performance variations were observed across LLMs and prompt identities. Claude-3.7 achieved almost perfect agreement for CAD-RADS scoring (κ = 0.997) regardless of prompt identity, Gemini similarly achieved almost perfect agreement (radiologist: κ = 0.962; cardiologist: κ = 0.990), ChatGPT demonstrated almost perfect agreement when prompted as a radiologist (κ = 0.896) but only substantial agreement when prompted as a cardiologist (κ = 0.715). DeepSeek showed the lowest overall performance (radiologist: κ = 0.637; cardiologist: κ = 0.768). By category, all LLMs correctly identified CAD-RADS 0, whereas higher-grade stenosis (4A/4B) remained the most challenging, with non-Claude models showing low-to-null agreement in some configurations. The LLMs' accuracy in proposing further management was considerably lower than their scoring accuracy, with CAD-RADS 3 showing the greatest variability in management recommendations across models and between human specialists. Furthermore, both CAD-RADS scoring and management recommendations varied depending on the professional identity specified in the prompt. CONCLUSION: While LLMs demonstrated reliable scoring performance for lower-grade CAD-RADS categories (0-2), agreement was substantially reduced for higher-grade stenosis categories (4A/4B) and non-diagnostic studies, which could pose risks to patients. Their current ability to generate dependable clinical management recommendations is limited.