Eposter Presentation
 
Accept format: PDF. The file size should not be more than 5MB
 
Accept format: PNG/JPG/WEBP. The file size should not be more than 2MB
 
Withdrawn
Abstract
The In-depth Comparative Analysis of Four Large Language AI Models for Risk Assessment and Information Retrieval from Multi-Modality Prostate Cancer Work-up Reports
Moderated Poster Abstract
Clinical Research
AI in Urology
Author's Information
4
No more than 10 authors can be listed (as per the Good Publication Practice (GPP) Guidelines).
Please ensure the authors are listed in the right order.
Taiwan
Yen-Chun Lin u102001412@gmail.com National Taiwan University Hospital, Yunlin Branch Department of Urology Yunlin Taiwan *
Lun-Hsiang Yuan lunhsiang.yuan@gmail.com National Taiwan University Hospital, Yunlin Branch Department of Urology Yunlin Taiwan
Chung-You Tsai pgtsai@gmail.com Far Eastern Memorial Hospital Division of Urology, Department of Surgery New Taipei Taiwan
Shi-Wei Huang will6438.huang@gmail.com National Taiwan University Hospital, Yunlin Branch Department of Urology Yunlin Taiwan
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Abstract Content
This study compares four general-purpose large language models (LLMs) in prostate cancer information retrieval (IR) and risk assessment (RA), highlighting performance differences across multifaceted clinical tasks.
We compares the performance of four LLMs (ChatGPT-4-turbo, Claude-3-opus, Gemini-pro-1.0, ChatGPT-3.5-turbo) on three RA tasks (LATITUDE, CHAARTED, TwNHI) and seven IR tasks. The study using simulated text reports from computed tomography, magnetic resonance imaging, bone scans, and biopsy pathology on stage IV PC patients. The tasks covered TNM staging, detection and quantification of bone and visceral metastases, offering a broad evaluation of the models' ability to process diverse clinical data. We used zero-shot chain-of-thought prompting via API to query the LLMs with multi-modal reports. The models' performances were assessed through a consensus standard set by three adjudicators, using repeated single-query methods and ensemble voting, and evaluated based on 6 outcome metrics.
In a simulated analysis of 350 Stage IV PC patient reports, 115(32.8%) as LATITUDE high risk, 128(36.5%) as CHAARTED high volume, and 94(27.0%) as TwNHI high risk. Ensemble voting, based on three repeated single-round queries, consistently enhances accuracy with a higher likelihood of achieving non-inferior results compared to a single query. The four language models tested showed small differences in information retrieval (IR) tasks, achieving high accuracy rates (87.4%-94.2%) and consistent TNM staging results (ICC > 0.8). However, notable variations emerged in RA, with performance ranked from highest to lowest: ChatGPT- 4-turbo, Claude-3-opus, Gemini-pro-1.0, and ChatGPT-3.5-turbo. While all models showed similar IR performance, significant differences were observed in RA tasks, with ChatGPT-4-turbo outperforming the others in accuracy (90.1%, 90.7%, 91.6%) and consistency (ICC 0.86, 0.93, 0.76) across three RA tasks.Its high sensitivity and NPV also making it a tool for ruling out high-risk patients.
This study shows ChatGPT-4-turbo is the most effective LLM tested, excelling in RA tasks for Stage IV PC with high accuracy and consistency, highlighting its potential as a clinical decision support tool.
Prostate cancer, large language model, risk assessment, information retrieval, clinical decision support, ChatGPT
https://storage.unitedwebnetwork.com/files/1237/77af9645be407bb8225abf07a7c75750.png
 
 
 
 
 
 
 
 
 
2230
 
Presentation Details