Eposter Presentation
 
Accept format: PDF. The file size should not be more than 5MB
 
Accept format: PNG/JPG/WEBP. The file size should not be more than 2MB
 
Withdrawn
Abstract
Comparing AI Language Models and Human Experts, Benchmarking Risk Assessment and Feature Stratification Among Nine Models in Stage IV Prostate Cancer
Podium Abstract
Clinical Research
AI in Urology
Author's Information
3
No more than 10 authors can be listed (as per the Good Publication Practice (GPP) Guidelines).
Please ensure the authors are listed in the right order.
Taiwan
Chung-You Tsai pgtsai@gmail.com Far Eastern Memorial Hospital Division of Urology, Department of Surgery New Taipei City Taiwan - Yuan Ze University Department of Electrical Engineering Taoyuan Taiwan
Wei Tu ms0344624@gmail.com Far Eastern Memorial Hospital Division of Urology, Department of Surgery New Taipei City Taiwan *
Shi-Wei Huang swhunag38@ntu.edu.tw National Taiwan University Hospital, Yunlin Branch Department of Urology Yunlin Taiwan - National Taiwan University Hospital Department of Urology Taipei Taiwan
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Abstract Content
Feature stratification (FS) and risk assessment (RA) based on multimodal imaging and pathology reports are essential in guiding treatment decisions for stage IV prostate cancer (PC). This study assesses the performance of nine large language models (LLMs) in FS and RA tasks compared to human experts.
We obtained text-based clinical reports from computed tomography, magnetic resonance imaging, bone scans, and biopsy pathology from 314 patients with stage IV PC. The study assessed the performance of nine LLMs categorized by scale: large-scale (o1-preview, Claude-3.5-sonnet, ChatGPT-4o, ChatGPT-4-turbo, Gemini-1.5-pro, Meta-Llama-3.1-405B), medium-scale (Meta-Llama-3.1-70B), and small-scale (Meta-Llama-3.1-8B, Medllama3-v20). Each LLM was evaluated on three RA tasks (LATITUDE, CHAARTED, TwNHI) and seven FS tasks, inclusive of TNM staging, detection of bone and visceral metastases, and metastatic site quantification. The models were queried via Application Programming Interface using zero-shot chain-of-thought prompting, and their outputs were assessed through repeated single-round queries and ensemble voting strategies. Performance was benchmarked against a gold-standard consensus from three human experts, with accuracy and consistency—measured using the intraclass correlation coefficient (ICC)—as the primary evaluation metrics. Generalized estimating equations were used to compare model performance across multiple queries with human experts.
Among the 314 patients, 115 (32.8%) were classified as LATITUDE high-risk, 128 (36.5%) as CHAARTED high-volume, and 94 (27%) as TwNHI high-risk. State-of-the-art (SOTA) LLMs achieved accuracy comparable to human experts in RA and FS tasks. Specifically, o1-preview (95.22%-96.50%) and Claude-3.5-sonnet (93.63%-96.50%) matched human expert accuracy (92.36%-96.73%) across three RA and five FS tasks, without significant differences in most comparisons. Notably, SOTA LLMs outperformed human experts in regional lymph node (N1) and distant metastasis (M1a) detection, showing higher accuracy and ICC. Closed-source LLMs consistently outperformed open-source models in both RA and FS tasks. In the LATITUDE RA task, o1-preview achieved 95.22% accuracy, while open-source models, including Meta-Llama-3.1-405B, 70B, and 8B, scored 88.54%, 72.61%, and 42.68%, respectively, emphasizing the advantage of larger proprietary models. LLM performance followed the scaling law, with large-scale models outperforming medium- and small-scale ones (Large>Medium>Small). Additionally, higher-accuracy models exhibited greater consistency across multiple queries, with SOTA LLMs surpassing human experts in ICC.
Among the evaluated LLMs, o1-preview and Claude-3.5-sonnet—despite not being specifically trained for medical applications—achieved state-of-the-art performance, matching human expert accuracy while demonstrating superior consistency in FS and RA for stage IV PC. These results highlight their potential for clinical decision support.
Artificial intelligence, Prostate cancer, Large language model, Risk assessment, Feature stratification, Clinical decision support, ChatGPT
 
 
 
 
 
 
 
 
 
 
2654
 
Presentation Details