- UAA Congress 2025

Back

Final Presentation Format

Eposter Presentation

Eposter in PDF Format

Accept format: PDF. The file size should not be more than 5MB

Eposter in Image Format

Accept format: PNG/JPG/WEBP. The file size should not be more than 2MB

Abstract

Abstract Title

Comparing AI Language Models and Human Experts, Benchmarking Risk Assessment and Feature Stratification Among Nine Models in Stage IV Prostate Cancer

Presentation Type

Podium Abstract

Manuscript Type

Clinical Research

Abstract Category *

AI in Urology

Author's Information

Number of Authors (including submitting/presenting author) *

No more than 10 authors can be listed (as per the Good Publication Practice (GPP) Guidelines).
Please ensure the authors are listed in the right order.

Co-author 1

Chung-You Tsai pgtsai@gmail.com Far Eastern Memorial Hospital Division of Urology, Department of Surgery New Taipei City Taiwan - Yuan Ze University Department of Electrical Engineering Taoyuan Taiwan

Co-author 2

Wei Tu ms0344624@gmail.com Far Eastern Memorial Hospital Division of Urology, Department of Surgery New Taipei City Taiwan *

Co-author 3

Shi-Wei Huang swhunag38@ntu.edu.tw National Taiwan University Hospital, Yunlin Branch Department of Urology Yunlin Taiwan - National Taiwan University Hospital Department of Urology Taipei Taiwan

Co-author 4

Co-author 5

Co-author 6

Co-author 7

Co-author 8

Co-author 9

Co-author 10

Co-author 11

Co-author 12

Co-author 13

Co-author 14

Co-author 15

Co-author 16

Co-author 17

Co-author 18

Co-author 19

Co-author 20

Abstract Content

Introduction

Feature stratification (FS) and risk assessment (RA) based on multimodal imaging and pathology reports are essential in guiding treatment decisions for stage IV prostate cancer (PC). This study assesses the performance of nine large language models (LLMs) in FS and RA tasks compared to human experts.

Materials and Methods

We obtained text-based clinical reports from computed tomography, magnetic resonance imaging, bone scans, and biopsy pathology from 314 patients with stage IV PC. The study assessed the performance of nine LLMs categorized by scale: large-scale (o1-preview, Claude-3.5-sonnet, ChatGPT-4o, ChatGPT-4-turbo, Gemini-1.5-pro, Meta-Llama-3.1-405B), medium-scale (Meta-Llama-3.1-70B), and small-scale (Meta-Llama-3.1-8B, Medllama3-v20). Each LLM was evaluated on three RA tasks (LATITUDE, CHAARTED, TwNHI) and seven FS tasks, inclusive of TNM staging, detection of bone and visceral metastases, and metastatic site quantification. The models were queried via Application Programming Interface using zero-shot chain-of-thought prompting, and their outputs were assessed through repeated single-round queries and ensemble voting strategies. Performance was benchmarked against a gold-standard consensus from three human experts, with accuracy and consistency—measured using the intraclass correlation coefficient (ICC)—as the primary evaluation metrics. Generalized estimating equations were used to compare model performance across multiple queries with human experts.

Results

Among the 314 patients, 115 (32.8%) were classified as LATITUDE high-risk, 128 (36.5%) as CHAARTED high-volume, and 94 (27%) as TwNHI high-risk. State-of-the-art (SOTA) LLMs achieved accuracy comparable to human experts in RA and FS tasks. Specifically, o1-preview (95.22%-96.50%) and Claude-3.5-sonnet (93.63%-96.50%) matched human expert accuracy (92.36%-96.73%) across three RA and five FS tasks, without significant differences in most comparisons. Notably, SOTA LLMs outperformed human experts in regional lymph node (N1) and distant metastasis (M1a) detection, showing higher accuracy and ICC. Closed-source LLMs consistently outperformed open-source models in both RA and FS tasks. In the LATITUDE RA task, o1-preview achieved 95.22% accuracy, while open-source models, including Meta-Llama-3.1-405B, 70B, and 8B, scored 88.54%, 72.61%, and 42.68%, respectively, emphasizing the advantage of larger proprietary models. LLM performance followed the scaling law, with large-scale models outperforming medium- and small-scale ones (Large>Medium>Small). Additionally, higher-accuracy models exhibited greater consistency across multiple queries, with SOTA LLMs surpassing human experts in ICC.

Conclusions

Among the evaluated LLMs, o1-preview and Claude-3.5-sonnet—despite not being specifically trained for medical applications—achieved state-of-the-art performance, matching human expert accuracy while demonstrating superior consistency in FS and RA for stage IV PC. These results highlight their potential for clinical decision support.

Keywords

Artificial intelligence, Prostate cancer, Large language model, Risk assessment, Feature stratification, Clinical decision support, ChatGPT

Figure 1

Figure 1 Caption

Figure 2

Figure 2 Caption

Figure 3

Figure 3 Caption

Figure 4

Figure 4 Caption

Figure 5

Figure 5 Caption

Presentation Order