Abstract
Background: Acute myeloid leukemia (AML) is a highly heterogeneous malignancy requiring accurate prediction of treatment response and relapse risk. Despite advances in genomics, complex mutation patterns limit prognostic precision. Large language models (LLMs) show promise in healthcare but their utility in AML outcome prediction remains unexplored.
Methods: We curated multimodal data from 684 newly diagnosed de novo AML patients (mean age 54±16 years) at Zhejiang University Hospital from 2019 to 2022, encompassing demographics, blood counts, genomics (43 fusions and 138 mutations), cytogenetics, treatments, and outcomes. Five state-of-the-art LLMs (Kimi, Qwen, SparkDesk, ChatGPT, DeepSeek) were evaluated using structured prompts for three tasks: 1. Treatment response prediction (remission: CRc vs non-CRc); 2. Relapse risk prediction; 3. Prognostic feature ranking. Performance was assessed via accuracy, precision, recall, F1-score, and cosine similarity against expert judgments.
Results: In treatment response, ChatGPT (with O1) achieved highest accuracy (72.22%) and F1-score (82.01%), while Kimi performed poorest (57.89% accuracy). Whereas in relapse prediction, SparkDesk had highest accuracy (58.77%), but all models showed low precision (26.36–30.74%) and high false-positive rates (F1-score: 33.49% for SparkDesk). Notably, in feature ranking, LLMs aligned closely with experts (cosine similarity >0.85). Top-ranked features (e.g., TP53 mutation, CBFB::MYH11 fusion, chromosome 7/17 abnormalities) showed significant differences between CRc and non-CRc groups (p<0.001). However, LLMs overvalued non-discriminative features (e.g., WBC, FCM) compared to experts.
Conclusions: Current LLMs demonstrate insufficient reliability for independent AML outcome prediction (relapse accuracy ≤58.77%). However, their robust capability in identifying clinically relevant prognostic features supports their potential as adjunctive tools to augment decision-making in hematologic malignancies. Future integration of longitudinal data and domain-specific fine-tuning may enhance clinical utility.
This feature is available to Subscribers Only
Sign In or Create an Account Close Modal