Introduction Identifying bleeding and clotting events in the electronic health record (EHR)is critical for quality improvement and patient safety efforts but has long been a challenge. Diagnosis codes provide an incomplete picture, while much of the relevant clinical context is buried in free-text notes. Natural language processing, a branch of artificial intelligence, offers tools for extracting this unstructured information. Recent studies have demonstrated excellent performance with fine-tuned transformer models but required substantial labeled data and task-specific training tailored to bleeding or clotting. In this work, we experiment with using zero-shot prompting with Generative Pre-trained Transformer 4 (GPT-4) to identify both event types in clinical text. Rather than aiming to optimize model performance, we evaluate whether an off-the-shelf large language model (LLM) can perform at a level comparable to human annotators, such that it can serve as an additional reviewer in manual chart abstraction or consensus-based coding workflows.

Methods We used clinical text from MIMIC-IV, which includes de-identified records from over 200,000 patients treated at an academic tertiary care center. For bleeding events, we extracted 400 excerpts from 200 notes containing at least one keyword related to bleeding (e.g., hemorrhage, laceration). For clotting events, we extracted 3182 CT pulmonary angiogram radiology reports.

Three physicians (SR, RP, BL; “original annotators”) annotated both corpora using binary labels: yes/no major or clinically relevant non-major bleeding, and yes/no acute pulmonary embolism, respectively, using ISTH criteria. We prompted GPT-4 with the ISTH definitions and asked it to apply the same labels in a zero-shot setting. A fourth physician (AK), blinded to all LLM and human labels, annotated a sample of agreements and disagreements between the original annotators and LLM. We assessed the agreement percentage between the LLM and original annotators and the weighted accuracy agreement between the blinded physician and the original annotators. We conducted a qualitative error analysis to characterize patterns of disagreement between the LLM and human annotators.

Results There was a higher percent agreement between the original annotators and LLM when identifying clotting events (99.2%) than with bleeding events (90.5%). For clotting, the weighted accuracy between the blinded physician and original annotators across a 50/50 sample of LLM and physician agreements/disagreements was 0.994 (LLM: 0.992). For bleeding, the weighted accuracy was 0.780 (LLM: 0.908).

Three themes emerged in our error analysis: human error, over-reasoning by the LLM, and incorrect reasoning by the LLM. An example of over-reasoning was the LLM noting, “a filling defect...with web-like appearance” was likely to be chronic; the annotators labeled this event as acute. An example of incorrect reasoning was the LLM labeling the segment, “HEENT: - Oral/Gum bleeding,” as positive for bleeding, not recognizing it was part of a negative review of systems. Most human errors were due to annotators failing to follow ISTH guidelines or missing part of the text.

Error analysis of the 26 clotting discrepancies between the original annotators and LLM revealed most disagreements were due to human error (18/26, 69.2%), followed by incorrect reasoning (5/26, 19.2%) and over-reasoning (3/26, 11.5%) by the LLM. Of the bleeding disagreements, most were due to over-reasoning (25/38, 65.8%), about one-third were due to human error (12, 31.6%), and one (1/38, 2.6%) was caused by incorrect reasoning by the LLM.

Conclusions There was more disagreement between the LLM and original annotators with identification of bleeding events than with clotting events. For clotting events, human error was the most frequent type of disagreement between the original annotators and the LLM, while over-reasoning by the LLM was the most frequent type of disagreement for bleeding events. A blinded physician and the LLM had nearly identical agreement levels with the original annotators for clotting but had lower agreement than the LLM and the original annotators for bleeding. Our findings suggest that an LLM performs similarly to a second human annotator in a blinded study. LLMs may serve as effective secondary reviewers or screening tools in quality improvement initiatives, clinical registry development, or other data abstraction workflows.

This content is only available as a PDF.
Sign in via your Institution