Skip to Main Content

Advertisement intended for health care professionals

Close

ASH
ASH Clinical News
ASH Image Bank Open External Link
ASH News Daily
ASH-SAP
Blood Journals
Hematology
The Hematologist
International

header search

search input Search input auto suggest

filter your search

Search

Menu

Issues
- Current Issue
- All Issues
Latest Articles
- First Edition
- Latest Articles
Guidelines
Collections
Author Center
About

Skip Nav Destination

References

RESEARCH LETTER| May 15, 2025

Cytogenetic annotation automation in myelodysplastic syndrome research databases

Samuel Urrutia,

Samuel Urrutia

1Division of Oncology, Washington University School of Medicine, St Louis, MO

https://orcid.org/0000-0003-4634-2279

Search for other works by this author on:

PubMed

Google Scholar

Kelly S. Chien,

Kelly S. Chien

2Department of Leukemia, The University of Texas MD Anderson Cancer Center, Houston, TX

https://orcid.org/0000-0001-6024-1613

Search for other works by this author on:

PubMed

Google Scholar

Ziyi Li,

Ziyi Li

3Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX

https://orcid.org/0000-0001-8359-0533

Search for other works by this author on:

PubMed

Google Scholar

Alexandre Bazinet,

Alexandre Bazinet

2Department of Leukemia, The University of Texas MD Anderson Cancer Center, Houston, TX

Search for other works by this author on:

PubMed

Google Scholar

Guilin Tang,

Guilin Tang

4Department of Hematopathology, The University of Texas MD Anderson Cancer Center, Houston, TX

Search for other works by this author on:

PubMed

Google Scholar

Guillermo Garcia-Manero

Guillermo Garcia-Manero

2Department of Leukemia, The University of Texas MD Anderson Cancer Center, Houston, TX

Search for other works by this author on:

PubMed

Google Scholar

Blood Adv (2025) 9 (10): 2428–2430.

https://doi.org/10.1182/bloodadvances.2024015362

Article history

Submitted:

December 6, 2024

Accepted:

February 2, 2025

First Edition:

February 21, 2025

Split-Screen
Share
- Facebook
- X
- LinkedIn
- Email
- Bluesky
Tools
- Request Permissions
- Cite
Search Site
Open the PDF for in another window

Citation

Samuel Urrutia, Kelly S. Chien, Ziyi Li, Alexandre Bazinet, Guilin Tang, Guillermo Garcia-Manero; Cytogenetic annotation automation in myelodysplastic syndrome research databases. Blood Adv 2025; 9 (10): 2428–2430. doi: https://doi.org/10.1182/bloodadvances.2024015362

Download citation file:

Ris (Zotero)
Reference Manager
EasyBib
Bookends
Mendeley
Papers
EndNote
RefWorks
BibTex

toolbar search

toolbar search

search input Search input auto suggest

filter your search

Search

Subjects:

Health Services and Outcomes, Myeloid Neoplasia

TO THE EDITOR:

In myelodysplastic syndromes (MDSs), cytogenetics and risk scores derived from these data are fundamental to prognosticate and treat this disease.¹ Owing to the complexity of annotation of cytogenetic strings and the valuable information therein contained, many hours of clinical researchers can be spent parsing through data, selecting abnormalities of interest, and calculating the cytogenetic risk score for each case. Furthermore, given the specialized rules around what is a true abnormality,² training of personnel requires significant time and resources. Despite such training, annotation errors can still occur.³ To solve this problem, we developed and validated autocyto, a free, web-based tool to calculate cytogenetic risk scores from a cytogenetic string.

The tool parses cytogenetic strings (eg, 46,XY,del(5)(q23)[20]), extracts abnormalities of interest (eg, del(5)q), and calculates the MDS cytogenetic risk score.¹ The tool first transforms the cytogenetic string’s nomenclature from previous nomenclatures to the International System for Human Cytogenomic Nomenclature 2020 if needed. The tool’s operations are outlined in the flowchart in Figure 1. The tool expects processed cytogenetic strings to follow specific formatting: cytogenetic abnormalities should be separated by commas (,), clones by forward slashes (/), and metaphase counts enclosed in square brackets ([20]). As recommended by the International System for Human Cytogenomic Nomenclature,² singular metaphases (ie, [1]) are excluded and not used to calculate the final score.

Figure 1.

Visual flowchart of autocyto cytogenetic string function process.

View large Download PPT

Visual flowchart of autocyto cytogenetic string function process.

Figure 1.

Visual flowchart of autocyto cytogenetic string function process.

View large Download PPT

Visual flowchart of autocyto cytogenetic string function process.

Using natural language processing,⁴ the tool then finds all prognostic abnormalities of interest in MDSs (supplemental Table 1) and creates a binary variable (1 for presence and 0 for absence) of each abnormality (supplemental Table 2). An additional function outputs the metaphases of the dominant clone. Finally, a function calculates the cytogenetic risk score and outputs an integer from 0 (very good) to 4 (very poor). Several safety functions check for missing components, empty rows, or lack of integers. If any of the safety functions is triggered, the tool will ask for a manual check of that case in the final database. Finally, a master function calls all previous functions and provides the final database in .csv format. The tool was fully written in Python 3.12.4 (Python Software Foundation, Wilmington, DE) using libraries pandas,⁵ numpy,⁶ re,⁷ and lifelines.⁸ We used the streamlit⁹ library to develop the web application. This study was approved by the institutional review board of The University of Texas MD Anderson Cancer Center.

We then tested the tool in a database with manual cytogenetic annotation of 456 patients with MDSs of whom 167 (36.6%) had complex karyotype. The tool asked for a manual check in 21 cases (4.6%). It produced abnormality annotation and cytogenetic risk scores in 435 cases (95.3%). There was a discrepancy between manual and automatic annotation in 46 cases (10%) with 38 (82%) being erroneous manual annotation and 8 (18%) being erroneous tool annotation. The concordance index between manual and autocyto annotations was 0.99 (Figure 2). Kaplan-Meier estimates using manually annotated cytogenetic risk scores vs autocyto annotated cytogenetic risk scores (supplemental Figure 1) showed excellent correlation in survival. For the 456-patient database, the processing time is 0.22 seconds.

Figure 2.

Bar chart of concordance between autocyto and manual input.

View large Download PPT

Bar chart of concordance between autocyto and manual input.

Figure 2.

Bar chart of concordance between autocyto and manual input.

View large Download PPT

Bar chart of concordance between autocyto and manual input.

The tool expects a comma-delimited file (.csv). Other file types such as excel (.xls or .xlsx) can be easily converted to .csv using the widely available Microsoft Excel (Microsoft Corporation, Redmond, WA). To process a database, the user must type the column header (the name of the column) in the field provided. The user must then upload the .csv file using the Browse files button. This will load the database in the lower part of the website under “Original Data,” and almost immediately a second database will appear as “Processed Data” along with a button Download Processed Data, which will allow the user to download their data to a directory of their choice. All data are deleted once the user closes their browser. The supplemental file (supplemental Figure 1) contains a link to a video for usage demonstration.

Clinical research conclusions depend on the accuracy of data entry and interpretation. This process has been partially improved by clinical data transfer from the medical record for discrete data points such as laboratory values, dates, or diagnosis codes. However, data entry errors can happen in up to 20% of clinical research databases and the main reasons continue to be errors in annotation and interpretation.¹⁰ Tools exist for data integrity assessment such as impossible dates (death dates before diagnosis) and for impossible values (eg, “stage I MDS”), but these are rarely used within clinical trials¹¹ and almost never used in retrospective data analyses, which represent more than 70% of clinical research found in PubMed.¹²

Cytogenetic strings have multiple potential annotation pitfalls: at the time of the abnormality annotation by the cytotechnologist, at the time of transcription into the clinical record, and at the time of transcription into the clinical database. Unfortunately, given that most cytogenetic reports are uploaded as a document, instead of a specific field value, this last step is almost always manual. Furthermore, the current nomenclature of cytogenetics includes multiple clicks and multiple special character keystrokes that make transcription more error prone. Artificial intelligence technologies are being developed to automate this process.¹³

To calculate the International Prognostic Scoring System or its revised and molecular versions,¹⁴^,¹⁵ clinical researchers with different levels of training extract each cytogenetic abnormality from the medical record manually. This process includes direct observation of the cytogenetic string, manual extraction, and calculation of each score for each patient. Errors notwithstanding, this process is tedious, takes long hours, and takes longer to verify for each patient. In MDS clinical research, cytogenomic data have the highest prognostic value and data integrity is critical for reporting. Accurate clinical research database annotation is mandatory to be able to extract insights from retrospective data and inform hypotheses or treatment practices. We found that discordance between manual and autocyto annotations was caused by manual errors in more than 80% of cases, highlighting the need for the automation of cytogenetic string analysis. Our tool can reduce research workhours in a user-friendly interface. The tool can be accessed at https://autocyto.streamlit.app.

Contribution: S.U. designed the tool and wrote the manuscript; G.T. reviewed the tool’s nomenclature assessment for accuracy; and K.S.C., Z.L., A.B., and G.G.-M. reviewed the tool’s performance and approved the final version of the manuscript.

Conflict-of-interest disclosure: The authors declare no competing financial interests.

Correspondence: Samuel Urrutia, Division of Oncology, Washington University School of Medicine, 660 S Euclid Ave, St Louis, MO 63104; email: surrutia@wustl.edu.

References

1.

Schanz

J

,

Tüchler

H

,

Solé

F

, et al.

New comprehensive cytogenetic scoring system for primary myelodysplastic syndromes (MDS) and oligoblastic acute myeloid leukemia after MDS derived from an international database merge

.

J Clin Orthod

.

2012

;

30

(

8

):

820

-

829

.

2.

McGowan-Jordan

J

,

Hastings

RJ

,

Moore

S

. ISCN 2020: An International System for Human Cytogenomic Nomenclature 2020.

Karger

;

2020

.

3.

Yen

KH

,

Lee

C

,

Liu

HS

,

Ho

CL

.

A precise and scalable method for querying genes in chromosomal banding regions based on cytogenetic annotations

.

Bioinformatics

.

2005

;

21

(

17

):

3469

-

3474

.

4.

Aho

AV

. Pattern matching in strings. In:

Book

RV

, eds.

Formal Language Theory

. 1st.

Academic Press

;

1980

:

325

-

347

.

5.

McKinney

W

.

pandas: a foundational Python library for data analysis and statistics

.

Python for High Performance and Scientific Computing

.

2011

;

14

(

9

):

1

-

9

.

6.

Harris

CR

,

Millman

KJ

,

van der Walt

SJ

, et al.

Array programming with NumPy

.

Nature

.

2020

;

585

(

7825

):

357

-

362

.

7.

Chapman

C

,

Stolee

KT

.

Exploring regular expression usage and context in Python

.

Proceedings of the 25th International Symposium on Software Testing and Analysis. ISSTA 2016

.

2016

:

282

-

293

.

8.

Davidson-Pilon

C

.

lifelines: survival analysis in Python

.

J Open Source Softw

.

2019

;

4

(

40

):

1317

.

9.

Khorasani

M

,

Abdou

M

,

Hernández Fernández

J

. Streamlit basics. In:

Khorasani

M

,

Abdou

M

,

Hernández Fernández

J

, eds.

Web Application Development with Streamlit: Develop and Deploy Secure and Scalable Web Applications to the Cloud Using a Pure Python Framework

. 1st.

Apress

;

2022

:

31

-

62

.

10.

Goldberg

SI

,

Niemierko

A

,

Turchin

A

.

Analysis of data errors in clinical research databases

.

AMIA Annu Symp Proc

.

2008

;

2008

:

242

-

246

.

11.

Olsen

R

,

Bihlet

AR

,

Kalakou

F

,

Andersen

JR

.

The impact of clinical trial monitoring approaches on data integrity and cost—a review of current literature

.

Eur J Clin Pharmacol

.

2016

;

72

(

4

):

399

-

412

.

12.

Zhao

X

,

Jiang

H

,

Yin

J

, et al.

Changing trends in clinical research literature on PubMed database from 1991 to 2020

.

Eur J Med Res

.

2022

;

27

(

1

):

95

.

13.

Hong

B

,

Fedderson

B

,

Zhao

J

,

Andersen

E

.

AI-based algorithms for neoplastic metaphase cells boost efficiencies in the cytogenetics laboratory [abstract]

.

Cancer Genetics

.

2023

;

278-279

:

2

-

3

.

14.

Greenberg

PL

,

Tuechler

H

,

Schanz

J

, et al.

Revised International Prognostic Scoring System for myelodysplastic syndromes

.

Blood

.

2012

;

120

(

12

):

2454

-

2465

.

15.

Bernard

E

,

Tuechler

H

,

Greenberg

PL

, et al.

Molecular International Prognostic Scoring System for myelodysplastic syndromes

.

NEJM Evid

.

2022

;

1

(

7

):

EVIDoa2200008

.

Author notes

The data used to validate the tool are not available. The source code and the application are freely available online at https://github.com/eotf/autocyto, and the application may be used directly at https://autocyto.streamlit.app.

The full-text version of this article contains a data supplement.

© 2025 American Society of Hematology. Published by Elsevier Inc. Licensed under Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0), permitting only noncommercial, nonderivative use with attribution. All other rights reserved.

2025

Supplemental data

Supplemental Media- pdf file

Supplemental Tables and Figure- xlsx file

Volume 9, Issue 10

May 27 2025

Previous Article
Next Article

Advertisement intended for health care professionals

View Metrics

Cited By

Google Scholar

Email alerts

Article Activity Alert

Continuous Publishing Alert

New Journal Content Alert

Advertisement intended for health care professionals

Advertisement intended for health care professionals

About Blood Advances
Advertising in Blood Advances
Alerts

All Issues
Author Guide
Collections

Community Conversations
Contact Us
Copyright

Current Issue
Newsroom
Permissions

Submit to Blood Advances

American Society of Hematology
2021 L Street NW, Suite 900
Washington, DC 20036
TEL +1 202-776-0544
FAX +1 202-776-0545

ASH Publications

Blood
Blood Advances
Blood Global Hematology
Blood Immunology & Cellular Therapy
Blood Neoplasia
Blood Red Cells & Iron
Blood Vessels, Thrombosis & Hemostasis
Hematology, ASH Education Program
ASH Clinical News
ASH-SAP
The Hematologist

American Society of Hematology

ASH Home
ASH Image Bank
ASH Store

Advocacy
Education
Meetings
Research

Copyright 2025 by American Society of Hematology

Cookie Settings
Privacy Policy
Cookie Policy
Terms of Use
Contact Us

This Feature Is Available To Subscribers Only

Sign In or Create an Account

Advertisement intended for health care professionals