Introduction: Cytogenetic analyses are critical in the management of a variety of malignancies. In acute myeloid leukemia (AML), tumor karyotype is among the most valuable features for prognosis and therapy selection. The reporting of patient karyotypes is reasonably standardized through the use of the International System for Human Cytogenetic Nomenclature (ISCN). However, karyotypes can be extremely complicated, sometimes leading to confusion in their interpretation. In addition, secondary-use of karyotype data requires re-interpretation of ISCN formulas, which acts as a barrier to research. To improve the speed and accuracy of karyotype interpretation and classification, we developed an automated, modular pipeline to parse and interpret cytogenetics reports and have initially applied this to risk prediction in AML.

Methods: We developed a modular set of python scripts that (1) process a batch of cytogenetics reports, (2) identifies and cleans ISCN formulas, (3) parses ISCN into an efficient representation of cells and their mutations, (4) classifies karyotypes according to Southwestern Oncology Group (SWOG) risk categories (Slovak, 2000), and (5) outputs a JSON document that details salient mutations, risk categories, algorithm confidence levels, and references to the supporting evidence for interpretation. We classified each case according to AML risk categories: unfavorable, intermediate, miscellaneous, favorable, unknown, or insufficient.

This program will be embedded within the incoming document pipeline for an enterprise-wide data repository and archive being constructed to facilitate research. The modular design will facilitate extension to additional data sources and classification schemas.

We collected reports (N=4,169) from two cytogenetics laboratories for patients newly diagnosed with AML at an academic hospital between 2008 and 2014. We split random subsets of these reports into training and testing sets. For training and testing, cytogenetic reports were matched to a research database of manual cytogenetic interpretations using patient identifier and report date. In addition to the standard SWOG risk category labels, the research database supplied a label of "not done."

Results: We trained our algorithm on 1,058 reports and tested it with 1,301 reports (See table). We demonstrated 95.5% and 94.7% strict accuracy, respectively for training and testing sets.

In testing, the algorithm failed to interpret 29 documents (2.2% of all testing records) due to incorrectly specified ISCN formulas that could not be definitively parsed. All of these reports were assigned an "unknown" label, indicating the need for further manual review. There were 40 (3.1% of testing records) disagreements between automated and manual interpretations. On further review, 14 (35%) of these appeared to be due to mismatched underlying data, in part due to imperfect pairing of reports and manual interpretations, in the absence of unique identifiers. Other errors included differences in the applied clonality criteria and/or the limitations of interpreting a single karyotype without considering historical and concurrent testing (45%) and the inability of our algorithm to identify some implied abnormalities not explicitly specified by ISCN (15%).

Conclusions: We developed an automated program that rapidly interprets karyotype reports for AML risk classification with ~95% accuracy. We plan to improve our interpretation algorithm to better identify implied abnormalities and update our risk classification to reflect the most recent SWOG risk classification guidelines. This tool should dramatically decrease the barrier to incorporating cytogenetic data in research studies and potentially improve the accuracy of clinical interpretations.

Table 1.

Confusion matrix of test set results

System output labels
Manual data labelsFavorableInsufficientIntermediateMiscellaneousNot doneUnfavorableUnknownTotal
 Favorable 20 28 
 Insufficient 74 85 
 Intermediate 818 822 
 Miscellaneous 45 49 
 Not done 11 
 Unfavorable 275 11 304 
 Unknown 
 Total 21 84 830 52 285 29 1301 
System output labels
Manual data labelsFavorableInsufficientIntermediateMiscellaneousNot doneUnfavorableUnknownTotal
 Favorable 20 28 
 Insufficient 74 85 
 Intermediate 818 822 
 Miscellaneous 45 49 
 Not done 11 
 Unfavorable 275 11 304 
 Unknown 
 Total 21 84 830 52 285 29 1301 

Disclosures

No relevant conflicts of interest to declare.

Author notes

*

Asterisk with author names denotes non-ASH members.

Sign in via your Institution