TO THE EDITOR:
T-cell acute lymphoblastic leukemia (T-ALL) is a rare and aggressive hematologic malignancy affecting both children and adults.1,2 T-ALL subtype identification is an emerging area of active research; as recently as 2016, the World Health Organization suggested only 1 provisional distinct T-ALL classification: the early T-cell precursor (ETP).3
However, recent revisions by the International Consensus Classification in 2022 further subclassified ETP based on BCL11B deregulation while introducing 8 provisional classifications for non-ETP T-ALL.4,5 These include subtypes characterized by rearrangements in TAL1/2, TLX1, TLX3, LMO1/2, SPI1, NKX2, and BHLH and dysregulation of HOXA.4,5 Other recent work by Dai et al, using gene expression analysis of 8 international cohorts of T-ALL samples from adult and pediatric patients, similarly developed classifications based on molecular abnormalities in TAL1, TLX1, TLX3, NKX2, LMO1/2, LYL1, and SPI1 while subclassifying HOXA-dysregulated samples based on the presence of KMT2A or MLLT10 fusions,6 resulting in 10 subtypes. Furthermore, Brady et al recently described 10 subtypes, defined again by TAL1/2, TLX1, TLX3, NKX2, LMO1/2, SPI1, and HOXA abnormalities, along with BCL11B abnormalities7 and an otherwise nonspecified group.8
As with other leukemias, understanding the heterogeneity within T-ALL holds clinical significance because T-ALL subtypes may influence risk stratification and treatment decisions. For instance, TLX1 and TLX3 lesions indicate good prognosis, TAL1/2 lesions indicate intermediate prognosis, and the rare SPI1 rearrangement subtype indicates very poor prognosis.5
Considering the recent expansion of T-ALL subtype classifications and the potential clinical utility, we developed TALLSorts, a machine-learning classifier algorithm that uses bulk RNA sequencing (RNA-seq) gene expression data to attribute subtypes to T-ALL samples. The architecture of TALLSorts is similar to that of ALLSorts, an analogous B-ALL subtype classifier, which we have previously published.9 TALLSorts (in its current iteration) accepts an expression matrix of gene counts from RNA-seq reads aligned to the hg19 or hg38 reference genomes. These data are analyzed by pretrained logistic regression models using the one-vs-rest strategy for each subtype. Each sample is then assigned a probability score for each subtype, with a positive identification if 1 or several subtypes exceed the 50% threshold (see supplemental Information for further details). Logistic regression models were built using the scikit-learn package in Python.10
In building a training set, we selected publicly available RNA-seq expression data sets comprising 376 T-ALL samples from 4 international cohorts: 26 samples from the Royal Children’s Hospital in Australia,11 265 samples from the Therapeutically Applicable Research To Generate Effective Treatments (TARGET) program in the United States,12 and 2 cohorts of 25 and 60 samples from the Saint-Louis Hospital in France.13,14
For these 4 cohorts, we used ComBat15 and RUVseq16 to remove batch effects. We then performed dimension reduction and unsupervised clustering on gene expression using the Uniform Manifold Approximation and Projection and density-based spatial clustering methods respectively,17,18 yielding 8 distinct clusters (Figure 1A).
Because the TARGET cohort was analyzed separately by Dai et al and Brady et al, we compared their TARGET sample subtype labeling with that of our clusters, in which we found an almost direct correspondence (Figure 1B-C). Considering each cluster as a distinct subtype, we assigned each of our discovered clusters with an appropriate subtype name reflecting the likely underlying pathology. This largely coincided with labels from Dai et al and Brady et al. Specifically, our 8 subtypes correspond to transcriptomic landscapes associated with BCL11B deregulation, NKX2 overexpression, TAL1/2 or LMO1/2 deregulation, TLX1 overexpression, TLX3 overexpression, HOXA overexpression and fusions involving KMT2A or MLLT10, and a Diverse category. In the case of the BCL11B cluster, we manually separated it from the Diverse cluster by identifying samples within the BCL11B subgroup from the classification system by Brady et al (Figure 1C).
In the case of the TAL/LMO subtype classification, TALLSorts performs a further hierarchical classification step by separating samples with overexpression of TAL2 from the rest, labeled as TAL2 and TAL1/LMO, respectively.
In the case of the Diverse subtype, its analogous groupings within the systems of both Dai et al and Brady et al suggest it is composed of samples with heterogeneous genetic profiles not otherwise captured by the other subtypes and with lesions not otherwise defined. The lack of an SPI1 label is acknowledged as a limitation, because only 1 SPI1-labeled sample was available within the TARGET data set. Training a cluster label from a single sample is inappropriate; however, we hope to create an SPI1 label in the future as we retrain TALLSorts on more SPI1 samples.
The Rand index for concordance of TARGET samples between our cluster labels and the subgroupings of Dai et al was 0.97, indicating a high degree of agreement. Similarly, the Rand index between our cluster labels and the subgroupings of Brady et al was 0.87. This lower value was likely largely because of the absence of SPI1 and LMO1/2 as TALLSorts labels and the presence of discrimination between HOXA samples within TALLSorts labels compared with that of Brady et al.
Validating TALLSorts on samples held out from training demonstrated 97% accuracy (109 of 112 samples; Figure 2E). Samples classified into multiple subtypes were considered accurately labeled if any of the classified subtypes included the true label. Notably, of the 109 correctly labeled samples, all were classified with >70% probability, whereas 93 (85%) were correctly classified with a probability >90%, thus demonstrating model confidence in its assignments. Of the 37 samples in the holdout set classified as TAL/LMO and from the TARGET cohort, all 4 TAL2 samples (as labeled by Brady et al) were identified by the hierarchical step, with no misidentifications.
Next, we validated TALLSorts using 66 samples from 2 cohorts that were included in the study by Dai et al but were incorporated into neither training nor holdout data sets: a study from St Jude’s Children’s Research Hospital19 (n = 41) and the Ma-Spore trial20 (n = 25). From the St Jude’s data set, 30 of the 41 samples were also analyzed by Brady et al. Truth values for the subtypes of these samples were determined by crossreferencing their Dai et al and/or Brady et al labels (supplemental Table 5). This data set similarly exhibited good accuracy at 94% (n = 62; Figure 2F). Of these, 58 correctly classified samples (94%) had a probability of >70%, whereas 50 (81%) received probability >90%. Although these samples were independent from the training and holdout sets, we acknowledge the limitation that their truth values were informed by the labeling from Dai et al, which were also used to generate truth values for the training and holdout sets.
The multiclass logistic regression model is a unique feature of TALLSorts and permits samples to be classified under multiple subtypes. This feature may prove beneficial in samples with >1 driver lesion, which may be only partially classified using decision-tree or clustering-based algorithms as suggested by Brady et al or Dai et al, respectively. Importantly, in the holdout data set, we identified 8 samples classified into multiple subtypes and 3 samples in the independent data set (supplemental Table 7A,B).
TALLSorts also offers the user the ability to train custom classification models using the same standardization and logistic regression pipeline used by our classification model.
The TALLSorts software will undergo active maintenance, including retraining, to ensure its subtype classifications remain current with the latest developments in T-ALL subclassification. Specifically, we expect that as more molecular studies investigate specific lesions and their underlying mechanisms and clinical outcomes, the subtype definitions will become further refined. At this stage, we do not know the defining features of the Diverse classification, which we expect contains multiple distinct driving mechanisms. Similarly, as more samples are made available, more refined subtypes will be able to be discriminated from the gene expression data.
In conclusion, TALLSorts is an accurate, publicly available algorithm that uses RNA-seq expression to classify T-ALL subtypes reflecting the latest research in T-ALL subclassification. TALLSorts produces easily interpretable output files and images.
Further details regarding the construction and testing of TALLSorts is available in the accompanying supplemental Information.
Acknowledgments: Royal Children's Hospital (RCH) tumor samples and coded data were supplied by the Children’s Cancer Centre Biobank at the Murdoch Children’s Research Institute and The Royal Children’s Hospital. This research was supported by the Peter MacCallum Cancer Foundation. Establishment, and running of the Children’s Cancer Centre was made possible through generous support by Cancer In Kids at RCH, The Royal Children’s Hospital Foundation, and the Murdoch Children’s Research Institute. The authors acknowledge the support of the Specialized Center of Research Program (SCOR) grant (7015-18) from the Lymphoma and Leukemia Society and of Perpetual Trustees and the Samuel Nissen Foundation. This work is supported by a Gilead Research Scholars Award (T.S.), funding from The Kid’s Cancer Project Mid-Career Fellowship (T.S.), and an National Health and Medical Research Council (NHMRC) investigator grant GNT1196256 (A.O.).
Contribution: A.G., B.S., and A.O conceptualized the study and created the methodology; P.G.E., L.M.B., H.J.K., and T.S. provided technical, clinical and biological expertise; A.G. and B.S. wrote the software; A.G., R.J., and A.L. performed data extraction and sequence alignment; A.G. performed the formal analysis and provided the original draft of the manuscript; A.G. and A.O. edited and reviewed the manuscript; and all authors read and approved the manuscript.
Conflict-of-interest disclosure: The authors declare no competing financial interests.
Correspondence: Alicia Oshlack, Computational Biology, Peter MacCallum Cancer Centre, 305 Grattan St, Melbourne, VIC 3000, Australia; email: alicia.oshlack@petermac.org.
References
Author notes
The TARGET data set is hosted by the NCBI database of Genotypes and Phenotypes (accession number phs000218).
The RNA-seq data sets are available at NCBI Gene Expression Omnibus database (accession numbers GSE110633 and GSE110677).
The RCH data set is available at the European Genome-Phenome Archive (accession number EGAS00001004212).
The Ma-Spore data set is available at the European Genome-Phenome Archive (accession number EGAD00001002151).
The data set from St Jude is available at NCBI Gene Expression Omnibus database (accession numbers GSE115525 and GSE124824).
TALLSorts, its source code, and instructions for use are available for download via GitHub at https://github.com/Oshlack/TALLSorts.
All other code is available on request from the corresponding author, Alicia Oshlack (alicia.oshlack@petermac.org).
The full-text version of this article contains a data supplement.