TO THE EDITOR:

T-cell acute lymphoblastic leukemia (T-ALL) is a rare and aggressive hematologic malignancy affecting both children and adults.1,2 T-ALL subtype identification is an emerging area of active research; as recently as 2016, the World Health Organization suggested only 1 provisional distinct T-ALL classification: the early T-cell precursor (ETP).3 

However, recent revisions by the International Consensus Classification in 2022 further subclassified ETP based on BCL11B deregulation while introducing 8 provisional classifications for non-ETP T-ALL.4,5 These include subtypes characterized by rearrangements in TAL1/2, TLX1, TLX3, LMO1/2, SPI1, NKX2, and BHLH and dysregulation of HOXA.4,5 Other recent work by Dai et al, using gene expression analysis of 8 international cohorts of T-ALL samples from adult and pediatric patients, similarly developed classifications based on molecular abnormalities in TAL1, TLX1, TLX3, NKX2, LMO1/2, LYL1, and SPI1 while subclassifying HOXA-dysregulated samples based on the presence of KMT2A or MLLT10 fusions,6 resulting in 10 subtypes. Furthermore, Brady et al recently described 10 subtypes, defined again by TAL1/2, TLX1, TLX3, NKX2, LMO1/2, SPI1, and HOXA abnormalities, along with BCL11B abnormalities7 and an otherwise nonspecified group.8 

As with other leukemias, understanding the heterogeneity within T-ALL holds clinical significance because T-ALL subtypes may influence risk stratification and treatment decisions. For instance, TLX1 and TLX3 lesions indicate good prognosis, TAL1/2 lesions indicate intermediate prognosis, and the rare SPI1 rearrangement subtype indicates very poor prognosis.5 

Considering the recent expansion of T-ALL subtype classifications and the potential clinical utility, we developed TALLSorts, a machine-learning classifier algorithm that uses bulk RNA sequencing (RNA-seq) gene expression data to attribute subtypes to T-ALL samples. The architecture of TALLSorts is similar to that of ALLSorts, an analogous B-ALL subtype classifier, which we have previously published.9 TALLSorts (in its current iteration) accepts an expression matrix of gene counts from RNA-seq reads aligned to the hg19 or hg38 reference genomes. These data are analyzed by pretrained logistic regression models using the one-vs-rest strategy for each subtype. Each sample is then assigned a probability score for each subtype, with a positive identification if 1 or several subtypes exceed the 50% threshold (see supplemental Information for further details). Logistic regression models were built using the scikit-learn package in Python.10 

In building a training set, we selected publicly available RNA-seq expression data sets comprising 376 T-ALL samples from 4 international cohorts: 26 samples from the Royal Children’s Hospital in Australia,11 265 samples from the Therapeutically Applicable Research To Generate Effective Treatments (TARGET) program in the United States,12 and 2 cohorts of 25 and 60 samples from the Saint-Louis Hospital in France.13,14 

For these 4 cohorts, we used ComBat15 and RUVseq16 to remove batch effects. We then performed dimension reduction and unsupervised clustering on gene expression using the Uniform Manifold Approximation and Projection and density-based spatial clustering methods respectively,17,18 yielding 8 distinct clusters (Figure 1A).

Figure 1.

Identification of subtypes through clustering and dimension reduction. (A) Uniform Manifold Approximation and Projection (UMAP) dimension reduction and density-based spatial clustering of samples used for training and holdout testing (n = 376), colored according to clusters, with cluster labels informed by Dai et al and Brady et al. (B-C) TARGET samples (n = 265) within our clustering colored according to subtype labels assigned by Dai et al (B) and Brady et al (C).

Figure 1.

Identification of subtypes through clustering and dimension reduction. (A) Uniform Manifold Approximation and Projection (UMAP) dimension reduction and density-based spatial clustering of samples used for training and holdout testing (n = 376), colored according to clusters, with cluster labels informed by Dai et al and Brady et al. (B-C) TARGET samples (n = 265) within our clustering colored according to subtype labels assigned by Dai et al (B) and Brady et al (C).

Close modal

Because the TARGET cohort was analyzed separately by Dai et al and Brady et al, we compared their TARGET sample subtype labeling with that of our clusters, in which we found an almost direct correspondence (Figure 1B-C). Considering each cluster as a distinct subtype, we assigned each of our discovered clusters with an appropriate subtype name reflecting the likely underlying pathology. This largely coincided with labels from Dai et al and Brady et al. Specifically, our 8 subtypes correspond to transcriptomic landscapes associated with BCL11B deregulation, NKX2 overexpression, TAL1/2 or LMO1/2 deregulation, TLX1 overexpression, TLX3 overexpression, HOXA overexpression and fusions involving KMT2A or MLLT10, and a Diverse category. In the case of the BCL11B cluster, we manually separated it from the Diverse cluster by identifying samples within the BCL11B subgroup from the classification system by Brady et al (Figure 1C).

In the case of the TAL/LMO subtype classification, TALLSorts performs a further hierarchical classification step by separating samples with overexpression of TAL2 from the rest, labeled as TAL2 and TAL1/LMO, respectively.

In the case of the Diverse subtype, its analogous groupings within the systems of both Dai et al and Brady et al suggest it is composed of samples with heterogeneous genetic profiles not otherwise captured by the other subtypes and with lesions not otherwise defined. The lack of an SPI1 label is acknowledged as a limitation, because only 1 SPI1-labeled sample was available within the TARGET data set. Training a cluster label from a single sample is inappropriate; however, we hope to create an SPI1 label in the future as we retrain TALLSorts on more SPI1 samples.

The Rand index for concordance of TARGET samples between our cluster labels and the subgroupings of Dai et al was 0.97, indicating a high degree of agreement. Similarly, the Rand index between our cluster labels and the subgroupings of Brady et al was 0.87. This lower value was likely largely because of the absence of SPI1 and LMO1/2 as TALLSorts labels and the presence of discrimination between HOXA samples within TALLSorts labels compared with that of Brady et al.

Validating TALLSorts on samples held out from training demonstrated 97% accuracy (109 of 112 samples; Figure 2E). Samples classified into multiple subtypes were considered accurately labeled if any of the classified subtypes included the true label. Notably, of the 109 correctly labeled samples, all were classified with >70% probability, whereas 93 (85%) were correctly classified with a probability >90%, thus demonstrating model confidence in its assignments. Of the 37 samples in the holdout set classified as TAL/LMO and from the TARGET cohort, all 4 TAL2 samples (as labeled by Brady et al) were identified by the hierarchical step, with no misidentifications.

Figure 2.

Validation of TALLSorts and example TALLSorts output figures. (A-B) Predicted probabilities for each sample (dot) by each subtype classifier (x-axis) within the holdout testing set (A) and the St Jude/Ma-Spore independent testing set (B). A red dot indicates a sample labeled as the subtype being tested; a blue dot indicates a different label. The black line is the threshold of 50% probability; all samples above this line are classified as positive for the relevant subtypes. These probability plots are generated as output by the TALLSorts package, albeit without the true positive/negative colorings in the case of data sets without known truth values. (C-D) Waterfall plots for the holdout test set (C) and independent test set (D). Each vertical bar is a sample colored by most likely subtype and with height proportional to its probability according to the classifier. Samples that do not exceed the 50% probability threshold for any subtype are labeled as “none/other.” These waterfall plots are also generated as output by the TALLSorts package. (E-F) Confusion matrices for the holdout test set (E) and independent test set (F). The y-axes correspond to subtype ground truth, and the x-axes are the TALLSorts predicted subtypes. A sample predicted positive for multiple subtypes is considered correctly labeled if its true label is one of the predictions.

Figure 2.

Validation of TALLSorts and example TALLSorts output figures. (A-B) Predicted probabilities for each sample (dot) by each subtype classifier (x-axis) within the holdout testing set (A) and the St Jude/Ma-Spore independent testing set (B). A red dot indicates a sample labeled as the subtype being tested; a blue dot indicates a different label. The black line is the threshold of 50% probability; all samples above this line are classified as positive for the relevant subtypes. These probability plots are generated as output by the TALLSorts package, albeit without the true positive/negative colorings in the case of data sets without known truth values. (C-D) Waterfall plots for the holdout test set (C) and independent test set (D). Each vertical bar is a sample colored by most likely subtype and with height proportional to its probability according to the classifier. Samples that do not exceed the 50% probability threshold for any subtype are labeled as “none/other.” These waterfall plots are also generated as output by the TALLSorts package. (E-F) Confusion matrices for the holdout test set (E) and independent test set (F). The y-axes correspond to subtype ground truth, and the x-axes are the TALLSorts predicted subtypes. A sample predicted positive for multiple subtypes is considered correctly labeled if its true label is one of the predictions.

Close modal

Next, we validated TALLSorts using 66 samples from 2 cohorts that were included in the study by Dai et al but were incorporated into neither training nor holdout data sets: a study from St Jude’s Children’s Research Hospital19 (n = 41) and the Ma-Spore trial20 (n = 25). From the St Jude’s data set, 30 of the 41 samples were also analyzed by Brady et al. Truth values for the subtypes of these samples were determined by crossreferencing their Dai et al and/or Brady et al labels (supplemental Table 5). This data set similarly exhibited good accuracy at 94% (n = 62; Figure 2F). Of these, 58 correctly classified samples (94%) had a probability of >70%, whereas 50 (81%) received probability >90%. Although these samples were independent from the training and holdout sets, we acknowledge the limitation that their truth values were informed by the labeling from Dai et al, which were also used to generate truth values for the training and holdout sets.

The multiclass logistic regression model is a unique feature of TALLSorts and permits samples to be classified under multiple subtypes. This feature may prove beneficial in samples with >1 driver lesion, which may be only partially classified using decision-tree or clustering-based algorithms as suggested by Brady et al or Dai et al, respectively. Importantly, in the holdout data set, we identified 8 samples classified into multiple subtypes and 3 samples in the independent data set (supplemental Table 7A,B).

TALLSorts also offers the user the ability to train custom classification models using the same standardization and logistic regression pipeline used by our classification model.

The TALLSorts software will undergo active maintenance, including retraining, to ensure its subtype classifications remain current with the latest developments in T-ALL subclassification. Specifically, we expect that as more molecular studies investigate specific lesions and their underlying mechanisms and clinical outcomes, the subtype definitions will become further refined. At this stage, we do not know the defining features of the Diverse classification, which we expect contains multiple distinct driving mechanisms. Similarly, as more samples are made available, more refined subtypes will be able to be discriminated from the gene expression data.

In conclusion, TALLSorts is an accurate, publicly available algorithm that uses RNA-seq expression to classify T-ALL subtypes reflecting the latest research in T-ALL subclassification. TALLSorts produces easily interpretable output files and images.

Further details regarding the construction and testing of TALLSorts is available in the accompanying supplemental Information.

Acknowledgments: Royal Children's Hospital (RCH) tumor samples and coded data were supplied by the Children’s Cancer Centre Biobank at the Murdoch Children’s Research Institute and The Royal Children’s Hospital. This research was supported by the Peter MacCallum Cancer Foundation. Establishment, and running of the Children’s Cancer Centre was made possible through generous support by Cancer In Kids at RCH, The Royal Children’s Hospital Foundation, and the Murdoch Children’s Research Institute. The authors acknowledge the support of the Specialized Center of Research Program (SCOR) grant (7015-18) from the Lymphoma and Leukemia Society and of Perpetual Trustees and the Samuel Nissen Foundation. This work is supported by a Gilead Research Scholars Award (T.S.), funding from The Kid’s Cancer Project Mid-Career Fellowship (T.S.), and an National Health and Medical Research Council (NHMRC) investigator grant GNT1196256 (A.O.).

Contribution: A.G., B.S., and A.O conceptualized the study and created the methodology; P.G.E., L.M.B., H.J.K., and T.S. provided technical, clinical and biological expertise; A.G. and B.S. wrote the software; A.G., R.J., and A.L. performed data extraction and sequence alignment; A.G. performed the formal analysis and provided the original draft of the manuscript; A.G. and A.O. edited and reviewed the manuscript; and all authors read and approved the manuscript.

Conflict-of-interest disclosure: The authors declare no competing financial interests.

Correspondence: Alicia Oshlack, Computational Biology, Peter MacCallum Cancer Centre, 305 Grattan St, Melbourne, VIC 3000, Australia; email: alicia.oshlack@petermac.org.

1.
Belver
L
,
Ferrando
A
.
The genetics and mechanisms of T cell acute lymphoblastic leukaemia
.
Nat Rev Cancer
.
2016
;
16
(
8
):
494
-
507
.
2.
Girardi
T
,
Vicente
C
,
Cools
J
,
De Keersmaecker
K
.
The genetics and molecular biology of T-ALL
.
Blood
.
2017
;
129
(
9
):
1113
-
1123
.
3.
Arber
DA
,
Orazi
A
,
Hasserjian
R
, et al
.
The 2016 revision to the World Health Organization classification of myeloid neoplasms and acute leukemia
.
Blood
.
2016
;
127
(
20
):
2391
-
2405
.
4.
Arber
DA
,
Orazi
A
,
Hasserjian
RP
, et al
.
International consensus classification of myeloid neoplasms and acute leukemias: integrating morphologic, clinical, and genomic data
.
Blood
.
2022
;
140
(
11
):
1200
-
1228
.
5.
Duffield
AS
,
Mullighan
CG
,
Borowitz
MJ
.
International Consensus classification of acute lymphoblastic leukemia/lymphoma
.
Virchows Arch
.
2022
;
482
(
1
):
11
-
26
.
6.
Dai
Y-T
,
Zhang
F
,
Fang
H
, et al
.
Transcriptome-wide subtyping of pediatric and adult T cell acute lymphoblastic leukemia in an international study of 707 cases
.
Proc Natl Acad Sci U S A
.
2022
;
119
(
15
):
e2120787119
.
7.
Montefiori
LE
,
Bendig
S
,
Gu
Z
, et al
.
Enhancer hijacking drives oncogenic bcl11b expression in lineage-ambiguous stem cell leukemia
.
Cancer Discov
.
2021
;
11
(
11
):
2846
-
2867
.
8.
Brady
SW
,
Roberts
KG
,
Gu
Z
, et al
.
The genomic landscape of pediatric acute lymphoblastic leukemia
.
Nat Genet
.
2022
;
54
(
9
):
1376
-
1389
.
9.
Schmidt
B
,
Brown
LM
,
Ryland
GL
, et al
.
ALLSorts: an RNA-Seq subtype classifier for B-cell acute lymphoblastic leukemia
.
Blood Adv
.
2022
;
6
(
14
):
4093
-
4097
.
10.
Pedregosa
F
,
Varoquaux
G
,
Gramfort
A
, et al
.
Scikit-learn: machine learning in python
.
J Mach Learn Res
.
2011
;
12
(
85
):
2825
-
2830
.
11.
Brown
LM
,
Lonsdale
A
,
Zhu
A
, et al
.
The application of RNA sequencing for the diagnosis and genomic classification of pediatric acute lymphoblastic leukemia
.
Blood Adv
.
2020
;
4
(
5
):
930
-
942
.
12.
Liu
Y
,
Easton
J
,
Shao
Y
, et al
.
The genomic landscape of pediatric and young adult T-lineage acute lymphoblastic leukemia
.
Nat Genet
.
2017
;
49
(
8
):
1211
-
1218
.
13.
Verboom
K
,
Van Loocke
W
,
Volders
PJ
, et al
.
A comprehensive inventory of TLX1 controlled long non-coding RNAs in T-cell acute lymphoblastic leukemia through polyA+ and total RNA sequencing
.
Haematologica
.
2018
;
103
(
12
):
e585
-
e589
.
14.
Buratin
A
,
Paganin
M
,
Gaffo
E
, et al
.
Large-scale circular RNA deregulation in T-ALL: unlocking unique ectopic expression of molecular subtypes
.
Blood Adv
.
2020
;
4
(
23
):
5902
-
5914
.
15.
Johnson
WE
,
Li
C
,
Rabinovic
A
.
Adjusting batch effects in microarray expression data using empirical Bayes methods
.
Biostatistics
.
2007
;
8
(
1
):
118
-
127
.
16.
Risso
D
,
Ngai
J
,
Speed
TP
,
Dudoit
S
.
Normalization of RNA-seq data using factor analysis of control genes or samples
.
Nat Biotechnol
.
2014
;
32
(
9
):
896
-
902
.
17.
McInnes
L
,
Healy
J
,
Melville
J
.
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. ArXiv eprints
. 1802.03426. Accessed 1 April 2023. https://doi.org/10.48550/arXiv.1802.03426.
18.
Ester
M
,
Kriegel
H-P
,
Sander
J
,
Xu
X
. A density-based algorithm for discovering clusters in large spatial databases with noise.
2nd International Conference on Knowledge Discovery and Data Mining
.
AAAI Press
;
1996
:
226
-
231
.
19.
Autry
RJ
,
Paugh
SW
,
Carter
R
, et al
.
Integrative genomic analyses reveal mechanisms of glucocorticoid resistance in acute lymphoblastic leukemia
.
Nat Cancer
.
2020
;
1
(
3
):
329
-
344
.
20.
Qian
M
,
Zhang
H
,
Kham
SK-Y
, et al
.
Whole-transcriptome sequencing identifies a distinct subtype of acute lymphoblastic leukemia with predominant genomic abnormalities of EP300 and CREBBP
.
Genome Res
.
2017
;
27
(
2
):
185
-
195
.

Author notes

The TARGET data set is hosted by the NCBI database of Genotypes and Phenotypes (accession number phs000218).

The RNA-seq data sets are available at NCBI Gene Expression Omnibus database (accession numbers GSE110633 and GSE110677).

The RCH data set is available at the European Genome-Phenome Archive (accession number EGAS00001004212).

The Ma-Spore data set is available at the European Genome-Phenome Archive (accession number EGAD00001002151).

The data set from St Jude is available at NCBI Gene Expression Omnibus database (accession numbers GSE115525 and GSE124824).

TALLSorts, its source code, and instructions for use are available for download via GitHub at https://github.com/Oshlack/TALLSorts.

All other code is available on request from the corresponding author, Alicia Oshlack (alicia.oshlack@petermac.org).

The full-text version of this article contains a data supplement.