TO THE EDITOR:

Machine learning (ML) has revolutionized many industries including the health care industry by providing innovative solutions to some of the most pressing problems. With the advancement of technology and increasing amounts of data being generated, ML has become a central tool for health care professionals in various fields, such as diagnostics, drug discovery, and personalized medicine.1-6 The ability of ML algorithms to analyze vast amounts of complex data has led to improved accuracy and speed in diagnosis, better targeting of treatments, and more personalized care for patients.

Reading a scientific paper that uses ML methodologies can be a challenging task for those who are not familiar with the field.6 However, with a clear understanding of the basic concepts and a critical approach, it is possible to gain valuable insights from these papers. In this commentary, we will provide a step-by-step guide on how to read a scientific paper that has ML methodologies.

Step 1: Understand the problem being addressed. The first step in reading an ML paper is to understand the problem that the authors are trying to solve and, more importantly, understand the clinical or scientific impact of solving this problem.7 In other words, if the aim of the study is to solve a clinical problem, how does the answer or the recommendation provided by the algorithm help physicians or researchers in their day-to-day practice, and is this solution mature enough to be implemented in clinical workflows? Major clinical problems in health care can mainly affect either patient outcomes or operations (can I make the process easier and faster for the patient and the health care system?).

Step 2: Assess the quality of the data. The quality of the data used to build the ML model is crucial for the validity of the results. Following are some questions that can be used to evaluate the data:

  1. Sample size: Is the size of the training, validation, and test sets enough to build a reproducible and generalizable ML model? Is this size of the data appropriate for the chosen methods (ie, some methods are “data-hungry” and understanding which methods require larger datasets is key)? However, different algorithms require different data types (image, tabular, text, or others) and sizes, and there are no rules of thumb or formulas that can estimate the perfect data.

  2. Relevance: Are the data appropriate and relevant to the problem that the model is trying to solve?

  3. Accuracy: How are the data collected and annotated (human vs natural language process). How are the data transformed to make it ready for ML use, etc.

  4. Consistency: Are the data consistent? Do they have any missing values and how the authors dealt with this?

  5. Representativeness: The data should be representative of the population being studied.

  6. Balance: The data should be balanced, with roughly equal representation of all relevant classes or groups. However, most health care data are unbalanced. It is critical to understand how the authors dealt with unbalanced data.

  7. Bias: To evaluate bias in data, it is important to look at the distribution of certain characteristics, such as race, gender, or socioeconomic status, among the samples in the data set,8 and how the data were collected. This will help to identify any disparities or overrepresentation of certain groups, which can indicate the presence of bias in the data. It is critical to evaluate bias at this stage because if this is not addressed properly, it could produce a biased model.8-10 

Step 3: Familiarize yourself with the ML methods used. The next step is to understand the ML methods that the authors have used to solve the problem. Many papers will provide a brief overview of the methods used (in clinical or applied journals), but it is important to have a good understanding of the underlying concepts.1-6 It is critical to familiarize yourself with some of these terminologies presented in Table 1. There are many papers that explain these terminologies in a very simple manner.1-6 It is also important to understand the key issues in building ML (Figure 1) models and what the authors did to address these at each step.

Figure 1.

Steps to build a machine learning model. Problem formulation: The first step is to clearly define the problem that you want to solve. This involves defining the inputs and outputs of your model, as well as the type of problem you are trying to solve (classification, regression, clustering, etc). It is important to have a clear understanding of the problem you are trying to solve before you start building a model. Data collection: Once you have formulated the problem, the next step is to collect the relevant data. This may involve scraping data from websites, downloading data sets from public repositories, or collecting data through surveys or experiments. It is important to collect enough data to train your model and validate its performance. Data preparation: After collecting the data, you will need to clean and preprocess it. This involves removing any irrelevant data, dealing with missing values, and transforming the data into a suitable format for ML algorithms. It also includes dividing the data set into training, validation, and test cohorts. This step can take a lot of time and effort, but it is essential for building an accurate and effective model. Feature engineering: Feature engineering is the process of selecting and transforming the input variables (features) in a way that will improve the performance of the model. This may involve selecting the most relevant features, transforming them into a different representation (eg, using one-hot encoding), or creating new features based on existing ones. Feature engineering can have a significant impact on the performance of the model. Model selection: Once you have prepared the data and engineered the features, the next step is to select a suitable ML algorithm. This involves choosing the type of algorithm (eg, decision trees, neural networks, support vector machines) and the specific parameters of the algorithm. This step requires some knowledge of ML and experience with different algorithms. Model training: After selecting the algorithm, the next step is to train the model on the prepared data. This involves feeding the input data into the algorithm and adjusting the model parameters to optimize its performance. This step can take a lot of time and computational resources, especially for large data sets and complex models. Model evaluation: Once the model has been trained, the next step is to evaluate its performance on a separate test set of data. This involves measuring metrics, such as accuracy, precision, recall, and F1 score, to assess the performance of the model. It is important to test the model on data that it has not seen before to ensure that it can be generalized to new data. Model optimization: If the model performance is not satisfactory, then the next step is to optimize the model. This involves tweaking the model parameters, changing the algorithm, or modifying the feature engineering process to improve the model’s performance. This step may require several iterations until the desired level of performance is achieved. Model deployment: Once you have built a satisfactory model, the final step is to deploy it in a production environment. This may involve integrating the model into a web application, creating an application programming interface for other developers to use, or deploying it as a stand-alone application. It is important to ensure that the model is well documented and tested thoroughly before it is deployed.

Figure 1.

Steps to build a machine learning model. Problem formulation: The first step is to clearly define the problem that you want to solve. This involves defining the inputs and outputs of your model, as well as the type of problem you are trying to solve (classification, regression, clustering, etc). It is important to have a clear understanding of the problem you are trying to solve before you start building a model. Data collection: Once you have formulated the problem, the next step is to collect the relevant data. This may involve scraping data from websites, downloading data sets from public repositories, or collecting data through surveys or experiments. It is important to collect enough data to train your model and validate its performance. Data preparation: After collecting the data, you will need to clean and preprocess it. This involves removing any irrelevant data, dealing with missing values, and transforming the data into a suitable format for ML algorithms. It also includes dividing the data set into training, validation, and test cohorts. This step can take a lot of time and effort, but it is essential for building an accurate and effective model. Feature engineering: Feature engineering is the process of selecting and transforming the input variables (features) in a way that will improve the performance of the model. This may involve selecting the most relevant features, transforming them into a different representation (eg, using one-hot encoding), or creating new features based on existing ones. Feature engineering can have a significant impact on the performance of the model. Model selection: Once you have prepared the data and engineered the features, the next step is to select a suitable ML algorithm. This involves choosing the type of algorithm (eg, decision trees, neural networks, support vector machines) and the specific parameters of the algorithm. This step requires some knowledge of ML and experience with different algorithms. Model training: After selecting the algorithm, the next step is to train the model on the prepared data. This involves feeding the input data into the algorithm and adjusting the model parameters to optimize its performance. This step can take a lot of time and computational resources, especially for large data sets and complex models. Model evaluation: Once the model has been trained, the next step is to evaluate its performance on a separate test set of data. This involves measuring metrics, such as accuracy, precision, recall, and F1 score, to assess the performance of the model. It is important to test the model on data that it has not seen before to ensure that it can be generalized to new data. Model optimization: If the model performance is not satisfactory, then the next step is to optimize the model. This involves tweaking the model parameters, changing the algorithm, or modifying the feature engineering process to improve the model’s performance. This step may require several iterations until the desired level of performance is achieved. Model deployment: Once you have built a satisfactory model, the final step is to deploy it in a production environment. This may involve integrating the model into a web application, creating an application programming interface for other developers to use, or deploying it as a stand-alone application. It is important to ensure that the model is well documented and tested thoroughly before it is deployed.

Close modal

Step 4: Evaluate the results and how they are presented in the paper. Ask these questions:

  1. How did authors divide the cohort (eg, training, validation, and test)? Ideally, the test cohort should be completely different from the original one. From where and how these cohorts are collected? Is this a single-center study or a study from multiple centers? Are these centers present in 1 country or worldwide? Is there any testing of the model at another site?

  2. How did the authors report the efficacy of the model? Reporting the accuracy and area under the curve (AUC) can be misleading especially when the data are unbalanced (eg, if the authors are building ML to predict bleeding in the brain on computed tomography that occurs in 3% of the scans, a model can be correct 97% of the time by answering no bleed, but this model is not useful clinically). Authors should report the entire confusion matrix (false positive, true positive, false negative, and true negative) among other important matrices, such as precision, recall, precision or recall AUC, and F1 score, and other matrices as deemed important to the type of ML algorithm applied (Table 1). It is important to report these matrices on the evaluation and test cohorts not on the training cohorts.

  3. Is there evidence of overfitting or underfitting (Table 1)? To evaluate evidence of overfitting or underfitting in ML, one can examine the training and validation accuracy (if the training accuracy is much higher than the validation accuracy, it could be a sign of overfitting, and if it is lower, it could be a sign of underfitting), learning curves, cross-validation results, and test set performance. These techniques provide insight into whether the model is overfitting or underfitting the data and can help in selecting an appropriate model with optimal performance.

  4. Is the model (and its subsequent predictions) explainable? Explainability of the ML models in health care is very important. This will allow the end user (health care provider, researcher, etc) to understand the model and learn from it but, more importantly, assure that the model is not using patterns in the image or data set that are irrelevant to make the final prediction. Several studies have shown that some deep learning algorithm that evaluates outcomes in imaging data can detect areas not of interest on the image.11 Some ML models could be useful without explainability, if constructed and validated properly.

Step 5: Critically evaluate the conclusions and implications. Finally, it is important to critically evaluate the conclusions and implications of the study and whether the result support the conclusion. This includes considering the limitations of the study, generalizability of the results, and potential impact of the findings on the field. More importantly, it is important to consider the practical deployment of the ML model developed in the study if the study intention is to develop a novel model rather than using ML as an analytic tool. Deployment options for ML models include developing a user-friendly interface for inputting data and receiving outputs, integrating the model into an electronic health care record or imaging database within a hospital, or other methods. Regardless of the chosen deployment strategy, it is essential for the authors to outline their plans for making the model accessible to the public and to address the steps they will take to deploy the model after publication.

With the widespread use of ML methodology in scientific papers, it has become important for all physicians and researchers to comprehend the processes of building, validating, and deploying these models. This will enable us to distinguish between poor scientific studies, comprehend the strengths and limitations of these algorithms, and learn how to overcome them.

Contribution: A.N. wrote the initial draft, and O.E., S.M., T.H., and M.M. reviewed, edited, and approved the final manuscript.

Conflict-of-interest disclosure: A.N. works at Incyte Pharma and owns stocks at Incyte and Amazon. T.H. works at Munich Leukemia Laboratory. The remaining authors declare no competing financial interests.

See "Appendix" for members of the American Society of Hematology Artificial Intelligence Taskforce.

Correspondence: Aziz Nazha, Thomas Jefferson University, 1007 Stewart St, Philadelphia, PA 98101; e-mail: ANazha@incyte.com.

The members of the American Society of Hematology Artificial Intelligence Taskforce are Aziz Nazha, Olivier Elemento, Shannon McWeeney, Moses Miles, and Torsten Haferlach.

1.
Rajpurkar
P
,
Chen
E
,
Banerjee
O
,
Topol
EJ
.
AI in health and medicine
.
Nat Med
.
2022
. ;
28
(
1
):
31
-
38
.
2.
Topol
EJ
.
High-performance medicine: the convergence of human and artificial intelligence
.
Nat Med
.
2019
. ;
25
(
1
):
44
-
56
.
3.
Vamathevan
J
,
Clark
D
,
Czodrowski
P
, et al
.
Applications of machine learning in drug discovery and development
.
Nat Rev Drug Discov
.
2019
. ;
18
(
6
):
463
-
477
.
4.
Radakovich
N
,
Nagy
M
,
Nazha
A
.
Machine learning in haematological malignancies
.
Lancet Haematol
.
2020
. ;
7
(
7
):
e541
-
e550
.
5.
Nagy
M
,
Radakovich
N
,
Nazha
A
.
Machine learning in oncology: what should clinicians know?
.
JCO Clin Cancer Inform
.
2020
. ;
4
:
799
-
810
.
6.
Liu
Y
,
Chen
PHC
,
Krause
J
,
Peng
L
.
How to read articles that use machine learning: users’ guides to the medical literature
.
JAMA
.
2019
. ;
322
(
18
):
1806
-
1816
.
7.
Kelly
CJ
,
Karthikesalingam
A
,
Suleyman
M
,
Corrado
G
,
King
D
.
Key challenges for delivering clinical impact with artificial intelligence
.
BMC Med
.
2019
. ;
17
(
1
):
195
.
8.
Vokinger
KN
,
Feuerriegel
S
,
Kesselheim
AS
.
Mitigating bias in machine learning for medicine
.
Commun Med
.
2021
. ;
1
(
1
):
25
.
9.
Seyyed-Kalantari
L
,
Zhang
H
,
McDermott
MBA
,
Chen
IY
,
Ghassemi
M
.
Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations
.
Nat Med
.
2021
. ;
27
(
12
):
2176
-
2182
.
10.
Ravi
N
,
Chaturvedi
P
,
Huerta
EA
, et al
.
FAIR principles for AI models with a practical application for accelerated high energy diffraction microscopy
.
Sci Data
.
2022
. ;
9
(
1
):
657
.
11.
DeGrave
AJ
,
Janizek
JD
,
Lee
S-I
.
AI for radiographic COVID-19 detection selects shortcuts over signal
.
Nat Mach Intell
.
2021
. ;
3
(
7
):
610
-
619
.