Abstract 5095

Method

The goal of this research is to assess inter-reader variability in identifying centroblast (CB) cells from digitized H&E-stained Follicular Lymphoma (FL) cases. We have enrolled three board-certified hematopathologists experienced in FL grading to complete reading sessions on 500 High Power Field (HPF: 40 × magnification) images that were selected from 17 H&E digital slides by three hematopathologists. Each slide represents one patient and the dataset is comprised of lymphoma cases with all grades 1, 2, and 3 of FL. Each pathologist was asked to grade the same set of images (500 images). The pathologists examined digital images and recorded the spatial coordinates of CBs using in-house developed software that allowed pathologists to mark CB cells using only a computer mouse.

Experimental Results

The results from each reading session were analyzed in terms of FL grade which was determined by averaging the centroblast counts across the 28–30 images for a patient and assigning grade using the standard WHO guidelines: Grade I = 0–5, Grade II = 6–15, Grade III = > 15 centroblasts/image. First, we used kappa and p-values in order to measure inter-reader agreement on the three level grades and then we computed the same metrics to measure agreement on a two level diagnosis: Grade I or II (no chemoprevention assigned) versus Grade III (chemoprevention assigned).

Kappa and p-values for Three Level Grade: Grades I, II, III

Pathologists ComparisonKappa1p-values2
1&2 0.514 0.0084 
1&3 0.377 0.103 
2&3 0.261 0.103 
Pathologists ComparisonKappa1p-values2
1&2 0.514 0.0084 
1&3 0.377 0.103 
2&3 0.261 0.103 
1

Landis and Koch [1] guidelines for degree of Kappa agreement; < 0 poor, 0-0.2 slight, 0.4-0.6 moderate, 0.6-0.8 substantial, 0.81-1 almost perfect.

2

2Holm's method [2] used to correct for multiple tests.

Kappa and p-values for Two Level Diagnosis (Grade I or II vs. III)

Pathologists ComparisonKappa1p-values2
1&2 0.564 0.0822 
1&3 0.290 
2&3 0.191 
Pathologists ComparisonKappa1p-values2
1&2 0.564 0.0822 
1&3 0.290 
2&3 0.191 
1

Landis and Koch [1] guidelines for degree of Kappa agreement; < 0 poor, 0-0.2 slight, 0.4-0.6 moderate, 0.6-0.8 substantial, 0.81-1 almost perfect.

2

2Holm's method [2] used to correct for multiple tests.

PathologistGrade IGrade IIGrade III
10 
PathologistGrade IGrade IIGrade III
10 

Summary of centroblast count determination by pathologist

PathologistMeanStandard deviation*
12.2 16.1 
23.5 27.7 
6.7 4.6 
PathologistMeanStandard deviation*
12.2 16.1 
23.5 27.7 
6.7 4.6 
*

Standard deviation of the means for each slide.

Discussion

Table 1 provides the weighted kappa statistics based on the three level grading system. There was significant moderate agreement between pathologists 1 and 2 in grade level. However, pathologist 3 shows high disagreement with respect to pathologists 1 and 2 in grade. We also examined agreement based on the clinically significant diagnosis (Grade I or II versus III) (see table 2), the kappa statistics show that pathologists 1 and 2 moderately agreed in their diagnosis, though the agreement was only marginally significant. However, we again see that pathologist 3 did not agree with pathologists 1 and 2. In these cases, the weighted kappas are equal to zero suggesting that there is no agreement between pathologist 3 and pathologists 1 and 2. Table 3 demonstrates the average grade determination for each pathologist per patient. Table 4 exhibits the mean and the standard deviation of centroblast count for each pathologist per patient. These tables demonstrate that there is a large amount of variability in both grade and centroblast count; pathologist 2 identified the most centroblasts and consequently identified the highest percentage of grade 3 cases. Pathologist 3, was considerably more conservative than pathologists 1 and 2 in identifying centroblasts and did not identify any grade 3 cases.

Conclusion

In this study, we have examined inter-reader variability in grading follicular lymphoma in digital images based on centroblast count. We found high variability in centroblast counts and grade across pathologists resulting in agreement, which ranged from none to moderate at best. A larger data set and more pathologists will be considered in the near future to improve the generalizability of our results.

References

1. J. R. Landis and G. G. Koch “The measurement of observer agreement for categorical data” in Biometrics, 1977, vol. 33, pp. 159–174.

2. S. Holm “A simple sequentially rejective multiple test procedure” in Scandinavian Journal of Statistics, 1979, vol. 6, pp. 65–70.

Disclosures:

No relevant conflicts of interest to declare.

Author notes

*

Asterisk with author names denotes non-ASH members.

Sign in via your Institution