Introduction: Epidemiologic research in hematologic malignancies has been dependent on three sources of data; SEER data, patients accrued to large clinical trials, and databases generated by individual providers or departments. Each of these sets of data have major limitations. In theory the adoption of computerized databases by pathology departments and the adoption of electronic medical records should have greatly expanded the ability to identify patients with particular types of malignancies. In practice the need to manually classify tens of thousands of free text pathology reports has made this resource unavailable. We have developed a computer program to extract and codify the final diagnoses described in free text pathology reports according to the World Health Organization (WHO) classification of hematologic malignancies. This enables us to identify sets of cases of desired pathologies for further study.

Methods: A medical records review protocol was approved by the Dana Farber Harvard Cancer Center Institutional Review Board. Using our clinical lymphoma database we collected records of patients at Massachusetts General Hospital (MGH) Cancer Center who carried a diagnosis of follicular lymphoma (FL) and diffuse large B-cell lymphoma (DLBCL).The program is modified from one developed to extract diagnoses from discharge summaries in a different context [Long, AMIA 2007]. The approach is to use punctuation and a few words (conjunctions and some common verbs) to divide the text into phrases and then use a search procedure to find the most specific matching concepts in the UMLS (Unified Medical Language System). The search uses all of the alternate phrases for each concept as included in the UMLS normalized string table, matched against normalized subphrases from the text. We mapped UMLS concepts to the desired WHO concepts. To ensure a complete list of UMLS concepts we used the hierarchical relations in the UMLS and examined both more specific and more general concepts for possible inclusion in the search. Since not all diseases in a report are part of the diagnosis, we developed pattern matching procedures to identify the parts of the pathology report containing the final diagnosis (as opposed to “note” or “clinical data” that might have patient specific clinical information that are not diagnosed in the current pathology report). The program also uses a strategy for identifying modifiers that change the sense of the diagnosis (“suggestive of”, “rule out”, “no”, etc.) to exclude diseases that are absent or only possible.

Results: We used the system to identify cases of FL and DLBCL and compared the results to lists of cases generated manually. The current program was 90% accurate in automatically classifying pathology reports as describing follicular lymphoma. Of 150 cases of FL, the program found 133 (eg, MALIGNANT LYMPHOMA, FOLLICLE CENTER) and 3 were in reports not available to the program (133/147 90%). Of 100 DLBCL cases, 76 were available and 63 were found (83%) (eg, HISTIOCYTE-RICH LARGE B-CELL LYMPHOMA). There were several reasons for the missed cases. Most commonly (13) the diagnosis was not in the identified diagnosis section, either because the section was divided by a note or the diagnosis was in an addendum. Only 3 cases had phrases not found in the UMLS. For 2 others, the program missed because the phrase was not contiguous (eg, B-CELL LYMPHOMA, CONSISTENT WITH FOLLICLE CENTER CELL TYPE). In 5 of the DLBCL cases the stated final diagnosis was more general than the desired diagnoses and 2 of the FL cases were listed as “strongly suggestive” which the program concluded was not a definite diagnosis.

Discussion: This program is useful for identifying desired cohorts of cases and can be improved with better identification of the sections containing the diagnosis, the addition of a few missing phrases as they are discovered, and the addition of techniques for handling common discontinuities in disease descriptions (eg, allowing “consistent with” in the phrase). We have shown that very simple natural language techniques are sufficient to extract most of the desired disease descriptions from free text reports, enabling automatic selection of cases and greatly enhancing the usefulness of large repositories of pathology reports.

Disclosures: No relevant conflicts of interest to declare.

Author notes

Corresponding author

Sign in via your Institution