Skip to main content
Patterns logoLink to Patterns
. 2024 Feb 21;5(3):100933. doi: 10.1016/j.patter.2024.100933

TCGA-Reports: A machine-readable pathology report resource for benchmarking text-based AI models

Jenna Kefeli 1, Nicholas Tatonetti 2,3,
PMCID: PMC10935496  PMID: 38487800

Summary

In cancer research, pathology report text is a largely untapped data source. Pathology reports are routinely generated, more nuanced than structured data, and contain added insight from pathologists. However, there are no publicly available datasets for benchmarking report-based models. Two recent advances suggest the urgent need for a benchmark dataset. First, improved optical character recognition (OCR) techniques will make it possible to access older pathology reports in an automated way, increasing the data available for analysis. Second, recent improvements in natural language processing (NLP) techniques using artificial intelligence (AI) allow more accurate prediction of clinical targets from text. We apply state-of-the-art OCR and customized post-processing to report PDFs from The Cancer Genome Atlas, generating a machine-readable corpus of 9,523 reports. Finally, we perform a proof-of-principle cancer-type classification across 32 tissues, achieving 0.992 average AU-ROC. This dataset will be useful to researchers across specialties, including research clinicians, clinical trial investigators, and clinical NLP researchers.

Keywords: TCGA, resource, cancer pathology, pathology reports, cancer type, machine learning, classification, transformer model, large language models, AI

Highlights

  • Introduction of TCGA-Reports, a machine-readable set of ∼10,000 pathology reports

  • Benchmark for researchers interested in utilizing LLMs for pathology applications

  • Dataset utility demonstrated through proof-of-principle cancer-type classification

The bigger picture

The Cancer Genome Atlas (TCGA) is a crucial resource in oncology research, comprising many different data types from patients with cancer across the United States. Within the TCGA data collection, there are genetic sequencing from tumor samples, long-term survival tracking, histopathology whole-slide images, and pathology reports. Pathology reports are text reports containing assessments by pathologists of tumor samples. They serve as the foundation for medical diagnosis and prognostic determination. As reports contain text, they are useable by large language models to directly extract or predict a wide variety of clinical attributes, as compared to structured electronic health record data. The dataset that we present is a curated and useable collection of the publicly available pathology reports generated by TCGA, which were previously underutilized due to accessibility issues. TCGA-Reports will serve as a useful resource for cancer researchers and as a benchmark for the broader community.


This paper presents TCGA-Reports, a publicly available dataset of pathology reports in machine-readable text derived from The Cancer Genome Atlas (TCGA) that is designed to be useful to researchers applying large language models (LLMs) to medical text. The dataset adds to the already rich TCGA data landscape and can be combined with sequencing or image data from TCGA. Its utility is demonstrated through cancer-type classification using TCGA-defined categories.

Introduction

Patient data derived from structured electronic health records (EHRs) or molecular sequencing are frequently used as input for clinical models across cancer types. However, unstructured free text, such as pathology reports or clinical notes, is less frequently used in biomedical data analysis, despite being regularly generated as part of the EHR. Tumor pathology reports in particular are an essential source of clinical data, often containing nuanced information that is not always captured within structured datasets. Report text generally includes a macroscopic or gross description of tumor appearance, location, and size; a microscopic description of tissue structure and cell differentiation; an evaluation of margins; and sometimes genetic or immunohistochemistry results. Reports can also contain the patient’s stage and grade, which help inform treatment, clinical care management, and prognosis.

Despite the potential utility of pathology report data in clinical research efforts, there currently does not exist a large, de-identified public dataset of pathology report text. However, The Cancer Genome Atlas (TCGA) has made available over 11,000 de-identified PDF reports associated with patient samples in their repository.1 TCGA is a particularly rich data source, containing clinical metadata, tumor genomic data, histopathology slide images, and follow-up patient tracking for survival outcomes. Pathology reports sent to TCGA were written through the course of routine clinical care. Although available for download, pathology reports from TCGA have not been extensively utilized for research purposes due to their PDF formatting, highly varied structure, and the presence of image artifacts, making automated analysis difficult.

Numerous recent studies have utilized information derived from pathology reports, both in general2,3,4,5,6,7,8,9,10,11,12,13 and using subsets of the TCGA pathology report dataset.14,15,16,17,18,19,20 Previous studies incorporating TCGA pathology report data have largely relied on manual curation and limited term-set extraction across smaller subsets of the dataset. More recently, one study21 used optical character recognition (OCR) and natural language processing (NLP) techniques on a TCGA subset of patients with breast cancer for an information retrieval task. Another study combined OCR and traditional machine learning techniques to classify tumor grade using a TCGA subset of approximately 500 patients with prostate cancer.22 Additionally, Allada et al. compared different NLP classification methods for the prediction of seven disease classes within a TCGA subset consisting of roughly 2,000 patients.23 The accelerated increase in recent research efforts using pathology report text as a basis for a variety of clinical prediction tasks demonstrates the utility of this type of data and the need for a benchmark dataset.

Here, we describe the curation of a text corpus derived from the set of all TCGA pathology reports. To convert pathology reports from PDF to machine-readable text, we employ OCR as well as significant OCR post-processing. After processing, translating from image to text, and cleaning, we leverage recent advances in NLP24 and its application to clinical text25,26,27 to demonstrate the utility of this dataset by training a cancer-type prediction model. We make the final corpus of 9,523 patient reports publicly available for researchers to use for data mining or machine learning applications. The corpus will be particularly useful for researchers who may not have access to institution-specific or otherwise controlled-access corpora and can potentially provide a benchmark in this field going forward.

Results

TCGA pathology report pre-processing and data selection

11,108 pathology reports, corresponding to 11,010 patients, were downloaded from the TCGA data portal. The dataset was pre-processed as follows: first, we removed 82 patients with multiple reports and 399 patients with non-primary tumors. Then, to ensure that the final dataset is fully complete with respect to associated outcome data, we removed 72 patients who did not have survival data in the TCGA Clinical Data Resource (CDR).28 This resulted in a selection of 10,457 patients, with each patient having one corresponding report, and all reports descriptive of primary tumors. Next, we removed 381 “Missing Pathology” reports, which were placeholder forms indicating a lack of pathology report for specific patients, as well as 14 reports of poor scan quality (see methods). We additionally removed 212 “TCGA Pathologic Diagnosis Discrepancy Form” reports, which consisted mostly of diagnosis discrepancies indicative of inaccurate pathology reporting. After these filters were applied, 9,850 reports were processed through text extraction.

Text extraction and OCR post-processing

We used OCR to transform reports from PDF into text. We qualitatively evaluated several different OCR programs, with Textract29 producing the most accurate results and being the best able to remove report artifacts (see methods). We processed 9,850 reports (25,478 pages) through Textract and then parsed and post-processed the resultant output files. Redaction bars and TCGA barcodes, artifacts of the TCGA quality control (QC) process, were removed by Textract (Figures S1A and S1B).

We removed reports that consisted partially or entirely of multiple-choice forms (see methods). We identified and removed 210 “Consolidated Diagnostic Pathology Form” reports using a combination of keyword and selection element (check-box) filters. We also manually reviewed and removed any “Synoptic Translated” forms as well as any reports with a large amount of multiple-choice selection elements. After removal, 9,547 reports (24,214 pages) remained.

We additionally removed within-report TCGA metadata insertions, which occur at inconsistent coordinates and at varying angles. These insertions can interrupt sentences in the OCR translation and need to be removed for clean final text (Figure S1C). We removed QC tables added to the reports by the TCGA, which contain information about sample quality but are irrelevant for diagnosis and not included in standard pathology reports. See methods for full details on this process, but briefly, we used Textract to identify the QC table’s section headers and drew a custom bounding box around each detected term. We validated this approach on a prototype set of 50 randomly selected reports and found 100% concordance between manual and automatic identification of QC tables using our bounding box technique. We additionally utilized Textract’s word-level text-type annotation, removing lines that contain only handwritten words. These lines were added by TCGA during QC and were typically mis-translated by OCR; their removal improves the overall quality of the dataset. 24,099 pages remained.

Finally, to finish cleaning the dataset, we sought to remove clinically irrelevant and potentially confounding clinic-specific section headers from the remaining reports. To maintain the general utility of this dataset, we employed a conservative approach, only removing content that is clearly irrelevant to pathological description and patient diagnosis. We manually reviewed 500 randomly selected report pages to compile a list of 312 regular expressions, which were used to remove report lines (see methods). More than 100,000 lines were removed in this step (Figure 1B).

Figure 1.

Figure 1

Patient and line distributions after dataset processing

(A) Distribution of patients remaining in the dataset after data selection, OCR, and post-processing, presented per cancer type. See also Table S1.

(B) Distribution of number of lines removed per report during the final post-processing step of matched regular expression removal.

In total, 9,523 reports (23,909 pages, or 842,134 lines) remain in the final dataset (Figures 1A and 2A‒2D). The frequency of cancer type within the dataset varies: breast invasive carcinoma is the most prevalent, with 1,034 patients, and cholangiocarcinoma is the least prevalent, with 43 patients (Figure 2A). We compiled the demographic characteristics of the patient population in the final pathology report dataset overall (Table 1) and plotted the distribution of demographics by cancer type (Figure S5).

Figure 2.

Figure 2

Final dataset characteristics

(A) Cancer-type distribution, ordered by prevalence.

(B) Distribution of number of pages per report (left) and lines per report (right).

(C) Distribution of number of pages per report, segmented by tissue. Darker hue indicates greater prevalence of cancer type within this dataset.

(D) Distribution of report-generating institutions (tissue source sites) and average number of reports per institution, presented separately by cancer type.

Table 1.

Demographic characteristics of patients in final pathology report dataset

No. of patients Total patients (%)
Age

<18 13 0.1
18–29 279 2.9
30–39 631 6.6
40–49 1,226 12.9
50–59 2,230 23.4
60–69 2,671 28
70–79 1,850 19.4
80+ 600 6.3
Not reported 23 0.2

Gender

Female 5,035 52.9
Male 4,488 47.1

Ethnicity

Hispanic or Latino 343 3.6
Not Hispanic or Latino 6,995 73.5
Not reported 2,185 22.9

Race

American Indian or Alaska Native 27 0.3
Asian 423 4.4
Black or African American 925 9.7
Native Hawaiian or Other Pacific Islander 13 0.1
Not reported 933 9.8
White 7,202 75.6

See also Figure S5.

Cancer-type classification

To demonstrate at a higher level that the resultant corpus is machine readable and useable by modern NLP methods, we performed cancer-type classification across 32 tissues. Ground-truth labels were provided by TCGA project classifications. To utilize domain-relevant pre-trained weights, we fine-tuned an existing model, ClinicalBERT.26 ClinicalBERT is a BERT-based model that had been trained on the clinical note set MIMIC-III30 and initialized with BioBERT25 weights. (BioBERT itself had been trained on PMC articles and PubMed abstracts and had been initialized with BERT-BASE.24) To prepare the text for input, we joined all lines and pages across each patient report. We partitioned the data into a train/validation/test split, stratifying by tissue type and holding out the test set until final evaluation. Train/validation and test patient sets were consistent across demographic strata (Table S2). We trained in parallel 32 ClinicalBERT models26 for 10 epochs, across 10 random seeds per tissue type (see methods). We identified models with maximal validation set AU-ROC and evaluated the performance of these models on the held-out test set. We were able to achieve an average test-set AU-ROC of 0.992 and an average test-set AU-PRC (area under precision-recall curve) of 0.903 (Figures 3A and 3B).

Figure 3.

Figure 3

Model performance for proof-of-concept classification task

Horizontal bar chart for (A) AU-ROC and (B) AU-PRC for test-set performance of models trained across 10 random seeds, with 95% confidence interval. (A) and (B) are ordered by tissue prevalence, with higher-prevalence cancer types toward the top of each image.

See also Figure S4.

Discussion

Pathology report text is generated routinely and ubiquitously across cancer care sites. In some medical centers, records can span decades, allowing for the research use of pathology reports in both retrospective and prospective analyses. Compared to whole-slide image data, report text is substantially smaller in size and easier to work with. Text files require far less storage, and model training requires much less run time, especially important as memory and computing power can be cost prohibitive. Reports reflect the expertise of practicing pathologists, who are typically equipped with years of specialty training, and the morphological features they describe may prove helpful in training models to predict clinical targets.

The TCGA pathology report corpus can be utilized by researchers for a variety of analyses. For example, the text may be used as input for cancer-subtype classification, survival prediction for increased prognostic accuracy, and information retrieval or named entity recognition (i.e., consistent extraction of specific information from report text). Directly, a clinical researcher could train and validate their model of interest on the TCGA corpus and then apply that trained model to private patient data at their institution. This type of research could be performed for a specific cancer type or in a pan-cancer capacity. As models increase in capability, e.g., the recent advances in artificial intelligence (AI) language models such as GPT4, the availability of relevant public text data will be essential for the benchmarking of relative model performance on pathology report text.

One of the main strengths of this dataset is that it is derived from the notes of many different pathologists at a wide range of institutions (Figure 2D). This diversity will result in greater generalizability of models trained, particularly compared to models trained at single institutions. An additional benefit of this dataset is that it is already de-identified for public use and does not require specialized or controlled access, allowing its use as a convenient benchmark with which to compare different text-based models.

The TCGA pathology report corpus is enriched by additional patient data gathered by TCGA and accessible through its portal. These include histopathology slide imaging, clinical metadata, and survival data, among other information (Table 2). The availability of these data opens the possibility of performing multimodal analyses, which may increase performance of downstream tasks.31 Limitations, however, of the TCGA dataset are that (1) it does not contain clinical notes or symptom timelines, (2) the reports are slightly older and may not contain the most up-to-date terminology as oncological classifications evolve, (3) the length of survival follow-up varies depending on cancer type, and (4) some cancer types are minimally represented (e.g., SKCM). In addition, we were not able to process multiple-choice or synoptic reports through our OCR-based pipeline; this is an area ripe for further research.

Table 2.

Selected available data for patients in final pathology report dataset

Cancer type n Age Race Eth. Prior malig. Tumor slides Normal slides OS events PFI events DFI events
BRCA 1,034 100.00 90.81 83.17 99.90 100.00 15.18 14.02 13.54 7.93
UCEC 546 99.63 94.14 71.43 100.00 100.00 7.33 16.48 22.53 10.44
KIRC 525 100.00 98.67 71.05 100.00 100.00 82.29 33.52 29.71 2.29
HNSC 520 100.00 97.31 93.27 100.00 100.00 15.00 42.50 37.88 5.38
LUAD 488 97.95 88.93 77.46 100.00 100.00 42.01 36.68 41.19 17.83
THCA 487 100.00 81.72 79.88 100.00 100.00 19.92 3.29 10.06 5.13
LGG 469 100.00 97.87 93.18 100.00 100.00 0.00 23.67 34.33 3.41
LUSC 468 98.72 76.50 66.24 99.79 100.00 49.36 43.59 29.70 13.25
PRAD 446 100.00 97.09 80.72 100.00 100.00 26.01 1.79 16.14 4.93
COAD 418 100.00 58.61 56.46 100.00 99.76 21.05 19.86 27.03 5.50
GBM 399 100.00 96.99 82.21 7.77 100.00 1.25 75.69 79.95 0.50
BLCA 379 100.00 95.51 92.88 100.00 100.00 9.23 45.38 44.59 7.39
OV 371 100.00 93.26 47.17 2.70 100.00 19.95 52.02 65.77 30.73
STAD 361 99.17 84.21 70.36 100.00 100.00 25.76 40.44 31.58 8.86
LIHC 341 100.00 97.07 94.43 100.00 100.00 26.10 32.26 47.51 38.12
CESC 289 100.00 88.24 64.36 100.00 100.00 2.42 22.84 22.84 8.65
KIRP 280 99.29 95.00 87.50 100.00 100.00 30.00 13.93 19.29 8.93
SARC 249 100.00 96.39 87.55 100.00 100.00 8.43 37.75 53.01 24.90
PAAD 176 100.00 97.73 76.70 100.00 100.00 22.16 52.27 58.52 13.07
PCPG 174 100.00 97.70 80.46 100.00 100.00 2.87 2.87 10.92 2.30
READ 162 100.00 50.00 46.91 100.00 100.00 11.11 12.35 22.22 4.32
ESCA 146 100.00 86.30 38.36 100.00 100.00 44.52 48.63 45.89 3.42
THYM 114 100.00 98.25 87.72 100.00 100.00 7.02 5.26 16.67 0.00
KICH 112 100.00 98.21 66.07 100.00 100.00 62.50 10.71 14.29 5.36
SKCM 102 100.00 98.04 95.10 100.00 100.00 0.00 27.45 35.29 0.00
ACC 90 100.00 87.78 52.22 100.00 100.00 4.44 37.78 54.44 15.56
TGCT 87 100.00 94.25 88.51 100.00 100.00 0.00 2.30 18.39 13.79
MESO 79 100.00 100.00 87.34 100.00 100.00 1.27 83.54 72.15 8.86
UVM 65 100.00 61.54 60.00 100.00 100.00 0.00 26.15 33.85 0.00
UCS 56 100.00 98.21 76.79 100.00 100.00 10.71 62.50 66.07 17.86
DLBC 47 100.00 100.00 100.00 100.00 100.00 0.00 17.02 23.40 6.38
CHOL 43 100.00 97.67 93.02 100.00 100.00 39.53 48.84 51.16 20.93

Percentage of patients with data availability, per cancer type. n, number of patients; Eth., ethnicity; Prior malig., prior malignancy; OS, overall survival; PFI, progression-free interval; DFI, disease-free interval, derived from TCGA-CDR.28 All other columns derived from TCGA clinical and biospecimen metadata.32 Additional patient data, such as ICD-10 codes, sequencing data, transcriptomic data, and epigenetic data, are available through the TCGA portal.32

The final text corpus presented here is moderately curated; data quality could be enhanced by applying additional cleaning steps for future analyses. For example, automated spelling correction could be applied in order to ensure that spelling mistakes made either in the original text or during OCR are corrected prior to model input tokenization. Depending on the model and tokenizer being applied, other pre-processing steps could include automated editing of punctuation or uncasing of the input text. For other PDF-based datasets, our pipeline would need to be adapted so that artifact removal would be customized to the specific report set of interest.

Finally, cancer-type classification performed in this study illustrated the trainability of and information content within the corpus. However, four cancer types (READ, KICH, UCS, CHOL) had relatively poor performance (low AU-PRC) for this classification task. Low AU-PRC may be a result of ClinicalBERT confusing one cancer type (e.g., UCS) with a similar cancer type (e.g., UCEC), particularly if the relative prevalence is severely imbalanced. Future work involving pathology reports for these low-prevalence cancer types could consider balancing classification33 or testing different models and tokenizers to potentially improve performance.

Experimental procedures

Resource availability

Lead contact

Please direct requests for additional information to the lead contact, Nicholas Tatonetti (nicholas.tatonetti@csmc.edu).

Materials availability

Text reports are available on Mendeley.32

Data and code availability

Source code is available through the GitHub repository (https://github.com/jkefeli/tcga-path-reports) and archived through Zenodo.34

Methods

TCGA pathology report pre-processing and data selection

We downloaded pathology reports, clinical metadata, and biospecimen metadata for all TCGA patients from the GDC portal.35 Each tumor sample has at most one associated pathology report (pathology_report_uuid), and each patient can have multiple samples. For case-based selection, we used sample.tsv (biospecimen directory). We removed patients with empty pathology_report_uuid values, removed patients with multiple pathology_report_uuid values, and selected patients with “primary tumor” in the sample_type column. Next, we checked which patients matched with the TCGA-CDR28 (a curated, comprehensive resource for TCGA outcomes data). We removed 72 patients either not found or found but lacking survival time within the TCGA-CDR.

For report-based filtering, we used OCR to identify reports for removal. We converted the reports from PDF to image and then from image to text using pdf2image and pytesseract.36,37 We scanned the resultant text for key phrases for report exclusion. We removed 381 reports that contained the phrase “TCGA Missing Pathology Report Form” within any page (Figure S2A), 212 reports that contained the phrase “TCGA Pathologic Diagnosis Discrepancy Form” within any page (Figure S2B), and 14 reports of poor scan quality. Ultimately, 9,850 reports were selected for full text extraction and post-processing.

Text extraction and OCR post-processing

Text extraction

Multiple OCR packages were tested for translation accuracy and output formatting consistency. A set of 50 randomly chosen reports were used as a basis for comparison. First, we evaluated PyPDF2,38 a python package that converts PDF files directly to text. Although text translation performed reasonably well on the prototype set, there were a number of issues with the output files, including poorly translated TCGA QC tables, incorrectly spaced words, and redaction bar artifacts in various sections of text. These factors made it infeasible to parse the output and achieve clean report text. Next, we evaluated the performance of pytesseract37 and Textract29 on the pathology report dataset. In order to use these packages, we performed a high-fidelity conversion of each page of the PDF prototype set to JPG image files using pdf2image.36 The python package pytesseract produced better-quality text files in comparison to PyPDF2. The output text was largely structured the same as the input files, with no major word spacing issues. Barcodes and redaction bars were not translated at all, resulting in much cleaner output. However, pytesseract failed at handwriting translation, leading to mis-translated text in variable sections of each report that would be difficult to parse out in post-OCR processing.

Finally, we tested Textract, a software created by Amazon Web Services (AWS) that uses OCR and machine learning to convert images into text alongside structural annotation.29 In contrast to pytesseract and PyPDF2, output files include structural annotation in addition to text. For example, tables, selection elements, and handwritten lines are identified and annotated with bounding box coordinates within each report page. This feature is particularly helpful for parsing out mis-translated handwriting during post-processing and for filtering reports using table- or selection-element-based filters (see form detection and removal). In addition, we found that Textract produced cleaner output and consistently performed with higher translation accuracy on the prototype set (although, as with pytesseract, handwriting was not well translated). Based on these considerations, we selected Textract for use on the entire pathology report dataset.

For each report, page images were converted into byte arrays and processed on the AWS server. Due to AWS hard limits (<10,000 pixels/edge and <5 MB total), we lowered the resolution slightly for 58 pages to conform. We converted 9,850 reports (25,478 pages) using the AnalyzeDocument function of Textract, with the table annotation option selected. We manually reviewed outlier short reports consisting of less than or equal to 5 lines of text (n = 24 reports); we found that they contained clinically relevant information and were therefore kept in the dataset.

Form detection and removal

Multiple-choice forms, consisting of questions with multiple-choice answer options, were identified and removed from the dataset (Figure S2C). Because the selected option for each multiple-choice question is not detected by OCR, the resultant output text contains all multiple-choice options for each question and is therefore not learnable. The multiple-choice selection elements were most frequently check boxes, but the exact format varied. The forms themselves were variable in content (with disease-specific questions and answers), overall format, and number of selection elements per question. Some reports consisted entirely of multiple-choice forms and were fully removed, while others contained a mix of page types, in which case only form-containing pages were removed.

We first searched for reports that potentially contained multiple-choice content. As an initial filter, we selected reports based on structural elements, including the total number of tables per report, the total number of selection elements per report, and the average and maximum numbers of selection elements per page. All structural elements considered in this section were annotated by Textract, with annotation data represented by the BlockType attribute of each page response block. We employed various empirically derived thresholds for this initial filter, starting with the clearest outliers and then including medium outliers, finding additional form reports in the medium outlier set upon manual review. However, we found that this first-level filter was not specific enough, including many non-forms in the selected report sets. We also observed that only a few cancer types had form-style pathology reports in this dataset.

We therefore added a second filter consisting of custom disease-specific keywords, based on manual review of a subset of reports selected from the structural elements filter. Keywords were drawn from both question and answer text, with a preference for unique phrases that were unlikely to appear elsewhere in standard report text. The number of matched keywords required for report selection was adjusted depending on the results for each disease filter. For example, colon cancer pathology reports were identified as likely forms if they contained at least 2 of the following keywords: “signet ring feature:”, “histologic heterogeneity:”, “Crohn’s like reaction”, “plasma cell rich stroma”, “angiolymphatic invasion:”, “Garland necrosis present:”, “TIL cells/HPF”, or “pathologist comment:”. As another example, liver-related forms were identified by 14 keywords, including “hepatitis (specify type)” and “(check all that apply)”, and cervix-related forms were identified using 21 keywords across multiple pages. We additionally incorporated fuzzy matching to the keyword filter to account for mis-spelled text (either mis-spelled via OCR translation error or within the original pathology report text).

Although the keyword filter greatly increased specificity, enriching report sets for form content, the final filtered report sets were not perfectly specific. We therefore manually reviewed all reports that passed the structural element and keyword-matching filters to ensure we ultimately removed only form reports from the overall dataset. After form removal, 9,547 reports (24,214 pages) remained.

Table detection and removal

We used 9 section header keywords to identify TCGA QC tables within each report (Figure S3). The section headers were largely typed text, free of handwriting annotation, and Textract transcribed these keywords sufficiently for fuzzy detection. Fuzzy-matching error allowance varied according to the observed frequency of mis-translation for each keyword. The relative location of the section headers within each table was consistent across reports. As such, we drew a custom bounding box for each keyword detected and then merged the keyword-based bounding boxes to form a “maximum bounding box” (max bounding box) around the entire detected table.

To check that this table detection method performs accurately across the overall dataset, we probed the results in a number of ways. First, we manually scored a prototype set of 50 randomly selected reports, finding that all detected tables were true tables and that all true tables were detected (no false negatives or false positives). Next, we tested whether a single matched keyword was sufficient to distinguish table content from main text. Checking for false positives, we manually reviewed all reports for which only one keyword was fuzzy matched. Upon reviewing 115 pages that met this criterion, we found that all detected bounding boxes were true QC tables. This aligns with our observation that the terms used in QC table section headers are distinct from the general vocabulary used in the main text. We also examined large max bounding box outliers to confirm that main report text was not fuzzy matched by our table detection method. We found that these reports (n = 39) had reasonably sized max bounding boxes and that the bounding boxes did not overlap with any clinically relevant, non-QC-table lines.

Once the tables had been detected, we removed them by removing any lines overlapping with the max bounding box. We set an overlap threshold, or the minimum area overlap between the bounding box of a given line and the max bounding box of a QC table, for the line to be considered as part of the table. A smaller overlap threshold would include lines that were further from the table, as less overlap area would be required for the line to be considered part of the table. To determine the appropriate overlap threshold, we assembled a randomly selected subset (n = 4,000 pages) and manually examined pages containing lines within specified overlap thresholds. Between thresholds 0.35 and 0.25, no clinically relevant, non-table-related lines were selected; however, for lines with overlap ≤0.25, some clinically relevant, non-table-related lines were captured. We therefore moved forward with a minimum 0.25 area overlap threshold and removed all QC-table-related lines from the dataset.

Handwriting and keyword removal

We implemented additional filters to clean the text before dataset finalization. First, we sought to remove TCGA handwritten annotations. We removed handwritten notes because they are not part of standard pathology report text generated during routine care (i.e., the handwritten notes were an artifact of the TCGA data collection process) and are also largely incorrectly OCR translated. We selected for lines that consisted entirely of Textract-annotated handwritten words, removing approximately 120,000 lines in this step. In addition, we sought to remove any clinically irrelevant TCGA identification data or site-specific text (such as clinic-specific section headers), with the goal of reducing any potentially confounding elements within the text itself. We manually reviewed 500 randomly selected report pages and compiled a list of 312 regular expressions. A list of keywords and regular expressions can be found in the final_report_cleaning python script in the GitHub repository associated with this paper. Approximately 100,000 additional lines were removed at this stage. After joining all lines with period delimiters and joining report pages, the final dataset consisted of 9,523 reports (23,909 pages) across 32 cancer types.

Cancer-type classification

We performed binary cancer-type classification by fine-tuning Bio+Clinical BERT26 and using TCGA project_id as the prediction target. We trained each model in parallel, with 32 separate cancer-type experiments. We split the data into train/validation/test sets, stratifying by cancer type. To establish confidence intervals for model performance, we ran 10 different random seeds for each experiment, resulting in 320 models trained and evaluated. We trained the models with default parameters, except for the following: per_device_train_batch_size set to 16 (for smoother training curves and reduced run time); AU-ROC was used for performance evaluation; models were saved and evaluated every 32 steps (more often than default). Model input was truncated at 512 tokens per patient report, which is the maximum number of input tokens that ClinicalBERT is able to utilize. For evaluation, we applied a softmax on raw model scores and used the transformed values for ROC and PR curve construction.

We trained all cancer-type models across 10 random seeds for 10 epochs. Each model used approximately 7,620 s of run time, for a total training time of 11.3 days. The best models, as determined by the highest validation set AU-ROC, were then applied to the test set for evaluation. AU-ROC was consistently high across cancer types, with narrow confidence intervals (Figure 3A). AU-PRC was more variable across cancer types and exhibited wider confidence intervals (Figure 3B). Performance among tissues with lower prevalence was generally worse as compared to tissues of higher prevalence. This is to be expected, as models trained on limited sample size are presented with fewer examples with which to learn their intended classification target and generally benefit from larger sample sizes with greater total information content. Individual per-tissue ROC and PR curves were plotted for comparison (Figure S4).

Acknowledgments

Our work was supported by the following NIH NIGMS grant: R35GM131905.

Author contributions

J.K. performed conceptualization, data collection and cleaning, software, and model output analysis. N.T. aided in conceptualization and manuscript revision.

Declaration of interests

The authors declare no competing interests.

Published: February 21, 2024

Footnotes

Supplemental information can be found online at https://doi.org/10.1016/j.patter.2024.100933.

Supplemental information

Document S1. Figures S1‒S5 and Tables S1 and S2
mmc1.pdf (3.3MB, pdf)
Document S2. Article plus supplemental information
mmc2.pdf (6.4MB, pdf)

References

  • 1.Cancer Genome Atlas Research Network. Weinstein J.N., Collisson E.A., Mills G.B., Shaw K.R.M., Ozenberger B.A., Ellrott K., Shmulevich I., Sander C., Stuart J.M. The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet. 2013;45:1113–1120. doi: 10.1038/ng.2764. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Yala A., Barzilay R., Salama L., Griffin M., Sollender G., Bardia A., Lehman C., Buckley J.M., Coopey S.B., Polubriaginof F., et al. Using machine learning to parse breast pathology reports. Breast Cancer Res. Treat. 2017;161:203–211. doi: 10.1007/s10549-016-4035-1. [DOI] [PubMed] [Google Scholar]
  • 3.Alawad M., Gao S., Qiu J.X., Yoon H.J., Blair Christian J., Penberthy L., Mumphrey B., Wu X.C., Coyle L., Tourassi G. Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks. J. Am. Med. Inf. Assoc. 2020;27:89–98. doi: 10.1093/jamia/ocz153. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Levy J., Vattikonda N., Haudenschild C., Christensen B., Vaickus L. Comparison of Machine-Learning Algorithms for the Prediction of Current Procedural Terminology (CPT) Codes from Pathology Reports. J. Pathol. Inf. 2022;13 doi: 10.4103/jpi.jpi_52_21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Ma R., Chen P.H.C., Li G., Weng W.H., Lin A., Gadepalli K., Cai Y. Human-centric Metric for Accelerating Pathology Reports Annotation. arXiv. 2019 doi: 10.48550/arXiv.1911.01226. Preprint at. [DOI] [Google Scholar]
  • 6.Nguyen A., O'Dwyer J., Vu T., Webb P.M., Johnatty S.E., Spurdle A.B. Generating high-quality data abstractions from scanned clinical records: text-mining-assisted extraction of endometrial carcinoma pathology features as proof of principle. BMJ Open. 2020;10 doi: 10.1136/bmjopen-2020-037740. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Gao S., Qiu J.X., Alawad M., Hinkle J.D., Schaefferkoetter N., Yoon H.J., Christian B., Fearn P.A., Penberthy L., Wu X.C., et al. Classifying cancer pathology reports with hierarchical self-attention networks. Artif. Intell. Med. 2019;101 doi: 10.1016/j.artmed.2019.101726. [DOI] [PubMed] [Google Scholar]
  • 8.Altieri N., Park B., Olson M., DeNero J., Odisho A.Y., Yu B. Supervised line attention for tumor attribute classification from pathology reports: Higher performance with less data. J. Biomed. Inf. 2021;122 doi: 10.1016/j.jbi.2021.103872. [DOI] [PubMed] [Google Scholar]
  • 9.Miettinen J., Tanskanen T., Degerlund H., Nevala A., Malila N., Pitkäniemi J. Accurate pattern-based extraction of complex Gleason score expressions from pathology reports. J. Biomed. Inf. 2021;120 doi: 10.1016/j.jbi.2021.103850. [DOI] [PubMed] [Google Scholar]
  • 10.Alawad M., Gao S., Shekar M.C., Hasan S.M., Christian J.B., Wu X.C., Durbin E.B., Doherty J., Stroup A., Coyle L., Penberthy L., et al. Integration of Domain Knowledge using Medical Knowledge Graph Deep Learning for Cancer Phenotyping. arXiv. 2021 doi: 10.48550/arXiv.2101.01337. Preprint at. [DOI] [Google Scholar]
  • 11.Zhou S., Wang N., Wang L., Liu H., Zhang R. CancerBERT: a cancer domain-specific language model for extracting breast cancer phenotypes from electronic health records. J. Am. Med. Inf. Assoc. 2022;29:1208–1216. doi: 10.1093/jamia/ocac040. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Laique S.N., Hayat U., Sarvepalli S., Vaughn B., Ibrahim M., McMichael J., Qaiser K.N., Burke C., Bhatt A., Rhodes C., Rizk M.K. Application of optical character recognition with natural language processing for large-scale quality metric data extraction in colonoscopy reports. Gastrointest. Endosc. 2021;93:750–757. doi: 10.1016/j.gie.2020.08.038. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Park B., Altieri N., DeNero J., Odisho A.Y., Yu B. Improving natural language information extraction from cancer pathology reports using transfer learning and zero-shot string similarity. JAMIA Open. 2021;4 doi: 10.1093/jamiaopen/ooab085. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Asaoka M., Patnaik S.K., Zhang F., Ishikawa T., Takabe K. Lymphovascular invasion in breast cancer is associated with gene expression signatures of cell proliferation but not lymphangiogenesis or immune response. Breast Cancer Res. Treat. 2020;181:309–322. doi: 10.1007/s10549-020-05630-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Sorgini A., Kim H.A.J., Zeng P.Y.F., Shaikh M.H., Mundi N., Ghasemi F., Di Gravio E., Khan H., MacNeil D., Khan M.I., et al. Analysis of the TCGA Dataset Reveals that Subsites of Laryngeal Squamous Cell Carcinoma are Molecularly Distinct. Cancers. 2020;13:105–131. doi: 10.3390/cancers13010105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Yu K.H., Berry G.J., Rubin D.L., Ré C., Altman R.B., Snyder M. Association of Omics Features with Histopathology Patterns in Lung Adenocarcinoma. Cell Syst. 2017;5:620–627.e3. doi: 10.1016/j.cels.2017.10.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Chappidi M.R., Welty C., Choi W., Meng M.V., Porten S.P. Evaluation of the Cancer of Bladder Risk Assessment (COBRA) Score in the Cancer Genome Atlas (TCGA) Bladder Cancer Cohort. Urology. 2021;156:104–109. doi: 10.1016/j.urology.2021.04.047. [DOI] [PubMed] [Google Scholar]
  • 18.Harmon S.A., Sanford T.H., Brown G.T., Yang C., Mehralivand S., Jacob J.M., Valera V.A., Shih J.H., Agarwal P.K., Choyke P.L., Turkbey B. Multiresolution Application of Artificial Intelligence in Digital Pathology for Prediction of Positive Lymph Nodes From Primary Tumors in Bladder Cancer. JCO Clin. Cancer Inform. 2020;4:367–382. doi: 10.1200/CCI.19.00155. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Kalra S., Li L., Tizhoosh H.R. Automatic classification of pathology reports using TF-IDF Features. arXiv. 2019 doi: 10.48550/arXiv.1903.07406. Preprint at. [DOI] [Google Scholar]
  • 20.Wu J., Liu Y., Gao Z., Gong T., Wang C., Li C. Bioie: Biomedical information extraction with multi-head attention enhanced graph convolutional network. arXiv. 2021 doi: 10.48550/arXiv.2110.13683. Preprint at. [DOI] [Google Scholar]
  • 21.Rinaldi J., Sokol E.S., Hartmaier R.J., Trabucco S.E., Frampton G.M., Goldberg M.E., Albacker L.A., Daemen A., Manning G. The genomic landscape of metastatic breast cancer: Insights from 11,000 tumors. PLoS One. 2020;15 doi: 10.1371/journal.pone.0231999. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Dhrangadhariya A., Otálora S., Atzori M., Müller H. ICPR International Workshops and Challenges. Springer; 2021. Classification of Noisy Free-Text Prostate Cancer Pathology Reports Using Natural Language Processing. [DOI] [Google Scholar]
  • 23.Allada A.K., Wang Y., Jindal V., Babee M., Tizhoosh H.R., Crowley M. Analysis of Language Embeddings for Classification of Unstructured Pathology Reports. Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. 2021:2378–2381. doi: 10.1109/EMBC46164.2021.9630347. [DOI] [PubMed] [Google Scholar]
  • 24.Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł., Polosukhin I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017;30 [Google Scholar]
  • 25.Lee J., Yoon W., Kim S., Kim D., Kim S., So C.H., Kang J. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36:1234–1240. doi: 10.1093/bioinformatics/btz682. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Alsentzer E., Murphy J.R., Boag W., Weng W.H., Jin D., Naumann T., McDermott M. Publicly available clinical BERT embeddings. arXiv. 2019 doi: 10.48550/arXiv.1904.03323. Preprint at. [DOI] [Google Scholar]
  • 27.Huang K., Altosaar J., Ranganath R. ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission. arXiv. 2019 doi: 10.48550/arXiv.1904.05342. Preprint at. [DOI] [Google Scholar]
  • 28.Liu J., Lichtenberg T., Hoadley K.A., Poisson L.M., Lazar A.J., Cherniack A.D., Kovatich A.J., Benz C.C., Levine D.A., Lee A.V., et al. An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics. Cell. 2018;173:400–416.e11. doi: 10.1016/j.cell.2018.02.052. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Amazon Web Services Textract Software. https://aws.amazon.com/textract/ Accessed in 2021.
  • 30.Johnson A.E.W., Pollard T.J., Shen L., Lehman L.W.H., Feng M., Ghassemi M., Moody B., Szolovits P., Celi L.A., Mark R.G. MIMIC-III, a freely accessible critical care database. Sci. Data. 2016;3 doi: 10.1038/sdata.2016.35. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Tran K.A., Kondrashova O., Bradley A., Williams E.D., Pearson J.V., Waddell N. Deep learning in cancer diagnosis, prognosis and treatment selection. Genome Med. 2021;13:152. doi: 10.1186/s13073-021-00968-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Kefeli J., Tatonetti N. TCGA-Reports: A Machine-Readable Pathology Report Resource for Benchmarking Text-Based AI Models. Mendeley Data. 2024 doi: 10.17632/hyg5xkznpx.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.De Angeli K., Gao S., Danciu I., Durbin E.B., Wu X.C., Stroup A., Doherty J., Schwartz S., Wiggins C., Damesyn M., et al. Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types. J. Biomed. Inf. 2022;125 doi: 10.1016/j.jbi.2021.103957. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Kefeli J., Tatonetti N. Code for TCGA Pathology Report Corpus Pipeline. Zenodo. 2024 doi: 10.5281/zenodo.10452345. [DOI] [Google Scholar]
  • 35.Grossman R.L., Heath A.P., Ferretti V., Varmus H.E., Lowy D.R., Kibbe W.A., Staudt L.M. Toward a Shared Vision for Cancer Genomic Data. N. Engl. J. Med. 2016;375:1109–1112. doi: 10.1056/NEJMp1607591. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Belval E. Github; 2020. pdf2image.https://github.com/Belval/pdf2image [Google Scholar]
  • 37.Tesseract-OCR Team . Github; 2021. Tesseract Open Source OCR Engine.https://github.com/tesseract-ocr/tesseract [Google Scholar]
  • 38.Py-PDF Team . Github; 2016. Pypdf.https://github.com/mstamy2/PyPDF2 [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures S1‒S5 and Tables S1 and S2
mmc1.pdf (3.3MB, pdf)
Document S2. Article plus supplemental information
mmc2.pdf (6.4MB, pdf)

Data Availability Statement

Source code is available through the GitHub repository (https://github.com/jkefeli/tcga-path-reports) and archived through Zenodo.34


Articles from Patterns are provided here courtesy of Elsevier

RESOURCES