Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Sep 7.
Published in final edited form as: IEEE EMBS Int Conf Biomed Health Inform. 2019 Sep 12;2019:10.1109/bhi.2019.8834586. doi: 10.1109/bhi.2019.8834586

Deep Transfer Learning Across Cancer Registries for Information Extraction from Pathology Reports

Mohammed Alawad , Shang Gao , John Qiu , Noah Schaefferkoetter , Jacob D Hinkle , Hong-Jun Yoon , J Blair Christian , Xiao-Cheng Wu , Eric B Durbin ‡,§,, Jong Cheol Jeong §,, Isaac Hands ‡,, David Rust , Georgia Tourassi
PMCID: PMC9450101  NIHMSID: NIHMS1830736  PMID: 36081613

Abstract

Automated text information extraction from cancer pathology reports is an active area of research to support national cancer surveillance. A well-known challenge is how to develop information extraction tools with robust performance across cancer registries. In this study we investigated whether transfer learning (TL) with a convolutional neural network (CNN) can facilitate cross-registry knowledge sharing. Specifically, we performed a series of experiments to determine whether a CNN trained with single-registry data is capable of transferring knowledge to another registry or whether developing a cross-registry knowledge database produces a more effective and generalizable model. Using data from two cancer registries and primary tumor site and topography as the information extraction task of interest, our study showed that TL results in 6.90% and 17.22% improvement of classification macro F-score over the baseline single-registry models. Detailed analysis illustrated that the observed improvement is evident in the low prevalence classes.

Index Terms—: Transfer learning, convolutional neural network, information extraction, pathology reports, NLP

I. Introduction

One of the challenges facing the National Cancer Institutes Surveillance, Epidemiology, and End Results (SEER) cancer surveillance program is the development of natural language processing (NLP) approaches to augment and automate information extraction from cancer pathology reports. Due to the large volumes of cancer pathology reports generated on an annual basis, the manual annotation process that is currently in practice across SEER registries is not easily scalable to capture all data elements needed to fill in the longitudinal stories of cancer patients. Leveraging state of the art NLP technologies could significantly increase efficiency, data quality, and timeliness of cancer reporting by the SEER program.

The linguistic variability of cancer pathology reports sourced from hundreds of healthcare providers and pathology laboratories poses unique challenges for the generalizability of NLP tools, particularly for those relying on expert-driven feature engineering. Deep learning has been successfully used for clinical NLP tasks, such as extracting clinical variables from cancer pathology reports [1], [2]. Specifically, CNNs have demonstrated superior performance for document-level information extraction and classification utilizing word embeddings. They have outperformed traditional machine learning approaches in terms of classification accuracy across a range of information extraction tasks. This is due to their ability to learn complex feature representation while capturing both semantic and syntactic content in clinical text and without having explicit knowledge of the clinical language.

Training a deep learning NLP model requires a large training dataset that has similar characteristics as the prospective testing dataset. However, obtaining a large labelled clinical text corpus is time-consuming and labor-intensive. Therefore, different techniques have been proposed to overcome this practical limitation, such as multi-task learning [3] and transfer learning [4]. TL can be very useful to enable robust information extraction from cancer pathology reports across SEER cancer registries since it can address two important bottlenecks; namely the cost of producing a large labelled corpus to train deep learning models and the privacy concerns of sharing patient data across registries. In this paper, we exploited the TL technique for information extraction from cancer pathology reports collected from two different SEER cancer registries. Specifically, we studied the performance of a previously developed CNN model for information extraction regarding the primary tumor’s site and topography. We developed two training approaches. In the first approach, a CNN model was trained on one cancer registry dataset (source data) and exploited to transfer knowledge to the other registry dataset (target data). Where the pre-trained model parameters were either frozen or fine tuned with the target dataset. In the second approach, both the cancer registry datasets were combined to train a global CNN model. Then, the model was tested on each cancer registry data separately. From a clinical perspective, the first approach is appropriate when only single-registry trained NLP models can be shared with other cancer registries but not the actual training patient data. The second approach is feasible when cancer registries can combine their patient data to train a global NLP model while benefiting from a much larger training text corpus.

II. Materials and Methods

A. Datasets and Pre-processing

We used text corpora of cancer pathology reports obtained from two independent SEER program sources; the Louisiana Tumor Registry (LA), and Kentucky Cancer Registry (KY). The study was executed in accordance to the institutional review board protocol DOE000152. The LA and KY datasets consist of 374,899 and 172,128 pathology reports respectively. The LA corpus spans the period 2004–2018 while the KY corpus spans the period 2009–2018. Each pathology report is identified by a combination of patient ID and tumor ID, which is called case ID. Ground truth labels associated with each unique case are obtained from the registry record associated with the pathology report. In this paper, we consider the International Classification of Diseases for Oncology, Third Edition (ICD-O-3), topography1 (i.e., site/subsite) as the data element of interest as it is a fundamental information extraction task for cancer reporting. According to the SEER coding manual, there is a total of 321 possible subsite labels, representing tumor topographies across more than 70 organs where cancer may appear. The LA and KY datasets include 306 labels and 299 labels respectively, while both datasets combined include 314 subsite labels suggesting that not all possible cancer subsite labels are observed in the available data. To simulate real world production environment, we used the pathology report date to split each registry dataset into train, validation and test sets. Specifically, reports collected before 2016 are used for train and validation with 80–20 ratio, while the rest of the reports are kept for testing. Since multiple cancer pathology reports might have the same case ID, we considered that unique case IDs can be either in train, validation or test sets to avoid any positive bias in the reported results. This data handling process resulted in a train set of 236,255 reports, a validation set of 59,711 reports, and a test set of 78,860 reports for the LA dataset. A train set of 99,328 reports, a validation set of 24,499 reports, and a test set of 48,153 reports comprised the KY dataset. When combined, the train set included 335,650 reports, the validation set included 84,143 reports, and the test set included 127,013 reports.

After excluding metadata in cancer pathology reports, text is cleaned by removing any consecutive punctuation and lowercasing all alphabetical characters. To reduce the vocabulary space, all words with document frequency less than five are replaced with an “unknown_word” token, all decimals are converted to a “decimal” word token, and all integers larger than 100 are converted to a “large_integer” word token. Cancer pathology reports are represented as one dimensional vectors, where each element is a word token. Different lengths of cancer pathology reports is accommodated by specifying a fixed length of N = 1, 500 words for all reports.

B. Deep Learning Model

In this study, we used a previously reported CNN model for automated feature extraction from cancer pathology reports [1]. The pathology reports are converted to appropriate inputs for deep learning models through the word embeddings layer. In this paper, we used the length of word vectors K = 300 based on previous experience [1]. The output of word embeddings is a two dimensional document matrix of size ARN×K to be used as an input to the convolution layers. Convolution layers in CNNs for NLP are not stacked as in computer vision applications. Instead, they are structured as parallel layers with different filters that run simultaneously on document matrices. Varying the size of filters enables CNN to extract different n-gram expressions from the text. In this paper, the kernel sizes of the convolutional filters are 3, 4, and 5 with 300 feature maps each optimized as reported previously [1]. Rectified Linear Unit (ReLU) is used as the activation function. After convolution layers, max pooling layers capture the most important features by taking the max value from each filters output. Then, the selected contexts from max pooling layers are aggregated by the concatenation layer. Finally, the output is connected to a soft-max fully connected layer to produce a rank for each label.

C. Transfer Learning (TL)

TL is defined as the process of transferring knowledge learned from a source task, which can be a dataset, to a target task [5]. It can be done either by transferring the low-level layers [6], the high-level layers [7], or the whole model layers. More details about a complete study on impact of transferability of each layer can be found in [8]. TL is very popular in clinical imaging applications [9], [10] where computer vision models pre-trained on very large but general image data (e.g., ImageNet) are exploited to transfer knowledge to specialized clinical images where a small but relevant clinical imaging dataset is used for further fine-tuning. However, applying TL of deep learning models to clinical NLP tasks is still a subject of research.

Word embeddings pre-trained on unlabeled corpora using unsupervised learning approaches, such as Word2Vec [11] or GloVe [12] have been extensively used to transfer knowledge across NLP tasks. This approach was also used successfully for clinical tasks by transferring the embeddings of medical concepts learned from multimodal medical data [13]. However, it failed to improve the performance of CNN models for information extraction from cancer pathology reports [1]. One explanation is due to transferring knowledge from datasets (e.g., Google News, PubMed) that are not semantically similar to the target dataset (pathology reports), which contradicts the conclusion found in [4], [14]. In this paper, we developed a CNN-based model for primary cancer subsite extraction from pathology reports collected from LA and KY tumor registries. The model was trained following one of four different approaches: 1) a CNN model was trained and tested on the same dataset (Baseline), 2) a CNN model was trained on a source dataset and tested on a target dataset, in which all parameters were frozen with the source dataset (Frozen), 3) a CNN model was trained on a source dataset and tested on a target dataset, in which all parameters were fine-tuned with the target dataset (Fine-Tuned), 4) a CNN model was trained on both the datasets combined together, and tested on each dataset separately (Cross-Registry).

D. Performance Evaluation

We evaluated the performance of all models using the standard NLP metrics of micro and macro averaged F-scores. The micro-averaged metrics have class representation roughly proportional to their test set representation, whereas macro averaged metrics are averaged by class without weighing by class prevalence. For both micro and macro F-scores, we calculated 95% confidence intervals for each performance metric by bootstrapping the test set. The confidence intervals were used to determine the statistical significance of the difference in performance between different training approaches.

III. Results and Analysis

Table I and II summarize the performance evaluation of all training approaches to extract primary cancer subsite from cancer pathology reports when the targets are LA and KY datasets, respectively. The results show that Fine-Tuned and Cross-Registry training approaches significantly outperform the performance of the baseline model for both the datasets in terms of micro and macro F-scores. However, freezing the model parameters approach does not generalize as well between the two registries. Although the LA model is successfully transferred to KY dataset with performance as good as the model trained on the KY dataset alone, the KY model performance drops when LA cancer registry dataset is used as the target. This performance decline can be attributed to the larger size and more inclusive class representation of the LA dataset compared to the KY dataset. Fig. 1 shows the distribution of the best and worst represented classes with at least 100 samples, and the F-score performance of CNN models, in which a model is either trained on LA or KY dataset and tested on the other dataset.

TABLE I.

Micro and macro F-score (with 95% confidence intervals) when LA dataset is the target

Source Dataset Micro F-score Macro F-score
LA (Baseline) 0.611 (0.608,0.613) 0.290 (0.287,0.300)
KY (Frozen) 0.566 (0.565,0.568) 0.226 (0.224,0.230)
KY (Fine-Tuned) 0.614 (0.612,0.617) 0.300 (0.297,0.312)
LA+KY (Cross-Registry) 0.623 (0.620,0.626) 0.310 (0.305,0.319)

TABLE II.

Micro and macro F-score (with 95% confidence intervals) when KY dataset is the target

Source Dataset Micro F-score Macro F-score
KY (Baseline) 0.608 (0.605,0.612) 0.273 (0.268,0.285)
LA (Frozen) 0.604 (0.602,0.606) 0.271 (0.268,0.276)
LA (Fine-Tuned) 0.616 (0.613,0.620) 0.309 (0.306,0.324)
LA+KY (Cross-Registry) 0.618 (0.614,0.622) 0.320 (0.313,0.333)

Fig. 1.

Fig. 1.

F-score and number of samples of the most (a) and least (b) represented classes with at least 100 samples. A model is trained on either LA-Train or KY-Train set, and tested on KY-Test or LA-Test, where all transferred parameters are frozen.

One of the issues in our dataset and most real world datasets is the class imbalance problem. In cancer registries, this is a common challenge as some cancer types are highly prevalent (e.g., breast, lung, prostate) while others are not (e.g., esophagus, gum, sinuses). To study in more detail the impact of TL per class, we select the most and least represented classes with at least 100 samples. The reason is that the performance is mixed when the sample size is less than 100 as shown in Fig. 2. Fig. 3 and 4 show the performance of different training approaches on the least represented classes. For all figures, the primary y-axis shows the number of samples per class, while the secondary y-axis shows the F-score. The figures clearly show that increasing the sample size of underrepresented classes by applying TL approaches improves the classification accuracy. However, for well represented classes, adding more samples has marginal impact on model performance. The latter is shown in Fig. 5 and 6 for the top ten well represented classes from LA and KY datasets.

Fig. 2.

Fig. 2.

F-score and number of samples of randomly selected classes from LA (a) and KY (b) datasets with number of samples less than 100.

Fig. 3.

Fig. 3.

F-score and number of samples of the least represented ten classes from LA dataset with at least 100 samples.

Fig. 4.

Fig. 4.

F-score and number of samples of the least represented ten classes from KY dataset with at least 100 samples.

Fig. 5.

Fig. 5.

F-score and number of samples of the most represented ten classes from LA dataset.

Fig. 6.

Fig. 6.

F-score and number of samples of the most represented ten classes from KY dataset.

IV. Conclusion

This study exploits different TL approaches to transfer knowledge of CNN models for clinical information extraction from cancer pathology reports. The study focuses on primary cancer site and topography information extraction, a fundamental information abstraction task for cancer registries. Using data from two different SEER registries, we demonstrate that TL is an effective technique for this application leading to superior performance relative to the baseline model. The results show the ability of TL to improve the model performance, especially for low prevalence classes. Fine-tuning the pre-trained parameters results in micro and macro F-scores gain for LA and KY as target datasets. Combining both the datasets to train one general model achieves the best performance as compared to all other models. The success of applying TL to these two cancer registries suggests that there is sufficient semantic similarity among the data, thus opening up the possibility to exploit this approach for other information extraction tasks, including under a mutli-task learning framework.

Acknowledgment

This work has been supported in part by the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) program established by the U.S. Department of Energy (DOE) and the National Cancer Institute (NCI) of the National Institutes of Health. This work was performed under the auspices of the U.S. Department of Energy by Argonne National Laboratory under Contract DE-AC02-06-CH11357, Lawrence Livermore National Laboratory under Contract DEAC52-07NA27344, Los Alamos National Laboratory under Contract DE-AC5206NA25396, and Oak Ridge National Laboratory under Contract DE-AC05-00OR22725.

This work has also been supported by National Cancer Institute under Contract No. HHSN261201800013I and NCI Cancer Center Support Grant (P30CA177558).

This manuscript has been authored by UT - Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid- up, irrevocable, world-wide license to publish or reproduce the published form of the manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public- access-plan).

Footnotes

References

  • [1].Qiu JX, Yoon H, Fearn PA, and Tourassi GD, “Deep learning for automated extraction of primary sites from cancer pathology reports,” IEEE Journal of Biomedical and Health Informatics, vol. 22, pp. 244–251, Jan 2018. [DOI] [PubMed] [Google Scholar]
  • [2].Gao S, Young MT, Qiu JX, Yoon H, Christian JB, Fearn PA,Tourassi GD, and Ramanthan A, “Hierarchical attention networks for information extraction from cancer pathology reports,” JAMIA, vol. 25, no. 3, pp. 321–330, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Alawad M, Yoon H, and Tourassi GD, “Coarse-to-fine multi-task training of convolutional neural networks for automated information extraction from cancer pathology reports,” in IEEE EMBS International Conference on Biomedical Health Informatics (BHI), March 2018. [Google Scholar]
  • [4].Semwal T, Yenigalla P, Mathur G, and Nair SB, “A practitioners’ guide to transfer learning for text classification using convolutional neural networks,” in Proceedings of the 2018 SIAM International Conference on Data Mining, SDM, pp. 513–521, May 2018. [Google Scholar]
  • [5].Weiss K, Khoshgoftaar TM, and Wang D, “A survey of transfer learning,” Journal of Big Data, vol. 3, p. 9, May 2016. [Google Scholar]
  • [6].Krizhevsky A, Sutskever I, and Hinton GE, “Imagenet classification with deep convolutional neural networks,” in Proceedings of the 25th International Conference on Neural Information Processing Systems -Volume 1, NIPS’12, pp. 1097–1105, 2012. [Google Scholar]
  • [7].Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, and Lecun Y, “Overfeat: Integrated recognition, localization and detection using convolutional networks,” in International Conference on Learning Representations (ICLR2014), CBLS, April 2014, 2014. [Google Scholar]
  • [8].Yosinski J, Clune J, Bengio Y, and Lipson H, “How transferable are features in deep neural networks?,” in Proceedings of the 27th International Conference on Neural Information Processing Systems -Volume 2, NIPS’14, pp. 3320–3328, 2014. [Google Scholar]
  • [9].Cheng PM and Malhi HS, “Transfer learning with convolutional neural networks for classification of abdominal ultrasound images,” Journal of digital imaging, vol. 30, no. 2, pp. 234–243, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Rajpurkar P, Irvin J, Zhu K, Yang B, Mehta H, Duan T, Ding D, Bagul A, Langlotz C, Shpanskaya K, Lungren MP, and Ng AY, “Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning,” CoRR, vol. abs/1711.05225, 2017. [Google Scholar]
  • [11].Mikolov T, Sutskever I, Chen K, Corrado GS, and Dean J, “Distributed representations of words and phrases and their compositionality,” in Advances in Neural Information Processing Systems 26, pp. 3111–3119, Curran Associates, Inc., 2013. [Google Scholar]
  • [12].Pennington J, Socher R, and Manning CD, “Glove: Global vectors for word representation,” in Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543, 2014. [Google Scholar]
  • [13].Beam AL, Kompa B, Fried I, Palmer NP, Shi X, Cai T, and Kohane IS, “Clinical concept embeddings learned from massive sources of medical data,” CoRR, vol. abs/1804.01486, 2018. [PMC free article] [PubMed] [Google Scholar]
  • [14].Mou L, Meng Z, Yan R, Li G, Xu Y, Zhang L, and Jin Z, “How transferable are neural networks in nlp applications?,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 479–489, Association for Computational Linguistics, 2016. [Google Scholar]

RESOURCES