Skip to main content
Journal of the American Medical Informatics Association: JAMIA logoLink to Journal of the American Medical Informatics Association: JAMIA
. 2019 Sep 24;26(12):1632–1636. doi: 10.1093/jamia/ocz164

Traditional Chinese medicine clinical records classification with BERT and domain specific corpora

Liang Yao 1, Zhe Jin 2, Chengsheng Mao 1, Yin Zhang 2, Yuan Luo 1,
PMCID: PMC7647141  PMID: 31550356

Abstract

Traditional Chinese Medicine (TCM) has been developed for several thousand years and plays a significant role in health care for Chinese people. This paper studies the problem of classifying TCM clinical records into 5 main disease categories in TCM. We explored a number of state-of-the-art deep learning models and found that the recent Bidirectional Encoder Representations from Transformers can achieve better results than other deep learning models and other state-of-the-art methods. We further utilized an unlabeled clinical corpus to fine-tune the BERT language model before training the text classifier. The method only uses Chinese characters in clinical text as input without preprocessing or feature engineering. We evaluated deep learning models and traditional text classifiers on a benchmark data set. Our method achieves a state-of-the-art accuracy 89.39% ± 0.35%, Macro F1 score 88.64% ± 0.40% and Micro F1 score 89.39% ± 0.35%. We also visualized attention weights in our method, which can reveal indicative characters in clinical text.

Keywords: natural language processing, clinical records classification, BERT, domain knowledge, TCM

INTRODUCTION

As a medical system with ancient roots, traditional Chinese medicine (TCM) plays an indispensable role in the health care of China for several thousand years and is increasingly adopted as a complementary therapy to modern medicine around the world.1 In TCM, historically accumulated clinical records are the main knowledge sources for the generation of appropriate clinical hypotheses.2 As a fundamental task of natural language processing (NLP), text classification plays an important role in organizing and retrieving clinical records to support diagnosing and prescribing, applicable to TCM without exception.

Existing studies on clinical text classification mainly focus on modern medicine clinical text written in English, but face obstacles in generalizing to TCM clinical records.3,4 In our previous work,5 we investigated features and learning algorithms for TCM clinical records classification, which required heavy and sometimes manual feature engineering.

In this paper, we investigated multiple deep learning NLP models for TCM clinical text classification. Using 7037 benchmark TCM clinical records of notable TCM doctors, we evaluated the performance of these deep models. In addition, we propose a new training procedure which combines the state-of-the-art Bidirectional Encoder Representations from Transformers (BERT)6 with TCM domain corpora. The model takes raw Chinese characters in clinical text as input, without any preprocessing and feature engineering such as Chinese word segmentation and stop/rare words removing. The new method achieves the best results so far and can automatically discover indicative characters of category labels. Our source code is available at https://github.com/yao8839836/tcm_bert.

The main contributions of the study are summarized as follows:

  • To the best of our knowledge, this is the first comprehensive study of deep learning models for TCM clinical text classification, and which is also compared with traditional machine learning methods.

  • We propose to fine-tune pretrained deep language models with large-scale unlabeled TCM clinical data, which produces the best results so far.

BACKGROUND

A comprehensive literature survey of clinical text classification studies has been conducted by Stanfill et al3 and Mujtaba et al.4 Most of clinical text classification work focuses on English clinical text and feature engineering, whereas only limited works have been conducted on Chinese clinical text and deep learning models. In our previous work,5 we systematically explored features and classifiers for TCM clinical text and further proposed a knowledge-based text embedding method, while recent deep learning models are not included. Recently, Hu et al7 modeled TCM syndrome differentiation as a text classification task and experimented with 2 deep learning models, Convolutional Neural Network (CNN)8 and fastText,9 but many state-of-the-art deep learning models have not been explored for the task.

Three dominant deep learning models for text classification are CNN,8 Recurrent Neural Networks (RNN) [such as long short-term memory (LSTM)],10 and fastText.9 The input of these models can be words or characters. To improve representation capability of these deep learning models, attention mechanism11 is usually introduced as an integral part of models. More recent self-attention models like Transformer12 also shows its superiority in NLP tasks such as machine translation. In recent months, deep transfer learning models such as ELMo,13 ULMFiT,14 OpenAI GPT15 and BERT6 have shown great success in NLP; these models can learn contextualized word embeddings with unlimited free text data and achieve state-of-the-art accuracy in multiple language understanding tasks. Among them, BERT is the most prominent model by pretraining the bidirectional encoder representations on a large general domain corpus through masked language modeling and next sentence prediction. It provides pretrained models for Chinese. BERT also has been adapted for English clinical texts.16,17 Alsentzer et al16 pretrained BERT on MIMIC-III clinical notes and fine-tuned the model for natural language inference and named entity recognition tasks. Huang et al17 also pretrained BERT on MIMIC-III clinical notes but fine-tuned BERT for readmission prediction. Unlike these 2 studies, we first fine-tune general domain language models on unlabeled clinical text, then fine-tune domain specific models for clinical text classification.

DATA

We used the benchmark TCM clinical records data set5 which was collected from Classified Medical Records of Distinguished Physicians Continued Two (二续名医类案 in Chinese, ISBN 7-5381-2372-5). The data set contains 7037 records which are in 5 disease categories (internal medicine, surgery, gynecology, ear-nose-throat-stomatology, and pediatrics). The class system is provided in the book which organizes diseases by different medical specialties. Most diseases (except eye and bone diseases) in TCM can be classified into the 5 classes. Each record belongs to only 1 of the 5 categories. The training set contains 4882 records, and the test set contains 2155 records. The average number of characters is 316. More details of the data set can be found in Yao et al.5 An example surgery record from the training set is shown in Figure 1.

Figure 1.

Figure 1.

An example record from the training set.

We also used 46 205 unlabeled clinical records from Chinese Knowledge Center for Engineering Science and Technology (CKCEST) as the domain specific corpus which provides domain knowledge (http://zcy.ckcest.cn/tcm/). There are 18 327 723 tokens in the corpus.

MATERIALS AND METHODS

Bidirectional encoder representations from transformers

The BERT model6 is the current state-of-the-art pretrained contextual representations developed on a multi-layer bidirectional Transformer encoder architecture.12 The Transformer encoder architecture is based on a self-attention mechanism. There are 2 steps in the BERT framework: pretraining and fine-tuning. During pretraining, BERT is trained on unlabeled multilingual general domain data (Chinese Wikipedia used in this work) over 2 self-supervised tasks: masked language modeling (Mask LM) and next sentence prediction (NSP). For fine-tuning, BERT is initialized with the pretrained parameters, and all of the parameters are fine-tuned using labeled data from downstream tasks such as text classification, sentence pair classification, and question answering.

Our TCM-BERT

As the original BERT (Chinese version) was pretrained on Chinese Wikipedia text, it can capture general syntactic and semantic information in Chinese. But TCM clinical notes are quite different from general domain texts, as they contain many TCM domain specific symptoms, syndromes, and herbs. In addition, many TCM clinical records are written in ancient style Chinese, where characters can have different meanings and orders in a sentence.

To fill the gap between pretrained the BERT model and TCM clinical notes, we fine-tuned the pretrained BERT language model on the external unlabeled clinical corpus before finally fine-tuning BERT as a text classifier. The intermediate step uses the same objective as the pretraining step. Our method is similar to 3 steps in ULMFiT14 while we follow BERT’s learning rate decay.

Our 3-step TCM-BERT method is depicted in Figure 2. Step 1 is the same as BERT pretraining: we used the publicly available BERT-Base-Chinese model released by Google as the result. For Step 2, we used real sentences in the 46 205 records to construct sentence pairs. If Sentence 1 and Sentence 2 are consecutive, the NSP label should be positive. Negative sentence pairs are generated by replacing Sentence 2 with a random sentence from another clinical record 50% of the time. The Mask LM task is the same as the pretrained BERT model. For Step 3, we treat all sentences in a record as a single input sequence to BERT. Step 2 uses pretrained weights of Step 1 as initialization, and Step 3 uses fine-tuned weights of Step 2 as initialization.

Figure 2.

Figure 2.

The 3 steps of our TCM-BERT method. The main difference between standard BERT and TCM-BERT is we perform clinical domain language modeling fine-tuning in Step 2. LM: language modeling, NSP: next sentence prediction. In Mask LM, a fraction of input tokens are held out for prediction. In NSP, BERT predicts whether 2 input sentences are consecutive.

As in original BERT, the input of TCM-BERT model in each of the 3 steps can be a single sentence or a pair of sentences. A “sentence” can be an arbitrary span of contiguous text in Step 1 and Step 3, rather than an actual linguistic sentence. The first token of input is always a special classification token [CLS]. Sentence pairs are packed into a single input sequence separated by a special token [SEP]. Each input token i has a token embedding Ei. The Hdimensional final hidden state CRHand TiRH corresponding to [CLS] token and i-th input token are used as the aggregate sequence representation and token representation for classification tasks.

RESULTS AND DISCUSSIONS

We compared our TCM-BERT with previous best-performing models and a number of representative deep learning models as follows. PV-DBOW + SVM (support vector machine)5: a document embedding method with an SVM classifier. ESA-PV-DBOW + SVM5: a knowledge-based document embedding method with an SVM classifier, which achieved the best results previously. Word TFIDF + SVM5: TFIDF weighted bag of words model with an SVM classifier, Chinese word segmentation was applied. Char TFIDF + SVM: TFIDF weighted bag of characters, Chinese word segmentation was not applied. Char CNN: CNN model8 trained on character sequences without any preprocessing like Chinese word segmentation. Word CNN: CNN model trained on segmented word sequences. Pretrained Chinese word embeddings from Tencent AI Lab18 were used as the initialization. Char Bi-LSTM: Bidirectional LSTM trained on characters without any preprocessing like Chinese word segmentation. Char Bi-LSTM + Attention: Char Bi-LSTM with an attention mechanism.11 Word Bi-LSTM: Bidirectional LSTM trained on segmented word sequences with pretrained Tencent AI Lab Chinese word embeddings. Char fastText: fastText9 trained on character sequences without word segmentation. We evaluated it with and without bigrams. Word fastText: fastText trained on word sequences. BERT: standard BERT with 2 steps. Following Devlin et al,6 we set the following parameters for fine-tuning TCM-BERT and BERT as text classifiers: training batch size: 32, the number of learning epochs: 3, learning rate: 2e-5. For the second step in TCM-BERT, training batch size: 32, the number of learning epochs: 3, learning rate: 3e-5. For best-performing methods in Yao et al,5 we used their default parameter settings. For CNN models, we used the following parameters: number of filters: 256, kernel size: 5, dropout rate: 0.5, learning rate: 1e-3, batch size: 128. For LSTM models: dropout rate: 0.5, learning rate: 1e-3, batch size: 128. For fastText, we used the default settings in its implementation. We found that small changes to the parameters didn’t change the results much. For TCM-BERT, BERT, CNN and Bi-LSTM models, we randomly selected 10% of the training records as the validation set.

Table 1 presents Accuracy, Macro F1 score and Micro F1 score of different models. TCM-BERT performs the best and significantly outperforms all baseline models (P < .001 based on student t-test) on the 3 metrics, which showcases the effectiveness of our proposed method. For more in-depth performance analysis, we note that PV-DBOW performs better than the bag of words method, which means local word co-occurrence information captured by document embedding is more predictive than word occurrence in documents. The previous best-performing method ESA-PVDBOW + SVM performs well and outperforms some deep learning methods such as Word CNN, Bi-LSTM, and fastText models, which shows word relatedness in TCM domain knowledge can overcome the sparseness and improve document embeddings. Bag of characters shows lower scores than bag of words, indicating that word occurrence is more predictive than character occurrence. On the contrary, word CNN and Word Bi-LSTM clearly performed worse than Char CNN and Char Bi-LSTM. Deep learning models prioritize local consecutive word/char sequences instead of word/char occurrence, so they show different behaviors. These results are also consistent with findings in Meng et al19 where char-based deep learning models consistently outperform word-based deep learning models. Meng et al19 showed that word-based models’ inferiority is due to the sparseness of word distributions (Chinese words are much more than Chinese characters), which leads to more out-of-vocabulary words and overfitting. In addition to the previously mentioned reason, characters in ancient style TCM clinical records can represent enough semantics, as ancient Chinese is very concise. CNN models perform much better than LSTM models, as they can learn word/character n-gram features which are indicative of category labels. For example, 鼻衄屡发 (recurrent epistaxis) and 发疡焮肿 (swollen ulcer) are closely related to the Surgery category. LSTM models learn sequential patterns in text, but may not perform well when documents are long. When an attention mechanism is incorporated, the performance of char Bi-LSTM is greatly improved, because the most predictive characters are selected. fastText also shows decent results, as it learns document embeddings in a supervised way. BERT clearly outperforms the previously mentioned methods, because the Transformer in BERT can learn n-gram features and selects most predictive characters simultaneously via self-attention. On the other hand, the pretrained BERT trained on Wikipedia encodes rich general domain knowledge. Our TCM-BERT further improves the performance with TCM domain knowledge extracted from the unlabeled clinical corpus, which shows the effectiveness of domain knowledge in the unlabeled corpus. The general knowledge and TCM domain knowledge are encoded in the parameters of the fine-tuned language model in the second step. The results can be further improved if more clinical texts and domain knowledge are provided.

Table 1.

Test performance of different methods, in percentage. For document embedding and deep learning models with randomness in learning process, we run all models 10 times and report mean ± standard deviation. Bold numbers are the best results. TCM-BERT significantly outperforms all baselines based on student t-test (P <.001)

Method Accuracy Macro F1 Micro F1
PV-DBOW + SVM 78.35 ± 0.17 76.35 ± 0.24 78.35 ± 0.17
ESA-PV-DBOW + SVM 82.61 ± 0.31 81.20 ± 0.32 82.61 ± 0.31
Word TFIDF + SVM 77.31 75.64 77.31
Char TFIDF + SVM 76.01 73.86 76.01
Char CNN 85.58 ± 0.67 84.11 ± 0.83 85.58 ± 0.67
Word CNN 82.36 ± 0.39 80.27 ± 0.43 82.36 ± 0.39
Char Bi-LSTM 62.84 ± 1.66 56.63 ± 3.17 62.84 ± 1.66
Char Bi-LSTM + Attention 81.33 ± 0.36 79.86 ± 0.30 81.33 ± 0.36
Word Bi-LSTM 60.60 ± 1.52 56.23 ± 3.21 60.60 ± 1.52
Char fastText 78.33 ± 0.04 76.95 ± 0.05 78.33 ± 0.04
Char fastText (bigrams) 81.10 ± 0.04 79.55 ± 0.06 81.10 ± 0.04
Word fastText 75.87 ± 0.19 73.42 ± 0.19 75.87 ± 0.19
BERT 87.87 ± 0.69 87.08 ± 0.75 87.87 ± 0.69
TCM-BERT 89.39 ± 0.35 88.67 ± 0.40 89.39 ± 0.35

Table 2 shows F1 scores of each category achieved by several best-performing models. We can see that TCM-BERT achieves the highest scores in each category. The improvements over others are more significant for categories with fewer training records such as Ear-Nose-Throat-Stomatology, because TCM domain knowledge in the unlabeled corpus can overcome the sparseness of these categories. For categories, we found Gynecology records are the most indicative ones, for they contain many characters/words only for female patients, and all 4 models in Table 2 perform well for predicting the category. Categories with fewer records are more difficult to classify. We note ESA-PV-DBOW + SVM and many of the methods in Yao et al5 could not achieve satisfactory F1 scores for Pediatrics records, as children and babies share many diseases and symptoms with adults. Deep learning models like BERT and Char CNN perform much better, which means the self-attention mechanism and max-pooling over n-grams can distinguish Pediatrics specific elements such as age descriptions.

Table 2.

F1 score of each category with several best performing methods, in percentage. We run all models 10 times and report mean ± standard deviation. Bold numbers are the best results. TCM-BERT significantly outperforms others based on student t-test (P <.001)

Category (# Training / # Test) TCM-BERT BERT Char CNN ESA-PV-DBOW
Internal medicine (1905/896) 89.97 ± 0.37 88.15 ± 1.05 87.23 ± 0.40 84.50 ± 0.12
Surgery (574/233) 83.42 ± 1.37 81.55 ± 0.66 80.13 ± 0.69 77.70 ± 0.24
Gynecology (1044/423) 91.95 ± 0.55 89.96 ± 1.07 89.23 ± 0.36 89.38 ± 0.30
Ear-nose-throat-stomatology (522/241) 87.29 ± 0.96 84.65 ± 2.45 81.67 ± 0.64 79.67 ± 0.77
Pediatrics (837/362) 89.93 ± 0.39 88.70 ± 0.61 83.51 ± 0.84 75.50 ± 0.78

Figure 3 illustrates attention patterns in TCM-BERT, we show the visualization of layer 11 using BertViz toolkit.20 The record in Figure 1 is used as the input. We can see indicative symptom characters such as 鼻衄 (epistaxis) and 发疡 (ulcer) in the record have higher attention weights connected to the label token “[CLS].” The example explains how BERT can achieve better results. The learned attention can be useful to reduce a doctor’s reading burden by guiding the highlighting of key symptoms in an interactive system.

Figure 3.

Figure 3.

The visualization of attention patterns in TCM-BERT (final step). The example record in Figure 1 is taken as the input. We show attention weights between [CLS] and characters in layer 11 of the Transformer model. Weights of lines and colors reflect the attention score. The indicative characters have higher attention weights.

CONCLUSION

In this study, we explored deep learning models for classifying TCM clinical records. We found deep learning models like CNN and BERT can achieve better results than traditional methods, without any preprocessing or feature engineering. In addition, char-based deep learning models perform better than word-based deep learning models. We also found that the pretrained BERT model achieves state-of-the-art results and can be further improved by fine-tuning on unlabeled clinical corpora. Some interesting future directions include improving the classification performance by fine-tuning on more unlabeled domain corpora and knowledge graphs. The pretraining procedure can also be enhanced by adding knowledge-based objectives.

FUNDING

Yin Zhang and Zhe Jin are supported by the China Knowledge Centre for Engineering Sciences and Technology (No. CKCEST-2019-1-12). Liang Yao, Chengsheng Mao, and Yuan Luo are partially supported by US NIH Grant R21LM012618.

AUTHOR CONTRIBUTIONS

LY designed and implemented the TCM-BERT and comparison models, evaluated the systems’ performance, and wrote the first draft of the paper. CM helped with debugging the models. LY, ZJ, and YZ formulated the application and prepared the data. YL guided model design and provided helpful feedback and revisions to the paper.

Conflict of Interest statement

None declared.

REFERENCES

  • 1. Cheung F. TCM: Made in China. Nature 2011; 4807378: S82–83. [DOI] [PubMed] [Google Scholar]
  • 2. Zhou X, Peng Y, Liu B.. Text mining for traditional Chinese medical knowledge discovery: a survey. J Biomed Inform 2010; 434: 650–60. [DOI] [PubMed] [Google Scholar]
  • 3. Stanfill MH, Williams M, Fenton SH, Jenders RA, Hersh WR.. A systematic literature review of automated clinical coding and classification systems. J Am Med Inform Assoc 2010; 176: 646–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Mujtaba G, Shuib L, Idris N, et al. Clinical text classification research trends: systematic literature review and open issues. Expert Syst Appl 2019; 116: 494–520. [Google Scholar]
  • 5. Yao L, Zhang Y, Wei B, Li Z, Huang X. Traditional Chinese medicine clinical records classification using knowledge-powered document embedding. In: 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM); December 15-18, 2016; Shenzhen, China.
  • 6. Devlin J, Chang M-W, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT; June 2 - June 7, 2019; Minneapolis, Minnesota.
  • 7. Hu Q, Yu T, Li J, Yu Q, Zhu L, Gu Y.. End-to-End syndrome differentiation of Yin deficiency and Yang deficiency in traditional Chinese medicine. Comput Methods Programs Biomed 2019; 174: 9–15. [DOI] [PubMed] [Google Scholar]
  • 8. Kim Y. Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP); October 25-29, 2014; Doha, Qatar.
  • 9. Joulin A, Grave E, Bojanowski P, Mikolov T. Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers; April 3-7, 2017; Valencia, Spain.
  • 10. Liu P, Qiu X, Huang X. Recurrent neural network for text classification with multi-task learning. In: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence; July 9-15, 2016; New York, NY.
  • 11. Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E. Hierarchical attention networks for document classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; June 12-17, 2016; San Diego, California.
  • 12. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Advances in Neural Information Processing Systems; December 4-9, 2017; Long Beach, CA.
  • 13. Peters M, Neumann M, Iyyer M, et al. Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers); June 1-6, 2018; New Orleans, Louisiana.
  • 14. Howard J, Ruder S. Universal language model fine-tuning for text classification. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); July 15-20, 2018; Melbourne.
  • 15. Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding with unsupervised learning. Technical report, OpenAI. 2018.
  • 16. Alsentzer E, Murphy JR, Boag W, et al. Publicly available clinical BERT embeddings. arXiv preprint arXiv: 190403323; 2019.
  • 17. Huang K, Altosaar J, Ranganath R. ClinicalBERT: modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv: 190405342; 2019.
  • 18. Song Y, Shi S, Li J, Zhang H. Directional skip-gram: explicitly distinguishing left and right context for word embeddings. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers); June 1-6, 2018; New Orleans, Louisiana.
  • 19. Meng Y, Li X, Sun X, Han Q, Yuan A, Li J. Is word segmentation necessary for deep learning of chinese representations? In: ACL July 28 - August 2, 2019; Florence, Italy.
  • 20. Vig J, Belinkov Y. Analyzing the structure of attention in a Transformer language model. arXiv preprint arXiv: 190604284; 2019.

Articles from Journal of the American Medical Informatics Association : JAMIA are provided here courtesy of Oxford University Press

RESOURCES