Abstract
Purpose
In recent years, patients usually accept more accurate and detailed examinations because of the rapid advances in medical technology. Many of the examination reports are not represented in numerical data, but text documents written by the medical examiners based on the observations from the instruments and biochemical tests. If the above-mentioned unstructured data can be organized as a report in a structured form, it will help doctors to understand a patient's status of the various examinations more efficiently. Besides, further association analysis on the structuralized data can be performed to identify potential factors that affect a disease.
Methods
In this paper, from the pathology examination reports of renal diseases, we applied the POS tagging results of natural language analysis to automatically extract the keyword phrases. Then a medical dictionary for various examination items in an examination report is established, which is used as the basic information for retrieving the terms to construct a structured form of the report. Moreover, a topical probability modeling method is applied to automatically discover the candidate keyword phrases of the examination items from the reports. Finally, a system is implemented to generate the structured form for the various examination items in a report according to the constructed medical dictionary.
Results and conclusion
The results of the experiments showed that the methods proposed in this paper can effectively construct a structural form of examination reports. Furthermore, the keywords of the popular examination items can be extracted correctly. The above techniques will help automatic processing and analysis of medical text reports.
Keywords: Extraction, Structuralization of medical report, Medical dictionary construction
Introduction
With the advance of technology, medical and healthcare data continues to rise which is easier to collect. How to discover useful information from the targeted medical data in order to further optimize various medical services has become one of the important research topics. The electronic medical records (EMR) system provides an environment for storing all of the health information of the patients in an electronic format. The data includes the descriptions on medication, laboratory test results, and symptoms of patients recorded by the doctors. Accordingly, the analysis on this type of records will be a direct way to understand the progress of a disease, which can assist healthcare professionals to make accurate and efficient medical treatment for patients.
Recent advancement in medical technology allows for more meticulous and detailed examinations for patients. However, many of the examination reports are not represented in numerical form. Instead, the reports contain text descriptions for the results shown in a medical equipment and observations of technical inspection provided by a medical technologist. From these unstructured examination reports, it often requires time for a doctor to read and understand what unusual problems occur in order to diagnose medical conditions and propose the proper treatments. Converting an unstructured text examination report into a structured one can help doctors understand a patient’s conditions for various examined items more effectively and efficiently. Moreover, it facilitates the doctor's breakdown of the different classification criteria for medical records and further analyzes the relevance of clinical symptoms to identify potential factors affecting the disease. Accordingly, the technology of transferring an unstructured examination report into a structured one is important and in practical needs, which can further improve the working efficiency and quality for the healthcare professionals.
The research goal of this paper is to construct a system for converting an unstructured English medical report into a structured form to display. In this study, we studied the nephrology medical examination reports. Sethi and Sanjeev et al. [21] proposed that a nephrology medical examination report should include the following eight paragraphs, which contain the different examination items and contents, respectively: (1) main diagnosis (diagnosis), (2) electron microscopy examination (EM), (3) examination status of electron microscope (comment/narrative), (4) the size and numbers of the sliced specimen (specimen type), (5) general description of the spliced specimen (gross description), (6) examination of light microscope (LM), (7) chromosome examination (DIF), and (8) report conclusion (summary). Within an examination report, if the contents correspond to the above paragraphs, which are surely follow the certain order. Paragraph (1), (3) and (8) are abstracts for the symptoms and examination results summarized by a doctor. The goal of processing these paragraphs is to extract the main diagnosis including the judgment of the main disease and the descriptions for abnormality of the pathological examines. Moreover, paragraph (2), (4), (5), (6), and (7) are special inspection contents for the various inspection items observed and recorded. The goal of processing is to indicate the resultant options for a specific inspection items. Figure 1 shows an example of a nephrology examination report, the content in the report is separated according to the paragraphs. The strategies proposed in this paper will automatically extract the result of each examination item as shown in Table 1.
Fig. 1.
Example of a nephrology examination report
Table 1.
Example of a structuralized examination report
| Diagnosis |
Procedure: echo-guided percutaneous needle core biopsy Primary diagnosis: moderate arteriolosclerosis Additional features: 1. Acute tubular necrosis 2. Acute glomerular ischemia 3. Acute vascular insult |
| Specimen type |
Specimen: 2 tissue cores Size: 2.0 × 0.1 × 0.1 cm □ Fixed in formalin ■ fresh |
| Gross description |
Gross looking (Grossly): whitish gray and soft Amount: more than 25 glomeruli |
| LM |
Sclerotic lesion: ■ Yes □ No Segmental glomerulosclerosis: □ Absent (S0) ■ Presence (S1) Mesangial: ■ Mesangial hypercellularity ■ matrix expansion □ mesangiolysis Degree of Mesangial hypercellularity ■ Present in ≤ 50% (M0) □ Present in > 50% (M1) Capillary: delicate and soft capillary walls Endocapillary hypercellularity: ■ Absent (E0) □ Presence (E1) Crescent formation: ■ Yes □ No Subendothelial deposit: □ Yes ■ No Subepithelial deposit: □ Yes ■ No |
| DIF |
Staining pattern: □granular □ linear Location: □focal □diffuse □segmental □global □mesangial □glomerular capillary wall IgA deposition/expression ■ absent □ present IgG deposition/expression ■ absent □ present IgM deposition/expression ■ absent □ present C3 deposition/expression ■ absent □ present C4 deposition/expression ■ absent □ present C1q deposition/expression ■ absent □ present C4d deposition/expression □ absent □ present Fibrinogen insignificant ■ absent □ present |
| Summary |
1. Moderate arteriolosclerosis Additional features 1. Acute tubular necrosis 2. Acute glomerular ischemia ( due to acute vascualr insult) |
For example, from the paragraph of Light Microscope (LM) shown in Fig. 1, the following pathological features are observed: ‘sclerotic lesion’, ‘segmental glomerulosclerosis’, ‘mesangial hypercellularity’, ‘matrix expansion’, ‘degree of mesangial hypercellularity ≤ 50%’, ‘delicate and soft capillary walls’, and ‘crescent formation’. Therefore, in the result shown in Table 1, the corresponding items of LM are marked by the observed feature names or the option ‘Yes’/‘Presence.’ If negative words appear with the pathological feature words, the option ‘No’/‘Absent’ is marked. The options will be represented as attribute values of the examination report for storing in the database.
To achieve the above mentioned goal, it is a critical step to extract topical keywords from each paragraph. However, it is costly to use a supervised method to perform keyword extraction because it needs to annotate all the keywords in the examination reports in advance by professional medical staffs. Therefore, our study aims to design an unsupervised approach to perform keyword extraction.
The data set used in this study consists of 476 examination reports of anonymous patients in the Department of Nephrology of China Medical University Hospital. All the contents of the report are written in English. The contribution of this paper are as follows:
We design a keyword phrase extraction method, which can be used to construct a vocabulary dictionary from the clinical examination reports automatically.
A strategy is proposed to automatically discover the keyword terms which represent the examination items from the examination reports by an unsupervised approach.
According to the constructed vocabulary dictionary, a system prototype is constructed to organize the results of the various examination items in structured form.
The processing of the proposed system includes two parts: the offline training and the online processing. The offline processing is shown in Fig. 2, in which the reports are inputted to a natural language processing tool in batches to get their part-of-speech tagging at first. Then a medical dictionary is built for each paragraph in the report. Besides, another module is proposed for automatically discovering the keyword terms of examinations terms by analyzing the topical terms in each paragraph of an examination report. The online processing is shown in Fig. 3, in which an examination report will be converted into a structured form of report based on the dictionary and keyword phrases extracted from the offline training results.
Fig. 2.
Flowchart of the offline processing
Fig. 3.
Flowchart of the online processing
At first, each examination report is segmented into paragraphs. According to the established vocabulary dictionary, the continuous vocabulary words are combined into keyword phrases according to their part-of-speech tags in the content of paragraphs. Then the words with typos can be corrected by their similar keywords with higher frequencies. To perform the key terms extraction, each paragraph in a report is assumed to describe a specific topic content for the examination. Therefore, the LDA statistical generative model is used to discover the representative topic words for each paragraph as candidate keywords of the examination items. By computing the entropy of the topic words in the various paragraphs of the report, the general terms and specific terms of the examination items are distinguishable. The specific item terms are extracted accordingly, which are used as binary attributes to represent the examination result with a structured form.
The following sections of this paper are organized as follows: Sect. 2 describes the related works and Sect. 3 explains the method for constructing the medical vocabulary dictionary. Section 4 introduce the strategies of discovering the keyword terms of examination items, which are used to show the examination results in a structured form. Section 5 provides the evaluation results of the proposed system. Finally, Sect. 6 concludes this paper and discusses the future work.
Related works
With the advance of information technology, medical records can be collected and maintained easily. Therefore, the clinical data of a lot of patients are stored in a database in electronic form such as medical diagnoses, medications, and examination reports. How to apply the data mining strategies properly to find association information from medical records, which can be used as a decision-making reference, has become an attracting research direction in recent years.
Unstructured clinical notes contain a wealth of information about each patient, but extracting it is a difficult problem by its complete lack of structure. This makes deriving information about patient characteristics from clinical notes a computationally challenging task that requires sophisticated NLP tools and techniques such as Ctakes [20] and MetaMap [2].
Taira et al. [23] aimed to automatically structure the important medical information contained in a radiology free-text document, including the existence, properties, location, and diagnostic interpretation of findings. Li et al. [13]. studied how to automatically analyze and map digestive endoscopic narrative records to structured minimal standard terminology (MST) records. The goals of these two works are very similar to ours. However, in Taira et al. [23], the words and phrases were labeled by looking up medical lexicons which were developed manually. The strategies proposed in Li et al. [13] also depend on the MST specialist lexicon coming from a knowledge base. On the contrary, our approach does not need external lexicons or knowledge base.
Information extraction has been exploited in some clinical research domains Wang et al. [25], studies show mostly used clinical IE approaches are rule-based and machine learning-based. One of a common form of rule-based method is based on regular expressions, which many of the defined search patterns are recognized and written manually. Savova et al. [19] used regular expressions to identify peripheral arterial disease (PAD). A positive PAD was extracted when the predefined patterns were matched.
Machine learning-based methods have gained attention because of their effectiveness. Nandhakumar et al. [16] use a word-level, sentence-level features and applied Conditional Random Field(CRFs) model [10] to extract the clinically significant parts of the radiology reports. Then, the reports are classified into critical or non-critical categories which help physicians to identify high priority reports that need urgent treatment. Jo et al. [9] proposed a mortality prediction for the patients in the intensive care units in order to make a most appropriate decision. The authors believe that the nursing notes within a recent time period can identify hidden clues about the physical condition of a patient, which is useful to decide the priority of handling matters. A state transition topic model is established to capture the semantic information in a nursing note. Then the n-grams, standard topics, state-aware topics, states etc. are used as the extracted features for a cost-sensitive SVM classifier to perform mortality prediction. Ghassemi et al. [6] and Lehman et al. [12] also considered the problem of mortality prediction from the nursing notes. The former [6] used LDA (Latent Dirichlet allocation) [4] to decompose free-text notes into meaningful features, after that, the SVM classifier is used to predict mortality. On the other hand, the latter [12] used topic model distribution as features, then a logical linear regression method is used to predicting mortality probability.
Goodwin et al. [7] used electronic medical records (EMRs) to automatically construct a medical knowledge graph in hope to help improving the treatment decision-making for patients. For a given medical question, the answer is first discovered and then the scientific articles containing the answer is selected and ranked. The proposed method establishes the information in an electronic medical record with the Markov network model to create a medical knowledge map. The probability of each connected edge in the pattern structure is computed. Finally, the probability values of possible answers is inferred from a probabilistic knowledge graph, which is automatically generated by processing the unstructured text in a large collection of electronic medical records (EMRs).
Leaman et al. [11] believed that medication feedback can be extracted by the posts of patients in a social media site, which is useful to obtain the possible reaction of drugs from the response of patients. The researches of [5, 14] aimed to find out the status and factors of adverse drug reactions from the discussions of the Internet users about the reaction of drug usage. The method proposed in [5] finds out the structure grammar between drugs, symptoms, and diseases in sentences from the discussion of drugs on Internet forum. Then unsupervised relation extraction method is used to discover the domain-specific relation patterns. The post-processing algorithm then merges the missing or incomplete sentences into complete sentences. Finally, from the extracted sentences, the Lift measure is applied to evaluate the correlation between a drug and a symptom.
Balaneshin-kordan et al. [3] thinks it is a complicated problem to answer what diseases a patient has only according to the textual description of symptoms of the patient. The reason is that the same symptoms may occur in many diseases, which makes it difficult to decide the disease of a patient only from part of the symptoms. Moreover, this property makes it not easy to search the similar cases of a patient. The problems described above show the demand of the research studied in this paper. To get a structured form of a textual examination report, a database of diseases and the clinical features can be established according to the extracted structured content. Accordingly, the doctors can query the diseases by querying different clinical features as conditions, and further analyze the correlation of diseases and results of examination items.
Keyword extraction is a critical technology to get semantic information from examination reports, which aims to obtain the keywords which represent the main points in the report, which are used for subsequent analysis. Rong et al. [18] considered the problem of entity set expansion. The purpose is to find a set of entities by giving one or a few seed examples, which belong to the same semantic class. For example, the entity ‘apple’ is a kind of fruit. Therefore, entity set expansion would like to find the entities such as ‘banana’, ‘orange’ and etc. which are also fruits. To extract the sibling relations from text, one important feature is the Skip-gram because it provides positional constraints on the contextual words with regard to the target term Teneva et al. [24] considered the problem of unsupervised key phrase extraction and proposed the Salience Rank algorithm, a modification of Topical PageRank algorithm. Both of the algorithms first applied Latent Dirichlet Allocation (LDA) to perform ranking of noun phrases extracted from the documents. Inspired by these works, we applied the LDA topic modeling method to extract the candidate keywords from each paragraph of the examination item.
About the spell checking techniques, [8] discussed the various types of spelling errors and illustrated error detection and error correction strategies for Indian languages. Amorim and Zampieri [1] proposed an unsupervised spell checking method by applying a clustering algorithm.
Method of medical dictionary construction
In this section, we will describe the pre-processing on the examination reports in order to establish a dictionary of the medical vocabulary for the reports.
Pre-processing of examination reports
In order to establish a dictionary of medical vocabulary, the whole corpus of the examination reports is pre-processed, which includes paragraph segmentation and part-of-speech tagging.
Paragraph segmentation
As shown in the introduction, a nephrology medical examination report itself has the structural aspects of the 8 paragraphs: (1) diagnosis, (2) EM, (3) comment/narrative, (4) specimen type, (5) gross description, (6) LM, (7) DIF, and (8) summary.
Because the text descriptions of each paragraph are distinct and the frequencies of the words within them are also different, we would establish the corresponding dictionaries for different paragraphs. In order to segment the paragraphs automatically, we observed that different topic words/phases often appear at the beginning of a paragraph. For example, as shown in Table 2, the EM paragraph often starts by the key phrase ‘The EM examination’ and the LM paragraph starts by the keyword ‘Microscopically.’ Therefore, the paragraph segmentation is performed by sequentially looking up a table of the beginning topic words/phases for each paragraph. This method achieves 99.6% accuracy of paragraph segmentation for the 476 examination reports. The result is shown as Table 3 after a report is segmented into paragraphs, which will be further proceed by performing part-of-speech tagging.
Table 2.
Example of an examination report
| Addendum on 2011-12-07 | |
|
1. Kidney, left, echo-guided percutaneous needle core biopsy, focal segmental glomerulosclerosis (4/26) with mild tubular atrophy (up to 5% to 7% in area) 2. The EM examination pathologic diagnosis: Electron microscopic study: 2 Glomeruli were examined ultrastructurally, which show no mesangial expansion, cellular proliferation or electron dense deposition. The glomerular basement membrane (GBM) show no remarkable change. Diffuse effacement of the podocytes foot processes also present Comment: The EM findings most consistent with minimal change disease/primary focal segmental glomerulosclerosis. Further clinical correlation is needed The submitted specimen consists of 2 tissue cores measuring up to 1.5 × 0.1 × 0.1 cm. in size in fresh state Grossly, they are whitish gray and soft. More than 10 glomeruli are visible under dissecting microscope All for sections and prepared for routine serial H&E, PAS/CSM, DIF, and EM studies. Jar 0 Microscopically, the section of renal biopsy contains three completely obsolescent and another 26 non-obsolescent glomeruli revealing minimal glomerular change, except four loci of focal segmental glomerulosclerosis (4/26, tip regions), with minimal mesangiopathy, indistinct intraglomerular leukocyte infiltration, thin and soft glomerular capillary walls, and no definite crescent formation noted. The tubulointerstitial compartment shows patchy foamy change of tubular epithelium, mild interstitial edema, minimal to focally mild interstitial chronic inflammatory infiltrates, areas of tubular atrophy (up to 5% to 7% in area), indistinct lymphocytic tubulitis, and inconspicuous interstitial fibrosis. The vascular compartment is unremarkable. The PAS and CSM stains delineate foci of focal segmental glomerulslerosis and tubular atrophy, otherwise nothing particular, without significant subendothelial/subepithelial deposit, visceral epithelial proliferation, nor spike formation. The DIF study demonstrates no significant immunodeposition of IgG, IgM, IgA, C3, C1q, C4, or fibrinogen. According to the above features, focal segmental glomerulosclerosis (4/26) with mild tubular atrophy (up to 5% to 7% in area) in the background of minimal glomerular change is firstly considered |
Table 3.
Example of the examination report after performing paragraph segmentation
| Main diagnosis (Diagnosis) | Kidney, left, echo-guided percutaneous needle core biopsy, focal segmental glomerulosclerosis (4/26) with mild tubular atrophy (up to 5% to 7% in area) |
| Electron microscopy examination (EM) | Electron microscopic study: 2 glomeruli were examined ultrastructurally, which show no mesangial expansion, cellular proliferation or electron dense deposition. The glomerular basement membrane (GBM) show no remarkable change. Diffuse effacement of the podocytes foot processes also present |
| Examination status of electron microscope (comment) | Comment: The EM findings most consistent with minimal change disease/primary focal segmental glomerulosclerosis. Further clinical correlation is needed. |
| The size and condition of sliced specimen (specimen) | The submitted specimen consists of 2 tissue cores measuring up to 1.5 × 0.1 × 0.1 cm. in size in fresh state |
| Description of spliced specimen (gross) | Grossly, they are whitish gray and soft. More than 10 glomeruli are visible under dissecting microscope |
| Examination of light microscope (LM) | Microscopically, the section of renal biopsy contains three completely obsolescent and another 26 non-obsolescent glomeruli revealing minimal glomerular change, except four loci of focal segmental glomerulosclerosis (4/26, tip regions), with minimal mesangiopathy, indistinct intraglomerular leukocyte infiltration, thin and soft glomerular capillary walls, and no definite crescent formation noted. The tubulointerstitial compartment shows patchy foamy change of tubular epithelium, mild interstitial edema, minimal to focally mild interstitial chronic inflammatory infiltrates, areas of tubular atrophy (up to 5% to 7% in area), indistinct lymphocytic tubulitis, and inconspicuous interstitial fibrosis. The vascular compartment is unremarkable. The PAS and CSM stains delineate foci of focal segmental glomerulslerosis and tubular atrophy, otherwise nothing particular, without significant subendothelial/subepithelial deposit, visceral epithelial proliferation, nor spike formation |
| Chromosome examination (DIF) | The DIF study demonstrates no significant immunodeposition of IgG, IgM, IgA, C3, C1q, C4, or fibrinogen |
| Report conclusion (summary) | According to the above features, focal segmental glomerulosclerosis (4/26) with mild tubular atrophy (up to 5% to 7% in area) in the background of minimal glomerular change is firstly considered |
Part-of-speech tagging
We used the Stanford CoreNLP API, a natural language processing tool (https://stanfordnlp.github.io/CoreNLP) developed by Stanford University's Natural Language Processing Research Group [22], to perform POS tagging. Figure 4 shows the result of part-of-speech tagging on a sentence in a Diagnosis paragraph, which includes the tense of each word in the sentence such as NN (noun), VBD (past tense verb), JJ (adjective), and CD (quantifier). According to the part-of-speech tags of words, the NNS (plural noun), VBG (gerund), NNP (proper noun), etc. are classified as nouns. On the other hand, VBN (past participle), VBD (past tense verb), JJR (comparative adjective), etc. are classified as adjectives.
Fig. 4.
Example of the Part-of-Speech tagging result
Medical dictionary construction
After the corpus of examination reports are pre-processed, the next step is to construct the medical dictionary for the different paragraphs in a clinical examination report. The detailed processing of constructing the dictionary include the following three steps.
Step 1
At first, the pre-processing result of the examination reports are collected, as shown in Table 4. Next, the continuous words appearing in the sentences are combined into a vocabulary term according to some specific pattern rules as shown in Table 5. Then the extracted vocabulary terms are organized into the following two dictionaries:
Adjective vocabulary dictionary: which includes the adjectives and the compound adjectives (continuous adjectives).
Proper noun vocabulary dictionary: which includes the nouns and the compound nouns. The compound nouns are further divided into: (1) a combination of nouns which uses the last noun word as the base word; (2) the compound noun has an adjective as its prefix word.
Table 4.
Example of the contents collected from a Diagnosis paragraph
| Main diagnosis(diagnosis) | |
|---|---|
|
Report No. 201101639 |
|
|
Report No. 201102475 |
|
|
Report No. 201103835 |
|
Table 5.
Example of the extracted vocabulary terms by matching the pattern rules
| Adjective vocabulary dictionary | |
|---|---|
| vocabulary | Frequency |
| Immune complex proliferative | 3 |
| Large sized subendothelial | 1 |
| Splitting | 1 |
| Proper noun vocabulary dictionary | |
|---|---|
| Vocabulary | Frequency |
| Interstitium show fibrosis change | 1 |
| Foot processes effacement | 314 |
| Foot process effacement | 7 |
| Vocabulary | Frequency |
|---|---|
| Extensive foot processes Effacement | 66 |
| Partial foot processes Effacement | 62 |
| Mesangium | 68 |
Step 2 In this step, the extracted phrases in step 1 are refined to filter out the semantics meaningless words according to the following rules: (1) the noun-phrases with medical meaningless ending words are removed, and (2) the adjectives that are medical meaningless at the beginning of the phrases are removed. Some examples of the medical meaningless ending words and adjectives are shown in Table 6. These meaningless words are given manually.
Table 6.
Example of deleting base words and starting words
| Meaningless ending words | Example of deletion |
|---|---|
| History | Clinical history→
|
| Management | Further management→
|
| Finding | Em finding→
|
| Meaningless adjacent words | Example of deletion |
|---|---|
| Including | Including IgA nephropathy IgA nephropathy→
IgA nephropathy |
| Only | Only few faint mesangial deposits few faint mesangial deposits→
few faint mesangial deposits |
| Otherwise | Otherwise minimally changed glomeruli minimally changed glomeruli→
minimally changed glomeruli |
Step 3 The typos appearing in the clinical reports usually occur due to additional characters or missing characters. Qin et al. [15] stated that the Longest Common Subsequence (LCS) distance computation is a good and fast approach for typo-detection. Accordingly, in order to efficiently filter out the typos in the reports, we apply the LCS [17] algorithm to estimate the similarity between the extracted phrases. If two words are similar to each other, the word with a less frequency is replaced by the other.
Step 3-1: Filter out the typos based on ending words Let bi and bj denote two different base words, which are the last words of the extracted phrases. Because the additional characters or missing characters of typos usually occur in the middle or the ending of a word, the initial characters of bi and bj should be the same. Let max_len(bi, bj) denote the maximal length between bi and bj, and LCS(bi, bj) denote the length of the longest common subsequence of bi and bj. Then the typing error between bi and bj is computed by max_len(bi, bj)—LCS(bi, bj) and denoted as ErrBaseW(bi, bj), as shown in Eq. (1).
| 1 |
Next, max_len(bi, bj) is multiplied by 1/d to get a threshold value, denoted as ComBaseT(bi, bj), of the typing error between bi and bj, as shown in Eq. (2).
| 2 |
When the typing error is less than or equal to the threshold value ComBaseT(bi, bj), bi and bj are considered to occur a typing error and ComBaseF(bi, bj) is set to be 1; otherwise ComBaseF(bi, bj) is set to be 0 as shown in Eq. (3). Let Bi and Bj denote the set of phrases whose base words are bi and bj, respectively. Besides, F(bi) and F(bj) denote the frequencies of bi and bj, respectively. If ComBaseF(bi, bj) is 1 and Fbi > Fbj, all the base words in Bj are modified into bi and are merged with the phrases in Bi.
| 3 |
Example 3-1 Assume there are two sets of phrases with different base words, as shown in Table 7. At first, check the base words with initial letter ‘g’ in pairs. Next, Eq. (1) is used to compute the typing error between the base words: ‘glomerulonephritis’ and ‘glomerulonephritiss’. Because ErrBaseW (‘glomeru-lonephritis’, ‘glomerulonephritiss’) is 1 and the threshold of the merging the base words ComBaseT(‘glomerulonephritis’, ‘glomerulonephritiss’) is 4 when d is set to 5. According to Eq. (3), the typing error 1 is acceptable, it results in the two base words are corrected into the one with a higher frequency. Because F(‘glomerulonephritis’) > F(‘glomerulonephritiss’), the words in the set with base word ‘glomerulonephritiss’ is modified to have a base word ‘glomerulonephritis’.
Table 7.
Example of combining a set of different base words
| Base word | Frequency | Sets phrases with the same base words | Frequency |
|---|---|---|---|
| Before merging | |||
| Glomerulonephritis | 150 | Glomerulonephritis | 90 |
| Membranous glomerulonephritis | 35 | ||
| Lupus glomerulonephritis | 25 | ||
| Glomerulonephritiss | 36 | Glomerulonephritiss | 20 |
| Lupus glomerulonephritiss | 8 | ||
| Membranous glomerulonephritiss | 7 | ||
| Focal glomerulonephritiss | 1 | ||
| Base word | Frequency | Sets of same ending words | Frequency |
|---|---|---|---|
| After merging | |||
| Glomerulonephritis | 186 | Glomerulonephritis | 110 |
| Membranous glomerulonephritis | 42 | ||
| Lupus glomerulonephritis | 33 | ||
| Focal glomerulonephritis | 1 | ||
Step 3-2 Perform a filtering operation on a set Bi consisting of the phrases with the same base word.
Let Pi and Pj denoted two phrases in the set Bi, which have the same base words. Let Pi.w1,…, Pi.wn denote the sequence of words in Pi and Pj.w1,…, Pj.wm denote sequence of words in Pj. If both Pi and Pj consist of the same number of words, the typing errors between these two phrases are counted. Otherwise, Pi and Pj are considered to be different phrases, as shown in Eq. 4.
| 4 |
Equation (5) is used to compute typing errors between Pi.wk and Pj.wk for k = 1 to n. Equation 6 is used to compute the threshold value of typing errors for w1 to wn.
| 5 |
| 6 |
If the typing error of any word wk in Pi and Pj are larger than the threshold CountContentT(Pi.wk, Pj.wk), Pi and Pj are considered to be two different phrases, as shown in Eq. 7. Otherwise, if the frequency of Pi is larger than Pj, i.e. F(Pi) > F(Pj), Pj is considered as a typo of Pi and thus modified to be Pi.
| 7 |
Example 3-2 Suppose that a set of phrases with the same base word is shown in Table 8. Let Pi and Pj denote ‘mildd tubular atrophy’ and ‘mild tubular atrophy’ in the set, respectively. At first, Eq. 4 is used to determine whether ‘mild tubular atrophy’ and ‘mildd tubular atrophyis’ consist of the same number of words. Then for each word in the prefix of the two phrases, the typing error is compared with the error threshold, where ErrContentW(‘tubular’, ‘tubular’) = 0 and ErrContentW(‘mild’, ‘mildd’) = 1, and the threshold values ComContentT(‘tubular’, ‘tubular’) = 2 and ComContentT(‘mild’, ‘mild’) = 1. Because each word in the two phrases satisfying the checking process of typing error, these two phrases are combined together. Because F(‘mild tubular atrophy’) > F(‘mildd tubular atrophy’), the phrase ‘mildd tubular atrophy’ is modified to be ‘mild tubular atrophy’. Accordingly, the frequency count of ‘mild tubular atrophy’ is updated to be 45.
Table 8.
Example of combining vocabularies with the same base word
| Vocabularies with the same base words | Frequency of appearance |
|---|---|
| Before merging | |
| Atrophy | 130 |
| Mild tubular atrophy | 40 |
| Focal tubular atrophy | 30 |
| Evident tubular atrophy | 20 |
| IgA atrophy | 10 |
| Mildd tubular atrophy | 5 |
| Focal tubularrr atrophy | 2 |
| IgAA atrophy | 1 |
| After merging | |
| Atrophy | 130 |
| Mild tubular atrophy | 45 |
| Focal tubular atrophy | 32 |
| Evident tubular atrophy | 20 |
| IgA atrophy | 11 |
Usage of medical dictionary
This section introduces how to use the medical vocabulary dictionary to perform structuralization for the medical reports. The following two subsections will introduce the overall processing of structuralization and how to automatically extract the keywords for each special inspection item.
Methods of structuralization
Based on the examination items given by the physicians, the structuralized processing proposed in this paper focuses on two kinds of contents: (1) the summary content in the report, and (2) the results of special inspection items. The summary content includes three paragraphs in a report: the diagnosis, the comment/narrative, and the summary paragraphs. The results of special inspection items consist of five paragraphs: the paragraphs of EM, specimen type, gross description, LM, and DIF.
Match dictionary to extract keyword list
Each paragraph with the summary content in an examination report is inputted into the structuralization module. Entity extraction is performed by matching the words in a paragraph with the vocabularies in the dictionary according to the pre-defined matching priorities. Then the phrases which represent entity concepts are extracted from the examination report to generate the examination report's keyword phrases. The matching priorities are as follows:
The vocabularies with length less than 2 in the dictionary is not matched.
The compound vocabularies consisting of more words have higher priorities to be matched.
If the vocabularies have the same number of words, the vocabularies with higher frequency have higher priorities to be matched.
Finally, the negative word list, as shown in Table 9, is matched.
Table 9.
Example of combining the negative words with their following words
| Negative word list | |
| No, neither, nor, without, negative | |
| Example | |
| Sentence | No significant immunodepostion of IgG, IgM, IgA, C3 |
| Correspond keywords |
Keyword1: no IgG Keyword2: no IgM Keyword3: no IgA Keyword4: no C3 |
After completing the matching, the extracted phrases are combined sequentially according to some syntactic rules to generate a narrative short sentence with sufficient semantics. The syntactic rules for combination and the corresponding examples are shown in Table 10. When a negative word appears, it is combined with its following keyword. After completing the above processing, a list KP of keyword phrases for the paragraph of the summary content is created.
Table 10.
Example of combining special words
| Special words | Example of combination |
|---|---|
| Numeral (measure word) |
Keyword1: stage Keyword2: 3 combine into: stage 3 |
| Words inside bracket |
Keyword1: stage 3 Keyword2: (ins/rps class 5) combine into: stage 3 (ins/rps class 5) |
| Stage, class, grade, type |
Keyword1: membranous lupus glomerulonephritis Keyword2: stage 3 (ins/rps class 5) combine into: membranous lupus glomerulonephritis ( stage 3(ins/rps class 5)) |
| Show |
Keyword1: glomerular change show Keyword2: sclerosing change combine into: glomerular change show sclerosing change |
| For |
Keyword1: poor quality Keyword2: for Keyword3: sclerosing change combine into: poor quality for sclerosing change |
| Of |
Keyword1: thrombotic microangiopathy change Keyword2: of Keyword3: glomerulus combine into: thrombotic microangiopathy change of glomerulus |
| % |
Keyword1: up to 60 Keyword2: % combine into: up to 60% |
| JJ to JJ |
Keyword1: mild Keyword2: to Keyword3: moderate combine into: mild to moderate |
Categorize the summary content
The keyword phrases in the summary content need to be divided into different categories: procedure, primary diagnosis, and additional features. The vocabularies with the same base word are discovered when constructing the vocabulary dictionary. Accordingly, by inputting the base keywords of a certain procedure/disease as shown in Table 11, the system can automatically match all the specific procedures and disease subtypes with the given base keywords.
Table 11.
Example of a list of structuralized keywords’ ending words
| Summary content | Base keywords | |
|---|---|---|
| Sprocedure | Biopsy | Transplantation |
| Sdiagnosis | Failure | Disease |
| Nephropathy | Nephritis | |
| Nephrosclerosis | Glomerulonephritis | |
| Glomerulonephropathy | Arteriolosclerosis | |
| Glomerulosclerosis | Glomerulopathy | |
| Arteriosclerosis | Pyelonephritis | |
| Carcinoma | Glomerulitis | |
| Nepropathy | ||
Let Sprocedure and Sdiagnosis correspond to the sets of base keywords of procedure and diseases, respectively. By matching the base word of each keyword phrase in KP with the base keywords in Sprocedure and Sdiagnosis, individually, the matched keyword phrases will be assigned to the corresponding categories. The rest of the keyword phrases, whose based words are not matched the base keywords of the procedure or the primary diagnosis, are categorized into the additional features.
Example 4-1 Assume the content in the diagnosis paragraph of an examination report is shown in Table 12. The extracted phrases by matching with the vocabularies in the dictionary for the list of keyword phrases and denoted as KP, which consists of 7 phrases identified by 1 to 7. The base word of k1 matched the base keyword ‘biopsy’ of procedure. Therefore, keyword phrase 1 is assigned into the procedure category. By performing the similar processing, keyword phrases 2, 3, 6, and 7 are assigned into the primary diagnosis category because their base words ‘glomerulopathy’, ‘glomerulosclerosis’, ‘nephropathy’, and ‘glomerulosclerosis’ belong to the base keywords of main diagnosis. Finally, the rest of the keyword phrases 4 and 5 are categorized into the additional features.
Table 12.
Example of structuralizing main diagnosis (diagnosis) content of the examination report
| Diagnosis | Kidney, left, echo-guided percutaneous needle core biopsy, focal mesangial proliferative and sclerotic glomerulopathy with focal segmental glomerulosclerosis (11/33), patchy tubular atrophy (up to 15% in area), and scattered to clustered interstitial CD20-positive lymphocytic infiltration, c / w IgA nephropathy (class II) with focal segmental glomerulosclerosis |
| Keyword phrases list KP |
Keyword phrase 1: echo-guided percutaneous needle core biopsy Keyword phrase 2: focal mesangial proliferative and sclerotic glomerulopathy Keyword phrase 3: focal segmental glomerulosclerosis ( 11/33) Keyword phrase 4: patchy tubular atrophy ( up to 15% in area) Keyword phrase 5: clustered interstitial cd20-positive lymphocytic infiltration Keyword phrase 6: c/w IgA nephropathy ( class II) Keyword phrase 7: focal segmental glomerulosclerosis |
| Structuralized result |
(1) Procedure 1. Echo-guided percutaneous needle core biopsy (2) Primary diagnosis 1. Focal mesangial proliferative and sclerotic glomerulopathy 2. Focal segmental glomerulosclerosis ( 11/33) 3. c/w IgA nephropathy ( class II) 4. Focal segmental glomerulosclerosis (3) Additional features 1. Patchy tubular atrophy ( up to 15% in area) 2. Clustered interstitial cd20-positive lymphocytic infiltration |
For the results of special inspection items, the structuralized result must show the detailed observations for specific inspection items. According to the given specific keywords of each examination item for the structured report, the words in the extracted keyword phrase in KP are stemmed and compared with keywords of each examination item to decide which options of the examination item appearing. Negative words should be considered to decide ‘present’ or ‘absence’ of an examination item.
Example 4-2 Table 13 shows an example with the results of special inspection items, which is a paragraph of DIF. The detailed results are usually described by continuous adjectives, such as ‘diffuse segmental coarse granular.’ Therefore, in addition to the noun dictionary, an adjective dictionary is also used for matching when extracting the keyword phrase list KP from the results of special inspection items. In the example, there are 9 keyword phrases extracted as shown in Table 13. Finally, the option ‘granular’ in the staining pattern is marked because the keyword phase 3 is extracted. Besides, the options ‘focal’ and ‘diffuse’ are marked because the keyword phases 7 and 1 are extracted. Similarly, ‘IgA,’ ‘IgG,’ and ‘C3’ are labeled as ‘present.’ The other inspection items are labeled as ‘absence’ because the negative phrases ‘negative staining’ appear with them together.
Table 13.
Example of structuralizing chromosome examination (DIF) content of the examination report
|
Main diagnosis (diagnosis) |
The DIF study demonstrates diffuse segmental coarse granular to lumpy depositions of IgA (grade 3–4/4) and C3 (grade 3/4) with focal segmental grade 2–3/4 mesangial deposition of IgG and negative staining to IgM, C1q, C4, and fibrinogen |
|
Keyword phrase list KP |
Keyword phrase 1: diffuse Keyword phrase 2: segmental Keyword phrase 3: granular Keyword phrase 4: mesangial Keyword phrase 5: IgA (grade 3–4/4) Keyword phrase 6: C3 (grade 3/4) Keyword phrase 7: focal Keyword phrase 8: IgG (grade 2–3/4) Keyword phrase 9: negative staining to IgM, C1q, C4, and fibrinogen |
|
Structuralized Result |
Staining pattern: ■granular □ linear Location: ■focal ■diffuse ■segmental □global ■mesangial □glomerular capillary wall IgA deposition/expression □abscence ■present (grade 3–4/4) IgG deposition/expression □abscence ■present (grade 2–3/4) IgM deposition/expression ■abscence □present C3 deposition/expression □abscence ■present (grade 3/4) C4 deposition/expression ■abscence □present C1q deposition/expression ■abscence □present C4d deposition/expression □abscence □present Fibrinogen insignificant ■abscence □present |
Automatic keyword extraction for special inspection items
Due to the large number of inspection items, it is cumbersome and time consuming for physicians to enumerate. Moreover, some items or their options may appear in the examination report but are not listed. Therefore, we proposed a probabilistic topic modeling method to automatically extract the candidate keywords of the examination items from the extracted keyword phrase list. The extracted candidate keywords for the inspection items can be provided to the physicians for further verification in order to reduce the effort of keyword enumeration manually.
For the five paragraphs with the results of special inspection items: EM, specimen type, gross description, LM, and DIF, the following keyword extraction processing is performed on each paragraph individually.
Noise removal for the paragraph dictionary
The dictionary consists of two kinds of compound nouns, including the compound nouns of combining adjective and noun or the ones of combining continuous nouns.
We use the Lift measure to estimate the association degree between the singular nouns composing a compound noun , as shown in Eq. 8.
| 8 |
If the compound noun is in the form of combining an adjective and nouns, let denote the adjective word in and denote the following noun phrase in . Otherwise, is in the form of combining continuous nouns. Then let denote the first noun in and and denote the following noun phrase in . denotes the number of times and cooccur. Besides, and denote the number of occurrences of and ,respectively.
For a compound noun whose Lift value is greater than or equal to a given threshold value θ, it will be retained. Otherwise, it will be considered as a noisy vocabulary and removed from the dictionary.
Find general adjectives
Among the compound nouns in the form of combining adjective and noun, we compute Entropy of each adjective to find general adjectives. Let denote an adjective, and , … denote m different nouns appearing after Besides, denotes the probability of appearing after .
| 9 |
According to the result of Eq. (9), if is larger than the threshold value 1, is added into the list of general adjectives. A list of general adjectives is created after completing the entropy computation for all the adjectives in the constructed dictionary.
Create candidate keyword list for examination items
We applied the Latent Dirichlet allocation (LDA) topic modeling method [4] to analyze the compound keyword phrases in the paragraph of certain special inspection items from the whole database for extracting the candidate keywords of the examination items. The LDA modeling assumes that each document is a mixture of various topics and each topic is described by a number of different hidden topic words. Accordingly, the LDA modeling aims to find a Dirichlet distribution which most fit the word observation of the documents in the database. Finally, we can get the probabilities of each document belonging to various hidden topics and the distributions of topic words for each hidden topic. For each paragraph of certain special inspection item, we collect the compound keyword phrases extracted from the corresponding paragraph in a report as the words in a document. After performing LDA topic modeling on the same paragraph for all the reports, the topic words with higher probabilities for each hidden topic is extracted as candidate keywords of the examination items.
For each paragraph with the results of special inspection items in a report, the sentences are parsed to get their POS tagging. The words with POS tags labeled as conjunctions, articles, pronouns, adverbs, auxiliary verbs, prepositions are removed, as shown in Fig. 5. Then, the remaining words are compared with the vocabularies in the corresponding dictionary of the paragraph. The matched phrases are extracted to be the content in a document for LDA topic modeling, as shown in Fig. 6. According to a given number numT of topics, the result of LDA topic modeling will provide the topic words and their probabilities for each topic. The topic words with the top k highest probabilities for each topic are chosen. Let denote the nth topic and denote a topic word of . Besides, denotes the average probability of the top k highest probabilities for and denotes the probability of appearing in topic . If , is selected into the candidate keywords of the examination items as shown in Fig. 7. After processing the topic words for each topic, the selected candidate keywords from each topic are collected into a set to provide the candidate keywords of the examination items in the paragraph.
Fig. 5.
Example of sentence separation and semantic meaningless words removing
Fig. 6.
The extracted keyword phrases according to the pargraphic dictionary
Fig. 7.
Example of selecting candidate keywords of the examination items
Extension for the candidate keyword list
In the result of the LDA topic modeling, the extracted candidate keyword list may miss the adjectives because they do not appear consecutively with the noun keywords in the report. Accordingly, it results in getting the incomplete options for the examination items. Therefore, the goal of this part of processing is to further find the general adjective to be the keywords for the options of examination items.
Let denote a phrase in the candidate keyword list. For each adjective in the general adjective list, if + forms a compound noun phrase in the dictionary, + is inserted into the candidate keyword list of examination items.
Example 4-3 Assume that a keyword candidate list, a general adjective list, and a dictionary of the compound nouns are given as shown in Table 14. Because the phrase ‘basement membrane’ combined with the general adjective ‘thick’ appearing in the dictionary, ‘thick basement membrane’ is inserted into the candidate keyword list, where ‘thick’ is a keyword of options for the observation item ‘basement membrane’. Similarly, ‘diffuse’ is a keyword of options for ‘foot processes effacement’ and ‘mesangial’ is a keyword of options for ‘expansion’.
Table 14.
Example of extending cadidate keywords
| Candidate keyword list |
|---|
| Basement membrane, cytoplasmic vacuolization, expansion, subepithelial deposits, cell, cellularity |
| General adjective list |
| Thick, diffuse, mesangial, significant, segmental, mild |
| A dictionary of the compound nouns |
| Thick basement membrane, diffuse foot processes effacement, mesangial expansion, lupus glomerulonephritis, dense deposits, necrosis |
| Candidate keyword extension list |
| Basement membrane, foot processes effacement, expansion, subepithelial deposits, cell, cellularity, thick basement membrane, diffuse foot processes effacement, mesangial expansion |
Performance evaluation
Three parts of experiments are performed to evaluate the proposed methods. The first part evaluates the effectiveness of eliminating typos by using the LCS method after establishing the medical dictionary. The second part computes the accuracy of the extracted examination item terms by adjusting the parameter settings in the calculation formula. Finally, the quality of the structured examination report constructed automatically is evaluated.
Experimental data source
The data set used in the experiments includes 476 examination reports of the patients in the Division of Nephrology of a University Hospital. All the contents in the report are in English. At first, the examination reports are first automatically segmented into the following eight paragraphs: (1) Main diagnosis (Diagnosis), (2) Electron Microscopy Examination (EM), (3) Examination Status of Electron Microscope (Comment/Narrative), (4) The Size and Condition of sliced Specimen (Specimen type), (5) Description of spliced Specimen (Gross description), (6) Examination of Light Microscope (LM), (7) Chromosome examination (DIF), (8) Report Conclusion (Summary). However, not all of the examination report contains all the eight paragraphs.
Evaluation on typo elimination using LCS
Evaluation method
In Sect. 3.2, a method is proposed for typos correction. In paragraph (1), (6), and (8) of the report, which contain more text descriptions and the typos occur frequently. Accordingly, the noun dictionaries are established for the three paragraphs, separately, and a noun dictionary for the whole corpus of the examination reports is constructed. For each of the four different dictionaries, 40 correction cases are randomly selected, which are manually labelled to decide whether the typos are correctly eliminated by using the LCS matching. Then a precision value is computed to show the effectiveness of the typos elimination strategy for each paragraph. Then, the precision achieved by performing on three paragraphs is averaged. After determining the appropriate 1/d value setting for typos correction by using the LCS matching, the Precision, Recall, and F1-score values of the typos correction by using the noun dictionary of the whole corpus is then computed.
Result of experiments
The Precision, Recall, and F1-score values of typos correction by the three paragraph dictionary are shown in Figs. 8, 9, and 10, respectively. The result shows that the precision of typos correction performed on the paragraph (1) is the best. Although the performance on the paragraph (6) is worse, its recall achieves the best. The reason is because the number of distinct words appearing in paragraph (1) is lesser than the other paragraphs, so it is less likely to cause wrong matching of typos. On the other hand, on paragraph (6), the LCS method detects more typos than the actual number of typos, so that a higher Recall value is achieved. Overall, in terms of F1-score, the performance of typos corrections on these three paragraphs can achieve 0.62 or higher. Tables 15 and 16 show some examples of the correct and wrong typo correction, respectively.
Fig. 8.
The precision of typos correction by using LCS to correct the base word
Fig. 9.
The precision of typos correction by using LCS to correct the content word
Fig. 10.
The result of Precision, Recall, and F1-score of typos correction by using paragraph dictionaries
Table 15.
Example of the correct typo correction
| Before correction | After correction |
|---|---|
| Oxalte | Oxalate |
| Glomeronephropathy | Glomerulonephropathy |
| Glomeurulosclerosis | Glomerulosclerosis |
| Atrophy | Atrophy |
| Diffuse | Diffuse |
Table 16.
Example of the wrong typo correction
| Before correction | After correction |
|---|---|
| Infarction | Infection |
| Arteriosclerosis | Arteriolosclerosis |
| Endotheliosis | Endotheliitis |
| Stage | State |
| Cords | Cores |
The total execution time, including paragraph separation, dictionary construction, and typo correction for the 476 examination reports is 5250 s.
Evaluation on keyword extraction of examination items
Evaluate method
Figure 11 shows the proposed methods for extracting the keywords introduced in Sect. 4.2. We consider the following four processing components which may influence the extraction effectiveness: (1) dictionary construction, (2) the threshold value of LIFT filtering, (3) the topic number of LDA, and 3) whether to perform the keyword extension step. The corresponding experiments are from [Exp. 2-1] to [Exp. 2-4].
Fig. 11.
Method and Experiment Flowchart
In this part of experiment, the data set consists of the paragraph (2), paragraph (6), and paragraph (7) in the reports, which describe the special examination items. The terms of the examination items in a structured form of the report, which are listed by a medical expert, are used as the correct answer. An extracted key phrase that contains a correct answer is considered to be predicted correctly. Accordingly, Precision, Recall, and F1-score are measured for the extracted keywords for each paragraph.
Experiment result
Evaluation of keyword extraction based on various dictionary constructing strategies
This experiment compares three different strategies for constructing the noun dictionaries: (1) nouns or compound nouns (labeled as NN + NN) that do not contain adjectives, (2) compound nouns (labeled as JJ + NN) that contain adjectives, (3) Union the compound nouns in the former two sets.
The precision, recall, and F1_score of the extracted keywords by using the dictionaries constructed by the three different methods are shown in Figs. 12, 13, and 14, respectively.
Fig. 12.
The precision of keyword extraction based on different methods of dictionary construction
Fig. 13.
The recall of keyword extraction based on different methods of dictionary construction
Fig. 14.
The F1-score of keyword extraction based on different methods of dictionary construction
The results show that the dictionary established by NN + NN is not as good as the others. The main reason is that the keywords composing an examination item usually contain adjective terms. Accordingly, the dictionary constructed by the discovered NN + NN phrases are not all correct answers and incomplete. By using the dictionary constructed from the union of the NN + NN and JJ + NN phrases, the recall of the extracted keywords is significantly higher than using the other two dictionaries. However, it results in a lower precision than using the (JJ + NN) dictionary. Overall, according to the average F1-score of the three paragraphs, the dictionary composing of the (NN + NN) phrases merged with the (JJ + NN) phrases is selected as the basis for extracting the examination keywords.
Evaluation of keyword extraction by changing the threshold setting of Lift measure
In this experiment, the threshold values θ of Lift measure for detecting the noun phrases is varied. The precision, recall, and F1_score of the extracted keywords by setting the various θ values are shown in Figs. 15, 16, and 17, respectively. The results show that, when θ is set to be 0.2, all the paragraphs achieve the best performance for the evaluation metric. Accordingly, in the following experiments, θ is set to be 0.2.
Fig. 15.
The precision of keyword extraction based on different threshold setting on Lift measure
Fig. 16.
The recall of keyword extraction based on different threshold setting on Lift measure
Fig. 17.
The F1-score of keyword extraction based on different threshold setting on Lift measure
Evaluation of keyword extraction by changing the number of LDA topics
In this experiment, by varying the number of LDA topics for detecting the noun phrases, the precision, recall, and F1_score of the extracted keywords are measured. The results are shown in Figs. 18, 19, and 20, respectively.
Fig. 18.
The precision of the extracted candidate keyword list by setting different LDA topic number
Fig. 19.
The recall of the extracted candidate keyword list by setting different LDA topic number
Fig. 20.
The F1-score of the extracted candidate keyword list by setting different LDA topic number
According to the results of experiments, when numT is set to be 10, the highest precision of the extracted candidate keyword list can be obtained. Besides, when numT is set to be 30, the highest recall can be obtained. For the DIF paragraph, numbered 7, the recall is even higher than 0.9. The reason for getting the above results is that the more the topic number is, the more keywords will be extracted by the LDA topic modeling. Accordingly, more keywords of the examination items are discovered. However, more keywords also get more divergent keywords, which cause accuracy of the extracted keywords drop a little. Overall, when numT is 10, the superior performance of F1-score can be achieved.
Evaluation on the extension for the candidate keywords list
According to the proposed method for expanding the candidate keyword list proposed in Sect. 4.2, the purpose of this experiment aims to evaluate whether the method improves the effectiveness of the extracted keyword list. The results of precision, recall, and F1 measure by comparing before and after performing the proposed method are shown in Fig. 21. It shows that the proposed method indeed effectively improves the precision and recall values of the extracted candidate keyword list for each paragraph. In other words, to find the corresponding general adjective vocabulary provides a more correct and complete list of keywords for the examination items. Table 17 shows the words which are not extracted by the proposed method, most of these words have low frequency of occurrence in the reports. The other lost keywords whose probability for a specific topic is lower than the average of the top k probabilities for that topic. It implies these observations rarely occur in the reports of patients.
Fig. 21.
The performance of examination content keyword based on expanding keyword candidate list
Table 17.
The un-discovered keywords by the extraction method
| Paragraph 2 | ||
|---|---|---|
| The keywords not in the top k LDA topic words | The keywords with probabilities lower than the average probability of the top k topic words | The keywords not appearing in the reports |
| Epithelial | Occluded | Duplication |
| Hypercellularity | Microfilament condensation | |
| Rupture | Occlusion | |
| Fenestration | Endothelium | |
| Protein droplets | Tubuloreticular | |
| Swollen | Effacement | |
| Fibrin | ||
| Paragraph 6 | ||
| Arteriosclerosis | Mesangiolysis | |
| Thinning | Thrombotic | |
| Wire-loop | Basement membrane | |
| Interstitium | ||
| Fibrocellular | ||
| Fibrinoid | ||
| Double-contour | ||
| Arteritis | ||
| Microthrombi | ||
| Hyaline | ||
| Paragraph 7 | ||
| Pseudolinear | ||
Evaluation on structuralization of the summary content
Evaluate method
According to the method proposed in Sect. 4.1, this experiment takes the structuralization result for paragraphs (1) main diagnosis, (3) comment/narrative, and (8) summary for evaluation. The experts manually mark whether the extracted keyword phrases provided important medical information. If the extracted phrase has some missing word not extracted from the report, it is considered to be an incorrect result. In this experiment, 50 structured results of the reports were randomly sampled for each of the three paragraphs, the Precision metric is computed to show the performance according to the manually labeled results.
Experiment result
The result of this experiment is shown in Fig. 22. It can be found that, for these abstract paragraphs, the extracted diagnoses and observations can achieve a precision of at least 0.9. For the Comment/Narrative paragraph, it achieves precision of 0.98. It implies that the proposed method can well provide a structured form for most of the examination reports for the abstract paragraphs. The small number of errors, as shown Table 18, is mainly caused by the missing keywords which are not completely extracted in the keyword phrases. The reason is the corresponding phrases do not satisfy the combination rules of keyword phrase extraction, but the POS tag patterns rarely appear.
Fig. 22.
The precision of the structured form for the abstract paragraphs
Table 18.
Example of missing word in the extracted keyword phrases
| Missing words | Terms |
|---|---|
| Diffuse and nodular diabetic nephropathy | JJ CC JJ JJ NN |
| Chronic allograft rejection superimposed | JJ NN NN VBN |
| Minimal glomerular, tubulointerstitial, and vascular changes | JJ JJ, JJ CC JJ NN |
The total execution time of paragraph structuralization for the 476 examination reports is 1103 s
Conclusion and future work
Conclusion
This study aims to automatically generate a structured form for a textual examination report. At first, based on the part-of-speech tagging results, some patterns are designed to extract candidates of medical vocabulary from the corpus of nephrology examination reports. Besides, the possible typos and clinical meaningless words are filtered out to construct a medical vocabulary dictionary for each examination paragraph. For the paragraphs of special examination items, the noisy words are removed from the phrases in the established dictionaries. Then the LDA topic modeling is applied to extract the candidate keyword phrases of the examination items. For the abstract paragraphs, the content in each paragraph is matched with the vocabulary dictionary to extract keyword phrases, which are then merged into complete medical terms and assigned in to different categories to show a structured form of the paragraph. The results of a series of experiments show that the methods proposed in this paper can effectively construct a structural form of examination reports. Furthermore, the keywords of the popular examination items can be extracted correctly. The above techniques will help automatic processing and analysis of medical textural reports.
Future work
According to the results of experiments, in the structured form of the paragraphs, a small number of keywords are missing which are not extracted by the currently proposed part-of-speech pattern rules. We will consider to collect a larger database of examination reports, combined with a pre-constructed medical vocabulary, to automatically learn syntactic pattern to provide a more complete and effective extraction method for keyword phrases. Furthermore, how to analyze the cause of main diagnose from the keywords of the examination item will be studied in the future.
Acknowledgements
This research was partially supported by Project Number ASIA-105-CMUH-20, China Medical University Hospital.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Pei-Hao Wu, Email: h29489538@hotmail.com.
Avon Yu, Email: the.avon.yu@gmail.com.
Ching-Wei Tsai, Email: cwtsai2007@gmail.com.
Jia-Ling Koh, Email: jlkoh@csie.ntnu.edu.tw.
Chin-Chi Kuo, Email: chinchik@gmail.com.
Arbee L. P. Chen, Email: arbee@asia.edu.tw
References
- 1.Amorim RC, Zampieri, M. Effective spell checking methods using clustering algorithms. In: Proceedings of recent advances in natural language processing, p. 172–178; 2013.
- 2.Aronson, AR. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. In: Procedings of the AMIA Symposium. American Medical Informatics Association; 2001. [PMC free article] [PubMed]
- 3.Balaneshin-kordan S, Kotov A, Xisto R. WSU-IR. Joint weighting of explicit and latent medical query concepts from diverse sources. In: Proceedings of the Text REtreival Conference (TREC); 2015.
- 4.Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. Proc J Mach Learn Res JMLR. 2003;3:993–1022. [Google Scholar]
- 5.Feldman R, Netzer O, Peretz A, Rosenfeld B. Utilizing text mining on online medical forums to predict label change due to adverse drug reactions. In: Proceedings of Knowledge Discovery and Data Mining (KDD); 2015.
- 6.Ghassemi M, Naumann T, Doshi-Velez F, Brimmer N, Joshi R, Rumshisky A, Szolovits P. Unfolding physiological state: mortality modelling in intensive care units. In: Proceedings of the Knowledge Discovery and Data Mining (KDD); 2014. [DOI] [PMC free article] [PubMed]
- 7.Goodwin TR, Harabagiu SM. Medical question answering for clinical decision support. In: Proceedings of the International Conference on Information and Knowledge Management (CIKM); 2016. [DOI] [PMC free article] [PubMed]
- 8.Gupta N, Mathur P, Spell checking techniques in NLP: a survey. Int J Adv Res Comput Sci Softw Eng 2(12); 2012.
- 9.Jo Y, Loghmanpour N, Rose CP. Time series analysis of nursing notes for mortality prediction via a state transition topic model. In: Proceedings of the International Conference on Information and Knowledge Management (CIKM); 2015.
- 10.Lafferty J, McCallum A, F. CN Pereira. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning (ICML); 2001.
- 11.Leaman R, Wojtulewicz L, Sullivan R, Skariah A, Yang J, Gonzalez G. Towards internet-age pharmacovigilance: extracting adverse drug reactions from user posts to health-related social networks. In: Proceedings of the 2010 workshop on Biomedical Natural Language Processing, Association for Computational Linguistics; 2010.
- 12.Lehman L-W, Saeed M, Long W, Lee J, Mark R. Risk stratification of ICU patients using topic models inferred from unstructured progress notes. In: Proceedings of the American Medical Informatics Association (AMIA); 2012. [PMC free article] [PubMed]
- 13.Li Y, Li J, Duan H, Lu X. Structuralization of digestive endoscopic report based on NLP. In: 2008 International Conference on BioMedical Engineering and Informatics.
- 14.Liu X, Chen H. Azdrugminer: an information extraction system for mining patient-reported adverse drug events in online patient forums. In: Proceedings of the International Conference on Smart Health (ICSH); 2013.
- 15.Loo M, Jonge E. Statistical data cleaning with applications in R. New York: Wiley; 2018. [Google Scholar]
- 16.Nandhakumar N, et al. Clinically significant information extraction from radiology reports. In: Proceedings of the 2017 ACM Symposium on Document Engineering. ACM; 2017.
- 17.Paterson M, Dančík V. Longest common subsequences. In: Proceedings of the Mathematical Foundations of Computer Science (MFCS); 1994.
- 18.Rong X, Chen Z, Mei Q, Adar E. EgoSet: exploiting word ego-networks and user-generated ontology for multifaceted set expansion. In: Proceedings of the International Conference on Web Search and Data Mining (WSDM); 2016.
- 19.Savova GK, et al. Discovering peripheral arterial disease cases from radiology notes using natural language processing. In: AMIA Annual Symposium Proceedings, vol. 2010. American Medical Informatics Association; 2010. [PMC free article] [PubMed]
- 20.Savova GK, et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc. 2010;17(5):507–513. doi: 10.1136/jamia.2009.001560. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Sethi S, et al. Mayo clinic/renal pathology society consensus report on pathologic classification, diagnosis, and reporting of GN. Proc J Am Soc Nephrol JASN. 2015;27(5):1278–1287. doi: 10.1681/ASN.2015060612. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Stanford CoreNLP—Core natural language software https://stanfordnlp.github.io/CoreNLP.
- 23.Taira RK, Soderland SG, Jakobovits RM. Automatic structuring of radiology free-text reports. Radiographics. 2001;21(1):237–245. doi: 10.1148/radiographics.21.1.g01ja18237. [DOI] [PubMed] [Google Scholar]
- 24.Teneva N, Cheng, W. Salience rank: efficient keyphrase extraction with topic modeling. In: Proc. of the 55th Annual Meeting of the Association for Computational Linguistics; 2017. p. 530–535
- 25.Wang Y, et al. Clinical information extraction applications: a literature review. J Biomed Inform. 2018;77:34–49. doi: 10.1016/j.jbi.2017.11.011. [DOI] [PMC free article] [PubMed] [Google Scholar]





IgA nephropathy
few faint mesangial deposits
minimally changed glomeruli
















