Skip to main content
PLOS Digital Health logoLink to PLOS Digital Health
. 2022 Sep 15;1(9):e0000099. doi: 10.1371/journal.pdig.0000099

Exploring optimal granularity for extractive summarization of unstructured health records: Analysis of the largest multi-institutional archive of health records in Japan

Kenichiro Ando 1,2,3, Takashi Okumura 4,*, Mamoru Komachi 1, Hiromasa Horiguchi 3, Yuji Matsumoto 2
Editor: Tom J Pollard5
PMCID: PMC9931252  PMID: 36812582

Abstract

Automated summarization of clinical texts can reduce the burden of medical professionals. “Discharge summaries” are one promising application of the summarization, because they can be generated from daily inpatient records. Our preliminary experiment suggests that 20–31% of the descriptions in discharge summaries overlap with the content of the inpatient records. However, it remains unclear how the summaries should be generated from the unstructured source. To decompose the physician’s summarization process, this study aimed to identify the optimal granularity in summarization. We first defined three types of summarization units with different granularities to compare the performance of the discharge summary generation: whole sentences, clinical segments, and clauses. We defined clinical segments in this study, aiming to express the smallest medically meaningful concepts. To obtain the clinical segments, it was necessary to automatically split the texts in the first stage of the pipeline. Accordingly, we compared rule-based methods and a machine learning method, and the latter outperformed the formers with an F1 score of 0.846 in the splitting task. Next, we experimentally measured the accuracy of extractive summarization using the three types of units, based on the ROUGE-1 metric, on a multi-institutional national archive of health records in Japan. The measured accuracies of extractive summarization using whole sentences, clinical segments, and clauses were 31.91, 36.15, and 25.18, respectively. We found that the clinical segments yielded higher accuracy than sentences and clauses. This result indicates that summarization of inpatient records demands finer granularity than sentence-oriented processing. Although we used only Japanese health records, it can be interpreted as follows: physicians extract “concepts of medical significance” from patient records and recombine them in new contexts when summarizing chronological clinical records, rather than simply copying and pasting topic sentences. This observation suggests that a discharge summary is created by higher-order information processing over concepts on sub-sentence level, which may guide future research in this field.

Author summary

Medical practice includes significant paperwork, and therefore, automated processing of clinical texts can reduce medical professionals’ burden. Accordingly, we focused on hospitals’ discharge summaries from daily inpatient records stored in Electric Health Records. By applying summarization technologies, which are well-studied in Natural Language Processing, discharge summaries could be generated automatically from the source texts. However, automated summarization of daily inpatient records involves various technical topics and challenges, and the generation of discharge summaries is a complex process of mixing extractive and abstractive summarization. Thus, in this study, we explored optimal granularity for extractive summarization, attempting to decompose actual physicians’ processing. In the experiments, we used three types of summarization units with different granularities to compare performances of discharge summary generation: whole sentences, clinical segments, and clauses. We originally defined clinical segments, aiming to express the smallest medically meaningful concepts. The result indicated that sub-sentence processing, larger than clauses, improves the quality of the summaries. This finding can guide future development of medical documents’ automated summarization.

1 Introduction

Automated summarization of clinical texts can reduce the burden of medical professionals because their practice includes significant paperwork. A recent study found that family physicians spent 5.9h in an 11.4h workday on electronic health records (EHRs) [1]. In 2019, 74% of physicians spent more than 10h per week [2]. Another study reported that physicians spent 26.6% of their daily working time on documentation [3].

Compilation of hospital discharge summaries is an onerous task for physicians. Because daily inpatient records are already filed in the systems, computers might efficiently support physicians by generating summaries of clinical records. Although research has been conducted to identify certain classes of clinical information in clinical texts [48], there has been limited research on acquiring expressions that can be used to write discharge summaries [914]. Because many summarization techniques have been developed in natural language processing (NLP), the generation of discharge summaries can be a promising application of the technology.

However, automated summarization of daily inpatient records involves various technical topics and challenges. For example, descriptions of important findings related to a patient’s diagnosis require an extractive summary. Our preliminary experiments revealed that 20–31% of the sentences in discharge summaries were created by copying and pasting. This result proves that a certain amount of content can be automatically generated by extractive summarization. Meanwhile, when a patient is discharged from the hospital after surgery without any major problems, it is necessary to summarize the clinical record as the patient “recovered well after the surgery,” even if more details of the postoperative process are described in the records. Therefore, such descriptions cannot be created by copy and paste, and needs to be abstracted. These observations suggest that the generation of discharge summaries is a complex process that is a mixture of extractive and abstractive summarization, and it remains unclear how to process the unstructured source texts, i.e., free-texts. To advance this research field, it is desirable to properly decompose these summarization processes and clarify their interactions.

To this end, this study focuses on the extractive summarization process by physicians. Some recent studies investigated the best granularity units in this type of summarization [15, 16]. However, the granularity of extraction has not been explored for the summarization of medical documents. Thus, we attempted to identify the optimal granularity in this context, by defining three units with different granularities and comparing their summarization performance: whole sentences, clinical segments, and clauses. The clinical segments are our novel concepts to express the smallest medically meaningful concepts and are detailed in the methodology section (Section 3).

This paper is organized as follows. In Section 2, we survey related work. Section 3 describes the materials and methods. Section 4 presents the experiment and its results, and Section 5 discusses the experiment. Finally, Section 6 concludes the paper.

2 Related work

Automated summarization is an actively studied field [1519] with two main approaches: extractive and abstractive summarization. The former extracts contents from source texts, whereas the latter creates new contents. Generally, the abstractive approach provides more flexibility in summarization but often produces fake contents that do not match the reference summary, which is referred to as “hallucination” [1921]. Thus, in the medical field, “extractive summarization” has been mainly used for knowledge acquisition of clinical features such as diseases, prescriptions, examinations, etc. The determination of the optimal granularity would lead to the more reliable information. Secondly, the precise spanning of extraction would read to avoid extraction of unnecessary information, keeping the precision of the processing high.

Meanwhile, Natural Language Processing on unstructured medical text has been focusing on normalization and prediction, such as ICD codes, mortality, or readmission risk [2227]. However, they are not summarization in a narrow sense, that distills important information from the input. Several works targeted acquiring key information such as disease, examination result, or medication from EHRs [6, 8, 28, 29], while these studies collected fragmented information and did not try to generate contextualized passage. There are a line of researches that targeted to help physicians get the point quickly by generating a few key sentences [7, 3032]. However, most studies that producing discharge summaries used structured data as input. [3335]. Some other studies attempted to generate discharge summaries from free-form inpatient records, as we anticipated [914]. In part, an encoder-decoder model was used to generate sentences for abstractive summarization [911]. These studies can create a whole document of discharge summary. However, this approach may result in hallucinations, which limits its clinical use, although data can be corrected manually by physicians before filing. The other studies summarized sentences, using extractive summarization [1114], and unsupervised generation using prompt engineering [36, 37] would further contribute to the performance, although they can not generate entire texts.

For advancing the research on summarization of clinical texts, appropriate language resources are indispensable. In English, public corpora of medical records are available, such as MIMIC-III [38, 39], and [40]. However, the number of resources available in Japanese is highly limited. The largest publicly available corpus is the one used for a shared task in an international conference, NTCIR [41]. A non-profit organization for language resources maintains another corpus, GSK2012-D [42]. However, their data volume is small, and their statistics exhibit significant difference from those of large-scale data, as illustrated in Table 1. This low-resource situation makes the processing of Japanese medical documents more challenging. First, Japanese medical texts often contain excessive shortening of sentences and orthographical variants of terms originating from foreign languages. Besides, Japanese requires word segmentation. Most importantly, there is no Japanese parallel corpus of inpatient records and discharge summaries. Therefore, we built a new corpus as detailed in the next section.

Table 1. Statistics of the target data.

Inpatient records
Dataset Cases Sentences/Document Words/Sentence Characters/Sentence
NHO data 24,641 192.0 9.0 18.1
GSK2012-D 45 97.4 7.5 15.1
MedNLP 278 22.6 12.7 22.4
Our corpus 108 274.1 9.1 18.5
Discharge summary
Dataset Cases Sentences/Document Words/Sentence Characters/Sentence
NHO data 24,641 35.0 12.4 23.3
Our corpus 108 17.4 18.6 34.4

3 Materials, method, and preprocessing

3.1 Target text

Clinical records can be expressed in various dialects and jargons. Accordingly, a study on a single institution would lead to highly biased results in medical NLP tasks because of local and hospital-specific dialects. To explore the optimal granularity for clinical document summarization, it is necessary to conduct a multi-institutional study to mitigate the potential bias caused by the medical records stored in a single EHR source. For this purpose, we designed an experiment using the largest multi-institutional health records archive in Japan, National Hospital Organization Clinical Data Archives (NCDA) [43]. NCDA is a data archive operated by the National Hospital Organization (NHO), which stores replicated EHR data for 66 national hospitals owned by this organization. Thus, the archive has become a valuable data source for multi-institutional studies that span across the country.

On this research infrastructure, informed consent and patient privacy are ensured in the following manner. At the national hospitals, notices about their policy and the EHR data usage are posted in their facilities. The patients who disagree with the policies are supposed to notify the hospital by an opt-out form, to be excluded from the archive. Likewise, minors and their parents can turn in the opt-out form, at will. To conduct a study on the archive, researchers must submit their research proposals to the institutional review board. Once the study is approved, the data are extracted from NCDA, and anonymized to construct a dataset for further analysis. The data are accessible only in a secured room at the NHO headquarters, and only statistics are allowed to be carried out of the secured room, for protection of patients’ privacy.

In this present research, the analysis was conducted under the IRB approval (IRB Approval No.: Wako3 2019-22) of the Institute of Physical and Chemical Research (RIKEN), Japan, which has a collaboration agreement with the National Hospital Organization. The dataset we used for the study, referred to as NHO data hereafter, is the anonymized subset of the archive, which includes 24,641 cases collected from five hospitals that belong to the NHO. Each case includes inpatient records and a discharge summary for patients of internal medicine departments. The statistics of the target data are summarized in Table 1. As shown, the scale of the NHO data is much larger than that of GSK2012-D and MedNLP, which have been used in previous studies [41]. Accordingly, the results obtained using the NHO dataset are expected to be more general.

3.2 Design of the analysis

To identify the optimal granularity of extractive summarization, there are two approaches. One approach is a method that takes n word sequences of arbitrary lengths and compares them as the units for summarization. The other approach is a method that uses predefined linguistic units. Previous studies in this domain have used the latter approach and found that a sentence was a longer-than-optimal granularity unit for extractive summarization, as mentioned in Section 1. Another study adopted a clause as a shorter self-contained linguistic unit [44] instead of a sentence [15]. However, it remains unclear whether the clause performs the best in the summarization of clinical records or there could be further possibilities. In this study, we adopt both of the two methods. However, the examination using linguistic units in Japanese is a little different from that in English. In particular, clauses in Japanese have significantly different characteristics from clauses in English because they can be formed by simply adding a particle to a noun. Owing to this characteristics, Japanese clauses are often very short at the phrase level. Accordingly, they cannot constitute a meaningful unit that carries concepts of medical significance. Therefore, we need a self-contained linguistic unit that has a longer span than a clause in Japanese and expresses the smallest medically meaningful concept.

For this reason, we defined the clinical segment that spans several clauses but is shorter than a sentence. As exemplified in Table 2, segments may comprise clauses connected by a conjunction to form a medically meaningful unit; alternatively, they may be identical to clauses. For the statistical analysis, the clinical segment must be defined formally so that a splitter can automatically divide sentences into segments. We also need a corpus to train the splitter and evaluate its performance.

Table 2. Examples of the three types of units.

Units Examples
Sentence 認知症が進んでおり自宅退院は困難であること、 施設入居のためにはご家族の手続きが必要になることを説明
(We explained that it would be difficult to discharge her due to her advanced dementia, and that her family would need to make arrangements to move her into another facility.)
Segment 認知症が進んでおり SEP 自宅退院は困難であること、 SEP 施設入居のためにはご家族の手続きが必要になることを SEP 説明
(Due to her advanced dementia SEP it would be difficult to discharge SEP her family would need to make arrangements to move her into another facility SEP we explained)
Clause 認知症が進んでおり SEP 自宅退院は SEP 困難である SEP こと、 SEP 施設入居のためには SEP ご家族の手続きが必要になることを SEP 説明
(Due to her advanced dementia SEP discharge SEP it would be difficult SEP (verb nominalizer) SEP to move her into another facility SEP her family would need to make arrangements SEP we explained)
SEP SEP indicates the boundary of either a segment or clause.

When designing the clinical segment, we attempted to extract the atomic events related to medical care as a single unit. For example, statements such as “jaundice was observed in the patient’s conjunctiva,” “the patient was diagnosed with hepatitis,” and “a CT scan was performed” would lose their medical meaning if they are further split. In addition, medical events are the central statements in medical documents, whereas non-medical events play a relatively small role. Therefore, in this study, we considered only medical events as a component of self-contained units, and non-medical events were interpreted as noise. In previous studies, a self-contained unit was defined with respect to semantics. In our study, it was extended to a pragmatic unit based on domain knowledge. The details of the six segmentation rules are listed in Table 3.

Table 3. Segmentation rules.

  • Rule 1 Split at the end position of a predicate, by a comma or a verbal noun.

    This is the base rule for segmentation, and others are exception rules.

    (e.g., “絶食、 SEP 抗菌薬投与で SEP 肺炎は軽快。”)

    (e.g., “(After) fasting and SEP antibiotic use, SEP pneumonia was relieved.”)

  • Rule 2 If a segment is enclosed in parentheses, split a sentence at the positions of parentheses.

    To extract the clinical segment inside parentheses, parentheses sometimes become segment boundaries.

    (e.g., “画像で「 SEP 両側肺門部に陰影あり、 SEP CT で両肺に多彩な浸潤影を認め SEP 重症肺炎」 SEP として4 月10 日に入院。”)

    (e.g., “On imaging, “SEP there are bilateral hilar shadows and SEP widespread consolidation in both lungs on CT scan, SEP (suspected of) severe pneumonia” SEP (the patient was) admitted to the hospital on April 10.”)

  • Rule 3 Split content that includes disease name.

    Disease names are often written as diagnoses and play an important role in EHRs. Therefore, even if rule 1 does not match, the content that includes disease names should be split.

    (e.g., “肺炎疑いで SEP 当院紹介となった。”)

    (e.g., “Due to suspected pneumonia, SEP he was referred to our hospital.”)

  • Rule 4 Split examination results and their evaluation.

    Examination results and their evaluation are often written in a single sentence. Because the meaning of the examination results and their evaluation are clearly different, they should be divided even if rule 1 does not match.

    (e.g., “血清クレアチニンキナーゼは4512 U/L と 4512 U/L と SEP 高度に上昇していた。”)

    (e.g., “Serum creatinine kinase level was 4512 U/L, SEP which was highly elevated.”)

  • Rule 5 Do not split anything that is not related to the medical treatment.

    If the content is medically meaningless, its role is not important in its document, and it is not worthy of analysis. Therefore, the content with little relevance to medical treatment is not split, even if it matches rule 1.

    (e.g., “ケアマネジャーに同伴されて来院した。”)

    (e.g., “She came to our hospital accompanied by her care manager.”)

  • Rule 6 Do not split content that does not add meaning.

    If the content that supplements the meaning of the previous description does not add meaning (e.g., “…schedule to [VP] …” and “…continue the treatment …”), it is not split even if it matches rule 1.

    (e.g., “外来で抜糸を行う方針とした。”)

    (e.g., “It was planned to remove sutures as an outpatient.”)

    This includes contents where the semantic label does not change before and after the split.

    (e.g., “発熱、 盗汗、 体重減少、 喀痰、 血痰は否定。”)

    (e.g., “Fever, sweating, weight loss, sputum, and bloody sputum were not observed.”)

    It also includes contents that represent the passage of time or assumptions.

    (e.g., “抗菌薬開始後、 発熱 腹痛は徐々に改善し”)

    (e.g., “After starting antibiotic use, fever and abdominal pain gradually improved.”)

Based on this definition, we built a small corpus for the segmentation task. We used an independent dataset that included inpatient records and their discharge summaries for 108 cases. This corpus was built because annotation over the NHO data was restricted due to privacy concerns. The statistics of the resulting corpus are given in Table 1 (Our corpus). With respect to the inpatient records, the corpus is closer to real data than in previous studies, except for the number of sentences in a document. For the discharge summary, there are no publicly available Japanese corpora besides the one we built. Because of the summarization process, the sentences contain more words and characters than the source inpatient records. The total number of segments in the corpus was 3,816, the average number of segments per sentence was 2.18, and the average number of segment boundaries per sentence was 1.18. The agreement rate between the participants of the segmentation task and an author is 0.82, which is sufficiently high to be used for further study. The agreement rate is the accuracy of the workers’ labels for the correct boundaries annotated by an author. Across this task, we adopted the labels annotated by one of the authors.

3.3 Preprocessing

Table 4 shows a discharge summary—a type of medical record written by a Japanese physician. As illustrated, it is a noisy document: punctuation marks are missing, and line breaks appear in the middle of a sentence. Sentence boundaries may be denoted by spaces instead of punctuation marks. Therefore, for the further analysis of the three types of extraction units, we first need preprocessing for sentence splitting and segment splitting, which are shown in the upper part of Fig 1.

Table 4. Example of a discharge summary.

#1 細菌性髄膜炎 #1 Bacterial meningitis
4/20 〜 5/8 VCM 1250mg(q12h) 4/20-5/8 VCM 1250mg (q12h)
4/20 SBT/ABPC 1.5g単回 4/20 SBT/ABPC 1.5g single dose
4/20 〜 MEPM 2g(q8h) 4/20- MEPM 2g (q8h)
4/20 〜 4/23 デキサート 6.6mg(q6h) 4/20-4/23 Dexate 6.6mg (q6h)
4/20 〜 4/22 日赤ポログロビン 4/20-4/22 Nisseki polyglobin
4/20 腰椎穿刺1回目髄液糖定量 30 mg/dl(血中糖 95mg/dl) 細胞数 2475/μl. 4/20 1st lumbar puncture, cerebrospinal fluid glucose level 30 mg/dl (blood glucose level 95 mg/dl), cell count 2475/μl.
グラム染色するも明らかな菌が見つからず、 髄液培養でも優位な菌は培養されなかった。 Gram stain did not reveal any obvious bacteria, and cerebrospinal fluid culture also did not reveal any predominant bacteria.
細菌性髄膜炎に対するグラム染色の感度は60%程度であり、 培養に関しても感度は高くない。 The sensitivity of the gram stain for bacterial meningitis is about 60%, and the sensitivity of the culture is not high either.
また髄液中の糖はもう少し減るのではないだろうか。 Also, the glucose in the cerebrospinal fluid would have been slightly lower.
確定診断はつかないものの、 最も疑わしい疾患であった。 Although no definitive diagnosis could be made, bacterial meningitis was the most suspicious disease.
起因菌はMRSA, 腸内細菌等を広域にカバーするためバンコマイシン, メロペネム(髄膜炎dose)とした。 The causative organism was assumed to be MRSA, and vancomycin and meropenem (meningitis dose) were used to cover a wide range of enteric bacteria.

The left column shows the original Japanese texts, and the right column shows corresponding English translations.

Fig 1. Outline of our pipeline.

Fig 1

The top block is an example of the inpatient record, and the subsequent blocks indicate the chain of processes up to adding summarization labels.

For sentence splitting, we adopt two naive rules below to define the boundaries of a sentence:

  1. A statement that ends with a full-stop mark.

  2. A statement that ends with a new line and has no full-stop mark.

There is oversimplification here, compared to sentence splitting tasks in medical NLP that have been studied [45, 46]. However, since it is not a focus of this study, we adopted this naive approach for its simplicity. In this process, we also used MeCab [47] as a tokenizer. The MeCab’s dictionaries are mecab-ipadic-NEologd [48] and J-MeDic [49] (MANBYO 201905).

Next, sentences must be automatically split into clinical segments to efficiently analyze the huge dataset, NHO data. We compared several approaches to achieve the best splitting performance. In this study, we used 3,816 annotated segments in the corpus and applied six-fold cross-validation.

We used three rule-based splitters as baselines: a simple rule-based model for splitting by full-stop marks (Full-stop), another simple rule-based model for splitting by full-stop marks and verbs (Full-stop & Verb), and a complex rule-based model for splitting by clauses (CBAP) [50]. To be more precise, in the case of the Full-stop & Verb model, it starts with a verb and splits in front of the next occurring noun except for non-independents. The last model, which included 332 rules that were manually set up based on morphemes, was used to confirm that clinical segments have different boundaries than traditional clauses.

We used SEGBOT [51] as a machine learning method based on a pointer network architecture [52] for the splitting task. The method includes three phases: encoding, decoding, and pointing. An overview is shown in Fig 2. Medical records may include local dialects and technical terms that are not listed on public language resources. Accordingly, the splitter must handle even unknown words. In our approach, each input word is first represented by a distributed representation using fastText [53, 54]. FastText is a model that acquires vector representations of words considering the context. Notably, fastText can obtain vectors of unknown words by decomposing them into character n-grams. These vectors capture hidden information about a language, such as word analogies and semantics.

Fig 2. Overview of SEGBOT.

Fig 2

The performance of the splitter methods is summarized in Table 5. The machine-learning-based SEGBOT outperformed the others, with its F1 score being 0.257 points higher than that of the Full-stop & Verb model, which was the second best. Since this precision of 0.864 is higher than the inter-annotator agreement, it is considered to be almost the upper bound. In addition, CBAP, which is a clause segmentation model, has a low F1 score of 0.411, suggesting that the definitions of the clause and the clinical segment are inherently different. The precision of the model with splitting at the full-stop marks (Full-stop) is only 0.521, indicating that the clinical segment is not always split at the full-stop marks, and that it is necessary to consider the context for splitting. Overall, the results suggest that machine learning is the best fit for the segmentation task. Thus, the data preprocessed by this method are used for the main experiment of this study.

Table 5. Results of the segmentation task.

Precision Recall F1 score
Full-stop 0.521 0.187 0.275
Full-stop & Verb 0.569 0.610 0.589
CBAP [50] 0.368 0.464 0.411
SEGBOT [51] 0.864 0.829 0.846

The numbers in bold indicate the best performing methods.

4 Main experiment

In this section, we describe our experimental settings and results of automatic summarization. First, we present the performance metric of the experiments; specifically, the ROUGE score is used as a quality measure for a summary. Next, we describe a summarization model used in the experiments, followed by the datasets used to train the model. Finally, we present the experiments and their results.

4.1 Evaluation metric

Measurement of the summarization quality must be automated to avoid costly manual evaluation. ROUGE [55] has been used as a standardized metric to measure the summarization quality in NLP tasks. Formally, ROUGE-N is an n-gram recall between a candidate summary and the reference summaries. When we have only one reference document, ROUGE-N is computed as follows:

ROUGE-N=gramnReferenceCountmatch(gramn)gramnReferenceCount(gramn), (1)

where Countmatch(gramn) is the maximum number of n-grams that co-occur in a candidate summary and a reference summary.

When we have several references, ROUGE-L is the longest common subsequence (LCS) score between a candidate summary and the reference summaries. As it can assess word relationships, it is generally considered a more context-aware evaluation measure than ROUGE-N. Specifically, ROUGE-L is computed as follows:

Recalllcs=i=1uLCS(ri,C)Referencetokens, (2)
Precisionlcs=i=1uLCS(ri,C)Summarytokens, (3)
ROUGE-L=2RecalllcsPrecisionlcsRecalllcs+Precisionlcs, (4)

where u is the number of reference sentences, and LCS(ri, C) is the LCS score of the union of the longest common subsequences between the reference sentence ri and C, where C is the sequence of candidate summary sentences. For example, if ri = (w1, w2, w3, w4), and C contains two sentences: c1 = (w1, w2, w6, w7) and c2 = (w1, w8, w4, w9), the longest common subsequence of ri and c1 is (w1, w2), and the longest common subsequence of ri and c2 is (w1, w4). The union of the longest common subsequences of ri, c1, and c2 is (w1, w2, w4), and LCS(ri, C) = 3/4.

4.2 Summarization model

In an extractive summarization task, the goal is to automatically assign a binary label to each unit of the input to indicate whether this unit should be included in the summary. Therefore, we adopted a single classification model to cover the three types of units.

Following Zhou et al. [15], we used a model based on BERT [56], as shown in Fig 3. BERT is a pretrained neural network, and its parameters are learned from a large number of documents in advance. BERT is known to achieve a good accuracy even with few training samples. Instead of the original work that adopted BERT as an encoder for extractive summarization, we adopted UTH-BERT [57]. In contrast to the previous Japanese BERT models [5860], which were pre-trained mainly on web data such as Wikipedia, UTH-BERT was pretrained on a large number of Japanese health records and is expected to perform better on documents in the target domain.

Fig 3. Overview of classification model for clinical segments.

Fig 3

Formally, let the i-th sentence contain l segments Si = (si,1, si,2, …, si,l). The j-th segment with k words in Si is denoted by si,j = (wi,j,1, wi,j,2, …, wi,j, k). We add [CLS] and [SEP] tokens to the boundaries between sentences. After applying the UTH-BERT encoder, the vector of tokens is represented as (wi,j,1BT,wi,j,2BT,,wi,j,kBT). Next, we apply average pooling at the segment level. The pooled representation si,j is formulated as follows:

si,j=1k1kwi,j,kBT. (5)

Note that segments and clauses do not include the [CLS] and [SEP] tokens in average pooling. Subsequently, we apply a segment-level transformer [61] to capture their relationship for extracting summaries. The model predicts the summarization probability from those outputs as follows:

S=Transformer(S), (6)
p(si,j)=σ(Wosi,j+bo), (7)

where S=(s1,1,s1,2,,si,j) is a sequence of segments input to the transformer, and S=(s1,1,s1,2,,si,j) is a sequence that is the output of the transformer. The training objective of the model is the binary cross-entropy loss given the gold label yi,j and the predicted probability p(si,j).

This model does not need to change its structure depending on the input units. For clauses, the span of the segments is replaced by that of the clauses. In the case of sentences, the average pooling is not performed; instead, we input the [CLS] token into the transformer.

4.3 Training data

Our model requires an entire document for training. However, our corpus could be too small to be used for the training of the model, and would compromise the robustness of the model. Accordingly, we used NHO data as training data by assigning pseudo labels. Following previous studies [15, 16], we used the ROUGE scores to automatically assign gold labels to the three units. We used the ROUGE score both to create the gold labels and to evaluate the model. This may seem unusual, but it is a commonly used approach in previous studies. As ROUGE is correlated with human scores [62], the best summary can be obtained by creating a system that maximizes this score during evaluation, regardless of whether this score was used during training. The labeling steps were as follows.

First, we applied the splitter created in Section 3.3 to the NHO dataset and split it into clauses and clinical segments. In this manner, we easily obtained a larger dataset. We used CBAP as a splitter for clauses and SEGBOT as a splitter for clinical segments.

Second, we measured ROUGE-2 F1 for each unit of the source documents (against the discharge summaries), which were then sorted in descending order of their scores. Thus, we obtained a list of units that were important for our summary.

Third, we selected the units from the topmost part of the list. At this stage, we stopped selecting units when the result exceeded 1,200 characters, which was the average length of the summaries in the NHO data.

Finally, we assigned positive labels to the selected units. The entire process yielded the gold standard for the training and evaluation without manual annotation. We randomly selected 1,000 documents each for the development and test sets, and we used the remaining 22,641 documents for the training data.

4.4 Experiments and results

In this experiment, we used the three contextual units, instead of the n-gram units, and evaluated their impact on the summarization performance to determine which unit performs the best. The results of summarization, using the three types of units, are shown in Table 6. Comparing the three types of units in granularity, the model with clinical segments scored the highest in ROUGE-1, ROUGE-2, and ROUGE-L. The model with clinical segments outperformed sentences and clauses in summarizing inpatient records.

Table 6. Results of the summarization task.

Units ROUGE-1 ROUGE-2 ROUGE-L
Sentence 31.91 2.50 7.93
Segment 36.15 3.12 8.26
Clause 25.18 1.30 6.62

The numbers in bold indicate the best performing methods.

Table 2 shows that a sentence can contain multiple events and has room for further segmentation. It is certain that sentences are longer than clinical segments and clauses. However, the relation between clinical segments and clauses are unclear. Because ROUGE-1 and ROUGE-2 are measured on the basis of 1-gram and 2-gram, respectively, smaller units are more advantageous in the ROUGE evaluation. Table 7 shows the statistical relation of the three types of units. The first column shows how many units are included in a sentence on average. The second and third columns show the average number of tokens and characters included in each type of units. The result suggests that segments are longer than clauses on average. Nevertheless, the difference of a clause and a segment is not significant, at least for the average number of characters. Accordingly, the relationship between clause and clinical segment granularity is worthy of a more detailed analysis.

Table 7. Granularity of three units.

Units Units/Sentence Tokens/Unit Characters/Unit
Sentence 1 8.98 18.06
Segment 2.18 6.42 11.83
Clause 2.75 5.74 10.74

The numbers in bold indicate the smallest units.

We ensure the order of the three types of linguistic units, by an additional experiment on word-wise relation between clauses and clinical segments. For any two linguistic units in a sentence, there are four possible relationships (Fig 4): “Equal” is where the two match exactly; “Inclusive” is where a segment completely includes a clause; “Included” is where a clause completely includes a segment; and “Overlap” is where the two overlaps.

Fig 4. The four types of relationship between clause and clinical segment.

Fig 4

We obtained statistics of the four relationships, from all inpatient records and discharge summaries in the NHO data. The results are shown in Table 8. We found that 59.6% of them have the same boundaries. This is influenced by the many short sentences that have no boundaries. Then, “Inclusive” shared 20.0% of the relations. The sum of “Equal” and “Inclusive” turned out to be 79.6%, which is six times more than “Included” that shared only 13.1%. The figures gives the detailed dynamics of the relation between segments and clauses, shown just as 11.83 and 10.74 characters/unit in Table 7. Although the difference in the average length between segment and clause is small, there is a significant difference between segments and clauses in their relative sizes, when compared by each corresponding pair of the actual units.

Table 8. The Relationships between clauses and clinical segments.

Relation types Equal Inclusive Included Overlap
Number of relationships 6,687,046 2,239,839 1,469,423 821,663
(59.6%) (20.0%) (13.1%) (7.3%)

In sum, Clinical segments exhibited the best performance in ROUGE and it lies between sentences and clauses in their size. Combining the results in this section, we can conclude that the segment units we introduced in this paper are better and optimal units that lie between sentence and clause units.

5 Discussion

The result that extractive summarization with sentences is less effective than with other granularities is consistent with previous studies [15, 16]. Given the consistency of these results, this could be a universal property that must be exploited in further summarization tasks in NLP research.

In the summarization of medical documents, the experimental results of using linguistic units suggest that physicians create discharge summaries by capturing clinical concepts from the inpatient records. On the other hand, sentences and clauses performed poorly, probably because they were chunked only with syntactic information and did not deal with medical concepts. Accordingly, automatic summarization in the medical field requires not only syntactic information but also high-level semantic and pragmatic information related to domain knowledge. Clinical segments are reasonable candidates as atomic units that carry medical information. Therefore, clinical segments can potentially be used to quantify the quality of medical documentation and to acquire more detailed medical knowledge expressed in texts.

Limitations in the current study and analysis are twofold: language and cultural dependency. Firstly, Japanese grammar and Japanese medical practices are very different from those of European languages, and there can be differences in the description, summarization, and evaluation processes. Accordingly, this pipeline using extractive method might be applicable only to Japanese clinical setting. In particular, the clinical segment was defined for Japanese, only labeled corpus for Japanese exists, so it is not naively applicable to other languages. However, the idea of capturing medical concepts may be useful for other languages. Also, more researches at various institutions would be preferable to confirm the generalizability of our results, although our study used the largest multi-institutional health records archive in Japan. Secondly, in some countries with different cultural background, dictation is used in clinical records and their summaries [63]. In this regard, Japanese hospitals do not use dictation to produce discharge summaries, which could result in frequent copying and pasting from sources to summaries. This custom could have contributed to using extractive texts in the discharge summaries in Japan. The analysis of the influence of this customary difference is left for future work.

6 Conclusion

In this study, we explored the best granularity for the automatic summarization of medical documents. The result indicated clinically motivated semantic units, larger than clauses, are the best granularity for the extractive summarization.

Ohter contributions of this study are summarized as follows. First, we defined clinical segments that captured clinical concepts and showed that they can be reliably split automatically by a machine learning-based method. Second, we identified the optimal granularity of extractive summarization that can be used for automated summarization of medical documents. Third, we built a Japanese parallel corpus of medical records with inpatient data and discharge summaries.

The results of this study suggest that the clinical segments that we have introduced are useful for automated summarization in the medical domain. This provides an important insight into how physicians write discharge summaries. Previous studies have used other entities to analyze medical documents [6466]. Our results will help to provide more effective assistance in the writing process and automated acquisition of clinical knowledge.

Acknowledgments

The authors would like to thank Dr. Yoshinobu Kano and Dr. Mizuki Morita for their cooperation in our previous research that served as the foundation for this study. We also thank Ms. Mai Tagusari, Ms. Nobuko Nakagomi, and Dr. Hiroko Miyamoto, who served as annotators.

Data Availability

The NHO data we used in this study is a subset of NCDA (NHO Clinical Data Archives), which is an archive of actual patient health records replicated from national hospitals throughout Japan. Because of the nature of the data, access to the dataset is strictly restricted to research approved by the Ethics Review Board of the National Hospital Organization, Japan, and cannot be released publicly. However, once IRB approves, researchers would be able to obtain the dataset in the same manner as we did for the study. The overview and the contact address of the archive is available below. https://nho.hosp.go.jp/cnt1-1_000070.html.

Funding Statement

The work was funded by Center for Advanced Intelligence Project, Riken, Japan. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Arndt BG, Beasley JW, Watkinson MD, Temte JL, Tuan WJ, Sinsky CA, et al. Tethered to the EHR: Primary Care Physician Workload Assessment Using EHR Event Log Data and Time-Motion Observations. The Annals of Family Medicine. 2017;15(5):419–426. doi: 10.1370/afm.2121 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Leslie Kane MA. Medscape Physician Compensation Report 2019; 2019 [cited 2021 Aug 6]. Available from: https://www.medscape.com/slideshow/2019-compensation-overview-6011286.
  • 3. Ammenwerth E, Spötl HP. The Time Needed for Clinical Documentation versus Direct Patient Care. A Work-sampling Analysis of Physicians’ Activities. Methods of Information in Medicine. 2009;48(01):84–91. doi: 10.3414/ME0569 [DOI] [PubMed] [Google Scholar]
  • 4. Hirsch JS, Tanenbaum JS, Lipsky Gorman S, Liu C, Schmitz E, Hashorva D, et al. HARVEST, a Longitudinal Patient Record Summarizer. Journal of the American Medical Informatics Association. 2014;22(2):263–274. doi: 10.1136/amiajnl-2014-002945 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Feblowitz JC, Wright A, Singh H, Samal L, Sittig DF. Summarization of Clinical Information: A Conceptual Model. Journal of Biomedical Informatics. 2011;44(4):688–699. doi: 10.1016/j.jbi.2011.03.008 [DOI] [PubMed] [Google Scholar]
  • 6. Aramaki E, Miura Y, Tonoike M, Ohkuma T, Masuichi H, Ohe K. TEXT2TABLE: Medical Text Summarization System Based on Named Entity Recognition and Modality Identification. Proceedings of the BioNLP 2009 Workshop. 2009; p. 185–192. [Google Scholar]
  • 7.Liang J, Tsou CH, Poddar A. A Novel System for Extractive Clinical Note Summarization using EHR Data. Proceedings of the 2nd Clinical Natural Language Processing Workshop. 2019; p. 46–54.
  • 8. Reeve LH, Han H, Brooks AD. The Use of Domain-Specific Concepts in Biomedical Text Summarization. Information Processing & Management. 2007;43(6):1765–1776. doi: 10.1016/j.ipm.2007.01.026 [DOI] [Google Scholar]
  • 9.Diaz D, Cintas C, Ogallo W, Walcott-Bryant A. Towards Automatic Generation of Context-Based Abstractive Discharge Summaries for Supporting Transition of Care. AAAI Fall Symposium 2020 on AI for Social Good. 2020;.
  • 10. Shing HC, Shivade C, Pourdamghani N, Nan F, Resnik P, Oard D, et al. Towards Clinical Encounter Summarization: Learning to Compose Discharge Summaries from Prior Notes. ArXiv. 2021;abs/2104.13498. [Google Scholar]
  • 11.Adams G, Alsentzer E, Ketenci M, Zucker J, Elhadad N. What’s in a Summary? Laying the Groundwork for Advances in Hospital-Course Summarization. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021; p. 4794–4811. [DOI] [PMC free article] [PubMed]
  • 12. Moen H, Heimonen J, Murtola LM, Airola A, Pahikkala T, Terävä V, et al. On Evaluation of Automatically Generated Clinical Discharge Summaries. Proceedings of the 2nd European Workshop on Practical Aspects of Health Informatics. 2014;1251:101–114. [Google Scholar]
  • 13. Moen H, Peltonen LM, Heimonen J, Airola A, Pahikkala T, Salakoski T, et al. Comparison of Automatic Summarisation Methods for Clinical Free Text Notes. Artificial Intelligence in Medicine. 2016;67:25–37. doi: 10.1016/j.artmed.2016.01.003 [DOI] [PubMed] [Google Scholar]
  • 14. Alsentzer E, Kim A. Extractive Summarization of EHR Discharge Notes. ArXiv. 2018;abs/1810.12085. [Google Scholar]
  • 15.Zhou Q, Wei F, Zhou M. At Which Level Should We Extract? An Empirical Analysis on Extractive Document Summarization. Proceedings of the 28th International Conference on Computational Linguistics. 2020; p. 5617–5628.
  • 16.Cho S, Song K, Li C, Yu D, Foroosh H, Liu F. Better Highlighting: Creating Sub-Sentence Summary Highlights. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020; p. 6282–6300.
  • 17. Erkan G, Radev DR. LexRank: Graph-Based Lexical Centrality as Salience in Text Summarization. Journal of Artificial Intelligence Research. 2004;22(1):457–479. doi: 10.1613/jair.1523 [DOI] [Google Scholar]
  • 18.Mihalcea R, Tarau P. TextRank: Bringing Order into Text. Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 2004; p. 404–411.
  • 19.Haonan W, Yang G, Yu B, Lapata M, Heyan H. Exploring Explainable Selection to Control Abstractive Summarization. Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence. 2021;(15):13933–13941.
  • 20.Dong Y, Wang S, Gan Z, Cheng Y, Cheung JCK, Liu J. Multi-Fact Correction in Abstractive Text Summarization. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020; p. 9320–9331.
  • 21.Cao M, Dong Y, Wu J, Cheung JCK. Factual Error Correction for Abstractive Summarization Models. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020; p. 6251–6258.
  • 22.Sakishita M, Kano Y. Inference of ICD Codes from Japanese Medical Records by Searching Disease Names. Proceedings of the Clinical Natural Language Processing Workshop (ClinicalNLP). 2016; p. 64–68.
  • 23.Lee HG, Sholle E, Beecy A, Al’Aref S, Peng Y. Leveraging Deep Representations of Radiology Reports in Survival Analysis for Predicting Heart Failure Patient Mortality. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021; p. 4533–4538. [DOI] [PMC free article] [PubMed]
  • 24.Lu Q, Nguyen TH, Dou D. Predicting Patient Readmission Risk from Medical Text via Knowledge Graph Enhanced Multiview Graph Convolution. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2021; p. 1990–1994.
  • 25. Komaki S, Muranaga F, Uto Y, Iwaanakuchi T, Kumamoto I. Supporting the Early Detection of Disease Onset and Change Using Document Vector Analysis of Nursing Observation Records. Evaluation & the Health Professions. 2021;44(4):436–442. doi: 10.1177/01632787211014270 [DOI] [PubMed] [Google Scholar]
  • 26. Nakatani H, Nakao M, Uchiyama H, Toyoshiba H, Ochiai C. Predicting Inpatient Falls Using Natural Language Processing of Nursing Records Obtained From Japanese Electronic Medical Records: Case-Control Study. JMIR Medical Informatics. 2020;8(4):e16970. doi: 10.2196/16970 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Katsuki M, Narita N, Matsumori Y, Ishida N, Watanabe O, Cai S, et al. Preliminary Development of a Deep Learning-based Automated Primary Headache Diagnosis Model Using Japanese Natural Language Processing of Medical Questionnaire. Surgical neurology international. 2020;11. doi: 10.25259/SNI_827_2020 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Gurulingappa H, Mateen-Rajpu A, Toldo L. Extraction of Potential Adverse Drug Events from Medical Case Reports. Journal of biomedical semantics. 2012;3(1):1–10. doi: 10.1186/2041-1480-3-15 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Mashima Y, Tamura T, Kunikata J, Tada S, Yamada A, Tanigawa M, et al. Using Natural Language Processing Techniques to Detect Adverse Events from Progress Notes due to Chemotherapy. Cancer Informatics. 2022;21. doi: 10.1177/11769351221085064 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Lee SH. Natural Language Generation for Electronic Health Records. NPJ digital medicine. 2018;1(1):1–7. doi: 10.1038/s41746-018-0070-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.MacAvaney S, Sotudeh S, Cohan A, Goharian N, Talati I, Filice RW. Ontology-Aware Clinical Abstractive Summarization. Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2019; p. 1013–1016.
  • 32.Liu X, Xu K, Xie P, Xing E. Unsupervised Pseudo-labeling for Extractive Summarization on Electronic Health Records. Machine Learning for Health (ML4H) Workshop at NeurIPS 2018. 2018;.
  • 33. Hunter J, Freer Y, Gatt A, Logie R, McIntosh N, Van Der Meulen M, et al. Summarising Complex ICU Data in Natural Language. AMIA annual symposium proceedings. 2008;2008:323. [PMC free article] [PubMed] [Google Scholar]
  • 34. Portet F, Reiter E, Gatt A, Hunter J, Sripada S, Freer Y, et al. Automatic Generation of Textual Summaries from Neonatal Intensive Care Data. Artificial Intelligence. 2009;173(7):789–816. doi: 10.1016/j.artint.2008.12.002 [DOI] [PubMed] [Google Scholar]
  • 35. Goldstein A, Shahar Y. An Automated Knowledge-based Textual Summarization System for Longitudinal, Multivariate Clinical Data. Journal of Biomedical Informatics. 2016;61:159–175. doi: 10.1016/j.jbi.2016.03.022 [DOI] [PubMed] [Google Scholar]
  • 36. Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, et al. Language Models are Few-Shot Learners. ArXiv. 2020;abs/2005.14165. [Google Scholar]
  • 37.Goodwin T, Savery M, Demner-Fushman D. Towards Zero-Shot Conditional Summarization with Adaptive Multi-Task Fine-Tuning. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020; p. 3215–3226. [PMC free article] [PubMed]
  • 38. Johnson AE, Pollard TJ, Shen L, Li-Wei HL, Feng M, Ghassemi M, et al. MIMIC-III, a Freely Accessible Critical Care Database. Scientific data. 2016;3(1):1–9. doi: 10.1038/sdata.2016.35 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Voorhees EM, Hersh WR. Overview of the TREC 2012 Medical Records Track. Proceedings of the twentieth Text REtrieval Conference. 2012;.
  • 40. Özlem Uzuner, Goldstein I, Luo Y, Kohane I. Identifying Patient Smoking Status from Medical Discharge Records. Journal of the American Medical Informatics Association. 2008;15(1):14–24. doi: 10.1197/jamia.M2408 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Aramaki E, Morita M, Kano Y, Ohkuma T. Overview of the NTCIR-12 MedNLPDoc Task. In Proceedings of NTCIR-12. 2016;.
  • 42.Aramaki E. GSK2012-D Dummy Electronic Health Record Text Data [Internet]. Gengo-Shigen-Kyokai; 2013 Feb [cited 2021 Aug 6]. Available from: https://www.gsk.or.jp/catalog/gsk2012-d.
  • 43.National Hospital Organization [Internet]. 診療情報集積基盤 (In Japanese); 2015 Aug 5- [cited 2021 Aug 6]. Available from: https://nho.hosp.go.jp/cnt1-1_000070.html.
  • 44.Vladutz G. Natural Language Text Segmentation Techniques Applied to the Automatic Compilation of Printed Subject Indexes and for Online Database Access. Proceedings of the First Conference on Applied Natural Language Processing. 1983; p. 136–142.
  • 45. Kreuzthaler M, Schulz S. Detection of Sentence Boundaries and Abbreviations in Clinical Narratives. BMC Medical Informatics and Decision Making. 2015;15(2):1–13. doi: 10.1186/1472-6947-15-S2-S4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Griffis D, Shivade C, Fosler-Lussier E, Lai AM. A Quantitative and Qualitative Evaluation of Sentence Boundary Detection for the Clinical Domain. AMIA Joint Summits on Translational Science Proceedings. 2016; p. 88–97. [PMC free article] [PubMed]
  • 47.Kudo T. MeCab: Yet Another Part-of-Speech and Morphological Analyzer. Version 0.996 [software]; 2006 Mar 26 [cited 2021 Aug 6]. Available from: https://taku910.github.io/mecab.
  • 48.Sato T, Hashimoto T, Okumura M. Implementation of a Word Segmentation Dictionary Called Mecab-ipadic-NEologd and Study on How to Use It Effectively for Information Retrieval. Proceedings of the Twenty-three Annual Meeting of the Association for Natural Language Processing. 2017; p. NLP2017–B6–1.
  • 49.Ito K, Nagai H, Okahisa T, Wakamiya S, Iwao T, Aramaki E. J-MeDic: A Japanese Disease Name Dictionary based on Real Clinical Usage. Proceedings of the Eleventh International Conference on Language Resources and Evaluation. 2018;.
  • 50. Maruyama T, Kashioka H, Kumano T, Tanaka H. Development and Evaluation of Japanese Clause Boundaries Annotation Program. Journal of Natural Language Processing. 2004;11(3):39–68. [Google Scholar]
  • 51.Li J, Sun A, Joty SR. SegBot: A Generic Neural Text Segmentation Model with Pointer Network. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence. 2018; p. 4166–4172.
  • 52. Vinyals O, Fortunato M, Jaitly N. Pointer Networks. Advances in Neural Information Processing Systems 28. 2015; p. 2692–2700. [Google Scholar]
  • 53. Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics. 2017;5:135–146. doi: 10.1162/tacl_a_00051 [DOI] [Google Scholar]
  • 54.Grave E, Bojanowski P, Gupta P, Joulin A, Mikolov T. Learning Word Vectors for 157 Languages. Proceedings of the Eleventh International Conference on Language Resources and Evaluation. 2018;.
  • 55.Lin CY. ROUGE: A Package for Automatic Evaluation of Summaries. Proceedings of the Workshop on Text Summarization Branches Out. 2004; p. 74–81.
  • 56.Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019; p. 4171–4186.
  • 57. Kawazoe Y, Shibata D, Shinohara E, Aramaki E, Ohe K. A clinical specific BERT developed using a huge Japanese clinical text corpus. PLOS ONE. 2021;16(11):1–11. doi: 10.1371/journal.pone.0259763 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Kurohashi-Kawahara Laboratory. ku_bert_japanese [software]; 2019 [cited 2021 Aug 6]. Available from: https://nlp.ist.i.kyoto-u.ac.jp/index.php?ku_bert_japanese.
  • 59.Inui Laboratory. BERT models for Japanese text [software]; 2019 [cited 2021 Aug 6]. Available from: https://github.com/cl-tohoku/bert-japanese.
  • 60.National Institute of Information and Communications Technology. NICT BERT 日本語 Pre-trained モデル [software]; 2020 [cited 2021 Aug 6]. Available from: https://alaginrc.nict.go.jp/nict-bert/index.html.
  • 61. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is All you Need. Advances in Neural Information Processing Systems 31. 2017; p. 6000–6010. [Google Scholar]
  • 62.Liu F, Liu Y. Correlation between ROUGE and Human Evaluation of Extractive Meeting Summaries. Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 2008; p. 201–204.
  • 63. Cannon J, Lucci S. Transcription and EHRs: Benefits of a Blended Approach. Journal of American Health Information Management Association. 2010;81(2):36–40. [PubMed] [Google Scholar]
  • 64. Skeppstedt M, Kvist M, Nilsson GH, Dalianis H. Automatic Recognition of Disorders, Findings, Pharmaceuticals and Body Structures from Clinical Text: An Annotation and Machine Learning Study. Journal of Biomedical Informatics. 2014;49:148–158. doi: 10.1016/j.jbi.2014.01.012 [DOI] [PubMed] [Google Scholar]
  • 65. Wu Y, Lei J, Wei WQ, Tang B, Denny JC, Rosenbloom ST, et al. Analyzing Differences between Chinese and English Clinical Text: A Cross-Institution Comparison of Discharge Summaries in Two Languages. Studies in Health Technology and Informatics. 2013;192:662–666. [PMC free article] [PubMed] [Google Scholar]
  • 66. Pradhan S, Elhadad N, South BR, Martinez D, Christensen L, Vogel A, et al. Evaluating the State of the Art in Disorder Recognition and Normalization of the Clinical Narrative. Journal of the American Medical Informatics Association. 2015;22(1):143–154. doi: 10.1136/amiajnl-2013-002544 [DOI] [PMC free article] [PubMed] [Google Scholar]
PLOS Digit Health. doi: 10.1371/journal.pdig.0000099.r001

Decision Letter 0

Tom J Pollard, Imon Banerjee

24 May 2022

PDIG-D-21-00099

Exploring Optimal Granularity for Extractive Summarization of Unstructured Health Records: Analysis of the Largest Multi-Institutional Archive of Health Records in Japan

PLOS Digital Health

Dear Dr. Okumura,

Thank you for submitting your manuscript to PLOS Digital Health. After careful consideration, we feel that it has merit but does not fully meet PLOS Digital Health's publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Jul 23 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at digitalhealth@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pdig/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

* A rebuttal letter that responds to each point raised by the editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

* A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

* An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

We look forward to receiving your revised manuscript.

Kind regards,

Tom J. Pollard, Ph.D.

Academic Editor

PLOS Digital Health

Journal Requirements:

1. Please amend your detailed Financial Disclosure statement. This is published with the article. It must therefore be completed in full sentences and contain the exact wording you wish to be published.

State what role the funders took in the study. If the funders had no role in your study, please state: “The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.”

2. Please update your Competing Interests statement. If you have no competing interests to declare, please state: “The authors have declared that no competing interests exist.”

3. In the online submission form, you indicated that “Our corpus, created in this article, is available upon request. The NHO data is not publicly available for privacy reason.”. All PLOS journals now require all data underlying the findings described in their manuscript to be freely available to other researchers, either 1. In a public repository, 2. Within the manuscript itself, or 3. Uploaded as supplementary information.

This policy applies to all data except where public deposition would breach compliance with the protocol approved by your research ethics board. If your data cannot be made publicly available for ethical or legal reasons (e.g., public availability would compromise patient privacy), please explain your reasons by return email and your exemption request will be escalated to the editor for approval. Your exemption request will be handled independently and will not hold up the peer review process, but will need to be resolved should your manuscript be accepted for publication. One of the Editorial team will then be in touch if there are any issues.

4. Please provide separate figure files in .tif or .eps format only and remove any ensure that all files are under our size limit of 10MB.

For more information about how to convert your figure files please see our guidelines: https://journals.plos.org/digitalhealth/s/figures

5. Please ensure that you provide a single, cohesive .tex source file for your LaTeX revision. You may upload this file as the item type 'LaTeX Source File.' As stated in the PLOS template, your references should be included in your .tex file (not submitted separately as .bib or .bbl). Please also ensure that you are making any formatting changes to both your .tex file and the PDF of your manuscript. If you have any questions, please contact Latex@plos.org. You can find our LaTeX guidelines here: https://journals.plos.org/digitalhealth/s/latex

Additional Editor Comments (if provided):

The study is interesting and explores an important topic. Our reviewers - in particular Reviewer 1 - have raised some points that I think should be addressed prior publication. I would be grateful if you could submit a new version of the paper after responding to these points.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Does this manuscript meet PLOS Digital Health’s publication criteria? Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe methodologically and ethically rigorous research with conclusions that are appropriately drawn based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

--------------------

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: I don't know

--------------------

3. Have the authors made all data underlying the findings in their manuscript fully available (please refer to the Data Availability Statement at the start of the manuscript PDF file)?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception. The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No

--------------------

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS Digital Health does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

--------------------

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: An insightful presentation of the state of the art for clinical health records and discharge summaries in Japanese hospitals as well as the application of natural language processing (NLP) on said clinical text data. NLP in Japanese, let alone for clinical discharge summaries, presents a unique challenge and we applaud the authors for gaining access to the largest clinical text database and the efforts for conducting this study.

However, there were a few key elements that the paper could be revised in order to showcase and highlight the novelty of the study. Suggestions are not limited to but include:

● The paper did not describe the unique difficulties in conducting NLP in Japanese, such as the lack of spaces between the characters and words. This is further exacerbated with the lack of open source information to combat this, and including such limitations may further emphasize the quality of this study.

● The authors do not declare any other NLP related models or research available in Japan, and the study does not list many related studies such as the following which would have been helpful in creating comparisons:

→ Preliminary development of a deep learning-based automated primary headache diagnosis model using Japanese natural language processing of medical questionnaire: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7827501/

→ Predicting Inpatient Falls Using Natural Language Processing of Nursing Records Obtained From Japanese Electronic Medical Records: Case-Control Study: https://medinform.jmir.org/2020/4/e16970/

→ A clinical specific BERT developed using a huge Japanese clinical text corpus: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0259763

● Relatedly to the aforementioned point, due to the lack of related studies, the authors do not mention how different or novel their research stands in comparison to other Japanese or oversea models.

● Furthermore, the authors mention that the results might be applicable only to Japanese clinical settings, yet since no other examples are given, this cannot be generalized and further examples as well as background information and explanation is required to make this claim.

● The authors mention that their model requires entire documents for training, and that the number of documents in their corpus was too small to be used for the training of the model; they however do not explain why this is the case. Perhaps by including examples from Japan, if not from abroad, would help readers understand their model’s novelty.

● The authors mention that in this experiment, they used three contextual units instead of n-gram units to evaluate their impact on the summarization performance and determine which unit performs the best. While this is interesting, the authors once again do not describe why they did this and why this is meaningful.

● The authors discuss that this study had used the largest multi-institutional health records archive in Japan and thus it would be worthwhile to validate the results in multiple languages, but if there are actually no other databases, perhaps the authors must first be validating in other Japanese clinical text databases, or other Asian languages that have similar grammatical structures, prior to applying firsthand to other languages that may have more open source resources available.

● The authors raised an interesting point that Japanese hospitals do not use dictation to produce discharge summaries, which could result in frequent copying and pasting from sources to summaries. However, it would be stronger for the authors to provide examples for how this enhances (or does not enhance) the model, as well as further expand on the clinical significance or rationale to dictate or not dictate, and whether there are differences in clinical outcomes or clinical workflow to explore the strengths and weaknesses of the authors’ model.

● Finally (and related to the point above), the authors do not clearly discuss why this study is useful, especially its application to real world clinical settings.

While the English and Japanese translations available are also interesting, some of the translation seem non-native or slightly incorrect, thus perhaps could use proof-reading and reconsideration (e.g. 髄液糖定量 - glucose determination → cerebrospinal fluid glucose level or glycorrhachia).

Given all these considerations, priority for PLOS Digital Health will be based on significant revisions, only after being able to convince the editor that their model has significance in real-world clinical settings in comparison to pre-existing models.

Reviewer #2: The study aims to identify the optimal granularity between 3 proposed granularities for performing extractive summary of Japanese clinical texts.

The paper is clear and well-written.

Comments:

1) Taking into consideration that the exploration of optimal granularity for extractive summary had beed previously explored, the innovation of the research, is the choice of the clinical domain. However it is not clear if the results were different from the results reached in other non-clinical domains.

2) The fact that “20-31% of the sentences in discharge summaries were created by copying and pasting", made the authors conclude that a certain amount of content can be automatically generated by extractive summarization.

However since most of the summary needs to be generated by other methods (e.g. abstractive summary), the fact that extractive summarization cannot be used to generate a full summary, raises questions about its use at all in the summarization process of clinical summaries. Would be whether other methods would make any use of the 20% that can be generated by extractive summary?

3) The background should be further expanded, for example, there are no mention about abstractive methods that create summaries without halluciantions, the authors just mention abstractive summaries "often produces fake contents that do not match the reference summary". A more complete review of previous studies should also include abstractive methods, such as implemented in the CliniText system and BT-45 for example, that did not produce any hallucinations.

Minor comments:

1) Typos: "In this study,105we adopt both of the two methods. in106Japanese"

2)

"However, it remains unclear how the summaries should be generated from the unstructured source"

It’s unclear what the authors meant by unstructured source ? Images ? Free-text from other clinical experts ? Raw data ?

--------------------

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

Do you want your identity to be public for this peer review? If you choose “no”, your identity will remain anonymous but your review may still be made public.

For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

--------------------

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLOS Digit Health. doi: 10.1371/journal.pdig.0000099.r003

Decision Letter 1

Tom J Pollard, Imon Banerjee

28 Jul 2022

Exploring Optimal Granularity for Extractive Summarization of Unstructured Health Records: Analysis of the Largest Multi-Institutional Archive of Health Records in Japan

PDIG-D-21-00099R1

Dear Dr. Okumura,

We are pleased to inform you that your manuscript 'Exploring Optimal Granularity for Extractive Summarization of Unstructured Health Records: Analysis of the Largest Multi-Institutional Archive of Health Records in Japan' has been provisionally accepted for publication in PLOS Digital Health.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow-up email from a member of our team. 

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they'll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact digitalhealth@plos.org.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Digital Health.

Best regards,

Imon Banerjee

Section Editor

PLOS Digital Health

***********************************************************

Many thanks for your detailed response to the reviewer comments. I am satisfied that the concerns have been addressed and would be happy to move ahead with publication.

Reviewer Comments (if any, and for reference):

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Attachment

    Submitted filename: Rebuttal letter-20220627.pdf

    Data Availability Statement

    The NHO data we used in this study is a subset of NCDA (NHO Clinical Data Archives), which is an archive of actual patient health records replicated from national hospitals throughout Japan. Because of the nature of the data, access to the dataset is strictly restricted to research approved by the Ethics Review Board of the National Hospital Organization, Japan, and cannot be released publicly. However, once IRB approves, researchers would be able to obtain the dataset in the same manner as we did for the study. The overview and the contact address of the archive is available below. https://nho.hosp.go.jp/cnt1-1_000070.html.


    Articles from PLOS Digital Health are provided here courtesy of PLOS

    RESOURCES