Abstract
There is often a discontinuity between patients’ literacy level and educational materials. In response, we are developing an online medical text simplification editor. In this paper, we describe generating grammar simplification rules from a large parallel corpus (N=141,500) containing original sentences and their simplified variants. We algorithmically identified grammatical transformations between sentences (N=26,600) and used distributional characteristics in two corpora to select transformations with the broadest application and the least ambiguity. This resulted in a top set of 146 rules. Two experts evaluated 20 representative rules reflecting 4 characteristics (long/short and weak/strong) each with 5 example sentences. Generally, we found that the rules are helpful for guiding simplification. Using a 5-point Likert scale (5=best), stronger rules scored higher for ease of applying (4.11), overall helpfulness (4.40) and usefulness of examples (4.05). Rule length did not affect the expert scores. The grammar simplification rules are being integrated in our text editor.
Introduction
Seventy-seven million adults in the United States have basic or below basic health literacy1. Low health literacy can severely affect the long-term participation of patients in actively managing their health2. The complexity of medical text is the primary challenge and can be attributed to both difficult sentence construction as well as difficult terminology3. Because text is one of the most common forms used to disseminate information, it is crucial to provide patients with simple, easy-to-understand medical text.
There have been many approaches to text simplification. Early work consisted mostly of hand-crafted rules4. More recently, data-driven approaches have been developed that leverage large corpora and other resources to simplify text5,6. These approaches employ deeper analysis of parse structures and hybrid simplification methods, which rely on expanding hand-crafted rules via machine learning algorithms7,8. Although these algorithms provide significantly more flexibility than previous rule-based approaches, the quality of the output is not good enough for many real-world applications, including patient education. Additionally, they often fall short in specific domains6, such as medical, where full, end-to-end text simplification requires domain specific texts. Automated text simplification systems focused on the medical domain often have restricted application due to the challenges associated with generalized text simplification10.
Our overall goal is to create a text simplification tool that can be used by medical writers to create text that both appears and is easier to understand. We addresses simplification at four levels: word level, sub-sentence level, sentence level, and document level. In this paper, we examine sub-sentence-level simplification, specifically the transformation of complex grammar structures into simpler versions within sentences. We utilize a corpus of sentence pairs of aligned difficult and simple sentences, from which we extracted a large set of grammar transformation rules which was then narrowed down based on distributional characteristics. A representative set of these rules was evaluated in a study by experts who used them to simplify text.
Methods
Dataset Creation
Corpus
We use two corpora to generate a set of grammar simplification rules: Newsela and Wikipedia. We opted for using larger corpora instead of limiting to a subset of medical and healthcare text because we focus on grammatical transformations, which avoid many of the domain-specific lexical challenges by generalizing using the syntactic representation of sentences. Additionally, no sentence-aligned resources currently exist for exclusively medical text.
To generate simplification rules we used the Newsela corpus, which contains 141,500 sentence pairs automatically extracted from 1,130 news articles. They were simplified by professionals for different reading difficulties7. Each difficult article has been rewritten for up to four different reading levels by editors for pre-college classroom use. For our study, we did not distinguish between the different simplified levels and treated them all as simpler variants of the original. Table 1 shows a sample of the sentence pairs from the dataset.
Table 1.
Example aligned sentence pairs from the Newsela corpus.
Difficult Sentence | Simple Sentence |
---|---|
Two-thirds of people polled said women should be allowed in fighting units. | Two out of three people in a poll said women should be allowed to fight. |
But an important pediatricians group says parents need to know that unrestricted media use can have serious effects. | But an important pediatricians group says too much Internet use can have serious results. |
People who served in the military will still receive medical help at veterans hospitals. | People who served in the military will still get treated at veterans hospitals. |
If a positive identification is made, officials from pertinent service branch notify the family. | If someone is identified, officials notify that person’s family. |
The announcement comes even though its holding capacity for captured wild horses has nearly reached its limit at 50,000 animals nationwide. | The bureau has announced the plan even though it has run out of room to house the horses. |
Wikipedia is used to limit the generated rules to those that are broadly applicable and we used a corpus of 60K aligned documents from English Wikipedia and Simple English Wikipedia. This corpus is significantly larger than the Newsela corpus and helps better measure general applicability of the rules.
Sentence Filtering
Many of the aligned sentence pairs had a large difference in length. This was due to the corpus creation process where the sentences were automatically aligned from the original articles7. Some of these sentence pairs were mistakes in the sentence alignment process, while others represented large transformation between the original and simplified sentence. Both cases provide data that is inappropriate for learning sub-sentence rules from; in the former case, they are errors, and in the latter case, they represent larger transformations that were beyond the scope of this work. Thus, we filtered the sentence pairs by removing those pairs where the simple sentence was shorter than 73.5% of the difficult sentence’s length (this threshold was empirically decided on by tuning on a small development set). Additionally, we removed any pairs where the difficult sentence contained an ending punctuation character (e.g. period, question mark, exclamation mark) that did not appear in their aligned simple sentence. Based on a sample of 100 manually evaluated pairs, these two heuristics were reasonably accurate at identifying problematic sentences (85% accuracy, which ensures that there will be enough good sentence pairs to work with while removing examples most likely leading to unhelpful rules). After filtering, 73,150 sentence pairs remained for learning grammar rules.
We parsed the sentence pairs with the Stanford CoreNLP 3.7.0 package11 and extracted the parse tree. Figure 1 shows an example sentence, the full parse of the sentence, and the final parse tree that we utilized as an input to our rule learning algorithm. We did not consider the words themselves for rule generation since we were interested in generating grammar simplification rules independent of the lexical components. Sentences where the Stanford parser failed due to formatting or non-supported characters were removed (2.13% of the sentences).
Figure 1.
The parsing process that was done using the Stanford CoreNLP11.
To better generalize and avoid learning grammar simplification rules that were too detailed, we aggregated several parts-of-speech (POS). For example, NNS (noun, plural), NNP (proper noun, singular) and NNPS (proper noun, plural) were all collapsed into a single category NN (noun). Table 2 shows the list of all POS that were aggregated.
Table 2.
List of parts of speech that were aggregated to better generalize from the original parse trees.
Label (POS) | Original (POS) | Meaning |
---|---|---|
JJ | JJ | Adjective |
JJR | Adjective, comparative | |
JJS | Adjective, superlative | |
RB | RB | Adverb |
RBR | Adverb, comparative | |
RBS | Adverb, superlative | |
NN | NN | Noun, singular mass |
NNS | Noun, plural | |
NNP | Proper noun, singular | |
NNPS | Proper noun, plural | |
VB | VB | Verb, base form |
VBD | Verb, past tense | |
VBG | Verb, gerund or present participle | |
VBN | Verb, past participle | |
VBP | Verb, non-3rd person singular present | |
VBZ | Verb, 3rd person singular present |
Aggregating the POS increases the number of sentences to which rules are applicable and allowed for better generalization during rule learning, resulting in a 2% decrease in the number of rules learned.
Rule Generation
Identifying Grammatical Transformations
Given the aligned, parsed difficult and easy sentences, we first identified sub-trees in the parse tree where a transformation was made from the difficult to the easy sentence. Our goal for this project was to try and identify high precision rules that could be incorporated into our text simplification tool. Therefore, we designed the algorithm to be biased towards high-quality rules rather than learning a large number.
To identify the grammatical changes, we processed the parse trees of the aligned sentence pairs and then extracted the sub-tree structure where the changes occurred. Figure 2 shows a flow chart of the algorithm with an example pair of parse trees. The input is the parse trees of the aligned difficult and simple sentence. The first step is to identify where the change occurs in the parse tree by finding the first and last place where the parse tree strings differ. This approach treats multiple separate changes as one single change, however, it is computational efficient and, after our sentence filtering step, only a small number of sentences had multiple changes. In the example below, from the start of the strings the two are identical until after the “VB” and from the end of the string until after the period.
Figure 2.
Grammatical change identification.
The second step is to extract the parse sub-trees where the change occurs in the two sentences. In each sentence, the fragment where the parse structure changes spans some text, i.e. in the example below it’s come soon enough in the difficult sentence and wait in the simple sentence. For each of these changes, we extracted the lowest parse sub-trees that spanned the entire text and the sub-trees. In many cases, these were exactly the portions of the parse tree string that differed, though sometimes it enclosed a larger portion. In the example below, the text is spanned exactly by the differing parse tree structures and we extract the candidate rule “(ADVP (RB (RB))) → (VP (VB))”.
This algorithm generates pairs of a difficult sub-tree and the matching simple sub-tree for each sentence pair. We are not interested in sentence-level changes for this project, particularly since their generalizability is minimal, so we did not create rules if the smallest sub-tree change encompassed the entire sentence. Using this filter resulted in 32,700 sub-tree transformation pairs for learning rules.
Table 3 shows three pairs of aligned parse tree fragments extracted using our approach where the same difficult sub-tree was transformed into three different simpler sub-trees. This diversity in changes shows why it is critical to further analyze the changes to identify those difficult sub-trees that have consistent simplifications.
Table 3.
Sample sub-tree pairs.
Difficult Sub-tree | Simple Sub-tree |
---|---|
(ADJP(JJ)(PP(IN)(NP(JJ)(NN)))) | (ADJP(RB)(JJ)) |
(ADJP(RB)(JJ)) | |
(VP(VB)(NP(JJ)(NN))(PP(IN)(NP(JJ)(NN)))) |
Rule Set Generation
We define a rule as each unique combination of a left-hand side (LHS) difficult sub-tree that should be changed to a simplified right-hand side (RHS) sub-tree. All unique grammatical pairs from the previous step are candidate rules. For example, from Figure 3 we could generate three separate rules, all with the same LHS. However, not all rules are useful; thus, the final step is to filter the rules to identify those that are the most useful and most accurate. These final rules will then be used in our editor.
Figure 3.
For each rule, we provided a descriptive form and a highlighted example.
First, we removed rules consisting of a single POS on the difficult side, i.e. rules that spanned a single word, as we are not interested in lexical rules for this project. Then, we grouped candidate rules by unique left-hand sides and calculated three metrics−frequency, percentage of conformity, and entropy−to evaluate the quality of the remaining rules. These metrics were all calculated using the Newsela corpus.
Rule frequency is the number of times a unique LHS and RHS combination appeared in the list of candidate rules. For instance, a rule with a frequency of five was seen in five different sentence pairs in the corpus.
We defined the percentage of conformity of a rule as the frequency of its unique LHS and RHS combination divided by the total number of appearances of the LHS. Examples of the percentage conformity for five rules are shown in Table 4. Each rule’s percentage of conformity score shows what percentage of the cases the LHS turns into that specific RHS. The higher a rule’s conformity score, the more consistent that rule was since when we saw that LHS it more often went to a particular RHS. For instance, “(NP(DT)(JJ)(NN))” is a difficult sub-tree that appears a total of 8 times in the Newsela corpora and it turns into “(NP(DT)(NN)(POS))” 6/8 times. Therefore, for a new difficult sentence that contains this difficult sub-tree, the “(NP(DT)(NN)(POS))” simplification should be selected over other options as that has the highest percentage of conformity.
-
While conformity measures how precise a particular LHS/RHS combination is, entropy measures how much a particular LHS varies. For each LHS entropy was calculated as:
where a LHS has n different simple sub-trees associated with it and the percentage of conformity for the combination of that LHS with the i-th RHS is pi. Entropy values for each difficult sub-tree aggregates the different percentage of conformity values into one metric for every LHS. The lower the value of the entropy score, the more straightforward it will be to select the simple sub-tree rule to use in the simplification. For example, the absolute value for a LHS that turns into RHS1 half the time and RHS2 half the time is 1. However, the entropy score for a LHS that turns into RHS1 80% of the time and RHS2 20%of the time is 0.72.
Table 4.
Example percentage of conformity scores.
Difficult Sub-tree | Simple Sub-tree | Frequency | Conformity (%) |
---|---|---|---|
(NP(JJ)(NN)(NN)) | (NP(NP(NN)(NN))(,,)(SBAR(WHNP(WDT))(S(VP(VB)(VP(VB)))))) | 1 | 25% |
(NP(DT)(JJ)(NN)) | 2 | 50% | |
(NP(NP(NN)(NN))(,,)(SBAR(WHNP(WDT))(S(VP(VB)(ADJP(JJ)))))) | 1 | 25% | |
(NP(DT)(JJ)(NN)) | (NP(DT)(NN)(POS)) | 6 | 75% |
(NP(NP(DT)(NN))(PP(IN)(NP(NN)))) | 2 | 25% |
Rule Set Filtering
Of the 32,700 pairs, 26,600 were unique and were considered as candidate rules. However, many of these rules contained larger sub-trees and only appeared once in the set of pairs. Furthermore, there were candidate rules with higher frequency but high entropy as well – rules where the LHS turned into many different RHS inconsistently. We categorized all rules that either only occurred once or had an entropy <=2 as ‘weak’ and discarded them since they either would not generalize well or would be hard to apply in practice. This classification contained 19,863 rules.
The remaining rules were categorized into three additional categories based on their frequency of occurrence and percentage of conformity: ‘medium’, ‘strong’ and ‘very strong’. The naming of these categories is arbitrary, but we relied on them for identification purposes. The categories are subsets of each other; thus, the weak rules set contains every rule, medium rules have a lower entropy (<1.5) and contains 1400 rules, strong rules contains 125 rules and there are 13 very strong rules. The selected percentage of conformity and frequency threshold values for each of the categories are shown in Table 5.
Table 5.
Strength categorization of rules.
Strength | Frequency | Percentage of Conformity | Number of Rules |
---|---|---|---|
Weak | >0 | >0% | 26,600 |
Medium | >=2 | >34% | 1400 |
Strong | >=3 | >40% | 125 |
Very Strong | >=5 | >50% | 13 |
To maximize the applicability of the final set of rules to a new corpus, we removed all rules with coverage less than 0.00064 on the Wikipedia corpus. We chose the Wikipedia corpus since it is large, contains general content and is different than the corpus that the rules were learned from. The threshold was chosen based on manual examination of a small set of rules. After this filtering, 146 rules remained, each with entropy scores below 1.5, percentage of conformity >34%, and frequency of >= 2.
Example Rules
Table 6 shows a set of four example rules from our final set of rules along with the sentence pairs they were extracted from. The LHS of the rule occurred in the difficult sentence while the RHS of the rule occurred in the simple sentence.
Table 6.
Example rules extracted.
Parse rule (very strong) | (VP(VB)(VP(VB))) → (VP(VB)) |
---|---|
Normal Sentence | In cognitive therapy, it’s believed, that clients change their behavior only by altering their cognitions. |
Simple Sentence | In cognitive therapy, they believe, that clients change their behavior only by altering their cognitions. |
Parse rule (very strong) | (NP(DT)) → (NP(DT)(NN)) |
Normal Sentence | This was added on to treatment with beta blockers. |
Simple Sentence | This therapy was added on to treatment with beta blockers. |
Parse rule (strong) | (NP(DT)(NN)(NN)) → (NP(DT)(NN)) |
Normal Sentence | Acupressure points for neck pain relief are situated at various key crossings on body. |
Simple Sentence | Acupressure Points for Neck Pain Relief are situated at various spots on body. |
Parse rule (medium) | (ADJP(JJ)(PP(IN)(NP(NN)))) → (VP(VB)(PP(IN)(NP(NN)))) |
Normal Sentence | Centrum is rich in alkaline ingredients that makes the vitamins harder to absorb |
Simple Sentence | Centrum has a lot of alkaline ingredients that makes the vitamins harder to absorb |
Rule Evaluation
We conducted a user study to evaluate how well the rules guide text simplification. Two experts who were not involved in any of the development steps of the project performed the evaluation. Both have strong experience, 18 and 23 years, with digital literacy, qualitative data analysis, and medical research. This study will show how well these rules will be able to support text simplification in our online text editor.
Applying a Rule
In the medical text simplification tool, it is important to present the rules in a useful form for the writer. To facilitate this, a descriptive form of each rule was generated. Instead of providing the rule to the user in raw form with the LHS and RHS parse sub-trees, i.e. “(NP (NP (DT) (NN) (NN)) (PP (IN) (NP (NN) (NN)))) turns into (NP (DT) (NN) (NN)),” we created a descriptive form, i.e. “Remove prepositional phrase from noun phrase.” Note that this descriptive form of a rule was the only human-generated portion of the rule creation.
To further help application of a rule to a new difficult sentence, we will also display an example application of the rule taken from the Newsela corpus with the portion where the rule applied is highlighted. Figure 3 shows an example.
In the simplification tool, our algorithm will parse the text to check whether any of the rules apply. If so, then the sentence portion that should be simplified gets highlighted and Figure 3’s window would pop up below it.
Study Design
We had two main goals for our study. First, to validate that the rules learned can be used to guide grammatical text simplification of new sentences. Second, to understand how different rule characteristics affect the usefulness of the rules. This can help further decide which rules to deploy into the text simplification tool and direct future research into learning grammatical transformation rules. We examined two rule characteristics: the strength of the rule and the length of the rule.
We used a 2x2 design with strength and length as the independent variables in the experiment. We compared “strong” rules (which included both strong and very strong rules) with medium rules. We compared short versus long rules where length was determined by the character count of the difficult sub-tree. The cutoff for long/short rules was the median length: 16 characters. We tested 20 rules with 5 examples for each condition (N=100). The examples were randomly chosen sentences from the Newsela and/or the Wikipedia datasets where the rule applied.
To align the study design with the format of our text simplification tool, the descriptive form of a rule was provided to the experts as well as an example application of the rule. Examples were randomly chosen from the Newsela and/or the Wikipedia datasets.
Procedure & Metrics
Both experts were given the same document where each page contained one simplification rule in descriptive form and 5 sentences to simplify using that rule. The part of a difficult sentence that needed simplification (i.e. that matched grammatically the rule’s LHS) was underlined and its font color was changed to red. Each expert was given 20 such pages, one for each rule. The order of the pages was randomized.
For each sentence, the experts simplified the sentence guided by the rule and then rated the difficulty of the process on a 5-point Likert scale (1 = “Very Difficult to Simplify” to 5 = “Very Easy to Simplify”). Upon completion of the 5 sentences for a rule, the experts rated the overall helpfulness of the simplification rule and of the examples on a 5-point Likert scale (1 = least helpful” to 5 = “most helpful”). We also provided the option to add comments for each rule and example. Finally, we asked whether more examples would have been useful for the simplification. This is important primarily from an implementation perspective but can serve as a secondary way of measuring how difficult it is to apply a given rule as we expect that experts will likely need more examples in cases of more difficult rules.
Results
We analyzed the ease-of-use ratings from all individual examples simplified by the experts (Table 7) (N=195, 20 rules x 5 examples x 2 experts with one expert not providing ratings for one rule and its 5 examples). A high score (5-point Likert scale) indicates that the rule was easy to use. Overall, we found that strong rules received a slightly higher score (4.11) than medium rules (3.88). Long rules received a slightly higher score (4.09) than short rules (3.90). However, we conducted a 2x2 ANOVA on the data which showed that these differences were not significant.
Table 7.
Ease-of-use (5-point Likert, with higher being easier to apply).
Rule Strength | ||||
Strong | Medium | |||
Rule Length | Long | 4.36 | 3.82 | 4.09 |
Short | 3.86 | 3.93 | 3.90 | |
4.11 | 3.88 |
Next, we analyzed the overall ratings by the experts per rule (helpfulness of the rules and the examples). Table 8 show the results for overall helpfulness, i.e., the helpfulness rating for the rule to simplify text after completing the five examples (N=40, 20 rules x 2 experts). Strong rules received a higher score (4.40) than medium rules (3.45). A 2x2 ANOVA showed this to be a significant main effect (F(1,36) = 4.991, p = .03). The different in ratings for long (3.90) and short (3.95) rules was small and the effect was not significant. The interaction was similarly not significant.
Table 8.
Rule helpfulness (5-point Likert, with higher being better).
Rule Strength | ||||
Strong | Medium | |||
Rule Length | Long | 4.50 | 3.30 | 3.90 |
Short | 4.30 | 3.60 | 3.95 | |
4.40 | 3.45 |
Table 9 shows the results for overall helpfulness of examples. We see the same pattern emerge. Strong rules are rated higher (4.05) then medium rules (3.25). Longer rules are rated lower (3.55) than shorter rules (3.75). A 2x2 ANOVA showed these results were not significant, however there was a strong trend for rule strength (p = .066).
Table 9.
Example helpfulness (5-point Likert, with higher being better).
Rule Strength | ||||
Strong | Medium | |||
Rule Length | Long | 4.00 | 3.10 | 3.55 |
Short | 4.10 | 3.40 | 3.75 | |
4.05 | 3.25 |
In response to our additional questions, experts would have liked to see additional examples 35% of the time, 40% of the time for medium rules and 20% of the time for strong rules.
Conclusion
Grammar simplification rules that are based on the parsed structure of a sentence can be automatically generated using a data-driven approach. Based on a dataset of 141,500 sentence pairs, we extracted 1,400 medium rules and 146 strong rules. Based on expert evaluation, rules are highly applicable to difficult sentences and can make text simplification easier by guiding the simplification process.
We find that rule characteristics impact how useful the rules are. We hypothesized that strong rules are more helpful than medium rules and that long rules are more helpful than short rules. While the differences in ratings are small, our results indicate that strong rules are preferred over medium rules. This finding is important for our future use of these rules. When integrating them into our text editor, sentences that should be simplified based on our rules will be flagged. However, this may results in many or all sentences being flagged. We will provide writers the opportunity to reduce the number of rules being applied and sentences tagged. Additionally, the metrics can be used to rank the rules in an editor, with stronger rules and shorter rules shown first. Future research will look at additional rule characteristics to further improve this filtering and ranking process and will include a user study of the automatically generated sub-sentence-level rules after their implementation in our medical text simplification tool. In addition, we are writing plain and simple English descriptions for the rules and have examples for each. The section in the sentence that needs simplification will be highlighted and the plain English rule with examples will be shown. Our online prototype is available at http://simple.cs.pomona.edu:3000/.
Acknowledgements
Research reported in this paper was supported by the National Library of Medicine of the National Institutes of Health under Award Number R01LM011975. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
References
- 1.U.S. Department of Health and Human Services. America’s Health Literacy: Why We Need Accessible Health Information. Internet (US) 2008. Available from https://health.gov/communication/literacy/issuebrief/
- 2.Scott TL, Gazmararian JA, Baker DW, et al. Health literacy and preventive health care use among Medicare enrollees in a managed care organization. Medical Care (US) 2002 May;Volume 40(Issue 5):p. 395–404. doi: 10.1097/00005650-200205000-00005. [DOI] [PubMed] [Google Scholar]
- 3.Kickbusch I, Pelikan MJ, Apfel F, et al. Health literacy the solid facts. World Health Organization (EU) 2013:p. 0–68. [Google Scholar]
- 4.Chandrasekar R, Doran C, Srinivas B. Motivations and methods for text simplification; Proceedings of the 16th conference on Computational Linguistics; 1996. pp. p. 1041–1044. [Google Scholar]
- 5.Shardlow M. A Survey of Automated Text Simplification. International Journal of Advanced Computer Science and Applications. 2014 Special Issue on Natural Language Processing. [Google Scholar]
- 6.Siddharthan A. A survey of research on text simplification. International Journal of Applied Linguistics. 2014 [Google Scholar]
- 7.Xu W, Napoles C, Pavlick E, et al. Optimizing statistical machine translation for text simplification. Transactions of the Association for Computational Linguistics. 2016;Volume 4:p. 401–415. [Google Scholar]
- 8.Siddharthan A, Angrosh M.A. Hybrid text simplification using synchronous dependency grammars with hand-written and automatically harvested rules; Proceedings of the European Chapter of the Association for Computational Linguistics; 2016. pp. p. 722–731. [Google Scholar]
- 9.Coster W, Kauchak D. Proceedings of the Association for Computational Linguistics: Human Language Technologies. 2012. Simple English Wikipedia: a new text simplification task; pp. p. 665–669. [Google Scholar]
- 10.Damay J, Lojico G, Lu K, Tarantan D, et al. SIMTEXT The Simplification of Medical Literature. Internet. Journal of Research in Science, Computing and Engineering. 2006 [Google Scholar]
- 11.Stanford University. Stanford Core NPL. Internet (US) 2016. Oct. 31. Available from https://stanfordnlp.github.io/CoreNLP/history.html.