Abstract
Background
Readability in medical notes is critical for patient engagement and comprehension, especially in mental health, where readable documentation reduces stigmatizing language, promotes patient-friendly communication, and strengthens trust. The importance of readability has been amplified by the federal OpenNotes policy (mandated by the 21st Century Cures Act since April 2021), which requires providers to offer patients immediate electronic access to their clinical notes. This mandate makes it imperative that mental health documentation is both clinically useful and easily understood.
Methods
This study evaluated psychiatric discharge summaries from the MIMIC-IV dataset. Specifically, we compared patient-facing “Discharge Instructions” against provider-facing “Brief Hospital Course” sections to assess documentation practices. We analyzed 1,745 notes associated with four major psychiatric diagnoses (Major Depression, Bipolar Disorder, Schizophrenia, and Eating Disorders), with a paired subset of 880 notes containing both sections. Readability was assessed using a multidimensional framework of 12 metrics, spanning traditional grade-level formulas (e.g., SMOG, KFGL) and deep learning–based models (ClinicalBERT, MedReadMe). The internal correlation across these metrics was also evaluated.
Results
Discharge Instructions were significantly shorter, structurally simpler, and scored lower on difficulty indices compared to Brief Hospital Course sections. Additionally, complexity varied significantly by diagnosis: the Eating Disorder cohort uniquely displayed a “readability inversion,” having the simplest instructions but the most complex hospital course summaries. Furthermore, traditional statistical measures showed moderate to high internal consistency. While they demonstrated moderate correlation with MedReadMe, they showed weaker correlation with ClinicalBERT rankings, suggesting that deep learning models capture semantic dimensions of text difficulty distinct from surface-level syntactic complexity.
Conclusion
Psychiatric documentation practices successfully simplify patient-facing text. However, the observed variability across diagnostic groups and the divergence between traditional and neural readability metrics highlight the necessity of using multipronged evaluation frameworks to accurately assess the accessibility and comprehension of clinical text in the OpenNotes era.
Supplementary Information
The online version contains supplementary material available at 10.1186/s13040-026-00528-2.
Introduction
Effective communication between patients and healthcare providers is a cornerstone of high-quality care. Clear, readable documentation significantly influences patients’ understanding of their conditions, adherence to treatment plans, and overall satisfaction. However, patients without medical training often struggle to understand technical language in their electronic health records, which can lead to unnecessary anxiety and confusion [1, 2]. The importance of readability has been amplified by the federal OpenNotes initiative, launched in 2010 to advocate for transparency and subsequent mandated by the 21st Century Cures Act [3, 4]. Since April 2021, this federal mandate requires U.S. healthcare providers to offer patients immediate electronic access to their clinical notes without delay [1]. This regulatory shift makes it imperative that mental health clinicians write notes that patients can understand while maintaining clinical utility [5, 6].
Despite these regulatory advances, access does not guarantee comprehension. Patients frequently report difficulties understanding clinical documentation, citing complex terminology and dense abbreviations as major barriers [7–9]. This challenge is particularly acute in mental health care. Psychiatric documentation often includes subjective narratives, sensitive observations, and specialized terminology that may be difficult for laypeople to interpret [5, 10]. Furthermore, patients with severe mental illness may face additional cognitive or emotional barriers that affect how they process written health information [11–13]. Consequently, assessing and improving the readability of psychiatric notes is a critical step toward ensuring equitable care.
Historically, readability has been evaluated using manual methods [14–16], which provide detailed feedback but lack scalability. As a result, automated metrics have become the standard for large-scale evaluation. However, the field currently relies heavily on traditional “surface-level” formulas, such as the Kincaid-Flesch Grade Level (KFGL) or Simple Measure of Gobbledygook (SMOG), which estimate difficulty based only on word complexity and sentence length [7, 17–21]. While easy to calculate, these metrics may fail to capture the semantic and contextual complexities of medical text. Recently, deep learning-based approaches, such as ClinicalBERT and MedReadMe, have emerged as potential alternatives [22, 23], yet few studies have systematically compared these advanced models against traditional formulas. This challenge is particularly salient in psychiatric care. Prior studies in this field have documented improvement in documentation and patient experiences after OpenNotes. Yet when readability was evaluated, assessments typically relied on human judgments or only traditional statistical formulas [5, 11–13]. To our knowledge, this is the first study to evaluate psychiatric discharge summaries using a combined framework of lexical and semantic readability metrics.
To bridge these gaps, we aim to answer three core questions in this study. First, is there a readability gap between patient-facing and provider facing sections of the note? Second, does documentation complexity vary systematically across psychiatric diagnoses? Thirdly, how consistent are traditional readability formulas with modern deep learning models in this domain? To answer those questions, we analyzed a cohort of psychiatric discharge summaries from the MIMIC-IV dataset (version 3.1), a large, freely accessible database comprising deidentified health data from patients admitted to the Beth Israel Deaconess Medical Center in Boston, MA [24–27]. We focused on admissions for major depressive disorder, bipolar disorder, schizophrenia, and eating disorders, as they are common diagnoses well-represented in MIMIC-IV that offer clinically distinct profiles with varying degrees of complexity. Other diagnoses were excluded to maintain a robust and coherent comparison; future work can extend to additional conditions. Our analysis introduces distinct contributions to the literature. First, we move beyond treating the discharge summary as a monolithic document. Instead, we separately extract and compare “Discharge Instructions”—sections explicitly intended for the patient—and the “Brief Hospital Course”—sections summarizing clinical trajectories for other providers. This distinction allows us to assess characteristics of documentation practices that may be distinct when targeted to a patient or clinical audience. Second, we employ a comprehensive, multidimensional evaluation framework to analyze clinical text. In this study, we group metrics into complementary families: (1) traditional and lexical measures that quantify surface features such as length, word complexity, vocabulary frequency, and domain-specific terminology; and (2) semantic and contextual measures that consider contextual predictability and sentence difficulty based on human annotated database. This framework applies both statistical metrics and deep learning metrics to measure readability according to syntactic and semantic characteristics. We aimed to validate three hypotheses through this unique study design: (1) Patient-facing discharge Instructions would be more readable than Brief Hospital Course sections; (2) documentation complexity would differ by diagnosis; and (3) traditional and lexical metrics would have positive correlations with semantic metrics, although their correlation won’t be strong as they focused on complementary dimensions of readability.
Methods
Data selection and preprocessing
We identified inpatient admissions from the MIMIC-IV dataset (version 3.1) based on primary ICD-10 diagnoses corresponding to four major psychiatric diagnostic categories: major depressive disorder (F32, F33), bipolar disorder (F25.0, F30, F31), eating disorders (F50), and schizophrenia (F20). The unit of analysis for this study was the individual clinical note. While some patients had multiple admissions, we treated each discharge summary as an independent event to capture admission-specific variations in clinical context and documentation complexity. This initial query yielded 9,111 admissions. From this cohort, we filtered for admissions containing at least one discharge note in the MIMIC-IV-Note dataset (version 2.2), resulting in a corpus of 1,823 discharge notes.
To mitigate structural variation across documents, we extracted two specific sections common to discharge notes that possess high clinical utility. The first, Discharge Instructions, was selected for its relevance to patient behavior and post-discharge guidance. The second, Brief Hospital Course, was selected as it summarizes the clinical trajectory, including diagnoses, laboratory results, and medications.
We performed automatic extraction by locating section headers (case-insensitive). The end of a section was determined by the detection of a subsequent header, defined as a single line ending with a colon, containing fewer than 5 words and fewer than 100 characters. Because clinical note headers are not fully standardized, our rule-based approach may have missed headers containing typos or non-standard variations, potentially leading to under-capture This process resulted in a final dataset of 1,745 notes containing “Discharge Instructions” and a subset of 880 notes containing a “Brief Hospital Course.” All notes containing a Brief Hospital Course section also contained a Discharge Instructions section, allowing for paired comparisons.
Readability evaluation based on statistical and lexical metrics
To evaluate the readability of clinical notes, we employed several metrics that provide surface-level statistics at both the word and sentence levels. We organized these measures by their theoretical constructs: (a) length and lexical diversity (sentence count, word count, Shannon entropy); (b) lexical difficulty (SMOG, KFGL, GFI; Mean Zipf frequency); and (c) medical specificity (UMLS concept counts). For each metric described below in Sect.“Readability evaluation based on statistical and lexical metrics” and 2.3, we computed descriptive statistics (mean and standard deviation) and visualized their distributions (reported in the Supplement Materials).
We further conducted comparative analyses across the four diagnostic categories to explore systematic variations in documentation practices. For multi-group comparisons across diagnoses, distributions approximated normality for most metrics. Accordingly, we used one-way ANOVA followed by Bonferroni-corrected pairwise t-tests. The effect size were measured using Cohen’s d value. For paired comparisons between Discharge Instructions and Brief Hospital Course, we used the Wilcoxon signed-rank test due to deviations from normality observed in several metrics (Supplement Fig. 1).
Fig. 1.
Pairwise comparison of readability metrics between discharge instructions (x-axis) and Brief Hospital Course (y-axis) for matched clinical notes. The red dashed line represents the line of identity (y = x). r values represent Pearson correlation coefficients
Length and lexical diversity
We processed each clinical note using the spaCy NLP library (model: en_core_web_sm) [28], tokenizing text into sentences and word-level tokens (excluding punctuation and numerical values). Note length was quantified via both sentence count and word count, as longer notes generally correlate with higher cognitive load. To assess lexical diversity, we computed Shannon entropy over the distribution of word tokens [29]. Higher entropy indicates greater word variety, which may correspond to increased linguistic complexity and reduced readability.
Lexical difficulty
To evaluate lexical difficulty, we computed three widely used U.S. grade-level readability scores: the SMOG (Simple Measure of Gobbledygook), Kincaid-Flesch Grade Level (KFGL), and Gunning Fog Index (GFI) [30–32] using the Python package textstat1. These metrics estimate the years of formal education required to understand a text based on surface features such as the prevalence of polysyllabic words, sentence length, word and sentence counts.
Simple Measure of Gobbledygook (SMOG): This formula estimates the years of education needed to understand a piece of writing. It is calculated as:
![]() |
where polysyllables are words with three or more syllables.
Kincaid-Flesch Grade Level (KFGL): This formula indicates the U.S. grade level required to comprehend a text. It is calculated as:
![]() |
Gunning Fog Index (GFI): This index estimates the years of formal education a person needs to understand the text on a first reading. It is calculated as:
![]() |
where complex words are those with three or more syllables, excluding proper nouns, hyphenated words, and common jargon.
An additional estimate of lexical complexity we used is the Mean Zipf Frequency for each note. This metric is derived from the Zipf scale, a logarithmic transformation of raw term frequency representing the base-10 logarithm of a word’s occurrence per billion words (ranging from 0 to 8). Unlike linear frequency counts, the Zipf scale better approximates human perception of word prevalence and difficulty. This metric relies on the assumption that simple words appear more frequently in general language than difficult words [21, 33].
We first lemmatized all words to their root forms using the en_core_web_sm model in spaCy to ensure grammatical variations (e.g., “treating” vs. “treated”) were assessed as a single concept. We then assigned a Zipf frequency to each lemma using the ‘large’ English wordlist from the Python library wordfreq (v3.0.2).2 The final score was the arithmetic mean of the Zipf scores of all valid tokens, where a higher score indicates simpler, more common vocabulary within a note.
Medical specificity
Medical terminology poses a specific challenge to lay understanding [34]. To quantify this, we used the comprehensive UMLS terminology lexicon to calculate the prevalence of medical terms [35]. We limited our analysis to UMLS semantic type T048 (Mental or Behavioral Dysfunction) due to its direct relevance to the study cohort. By focusing on a single semantic type, we prioritized psychiatric relevance and reduced false positives from lay terms present in broader medical categories. However, this restriction may underestimate the total burden of medical terminology and should be interpreted as a conservative estimate of domain-specific terminology exposure [36–38].
Using the quickUMLS package, we matched all n-grams (lengths 1–5) from the notes to T048 concepts [39]. Among the available similarity metrics (Jaccard Index, Dice similarity, and cosine similarity), we selected the Jaccard Index as the standard. This decision was based on a manual inspection of terms extracted from one sample clinical note. The Jaccard Index was the most conservative metric, as nearly all terms identified by the Jaccard Index were also identified by the other methods. Furthermore, the Dice similarity score by nature tends to overestimate similarity when phrase lengths differ significantly. Our manual inspection also led us to set a Jaccard Index similarity score of 0.6 as the cutoff threshold for a matched term, acknowledging that this threshold can be adjusted to balance false positives and negatives.
For each note, we recorded the total count of unique matched concepts (unique CUI count) and the total count of all matched terms. Higher counts for both metrics indicate a greater prevalence of specialized medical terminology, corresponding to increased reading difficulty for lay audiences.
Readability evaluation based on semantic and contextual metrics
In addition to statistical metrics, we evaluated text difficulty using two deep learning-based approaches, aiming to understand the semantic predictability and perceived difficulty of clinical notes. For these metrics, we performed the same set of analyses described in Sect. “Readability evaluation based on statistical and lexical metrics”, including the computation of descriptive statistics, distribution visualization, comparative testing across diagnostic categories (ANOVA and pairwise t-tests with Bonferroni correction), and comparative testing between Discharge Instructions and Brief Hospital Course (Wilcoxon signed-rank test).
Semantic predictability
Language Model Fill-Mask (LMFM) employs a language model to measure semantic predictability of masked tokens. Masked language models like BERT can predict a masked word from context. We computed the rank of the original word among the model’s predictions and averaged ranks across masked tokens in a section. Higher average rank indicates the original word was less predictable given the surrounding clinical context—suggesting greater semantic complexity.
In specific, we tokenized notes into sentences using spaCy and randomly masked 15% of tokens in each sentence, following the methodology of Scholz and Wenzel (2025) [21]. ClinicalBERT, a BERT model pre-trained specifically on large-scale clinical corpora (including the MIMIC database), was then utilized to predict the masked tokens [22]. To quantify contextual readability, we calculated the rank of the original masked word within the model’s vocabulary predictions [21, 40]. This rank was averaged across all masked tokens in a note section.
Perceived difficulty
MedReadMe is a manually annotated dataset providing readability ratings and fine-grained complex span annotations for 4,520 sentences in the medical domain [23]. Readability scores range from 1 (easiest) to 6 (hardest), utilizing “+” and “−” modifiers (representing ± 0.3) to reflect finer-grained difficulty levels. The readability score represents the perceived difficulty of a sentence to human readers. We adopted the language model released by the authors, which was fine-tuned on this dataset.3 Consistent with our previous preprocessing, we utilized spaCy to tokenize notes into sentences and computed a readability score for each sentence. For each note section, we calculated the mean and median readability scores across its sentences to characterize overall difficulty.
Comparison across metrics
To evaluate the consistency and agreement among different readability constructs, we computed Spearman rank correlation coefficients between all pairs of metrics described in Sect. “Readability evaluation based on statistical and lexical metrics” and 2.3. Spearman correlation is a non-parametric measure suitable for comparing metrics with different scales and underlying assumptions.
To ensure consistent directionality, we inverted the sign of metrics where higher values indicate easier text (specifically, Mean Zipf Frequency). This transformation ensured that for all metrics, higher values corresponded to greater difficulty. As such, a high positive correlation indicates that two metrics consistently rank notes in a similar order of difficulty, while a low correlation suggests they capture distinct dimensions of readability. The resulting correlation matrix was visualized using a heatmap (Fig. 2) to identify clusters of highly colinear metrics and detect divergences between statistical and deep learning approaches.
Fig. 2.
Distribution of readability metrics across diagnostic categories. (A) Discharge Instructions and (B) Brief Hospital Course
Ethics
This study used the de-identified MIMIC-IV and MIMIC-IV-Note public datasets. We adhered to all requirements of the PhysioNet Data Use Agreement (DUA) and conducted all model development and evaluation on HIPAA-compliant servers. As the data is de-identified, it does not require separate institutional review board (IRB) approval.
Results
Data characterization
The final dataset consisted of 1,745 Discharge Instructions sections (representing 1,432 unique patients), representing the patient-facing component of the clinical notes, and a subset of 880 notes that also contained a Brief Hospital Course section (representing 767 unique patients), representing the clinician-facing component. Table 1 summarizes the demographic and clinical characteristics of both cohorts.
Table 1.
Demographic and clinical characteristics of the study cohorts. Values are presented as count (%) or mean (SD)
| Characteristic | Discharge Instructions (Full Set) | Brief Hospital Course (Subset) |
|---|---|---|
| Total Notes | 1,745 | 880 |
| Total Unique Patients | 1,432 | 767 |
| Age (years) | ||
| Mean (SD) | 39.18 (17.47) | 39.82 (17.79) |
| Range | 18 to > 89 | 18 to > 89 |
| Gender | ||
| Female | 933 (53.5%) | 495 (56.2%) |
| Male | 812 (46.5%) | 385 (43.8%) |
| Primary Diagnosis | ||
| Major Depression | 851 (48.8%) | 408 (46.4%) |
| Bipolar Disorder | 565 (32.4%) | 275 (31.2%) |
| Schizophrenia | 224 (12.8%) | 132 (15.0%) |
| Eating Disorder | 105 (6.0%) | 65 (7.4%) |
Demographic distributions were consistent between the full dataset and the paired subset. The mean age was 39.2 years for the full cohort and 39.8 years for the subset. Females comprised slightly more than half of the population in both groups.
Regarding clinical diagnoses, Major Depression was the most prevalent condition, accounting for 48.8% of Discharge Instructions and 46.4% of Brief Hospital Course notes, followed by Bipolar Disorder and Schizophrenia. The similarity in diagnostic and demographic distributions confirms that the subset of notes containing a Brief Hospital Course is representative of the larger cohort.
Overview of all metrics
The most prominent finding was a significant readability gap between note sections. Across all metrics, patient-facing Discharge Instructions were consistently shorter, simpler, and more readable than the provider-facing Brief Hospital Course (Table 2).
Table 2.
Comparison of structural and readability metrics between discharge instructions and brief hospital Course. Values are presented as mean (SD)
| Metric | Discharge Instruction | Brief Hospital course | |
|---|---|---|---|
| Statistical and Lexical Metrics | |||
| Number of sentences | 6.84 (2.45) | 18.28 (15.88) | |
| Number of words | 113.78 (34.94) | 319.08 (288.25) | |
| Shannon Entropy | 5.91 (0.49) | 6.28 (1.35) | |
| SMOG | 14.16 (1.61) | 14.94 (3.81) | |
| KFGL | 10.63 (1.89) | 13.46 (4.68) | |
| GFI | 15.48 (2.47) | 16.32 (5.61) | |
| UMLS term count | 5.11 (3.69) | 32.93 (32.44) | |
| Unique UMLS term count | 4.90 (3.03) | 21.00 (18.90) | |
| Zipf frequency | 5.77 (0.19) | 5.41 (0.24) | |
| Semantic and Contextual Metrics | |||
| ClinicalBERT average rank | 65.57 (189.06) | 168.05 (339.89) | |
| MedReadMe mean score | 3.59 (0.17) | 3.91 (0.30) | |
| MedReadMe median score | 3.70 (0.16) | 3.99 (0.30) |
The Brief Hospital Course sections were significantly longer and structurally more variable than Discharge Instructions, with a nearly threefold increase in mean word count. Across all metrics, the Brief Hospital Course consistently exhibited higher difficulty than Discharge Instructions, encompassing both traditional statistical and lexical metrics and deep learning–based semantic metrics. These differences were statistically significant for all comparisons (p < 0.05; Supplement Fig. 2). Given that the data distributions for several metrics deviated from normality (Supplement Fig. 1), the non-parametric Wilcoxon Signed-Rank Test was employed.
Notably, the correlation between the two sections was negligible (∣r∣<0.1) across nearly all metrics, with the sole exception of mean Zipf frequency, which showed a weak anti-correlation (r = − 0.2) (Fig. 1). This lack of association suggests that the linguistic complexity of a patient’s internal clinical documentation (Brief Hospital Course) is not predictive of the complexity of their patient-facing Discharge Instructions.
Variability in readability across diagnostic categories
To determine if text complexity varied based on the patient’s primary diagnosis, we performed a one-way ANOVA across the four diagnostic categories for both the Discharge Instructions and Brief Hospital Course sections. Significant findings were followed by pairwise t-tests with Bonferroni correction to adjust for multiple comparisons. The distribution of all metrics across diagnosis groups is visualized in Fig. 2, and the detailed ANOVA and t-test statistics are presented in the Supplement Results Sect. “Introduction”.
Discharge instructions
For Discharge Instructions (Fig. 2A), ANOVA revealed significant differences across groups for all metrics except total word count and raw UMLS term count. The most distinct pattern was observed in the Eating Disorder cohort. Pairwise comparisons indicated that discharge instructions for these patients were significantly easier to read than those for all other diagnostic groups. This was evidenced by significantly lower scores on Shannon entropy, SMOG, KFGL, GFI, ClinicalBERT average rank, and median MedReadMe, as well as a higher Mean Zipf frequency. A secondary finding was that discharge instructions for Major Depression exhibited significantly higher GFI scores and a higher count of unique UMLS terms, thus harder to read, compared to Bipolar Disorder notes.
Brief hospital course
For Brief Hospital Courses (Fig. 2B), significant differences were observed across diagnostic groups for traditional readability formulae, Mean Zipf frequency, and MedReadMe scores. Notably, basic structural metrics such as sentence count, word count, and Shannon entropy did not differ significantly between groups.
In contrast to the trend observed in the discharge instructions, the Eating Disorder cohort exhibited the most complex clinical documentation. Pairwise comparisons revealed that brief hospital courses for this group were significantly harder to read than those of all other three categories, indicated by lower Mean Zipf frequency and higher MedReadMe scores (all p < 0.005). Apart from this finding, the only other significant pairwise difference was observed between Major Depression and Schizophrenia, where notes for Major Depression were found to be easier to read across traditional readability formulae (p < 0.005).
In summary, documentation complexity varies significantly by diagnostic category. Most notably, the Eating Disorder cohort demonstrated a distinct “inversion” pattern, characterized by the most readable Discharge Instructions but the most complex Brief Hospital Course.
Comparison across metrics
Across both sections, metrics are organized into distinct, highly correlated clusters by their theoretical constructs. Figure 3 displays the Spearman rank correlations among all assessed readability metrics for Discharge Instructions (Fig. 3a) and Brief Hospital Course (Fig. 3b). To ensure consistent directionality, where higher values uniformly indicate greater reading difficulty, the sign for Mean Zipf Frequency was inverted prior to analysis. Generally, metrics demonstrated stronger internal consistency within the Brief Hospital Course section.
Fig. 3.
Spearman rank correlation matrices among readability metrics. (a) Discharge Instructions; (b) Brief Hospital Course
Statistical and lexical metrics formed three clusters with moderate positive correlations with one another, particularly within the Brief Hospital Course section. First, metrics reflecting note length and lexical diversity—specifically sentence count, word count, and Shannon entropy—formed a tight cluster with strong positive correlations, capturing the overall volume and information density of the notes. Second, the three traditional grade-level formulae (SMOG, KFGL, and GFI) exhibited near-perfect collinearity (r > 0.90), confirming that they measure highly similar constructs of surface-level syntactic complexity. Third, the count of total UMLS terms and unique UMLS CUIs demonstrated perfect collinearity.
Regarding metrics representing perceived difficulty, MedReadMe mean and median scores demonstrated moderate positive correlations with most statistical metrics. Notably, MedReadMe aligned relatively well with the three traditional grade-level scores (r ≈ 0.50), suggesting that this supervised model captures features of difficulty that overlap with classical heuristics.
However, the semantic predictability metric, ClinicalBERT average rank, demonstrated weak correlation with other metrics. In Discharge Instructions, it showed negligible correlation with most other readability metrics. In the Brief Hospital Course, it exhibited a weak positive correlation with length-based and UMLS-based metrics, but a weak negative correlation with traditional grade-level scores and MedReadMe. This divergence suggests that semantic predictability might be a dimension of textual difficulty that is distinct from traditional statistical and lexical definitions. In practical terms, a note can be complex (e.g., long sentences) yet semantically predictable within clinical context, or vice versa.
Discussion
Our study evaluated the readability of psychiatric discharge notes in the MIMIC-IV dataset after the implementation of the OpenNotes policy. By investigating the patient-facing “Discharge Instructions” and the provider-facing “Brief Hospital Course” sections, we aimed to assess whether documentation practices effectively adapt to their intended audience. Using a multidimensional framework of twelve metrics—ranging from traditional grade-level formulas to advanced deep learning models—we identified three primary findings: (1) clinicians create significantly more readable discharge instructions compared to hospital course summaries; (2) documentation complexity varies significantly by diagnostic category, with a distinct “inversion” pattern observed in the Eating Disorder cohort; and (3) most readability metrics demonstrated moderate internal consistency, but deep learning metrics, particularly those based on masked language modeling, might capture dimensions of text difficulty that are distinct from traditional surface-level readability formulas.
The most robust finding of our study is the statistically significant gap in complexity between Discharge Instructions and the Brief Hospital Course. Across all metrics, Discharge Instructions were shorter, syntactically simpler, and semantically more accessible. This confirms that clinicians are, to a significant degree, adhering to the principles of patient-centered communication advocated by the OpenNotes initiative. The Brief Hospital Course, conversely, exhibited nearly threefold greater length and consistently higher difficulty scores. This divergence is appropriate, as the Brief Hospital Course serves as a technical handover for other medical professionals, necessitating precise medical terminology and dense informational content that naturally inflates traditional readability scores. The lack of correlation between the readability of a patient’s discharge instructions and their hospital course (∣r∣<0.1 in most metrics) further suggests that clinicians might reformulate information when addressing patients.
The analysis revealed significant variability in documentation complexity across different diagnostic categories, with the most significant and consistent differences observed in the Eating Disorder cohort. This group displayed a unique “readability inversion”: their Discharge Instructions were the simplest among all categories (evidenced by the lowest Shannon entropy, grade-level scores, and MedReadMe scores, alongside the highest Mean Zipf frequency), whereas their Brief Hospital Course notes were the most complex (lowest Mean Zipf frequency, highest MedReadMe). This pattern may suggest a tailored communication strategy, where clinicians might simplify patient-facing instructions to optimize comprehension. Simultaneously, the Brief Hospital Course for these patients likely necessitates greater density to accurately document the multifaceted medical complications and strict behavioral protocols inherent to their management. Beyond this, we observed that discharge instructions for Major Depression were generally simpler than those for Bipolar Disorder and Schizophrenia, further reinforcing that documentation style is sensitive to the patient’s clinical profile. However, without qualitative data on clinician intent, we cannot rule out that these differences are driven by unmeasured confounders and further studies are necessary to validate our hypothesis.
Third, our analysis of metric correlations demonstrated substantial consistency across the majority of readability metrics, while simultaneously highlighting the limitations of relying solely on traditional formulas. As expected, the most commonly used SMOG, KFGL, and GFI indices were nearly perfectly correlated (r > 0.90), confirming that they measure largely the same construct: syntactic complexity based on sentence length and syllable count. MedReadMe scores demonstrated moderate correlation with these traditional formulas (r ≈ 0.50), a finding consistent with the original study [23]. This suggested that supervised deep learning models capture features of difficulty that overlap with grade-level heuristics. Distinct clusters also emerged for other traditional metrics, grouping clearly by the specific dimension of readability they measure, such as note length and lexical diversity or medical term prevalence. In contrast, the ClinicalBERT rank showed a divergent pattern, particularly in the Brief Hospital Course, where it exhibited a weak negative correlation with traditional grade-level scores. Given that ClinicalBERT was pre-trained on MIMIC-IV data, its fill-mask rank in theory reflects the likelihood of word occurrence specifically within this clinical corpus. Consequently, this metric likely captures semantic probability or “surprise”—a dimension of difficulty distinct from surface-level metrics. A text might be syntactically complex (e.g., long sentences) yet semantically predictable to a clinician (low ClinicalBERT rank), underscoring the necessity of a multipronged approach to assessment. In addition, we noted that rank-based measures can be heavy-tailed and sensitive to rare tokens, so normalization strategies could be explored in future work to improve robustness.
Our study has several limitations. First and foremost, while we employed a diverse set of statistical and deep learning metrics, these remain computational proxies for text difficulty. A central constraint of this study is the absence of validation with human participants, which remains the gold standard for assessing actual patient comprehension and usability. Second, the extraction of the Brief Hospital Course relied on strict, header-based regular expressions. Given the structural heterogeneity of clinical notes in MIMIC-IV, this rule-based approach likely resulted in the exclusion of valid sections that used non-standard headers, limiting our paired analysis to a subset of the total cohort (n = 880). Third, our scope was restricted to four broad psychiatric diagnostic categories; future research could explore whether readability varies among finer-grained diagnostic subtypes or in the presence of comorbidities. Fourth, while MIMIC-IV was a wellpowered public dataset, it also had significant limiting factors. The rigorous de-identification protocols of the MIMIC-IV dataset precluded the analysis of detailed sociodemographic factors, preventing us from assessing potential disparities in readability related to race, ethnicity, or socioeconomic status. Also, the MIMIC-IV dataset only contains ICU and Emergency Department admissions. Consequently, our findings may reflect the documentation style of acute care settings rather than general psychiatric wards. Fifth, our unit of analysis was the clinical note rather than the patient. While our approach captures admission-specific variations in documentation, it does not account for potential clustering effects in patients with multiple readmissions. Lastly, to our knowledge, this is the first study to systematically apply deep learning–based readability metrics to psychiatric discharge summaries. While our findings broadly align with the original MedReadMe evaluation, the robustness and generalizability of our findings in psychiatric documentation should be validated in future studies.
Future research can build upon this framework to address these gaps and strengthen our understanding of clinical documentation practices. Specifically, subsequent studies should utilize available demographic data to examine readability differences across subgroups, such as gender and age, to identify potential biases in patient communication. Furthermore, applying this multi-metric approach to longitudinal data, such as comparing notes written before and after the implementation of the OpenNotes policy, would yield critical insights into how regulatory mandates actively shape clinical documentation practices over time. Notably, if datasets with richer patient-level sociodemographic information are available, patient-level analyses in addition to note-level analyses would be valuable, enabling assessment of potential population disparities and longitudinal trajectories. Finally, future work should systematically investigate the role of clinical confounders. Factors such as length of stay, care setting (e.g., psychiatry units, general medical wards, ICU), and author role (e.g., physician versus nurse) likely contribute to the variations in complexity we observed.
As the 21st Century Cures Act expands patient access to records, these insights have direct utility for clinical informatics, emphasizing that relying on a single score like Grade Level is insufficient for mental health text. To improve patient-facing documentation in practice, EHRs could adopt readability dashboards that leverage a multi-construct framework—combining traditional and lexical measures with semantic, machine learning–based measures—to flag readability issues in real time. When used with tools that suggest plain-language alternatives, such a system can nudge clinicians to simplify patient-facing Discharge Instructions without compromising clinical precision. Beyond technical solutions, institutions and professional societies could use these metrics to drive quality-improvement programs, establishing audience-specific readability targets and educational curricula that balance clarity with completeness. In this effort, the “readability inversion” observed in the Eating Disorder cohort offers a compelling training benchmark, demonstrating that high clinical complexity does not preclude writing simple, accessible patient instructions.
In short, this study demonstrates that psychiatric discharge instructions in the MIMIC-IV dataset are significantly more readable than their corresponding hospital course summaries. Moreover, this simplification is not uniform; it varies significantly by diagnosis, with the most pronounced adaptation observed in Eating Disorder cases. Our findings also illustrate that while most readability metrics demonstrated moderate consistency, deep learning–based measures capture dimensions of text difficulty that are distinct from those assessed by traditional formulas. As access to clinical notes expands, multipronged readability assessment can inform writing assistance, training, and quality monitoring that ultimately supports clearer, more patient-centered documentation.
Conclusion
This study demonstrates that psychiatric discharge instructions in the MIMIC-IV dataset are significantly more readable than their corresponding hospital course summaries. This simplification is not uniform, however; it varies significantly by diagnosis, with the most pronounced adaptation observed in Eating Disorder cases. Our findings further illustrate that while different readability metrics generally show moderate consistency, deep learning–based measures capture dimensions of text difficulty distinct from traditional formulas. Ultimately, this study provides an extendable framework for evaluating the complexity of medical documentation from a multifaceted perspective.
Supplementary Information
Below is the link to the electronic supplementary material.
Acknowledgements
The authors claimed no conflict of interest. We gratefully acknowledge the funding support from the National Institute on Minority Health and Health Disparities (NIMHD) of the National Institutes of Health (NIH) under grant number 1R21MD019870-01A1. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
Author contributions
F.L. executed the data analysis and drafted the original manuscript. C.T. and A.Z. conceptualized the study design and defined the methodological framework. All authors contributed to the interpretation of results, reviewed the manuscript, and approved the final version for submission.
Data availability
The data that support the findings of this study are available from the MIMIC-IV (version 3.1) and MIMIC-IV-Note (version 2.2) databases, hosted on PhysioNet (https://physionet.org/content/mimiciv/3.1/ and https://physionet.org/content/mimic-iv-note/2.2/). Access to these datasets is restricted to credentialed users who have completed the required CITI Program training in human subjects research and signed the data use agreement.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Available at https://github.com/textstat/textstat.
Available at https://github.com/rspeer/wordfreq.
Available at https://github.com/chaojiang06/medreadme.
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Office of the National Coordinator for Health Information Technology (ONC). 21st Century Cures Act: Interoperability, Information Blocking, and the ONC Health IT Certification Program. 2020. Available from: https://www.healthit.gov/curesrule
- 2.Zheng J, Yu H. Assessing the readability of medical documents: A ranking approach. JMIR Med Inf. 2018;6(1):e17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Delbanco T, Walker J, Bell SK, et al. Inviting patients to read their doctors’ notes: a quasi-experimental study and a look ahead. Ann Intern Med. 2012;157(7):461–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Wolff JL, Darer JD, Berger A, et al. Inviting patients and care partners to read doctors’ notes: opennotes and shared access to electronic medical records. J Am Med Inf Assoc. 2017;24(e1):e166–72. 10.1093/jamia/ocw108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Meier-Diedrich E, Blease C, Heinze M, Wördemann J, Schwarz J. Changes in Documentation after implementing open notes in mental health care: Pre-Post mixed methods study. J Med Internet Res. 2025;27:e72667. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Blease C, Torous J, Kharko A, DesRoches CM, Harcourt K, O’Neill S, Salmi L, Wachenheim D, Hägglund M. Preparing patients and clinicians for open notes in mental health: qualitative inquiry of international experts. JMIR Ment Health. 2021;8(4):e27397. 10.2196/27397. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Maryam Rahimian JL, Warner L, Salmi S, Trent Rosenbloom, Roger B, Davis, Robin M, Joyce. Open notes sounds great, but will a provider’s documentation change? An exploratory study of the effect of open notes on oncology documentation. JAMIA Open. July 2021;4(3):ooab051. 10.1093/jamiaopen/ooab051. [DOI] [PMC free article] [PubMed]
- 8.Joshua K, Cho HM, Zafar, Cook TS. Use of an online crowdsourcing platform to assess patient comprehension of radiology reports and colloquialisms. Am J Roentgenol. 2020;214(6):1316–20. [DOI] [PubMed] [Google Scholar]
- 9.Alla Keselman and Catherine Arnott Smith. A classification of errors in Lay comprehension of medical documents. J Biomed Inform. 2012;45(6):1151–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Lee J et al. 2022. Prevalence of sensitive terms in clinical notes using natural language processing techniques: observational study. JMIR Med inform. 2022;10(6): e38482. [DOI] [PMC free article] [PubMed]
- 11.Schwarz J et al. 2021. Sharing clinical notes and electronic health records with people affected by mental health conditions: scoping review. JMIR Ment Health 2021;8(12): e34170. [DOI] [PMC free article] [PubMed]
- 12.Blease C et al. 2021. Preparing patients and clinicians for open notes in mental health: qualitative inquiry of international experts. JMIR Ment Health. 2021;8(4): e27397. [DOI] [PMC free article] [PubMed]
- 13.O’Neill S et al. 2019. Embracing the new age of transparency: mental health patients reading their psychotherapy notes online. Journal of Mental Health. 2019;28(5): 527–535. [DOI] [PubMed]
- 14.Charlie Rogers S, Willis S, Gillard, Chudleigh J. Patient experience of imaging reports: A systematic literature review. Ultrasound. 2023;31(3):164–75. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Katharina Jeblick B, Schachtner J, Dexl A, Mittermeier AT, Stübner J, Topalis T, Weber P, Wesp B, Sabel J, Ricke, Michael Ingrisch. ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports. Eur Radiol. 2024;34(5):2817–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Qing Lyu J, Tan ME, Zapadka J, Ponnatapura C, Niu KJ, Myers G, Wang, Whitlow CT. Translating radiology reports into plain language using ChatGPT and GPT-4 with prompt learning: results limitations, and potential. Vis Comput Ind Biomed Art. 2023;6(1), 9. [DOI] [PMC free article] [PubMed]
- 17.Francesco Moramarco D, Juric A, Savkov J, Flann M, Lehl K, Boda T, Grafen V, Zhelezniak S, Gohil. Alex Papadopoulos Korfiatis, and Nils Hammerla. Towards more patient friendly clinical notes through language models and ontologies. In: AMIA Annu Symp ProcIn , American Medical Informatics Association, 2021;881–890. [PMC free article] [PubMed]
- 18.Imperial J, Marvin, and Harish Tayyar Madabushi. Flesch or fumble? Evaluating readability standard alignment of instruction-tuned language models. ArXiv Preprint. 2023;arXiv:2309.05454.
- 19.Bhatt C, et al. Evaluating readability, understandability, and actionability of online printable patient education materials for cholesterol management: a systematic review. J Am Heart Association. 2024;13:e030140. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.García-Álvarez. José Manuel, and Alfonso García-Sánchez. Readability of informed consent forms for medical and surgical clinical procedures: A systematic review. Clin Pract. 2025;15(2):26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Scholz K. and Markus Wenzel. Evaluating readability metrics for German medical text simplification. Proceedings of the 31st International Conference on Computational Linguistics. 2025.
- 22.Alsentzer E et al. Publicly available clinical bert embeddings. arxiv preprint. 2019;arxiv:1904.03323.
- 23.Chao Jiang and Wei Xu. 2024. MedReadMe: A systematic study for fine-grained sentence readability in medical domain. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, Florida, USA. Association for Computational Linguistics. pages 17293–17319, [DOI] [PMC free article] [PubMed]
- 24.Johnson A, Bulgarelli L, Pollard T, Gow B, Moody B, Horng S, Celi LA, Mark R. MIMIC-IV (version 3.1). PhysioNet. 2024. 10.13026/kpb9-mt58. RRID:SCR_007345. [Google Scholar]
- 25.Johnson AEW, Bulgarelli L, Shen L, et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci Data. 2023;10:1. 10.1038/s41597-022-01899-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Goldberger A, Amaral L, Glass, et al. PhysioBank, PhysioToolkit, and physionet: components of a new research resource for complex physiologic signals. Circulation [Online]. 2000;101(23):e215–20. RRID:SCR_007345. [DOI] [PubMed] [Google Scholar]
- 27.Johnson A, Pollard T, Horng S, Celi LA, Mark R. MIMIC-IV-Note: deidentified free-text clinical notes (version 2.2). PhysioNet. 2023. 10.13026/1n74-ne17. RRID:SCR_007345. [Google Scholar]
- 28.Honnibal M, Montani I, Van Landeghem S, Boyd A. spaCy: industrial-strength natural language processing in Python. 2020. Available from: https://spacy.io
- 29.Shannon CE. A mathematical theory of communication. Bell Syst Tech J. 1948;27(3):379–423. [Google Scholar]
- 30.Kincaid, Peter J, et al. Derivation of new readability formulas (automated readability index, fog count and Flesch reading ease formula) for navy enlisted personnel. 1975.
- 31.Mc Laughlin G, Harry. SMOG grading-a new readability formula. J Read. 1969;12:639–46. [Google Scholar]
- 32.Gunning R. The technique of clear writing. McGraw-Hill; 1952.
- 33.Gondy Leroy JE, Endicott D, Kauchak O, Mouradi, and Melissa Just. User evaluation of the effects of a text simplification algorithm using term familiarity on perception, understanding, learning, and information retention. J Med Internet Res. 2013;15(7), e2569. [DOI] [PMC free article] [PubMed]
- 34.Estopà R, Amor M. Montané. 2020. Terminology in medical reports: textual parameters and their lexical indicators that hinder patient understanding. Terminology. 2020;26(2):213–236.
- 35.Michalopoulos G et al. Umlsbert: Clinical domain knowledge augmentation of contextual embeddings using the Unified Medical Language System Metathesaurus. arXiv preprint. 2020; arXiv:2010.10391.
- 36.Bodenreider O. The unified medical Language system (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004;32:D267–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.McCray AT, Burgun A, Bodenreider O. Aggregating UMLS semantic types for reducing conceptual complexity. MEDINFO 2001. IOS Press, 2001. [PMC free article] [PubMed]
- 38.Bodenreider O, Alexa T. McCray. Exploring semantic groups through visual approaches. J Biomed Inform. 2003;36(6):414–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Soldaini L, Goharian N. Quickumls: a fast, unsupervised approach for medical concept extraction. MedIR workshop, sigir. 2016.
- 40.Menta A, Garcia-Serrano A. Controllable sentence simplification using transfer learning. Proceedings of the Working Notes of CLEF. 2022.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data that support the findings of this study are available from the MIMIC-IV (version 3.1) and MIMIC-IV-Note (version 2.2) databases, hosted on PhysioNet (https://physionet.org/content/mimiciv/3.1/ and https://physionet.org/content/mimic-iv-note/2.2/). Access to these datasets is restricted to credentialed users who have completed the required CITI Program training in human subjects research and signed the data use agreement.






