Skip to main content
JAMIA Open logoLink to JAMIA Open
. 2024 Feb 9;7(1):ooae013. doi: 10.1093/jamiaopen/ooae013

Bridging information gaps in menopause status classification through natural language processing

Hannah Eyre 1,2, Patrick R Alba 3,4,, Carolyn J Gibson 5,6, Elise Gatsby 7, Kristine E Lynch 8,9, Olga V Patterson 10,11, Scott L DuVall 12,13
PMCID: PMC10901606  PMID: 38419670

Abstract

Objective

To use natural language processing (NLP) of clinical notes to augment existing structured electronic health record (EHR) data for classification of a patient’s menopausal status.

Materials and methods

A rule-based NLP system was designed to capture evidence of a patient’s menopause status including dates of a patient’s last menstrual period, reproductive surgeries, and postmenopause diagnosis as well as their use of birth control and menstrual interruptions. NLP-derived output was used in combination with structured EHR data to classify a patient’s menopausal status. NLP processing and patient classification were performed on a cohort of 307 512 female Veterans receiving healthcare at the US Department of Veterans Affairs (VA).

Results

NLP was validated at 99.6% precision. Including the NLP-derived data into a menopause phenotype increased the number of patients with data relevant to their menopausal status by 118%. Using structured codes alone, 81 173 (27.0%) are able to be classified as postmenopausal or premenopausal. However, with the inclusion of NLP, this number increased 167 804 (54.6%) patients. The premenopausal category grew by 532.7% with the inclusion of NLP data.

Discussion

By employing NLP, it became possible to identify documented data elements that predate VA care, originate outside VA networks, or have no corresponding structured field in the VA EHR that would be otherwise inaccessible for further analysis.

Conclusion

NLP can be used to identify concepts relevant to a patient’s menopausal status in clinical notes. Adding NLP-derived data to an algorithm classifying a patient’s menopausal status significantly increases the number of patients classified using EHR data, ultimately enabling more detailed assessments of the impact of menopause on health outcomes.

Keywords: natural language processing, menopause, phenotyping, women’s health

Introduction

Menopause is a defining milestone in women’s health, characterized by changing health risks and unique healthcare needs. These changes are attributed to the hypoestrogenic state of postmenopause, as well as to aging-related symptoms, psychosocial changes, and behavioral risk factors common to this period. Menopause is a risk factor for a variety of health concerns including cardiovascular disease,1 osteoporosis,2 and all-cause mortality.3 The significant impact of menopause status on women’s health makes it a common data element for many research studies as a covariate, inclusion or exclusion criteria, or as an element of stratified analysis.

Menopause is clinically defined as 12 months following the last menstrual period or following bilateral oophorectomy, with information on menstrual cycle and surgical history necessary for clear determination.4 Despite this frequent use, no unified definition for menopause status derived from observational data exists and definitions can vary greatly by the availability and structure of the data source. Prospective studies such as the Study of Women’s Health Across the Nation (SWAN),5 define menopause status using analysis of menstrual patterns and medical history through self-report, interviews, and chart abstraction. While these studies have helped understand health risks related to menopause, resource-intensive approaches to gather data are not always available or feasible for research studies.

Electronic health record (EHR) systems can serve as repositories of real-world longitudinal data on large, diverse patient samples and allow researchers to derive insights based on a large number of potential data sources across a healthcare system. However, in women’s health research using EHR data, menopause status is often not considered or is classified through an age proxy.6,7 Using an age-based classification excludes important health information and cannot account for differences among patients. For example, natural menopause typically occurs between the ages of 45-55 years; age alone is insufficient to determine pre-, peri-, or postmenopause status during that period. Further, an age-based classification may misclassify patients with early natural or surgical menopause. Incorporating other sources of data such as a patient’s last menstrual period (LMP) can change the classification of large subsets of a study population.8

Clinical notes have been the subject of analysis using natural language processing (NLP) across many domains9–11 for extracting information that is otherwise not contained in the structured data elements (ie, diagnoses, procedures, medications, lab tests) of the EHR. NLP methods have been applied in a variety of women’s health studies using clinical notes, including topics such as breast cancer,12 suicide risk during pregnancy,13 and COVID-19 disparities.14 However, there is little work studying or identifying menopause with NLP methods. Identifying evidence of menopause status with NLP often has limited applicability to an EHR-based classification, such as only identifying explicit statements of pre- or postmenopause status in within specific EHR note types15 or using social media sources rather than clinical notes.16

The US Department of Veterans Affairs (VA) is a healthcare system with over 1961 locations in the United States, US territories, and the Philippines that serves over 600 000 women annually.17 The VA Corporate Data Warehouse (CDW) has collected decades of data including patient demographic information, diagnoses, procedures, medications, and unstructured clinical notes from all VA locations. The data available in the CDW are an excellent resource for developing an EHR-based method of classifying a patient’s menopause status.

In this article, we describe our analysis of VA structured and text sources as well as the resulting NLP system used to augment structured sources of evidence for an EHR-based algorithm for classifying a patient’s menopause status as either premenopausal or postmenopausal relative to a study-specific index date.

Methods

Study population

The VA COVID-19 Shared Data Resource (CSDR) is a collaboratively developed data repository containing phenotypes derived from both structured data and unstructured clinical notes related to COVID-19. The data are updated monthly and includes information about patient demographics, pre-existing conditions, pharmaceutical and non-pharmaceutical interventions, outcomes and sociodemographic factors aggregated from the CDW and Department of Defense (DoD). Patients are determined by a variety of surveillance efforts at VA.18,19 Each patient in the CSDR receives an index date, which is either the date of a COVID-19 test or the patient’s hospital admission date.

In this work, the menopause status of all female Veterans is given with respect to their CSDR index date. In this initial exploration, identified menopause status is limited to postmenopause and premenopause (to indicate perimenopause). Female Veterans 65 years or older at the index date are assumed to be postenopausal and are excluded from NLP processing and further analysis. A summary of the patient cohort can be seen in Table 1. For patients who are younger than 65 years as of the index date, we collect relevant ICD9/10 codes from CDW and Department of Defense (DoD) sources, and all CDW clinical notes for NLP processing. An initial effort to guide NLP development was done by identifying a sample with 54 065 patients who had indicators of menopause status in their structured data; this was done in order to sample potential cases most likely to contain documentation of their menopausal status.

Table 1.

Summary statistics of the cohort of female Veterans younger than 65 years in the CSDR.

N (%)
Female Veterans <65 years at index date 307 512
Age
 18-25 14 213 (4.6)
 26-35 69 944 (22.7)
 36-45 79 405 (24.6)
 46-55 75 671 (24.6)
 56-64 68 279 (22.2)
Race
 American Indian or Alaska Native 2456 (0.8)
 Asian 4269 (1.4)
 Black or African American 74 263 (24.1)
 Native Hawaiian or Other Pacific Islander 2618 (0.9)
 White 118 842 (38.6)
 Unknown 105 064 (34.2)
Ethnicity
 Hispanic or Latino 21 289 (6.9)
 Not Hispanic or Latino 183 201 (59.6)
 Unknown 103 022 (33.5)
Healthcare utilization
 VA primary care appointment within 18 months prior to index 203 665 (66.2)
 Any VA clinical appointment within 24 months prior to index 281 154 (91.4)

Structured EHR data

We gather all data before and after each patient’s index date, including International Classification of Diseases Ninth and Tenth (ICD9 and ICD10) diagnosis and procedure codes as well as Current Procedure Terminology (CPT) codes. These codes include those related to menstrual diagnoses, pregnancy, menopausal diagnoses, and surgical procedures that involve bilateral oophorectomies. Other administrative information related to pregnancy such as health factors and pregnancy consults are also included. All administrative codes used in this study are provided in the Supplementary Material.

Structured EHR data relating to pregnancy and menstruation within 1 year of the index date are considered evidence of premenopause, while data related to menopausal diagnoses and surgical procedures are considered evidence of postmenopause.

Natural language processing

We implemented an NLP system with the intent to extract evidence of a patient’s menopausal status with 3 primary goals in mind. First, evidence had to be relevant to a patient-level evaluation algorithm. Evidence could not be captured if it discussed menopause in terms of family history, hypothetical or future statements, or in educational materials. Second, evidence had to be placed on a patient timeline and evaluated at a specific date. Evidence had to have an associated date referencing the described event, which was not always the date of the clinical note. Finally, the evidence had to augment structured data sources and capture evidence that was unavailable or had notable gaps in VA and DoD structured data.

To reach these goals, we selected several forms of textual evidence for extraction. The first kind of evidence was a date of a postmenopausal diagnosis (eg, “menopausal since the 1980s”), which is definitive evidence that may predate available structured data. Second was a date of a relevant surgical procedure, such as a hysterectomy or bilateral oophorectomy (eg, “TAH-BSO Jan 2022”), some of which occur at facilities outside VA or DoD and may only be captured in clinical notes. Third is a date of a patient’s last menstrual period (eg, “LMP: 3/4/2009”), which has no corresponding structured element in the VA EHR. Each of these concepts must be linked with a date identified in the text to ensure both relevancy and ability to use each on a patient timeline.

As supplemental evidence, primarily for contextualizing a patient’s menstrual period, we also added evidence of menstrual interruption due to birth control (“LMP: has IUD”) or by the use of birth control when mentioned near evidence of a patient’s last menstrual period (eg, “LMP: 2 years ago Birth control: Depo-provera”).

These concepts were captured using a rule-based NLP system built using Leo,20 an NLP framework that primarily utilizes regular expressions designed for deployment on large text corpora. Concepts and dates were identified using a curated list of regular expressions to identify each concept and possible variants or misspellings in the note. Dates identified were exact (eg, “01/01/1994”), partial (eg, “03/2014”), or relative (eg, “3 years ago”). Additionally, concepts with explicit negation (eg, “not menopausal”) were extracted, but not utilized in the final classification. Examples of NLP logic are found in Table 2.

Table 2.

Synthetic examples of date resolution, NLP evidence categorization, and patient-level classification.

Note text Note date Concept Extracted date Resolved date NLP categorization Patient-level classification (January 1, 2023 index date) Explanation
TAH-BSO in the 1980s January 14, 2019 Surgical procedure 1980s January 1, 1980 Postmenopause Postmenopause Surgical procedures are evidence of postmenopause
Last menses 03/1994 June 2, 2014 Last menstrual period 03/1994 March 1, 1994 Postmenopause Postmenopause LMP occurs prior to index and >1 year from note’s creation date
LMP: 3 years ago March 5, 2023 Last menstrual period 3 years ago March 5, 2020 Postmenopause Postmenopause This note originates after index, but the LMP occurs >1 year prior to index
LMP: IUD May 21, 2022 Menstrual interruption May 21, 2022 Premenopause Premenopause Menstrual interruptions due to birth control are evidence of premenopause
LMP: 2 weeks ago December 10, 2022 Last menstrual period 2 weeks ago November 26, 2022 Premenopause Premenopause LMP occurs within 1 year of index
LMP: 2/1/2023 February 10, 2023 Last menstrual period 2/1/2023 February 1, 2023 Premenopause Premenopause LMP identified after index is evidence of premenopause at index
LMP: 2003 November 1, 2003 Last menstrual period 2003 January 1, 2003 Premenopause Insufficient evidence LMP occurs within 1 year of the note’s creation, but >1 year from the index date
Hysterectomy 3/15/2022 February 25, 2022 Surgical procedure 3/15/2022 March 15, 2022 DISCARDED Insufficient evidence The date occurs after the document date and is discarded

Patient-level classifications are assuming each row is a unique patient, using an example index date, and no other NLP evidence or structured data are available.

Date resolution

The NLP algorithm extracts partial dates, where one or more of the day, month, or year is missing from the text, and relative dates, where the concept is described in some quantity of days, weeks, months, or years in the past. These dates must be resolved to an exact date in order to be used in a patient’s timeline along with structured data, which have a timestamp representing the moment of entry into the EHR.

Relative dates are assumed to be relative to the note’s creation date, rather than to another event described in the note. This allows resolution of the date to simply involve subtracting the time extracted from the document’s creation date. A document created on December 15, 2022 with the phrase “LMP: 3 years ago” will result in a recorded instance of a patient’s last menstrual period on December 15, 2019.

Partial dates are resolved depending on the textual format extracted. Instances with possible ambiguity between formats are resolved preferring larger units of time and American date formats. When a smaller unit of time is missing, such as a day or month, the date is resolved to the earliest available date that meets the criteria of the available information. These rules resolve text such as “2003” to become January 1, 2003 and “3/12” to become March 1, 2012.

Birth control and menstrual interruption extractions do not get paired with dates in the note itself and are instead assigned the note’s creation date.

Categorizing NLP evidence

Each piece of evidence identified by the NLP system are not evidence for a patient’s menopausal status in isolation. All text identified by NLP must be used in context with document creation date and the date at which menopausal status is being evaluated.

NLP extractions of an explicit postmenopause diagnosis or surgical procedure occurring prior to the index date are always considered evidence of postmenopause. However, NLP identification of a patient’s last menstrual period can be used as evidence of postmenopause, premenopause, or neither depending on the menstrual period’s relation to the document creation date and date of evaluation of menopausal status.

A patient’s last menstrual period is used as evidence of postmenopause when the last menstrual period was more than a year prior to the note creation. A last menstrual period is evidence of premenopause when it occurs <1 year prior the index date. Finally, a last menstrual period mention is indeterminate when it occurs <1 year prior to the document creation but >1 year prior to the index date.

Evidence identified by NLP may be kept or discarded depending on the creation date of the note. Surgical procedures and postmenopause diagnoses occurring prior to the index date are evidence of postmenopause at index date, even if the note was created after the index date. Last menstrual period, menstrual interruption, or use of birth control occurring after the index date is evidence of premenopause at index. Additionally, any date that resolves to a date after the note’s creation date is discarded. While surgical procedures may be scheduled for a future date, it cannot be used as evidence that the procedure was performed.

Patient classification

All evidence sources are aggregated and patients are classified hierarchically by evidence type using evidence of premenopause and postmenopause relative to each patient’s index date. While this allows us to classify patients as either premenopausal or postmenopausal as of the index date, it is not meant to identify the precise date of menopause onset. It is also not meant to determine perimenopause status, which is subsumed in the definition of premenopause in this analysis.

Patients may end up classified as “Premenopausal” with evidence of premenopause/not being postmenopausal within 1 year of index date, “Postmenopausal” with evidence of postmenopause any time before the index date, “Insufficient Evidence” when there is some evidence but it is inconclusive, or “No Data” when a patient has no relevant data for determining their menopause status. A full description of the algorithm is seen in Algorithm 1.

Algorithm 1.

Classification of a patient’s menopausal status.

Require: P, patients in the CSDR

 for each pP do

  if Sex(p)==F then

   if Age(p)65 then

     p postmenopausal

   else if isPregnant(p) then

     p premenopausal

   else ifusesBirthControl(p)orhasMenstrualInterruption(p)then NLP-derived evidence

     p premenopausal

   else if premenopauseCodes(p) then

     p premenopausal

   else ifpremenopausalLMP(p)then NLP-derived evidence

     p premenopausal

   else if menopausalLMP(p) then NLP-derived evidence

     p postmenopausal

   else ifmenopauseDiagnosisNLP(p)ormenopauseCode(p)then part NLP-derived evidence

     p postmenopausal

   else ifsurgicalProcedureNLP(p)orsurgicalProcedureCode(p)then part NLP-derived evidence

     p postmenopausal

   else if indeterminateNLP(p) then

     p insufficient evidence NLP-derived evidence

   else

     p no data

   end if

  end if

 end for

Results

Validation was completed by selecting 100 documents held out from the initial sample and having at least 1 NLP-derived instance for each evidence type, totaling to 500 documents. All instances in the documents were then manually reviewed, resulting in a total of 554 instances. Evidence extracted by the NLP system was evaluated at 99.6% precision for both identification of the concept and for date normalization by expert human annotators. NLP errors occurred when other numeric values in a note were extracted as dates or when a partial date was incorrectly resolved.

Of the 307 526 female Veterans younger than 65 years in the CSDR cohort, 277 999 (90.4%) had at least one clinical note and 83 173 (27.0%) had at least one structured code related to menopause status. The NLP system has processed 148 million clinical notes as of March 2023, resulting in 163 976 (53.3%) patients who had at least one NLP-derived result. When combined, 181 325 (59.0%) of patients had at least one source of evidence for their menopausal status.

Overall, 167 804 (54.6%) of all patients are able to be classified as either premenopausal or postmenopausal. Limiting the results to patients with at least one primary care appointment at the VA in the 18 months prior to their index date results in 203 665 total patients. Of these, 159 372 (78.3%) are able to be classified as either premenopausal or postmenopausal. An ablation study of the contributions of structured data, NLP-derived evidence, and primary care appointments on classification can be seen in Table 3.

Table 3.

Ablation study of the classification algorithm by evidence source.

N (%)
All patients Patients with a primary care appointment in the 18 months prior to index
Female Veterans <65 years at index date 307 512 203 665
Structured data only classification
 Postmenopausal 71 537 (23.3) 69 303 (34.0)
 Premenopaual 11 636 (3.8) 11 005 (5.4)
 No data 224 339 (72.9) 123 357 (60.6)
NLP only classification
 Postmenopausal 76 252 (24.8) 74 445 (36.6)
 Premenopaual 70 628 (23.0) 66 451 (32.6)
 Insufficient evidence 16 991 (5.5) 14 371 (7.1)
 No data 143 641 (46.7) 49 398 (24.3)
Structured data+NLP classification
 Postmenopausal 94 180 (30.6) 90 172 (44.3)
 Premenopaual 73 624 (23.9) 69 200 (34.0)
 Insufficient evidence 13 521 (4.4) 11 114 (5.5)
 No data 126 187 (41.0) 33 179 (16.3)

By adding NLP-generated evidence to the classification algorithm, patients otherwise classified as “No Data” were identified, resulting in a 31.7% increase in postmenopausal classifications and a substantial 532.7% increase to premenopausal classifications. Figure 1 shows the distributions of patient classifications by age using only structured data sources and with the combination of NLP and structured data.

Figure 1.

Figure 1.

Classification of Veteran patients by age with and without NLP.

Discussion

We developed a rule-based NLP system to extract evidence of a patient’s menopause status that, when combined with structured data sources, significantly improves the number of patients able to be classified using EHR data. The NLP system was designed to capture patient information from clinical text that is typically absent from structured data, including healthcare services provided outside the VA network, data predating the VA EHR or a patient’s use of VA services, and data without a corresponding structured component in the VA EHR.

The system has processed over 148 million clinical notes for patients from the CSDR, which includes patients from across the VA healthcare system. Although we evaluate a patient’s menopause status as of their index date into the CSDR, which represents the date of a COVID-19 test or COVID-19-related hospital admission, the date of evaluation is arbitrary and can be substituted for any date relevant to a particular research question.

Increased patient classification

Structured data in the EHR largely capture patients who have required medical care or diagnosis. Patients who have not received care related menopause or reproductive health, or received care outside the VA may not be captured by the structured data. Clinical notes, on the other hand, contain information about routine care or nonintervention. Elements such as a patient’s last menstrual period, which have no corresponding structured element in the VA EHR are contained exclusively in the notes.

The substantial increase to patients with evidence of premenopause allows finer-grained classifications of patients and enables researchers to distinguish between patients with evidence against being postmenopausal and patients who do not have enough evidence to be classified at all.

Patients classified as “Insufficient Evidence” in our algorithm appear when a patient has NLP-derived evidence, such as having their last menstrual period recorded in a clinical note within 1 year of the note’s creation, indicating premenopause at the time of documentation, but the note was created more than 1 year prior to the index date. If there is no other evidence between the documentation of that menstrual period and the index date, a patient is classified as “Insufficient Evidence.” Patients with insufficient evidence for classification require further investigation and may represent opportunities for improvement in documentation practices or patient care at the VA.

Temporal expression variation

Capture and normalization of temporal expressions in clinical notes is a significant challenge in any clinical NLP problem involving a patient’s timeline during analysis. Our NLP system extracted dates in 3 forms: exact dates, which contained a day, month, and year and represented 61.5% of all dates identified; partial dates, which were missing one or more of day, month, or year and represented 34.4% of all dates identified; and relative dates, which were expressed as some number of days, weeks, months, or years in the past and represented 4.1% of all dates identified. Table 4 shows a distribution of date types and NLP-identified concepts.

Table 4.

NLP identified concepts and date types.

N (%)
All NLP-identified concepts 2 333 038
 Exact date 1 434 970 (61.5)
 Partial date 802 556 (34.4)
 Relative date 95 512 (4.1)
Postmenopause diagnosis 46 742
 Exact date 27 877 (59.6)
 Partial date 16 350 (35.0)
 Relative date 2515 (5.4)
Surgical procedure 537 750
 Exact date 125 173 (23.3)
 Partial date 401 494 (74.7)
 Relative date 11 083 (2.1)
Last menstrual period 1 692 720
 Exact date 1 281 920 (75.7)
 Partial date 327 886 (19.4)
 Relative date 81 914 (4.8)
Menstrual interruption 18 932
Birth control use 37 894

Exact dates include a day, month, and year. Partial dates are missing one or more of the day, month, and year. Relative dates are expressed as some quantity of time in the past.

Expression of dates vary in the text with 80.5% of all last menstrual period mentions using an exact date, but only 23.3% of all surgical procedures using an exact date. This is likely a representation of each concept’s differing relation to time. Surgeries occur rarely for most patients and specificity beyond only a month or year may be unnecessary for future decisions. A patient’s menstrual period, on the other hand, usually occurs regularly for a significant portion of their life and specificity is important for diagnosing health concerns, pregnancy, or menopause.

Documentation of a patient’s last menstrual period in particular differs by the patient age at the time of documentation. The average age of a patient who had their last menstrual period documented using an exact date was 38.1, while the average age of a patient who had their last menstrual period documented with partial or relative dates was 47.9. Figure 2 shows the distribution of last menstrual period dates in clinical notes by age of the patient at the time of note creation and the type of date expression used.

Figure 2.

Figure 2.

Age at documentation of last menstrual period, by type of date identified.

These differences in the expression of dates in clinical notes may be a result of each concept’s unique relation to time and a patient’s medical record at a specific point in a patient’s life. Robust capture and normalization of temporal expressions are necessary to fully represent the concepts.

Limitations

This NLP system’s primary limitation is the narrow concept definitions and strict format expecting the concepts to be linked to dates in text. While the NLP system captured evidence used to classify many more patients than structured code alone, other forms of evidence in clinical notes were not extracted and may have resulted in additional patients classified. Further exploration of additional sources of evidence is warranted in future iterations of the NLP system, as the prevalence of textual evidence of a patient’s menopause status in VA clinical notes is not known.

The overall algorithm is also unable to identify perimenopause, an important transitional period marked by hormonal fluctuation and changing health risks. Many perimenopausal patients will be classified as premenopausal due to NLP extraction of their last menstrual period. Focused development on age-restricted patients most likely to be perimenopausal (patients age 45-55 years and not yet classified as postmenopausal), may provide insights on how to identify evidence of perimenopause with NLP.

While efforts to incorporate gender identity are underway,21 the VA EHR does not have a standard method of distinguishing sex and gender, nor does it record when the patient’s sex has changed. Therefore, our NLP system and classification algorithm do not evaluate all patients who may experience menopause and has evaluated some who will not experience it. Evaluation on patients regardless of sex recorded in the EHR was not performed.

After the US Supreme Court case Dobbs v Jackson overturned a constitutional right to abortion care, there are a variety of implications on reproductive healthcare privacy beyond direct access to abortion procedures or medications.22 Although the VA is working to ensure abortion access for Veterans in all states, the frequency and nature of reporting and documenting reproductive health indicators may shift, potentially impacting the availability and quality of observational data for researchers utilizing EHR data from any source.

While the NLP system was measured to be precise, further work is necessary to evaluate the accuracy and clinical implications of the algorithm as a whole. In addition, 45.3% of all patients and 21.7% of patients with a primary care appointment in the 18 months prior to index remain unable to be classified. Patients who have no relevant menopause data are also most likely to be missing data elsewhere in the EHR; reflecting this, 93 710 (74.3%) patients with no menopause data are also missing race or ethnicity data whereas only 23 316 (14.6%) patients who are classified as either premenopausal or postmenopausal are missing race or ethnicity data. These patients may underscore a lack of documentation that continues to pose a challenge for developing a comprehensive method of classifying a patient’s menopausal status with EHR data. Further work investigating these patients is needed to identify potential improvements to the NLP system, classification algorithm, and to identify potential gaps in care or documentation at VA.

Conclusion

In this article, we demonstrate that the addition of a strict rule-based NLP system aided substantially in classifying a patient’s menopausal status using EHR data. The NLP system effectively captures relevant evidence that was otherwise incomplete or entirely absent in the structured data, resulting in a 43.8% reduction in unclassified patients compared to structured data alone.

While the number of patients that continue to have no data related to their menopausal status is large and may represent documentation gaps, the inclusion of NLP-derived evidence enabled a substantial increase in patient classifications as a whole, particularly premenopausal patients. This represents a significant improvement over using only structured data for classification or relying on a proxy by age. Improved classifications of a patient’s menopausal status enable more individualized analysis of a patient’s health at a large scale to support a clinical decision making, policy, and research for women’s health.

Supplementary Material

ooae013_Supplementary_Data

Contributor Information

Hannah Eyre, VA Informatics and Computing Infrastructure, VA Salt Lake City Health Care System, Salt Lake City, UT 84113, United States; Department of Internal Medicine, School of Medicine, University of Utah, Salt Lake City, UT 84112, United States.

Patrick R Alba, VA Informatics and Computing Infrastructure, VA Salt Lake City Health Care System, Salt Lake City, UT 84113, United States; Department of Internal Medicine, School of Medicine, University of Utah, Salt Lake City, UT 84112, United States.

Carolyn J Gibson, San Francisco VA Healthcare System, San Francisco, CA 94121, United States; University of California, San Francisco, San Francisco, CA 94115, United States.

Elise Gatsby, VA Informatics and Computing Infrastructure, VA Salt Lake City Health Care System, Salt Lake City, UT 84113, United States.

Kristine E Lynch, VA Informatics and Computing Infrastructure, VA Salt Lake City Health Care System, Salt Lake City, UT 84113, United States; Department of Internal Medicine, School of Medicine, University of Utah, Salt Lake City, UT 84112, United States.

Olga V Patterson, VA Informatics and Computing Infrastructure, VA Salt Lake City Health Care System, Salt Lake City, UT 84113, United States; Department of Internal Medicine, School of Medicine, University of Utah, Salt Lake City, UT 84112, United States.

Scott L DuVall, VA Informatics and Computing Infrastructure, VA Salt Lake City Health Care System, Salt Lake City, UT 84113, United States; Department of Internal Medicine, School of Medicine, University of Utah, Salt Lake City, UT 84112, United States.

Author contributions

H.E., C.J.G., O.V.P., K.E.L., S.L.D. contributed to conceptualization and design of the study. Data collection, software development, and implementation of the patient classification algorithm was performed by H.E. and E.G. Data analysis was performed by H.E. and P.R.A. H.E. wrote the original draft, which was reviewed and edited by all co-authors. All authors contributed to writing and revising the article and approved the submitted version.

Supplementary material

Supplementary material is available at JAMIA Open online.

Funding

This work was supported using resources and facilities of the VA Informatics and Computing Infrastructure (VINCI), funded under the research priority to Put VA Data to Work for Veterans (VA ORD 22-D4V) and VA Health Services Research & Development Career Development Award (IK2 HX002402) to C.J.G. The views expressed are those of the authors and do not necessarily represent the views or policy of the Department of Veterans Affairs or the United States Government. This study was supported using data from the VA COVID-19 Shared Data Resource (CSDR).

Conflicts of interest

None declared.

Data availability

The data in the CSDR are part of the VA system of record and can be analyzed within the VA firewall by VA-affiliated researchers upon request and with proper regulatory approval. Code for the NLP system is available at: https://github.com/department-of-veterans-affairs/menopausal_status

References

  • 1. Khoudary SRE, Aggarwal B, Beckie TM, et al. ; American Heart Association Prevention Science Committee of the Council on Epidemiology and Prevention; and Council on Cardiovascular and Stroke Nursing. Menopause transition and cardiovascular disease risk: implications for timing of early prevention: a scientific statement from the American Heart Association. Circulation. 2020;142(25):E506-E532. [DOI] [PubMed] [Google Scholar]
  • 2. North American Menopause Society. Management of osteoporosis in postmenopausal women: the 2021 position statement of the North American Menopause Society. Menopause. 2021;28(9):973-997. [DOI] [PubMed] [Google Scholar]
  • 3. Jacobsen BK, Heuch I, Kvåle G.. Age at natural menopause and all-cause mortality: a 37-year follow-up of 19,731 Norwegian women. Am J Epidemiol. 2003;157(10):923-929. [DOI] [PubMed] [Google Scholar]
  • 4. Harlow SD, Gass M, Hall JE, et al. ; STRAW 10 Collaborative Group. Executive summary of the stages of reproductive aging workshop + 10: addressing the unfinished agenda of staging reproductive aging. Menopause. 2012;19(4):387-395. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Khoudary SRE, Greendale G, Crawford SL, et al. The menopause transition and women’s health at midlife: a progress report from the Study of Women’s Health across the Nation (SWAN). Menopause. 2019;26(10):1213-1227. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Bossick AS, Katon JG, Gray KE, Ma EW, Callegari LS.. Concomitant bilateral Salpingo-Oophorectomy at hysterectomy: differences by race and menopausal status in the Veterans Affairs Health Care System, 2007–2014. J Womens Health (Larchmt). 2020;29(12):1513-1519. [DOI] [PubMed] [Google Scholar]
  • 7. Gibson CJ, Li Y, Bertenthal D, Huang AJ, Seal KH.. Menopause symptoms and chronic pain in a national sample of midlife women Veterans. Menopause. 2019;26(7):708-713. [DOI] [PubMed] [Google Scholar]
  • 8. Phipps AI, Ichikawa L, Bowles EJA, et al. Defining menopausal status in epidemiologic studies: a comparison of multiple approaches and their effects on breast cancer rates. Maturitas. 2010;67(1):60-66. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Sheikhalishahi S, Miotto R, Dudley JT, Lavelli A, Rinaldi F, Osmani V.. Natural language processing of clinical notes on chronic diseases: systematic review. JMIR Med Inform. 2019;7(2):e12239. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Le Glaz A, Haralambous Y, Kim-Dufor D-H, et al. Machine learning and natural language processing in mental health: systematic review. J Med Internet Res. 2021;23(5):e15708. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Al-Garadi MA, Yang Y, Al-Garadi MA, Yang YC, Sarker A.. The role of natural language processing during the COVID-19 pandemic: health applications, opportunities, and challenges. Healthcare. 2022;10(11):2270. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Jain NL, Friedman C. Identification of findings suspicious for breast cancer based on natural language processing of mammogram reports. In: Proceedings of the AMIA Annual Fall Symposium. Vol 4. American Medical Informatics Association; 1997:829. [PMC free article] [PubMed] [Google Scholar]
  • 13. Zhong QY, Karlson EW, Gelaye B, et al. Screening pregnant women for suicidal behavior in electronic medical records: diagnostic codes vs clinical notes processed by natural language processing. BMC Med Inform Decis Mak. 2018;18(1):30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Ancochea J, Izquierdo JL, Soriano JB, Lumbreras S.. Evidence of gender differences in the diagnosis and management of coronavirus disease 2019 patients: an analysis of electronic health records using natural language processing and machine learning. J Womens Health (Larchmt). 2021;30(3):393-404. [DOI] [PubMed] [Google Scholar]
  • 15. Bruno A, Ghonge MM, Elhoseny M, et al. BI-RADS BERT and using section segmentation to understand radiology reports. J Imaging. 2022;8(5):131. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Lazar A, Su NM, Bardzell J, Bardzell S. Parting the Red Sea: sociotechnical systems and lived experiences of menopause. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. CHI ’19. Association for Computing Machinery; 2019:1-16.
  • 17. Frayne S, Haskell S, Hayes P, Saechao F.. Sourcebook: Women Veterans in the Veterans Health Administration. Vol 4. US Department of Veterans Affairs; 2018. [Google Scholar]
  • 18. Chapman A, Peterson K, Turano A, Box T, Wallace K, Jones MA. Natural language processing system for national COVID-19 surveillance in the US Department of Veterans Affairs. In: Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020. Association for Computational Linguistics; 2020.
  • 19. US Department of Veterans Affairs. Veterans Health Administration (VHA) Coronavirus Disease 2019 (COVID-19) Response Report. 2020. Accessed March 30, 2023. https://www.va.gov/HEALTH/docs/VHA_COVID-19_Response_Report.pdf [Google Scholar]
  • 20. Cornia R, Patterson OV, Ginter T, Duvall SL. Rapid NLP development with Leo. In: AMIA Annual Symposium Proceedings.American Medical Informatics Association; 2014:1356.
  • 21. Burgess C, Kauth MR, Klemt C, Shanawani H, Shipherd JC.. Evolving sex and gender in electronic health records. Fed Pract. 2019;36(6):271-277. [PMC free article] [PubMed] [Google Scholar]
  • 22. Spector-Bagdady K, Mello MM.. Protecting the privacy of reproductive health information after the fall of Roe v Wade. JAMA Health Forum. 2022;3(6):e222656. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ooae013_Supplementary_Data

Data Availability Statement

The data in the CSDR are part of the VA system of record and can be analyzed within the VA firewall by VA-affiliated researchers upon request and with proper regulatory approval. Code for the NLP system is available at: https://github.com/department-of-veterans-affairs/menopausal_status


Articles from JAMIA Open are provided here courtesy of Oxford University Press

RESOURCES