Skip to main content
Future Cardiology logoLink to Future Cardiology
. 2021 Jan 11;17(7):1215–1224. doi: 10.2217/fca-2020-0083

Identifying knowledge gaps in heart failure research among women using unsupervised machine-learning methods

Khalid Alhussain 1,*, Kazuhiko Kido 2, Nilanjana Dwibedi 3, Traci LeMasters 3, Danielle E Rose 4, Ranjita Misra 5, Usha Sambamoorthi 6
PMCID: PMC8656318  PMID: 33426899

Abstract

Aim:

To identify knowledge gaps in heart failure (HF) research among women, especially postmenopausal women.

Materials & methods:

We retrieved HF articles from PubMed. Natural language processing and text mining techniques were used to screen relevant articles and identify study objective(s) from abstracts. After text preprocessing, we performed topic modeling with non-negative matrix factorization to cluster articles based on the primary topic. Clusters were independently validated and labeled by three investigators familiar with HF research.

Results:

Our model yielded 15 topic clusters from articles on HF among women. Atrial fibrillation was found to be the most understudied topic. From articles specific to postmenopausal women, five clusters were identified. The smallest cluster was about stress-induced cardiomyopathy.

Conclusion:

Topic modeling can help identify understudied areas in medical research.

Keywords: : heart failure research, postmenopausal women, topic modeling, unsupervised learning, women

Lay abstract

There is evidence that heart failure (HF) affects men and women differently. This means that researchers should consider gender when studying HF. However, women, especially postmenopausal women, are underrepresented in HF research. With the advancement in technology, we were able to identify and review almost all published studies on HF among women. Then, we summarized the topics of those studies. We observed that the co-occurrence of atrial fibrillation and HF in women has not been well-studied. We also found that the number of articles studying HF in postmenopausal women was very small compared with those studying women in general. Stress-induced cardiomyopathy, a rare condition, was found to be the most understudied topic. Therefore, more research should be conducted in these areas.


Heart failure (HF) affects at least 26 million people worldwide, and its prevalence has been increasing over the past decades [1]. For example, HF prevalence is expected to rise from 2.42% in 2012 to 2.97% in 2030 in the USA [2]. The growing prevalence of HF, along with its high mortality and morbidity [3] as well as poor health-related quality of life (HRQoL) [4] make HF a major global health problem. HF mortality has been assessed in several countries [1]. In a registry-based study enrolling 12,440 patients with acute or chronic HF from 21 European and/or Mediterranean countries, the 1-year mortality rates varied across countries; it ranged from 21.6 to 36.5% in patients with acute HF, and from 6.9 to 15.6% in those with chronic HF [5]. In the USA, the 1-year mortality in patients with HF ranged from 35.1 to 37.5% [6]. Even if they survive, patients with HF have poor HRQoL, both physical and mental components, compared with the general population [4]. In addition, HF has a high economic burden. Healthcare spending on HF constitutes 1–2% of the global healthcare budget, mainly due to hospitalization costs [7]. Cost estimates varied from one country to another. For instance, total annual costs per patient with HF ranged from $868 for South Korea to $25,532 for Germany [7]. Regardless of the differences across countries, HF in general has a significant health and economic burden worldwide.

With that being said, there is a need to study HF. A major consideration that should be taken into account in future studies is the sex differences in HF burden and risk factors. For example, women with HF have poorer HRQoL compared with their men counterparts [8]. Furthermore, women tend to develop HF at an older age than men [3,9], which can be explained by the female sex hormone, estrogen. Estrogen has anti-atherosclerotic and anti-inflammatory properties, which positively affects the inner layer of the artery wall [10,11]. However, estrogen levels decrease after menopause. The decline in the level of endogenous estrogen increases the risk of HF in postmenopausal women [12,13]. In terms of risk factors for HF, hypertension is more common in women, whereas myocardial infarction is more prevalent in men [9].

Despite these differences between women and men with HF, women are underrepresented in clinical trials for HF [14,15]. A recent systematic review examined the enrollment of women and other minorities in 118 HF clinical trials [15]. This study revealed that women represented only 27% of participants in clinical trials for HF, and women participation has not significantly changed over time.

With such underrepresentation of women in HF clinical trials, significant knowledge gaps in HF research among women may exist. These knowledge gaps need to be identified and addressed. To date, no study has reviewed all published HF research among women, specifically among postmenopausal women. Systematic reviews and meta-analyses focus on a single topic (e.g., mortality, treatment, biological markers) [16,17]. However, conducting a broad search of “heart failure” and women in the PubMed database yields over 100,000 articles. Manually reading all these articles and summarizing the topics will not be feasible.

With the widespread digital transformation and the ability to process and understand the text by machine through natural language processing, it is now possible to use digital technology to cluster all HF research among women based on their primary objectives. Such approach cannot only save the researchers’ time by substituting computer time [18] but also discover knowledge gaps in HF research among women. Therefore, the objective of the current study is to identify knowledge gaps in HF research among women, especially postmenopausal women using unsupervised machine learning methods and articles published in the PubMed database.

Materials & methods

Data source, search strategies & procedures

Our data source was PubMed, a free database comprising more than 30 million citations and abstracts of biomedical literature from MEDLINE, life science journals and online books [19]. We only searched PubMed (i.e., no other databases) because we wanted to assess the feasibility of using unsupervised machine learning methods for identifying knowledge gaps. We identified articles on HF research in women from the inception (1959) until 3 December 2019. We conducted two search strategies: the first was broad, where we focused on all women; and the second was specific to postmenopausal women. For search #1, we used the following keywords and medical subject headings: (“heart failure” OR “congestive heart failure” OR “cardiac failure” OR “heart failure therapy” OR “ejection fraction”). For search #2, we used the following strategy: (“heart failure” OR “congestive heart failure” OR “cardiac failure” OR “heart failure therapy” OR “ejection fraction” AND (“postmenopause” OR “menopause”)). We included “ejection fraction” as one of the search terms because ejection fraction plays a key role in HF diagnosis and outcomes [20]. For both searches, we used PubMed search filters on sex (female), species (humans) and text availability (abstract) to enhance our search strategies. For the purpose of this study, no restrictions (e.g., study design or country) were used.

Procedures

Articles retrieved from the PubMed searches were stored in Comma-separated Values files. We removed duplicates based on article titles. We identified relevant articles based on ‘study objectives’ because the objectives of an article can provide a clear and exact intent of the study. We only included studies having at least one of the HF terms (i.e., “heart failure” and “cardiac failure”) in their objectives.

As our main interest was in summarizing the HF research in women and postmenopausal women, we used ‘topic models’, a type of statistical model for identifying a set of ‘topics’ that best describes a given document (in this case, given PubMed article). Topic modeling is an unsupervised machine learning method that automatically clusters a set of documents according to ‘semantic structures’ or topics that are similar. It has to be noted that topic modeling can group words within the same context and distinguish the use of the same words in a different context. Furthermore, topic modeling does not require pre-existing knowledge of the categories of the articles [18]. Topic modeling has been applied on different medical datasets including lung cancer, breast cancer and Salmonella PFGE genotyping datasets [21]. Following the framework for the smart literature review of big data, we used three key steps: preprocessing, topic modeling and postprocessing of outcomes [18]. All procedures and modeling were conducted with Python 3.7.

Text preprocessing

Text preprocessing is a crucial step in the process of building any model. Typically, text preprocessing helps machine learning algorithms by removing or filtering less useful parts of the text through various methods, such as punctuation and stop word removal. In the current study, we were interested in analyzing study objectives instead of full texts and abstracts. This is because study objectives contain information relevant to the primary topic of the study, while abstracts contain information that may not be directly related to the primary topic of the study (e.g., literature review and statistical analysis). We used text mining techniques to extract the objective(s) of each study. However, these methods were not applicable to unstructured abstracts. In that case, we analyzed the full abstract. We preprocessed the text using the Natural Language Toolkit, one of the useful tools for processing human language in Python software [22]. We first removed stop words (e.g., a, is, the, and) and common words that carry less important meaning than keywords. Examples of such words are “introduction”, “background”, “aim”, “objective” and “purpose”. These words usually precede the study objectives in structured abstracts. After removing unnecessary words, we conducted two more steps (i.e., tokenization and lemmatization). Tokenization is the process of splitting text into a list of tokens, and lemmatization is a morphological analysis of the words (e.g., using the lemma “study” for studies, study, studied and studying).

Topic modeling with non-negative matrix factorization

As topic modeling involves grouping similar word patterns to identify topics, several algorithms such as Non-Negative Matrix Factorization (NMF) based on linear algebra are available [23]. We selected NMF to identify topics and classify the documents according to these topics at the same time. NMF computes term frequency-inverse document frequency, a weighting scheme that assigns each word in our dataset (i.e., PubMed abstracts) a weight. The higher the weight, the more important the word is. To compute the term frequency-inverse document frequency weighting, we used TfidfVectorizer with n-gram range from 1 to 2 from the scikit-learn Python module.

We performed topic modeling on all studies of women with HF (search#1) and studies specific to postmenopausal women (search#2). To identify the optimal number of clusters, we ran the algorithm with a different number of topics (n); for example, we specified the value of n as 5, 10, 15, 20 and 25. Then, we manually evaluated the outputs from all models and selected the most interpretable model. All analyses were performed using Python 3.7.

Postprocessing

Validation of topic modeling: human intelligence

During the postprocessing, we reviewed the clusters identified to ensure that they are interpretable. Moreover, we used an expert evaluation to validate the topic models. Clusters yielded from our model were independently labeled and validated by three investigators familiar with HF research. In case of a disagreement on the cluster label, discussion among the investigators was a sensible first step. Disagreements among investigators were resolved by consensus. If a disagreement could not be resolved, investigators reviewed that cluster in-depth. They randomly reviewed the titles and abstracts of 40 articles within that cluster. Finally, we reported the frequency and percentage of agreements and disagreements.

Results

Study retrieval & selection

Automated extraction using search strategy #1 yielded 69,558 articles related to HF in women. Of these, 6 articles with no abstract and 53 duplicates were removed. The remaining (69,499 articles) were electronically screened for relevance (i.e., study objective[s] must have at least one of the HF terms). This process yielded 32,946 eligible HF articles for topic modeling.

Using a separate search strategy focusing on postmenopausal women (search #2), there were 41,519 articles with abstracts after removing 150 duplicates. After electronically screening, 41,442 articles were excluded because they were not relevant based on the study objective(s) (i.e., absence of all HF terms in the study objective[s]). A total of 77 articles were included in the topic modeling. Flow charts illustrating each step of this process are shown in Figure 1.

Figure 1. . Flow chart for the selection of studies (A) on women with heart failure and (B) specific to postmenopausal women with heart failure.

Figure 1. 

Topic clusters

A description of the topic clusters is shown in Table 1. For search strategy #1, the topic model with 15 topic clusters was selected because it was the most interpretable model for HF articles in women. In terms of size, the largest topic cluster consisted of 4578 articles (13.9%), whereas the smallest topic cluster consisted of 808 articles (2.5%) (Figure 2A). The most studied topic in HF among women was epidemiology and disease burden of HF. For search strategy #2, the most interpretable topic model yielded 5 clusters out of 77 articles on HF in postmenopausal women. The largest cluster included 34 articles (44.2%), while the smallest cluster had 6 articles (7.8%) (Figure 2B). The most studied topic in postmenopausal women was cardiovascular risk (e.g., effects of lipid accumulation product and blood pressure on cardiovascular risk in postmenopausal women).

Table 1. . Description of the topic clusters.

Topic/cluster label Keywords (examples) Articles (n)
Clusters yielded from search #1 (n = 15)
Epidemiology/disease burden of HF “prevalence”, “hf risk”, “factor”, “obesity”, “incident”, “chronic hf”, “acute hf”, “systolic hf”, “advanced hf”, “hf preserved”, “hf outcome”, “hf hospitalization”, “outpatient”, “mortality”, “population” 4578
Heart procedures - mainly valvular “surgery”, “operation”, “valve replacement”, “mitral regurgitation”, “aortic valve”, “tricuspid”, “bypass”, “coronary artery”, “echocardiography”, “dilated cardiomyopathy”, “stenosis”, “treatment”, “cardiac failure”, “congestive heart”, “severe”, “underwent”, “complication” 4515
Clinical markers in chronic HF ‘inflammation’, ‘tnfalpha’, ‘endothelial’, ‘cytokine’, ‘cell’, ‘marker’, ‘activation’, ‘oxidative’, ‘sympathetic’, ‘muscle’, ‘serum’, ‘breathing’, ‘sdb’, ‘sleep’, ‘severity’, ‘renal’, ‘copd’, ‘anemia’, ‘prognosis’, ‘elderly’, ‘congestive heart’, ‘chronic heart’ 3243
Myocardial infarction “myocardial infarction”, “acute myocardial infarction”, “coronary artery”, “cardiac index”, “incidence”, “age”, “diabetes”, “stroke”, “outcome”, “hospitalization”, “survival”, “all-cause”, “sudden”, “death”, “mortality” 2990
HRQoL “quality of life”, “health-related quality”, “depressive symptom”, “depression”, “physical”, “symptom”, “status”, ”selfcare”, “questionnaire”, “program”, “intervention”, “education”, “social”, “service”, “caregiver” 2909
Hemodynamic effects “hemodynamic”, “pulmonary artery”, “pulmonary capillary”, “systemic vascular”, “vascular resistance”, “arterial pressure”, “heart rate”, “wedge pressure”, “blood pressure”, “cardiac index”, “stroke” 2562
Pharmacotherapy “ace inhibitor”, “beta-blockers”, “diuretic”, “receptor blocker”, “arb”, “captopril”, “digoxin”, “enalapril”, “angiotensin receptor”, “antagonist”, “inhibition”, “dose”, “mg”, “placebo”, “drug”, “class” 1713
Cardiac biomarkers “brain natriuretic”, “bnp level”, “anp”, “b-type natriuretic”, “nt-pro-bnp”, “natriuretic peptide”, “‘n-terminal pro-brain”, “serum”, “plasma”, “pgml”, “marker”, “concentration”, “measurement”, “prognostic value”, “diagnosis” 1654
Acute decompensated heart failure “acute decompensated”, “adhf”, “worsening renal”, “wrf”, “renal dysfunction”, “aki”, “emergency department”, “inhospital”, “admission”, “hospitalized”, “nesiritide”, “diuretic” 1545
Exercise “aerobic”, “cardiopulmonary exercise”, “peak exercise”, “training”, “ventilation”, “exercise test”, “exercise tolerance”, “functional capacity”, “oxygen uptake”, “oxygen consumption”, “peak vo”, “vevco” 1469
Cardiac resynchronization therapy “cardiac resynchronization”, “crt”, “crt-d”, “icd”, “defibrillator”, “implantation”, “dyssynchrony”, “pacing”, “bundle branch”, “branch block”, “lbbb, ‘delay’, ‘remodeling’, ‘biventricular’, ‘lead’, ‘qrs duration’ 1295
Left ventricular assist device & heart transplantation “lvad implantation”, “pump”, “bridge”, “mechanical circulatory”, “assist device”, “heartmate”, “cardiac transplantation”, “recovery”, “experience”, “survival”, “advanced heart”, “end-stage heart” 1255
Left ventricular ejection fraction phenotypes “hfpef”, “hfmref”, “hfref”, “reduced ef”, “preserved ef”, “midrange”, “pathophysiology”, “hypertension”, “prognostic”, “outcome”, “ejection fraction” 1209
Systolic & diastolic dysfunction ‘systolic dysfunction’, ‘diastolic dysfunction’, ‘lv systolic’, ‘lv dysfunction’, ‘lv diastolic’, ‘velocity’, ‘right ventricular’, ‘myocardial’, ‘doppler’, ‘pacing’, ‘filling’, ‘volume’, ‘echocardiography’, ‘diastolic function’, ‘ejection fraction’ 1201
Atrial fibrillation “atrial fibrillation”, “af”, “af sinus”, “paroxysmal”, “sinus rhythm”, “permanent atrial”, “persistent atrial”, “af hf”, “incidence”, “new-onset af”, “rate control”, “cardioversion”, “pacing”, “catheter ablation”, “digoxin” 808
Clusters yielded from search #2 (n = 5)
Cardiovascular disease risk “cardiovascular risk”, “risk factor”, “myocardial infarction”, “coronary artery”, “sex”, “estrogen”, “hrt”, “postmenopausal woman”, “blood pressure”, “hypertension”, “stroke”, “obesity”, “diabetes”, “morbidity”, “mortality”, “death” 34
Role of female sex hormone in HF “sex hormone”, “female”, “estrogen”, “menopause”, “age”, “protective”, “endothelial”, “risk marker”, “lvdd”, “diastolic dysfunction”, “preserved ejection”, “ejection fraction”, “hfpef”, “microvascular”, “role”, “mechanism” 13
Effect of breast cancer and chemotherapy on HF “breast cancer”, “advanced breast”, “chemotherapy”, “cyclophosphamide”, “tamoxifen”, “methotrexate”, “doxorubicin”, “mitoxantrone”, “cmf”, “combination”, “course”, “regimen”, “drug”, “agent”, “dose”, “toxicity”, “progression”, “remission”, “alopecia”, “response” 12
HF incidence “hf incidence”, “incident hf”, “incident heart”, “risk incident”, “risk heart”, “age”, “early’”, “age menopause”, “effect cardiac”, “cvd”, “hf postmenopausal”, “sex hormone”, “hrt”, “deficit”, “vitamin”, “supplementation” 12
Stress-induced cardiomyopathy “stress”, “takotsubo syndrome”, “takotsubo cardiomyopathy”, “tt”, “acute”, “syndrome”, “condition”, “reversible”, “rare”, “segment”, “pathophysiology”, “coronary artery”, “left ventricle”, “activation”, “diagnosis”, “imaging”, “admitted”, “morbidity”, “mortality” 6

VO2 is the rate of oxygen consumption measured during incremental exercise, and vevco refers to minute ventilation-to-carbon dioxide output (VE/VCO2).

Cluster that was reviewed in depth.

adhf: Acute decompensated heart failure; af: Atrial fibrillation; aki: Acute kidney injury; anp: Atrial natriuretic peptide; arb: Angiotensin II receptor blocker; bnp: Brain or B-type natriuretic peptide; cmf: Cyclophosphamide, methotrexate, fluorouracil; copd: Chronic obstructive pulmonary disease; crt: Cardiac resynchronization therapy; crt-d: Cardiac resynchronization therapy defibrillator; cvd: Cardiovascular disease; ef: Ejection fraction; HF: Heart failure; hfmref: Heart failure with mid-range ejection fraction; hfpef: Heart failure with preserved ejection fraction; hfref: Heart failure with reduced ejection fraction; hrt: Hormone replacement therapy; HRQoL: Health-related quality of life; icd: Implantable cardioverter defibrillator; lbbb: Left bundle branch block; lv: Left ventricular; lvad: Left ventricular assist device; lvdd: Left ventricular diastolic dysfunction; mg: Milligram; sdb: Sleep disordered breathing; tnfalpha: Tumor necrosis factor alpha; tt: Takotsubo; wrf: Worsening renal function.

Figure 2. . Distribution of studies (A) on women with heart failure (Search #1) and (B) specific to postmenopausal women with heart failure (Search #2) based on main topics.

Figure 2. 

HF: Heart failure.

Understudied research topics in the literature of HF among women

Based on the cluster size, the three most understudied topics are, first, atrial fibrillation; second, systolic and diastolic dysfunction; and third, left ventricular ejection fraction phenotypes. The knowledge gaps are even greater in the literature of HF among postmenopausal women. Only 6 articles studied stress-induced cardiomyopathy. The effect of breast cancer and chemotherapy on HF was discussed in 12 articles. Also, the incidence of HF in postmenopausal women was studied in 12 articles.

Cluster validation & labeling

Topic clusters were independently validated and labeled by the first, second and seventh authors. The percentage of agreement among authors on topic labels is presented in Table 2. For search strategy #1, the agreement percentage was 80%, which means the authors agreed on 12 out of 15 topic labels. Regarding the other three clusters, disagreements were resolved by reviewing those clusters in depth. For search strategy #2, there were no disagreements on the topic labels.

Table 2. . Percentage of agreement among authors on topic labels.

Search strategy Topic clusters (n) Agreements, n (%)
Search #1 15 12 (80)
Search #2 5 5 (100)

Disagreements were on the following topic labels: systolic and diastolic dysfunction, clinical markers in chronic heart failure, and hemodynamic effects.

Discussion

The main objective of this study was to explore knowledge gaps in HF research among all women and postmenopausal women. We achieved this objective by using topic modeling, an unsupervised machine learning method. Our approach saved researchers’ time once the program was developed. Our program took only 1 min and 4 sec to cluster 32,946 articles into 15 topics. This hybrid approach was more comprehensive and less time-consuming than the expert-based manual literature review method. For example, a study by Myers et al. was conducted to assess the progress of cardiovascular disease research output between 2002 and 2011 using the expert-based manual literature review method [24]. In that study, a physician read the abstracts and decided whether a study was relevant. Although there were 47,897 articles related to cardiovascular disease in 2002 and 54,488 articles in 2011, only 3000 articles randomly selected each year were reviewed. This is mainly because it was difficult to manually review more than 100,000 abstracts. Our approach adds to the emerging body of literature showing the promise of implementing unsupervised machine learning techniques in the field of HF. Several studies have utilized unsupervised machine learning methods to improve the diagnosis and treatment of HF [25–28]. For example, unsupervised cluster analysis has been used to identify HF patients who may respond to a particular therapy [27,28].

Although our approach is comprehensive and efficient, it should be noted that it does not replace systematic reviews and meta-analyses. These two approaches significantly differ in their goals. Topic modeling is broad and focuses on the distribution of topics in a large number of articles without an in-depth analysis of the methods and results of the included articles. On the other hand, systematic reviews and meta-analyses focus on specific research questions and analyze findings from the included articles.

Our current study has revealed that atrial fibrillation is the most understudied area in the literature of HF among women. Prior research in this area has discussed the epidemiology of atrial fibrillation, the role of the natriuretic peptide and the risk of stroke in patients with atrial fibrillation and HF. Nevertheless, this research area should be further explored for several reasons. First, there is a positive association between atrial fibrillation and HF [29,30], and this association can be explained by shared risk factors and pathophysiology [31]. Thus, these two diseases can be regularly encountered concomitantly in clinical practice. Patients with concomitant HF and atrial fibrillation may have even worse symptoms and poorer prognosis, which means they may respond to treatment differently than those with HF or atrial fibrillation alone [30,31]. Furthermore, the co-occurrence of HF and atrial fibrillation may increase the risk of HF hospitalization and all-cause mortality, as previous studies have shown [32,33]. With that being said, future research focusing on the comorbidity of HF and atrial fibrillation in women is needed. This can improve the health outcomes of women affected by these two conditions and the cost–effectiveness of their care.

Another important finding was that the volume of research on HF in postmenopausal women is small. In this study, we only identified 77 articles on HF in postmenopausal women compared with 32,946 in women in general. Based on the content of those articles, the most understudied topic is stress-induced cardiomyopathy. This may be because this condition is rare. In the US, stress-induced cardiomyopathy was diagnosed in about 0.02% of all nationwide hospitalizations [34]. Of those, 90.6% were women. It is well-known that this condition is more common in women than men [35–39]. Therefore, future studies should investigate this topic and address knowledge gaps in this area.

Another major understudied area is the incidence of HF in postmenopausal women. For instance, a few studies examined risk factors for the incidence of HF in postmenopausal women [40–43]. There is a critical need to identify factors associated with HF incidence in this population and address the modifiable risk factors. There may be emerging risk factors, such as medication use. Medications that may increase the risk of HF should be identified. For example, one of the clusters yielded from our model was related to cardio-oncology in advanced breast cancer.

Identification of research gaps is the first step toward reducing HF risk and improving health outcomes in women. Our findings provide an overview of HF research among women. Such information can help researchers and funding agencies to prioritize and address research gaps. Using data from this study along with the insights of the professional community may contribute to the development of a research roadmap for HF in women.

Potential limitations and strengths of this study should be noted. First, no evaluation metrics were used to assess the accuracy of clusters yielded from unsupervised machine learning. However, this limitation was addressed by independently validating and labeling clusters yielded from our model by three investigators familiar with HF research. Second, we were not able to extract the study objective(s) from unstructured abstracts. In that case, we analyzed the full abstract. Finally, we only searched one database (i.e., PubMed) to retrieve HF articles, which might impact on the number of articles included in this study. However, it is not a major issue in our paper for two reasons. First, the main purpose of the current paper is to broadly examine the literature of HF among women and identify knowledge gaps. Our approach focuses on very broad research questions (i.e., HF research in all women and postmenopausal women). As a result, we had a relatively large number of eligible articles compared with other types of literature review methods (e.g., systematic reviews and meta-analyses). For example, we ended up with 32,946 eligible articles for HF among women. Most likely the number of eligible articles retrieved from another database will be small after removing duplicates. Adding a small number of articles to our analysis probably will not significantly affect the final results and conclusions of our study, given the large number of the included articles. In addition, this study performed topic modeling on the objective(s) of each study, when possible. As far as we know, this is the first study focusing on the study objectives. Therefore, we were interested to test the feasibility of this approach. Despite these limitations, this study had several strengths. To our knowledge, this was the first study to use big data (PubMed) and unsupervised machine learning methods to identify research topics in the literature of HF among women. In addition, we used natural language processing and text mining techniques to screen and identify relevant articles and extract the objective(s) of each study from the PubMed abstract.

Conclusion

The present study was able to identify gaps in the literature of HF among women, particularly postmenopausal women, using unsupervised machine learning methods. This approach is promising and effective for the discovery of knowledge gaps in medical research. Once unsupervised machine learning procedures are established, clustering a large number of research articles can be performed within a short time. However, human intelligence is required to interpret and validate the results.

Summary points.

  • Women, especially postmenopausal women, are underrepresented in clinical trials for heart failure (HF) despite sex differences in heart failure risk factors and disease burden.

  • Identification of knowledge gaps in HF research among women in general and postmenopausal women in particular is the first step toward reducing heart failure risk and improving health outcomes.

  • In this current study, knowledge gaps in the literature of HF among women, especially postmenopausal women were identified using unsupervised machine learning methods.

  • Text preprocessing was conducted on the retrieved articles using natural language processing and text mining techniques.

  • Topic modeling with non-negative matrix factorization was performed on the preprocessed text to cluster articles based on their primary topics.

  • Study results illustrate that the largest knowledge gap is in the area of atrial fibrillation as a comorbid condition in women with HF.

  • Stress-induced cardiomyopathy is the most understudied research topic in postmenopausal women.

  • Topic modeling (unsupervised machine learning method) saves researchers’ time once the program is developed.

Acknowledgments

The authors thank all contributors of scikit-learn and Biopython for making their software freely available in Python.

Footnotes

Author contributions

K Alhussain and U Sambamoorthi contributed toward conceptualization. K Alhussain, U Sambamoorthi and K Kido contributed toward validation of topic modeling and cluster labeling. K Alhussain contributed toward writing-original draft. U Sambamoorthi, K Kido, N Dwibedi, T LeMasters, DE Rose and R Misra contributed toward writing-review and editing.

Financial & competing interests disclosure

The authors have no relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript. This includes employment, consultancies, honoraria, stock ownership or options, expert testimony, grants or patents received or pending, or royalties.

No writing assistance was utilized in the production of this manuscript.

References

Papers of special note have been highlighted as: • of interest

  • 1.Savarese G, Lund LH. Global public health burden of heart failure. Card. Fail. Rev. 3(1), 7 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]; • Shows the need for heart failure research.
  • 2.Heidenreich PA, Albert NM, Allen LA et al. Forecasting the impact of heart failure in the United States: a policy statement from the American Heart Association. Circ. Heart Fail. 6(3), 606–619 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Benjamin EJ, Muntner P, Alonso A et al. Heart disease and stroke Statistics-2019 update a report from the American Heart Association. Circulation 139(1), e56–e528 (2019). [DOI] [PubMed] [Google Scholar]
  • 4.Juenger J, Schellberg D, Kraemer S et al. Health related quality of life in patients with congestive heart failure: comparison with other chronic diseases and relation to functional variables. Heart 87(3), 235–241 (2002). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Crespo-Leiro MG, Anker SD, Maggioni AP et al. European Society of Cardiology Heart Failure Long-Term Registry (ESC-HF-LT): 1-year follow-up outcomes and differences across regions. Eur. J. Heart Fail. 18(6), 613–625 (2016). [DOI] [PubMed] [Google Scholar]
  • 6.Cheng RK, Cox M, Neely ML et al. Outcomes in patients with heart failure with preserved, borderline, and reduced ejection fraction in the Medicare population. Am. Heart J. 168(5), 721–730 e723 (2014). [DOI] [PubMed] [Google Scholar]
  • 7.Lesyuk W, Kriza C, Kolominsky-Rabas P. Cost-of-illness studies in heart failure: a systematic review 2004–2016. BMC Cardiovasc. Disord. 18(1), 74 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Dewan P, Rørth R, Jhund PS et al. Differential impact of heart failure with reduced ejection fraction on men and women. J. Am. Coll. Cardiol. 73(1), 29–40 (2019). [DOI] [PubMed] [Google Scholar]
  • 9.Azad N, Kathiravelu A, Minoosepeher S, Hebert P, Fergusson D. Gender differences in the etiology of heart failure: a systematic review. JGC 8(1), 15 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Nofer J-R. Estrogens and atherosclerosis: insights from animal models and cell systems. J. Mol. Endocrinol. 48(2), R13–R29 (2012). [DOI] [PubMed] [Google Scholar]
  • 11.Williams JK, Adams M, Klopfenstein H. Estrogen modulates responses of atherosclerotic coronary arteries. Circulation 81(5), 1680–1687 (1990). [DOI] [PubMed] [Google Scholar]
  • 12.American Heart Association. Menopause and heart disease. https://www.heart.org/en/health-topics/consumer-healthcare/what-is-cardiovascular-disease/menopause-and-heart-disease
  • 13.Pardhe BD, Ghimire S, Shakya J et al. Elevated cardiovascular risks among postmenopausal women: a community based case control study from Nepal. Biochem. Res. Int. 2017, 3824903 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Heiat A, Gross CP, Krumholz HM. Representation of the elderly, women, and minorities in heart failure clinical trials. Arch. Intern. Med. 162(15), 1682–1688 (2002). [DOI] [PubMed] [Google Scholar]
  • 15.Tahhan AS, Vaduganathan M, Greene SJ et al. Enrollment of older patients, women, and racial and ethnic minorities in contemporary heart failure clinical trials: a systematic review. JAMA Cardiol. 3(10), 1011–1019 (2018). [DOI] [PubMed] [Google Scholar]; • Shows the underrepresentation of women in heart failure clinical trials.
  • 16.Gwadry-Sridhar FH, Flintoft V, Lee DS, Lee H, Guyatt GH. A systematic review and meta-analysis of studies comparing readmission rates and mortality rates in patients with heart failure. Arch. Intern. Med. 164(21), 2315–2320 (2004). [DOI] [PubMed] [Google Scholar]
  • 17.Pufulete M, Maishman R, Dabner L et al. B-type natriuretic peptide-guided therapy for heart failure (HF): a systematic review and meta-analysis of individual participant data (IPD) and aggregate data. Syst. Rev. 7(1), 112 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Asmussen CB, Møller C. Smart literature review: a practical topic modelling approach to exploratory literature review. Journal of Big Data 6(1), 93 (2019). [Google Scholar]; • Presents a framework on how to use topic modeling on a large collection of papers for an exploratory literature review.
  • 19.Pubmed. PubMed Overview. https://pubmed.ncbi.nlm.nih.gov/about/
  • 20.Solomon SD, Anavekar N, Skali H et al. Influence of ejection fraction on cardiovascular outcomes in a broad spectrum of heart failure patients. Circulation 112(24), 3738–3744 (2005). [DOI] [PubMed] [Google Scholar]
  • 21.Zhao W, Zou W, Chen JJ. Topic modeling for cluster analysis of large biological and medical datasets. BMC Bioinformatics. 15(S11), S11 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]; • Shows applications of topic modeling on medical datasets.
  • 22.Lobur M, Romanyuk A, Romanyshyn M. Using NLTK for educational and scientific purposes. Presented at: 2011 11th International Conference the Experience of Designing and Application of CAD Systems in Microelectronics (CADSM). Polyana-Svalyava, Ukraine, 23–25 February 2011. [Google Scholar]
  • 23.Kherwa P, Bansal P. Topic Modeling: a comprehensive review. EAI Endorsed Transactions on Scalable Information Systems 7(24), (2020). [Google Scholar]
  • 24.Myers L, Mendis S. Cardiovascular disease research output in WHO priority areas between 2002 and 2011. J. Epidemiol. Glob. Health 4(1), 23–28 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Tabassian M, Sunderji I, Erdei T et al. Diagnosis of heart failure with preserved ejection fraction: machine learning of spatiotemporal variations in left ventricular deformation. J. Am. Soc. Echocardiogr. 31(12), 1272–1284 e1279 (2018). [DOI] [PubMed] [Google Scholar]
  • 26.Sanchez-Martinez S, Duchateau N, Erdei T et al. Machine learning analysis of left ventricular function to characterize heart failure with preserved ejection fraction. Circ. Cardiovasc. Imaging 11(4), e007138 (2018). [DOI] [PubMed] [Google Scholar]
  • 27.Cikes M, Sanchez-Martinez S, Claggett B et al. Machine learning-based phenogrouping in heart failure to identify responders to cardiac resynchronization therapy. Eur. J. Heart Fail. 21(1), 74–85 (2019). [DOI] [PubMed] [Google Scholar]
  • 28.Bose E, Radhakrishnan K. Using unsupervised machine learning to identify subgroups among home health patients with heart failure using telehealth. Comput. Inform. Nurs. 36(5), 242–248 (2018). [DOI] [PubMed] [Google Scholar]
  • 29.Anter E, Jessup M, Callans DJ. Atrial fibrillation and heart failure: treatment considerations for a dual epidemic. Circulation 119(18), 2516–2525 (2009). [DOI] [PubMed] [Google Scholar]
  • 30.Wang TJ, Larson MG, Levy D et al. Temporal relations of atrial fibrillation and congestive heart failure and their joint influence on mortality: the Framingham Heart Study. Circulation 107(23), 2920–2925 (2003). [DOI] [PubMed] [Google Scholar]
  • 31.Kotecha D, Piccini JP. Atrial fibrillation in heart failure: what should we do? Eur. Heart J. 36(46), 3250–3257 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Chamberlain AM, Redfield MM, Alonso A, Weston SA, Roger VL. Atrial fibrillation and mortality in heart failure: a community study. Circ. Heart Fail. 4(6), 740–746 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Zareba W, Steinberg JS, Mcnitt S et al. Implantable cardioverter-defibrillator therapy and risk of congestive heart failure or death in MADIT II patients with atrial fibrillation. Heart Rhythm 3(6), 631–637 (2006). [DOI] [PubMed] [Google Scholar]
  • 34.Deshmukh A, Kumar G, Pant S, Rihal C, Murugiah K, Mehta JL. Prevalence of Takotsubo cardiomyopathy in the United States. Am. Heart J. 164(1), 66–71 e61 (2012). [DOI] [PubMed] [Google Scholar]
  • 35.Akashi YJ, Goldstein DS, Barbaro G, Ueyama T. Takotsubo cardiomyopathy: a new form of acute, reversible heart failure. Circulation 118(25), 2754–2762 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Bybee KA, Kara T, Prasad A et al. Systematic review: transient left ventricular apical ballooning: a syndrome that mimics ST-segment elevation myocardial infarction. Ann. Intern. Med. 141(11), 858–865 (2004). [DOI] [PubMed] [Google Scholar]
  • 37.Kurowski V, Kaiser A, Von Hof K et al. Apical and midventricular transient left ventricular dysfunction syndrome (tako-tsubo cardiomyopathy) frequency, mechanisms, and prognosis. Chest 132(3), 809–816 (2007). [DOI] [PubMed] [Google Scholar]
  • 38.Sharkey SW, Lesser JR, Zenovich AG et al. Acute and reversible cardiomyopathy provoked by stress in women from the United States. Circulation 111(4), 472–479 (2005). [DOI] [PubMed] [Google Scholar]
  • 39.Tsuchihashi K, Ueshima K, Uchida T et al. Transient left ventricular apical ballooning without coronary artery stenosis: a novel heart syndrome mimicking acute myocardial infarction. J. Am. Coll. Cardiol. 38(1), 11–18 (2001). [DOI] [PubMed] [Google Scholar]
  • 40.Donneyong MM, Hornung CA, Taylor KC et al. Risk of heart failure among postmenopausal women: a secondary analysis of the randomized trial of vitamin D plus calcium of the women's health initiative. Circ. Heart Fail. 8(1), 49–56 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Agha G, Loucks EB, Tinker LF et al. Healthy lifestyle and decreasing risk of heart failure in women: the Women's Health Initiative observational study. J. Am. Coll. Cardiol. 64(17), 1777–1785 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Belin RJ, Greenland P, Martin L et al. Fish intake and the risk of incident heart failure: the Women's Health Initiative. Circ. Heart Fail. 4(4), 404–413 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Shah RU, Winkleby MA, Van Horn L et al. Education, income, and incident heart failure in post-menopausal women: the Women's Health Initiative Hormone Therapy trials. J. Am. Coll. Cardiol. 58(14), 1457–1464 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Future Cardiology are provided here courtesy of Taylor & Francis

RESOURCES