Abstract
This dataset compiles breast cancer risk factors from 1697 Cuban women who attended consultations at the Hospital Universitario Clínico-Quirúrgico Comandante Manuel Fajardo in Havana, Cuba. The data were collected to develop a breast cancer risk estimation model specifically tailored to the Cuban population. The dataset includes 23 variables encompassing internationally recognized risk factors such as family history of breast cancer, lifestyle habits, demographic characteristics, and clinical outcomes. The data were extracted from electronic records and anonymized to protect patient privacy, in compliance with the principles of the Declaration of Helsinki and with the approval of the hospital's scientific and ethics committees. This dataset can be employed in the development of predictive models and in comparative studies of risk factors across different populations. It is important to note that the data originate from a single hospital, which may limit their representativeness at the national level.
Keywords: Breast cancer, Risk factors, Cuba, Epidemiology, Machine learning, Public health, Data analysis
Specifications Table
Subject | Health and Medical Sciences. |
Specific subject area | Breast cancer risk factors and epidemiology in Cuban women. |
Type of data | Table, Raw, csv. |
Data collection | The retrospective data were stored and collected from electronic records of breast cancer patients who attended consultations at the Hospital Universitario Clínico-Quirúrgico Comandante Manuel Fajardo. |
Data source location | Hospital Universitario Clínico-Quirúrgico Comandante Manuel Fajardo in Havana, Cuba. |
Data accessibility | Repository name: Mendeley Data Data identification number: 10.17632/7jhddnpz2p.1 Direct URL to data: https://data.mendeley.com/datasets/7jhddnpz2p/1 None |
Related research article | Jose Manuel Valencia-Moreno, Jose Angel Gonzalez-Fraga, Everardo Gutierrez-Lopez, Vivian Estrada-Senti, Hugo Alexis Cantero-Ronquillo, Vitaly Kober. Breast cancer risk estimation with intelligent algorithms and risk factors for Cuban women, Computers in Biology and Medicine, Volume 179, 2024, 108,818, ISSN 0010-4825, https://doi.org/10.1016/j.compbiomed.2024.108818 |
1. Value of the Data
-
•
The data can guide public health policymakers in Cuba in formulating strategies for the prevention and control of breast cancer.
-
•
This dataset serves as a valuable resource for understanding the epidemiology of breast cancer among Cuban women.
-
•
It enables comparative analysis of breast cancer risk factors in Cuban women residing in Havana, helping to identify potential differences when compared to other populations.
-
•
The collected data provide a solid foundation for developing new predictive models for breast cancer, tailored to populations with sociodemographic and genetic characteristics similar to those of Cuban women.
-
•
The dataset can be utilized to improve the accuracy of existing predictive models by incorporating risk factors specific to the Cuban population, thereby enabling the calibration of global models to local contexts.
-
•
The dataset is a valuable resource for students, academics, and researchers in epidemiology, public health, data science and machine learning.
2. Background
The motivation for compiling this dataset arose from the need to develop a breast cancer risk estimation model specifically adapted to Cuban women [1]. While models such as the Gail model in the United States and the Tyrer–Cuzick model in the United Kingdom exist, they have been designed and validated in different populations, limiting their accuracy and applicability to the Cuban population due to genetic, environmental, and socioeconomic differences [[2], [3], [4]]. Comparative studies have shown that these differences can significantly impact the effectiveness of predictive models when applied in diverse contexts [5,6]. For this reason, many countries have undertaken efforts to collect data and develop risk estimation models tailored to their specific populations [[7], [8], [9], [10]]. Following this approach, this dataset was compiled with the aim of developing a predictive model based on intelligent algorithms that more accurately reflects the unique characteristics of Cuban women [1].
3. Data Description
The dataset is provided in comma-separated values (CSV) format and in its raw form. The data were extracted from digital databases and comprise a total of 1697 records of women who attended consultations and whose data were collected. This dataset includes 23 variables, including breast cancer diagnosis and a unique serial number for each case, which serves to ensure patient anonymity in accordance with the Declaration of Helsinki [11].
Table 1 presents the profile of the variables included in the dataset. These variables represent significant risk factors and clinical characteristics associated with breast cancer in the studied population of Cuban women. Each variable is described below, providing key information about its nature and its relevance in the context of the study.
Table 1.
Name | Description | Role | Type | Values | Missing |
---|---|---|---|---|---|
Id | Sequential registration number | id | Quantitative | 1 - 1697 | 0 |
Age | Patient's age in completed years | regular | Quantitative | 20 - 90 | 0 |
Menarche | Age at first menstruation | regular | Quantitative | 8 - 17 | 0 |
menopause | Age at menopause | regular | Qualitative | No, 0, 30 - 60 | 0 |
Agefirst | Age at first successful birth | regular | Qualitative | No, 0, 9–46 | 0 |
Children | Number of live births the patient has had | regular | Qualitative | 0 - 5, 5+ | 0 |
breastfeeding | Months of breastfeeding | regular | Qualitative | No, 0 - 72 (months) | 0 |
Nrelbc | Family history of first-degree breast cancer | regular | Qualitative | Aunt, cousin, daughter, grandmother, mother, no, sister (and combinations) | 0 |
Biopsies | Number of biopsies performed on the patient | regular | Quantitative | 0 - 5 | 1 |
hyperplasia | Presence of atypical hyperplasia | regular | Qualitative | No, yes | 0 |
Race | Patient's race | regular | Qualitative | Black, mixed, white | 0 |
Year | Year of breast cancer diagnosis | regular | Quantitative | 2001 - 2018 | 537 |
Imc | Body mass index | regular | Quantitative | 5.0 - 88.8 | 7 |
Weight | Patient's weight in kilograms | regular | Quantitative | 13 - 240 | 10 |
Exercise | Number of days of physical activity per week | regular | Qualitative | No, NO, diary, 0 - 7 | 0 |
Alcohol | Alcohol consumption | regular | Qualitative | No, yes | 0 |
Tobacco | Tobacco consumption | regular | Qualitative | No, yes | 0 |
Allergies | Type of allergies the patient has | regular | Qualitative | Dermatitis, laryngitis, medicines, no, none, other, rhinitis (and combination) | 0 |
Emotional | Emotional predisposition | regular | Qualitative | Sad, joy | 0 |
Depressive | Presence of depressive symptoms | regular | Qualitative | No, yes | 0 |
histologicalclass | Histological classification of cancer | regular | Quantitative | 1 - 11 | 537 |
Birads | Birads classification of mammography | regular | Qualitative | 3A, 3B, 3C, 4B, 5B, 5C, 6 | 537 |
Cancer | Breast cancer diagnosis | label, target | Qualitative | No, yes | 0 |
In addition, Table 1 details the role of each variable, its type, the amount of missing data, and its domain. It should be noted that the variables “nrelbc” and “allergies” contain multivalued entries, i.e. combinations of different reported values. These details are critical for assessing data quality, performing necessary cleaning tasks, and identifying potential limitations in the analysis.
4. Experimental Design, Materials and Methods
International health organizations have identified a set of key risk factors for breast cancer [[12], [13], [14]]. Based on this information, we conducted a comparative review of these risk factors to determine which should be collected in the context of Cuban women. From this review, a list of internationally accepted risk factors was compiled and selected for data collection.
First, the project was presented to the Scientific Committee of the Hospital Universitario Clínico-Quirúrgico Comandante Manuel Fajardo in Havana, Cuba, where it received the necessary approval. Subsequently, the proposal was submitted for review by the Medical Ethics Committee of the same hospital, which also granted its approval.
With the necessary approvals, retrospective data were requested from the hospital's electronic database. Due to the nature of the data, informed consent could not be obtained directly from patients; instead, the hospital anonymized the data to ensure privacy and confidentiality.
Most variables were based on patient self-reports, while others, such as histological classification, BIRADS, year of diagnosis, and breast cancer diagnosis, were obtained through standard medical procedures.
Data were collected from patients during mammography and from women with clinical suspicions of breast cancer. Thus, the study population included Cuban women who attended consultations for breast cancer evaluation and received a diagnosis, either positive or negative.
Limitations
There are limitations in the dataset due to variability in capturing and interpreting certain qualitative values, particularly the “nrelbc” variable. This variable includes multiple values such as “no”, “mother”, “sister”, “daughter”, and combinations thereof, adding complexity to its analysis. This heterogeneity may cause inconsistencies in categorizing family history, which could affect the accuracy of predictive models.
For the variables “menopause”, “agefirst”, and “exercise”, the values “No” and “0” (zero) were used to indicate the absence of these characteristics. Additionally, in the “breastfeeding” variable, some values include the word “months.” These inconsistencies may lead to misinterpretation during analysis.
Moreover, the primary limitation is that the data come from a single hospital in Havana, which may not represent the entire Cuban population, limiting the generalizability of the findings. These issues could introduce biases and reduce the precision of predictive models.
To mitigate these limitations, future studies should standardize coding and analysis of qualitative variables like “nrelbc,” re-code “No” and “0” (zero) values, and collect data from multiple health centers in different Cuban regions to enhance representativeness and improve model robustness.
Ethics Statement
Ethical approval for the use of these data was granted by the Scientific Committee and the Medical Ethics Committee of the Hospital Universitario Clínico-Quirúrgico Comandante Manuel Fajardo in September 2023. Individual informed consent was not required for the analysis of depersonalized health records provided by the hospital, as all data were anonymized, and patients were given the option to opt out of clinical data sharing.
CRediT authorship contribution statement
José Manuel Valencia-Moreno: Conceptualization, Methodology, Investigation, Writing – original draft. Jose Angel Gonzalez-Fraga: Methodology, Writing – review & editing. Everardo Gutierrez-Lopez: Methodology, Writing – original draft. Hugo Alexis Cantero-Ronquillo: Resources, Data curation.
Acknowledgments
We would like to express our gratitude to Dr. Maria Vidal Ledo of the National School of Public Health, Havana, Cuba. We also thank the following physicians for their invaluable assistance: Dr. Dalsy Torres Avila, Dr. Manuel Ortega Soto, Dr. Heydi Bustamante Abreu, and Dr. Monica Gomez Ramos. We also acknowledge the support of the Universidad Autónoma de Baja California in the realization of this work.
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Data Availability
Mendeley DataBreast cancerrisk factors inCubanwomen (Original data).
References
- 1.Valencia-Moreno J.M., Gonzalez-Fraga J.A., Gutierrez-Lopez E., Estrada-Senti V., Cantero-Ronquillo H.A., Kober V. Breast cancer risk estimation with intelligent algorithms and risk factors for Cuban women. Comput. Biol. Med. 2024;179 doi: 10.1016/j.compbiomed.2024.108818. [DOI] [PubMed] [Google Scholar]
- 2.Gail M.H., Brinton L.A., Byar D.P., et al. Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. J. Natl. Cancer Inst. 1989;81(24):1879–1886. doi: 10.1093/jnci/81.24.1879. [DOI] [PubMed] [Google Scholar]
- 3.Tyrer J., Duffy S.W., Cuzick J. A breast cancer prediction model incorporating familial and personal risk factors. Stat. Med. 2004;23(7):1111–1130. doi: 10.1002/sim.1668. [DOI] [PubMed] [Google Scholar]
- 4.Cuzick J., Brentnall A.R., Segal C., et al. A new model for breast cancer risk prediction and its validation in a prospective study of 2 Million Women in the UK. Br. J. Cancer. 2017;117(6):877–883. doi: 10.1038/bjc.2017.229. [DOI] [Google Scholar]
- 5.Costantino J.P., Gail M.H., Pee D., et al. Validation studies for models projecting the risk of invasive and total breast cancer incidence. J. Natl. Cancer Inst. 1999;91(18):1541–1548. doi: 10.1093/jnci/91.18.1541. [DOI] [PubMed] [Google Scholar]
- 6.McCarthy A.M., Armstrong K. Risk prediction models for breast cancer: challenges to implementation. Cancer Epidemiol., Biomark. Prev. 2014;23(10):2324–2334. doi: 10.1158/1055-9965.EPI-14-0590. [DOI] [Google Scholar]
- 7.Nindrea R.D., Usman E., Katar Y., Darma I.Y., Warsiti H.H., Sari N.P. Dataset of Indonesian women's reproductive, high-fat diet and body mass index risk factors for breast cancer. Data Br. 2021;36 doi: 10.1016/j.dib.2021.107107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Oguntunde P.E., Adejumo A.O., Okagbue H.I. Breast cancer patients in Nigeria: data exploration approach. Data Br. 2017;15:47–57. doi: 10.1016/j.dib.2017.08.038. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Awodutire P.O., Kolawole O.A., Ilori O.R. Data on the survival times of breast cancer patients in a Teaching Hospital, Osogbo. Data Br. 2020;32 doi: 10.1016/j.dib.2020.106109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Matsuno R.K., Costantino J.P., Ziegler R.G., et al. Projecting individualized absolute invasive breast cancer risk in Asian and Pacific Islander American Women. J. Natl. Cancer Inst. 2011;103(12):951–961. doi: 10.1093/jnci/djr154. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.World Medical Association World medical association declaration of Helsinki: ethical principles for medical research involving human subjects. JAMA. 2013;310(20):2191–2194. doi: 10.1001/jama.2013.281053. [DOI] [PubMed] [Google Scholar]
- 12.World Health Organization (WHO) (2024). Global breast cancer initiative implementation framework: assessing, strengthening and scaling up of services for the early detection and management of breast cancer: executive summary. waww.who.int [Internet]. Available from: https://www.who.int/publications/i/item/9789240067134.
- 13.Pan American Health Organization . Pan American Health Organization; Geneva, Switzerland: 2016. Prevention: Breast Cancer Risk Factors and Prevention.https://www.paho.org/en/documents/prevention-breast-cancer-risk-factors-and-prevention Available from. [Google Scholar]
- 14.Centers for Disease Control and Prevention. (2024). Breast Cancer Risk Factors. Available from: https://www.cdc.gov/breast-cancer/risk-factors/index.html.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Mendeley DataBreast cancerrisk factors inCubanwomen (Original data).