Clinical utility assessment framework for machine learning-based fetal health classification in cardiotocography: an observational study

YooKyung Lee; So Yun Kim; Hana Park

doi:10.5468/ogs.25376

. 2026 Feb 26;69(2):119–127. doi: 10.5468/ogs.25376

Clinical utility assessment framework for machine learning-based fetal health classification in cardiotocography: an observational study

YooKyung Lee ^1,^✉, So Yun Kim ¹, Hana Park ¹

PMCID: PMC13017172 PMID: 41747742

Abstract

Objective

To evaluate the clinical utility and implementation considerations of artificial intelligence (AI)-based fetal health classification systems using the Kaggle Fetal Health Classification dataset, with a focus on obstetric physicians’ perspectives.

Methods

We analyzed the Kaggle Fetal Health Classification dataset (n=2,126), containing 21 cardiotocography parameters. Five machine-learning algorithms were evaluated: logistic regression, random forest, gradient boosting, support vector machine, and decision tree. Class weighting was applied to address the dataset imbalance. The model performance was assessed using standard classification metrics. An expert opinion-based clinical utility assessment framework was developed to assess interpretability, workflow integration, and safety.

Results

With class weighting applied, gradient boosting achieved the highest accuracy (89.67%), followed by random forest (88.50%) and logistic regression (82.16%). The most important predictive features were abnormal short-term variability (16.23% importance) and the percentage of time with abnormal long-term variability (13.21% importance). An analysis of all 21 features revealed that contraction-related parameters, including uterine_contractions, contributed minimally to the classification performance. The 35.3% false negative rate for pathological cases represents a significant safety concern and requires physician oversight.

Conclusion

AI-based fetal health classification systems show potential for future applications when properly validated. However, the significant false negative rate for pathological cases indicates that these systems cannot function independently. External validation using multicenter clinical data and prospective outcome studies is essential before clinical implementation.

Keywords: Cardiotocography; Machine learning; Artificial intelligence; Fetal monitoring; Pregnancy, high-risk

Introduction

Among the major technological advances in obstetric practice, cardiotocography (CTG) and ultrasonography are cornerstone methods for fetal health assessment during pregnancy and labor [1]. CTG provides real-time insights into fetal heart rate patterns and uterine contractions, offering healthcare providers valuable information about fetal well-being and enabling timely clinical interventions when necessary [2].

Despite its widespread clinical adoption and proven utility, CTG trace interpretation remains one of the most challenging and subjective aspects of modern obstetric practice. The complexity of fetal heart rate patterns, combined with the dynamic nature of labor and delivery, creates scenarios in which even experienced obstetricians and midwives may disagree on the interpretation of identical CTG recordings [3]. This inherent subjectivity has been extensively documented, with studies consistently showing significant inter- and intra-observer variability among healthcare professionals across different levels of experience and training [4].

The clinical implications of this interpretive variability extend beyond academic interest, directly impacting patient care quality, clinical decision-making, and ultimately maternal and fetal outcomes. When healthcare providers disagree on CTG interpretation, inconsistencies in clinical management can lead to false-positive and false-negative assessments [5]. False-positive interpretations may result in unnecessary interventions, including emergency cesarean deliveries or other invasive procedures that carry inherent risks for both the mother and baby. Conversely, false-negative interpretations may lead to missed opportunities for timely intervention in cases where fetal compromise is present, potentially resulting in adverse outcomes.

Recently, artificial intelligence (AI) and machine learning technologies have opened up new possibilities for addressing these longstanding challenges in fetal health monitoring. Machine learning algorithms have demonstrated capabilities in pattern recognition, data analysis, and predictive modeling across diverse medical domains, ranging from diagnostic imaging and cardiac arrhythmia detection [6,7] to clinical decision support systems [8]. The application of these computational approaches to fetal health classification represents a promising avenue for improving the objectivity, consistency, and accuracy of CTG interpretation, while maintaining the essential role of clinical expertise in patient care [9].

Recent developments in large language models and conversational AI systems have further expanded the potential applications of AI in healthcare, showing promise for diagnostic support, clinical education, and patient communication in obstetric and gynecological practices [10]. These technological advances provide an important context for understanding the broader landscape of AI applications in obstetric care. Ahn and Lee [11] comprehensively reviewed AI applications in obstetrics, including preterm birth prediction, fetal growth assessment, and CTG interpretation, highlighting both the potential and limitations of various machine learning approaches in maternal-fetal medicine. Kim et al. [12] reviewed AI applications in obstetrics, demonstrating how AI technologies were integrated into CTG, ultrasonography, and magnetic resonance imaging diagnostics in clinical settings, particularly emphasizing the potential of CTG automated interpretation systems.

This study aimed to provide a preliminary clinical utility assessment of AI-based fetal health classification from the perspective of an obstetric physician. We examined not only the technical performance of various machine learning algorithms but also their potential clinical utility, practical applicability, and considerations for future integration into real-world obstetric care settings. Our analysis considers the unique requirements and challenges of clinical practice, providing insights that can guide the responsible implementation of AI technologies in fetal health monitoring, while ensuring patient safety remains the primary consideration.

Materials and methods

1. Study design and clinical framework

This study employed an expert opinion-based clinical utility assessment approach to evaluate AI-based fetal health classification systems from an obstetric physician’s perspective. This study was limited to internal validation using a single publicly available dataset and did not constitute external validation. An expert opinion-based clinical utility assessment framework was developed through consultation with experienced obstetricians and maternal-fetal medicine specialists to ensure that the evaluation criteria reflected real-world clinical needs and priorities.

2. Dataset description

The Kaggle Fetal Health Classification dataset served as the foundation for our analysis, providing a standardized benchmark for machine learning approaches to fetal health classification [13]. This dataset contains 2,126 CTG records collected from fetal monitoring sessions conducted in clinical settings, with each record characterized by 21 carefully extracted features representing various aspects of fetal heart rate patterns and uterine activity (Table 1). The dataset classification scheme divided the fetal health status into three clinically relevant categories: normal (n=1,693 [79.6%]), suspect (n=265 [12.5%]), and pathological (n=168 [7.9%]). This classification system aligns closely with established clinical practice in which CTG interpretations are typically categorized into similar risk-based classifications that guide clinical decision-making processes [14].

Table 1.

Dataset characteristics

Characteristic	N (%)
Total samples	2,126
Features	21
Normal cases	1,693 (79.6)
Suspect cases	265 (12.5)
Pathological cases	168 (7.9)

Open in a new tab

3. Machine learning model development

Five machine learning algorithms were selected for evaluation based on their widespread use in medical applications and clinical interpretability requirements: logistic regression, random forest, gradient boosting, support vector machine (SVM), and decision tree classifiers [15–17]. Similar machine-learning classifiers have been applied in other areas of gynecologic oncology, demonstrating the versatility of these approaches across the obstetric and gynecologic domains [18]. Our methodological approach prioritized interpretable machine learning models over deep learning approaches to ensure clinical transparency and physician trust, aligning with the study’s core objective of evaluating AI systems from a physician’s perspective. Model training was conducted using a stratified train-test split approach (80–20%) to ensure representative sampling across all three fetal health categories.

To address the significant class imbalance in the dataset (normal: 79.6%; suspect: 12.5%; pathological: 7.9%), class weighting was applied using weights that were inversely proportional to the class frequencies. This approach ensures that minority classes (suspect and pathological) receive appropriate emphasis during model training without requiring synthetic data generation.

All experiments were conducted using Python 3.8 with scikit-learn version 0.24.2 [19]. A random seed of 42 was used for reproducibility. The dataset was divided into training (80%) and testing (20%) sets using stratified sampling. A five-fold cross-validation was performed on the training set for model selection. Specific hyperparameters were as follows: logistic regression (C=1.0; max_iter=1,000; solver=‘lbfgs’; multi_class=‘multinomial’); random forest (n_ estimators=100; max_depth=none; min_samples_split=2); gradient boosting (n_estimators=100; learning_rate=0.1; max_depth=3); SVM (C=1.0; kernel=‘rbf’; gamma=‘scale’); decision tree (max_depth=none; min_samples_split=2; criterion=‘gini’).

4. Expert opinion-based clinical utility assessment framework

An expert opinion-based clinical utility assessment framework was developed to systematically assess the clinical suitability of AI models beyond the technical performance metrics. This framework evaluates six key criteria: technical performance, interpretability, workflow integration, safety considerations, training requirements, and cost-effectiveness. Each criterion was assessed on a five-point scale based on expert opinions from three maternal-fetal medicine specialists with more than 10 years of CTG interpretation experience. Inter-rater reliability was assessed using Fleiss’ kappa (kappa=0.68, indicating substantial agreement). This assessment represented preliminary expert opinions rather than validated clinical outcomes.

5. Ethical approval

This study used a publicly available, de-identified dataset (Kaggle Fetal Health Classification dataset). Because the dataset did not contain personally identifiable information, institutional review board approval and informed consent were not required.

Results

1. Model performance analysis

The evaluation of the five machine-learning algorithms revealed substantial variations in performance across different metrics. Table 2 presents the results with and without class weighting applied to address dataset imbalance. With class weighting applied, gradient boosting achieved the highest overall accuracy of 89.67%, followed by random forest (88.50%), and logistic regression (82.16%). The superior performance of ensemble methods such as random forest and gradient boosting aligns with theoretical expectations because these algorithms combine multiple weak learners to create more robust predictive models [16,17].

Table 2.

Model performance comparison with and without class weighting

Algorithm	Approach	Accuracy (%)	Precision (%)	Recall (%)	F1-score (%)
Logistic regression	Original	80.05	71.61	80.05	71.60
Logistic regression	Balanced	82.16	74.28	82.16	74.85
Random forest	Original	84.74	83.05	84.74	80.73
Random forest	Balanced	88.50	86.42	88.50	85.17
Gradient boosting	Original	85.45	85.03	85.45	82.15
Gradient boosting	Balanced	89.67	87.91	89.67	87.03
SVM	Original	79.58	63.33	79.58	70.53
SVM	Balanced	81.22	68.45	81.22	73.16
Decision tree	Original	72.30	73.43	72.30	72.82
Decision tree	Balanced	75.84	76.19	75.84	75.93

Open in a new tab

SVM, support vector machine.

2. Feature importance analysis

Analysis of feature importance provided insights into which CTG parameters were the most influential in determining fetal health classification (Table 3). All 21 features from the dataset were analyzed and ranked according to their contribution to the predictive performance of the random forest model. The most important predictive feature was abnormal short-term variability, accounting for 16.23% of the predictive power of the model, followed by the percentage of time with abnormal long-term variability (13.21%). This finding aligns well with established clinical knowledge, as reduced or absent short-term variability in fetal heart rate is widely recognized as an indicator of potential fetal compromise that requires clinical attention [20].

Table 3.

Feature importance analysis of all 21 features (random forest model)

Rank	Feature	Importance (%)
1	abnormal_short_term_variability	16.23
2	percentage_of_time_with_abnormal_long_term_variability	13.21
3	histogram_variance	6.84
4	histogram_max	6.73
5	histogram_mean	6.61
6	baseline_value	6.50
7	histogram_mode	6.47
8	histogram_width	6.31
9	histogram_median	6.05
10	histogram_min	5.89
11	accelerations	5.33
12	fetal_movement	4.25
13	mean_value_of_long_term_variability	3.98
14	histogram_number_of_peaks	3.47
15	histogram_number_of_zeroes	2.86
16	histogram_tendency	1.64
17	uterine_contractions	1.37
18	prolongued_decelerations	0.93
19	mean_value_of_short_term_variability	0.82
20	severe_decelerations	0.63
21	light_decelerations	0.19

Open in a new tab

Features ranked 11–21 were included to present the complete feature importance analysis. Contraction-related features (uterine_contractions, rank 17) and deceleration parameters (ranks 18, 20, and 21) showed notably low importance despite their clinical significance in conventional CTG interpretation.

CTG, cardiotocography.

Notably, several features that clinicians might expect to be important contributed minimally to the classification. Uterine_contractions ranked 17th (1.37%) and light_decelerations ranked last among all features (0.19%). These findings have significant clinical implications. In clinical CTG interpretation, the temporal relationship between uterine contractions and fetal heart rate deceleration is critical for distinguishing late from early decelerations and assessing fetal reserve [14,20]. The low importance of contraction-related features in machine learning models suggests that the dataset may not adequately capture the dynamic temporal relationship between contractions and heart rate patterns, or that the extracted features may not represent the sequential and contextual information that clinicians rely on when interpreting these patterns. Similarly, prolongued_decelerations (0.93%) and severe_decelerations (0.63%) showed low importance, likely reflecting their rarity in the dataset rather than their clinical insignificance. A recent scoping review by Francis et al. [21] identified concerns over the practicality and generalizability of machine learning approaches applied to CTG data, particularly when using summary features rather than raw time-series signals. These findings underscore the fundamental limitation of applying machine learning to CTG data. The static summary-level features available in this dataset do not fully represent the temporal waveform patterns that clinicians evaluate in real-time practice.

3. Class-specific performance analysis

A detailed analysis of class-specific performance revealed important insights into the strengths and limitations of AI models for different categories of fetal health status (Table 4). The balanced model showed improved performance for minority classes compared with the original approach. The analysis revealed excellent performance in identifying normal cases (100% recall), improved performance for pathological cases (64.7% recall with the balanced model vs. 41.2% with the original), and challenging performance for suspect cases (34.0% recall). From a clinical safety perspective, the 35.3% false-negative rate for pathological cases (improved from 58.8% in the original model) represents a significant concern that must be addressed through appropriate implementation strategies under physician supervision.

Table 4.

Class-specific performance analysis (balanced model)

Class	True positive	False positive	False negative	Precision (%)	Recall (%)	Specificity (%)
Normal	352	44	0	88.9	100.0	45.8
Suspect	18	6	35	75.0	34.0	98.5
Pathological	22	3	12	88.0	64.7	99.2

Open in a new tab

False negative rate for pathological cases=35.3% (12/34).

4. Expert opinion-based clinical utility assessment

The expert opinion-based clinical utility assessment framework was used to evaluate six key criteria (Table 5). Expert assessment revealed moderate technical performance (score 3/5), potentially suitable for decision support but insufficient for independent clinical use. Safety considerations received a poor score (2/5) because of the persistent false-negative rate for pathological cases. Interpretability scored well (4/5), as the feature importance aligned with established clinical knowledge.

Table 5.

Expert opinion-based assessment of clinical utility criteria

Criterion	Assessment	Score (1–5)	Comments
Technical performance	Moderate	3	Balanced models show 89.67% accuracy; suitable for decision support only
Interpretability	Good	4	Feature importance aligns with established clinical knowledge
Workflow integration	Moderate	3	Requires careful implementation planning and physician training
Safety considerations	Poor	2	35.3% false negative rate for pathological cases represents significant patient safety concern
Training requirements	Moderate	3	Comprehensive training program needed for clinical staff
Cost-effectiveness	Uncertain	3	Potential benefits require validation with actual clinical outcomes

Open in a new tab

Assessment was based on expert opinions from three maternal-fetal medicine specialists with >10 years of CTG interpretation experience. Inter-rater reliability: Fleiss’ kappa=0.68. Scoring scale: 1=very poor, 2=poor, 3=moderate, 4=good, and 5=excellent. This represents a preliminary expert opinion rather than validated clinical outcomes.

CTG, cardiotocography.

Discussion

This preliminary expert opinion-based clinical utility assessment of AI-based fetal health classification systems from an obstetric physician’s perspective provides initial insights into both the potential and limitations of these technologies for clinical implementation. The moderate but realistic performance levels achieved in this study (89.7% accuracy for the best-performing balanced model) suggest that these systems may be more valuable as decision support tools than autonomous diagnostic systems, which has important implications for the integration of such technologies into clinical workflows [22].

A key finding of this study was that the best-performing model achieved 89.7% accuracy. This figure may appear modest compared with some technical studies that reported accuracies of over 99%. However, this raw accuracy must be interpreted within the context of clinical reality, where the primary challenge is not necessarily achieving perfect prediction, but overcoming the significant inter-observer variability inherent in the human interpretation of CTG traces. Multiple studies have shown that even experienced clinicians often disagree on the classification of the same CTG trace, with reported agreement levels (measured by the kappa statistic) often falling into the ‘fair’ to ‘moderate’ range [3–5]. In a study by Uccella et al. [23], a standardized algorithm substantially improved interpretation agreement (kappa=0.85) compared to subjective visual interpretation (kappa=0.24). This strongly suggests that the primary value of an AI system lies in providing a standardized, objective baseline that reduces the well-documented subjectivity and variability of human interpretation.

The prominence of physiologically relevant features, particularly heart rate variability measures, in AI model predictions provides reassurance that these systems focus on clinically meaningful patterns rather than spurious correlations. This alignment between the importance of AI features and established clinical knowledge is important for physicians’ acceptance of and confidence in AI recommendations. The finding that abnormal short-term variability and the percentage of time with abnormal long-term variability are the most important predictive features validates the clinical relevance of the AI approach and supports its potential for meaningful integration with existing clinical practice [20]. However, the low importance of contraction-related features and deceleration parameters remains an important issue. In standard clinical practice, the relationship between uterine contractions and fetal heart rate changes, particularly the timing of deceleration relative to contractions, is fundamental for CTG assessment [14,20]. The observation that these features contributed minimally to the classification suggests that the static feature extraction approach used in this dataset did not capture the temporal dynamics central to clinical interpretation. Hardalaç et al. [24] demonstrated that feature elimination strategies applied to CTG data could improve model accuracy to 97.20% while correctly predicting 100% of pathological cases, suggesting that careful feature engineering rather than simply including all available features may be more effective. Future AI systems should incorporate time-series analysis methods that can model sequential relationships between contractions and heart rate responses.

A critical finding of this study is the persistent challenge of class imbalance even after applying balancing techniques. The false-negative rate of 35.3% for pathological cases, which improved from the original 58.8%, still represents a significant patient safety concern. In clinical practice, missing a pathological case can lead to delayed interventions and adverse fetal outcomes. This finding strongly supports the idea that AI systems in fetal monitoring must function as decision support tools with mandatory physician oversight and not as autonomous diagnostic systems.

The landscape of AI applications in the Korean obstetric practice provides an important context for understanding the potential implementation of these systems. Park et al. [25] conducted an important study utilizing 22,522 deliveries from 14 Korean hospitals and demonstrated that large-scale, multicenter CTG datasets can successfully support AI model development with external validation, achieving area under the curves of 0.862–0.895. This study represents an advancement beyond publicly available datasets by incorporating real clinical data with actual patient outcomes and multi-institutional validation.

The implementation of AI-based fetal health classification systems raises important safety considerations that must be carefully addressed. Modern clinical decision support systems must be designed with robust safeguards and physician oversight capabilities [26]. Based on the findings of this study, we propose a structured implementation framework for AI-based fetal health classification systems that prioritizes patient safety while maximizing the potential benefits of AI technology. Class-specific performance analysis revealed that excellent performance for normal cases (100% recall) suggests a particular value for screening applications. Conversely, the low recall rate for suspected cases (34.0%) suggests that AI systems may have limited utility for these challenging intermediate cases, which often require clinical expertise and nuanced judgment, a challenge widely recognized by clinicians [27].

The performance achieved in this study (89.7% accuracy with balanced models) is lower than that of several recently reported results. Studies reporting greater than 99% accuracy using the same Kaggle dataset often employ extensive hyperparameter optimization and cross-validation strategies, which may not be generalizable to real clinical settings. Several important research directions emerged from this analysis that could enhance the clinical utility of AI-based fetal health classification systems. The integration of additional clinical data sources and explainable AI techniques specifically designed for medical applications can address interpretability challenges that currently limit clinical acceptance [28,29]. Although other studies have reported accuracies exceeding 95% [30], our study intentionally prioritized clinical realism and physician-centered assessment over maximal technical optimization.

Recent international studies have demonstrated promising results using advanced AI approaches. Mushtaq and Veningston [31] demonstrated that explainable deep learning models achieve high accuracy while maintaining their interpretability. Ensemble learning approaches have demonstrated strong performance in fetal health classification [32]. Studies integrating multimodal cardiotocographic and maternal clinical data have achieved 90.8% accuracy with balanced performance across risk categories [33]. Furthermore, reviews of AI applications in electronic fetal monitoring have identified both technological potential and significant gaps in external validation, emphasizing the need for rigorous clinical trials before large-scale adoption [21,34].

This study has several limitations that should be considered when interpreting the results. First, the evaluation was conducted using a publicly available dataset, which may not fully represent the diversity of clinical scenarios encountered in real-world practice. Second, the expert opinion-based clinical utility assessment framework was based on expert opinions rather than validated clinical outcome data. Third, the dataset lacked actual patient outcome data, such as Apgar scores, limiting our ability to assess the true clinical utility. Future studies should include a prospective evaluation of AI systems in clinical settings, with measurements of actual patient outcomes and clinical impact [28,29].

In conclusion, our preliminary assessment from a physician’s perspective suggests that AI-based systems show potential as decision-support tools for fetal health assessment. The realistic performance levels achieved in this study provide a practical foundation for future clinical studies. The strong alignment between the importance of AI features and clinical knowledge fosters confidence in these systems. However, the current limitations, especially the significant false-negative rate in detecting high-risk cases, demonstrate that these systems cannot replace the essential expertise and judgment of clinical physicians. External validation using multicenter clinical data and prospective outcome studies is essential before clinical implementation. The expanding application of AI in obstetrics beyond fetal monitoring further emphasizes the importance of continued assessment and physician oversight [35]. As AI continues to advance in obstetric practice, rigorous clinical validation and ongoing physician involvement will remain essential to ensure that these technologies enhance, rather than replace, clinical judgment, which is central to safe and effective patient care.

Footnotes

Conflict of interest

No potential conflict of interest relevant to this article was reported.

Ethical approval

Patient consent

Not applicable.

Funding information

None.

Acknowledgments

The dataset used in this study is publicly available through Kaggle at: https://www.kaggle.com/datasets/andrewmvd/fetal-health-classification.

References

1.Ayres-de-Campos D, Spong CY, Chandraharan E. FIGO consensus guidelines on intrapartum fetal monitoring: cardiotocography. Int J Gynaecol Obstet. 2015;131:13–24. doi: 10.1016/j.ijgo.2015.06.020. [DOI] [PubMed] [Google Scholar]
2.Alfirevic Z, Devane D, Gyte GM, Cuthbert A. Continuous cardiotocography (CTG) as a form of electronic fetal monitoring (EFM) for fetal assessment during labour. Cochrane Database Syst Rev. 2017;2:CD006066. doi: 10.1002/14651858.CD006066. [DOI] [PubMed] [Google Scholar]
3.Blackwell SC, Grobman WA, Antoniewicz L, Hutchinson M, Gyamfi Bannerman C. Interobserver and intraobserver reliability of the NICHD 3-tier fetal heart rate interpretation system. Am J Obstet Gynecol. 2011;205:378e1–5. doi: 10.1016/j.ajog.2011.06.086. [DOI] [PubMed] [Google Scholar]
4.Rei M, Tavares S, Pinto P, Machado AP, Monteiro S, Costa A, et al. Interobserver agreement in CTG interpretation using the 2015 FIGO guidelines for intrapartum fetal monitoring. Eur J Obstet Gynecol Reprod Biol. 2016;205:27–31. doi: 10.1016/j.ejogrb.2016.08.017. [DOI] [PubMed] [Google Scholar]
5.Devane D, Lalor JG, Daly S, McGuire W, Cuthbert A, Smith V. Cardiotocography versus intermittent auscultation of fetal heart on admission to labour ward for assessment of fetal wellbeing. Cochrane Database Syst Rev. 2017;1:CD005122. doi: 10.1002/14651858.CD005122.pub5. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Rajkomar A, Dean J, Kohane I. Machine learning in medicine. N Engl J Med. 2019;380:1347–58. doi: 10.1056/NEJMra1814259. [DOI] [PubMed] [Google Scholar]
7.Acharya UR, Fujita H, Lih OS, Hagiwara Y, Tan JH, Adam M. Automated detection of arrhythmias using different intervals of tachycardia ECG segments with convolutional neural network. Inf Sci. 2017;405:81–90. [Google Scholar]
8.Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25:44–56. doi: 10.1038/s41591-018-0300-7. [DOI] [PubMed] [Google Scholar]
9.Lee Y, Kim SY. Potential applications of ChatGPT in obstetrics and gynecology in Korea: a review article. Obstet Gynecol Sci. 2024;67:153–9. doi: 10.5468/ogs.23231. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Clark SL, Nageotte MP, Garite TJ, Freeman RK, Miller DA, Simpson KR, et al. Intrapartum management of category II fetal heart rate tracings: towards standardization of care. Am J Obstet Gynecol. 2013;209:89–97. doi: 10.1016/j.ajog.2013.04.030. [DOI] [PubMed] [Google Scholar]
11.Ahn KH, Lee KS. Artificial intelligence in obstetrics. Obstet Gynecol Sci. 2022;65:113–24. doi: 10.5468/ogs.21234. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Kim HY, Cho GJ, Kwon HS. Applications of artificial intelligence in obstetrics. Ultrasonography. 2023;42:2–9. doi: 10.14366/usg.22063. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Mehbodniya A, Lazar AJP, Webber J, Sharma DK, Jayagopalan S, Kousalya K, et al. Fetal health classification from cardiotocographic data using machine learning. Expert Syst. 2022;39:e12899. [Google Scholar]
14.Macones GA, Hankins GD, Spong CY, Hauth J, Moore T. The 2008 National Institute of Child Health and Human Development workshop report on electronic fetal monitoring: update on definitions, interpretation, and research guidelines. Obstet Gynecol. 2008;112:661–6. doi: 10.1097/AOG.0b013e3181841395. [DOI] [PubMed] [Google Scholar]
15.Breiman L. Random forests. Mach Learn. 2001;45:5–32. [Google Scholar]
16.Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In: Krishnapuram B, Shah M, Smola A, Aggarwal C, Shen D, Rastogi R, editors. Proceedings of the 22nd ACM SIGKDD International Conference on knowledge discovery and data mining; 2016 Aug 13–17; San Francisco (CA). New York (NY): Association for Computing Machinery; 2016. pp. 785–94. [Google Scholar]
17.Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20:273–97. [Google Scholar]
18.Akazawa M, Hashimoto K, Noda K, Yoshida K. The application of machine learning for predicting recurrence in patients with early-stage endometrial cancer: a pilot study. Obstet Gynecol Sci. 2021;64:266–73. doi: 10.5468/ogs.20248. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825–30. [Google Scholar]
20.Parer JT, King T. Fetal heart rate monitoring: is it salvageable? Am J Obstet Gynecol. 2000;182:982–7. doi: 10.1016/s0002-9378(00)70358-9. [DOI] [PubMed] [Google Scholar]
21.Francis F, Luz S, Wu H, Stock SJ, Townsend R. Machine learning on cardiotocography data to classify fetal outcomes: a scoping review. Comput Biol Med. 2024;172:108220. doi: 10.1016/j.compbiomed.2024.108220. [DOI] [PubMed] [Google Scholar]
22.Chen JH, Asch SM. Machine learning and prediction in medicine - beyond the peak of inflated expectations. N Engl J Med. 2017;376:2507–9. doi: 10.1056/NEJMp1702071. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Uccella S, Cromi A, Colombo GF, Bogani G, Casarin J, Agosti M, et al. Interobserver reliability to interpret intrapartum electronic fetal heart rate monitoring: does a standardized algorithm improve agreement among clinicians? J Obstet Gynaecol. 2015;35:241–5. doi: 10.3109/01443615.2014.958144. [DOI] [PubMed] [Google Scholar]
24.Hardalaç F, Akmal H, Ayturan K, Acharya UR, Tan RS. A pragmatic approach to fetal monitoring via cardiotocography using feature elimination and hyperparameter optimization. Interdiscip Sci. 2024;16:882–906. doi: 10.1007/s12539-024-00647-6. [DOI] [PubMed] [Google Scholar]
25.Park CE, Choi B, Park RW, Kwak DW, Ko HS, Seong WJ, et al. Automated interpretation of cardiotocography using deep learning in a nationwide multicenter study. Sci Rep. 2025;15:19617. doi: 10.1038/s41598-025-02849-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Shortliffe EH, Sepúlveda MJ. Clinical decision support in the era of artificial intelligence. JAMA. 2018;320:2199–200. doi: 10.1001/jama.2018.17163. [DOI] [PubMed] [Google Scholar]
27.Santo S, Ayres-de-Campos D, Costa-Santos C, Schnettler W, Ugwumadu A, Da Graça LM. Agreement and accuracy using the FIGO, ACOG and NICE cardiotocography interpretation guidelines. Acta Obstet Gynecol Scand. 2017;96:166–75. doi: 10.1111/aogs.13064. [DOI] [PubMed] [Google Scholar]
28.Esteva A, Robicquet A, Ramsundar B, Kuleshov V, De-Pristo M, Chou K, et al. A guide to deep learning in healthcare. Nat Med. 2019;25:24–9. doi: 10.1038/s41591-018-0316-z. [DOI] [PubMed] [Google Scholar]
29.Sendak M, Elish MC, Gao M, Futoma J, Ratliff W, Nichols M, et al. The human body is a black box supporting clinical decision-making with deep learning. In: Hildebrandt M, Castillo C, Celis E, Ruggieri S, Taylor L, Zanfir-Fortuna G, editors. Proceedings of the 2020 conference on fairness, accountability, and transparency; 2020 Jan 27–30; Barcelona. New York (NY): Association for Computing Machinery; 2020. pp. 99–109. [Google Scholar]
30.Niveditha G, Maheswari BU, Perez de Prado R. Elevating maternal healthcare: synergy of cardiotocography, machine learning models and interpretive analysis. Procedia Comput Sci. 2025;258:2787–97. [Google Scholar]
31.Mushtaq G, Veningston K. AI driven interpretable deep learning based fetal health classification. SLAS Technol. 2024;29:100206. doi: 10.1016/j.slast.2024.100206. [DOI] [PubMed] [Google Scholar]
32.Kuzu A, Santur Y. Early diagnosis and classification of fetal health status from a fetal cardiotocography dataset using ensemble learning. Diagnostics (Basel) 2023;13:2471. doi: 10.3390/diagnostics13152471. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Cao Z, Wang G, Xu L, Li C, Hao Y, Chen Q, et al. Intelligent antepartum fetal monitoring via deep learning and fusion of cardiotocographic signals and clinical data. Health Inf Sci Syst. 2023;11:16. doi: 10.1007/s13755-023-00219-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Barnova K, Martinek R, Vilimkova Kahankova R, Jaros R, Snasel V, Mirjalili S. Artificial intelligence and machine learning in electronic fetal monitoring. Arch Comput Methods Eng. 2024;31:2557–88. [Google Scholar]
35.Han JY. Usefulness and limitations of Chat GPT in getting information on teratogenic drugs exposed in pregnancy. Obstet Gynecol Sci. 2025;68:1–8. doi: 10.5468/ogs.24231. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b1-ogs-25376] 1.Ayres-de-Campos D, Spong CY, Chandraharan E. FIGO consensus guidelines on intrapartum fetal monitoring: cardiotocography. Int J Gynaecol Obstet. 2015;131:13–24. doi: 10.1016/j.ijgo.2015.06.020. [DOI] [PubMed] [Google Scholar]

[b2-ogs-25376] 2.Alfirevic Z, Devane D, Gyte GM, Cuthbert A. Continuous cardiotocography (CTG) as a form of electronic fetal monitoring (EFM) for fetal assessment during labour. Cochrane Database Syst Rev. 2017;2:CD006066. doi: 10.1002/14651858.CD006066. [DOI] [PubMed] [Google Scholar]

[b3-ogs-25376] 3.Blackwell SC, Grobman WA, Antoniewicz L, Hutchinson M, Gyamfi Bannerman C. Interobserver and intraobserver reliability of the NICHD 3-tier fetal heart rate interpretation system. Am J Obstet Gynecol. 2011;205:378e1–5. doi: 10.1016/j.ajog.2011.06.086. [DOI] [PubMed] [Google Scholar]

[b4-ogs-25376] 4.Rei M, Tavares S, Pinto P, Machado AP, Monteiro S, Costa A, et al. Interobserver agreement in CTG interpretation using the 2015 FIGO guidelines for intrapartum fetal monitoring. Eur J Obstet Gynecol Reprod Biol. 2016;205:27–31. doi: 10.1016/j.ejogrb.2016.08.017. [DOI] [PubMed] [Google Scholar]

[b5-ogs-25376] 5.Devane D, Lalor JG, Daly S, McGuire W, Cuthbert A, Smith V. Cardiotocography versus intermittent auscultation of fetal heart on admission to labour ward for assessment of fetal wellbeing. Cochrane Database Syst Rev. 2017;1:CD005122. doi: 10.1002/14651858.CD005122.pub5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b6-ogs-25376] 6.Rajkomar A, Dean J, Kohane I. Machine learning in medicine. N Engl J Med. 2019;380:1347–58. doi: 10.1056/NEJMra1814259. [DOI] [PubMed] [Google Scholar]

[b7-ogs-25376] 7.Acharya UR, Fujita H, Lih OS, Hagiwara Y, Tan JH, Adam M. Automated detection of arrhythmias using different intervals of tachycardia ECG segments with convolutional neural network. Inf Sci. 2017;405:81–90. [Google Scholar]

[b8-ogs-25376] 8.Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25:44–56. doi: 10.1038/s41591-018-0300-7. [DOI] [PubMed] [Google Scholar]

[b9-ogs-25376] 9.Lee Y, Kim SY. Potential applications of ChatGPT in obstetrics and gynecology in Korea: a review article. Obstet Gynecol Sci. 2024;67:153–9. doi: 10.5468/ogs.23231. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b10-ogs-25376] 10.Clark SL, Nageotte MP, Garite TJ, Freeman RK, Miller DA, Simpson KR, et al. Intrapartum management of category II fetal heart rate tracings: towards standardization of care. Am J Obstet Gynecol. 2013;209:89–97. doi: 10.1016/j.ajog.2013.04.030. [DOI] [PubMed] [Google Scholar]

[b11-ogs-25376] 11.Ahn KH, Lee KS. Artificial intelligence in obstetrics. Obstet Gynecol Sci. 2022;65:113–24. doi: 10.5468/ogs.21234. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b12-ogs-25376] 12.Kim HY, Cho GJ, Kwon HS. Applications of artificial intelligence in obstetrics. Ultrasonography. 2023;42:2–9. doi: 10.14366/usg.22063. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b13-ogs-25376] 13.Mehbodniya A, Lazar AJP, Webber J, Sharma DK, Jayagopalan S, Kousalya K, et al. Fetal health classification from cardiotocographic data using machine learning. Expert Syst. 2022;39:e12899. [Google Scholar]

[b14-ogs-25376] 14.Macones GA, Hankins GD, Spong CY, Hauth J, Moore T. The 2008 National Institute of Child Health and Human Development workshop report on electronic fetal monitoring: update on definitions, interpretation, and research guidelines. Obstet Gynecol. 2008;112:661–6. doi: 10.1097/AOG.0b013e3181841395. [DOI] [PubMed] [Google Scholar]

[b15-ogs-25376] 15.Breiman L. Random forests. Mach Learn. 2001;45:5–32. [Google Scholar]

[b16-ogs-25376] 16.Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In: Krishnapuram B, Shah M, Smola A, Aggarwal C, Shen D, Rastogi R, editors. Proceedings of the 22nd ACM SIGKDD International Conference on knowledge discovery and data mining; 2016 Aug 13–17; San Francisco (CA). New York (NY): Association for Computing Machinery; 2016. pp. 785–94. [Google Scholar]

[b17-ogs-25376] 17.Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20:273–97. [Google Scholar]

[b18-ogs-25376] 18.Akazawa M, Hashimoto K, Noda K, Yoshida K. The application of machine learning for predicting recurrence in patients with early-stage endometrial cancer: a pilot study. Obstet Gynecol Sci. 2021;64:266–73. doi: 10.5468/ogs.20248. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b19-ogs-25376] 19.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825–30. [Google Scholar]

[b20-ogs-25376] 20.Parer JT, King T. Fetal heart rate monitoring: is it salvageable? Am J Obstet Gynecol. 2000;182:982–7. doi: 10.1016/s0002-9378(00)70358-9. [DOI] [PubMed] [Google Scholar]

[b21-ogs-25376] 21.Francis F, Luz S, Wu H, Stock SJ, Townsend R. Machine learning on cardiotocography data to classify fetal outcomes: a scoping review. Comput Biol Med. 2024;172:108220. doi: 10.1016/j.compbiomed.2024.108220. [DOI] [PubMed] [Google Scholar]

[b22-ogs-25376] 22.Chen JH, Asch SM. Machine learning and prediction in medicine - beyond the peak of inflated expectations. N Engl J Med. 2017;376:2507–9. doi: 10.1056/NEJMp1702071. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b23-ogs-25376] 23.Uccella S, Cromi A, Colombo GF, Bogani G, Casarin J, Agosti M, et al. Interobserver reliability to interpret intrapartum electronic fetal heart rate monitoring: does a standardized algorithm improve agreement among clinicians? J Obstet Gynaecol. 2015;35:241–5. doi: 10.3109/01443615.2014.958144. [DOI] [PubMed] [Google Scholar]

[b24-ogs-25376] 24.Hardalaç F, Akmal H, Ayturan K, Acharya UR, Tan RS. A pragmatic approach to fetal monitoring via cardiotocography using feature elimination and hyperparameter optimization. Interdiscip Sci. 2024;16:882–906. doi: 10.1007/s12539-024-00647-6. [DOI] [PubMed] [Google Scholar]

[b25-ogs-25376] 25.Park CE, Choi B, Park RW, Kwak DW, Ko HS, Seong WJ, et al. Automated interpretation of cardiotocography using deep learning in a nationwide multicenter study. Sci Rep. 2025;15:19617. doi: 10.1038/s41598-025-02849-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b26-ogs-25376] 26.Shortliffe EH, Sepúlveda MJ. Clinical decision support in the era of artificial intelligence. JAMA. 2018;320:2199–200. doi: 10.1001/jama.2018.17163. [DOI] [PubMed] [Google Scholar]

[b27-ogs-25376] 27.Santo S, Ayres-de-Campos D, Costa-Santos C, Schnettler W, Ugwumadu A, Da Graça LM. Agreement and accuracy using the FIGO, ACOG and NICE cardiotocography interpretation guidelines. Acta Obstet Gynecol Scand. 2017;96:166–75. doi: 10.1111/aogs.13064. [DOI] [PubMed] [Google Scholar]

[b28-ogs-25376] 28.Esteva A, Robicquet A, Ramsundar B, Kuleshov V, De-Pristo M, Chou K, et al. A guide to deep learning in healthcare. Nat Med. 2019;25:24–9. doi: 10.1038/s41591-018-0316-z. [DOI] [PubMed] [Google Scholar]

[b29-ogs-25376] 29.Sendak M, Elish MC, Gao M, Futoma J, Ratliff W, Nichols M, et al. The human body is a black box supporting clinical decision-making with deep learning. In: Hildebrandt M, Castillo C, Celis E, Ruggieri S, Taylor L, Zanfir-Fortuna G, editors. Proceedings of the 2020 conference on fairness, accountability, and transparency; 2020 Jan 27–30; Barcelona. New York (NY): Association for Computing Machinery; 2020. pp. 99–109. [Google Scholar]

[b30-ogs-25376] 30.Niveditha G, Maheswari BU, Perez de Prado R. Elevating maternal healthcare: synergy of cardiotocography, machine learning models and interpretive analysis. Procedia Comput Sci. 2025;258:2787–97. [Google Scholar]

[b31-ogs-25376] 31.Mushtaq G, Veningston K. AI driven interpretable deep learning based fetal health classification. SLAS Technol. 2024;29:100206. doi: 10.1016/j.slast.2024.100206. [DOI] [PubMed] [Google Scholar]

[b32-ogs-25376] 32.Kuzu A, Santur Y. Early diagnosis and classification of fetal health status from a fetal cardiotocography dataset using ensemble learning. Diagnostics (Basel) 2023;13:2471. doi: 10.3390/diagnostics13152471. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b33-ogs-25376] 33.Cao Z, Wang G, Xu L, Li C, Hao Y, Chen Q, et al. Intelligent antepartum fetal monitoring via deep learning and fusion of cardiotocographic signals and clinical data. Health Inf Sci Syst. 2023;11:16. doi: 10.1007/s13755-023-00219-w. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b34-ogs-25376] 34.Barnova K, Martinek R, Vilimkova Kahankova R, Jaros R, Snasel V, Mirjalili S. Artificial intelligence and machine learning in electronic fetal monitoring. Arch Comput Methods Eng. 2024;31:2557–88. [Google Scholar]

[b35-ogs-25376] 35.Han JY. Usefulness and limitations of Chat GPT in getting information on teratogenic drugs exposed in pregnancy. Obstet Gynecol Sci. 2025;68:1–8. doi: 10.5468/ogs.24231. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Clinical utility assessment framework for machine learning-based fetal health classification in cardiotocography: an observational study

YooKyung Lee, MD, PhD

So Yun Kim, MD

Hana Park, MD

Abstract

Objective

Methods

Results

Conclusion

Introduction

Materials and methods

1. Study design and clinical framework

2. Dataset description

Table 1.

3. Machine learning model development

4. Expert opinion-based clinical utility assessment framework

5. Ethical approval

Results

1. Model performance analysis

Table 2.

2. Feature importance analysis

Table 3.

3. Class-specific performance analysis

Table 4.

4. Expert opinion-based clinical utility assessment

Table 5.

Discussion

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Clinical utility assessment framework for machine learning-based fetal health classification in cardiotocography: an observational study

YooKyung Lee, MD, PhD

So Yun Kim, MD

Hana Park, MD

Abstract

Objective

Methods

Results

Conclusion

Introduction

Materials and methods

1. Study design and clinical framework

2. Dataset description

Table 1.

3. Machine learning model development

4. Expert opinion-based clinical utility assessment framework

5. Ethical approval

Results

1. Model performance analysis

Table 2.

2. Feature importance analysis

Table 3.

3. Class-specific performance analysis

Table 4.

4. Expert opinion-based clinical utility assessment

Table 5.

Discussion

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases