Abstract
Background
Advances in machine learning (ML) offer an innovative approach to accurate fetal weight estimation by integrating multiple biometric and clinical variables.
Objective
To develop and validate ML models for estimating fetal weight using biometric data obtained via ultrasonography, evaluating their accuracy and comparing them with traditional formulas, such as Hadlock and Shepard.
Methods
A retrospective observational study was conducted at the National Maternal Perinatal Institute of Peru from 2009 to 2022, including 3525 low-risk pregnancies with singleton gestations. ML models, including Gradient Boosting, Support Vector Machine (SVM), Random Forest and TabPFN (Tabular Prior-data Fitted Network), were trained and validated using ultrasonographic measurements such as biparietal diameter, abdominal circumference, head circumference, femur length, and gestational age. Accuracy was assessed using the coefficient of determination (R²) and mean squared error (MSE), comparing the ML models to the Hadlock and Shepard formulas.
Results
Data from the first study stage (2009–2018) indicated that the TabPFN model was the most accurate (R² = 0.856; MSE = 0.146), outperforming the Hadlock (R² = 0.807; MSE = 0.195) and Shepard (R² = 0.801; MSE = 0.201) formulas. In the independent validation sample (2019–2022), TabPFN consistently outperformed other methods (R² = 0.873; MSE = 0.144). Model consistency was evaluated through cross-validation and randomization of samples.
Conclusions
The TabPFN model outperformed traditional formulas, including Hadlock and Shepard, and other evaluated machine learning methods in estimating fetal weight. Its high predictive accuracy, robustness across temporally distinct cohorts, and independence from hyperparameter tuning support its potential as a reliable clinical decision-support tool in obstetric care.
Keywords: Fetal weight, ultrasonography, prenatal, artificial intelligence, machine learning, pregnancy, perinatal care
Introduction
Accurate estimation of fetal growth and weight is crucial in obstetric and neonatology practice, as it directly impacts the planning of prenatal and neonatal care and plays a key role in reducing perinatal morbidity and mortality.1,2 Precision in fetal weight estimation facilitates the early identification of complications, such as intrauterine growth restriction, and optimizes clinical decision-making, from labor management to neonatal care planning. 3 However, fetal weight estimation encounters inherent challenges due to variability in biological and clinical factors, as well as limitations in traditional models, which can be imprecise, particularly in diverse populations. 4
Ultrasound-based formulas, such as Hadlock's, have been widely utilized to estimate fetal weight based on standard biometric measurements. 5 While these formulas have demonstrated significant clinical utility, they possess important limitations regarding universal applicability, as they assume a normal weight distribution and inadequately account for demographic and contextual factors, including ethnicity, fetal sex, and individual maternal characteristics.5–7 Previous studies have documented notable discrepancies between estimated and actual fetal weights, underscoring the necessity for more adaptive and precise approaches.8–10
Personalizing fetal growth references by incorporating specific characteristics such as maternal parity, ethnicity, and body mass index has shown promising improvements in the accuracy of fetal growth assessment.11,12 However, traditional methods encounter challenges related to adaptability and the integration of complex, heterogeneous datasets. In this context, machine learning (ML) has emerged as a transformative approach, providing significant advantages by analyzing intricate and non-linear patterns within large biometric datasets.13,14 Recent advances, such as the Tabular Prior-data Fitted Network (TabPFN), introduce transformer-based architectures trained on synthetic data using structural causal models, enabling robust, uncertainty-aware predictions even in small datasets with missing values and diverse feature types. 15
Previous studies have highlighted the potential of ML to enhance fetal growth assessment and birthweight prediction across diverse obstetric populations. Naimi et al. showed that ensemble models, including generalized boosted models and Bayesian additive regression trees, outperform traditional regression techniques in estimating fetal weight across gestation, yielding more accurate and generalizable results. 16 Ranjbar et al. demonstrated the effectiveness of gradient boosting methods in predicting low birth weight by identifying complex risk factors such as gestational age and prior obstetric history. 17 Similarly, Tao et al. developed a hybrid LSTM model that integrates temporal maternal and fetal biometric data, achieving superior performance in classifying fetal growth categories compared to conventional formulas and other ML models. 18 Collectively, these studies underscore the value of ML in improving the accuracy of fetal weight estimation and supporting perinatal clinical decision-making.
This study aimed to evaluate the application of ML models for estimating fetal weight using ultrasound-derived biometric data. Additionally, the accuracy of these models was compared to traditional formulas, including the widely used Hadlock formulas and the Shepard formula, to assess the performance of ML models.
Methods
Study design and participants
This was a cross-sectional observational study based on data collected from pregnant women with singleton, low-risk pregnancies and their newborns, who were attended at the Instituto Nacional Materno Perinatal (INMP) in Peru between 2009 and 2022.
Inclusion criteria were singleton pregnancies with ultrasound-based fetal biometry performed within seven days prior to delivery. The exclusion criteria comprised: fetal malformations, maternal comorbidities (cancer, stroke, pneumonia, epilepsy, or other systemic conditions), major obstetric complications (preeclampsia, hemorrhage, chorioamnionitis, sepsis, hyperemesis, placenta previa, accreta, or abruption), gestational age at delivery <21 weeks, birth weight ≤500 g, and multiple pregnancies. Additionally, cases lacking complete records of the variables of interest in the medical histories were also excluded. A total of 3525 cases met the eligibility criteria and were included in the final analysis.
Data collection and variables:
This study received approval from the Research Ethics Committee of the Instituto Nacional Materno Perinatal (INMP), Lima, Peru (Approval ID: 032-2023-CIEI/INMP) and received the corresponding institutional authorization.
The identification of eligible pregnant women and their corresponding newborns was conducted through a systematic review of the electronic database of the Fetal Medicine Service. In the initial phase, medical records of patients who had undergone ultrasonographic fetal biometry evaluations were identified. Each record was subsequently reviewed to verify compliance with the predefined inclusion and exclusion criteria. Only mother–newborn pairs with complete and consistent clinical information, and who strictly met the eligibility requirements, were included in the study. This selection process resulted in a final analytical dataset comprising 3525 observations.
The extracted variables included maternal age, gestational age (GA) at the time of fetal biometry and at delivery, as well as the following fetal biometric parameters: femur length (FL), head circumference (HC), biparietal diameter (BPD), and abdominal circumference (AC). Actual neonatal weight was obtained from the delivery records. Based on the fetal biometric measurements, estimated fetal weight (EFW) was calculated using conventional ultrasonographic formulas, such as Hadlock I–IV and Shepard, which incorporate various mathematical combinations of the aforementioned biometric parameters.
The formulas used to compare fetal weight estimation between ML models and conventional formulas were those of Shepard 19 and Hadlock: 5
Preparation of the database and development of ml models:
The initial dataset was collected in Excel (xlsx) format for subsequent processing using Google Colab, a free platform for using Jupyter Notebooks with the Python 3.10 programming language.
The complete dataset was temporally segmented into two periods for analysis. The development dataset (2009–2018) included 2738 newborns randomly split into a training set (80%) and a test set (20%). The validation dataset (2019–2022) comprised 787 newborns used as an independent set for external model validation.
During the database preparation, a normalization process was applied, adjusting variables to have a mean of 0 and a standard deviation of 1, thus ensuring compatibility and optimal performance in predictive models. The predictor variables included BPD, AC, HC, FL, and GA, which were used as inputs to estimate fetal weight using ML models.
The ML algorithms utilized in this study include Decision Tree, Random Forest, Gradient Boosting, Xtreme Gradient Boosting, Support Vector Machine (SVM), and TabPFN. The methodology employed for model development and comparison is illustrated in Figure 1.
Figure 1.
Description of techniques applied in the ML modeling process.
Note: in case TabPFN is the best model and not hyperparameter tunning is possible (because is a pre-trained model), then hyperparameter tunning is for other models to compare with
Source: prepared by the authors based on study data.
Data analysis
Machine learning techniques were employed to construct models for estimating fetal weight, utilizing the training sample. The algorithms evaluated included Decision Tree, Random Forest, Gradient Boosting, Xtreme Gradient Boosting, SVM, and TabPFN. Model development relied on the training data, with initial performance assessed using the coefficient of determination (R²) and mean squared error (MSE).
The ML models were calibrated by optimizing their performance parameters using non-linear algorithms, with the exception of TabPFN. The TabPFN model was not calibrated, as its pre-trained architecture does not support hyperparameter tuning. To ensure the robustness of the performance estimates, the standard deviations of R² and MSE were calculated using five-fold cross-validation. Subsequently, hyperparameter tuning was performed to optimize model accuracy further. Various random values were tested for the learning rate (0.07, 0.08, and 0.09), alpha regularization (0.8, 0.9, and 0.95), and the number of models (50, 75, and 100) in the case of Gradient Boosting. For the SVM model, coefficient C values (1, 5, and 10) and kernel functions (rbf and poly) were evaluated. This process was conducted using a nested cross-validation methodology to maximize model accuracy and generalizability. The model that produced the estimates closest to the actual weight was applied to the validation sample to evaluate its performance and to compare the MSE and R² indicators with those from the training sample.
In addition, to enhance model interpretability and assess the contribution of individual features and their interactions to the predicted fetal weight, we conducted a feature importance analysis using the Faithful Shapley Interaction Index (FSII). This SHAP-based approach allows for a consistent and theoretically grounded estimation of both main effects and feature interactions, offering insights into how specific biometric variables influence model predictions across validation scenarios.
To ensure the robustness and stability of the performance metrics, additional randomizations of the training and test datasets were performed, generating 1000 resampled datasets through repeated random sampling.
Ethical considerations
This study was approved by the Research Ethics Committee of the Instituto Nacional Materno Perinatal (INMP), Lima, Peru (Approval ID: 032-2023-CIEI/INMP). As the study employed a retrospective design based on pre-existing clinical records, informed consent was not required, in compliance with both international and local ethical regulations. Data handling strictly adhered to the guidelines established by the Council for International Organizations of Medical Sciences (CIOMS) and Peru's Personal Data Protection Law, ensuring confidentiality and privacy.
The data collected were anonymized by removing personal identifiers and replacing them with alphanumeric codes. Only the research team had access to the anonymized data, which was protected under strict security protocols to prevent unauthorized access. Additional measures were implemented to ensure that the data were used exclusively for research purposes, safeguarding the identity and privacy of participants throughout the study. Data collection, storage, and processing were conducted in accordance with ethical standards for retrospective studies, thereby minimizing potential risks to confidentiality.
Results
Participant description
A total of 3525 pregnant women and their newborns met the inclusion criteria and were analyzed. GAs at birth ranged from 22 to 41 weeks, with the majority of newborns (81.1%) being full term (≥37 weeks of gestation). Birth weights varied from less than 1000 g to more than 4000 g, reflecting a wide distribution.
Estimated models
The models were developed using training data from the first study phase (2009–2018). The performance of various ML models, as well as the conventional Hadlock and Shepard models, was evaluated in terms of R² and MSE using both training and test datasets. The ML models incorporated predictive variables derived from fetal biometric measurements, including BPD, HC, AC, FL, and GA at the time of ultrasonographic evaluation.
Seven ML models were evaluated during the first study phase: Decision Tree, Random Forest, Gradient Boosting, Xtreme Gradient Boosting, SVM, and TabPFN. These models were trained under three specific scenarios:
Scenario 1: Predictive variables included four fetal biometric measurements (AC, FL, HC, and BPD).
Scenario 2: The four fetal biometric measurements were combined with GA.
Scenario 3: Three biometric measurements (AC, FL, and HC) were combined with GA.
In the test dataset, TabPFN demonstrated the best overall performance among all evaluated algorithms, achieving the highest R² and the lowest MSE for fetal weight estimation. Its optimal performance was observed in Scenario 2, where it reached an R² of 0.856 and a MSE of 0.146 (Table 1).
Table 1.
R² and MSE results of predictive models using fetal biometry and gestational age in training and test datasets.
| Scenario 1 a (HC, AC, FL, BPD) | Scenario 2 a (HC, AC, FL, BPD, GA) | Scenario 3 a (HC, AC, FL, GA) | |||||
|---|---|---|---|---|---|---|---|
| Model | Dataset | R2±SD | MSE±SD | R2±SD | MSE±SD | R2±SD | MSE±SD |
| Shepard | Train | 0.772 | 0.228 | 0.772 | 0.228 | 0.772 | 0.228 |
| Test | 0.801 | 0.201 | 0.801 | 0.201 | 0.801 | 0.201 | |
| Hadlock 1 | Train | 0.753 | 0.247 | 0.753 | 0.247 | 0.753 | 0.247 |
| Test | 0.75 | 0.253 | 0.75 | 0.253 | 0.75 | 0.253 | |
| Hadlock 2 | Train | 0.784 | 0.216 | 0.784 | 0.216 | 0.784 | 0.216 |
| Test | 0.794 | 0.209 | 0.794 | 0.209 | 0.794 | 0.209 | |
| Hadlock 3 | Train | 0.784 | 0.216 | 0.784 | 0.216 | 0.784 | 0.216 |
| Test | 0.798 | 0.204 | 0.798 | 0.204 | 0.798 | 0.204 | |
| Hadlock 4 | Train | 0.791 | 0.209 | 0.791 | 0.209 | 0.791 | 0.209 |
| Test | 0.807 | 0.195 | 0.807 | 0.195 | 0.807 | 0.195 | |
| Linear Regression | Train | 0.791 (±0.044) |
0.209 (±0.044) |
0.8 (±0.038) |
0.2 (±0.038) |
0.796 (±0.04) |
0.204 (±0.039) |
| Test | 0.818 (±0.044) |
0.185 (±0.044) |
0.826 (±0.038) |
0.176 (±0.038) |
0.825 (±0.04) |
0.177 (±0.039) |
|
| Decision Tree Regressor | Train | 1.0 (±0.024) |
0.0 (±0.024) |
1.0 (±0.022) |
0.0 (±0.022) |
0.999 (±0.023) |
0.001 (±0.02) |
| Test | 0.613 (±0.024) |
0.392 (±0.024) |
0.641 (±0.022) |
0.363 (±0.022) |
0.641 (±0.023) |
0.363 (±0.02) |
|
| Random Forest Regressor | Train | 0.97 (±0.026) |
0.03 (±0.026) |
0.972 (±0.019) |
0.028 (±0.019) |
0.971 (±0.02) |
0.029 (±0.018) |
| Test | 0.806 (±0.026) |
0.196 (±0.026) |
0.827 (±0.019) |
0.175 (±0.019) |
0.821 (±0.02) |
0.181 (±0.018) |
|
| Gradient Boosting Regressor | Train | 0.856 (±0.024) |
0.144 (±0.023) |
0.865 (±0.017) |
0.135 (±0.016) |
0.859 (±0.016) |
0.141 (±0.015) |
| Test | 0.832 (±0.024) |
0.17 (±0.023) |
0.847 (±0.017) |
0.155 (±0.016) |
0.846 (±0.016) |
0.156 (±0.015) |
|
| XGB Regressor | Train | 0.959 (±0.024) |
0.041 (±0.024) |
0.967 (±0.015) |
0.033 (±0.015) |
0.954 (±0.023) |
0.046 (±0.022) |
| Test | 0.789 (±0.024) |
0.214 (±0.024) |
0.808 (±0.015) |
0.195 (±0.015) |
0.803 (±0.023) |
0.199 (±0.022) |
|
| SVM | 0.83 (±0.022) |
0.17 (±0.022) |
0.842 (±0.023) |
0.158 (±0.023) |
0.837 (±0.023) |
0.163 (±0.023) |
|
| 0.831 (±0.022) |
0.171 (±0.022) |
0.843 (±0.023) |
0.159 (±0.023) |
0.838 (±0.023) |
0.164 (±0.023) |
||
| TabPFN | Train | 0.835 (±0.025) |
0.165 (±0.025) |
0.846 (±0.013) |
0.154 (±0.013) |
0.842 (±0.011) |
0.158 (±0.011) |
| Test | 0.846 (±0.025) |
0.156 (±0.025) |
0.856 (±0.013) |
0.146 (±0.013) |
0.854 (±0.011) |
0.148 (±0.011) |
|
R²: coefficient of determination; MSE: mean squared error; SD: standard deviation of R² and MSE values calculated via 5-fold cross-validation; GA: gestational age.
Scenarios 1, 2, and 3 were applied exclusively to the machine learning models, as the Hadlock and Shepard formulas are based on predefined fetal biometry parameters specific to their own estimation frameworks.
Source: Prepared by the authors based on study data.
The Gradient Boosting Regressor demonstrated the second-best predictive performance. In Scenario 1, this model achieved an R² of 0.832 and an MSE of 0.17. Performance improved in Scenario 2, with an R² of 0.847 and an MSE of 0.155. The SVM model demonstrated competitive results, achieving an R² of 0.831 and an MSE of 0.171 in Scenario 1, which improved to an R² of 0.843 and an MSE of 0.159 in Scenario 2. The Random Forest Regressor exhibited slightly lower performance, with an R² of 0.806 and an MSE of 0.196 in Scenario 1, improving to an R² of 0.827 and an MSE of 0.175 in Scenario 2. While the XGB Regressor produced satisfactory outcomes, its performance was inferior to that of the TabPFN and Gradient Boosting model, achieving an R² of 0.789 and an MSE of 0.214 in Scenario 1, and an R² of 0.808 and an MSE of 0.195 in Scenario 2 (Table 1).
Model calibration:
The ML models, except TabPFN, were calibrated by optimizing their performance through non-linear algorithms. TabPFN was excluded from tuning due to its pre-trained architecture, which does not support hyperparameter adjustment.
Among the tuned models, Gradient Boosting and SVM emerged as the top performers. For Gradient Boosting, various combinations of learning rate, alpha regularization, and the number of estimators were evaluated, while for SVM, different values of the regularization parameter C and kernel functions were tested (Table 2).
Table 2.
R² and MSE results for models using fetal biometry and gestational age, including hyperparameter tuning with nested cross-validation.
| Train CV | Test CV | Test | ||||||
|---|---|---|---|---|---|---|---|---|
| Model | Scenario | R 2 | MSE | R 2 | MSE | R 2 | MSE | Best parameters |
| Gradient Boosting | Scenario 1 (HC, AC, FL, BPD) | 0.855 | 0.147 | 0.796 | 0.191 | 0.82 | 0.183 | {'alpha': 0.8, ‘learning_rate': 0.07, ‘n_estimators’: 75} |
| Gradient Boosting | 0.85 | 0.149 | 0.782 | 0.222 | 0.832 | 0.17 | {'alpha': 0.8, ‘learning_rate': 0.09, ‘n_estimators': 50} | |
| Gradient Boosting | 0.855 | 0.15 | 0.804 | 0.171 | 0.829 | 0.173 | {'alpha': 0.8, ‘learning_rate': 0.08, ‘n_estimators': 75} | |
| Gradient Boosting | 0.858 | 0.142 | 0.809 | 0.191 | 0.832 | 0.17 | {'alpha': 0.8, ‘learning_rate': 0.08, ‘n_estimators': 75} | |
| Gradient Boosting | 0.849 | 0.144 | 0.85 | 0.176 | 0.828 | 0.174 | {'alpha': 0.8, ‘learning_rate': 0.07, ‘n_estimators': 100} | |
| Gradient Boosting | 0.853± (0.004) | 0.146± (0.003) | 0.808± (0.025) | 0.19± (0.02) | 0.828± (0.005) | 0.174± (0.005) | ||
| SVM | 0.832 | 0.171 | 0.798 | 0.189 | 0.827 | 0.175 | {'C': 1, ‘kernel': ‘rbf'} | |
| SVM | 0.84 | 0.159 | 0.786 | 0.217 | 0.826 | 0.176 | {'C': 5, ‘kernel': ‘rbf'} | |
| SVM | 0.831 | 0.175 | 0.82 | 0.157 | 0.83 | 0.172 | {'C': 1, ‘kernel': ‘rbf'} | |
| SVM | 0.832 | 0.168 | 0.812 | 0.189 | 0.829 | 0.173 | {'C': 1, ‘kernel': ‘rbf'} | |
| SVM | 0.833 | 0.159 | 0.839 | 0.188 | 0.828 | 0.174 | {'C': 5, ‘kernel': ‘rbf'} | |
| SVM | 0.834± (0.004) | 0.166± (0.007) | 0.811± (0.02) | 0.188± (0.021) | 0.828± (0.002) | 0.174± (0.002) | ||
| Gradient Boosting | Scenario 2 (HC, AC, FL, BPD, GA) | 0.86 | 0.143 | 0.821 | 0.168 | 0.841 | 0.161 | {'alpha': 0.8, ‘learning_rate': 0.09, ‘n_estimators': 50} |
| Gradient Boosting | 0.858 | 0.141 | 0.818 | 0.185 | 0.846 | 0.156 | {'alpha': 0.8, ‘learning_rate': 0.07, ‘n_estimators': 75} | |
| Gradient Boosting | 0.863 | 0.142 | 0.809 | 0.166 | 0.844 | 0.158 | {'alpha': 0.8, ‘learning_rate': 0.07, ‘n_estimators': 75} | |
| Gradient Boosting | 0.865 | 0.135 | 0.81 | 0.191 | 0.846 | 0.156 | {'alpha': 0.8, ‘learning_rate': 0.07, ‘n_estimators': 75} | |
| Gradient Boosting | 0.852 | 0.142 | 0.853 | 0.171 | 0.842 | 0.16 | {'alpha': 0.8, ‘learning_rate': 0.09, ‘n_estimators': 50} | |
| Gradient Boosting | 0.86± (0.005) | 0.141± (0.003) | 0.822± (0.018) | 0.176± (0.011) | 0.844± (0.002) | 0.158± (0.002) | ||
| SVM | 0.842 | 0.16 | 0.811 | 0.177 | 0.84 | 0.162 | {'C': 1, ‘kernel': ‘rbf'} | |
| SVM | 0.843 | 0.156 | 0.806 | 0.197 | 0.839 | 0.163 | {'C': 1, ‘kernel': ‘rbf'} | |
| SVM | 0.845 | 0.16 | 0.823 | 0.154 | 0.844 | 0.158 | {'C': 1, ‘kernel': ‘rbf'} | |
| SVM | 0.845 | 0.154 | 0.819 | 0.181 | 0.843 | 0.159 | {'C': 1, ‘kernel': ‘rbf'} | |
| SVM | 0.836 | 0.157 | 0.849 | 0.176 | 0.841 | 0.161 | {'C': 1, ‘kernel': ‘rbf'} | |
| SVM | 0.842± (0.004) | 0.157± (0.003) | 0.822± (0.017) | 0.177± (0.015) | 0.841± (0.002) | 0.161± (0.002) | ||
| Gradient Boosting | Scenario 3 (HC, AC, FL, GA) | 0.858 | 0.144 | 0.82 | 0.169 | 0.839 | 0.163 | {'alpha': 0.8, ‘learning_rate': 0.07, ‘n_estimators': 75} |
| Gradient Boosting | 0.852 | 0.148 | 0.816 | 0.187 | 0.845 | 0.157 | {'alpha': 0.8, ‘learning_rate': 0.09, ‘n_estimators': 50} | |
| Gradient Boosting | 0.859 | 0.146 | 0.807 | 0.167 | 0.843 | 0.159 | {'alpha': 0.8, ‘learning_rate': 0.07, ‘n_estimators': 75} | |
| Gradient Boosting | 0.864 | 0.136 | 0.804 | 0.196 | 0.844 | 0.157 | {'alpha': 0.8, ‘learning_rate': 0.09, ‘n_estimators': 75} | |
| Gradient Boosting | 0.847 | 0.146 | 0.851 | 0.174 | 0.839 | 0.163 | {'alpha': 0.8, ‘learning_rate': 0.09, ‘n_estimators': 50} | |
| Gradient Boosting | 0.856± (0.007) | 0.144± (0.005) | 0.82± (0.019) | 0.179± (0.012) | 0.842± (0.003) | 0.16± (0.003) | ||
| SVM | 0.839 | 0.164 | 0.804 | 0.184 | 0.835 | 0.167 | {'C': 1, ‘kernel': ‘rbf'} | |
| SVM | 0.839 | 0.161 | 0.801 | 0.202 | 0.833 | 0.169 | {'C': 1, ‘kernel': ‘rbf'} | |
| SVM | 0.84 | 0.165 | 0.819 | 0.157 | 0.839 | 0.163 | {'C': 1, ‘kernel': ‘rbf'} | |
| SVM | 0.841 | 0.159 | 0.815 | 0.185 | 0.839 | 0.163 | {'C': 1, ‘kernel': ‘rbf'} | |
| SVM | 0.831 | 0.161 | 0.843 | 0.183 | 0.836 | 0.166 | {'C': 1, ‘kernel': ‘rbf'} | |
| SVM | 0.838± (0.004) | 0.162± (0.002) | 0.816± (0.017) | 0.182± (0.016) | 0.836± (0.003) | 0.166± (0.003) | ||
R²: coefficient of determination; MSE: mean squared error; SD: standard deviation of R² and MSE values calculated via 5-fold cross-validation; GA: gestational age.
Source: Prepared by the authors based on study data.
To determine the optimal model between Gradient Boosting and SVM, the model with the highest average R² and the lowest standard deviation over five iterations on the test sample was selected. Gradient Boosting achieved an R² of 0.844 ± 0.002, slightly outperforming the SVM model, which recorded an R² of 0.841 ± 0.002 in Scenario 2. Within Gradient Boosting, the best iteration for Scenario 2 was achieved with the hyperparameters: ‘alpha': 0.8, ‘learning_rate': 0.07, and ‘n_estimators’: 75 (R² = 0.846). Conversely, the SVM model selected for final comparison in the same scenario utilized the hyperparameters: ‘C': 1 and ‘kernel': ‘rbf’, yielding an R² of 0.844 (Table 2).
However, despite hyperparameter tuning, neither model surpassed the performance of TabPFN, which remained the most accurate model overall. Therefore, these tuned models are not the best option for fetal weight estimation in this context.
Model validation
The calibrated models were validated using an independent sample of newborns from the second stage of the study (2019–2022). Both the calibrated Gradient Boosting and SVM, as well as TabPFN, outperformed the conventional Hadlock and Shepard models. The TabPFN model achieved an R² of 0.873 and an MSE of 0.144, surpassing the performance of the best Hadlock model (Hadlock 2) and the calibrated SVM and Gradient Boosting models (Table 3).
Table 3.
R² and MSE results in the validation sample (2019–2022).
| Model | Scenario a | R 2 | MSE |
|---|---|---|---|
| Shepard | - | 0.826 | 0.198 |
| Hadlock 1 | 0.819 | 0.205 | |
| Hadlock 2 | 0.845 | 0.176 | |
| Hadlock 3 | 0.822 | 0.202 | |
| Hadlock 4 | 0.839 | 0.183 | |
| Gradient Boosting (Calibrated) | Scenario 2 (HC, AC, FL, BPD, GA) | 0.869 | 0.148 |
| SVM (Calibrated) | 0.856 | 0.163 | |
| TabPFN | 0.873 | 0.144 |
R²: coefficient of determination; MSE: mean squared error.
Scenarios 2 was applied exclusively to the machine learning models, as the Hadlock and Shepard formulas are based on predefined fetal biometry parameters specific to their own estimation frameworks.
Source: Prepared by the authors based on study data.
Model interpretation using FSII
To improve the interpretability of the TabPFN model, FSII (Faithful SHapley Interaction Index) values were computed to analyze the contribution of each input feature and its interactions to the predicted fetal weight.
In the test dataset (2009–2018), the model estimated a fetal weight of 4051.38 grams, based on a base value of 3203.89 g, which represents the model's expected output in the absence of any feature influence. Individual features, including gestational age (GA), femur length (FL), head circumference (HC), biparietal diameter (BPD), and abdominal circumference (AC), showed strong positive contributions to the final prediction. The highest additive effects were observed for BPD and AC. However, several interaction terms, particularly HC × BPD, AC × BPD, HC × AC, FL × AC, and FL × GA, contributed negatively, partially offsetting the individual effects and indicating complex compensatory mechanisms within the model (Figure 2: test dataset).
Figure 2.
Contribution of features and their interactions to the mean estimated fetal weight using the TabPFN model, based on faithful Shapley interaction Index values in the test and validation datasets.
Source: Prepared by the authors based on study data.
In the validation dataset (2019–2022), the estimated fetal weight was 4066.02 g, consistent with the performance observed in the test dataset. Individual features again contributed positively, with HC and AC exerting the most substantial influence. Negative interaction effects were observed between HC × BPD, HC × AC, AC × BPD, HC × FL, and FL × AC, confirming the model's capacity to capture nonlinear interactions and adjust predictions accordingly (Figure 2: validation dataset).
These results demonstrate that the TabPFN model not only achieves high predictive accuracy but also provides interpretable explanations that align with known fetal biometric relationships. The FSII analysis supports the model's reliability and potential for clinical application in fetal weight estimation.
Model consistency
To verify model consistency, the selected TabPFN model was re-estimated by randomizing the training and testing samples 1000 times. A graph illustrating the R² and MSE estimates obtained from each randomized sampling demonstrated consistent trends in the estimated values of R² and MSE. It was evident that all MSE values were approximately 0.151 ± 0.003 for the training sample and 0.161 ± 0.013 for the testing sample. Meanwhile, the estimated R² values for the training sample averaged 0.849 ± 0.004, while those for the testing sample averaged 0.838 ± 0.016 (Figure 3).
Figure 3.
Trend graph of MSE and R² obtained from replications of the fetal weight predictive model.
Source: Prepared by the authors based on study data.
Discussion
In this study, ML models for estimating fetal weight using ultrasonographic biometric data were developed and validated, with the TabPFN emerging as the most accurate and consistent model throughout the various stages of analysis. During the training phase, multiple ML models were developed, and the TabPFN outperformed the other evaluated approaches. In the testing phase, the model maintained its superiority, achieving an R² of 0.856 and an MSE of 0.146, demonstrating robust performance compared to models calibrated such as SVM (R² = 0.844, MSE = 0.158) and Gradient Boosting (R² = 0.846, MSE = 0.156). During validation with independent data from 2019 to 2022, the TabPFN model achieved an R² of 0.873 and an MSE of 0.144, outperforming both the best-performing Hadlock model (R² = 0.845, MSE = 0.176) and the calibrated Gradient Boosting model (R² = 0.869, MSE = 0.148). These results were achieved by incorporating the four fetal biometric measurements and GA into the model.
Our study developed and validated ML models using data from 2009 to 2018 for model training and testing, and data from 2019 to 2022 for validation. We employed nested cross-validation and randomization to enhance the robustness of the models and minimize the risk of overfitting. 20 Additionally, we utilized fetal biometric measurements that are widely accepted for fetal weight estimation,5,8 in alignment with the work of Miotto et al. 14 and Helm et al., 21 who emphasize the importance of developing interpretable models. Such models are essential for building trust and ensuring their applicability in clinical settings.
The TabPFN model utilizing the four fetal biometric measurements (HC, BPD, AC, and FL) and GA demonstrated the highest accuracy in fetal weight estimation compared to other ML models and conventional formulas, such as those developed by Hadlock and Shepard. While Hadlock's formulas have historically been considered the standard, 8 previous studies have highlighted their limitations in extreme fetal weight ranges due to their static nature, which fails to capture the complexities of fetal growth.4,10 Monier et al. 10 demonstrated that these formulas can underestimate or overestimate fetal weight, potentially impacting obstetric planning. Similarly, our study included biometric data from women who delivered live, singleton fetuses without congenital anomalies and with a low risk of adverse outcomes. These criteria align with the INTERGROWTH-21st Project, which developed international standards for assessing fetal growth.6,22
Personalization in fetal weight estimation is an evolving field, recognized for its critical relevance to clinical practice. In this context, ML models have emerged as pivotal tools due to their ability to adapt and provide personalized predictions for specific population groups exhibiting significant variations in fetal growth. These variations are often attributed to socioeconomic, racial, or environmental factors.7,23 Consistent with this evidence, the TabPFN model developed in the present study outperformed the conventional formulas of Hadlock and Shepard for fetal weight estimation, reinforcing the notion, supported by prior research, that data-driven approaches can yield more accurate and personalized estimations.17,18 Hollmann et al. demonstrated that the TabPFN, a tabular foundation model, enables highly accurate predictions on small datasets, outperforming traditional machine learning methods without the need for hyperparameter tuning. This represents a significant advancement in applying artificial intelligence to clinical and scientific settings with limited data availability. 15
Compared to previous studies, our findings represent a substantial advancement in fetal weight estimation. Specifically, Solt et al. reported a maximum R² of 0.70 using a K-means clustered regressor model, a value that was surpassed by more than 17 percentage points by the TabPFN model in our study. 24 Similarly, Cohen et al. proposed a multivariable clinical regression model that improved the proportion of estimates within 10% of the actual birth weight. 25 However, they did not report standardized performance metrics such as R² or mean squared error, limiting the potential for quantitative comparisons. In contrast, our results supported by higher R² values and external validation on temporally distinct cohorts suggest a more accurate and generalizable model.
Multiple prior studies have demonstrated the value of ML in obstetrics. Lu et al. developed a model that did not require ultrasound data and achieved a mean relative error (MRE) of 6%. While this may be useful in resource-limited settings, the magnitude of error indicates an overall lower precision compared to our TabPFN model. 26 Tao et al. implemented a hybrid model based on long short-term memory (LSTM) neural networks, which achieved an accuracy of 79.2% and an MRE of 5.65%. 18 However, they also did not report R² or MSE, nor did they perform external validation. In this context, our findings indicate superior performance of the TabPFN model, with simpler input features and a more rigorous methodological approach.
Additional evidence is provided by Naimi et al., demonstrated that models such as Bayesian Additive Regression Trees (BART) and Gradient Boosting Machines (GBM) outperform traditional regression methods in estimating fetal weight by capturing complex relationships between clinical variables and birth weight. 16 Likewise, Victor et al. applied ML models to predict gestational weight gain categories using boosting algorithms such as XGBoost and LightGBM. 27 Although this represents a different maternal outcome, their results further confirm the effectiveness of ML algorithms in analyzing structured clinical data. These findings also support the strong performance observed in our study using Gradient Boosting and reinforce the added value of employing the TabPFN model, which is specifically designed to optimize learning from complex structured tabular data.
In line with previous studies, our findings also highlight the value of integrating ML in fetal weight estimation using routinely collected biometric parameters (HC, BPD, AC, and FL) along with gestational age. However, unlike previous studies, our use of both nested cross-validation and external validation with data collected during a different period strengthens the robustness of our model and addresses a common methodological limitation in the literature, which often relies solely on internal or cross-sectional validation. Furthermore, the application of the TabPFN model, specifically optimized for structured tabular data, demonstrates superior performance in terms of both precision and clinical applicability in fetal weight estimation.
The clinical implementation of advanced ML models, such as TabPFN, presents several practical challenges, particularly in terms of interpretability and effective integration into existing clinical workflows. Although ensemble-based models are well-documented for their high predictive accuracy, their “black-box” nature can hinder trust and adoption among healthcare professionals. 28 In this context, the development of interpretable models is essential to foster clinical confidence and ensure the safe application of AI-driven tools in medical decision-making. 29
In the present study, we utilized fetal biometric measurements obtained via ultrasound, an established input in conventional fetal weight estimation formulas, as the primary variable for model construction. To enhance model transparency, we applied the Faithful Shapley Interaction Index to interpret both the individual and interactive contributions of variables to the predicted mean fetal weight in the test and validation datasets using the TabPFN model. This approach enables a transparent exploration of the model's underlying logic and its clinical implications. 30 Future applications of ML in healthcare should prioritize the integration of robust explanatory tools that accompany model predictions, facilitating validation and interpretation within clinical reasoning processes.
Beyond high-resource settings, ML models have the potential to offer substantial benefits in low- and middle-income countries, where access to trained sonographers and standardized imaging equipment remains limited. The importance of task-shifting strategies and portable technologies in addressing maternal-fetal health disparities has been highlighted. 31 In such contexts, AI-powered tools can support mid-level providers by delivering standardized fetal weight estimations from minimal input, thereby enhancing clinical decision-making even in the absence of specialists or advanced equipment. 32 Our study provides evidence that models like TabPFN could be deployed in these settings, potentially embedded in mobile ultrasound devices or telehealth platforms, to improve access to and quality of prenatal care.
Despite the promising results, several limitations must be acknowledged that may affect the generalizability of our findings. First, the analysis was based on a large sample from a single institution, potentially limiting the applicability of our ML models to other populations or clinical contexts with distinct characteristics. While the ML models demonstrated superior performance in estimating fetal weight, the accuracy of the ultrasonographic measurements used as predictors may vary depending on the operator's level of expertise and the technology available. Additionally, the focus on specific variables, such as fetal biometric parameters and GA, may not fully capture other clinical or sociodemographic factors that could influence fetal growth.
Conclusion
This study demonstrated that machine learning models, particularly the TabPFN model, provide a significant improvement in fetal weight estimation based on ultrasonographic biometric measurements, surpassing both other ML approaches and traditional formulas such as those developed by Hadlock and Shepard. The integration of nested cross-validation, external validation on temporally distinct cohorts, and the application of explanatory tools such as the Shapley Interaction Index reinforces the robustness and transparency of the proposed approach. The superior accuracy and consistency of the TabPFN model, even without the need for hyperparameter tuning, position it as a promising alternative for clinical use, especially in resource-constrained settings or heterogeneous populations.
Our findings support the value of incorporating advanced artificial intelligence models into obstetric practice, particularly for personalized fetal weight estimation, which remains a critical variable in perinatal planning. Nevertheless, further studies are needed to evaluate the performance of this model across diverse clinical environments, populations, and ultrasound technologies to confirm its generalizability. In addition, the clinical adoption of these models should be accompanied by strategies that ensure interpretability, build trust among healthcare professionals, and enable seamless integration into existing clinical workflows.
Acknowledgments
We would like to express our sincere gratitude to the individuals who supported and facilitated this project, including our research assistants, coordinators, and healthcare staff at INMP. Their dedication and hard work were essential to the completion of this study.
Footnotes
ORCID iDs: Marcos Espinola-Sánchez https://orcid.org/0000-0002-1005-5158
Antonio Limay-Rios https://orcid.org/0000-0001-6012-3705
Andrés Campaña-Acuña https://orcid.org/0000-0001-6055-0416
Silvia Sanca-Valeriano https://orcid.org/0000-0002-0517-2114
Ethical considerations: This study received approval from the Research Ethics Committee of the Instituto Nacional Materno Perinatal (INMP), Lima, Peru (Approval ID: 032-2023-CIEI/INMP; Date: April 12, 2023).
Consent to participate: As it was a retrospective study based on pre-existing clinical records, informed consent was not required, in compliance with both international and local ethical regulations.
Author contributions: Marcos Espinola-Sánchez: Conceptualized and designed the study, led data acquisition, and supervised statistical analysis. He contributed substantially to interpreting the results and drafting the manuscript.
Antonio Limay-Rios: Collaborated in collecting and processing ultrasound data, leveraging his expertise in obstetric ultrasonography. He contributed to the manuscript's review and editing, focusing on clinical and methodological aspects.
Andres Campaña-Acuña: Responsible for developing and implementing the ML models and comparing them with conventional methods. He participated in validating the results and critically reviewed the manuscript to enhance its scientific content.
Silvia Sanca-Valeriano: Provided overall supervision of the study, reviewed the analysis and interpretation of the results, and critically revised the manuscript to improve its clinical accuracy and relevance.
All authors approved the final version of the manuscript and are accountable for all aspects of the work, ensuring that questions related to the accuracy or integrity of any part of the study are appropriately investigated and resolved.
Funding: The authors received no financial support for the research, authorship, and/or publication of this article.
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data availability statement: Data will be made available upon request.
References
- 1.Azcorra H, Dickinson F, Mendez-Dominguez N, et al. Development of birthweight and length for gestational age and sex references in Yucatan, Mexico. Am J Hum Biol 2022; 34: e23732. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Wilcox AJ, Cortese M, McConnaughey DR, et al. The limits of small-for-gestational-age as a high-risk category. Eur J Epidemiol 2021; 36: 985–991. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Halimeh R, Melchiorre K, Thilaganathan B. Preventing term stillbirth: benefits and limitations of using fetal growth reference charts. Curr Opin Obstet Gynecol 2019; 31: 365–374. [DOI] [PubMed] [Google Scholar]
- 4.Sotiriadis A, Eleftheriades M, Papadopoulos V, et al. Divergence of estimated fetal weight and birth weight in singleton fetuses. J Matern Fetal Neonatal Med 2018; 31: 761–769. [DOI] [PubMed] [Google Scholar]
- 5.Hadlock FP, Harrist RB, Sharman RS, et al. Estimation of fetal weight with the use of head, body, and femur measurements––a prospective study. Am J Obstet Gynecol 1985; 151: 333–337. [DOI] [PubMed] [Google Scholar]
- 6.Papageorghiou AT, Ohuma EO, Altman DG, et al. International standards for fetal growth based on serial ultrasound measurements: the fetal growth longitudinal study of the INTERGROWTH-21st project. Lancet 2014; 384: 869–879. [DOI] [PubMed] [Google Scholar]
- 7.Buck Louis GM, Grewal J, Albert PS, et al. Racial/ethnic standards for fetal growth: the NICHD fetal growth studies. Am J Obstet Gynecol 2015; 213: 449.e1–449.e41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Blue NR, Savabi M, Beddow ME, et al. The hadlock method is superior to newer methods for the prediction of the birth weight percentile. J Ultrasound Med 2019; 38: 587–596. [DOI] [PubMed] [Google Scholar]
- 9.Alves ALF, Carvalho AAV, Carvalho JAB, et al. Evaluation of the adequacy of Hadlock's reference chart for identification of fetuses with growth restriction. J Matern Fetal Neonatal Med 2018; 31: 967–971. [DOI] [PubMed] [Google Scholar]
- 10.Monier I, Ego A, Benachi A, et al. Comparison of the Hadlock and INTERGROWTH formulas for calculating estimated fetal weight in a preterm population in France. Am J Obstet Gynecol 2018; 219: 476.e1–476.e12. [DOI] [PubMed] [Google Scholar]
- 11.Grantz KL, Hinkle SN, He D, et al. A new method for customized fetal growth reference percentiles. PLoS One 2023; 18: e0282791. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Rizzo G, Prefumo F, Ferrazzi E, et al. The effect of fetal sex on customized fetal growth charts. J Matern Fetal Neonatal Med 2016; 29: 3768–3775. [DOI] [PubMed] [Google Scholar]
- 13.Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med 2019; 25: 44–56. [DOI] [PubMed] [Google Scholar]
- 14.Miotto R, Wang F, Wang S, et al. Deep learning for healthcare: review, opportunities and challenges. Brief Bioinform 2018; 19: 1236–1246. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Hollmann N, Müller S, Purucker L, et al. Accurate predictions on small data with a tabular foundation model. Nature 2025; 637: 319–326. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Naimi AI, Platt RW, Larkin JC. Machine learning for fetal growth prediction. Epidemiology 2018; 29: 290–298. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Ranjbar A, Montazeri F, Farashah MV, et al. Machine learning-based approach for predicting low birth weight. BMC Pregnancy Childbirth 2023; 23: 03. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Tao J, Yuan Z, Sun L, et al. Fetal birthweight prediction with measured data by a temporal machine learning method. BMC Med Inform Decis Mak 2021; 21: 26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Shepard MJ, Richards VA, Berkowitz RL, et al. An evaluation of two equations for predicting fetal weight by ultrasound. Am J Obstet Gynecol 1982; 142: 47–54. [DOI] [PubMed] [Google Scholar]
- 20.Wilimitis D, Walsh CG. Practical considerations and applied examples of cross-validation for model development and evaluation in health care: tutorial. JMIR AI 2023; 2: e49023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Helm JM, Swiergosz AM, Haeberle HS, et al. Machine learning and artificial intelligence: definitions, applications, and future directions. Curr Rev Musculoskelet Med 2020; 13: 69–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Stirnemann J, Villar J, Salomon LJ, et al. International estimated fetal weight standards of the INTERGROWTH-21st project. Ultrasound Obstet Gynecol 2017; 49: 478–486. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Fung R, Villar J, Dashti A, et al. Achieving accurate estimates of fetal gestational age and personalized predictions of fetal growth based on data from an international prospective cohort study: a population-based machine learning study. Lancet Digit Health 2020; 2: e368–e375. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Solt I, Caspi O, Beloosesky R, et al. Machine learning approach to fetal weight estimation. Am J Obstet Gynecol 2019; 220: S666–S667. [Google Scholar]
- 25.Cohen G, Girshovitz I, Amit G, et al. Improving clinical fetal weight estimation using machine learning. Am J Obstet Gynecol 2023; 228: S131. [Google Scholar]
- 26.Lu Y, Fu X, Chen F, et al. Prediction of fetal weight at varying gestational age in the absence of ultrasound examination using ensemble learning. Artif Intell Med 2020; 102: 101748. [DOI] [PubMed] [Google Scholar]
- 27.Victor A, Geremias Dos Santos H, Silva GFS, et al. Predictive modeling of gestational weight gain: a machine learning multiclass classification study. BMC Pregnancy Childbirth 2024; 24: 33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Tonekaboni S, Joshi S, McCradden MD, et al. What clinicians want: contextualizing explainable machine learning for clinical end use. Mach. Learn. Healthc. Conf 2019; 106: 359–380. Available from: https://proceedings.mlr.press/v106/tonekaboni19a.html. [Google Scholar]
- 29.Holzinger A, Langs G, Denk H, et al. Causability and explainability of artificial intelligence in medicine. Wiley Interdiscip Rev Data Min Knowl Discov 2019; 9: e1312. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Tsai Y, Bailis P, Zaharia M. Faith-Shap: the faithful shapley interaction Index. J Mach Learn Res 2023; 24: 1–42. Available from: https://dl.acm.org/doi/pdf/10.5555/3648699.3648793. [Google Scholar]
- 31.Wanyonyi SZ, Mutiso SK. Monitoring fetal growth in settings with limited ultrasound access. Best Pract Res Clin Obstet Gynaecol 2018; 49: 29–36. [DOI] [PubMed] [Google Scholar]
- 32.Horgan R, Nehme L, Abuhamad A. Artificial intelligence in obstetric ultrasound: a scoping review. Prenat Diagn 2023; 43: 1176–1219. [DOI] [PubMed] [Google Scholar]



