Skip to main content
BMC Medical Research Methodology logoLink to BMC Medical Research Methodology
. 2026 Feb 4;26:55. doi: 10.1186/s12874-026-02791-7

Development and validation of machine learning models for predicting operative duration in assisted reproductive technology procedures

Seoyoung Oh 1,#, Hyo Young Kim 2,#, Young Soo Park 1,
PMCID: PMC12964782  PMID: 41634589

Abstract

Background

Accurate prediction of operative duration is essential for efficient scheduling and resource allocation in surgical settings. In assisted reproductive technology (ART), procedures are brief yet highly time-sensitive, such that even small timing errors can propagate delays across tightly coupled workflows. This study aimed to develop and validate interpretable prediction models for operative duration in ART procedures using routinely available electronic medical record (EMR) data.

Methods

We retrospectively analyzed 763 operative cases from a high-volume fertility clinic in South Korea. Operative duration was defined using EMR-recorded start and end timestamps. Predictors included procedure type, patient age, reservation characteristics, attending physician, and day of surgery. We evaluated representative models from three commonly used predictive modeling paradigms—linear models, tree-based ensembles, and kernel/neural-network approaches—within a unified preprocessing and five-fold cross-validation framework. Model performance was assessed using standard error-based metrics and benchmarked against a commonly used moving-average heuristic.

Results

Among the evaluated approaches, regularized linear models—particularly ridge and Bayesian ridge regression—demonstrated stable and interpretable performance. After back-transformation to the original time scale, these models achieved a mean absolute error of approximately 3.1 min, corresponding to a 7–8% reduction compared with the moving-average (MA5) baseline. Procedure type, patient age, and reservation type emerged as the most influential predictors of operative duration.

Conclusions

Using routinely available EMR data, interpretable linear models demonstrated consistent performance gains over a simple operational baseline. These results highlight the value of transparent and well-validated modeling strategies for operative duration prediction in high-throughput clinical settings.

Trial registration

Not applicable (retrospective study using de-identified EMR data).

Keywords: Operative duration, Machine learning, Assisted reproductive technology, Electronic medical records.

Background

Accurate prediction of operative duration is essential for optimizing operating room (OR) planning, case sequencing, staff allocation, and resource utilization [1, 2]. When procedures exceed their estimated times, delays cascade through subsequent cases, extend post-anesthesia care unit occupancy, and slow bed turnover, leading to overtime costs, cancellations, and reduced patient satisfaction. Because operative durations often follow right-skewed distributions with high variability [3], even small prediction errors can cause systemic scheduling bottlenecks. Developing reliable, data-driven models to forecast operative duration is therefore critical for improving operational efficiency while maintaining patient safety and quality of care [4].

The need for precise forecasts is amplified in assisted reproductive technology (ART) settings, where procedures are short, frequent, and tightly coupled within narrow biological windows. Interventions such as oocyte retrieval and embryo transfer typically last less than 10 min but must align with sequential laboratory processes—such as intracytoplasmic sperm injection (ICSI) and embryo culture—occurring over a few hours. Even minute-level deviations can disrupt this chain and impair reproductive outcomes [5, 6]. In such time-sensitive environments, predictive models must be not only accurate but also interpretable and deployable in real-time scheduling systems.

Existing studies on operative time prediction have evolved along two main lines. Operations research models have used probabilistic distributions to optimize OR scheduling and block allocation [1]. More recently, medical informatics approaches have applied machine learning to forecast operative duration and related time-based outcomes, incorporating structured and unstructured data and enhancing model transparency through SHAP [710]. However, most prior work has focused on tertiary hospitals and long-duration surgeries, relying on detailed clinical covariates such as American Society of Anesthesiologists (ASA) classification, body mass index (BMI), or anesthesia type [8]. These variables are often unavailable or impractical to collect in routine outpatient or specialty clinic settings.

To address this gap, the present study develops and validates parsimonious, interpretable machine learning models for predicting operative duration in ART procedures using only routinely available electronic medical record (EMR) data. We benchmark these models against practical heuristics to assess whether simplified predictors can achieve clinically meaningful accuracy while remaining feasible for real-world clinical deployment. By focusing on model interpretability and parsimony, this study contributes to the methodological evidence base for deploying machine learning in low-data, high-frequency surgical environments.

Methods

We analyzed EMR data from one of South Korea’s leading reproductive medicine specialty clinics, a high-volume institution internationally recognized for infertility treatment. The clinic primarily serves older female patients and is characterized by frequent, high-complexity, and repeat procedures. All data were fully de-identified in accordance with institutional research regulations.

Data sources and participants

The dataset was constructed by integrating three independent sources: the hospital information system (UniHIS), the outpatient queue management system, and physician-entered Excel records. The observation period spanned 17 consecutive days (April 18–May 7, 2022), during which 5,118 encounters were documented, including 4,343 outpatient visits and 775 surgical cases. As the study focused on operative duration prediction, we restricted the sample to obstetrics and gynecology surgeries performed on female patients. Male patients (n = 3) and urology cases (n = 3) were excluded to ensure cohort homogeneity, yielding 769 eligible gynecologic surgeries. Operative duration was defined as the time interval (minutes) between EMR-recorded procedure start and end timestamps.

Procedures with durations ≥ 30 min (n = 6) were excluded as clinical outliers. This threshold reflects the clinic’s operating reality—oocyte retrieval, the predominant procedure, typically lasts 5–10 min—and is consistent with prior evidence linking prolonged retrievals (> 12 min) to delayed recovery and lower embryo quality [5]. Statistically, this cutoff excluded approximately 1% of cases. The same exclusion rule was consistently applied to training, validation, and test sets to prevent data leakage. The final analytic cohort comprised 763 surgeries. The data selection and exclusion process is summarized in Fig. 1.

Fig. 1.

Fig. 1

Flow diagram of study cohort (N = 763)

Variables and outcomes

We used a parsimonious and interpretable feature set derived from routinely collected EMR fields. Predictors were selected based on their availability prior to the start of surgery to avoid target leakage and to ensure real-world deployability. The dataset included structured variables capturing patient demographics (age, nationality), operational attributes (reservation type, department, attending physician), clinical information (diagnosis and procedure), and workflow timestamps (reservation, arrival, surgery start, and surgery end). Table 1 summarizes the study variables, including their type, operational definitions, and rationale for inclusion in the predictive models.

Table 1.

Variables and definitions

Variable Type Description
Operative duration Continuous Duration of surgery (minutes), calculated from EMR start–end timestamps
Procedure type Categorical Procedure name simplified into top 5 most frequent procedures plus “Other”
Age Continuous Patient age (years) at time of surgery
Age group Categorical Age grouped in 5-year intervals
Reservation type Categorical Scheduled reservation vs. same-day visit
Attending physician Categorical Physician who performed the surgery
Nationality Categorical Domestic vs. foreign patients
Day of surgery Categorical Day of the week (Monday–Saturday)
Procedure × Age group Interaction Interaction between procedure type and age group
Reservation × Age group Interaction Interaction between reservation type and age group

Procedure type was simplified into a categorical variable with six levels (five most frequent categories plus “Other”). Among 18 distinct procedures identified, the five most frequent categories accounted for approximately 94% of all surgeries, with oocyte retrieval and preparation representing 57.9% of cases. To mitigate sparsity while retaining clinical granularity, the top five procedures were modeled as separate categories, and the remainder were collapsed into an “Other” group.

Age was included as both a continuous and a categorical variable grouped in five-year intervals to capture nonlinear effects. This approach aligns with prior studies that modeled age either through splines or grouped categories when analyzing operative duration and reproductive outcomes [9, 11]. The mean age was 42.5 years (SD 4.3), with 39.7% of patients aged 40–44, 28.3% aged 35–39, and 27.0% aged ≥ 45. The high proportion of patients aged ≥ 40 years reflects the age profile typical of infertility populations and ensures sufficient discriminability across clinically meaningful age strata. Nationality (domestic vs. foreign) was included as a demographic control, reflecting the predominance of domestic patients (97.8%).

Reservation type was included as a binary predictor (scheduled vs. same-day) because it may directly influence operative duration through differences in urgency. In our cohort, 75.9% of surgeries were scheduled in advance, while 24.1% were same-day visits. Consistent with prior work highlighting admission type or urgency status as important predictors of operative duration [12], we incorporated reservation type to capture this potential source of heterogeneity. Day of surgery (Monday–Saturday) was included to capture potential variation related to weekly scheduling patterns. Attending physician identifiers were retained to account for heterogeneity in surgical style across the 11 surgeons. We also examined candidate interaction terms to test whether operative duration varied systematically across patient and operational subgroups, specifically procedure × age group and reservation type × age group.

The primary outcome was operative duration, defined as the difference between surgery start and end timestamps recorded in the EMR and expressed in minutes. All timestamps were standardized to the second and converted into minutes. In the final analytic cohort, the raw distribution of operative duration exhibited a mean of 10.8 min (SD 6.2), consistent with expected durations for short, high-frequency ART procedures. Cases ≥ 30 min were excluded as outliers, leaving 763 observations for modeling.

The distribution was markedly right-skewed (skewness: 0.90), consistent with findings from large-scale surgical datasets [3, 9]. To address this skewness, we applied log transformation (log1p) to the target variable for model training. Critically, all model evaluations and performance metrics were calculated on the original scale (minutes) after back-transformation using Duan smearing correction, ensuring clinical interpretability of all reported results. We evaluated alternative transformation approaches (robust scaling, quantile transformation, Box–Cox) but found that log transformation using log1p provided the optimal balance between distribution improvement, predictive performance, and clinical interpretability.

Model development and evaluation

To identify an optimal balance between interpretability and predictive performance, we systematically compared machine-learning algorithms across three model families:

  1. (i) Linear regression methods (Ridge regression, Bayesian ridge, Linear regression),

  2. (ii) Tree-based ensembles (Random Forest, XGBoost, CatBoost), and.

  3. (iii) Kernel-based and neural-network models (support vector regression with linear, support vector regression with RBF, MLP).

Linear regression models assume additive linear relationships and provide transparent coefficient estimates, making them clinically interpretable and easy to integrate into decision-support systems. OLS provided a computationally efficient baseline, while ridge regression introduced L2 regularization to mitigate multicollinearity and reduce variance, thereby improving generalization. Bayesian ridge regression further places priors on regularization parameters and generates posterior coefficient distributions, quantifying predictive uncertainty—a valuable property in repetitive, variable surgical workflows. Prior research on spinal surgery has shown that linear regression achieved mean absolute error (MAE) of ~ 127 min, providing a benchmark against which more complex learners can be compared [7]. Linear models have also been widely adopted as comparators in predicting emergency department length of stay and inpatient duration due to their simplicity and transparency [13].

Tree-ensemble models aggregate multiple decision trees to capture variable interactions and accommodate categorical predictors. We included random forest, XGBoost, and CatBoost. Random forest aggregates bootstrap-sampled trees through averaging in regression tasks, yielding stable predictions and providing variable-importance metrics. In spinal surgery cohorts, department-specific random-forest models reduced root mean squared error (RMSE) from 57 to 31 min compared with single-hospital models, demonstrating improved performance and interpretability [14]. XGBoost employs gradient boosting with iterative weighting of residuals, enabling precise modeling of nonlinear effects. On spinal surgery data, XGBoost achieved R² of 0.77 and MAE of ~ 44 min, with SHAP analyses highlighting contributions of high-cardinality variables such as BMI, surgical scope, and attending surgeon [8]. CatBoost further enhances gradient boosting by incorporating ordered target encoding and sequential boosting designed for categorical features, facilitating robust training in EMR-based contexts characterized by sparse or high-dimensional categorical data [15].

Kernel- and neural-network methods were also considered for their structural advantages in learning high-dimensional nonlinear patterns. Support vector regression (SVR) fits a regression function that tolerates small deviations within a margin, relying on support vectors to define the model. We tested both linear and radial basis function (RBF) kernels, with the latter particularly effective for capturing nonlinear and irregular variation observed in operative durations. Prior studies have shown the utility of SVR for modeling surgical waiting times, where nonlinear relationships are common [16]. Multilayer perceptrons (MLPs), which use hidden layers to approximate complex functions, have been applied in exploratory medical contexts such as simulations and bed-turnover prediction. However, given the modest sample size in our dataset, MLPs carry a heightened risk of overfitting and were therefore included primarily as reference models rather than production candidates.

Taken together, our model-selection strategy encompassed linear models offering interpretability, tree ensembles leveraging variable interactions and categorical handling, and kernel/neural-network models capable of capturing nonlinear complexity. This balanced comparison enabled evaluation of both statistical characteristics and clinical relevance, with the aim of identifying a model that optimally balances interpretability and predictive accuracy for real-world clinical operations.

Model training and evaluation

The study followed a four-stage experimental workflow to ensure statistical validity and operational feasibility (Fig. 2).

Fig. 2.

Fig. 2

Experimental workflow for operative duration prediction

Stage 1

Data preparation. Predefined exclusion and transformation steps were applied, including removal of extreme outliers, log transformation of operative duration (with Box–Cox as a robustness check), and imputation of missing values. All procedures were performed within training folds and applied unchanged to validation and test folds to prevent data leakage.

Stage 2

Feature engineering. Clinical and operational variables were recoded into model-ready predictors. Procedure names were consolidated into the “top-5 plus Other” schema, age was recoded into 5-year groups, and two interaction terms were constructed (procedure × age group, reservation type × age group). Reservation type and physician identifiers were one-hot encoded, and continuous variables were standardized.

Stage 3

Model comparison and evaluation. Approximately 500 experiments were conducted by combining 20 algorithms across three model families with 25 feature sets. All models were trained and evaluated within a unified preprocessing and encoding pipeline to ensure fairness and reproducibility. For benchmarking, two heuristic baselines were included: (i) the procedure-specific mean operative duration and (ii) the moving average of the five most recent cases.

Stage 3.1

Cross-validation strategy: We selected 5-fold cross-validation based on several considerations. With a final analytic cohort of n = 763, 5-fold splitting yields approximately 152–153 samples per fold, providing sufficient statistical power while maintaining reasonable fold counts. StratifiedKFold was used to preserve the distribution of procedure types (Diagnosis_top5) across folds, ensuring that each fold maintains similar proportions of procedure categories. ART procedures are considered independent events with low temporal dependence, making time-series splitting unnecessary. Nested cross-validation was not used because Ridge and BayesianRidge employ default hyperparameters without tuning, and the computational cost would outweigh potential benefits.

Stage 3.2

Cross-validation stability: The 5-fold cross-validation results demonstrated stable performance across folds after back-transformation. Ridge regression showed MAE of 3.10 ± 0.23 min (coefficient of variation: 7.4%), and BayesianRidge showed MAE of 3.11 ± 0.22 min (coefficient of variation: 7.1%), indicating consistent and reliable model performance. This stability supports the validity of our cross-validation approach and the generalizability of our findings.

Stage 3.3

Collinearity assessment: To assess multicollinearity among predictors, we calculated variance inflation factors (VIF) for all features after one-hot encoding. All features had VIF < 10, with the highest VIF being 1.96 for physician-related variables, indicating no multicollinearity concerns. Age had a VIF of 1.09, and Diagnosis_top5-related features had VIFs ranging from 1.05 to 1.21. Correlation analysis for continuous variables and chi-square tests for categorical variable associations were also performed, confirming no problematic associations. Chi-square tests revealed significant associations between Diagnosis_top5 and Btype, and between Diagnosis_top5 and Doctor, but these associations did not compromise model stability, as confirmed by the low VIF values.

Stage 4

Explainability and operational impact. Model interpretability and clinical implications were assessed using permutation importance and SHAP analyses. Improvements in mean absolute error relative to baseline were reported in minutes and translated into estimated operating-room cost savings using published benchmarks (22–133 USD/min) [17, 18], emphasizing not only predictive accuracy but also reductions in cost, surgical delays, and staff overtime.

Model performance was summarized using four statistics: mean absolute error (MAE), root mean squared error (RMSE), coefficient of determination (R²), and mean absolute percentage error (MAPE). MAE calculates the average magnitude of errors between predicted and actual values and is directly interpretable in minutes of operative duration, making it the primary accuracy measure. RMSE penalizes larger errors more heavily by squaring differences before averaging. R² indicates the proportion of variance explained by the predictors. MAPE expresses error as a percentage of actual values; because denominators near zero can destabilize MAPE, it was used as a complementary metric.

MAE was prioritized as the primary metric due to its direct interpretability in minutes and robustness to outliers. RMSE served as a complementary metric to understand error distribution, while R² assessed explanatory power. MAPE provided relative error interpretation but was limited to original-scale models. All performance metrics were calculated on the original scale (minutes) after back-transformation from log space. This ensures clinical interpretability while benefiting from log transformation’s distribution normalization. To ensure robust uncertainty quantification of model performance estimates, 95% confidence intervals were calculated.

Importantly, we distinguish between explanatory fit (measured by R²) and operational improvement (measured by reductions in absolute prediction error relative to heuristic baselines). While R² provides a descriptive measure of variance explained, model usefulness in operational scheduling contexts is better evaluated by the extent to which predictions improve upon commonly used heuristic baselines. Accordingly, practical relevance was assessed primarily through consistent reductions in MAE relative to a moving-average heuristic, rather than by absolute levels of variance explained.

Results

Baseline characteristics

Baseline characteristics are presented separately for the full cohort prior to exclusions (n = 5,118) and the final analytic cohort used for model development (n = 763). Table 2 presents baseline characteristics of the entire dataset, including 4,343 outpatient visits and 775 surgical cases. Continuous variables are reported as mean ± SD, and categorical variables as frequencies and percentages. Table 3 presents baseline characteristics of the final analytic cohort after applying exclusion criteria (Fig. 1). The final cohort comprised 763 surgeries, with oocyte retrieval and preparation representing the majority of procedures (57.9%), followed by Thawing ET/transfer (25.6%). Most surgeries were scheduled in advance (75.9%), and all patients were female return visitors.

Table 2.

Descriptive statistics of entire dataset (N = 5,118)

Variable Type Mean ± SD or N (%) Range
Age (years) Continuous 42.06 ± 4.35 21.0–66.0 (median: 42.0)
Operative duration Continuous 7.58 ± 10.10 0.0–176.0 (median: 5.0)
Before-queue time Continuous 61.44 ± 57.57 0.0–582.0 (median: 45.0)
Waiting-queue time Continuous 62.94 ± 46.28 0.0–330.0 (median: 55.0)
Gender Categorical

Female: 4,830 (94.4%)

Male: 288 (5.6%)

Reservation type Categorical

Scheduled: 4,396 (85.9%)

Walk-in: 722 (14.1%)

Pattern Categorical

Consultation: 4,343 (84.9%)

Surgery: 775 (15.1%)

Patient type Categorical

Return: 4,585 (89.6%)

First: 533 (10.4%)

Table 3.

Descriptive statistics of final analytic cohort (N = 763)

Variable Type Mean ± SD or N (%) Range
Age (years) Continuous 42.49 ± 4.25 30.0–54.0 (median: 42.0)
Operative duration Continuous 10.54 ± 5.48 1.0–30.0 (median: 10.0)
Before-queue time Continuous 30.87 ± 26.21 0.0–220.0 (median: 25.0)
Waiting-queue time Continuous 61.14 ± 33.82 8.0–265.0 (median: 55.0)
Gender Categorical Female: 763 (100.0%)
Procedure type Categorical

Oocyte retrieval: 442 (57.9%)

Thawing ET: 195 (25.6%)

Other: 47 (6.2%)

Hysteroscopic polypectomy: 40 (5.2%)

ET: 27 (3.5%)

Diagnostic hysteroscopy: 12 (1.6%)

Reservation type Categorical

Scheduled: 579 (75.9%)

Same-day: 184 (24.1%)

Patient type Categorical Return: 763 (100.0%)

Model performance

The predictive performance of competing algorithms after back-transformation to the original scale (minutes) is summarized in Table 4. Across all models, linear regression methods consistently delivered the strongest performance. Ridge regression achieved the lowest MAE (3.10 min), while Bayesian ridge attained the lowest RMSE (4.21 min) and Linear regression achieved the highest R² (0.41). Ordinary least squares regression performed nearly identically (MAE 3.09 min, RMSE 4.20 min, R² 0.41), confirming the stability of linear specifications in this setting.

Table 4.

Predictive performance of machine learning models after back-transformation to original scale (5-fold cross-validation, N = 763)

Family Model MAE RMSE R²
Linear models Ridge regression 3.10 (2.82–3.38) 4.20 (3.80–4.61) 0.41 (0.33–0.48)
Bayesian ridge 3.11 (2.84–3.39) 4.21 (3.82–4.61) 0.40 (0.33–0.47)
Linear regression 3.09 (2.79–3.39) 4.20 (3.78–4.61) 0.41 (0.33–0.49)
Tree ensembles Random forest 3.27 (2.89–3.64) 4.50 (3.92–5.07) 0.32 (0.21–0.43)
XGBoost 3.59 (2.99–4.18) 4.91 (4.13–5.68) 0.19 (0.01–0.37)
CatBoost 3.23 (2.83–3.63) 4.39 (3.80–4.99) 0.35 (0.23–0.47)
Kernel / NN SVR (linear) 3.09 (2.79–3.40) 4.20 (3.79–4.60) 0.41 (0.33–0.48)
SVR (RBF) 3.23 (2.90–3.57) 4.49 (3.98–4.99) 0.32 (0.23–0.42)
MLP 3.18 (2.78–3.57) 4.32 (3.77–4.86) 0.37 (0.26–0.48)

Tree-ensemble models (random forest, XGBoost, CatBoost) showed higher MAE values (3.23–3.59 min) and lower R² values (0.19–0.35), indicating no additional performance benefit from their greater complexity. Kernel-based methods yielded mixed results: SVR with a linear kernel reached competitive accuracy (MAE 3.09 min, R² 0.41), whereas the RBF kernel underperformed (MAE 3.23 min, R² 0.32). The multilayer perceptron (MLP) achieved MAE 3.18 min but showed lower explained variance among evaluated models (R² 0.37) and higher RMSE (4.32 min).

Note: 95% confidence intervals shown in parentheses. Confidence intervals for all performance metrics were calculated using a t-distribution with 4 degrees of freedom (n − 1, where n = 5 folds).

To evaluate real-world applicability, models were benchmarked against a moving-average heuristic of the five most recent cases (MA5), which mirrors empirical scheduling practices in ART clinics. All comparisons were conducted under a five-fold cross-validation framework with Duan smearing back-transformation to recover predicted durations in minutes.

The best-performing models—ridge and Bayesian ridge regression—achieved mean absolute errors (MAE) of 3.10–3.11 min, compared with 3.36 min for the MA5 baseline, yielding an absolute improvement of 0.25–0.26 min (≈ 15–16 s per case) and a relative reduction of 7.4–7.6%. While the per-case difference appears modest, such improvement scales cumulatively in high-throughput ART settings, where dozens of procedures are performed daily, translating into measurable scheduling precision and reduced downstream delays.

We compared all models evaluated in Table 4 against the moving-average baseline of the five most recent cases (MA5, MAE 3.36 min), which mirrors empirical scheduling practices in ART clinics. Against the MA5 baseline, linear models showed the greatest improvement: Linear regression 0.27 min (8.0%), Ridge regression 0.26 min (7.7%), and Bayesian ridge 0.25 min (7.4%). Tree ensemble models showed smaller improvements: CatBoost 0.13 min (3.9%), Random forest 0.09 min (2.7%), and XGBoost showed negative improvement (− 0.23 min, − 6.8%). Kernel/neural-network models showed moderate improvements: SVR (linear) 0.27 min (8.0%), MLP 0.18 min (5.4%), and SVR (RBF) 0.13 min (3.9%). These results support the selection of linear models as the optimal choice.

We compared simple exponential back-transformation with Duan smearing correction in Table 5. For Ridge regression, simple exp transformation yielded MAE 3.03 min, RMSE 4.26 min, and R² 0.39, while Duan smearing yielded MAE 3.10 min, RMSE 4.20 min, and R² 0.41. The difference in MAE was minimal (0.07 min), but Duan smearing showed slightly better R² performance. For BayesianRidge, similar patterns were observed. We selected Duan smearing for final reporting to correct for retransformation bias, though the choice of back-transformation method had minimal impact on final metrics.

Table 5.

Comparison of back-transformation methods: simple exponential vs. Duan smearing correction (5-fold cross-validation, after back-transformation to original scale in minutes)

Family Model Back-transformation MAE ± SD RMSE ± SD R²± SD
Linear models Ridge regression Simple exp 3.03 ± 0.24 4.26 ± 0.35 0.39 ± 0.07
Duan smearing 3.10 ± 0.23 4.20 ± 0.32 0.41 ± 0.06
Bayesian ridge Simple exp 3.04 ± 0.24 4.28 ± 0.35 0.39 ± 0.06
Duan smearing 3.11 ± 0.22 4.21 ± 0.32 0.40 ± 0.06
Linear regression Simple exp 3.02 ± 0.25 4.19 ± 0.33 0.40 ± 0.06
Duan smearing 3.09 ± 0.25 4.20 ± 0.33 0.41 ± 0.06
Tree ensembles Random forest Simple exp 3.22 ± 0.31 4.48 ± 0.39 0.31 ± 0.08
Duan smearing 3.27 ± 0.31 4.50 ± 0.39 0.32 ± 0.08
XGBoost Simple exp 3.54 ± 0.48 4.88 ± 0.52 0.18 ± 0.12
Duan smearing 3.59 ± 0.48 4.91 ± 0.52 0.19 ± 0.12
CatBoost Simple exp 3.19 ± 0.34 4.36 ± 0.40 0.34 ± 0.08
Duan smearing 3.23 ± 0.34 4.39 ± 0.40 0.35 ± 0.08
Kernel / NN SVR (linear) Simple exp 3.05 ± 0.27 4.18 ± 0.33 0.40 ± 0.06
Duan smearing 3.09 ± 0.27 4.20 ± 0.33 0.41 ± 0.06
SVR (RBF) Simple exp 3.20 ± 0.28 4.46 ± 0.36 0.31 ± 0.06
Duan smearing 3.23 ± 0.28 4.49 ± 0.36 0.32 ± 0.06
MLP Simple exp 3.15 ± 0.32 4.29 ± 0.38 0.36 ± 0.07
Duan smearing 3.18 ± 0.32 4.32 ± 0.38 0.37 ± 0.07

Interpretation of predictive variables

Permutation importance and SHAP analyses consistently identified procedure type as the dominant predictor of operative duration, followed by patient age and reservation type. These three variables captured most of the predictive signal, while other features such as attending physician, day of surgery, and nationality contributed negligibly after adjustment. Interaction terms (procedure × age group, reservation × age group) added minor explanatory power, confirming that procedure category—rather than demographic or operational variability—primarily drives the observed duration differences. Together, this compact, interpretable feature set provides sufficient information for accurate, clinically meaningful predictions in short, high-frequency ART workflows.

The consistent identification of procedure type as the dominant predictor across both SHAP analysis in Fig. 3 and ablation studies confirms its primary role. Patient age ranked second, with ablation studies showing a 0.07-minute MAE increase when removed. Attending physician, while included in the model, showed relatively low importance (0.04-minute MAE increase when removed), and day of surgery was excluded from the final model due to low importance. This hierarchy remained consistent across different analytical approaches, supporting the robustness of our findings.

Fig. 3.

Fig. 3

SHAP summary plot (variable importance visualization)

To quantify the contribution of individual features, we performed an ablation study using the Ridge model by systematically removing each feature and measuring performance degradation (Table 6). Removing procedure type (Diagnosis_top5) increased MAE by 1.13 min and decreased R² by 0.37, confirming its dominant importance. Removing age increased MAE by 0.07 min and decreased R² by 0.05. Removing attending physician increased MAE by 0.04 min and decreased R² by 0.03. Removing reservation type (Btype) had minimal impact (MAE change: −0.001 min, R² decrease: 0.003), suggesting limited predictive value. These results align with SHAP analysis findings and support the focus on procedure type, age, and reservation type as the core predictive features.

Table 6.

Ablation study: feature importance quantified by performance degradation when removed

Removed Feature MAE MAE Increase R² R² Decrease Feature Importance Rank
None (Baseline) 3.03 0.00 0.39 0.00
Procedure type 4.16 1.13 0.02 0.37 1 (Most important)
Age 3.10 0.07 0.34 0.05 2
Attending physician 3.08 0.04 0.36 0.03 3
Reservation type 3.03 −0.001 0.39 0.003 4 (Least important)

Variance inflation factor (VIF) analysis confirmed no multicollinearity concerns. All features had VIF < 10, with the highest VIF being 1.96 for physician-related variables. Age had a VIF of 1.09, and Diagnosis_top5-related features had VIFs ranging from 1.05 to 1.21. Correlation analysis for continuous variables and chi-square tests for categorical associations revealed no problematic associations, supporting the stability of linear model coefficients.

Discussion

Principal findings

Using a compact, routinely available predictor set (procedure type, age, reservation type) and log-transformed outcomes, linear models provided the most reliable performance for short, high-frequency ART procedures. Ridge regression yielded the lowest MAE (3.10 min), while Bayesian ridge achieved the lowest RMSE (4.21 min) and Linear regression achieved the highest R² (0.41). Tree ensembles and neural or kernel methods did not provide consistent gains, likely reflecting the low feature cardinality and dominance of a few high-volume procedures in this context, which favors models emphasizing bias reduction over variance minimization. After Duan smearing back-transformation, the best models achieved an MAE of 3.10–3.11 min—marginally better than the procedure-mean baseline (by 0.01–0.02 min) and up to 7.6% better than the moving-average heuristic (by 0.25–0.26 min).

To quantify the trade-off between interpretability and accuracy, we calculated a composite score combining accuracy (normalized MAE) and interpretability (SHAP consistency across folds) in Table 7. Ridge regression achieved the highest composite score (0.987), with perfect accuracy score (1.000) and high interpretability score (0.974). BayesianRidge followed with a composite score of 0.967. In contrast, tree-based models showed lower composite scores: RandomForest 0.601 and GradientBoosting 0.485, primarily due to lower accuracy relative to linear models. This analysis demonstrates that linear models provide the optimal balance, achieving both superior accuracy and interpretability compared to more complex ensemble methods.

Table 7.

Interpretability-accuracy trade-off analysis

Family Model MAE R² Interpretability Score Accuracy
Score
Composite
Score
Linear models Ridge regression 3.10 0.41 0.974 1.000 0.987
Bayesian ridge 3.11 0.40 0.975 0.959 0.967
Linear regression 3.09 0.41 0.974 0.989 0.982
Tree ensembles Random forest 3.27 0.32 0.973 0.229 0.601
XGBoost 3.59 0.19 0.970 0.000 0.485
CatBoost 3.23 0.35 0.972 0.294 0.633
Kernel / NN SVR (linear) 3.09 0.41 0.974 0.989 0.982
SVR (RBF) 3.23 0.32 0.971 0.294 0.633
MLP 3.18 0.37 0.970 0.471 0.721

Comparison with prior work

The observed predictor hierarchy—procedure type as the dominant factor, followed by patient age and reservation type—aligns with previous studies of longer and more complex surgeries, where procedural and scheduling variables consistently explain the majority of operative time variance even in the presence of detailed clinical covariates [7, 8, 10]. Our findings extend this evidence to the context of assisted reproductive technology (ART), a short-duration and high-frequency surgical environment rarely examined in predictive modeling research. In this setting, the limited feature diversity and standardized workflows allow parsimonious linear models to achieve performance comparable to or exceeding that of more flexible machine-learning algorithms, while offering superior interpretability and ease of clinical integration.

Methodologically, this study demonstrates that model simplicity can be advantageous when the outcome is low-variance and the predictors are routinely standardized, a pattern that complements rather than contradicts prior findings from large tertiary surgical datasets. Clinically, it addresses a persistent gap in the literature, which has predominantly focused on tertiary hospitals, lengthy procedures, and data-intensive covariates such as ASA class, BMI, or anesthesia type, leaving outpatient and specialty-clinic workflows underrepresented.

Practical implications

A minimal predictor set reduces integration burden for EMR-based deployment and supports near-real-time scheduling. Ridge models deliver strong point estimates, while Bayesian ridge additionally quantifies predictive uncertainty, enabling risk-adjusted buffer allocation. Although per-case gains are modest, applying published OR cost benchmarks (22–133 USD/min) suggests approximately 5–35 USD saved per case versus moving-average heuristics. At the scale of high-throughput ART clinics, these improvements can tighten scheduling buffers, reduce cascading delays and overtime, alleviate PACU bottlenecks, and improve reliability across tightly coupled laboratory workflows.

Strengths and limitations

This study demonstrates that even in brief, high-frequency surgical settings such as assisted reproductive technology (ART), interpretable and parsimonious models can provide robust and reproducible predictions using only routinely collected EMR data. A unified analytical pipeline implemented across all algorithms ensured methodological consistency and transparency, while five-fold cross-validation combined with Duan smearing enabled unbiased performance evaluation and accurate back-transformation from the log scale. By showing that simple, well-regularized linear models can achieve accuracy comparable to complex ensemble or neural approaches, the study highlights that interpretability and operational stability are often more valuable than algorithmic complexity in real-world clinical prediction problems. Practically, the findings demonstrate that accurate operative-duration prediction is achievable without additional data collection or manual preprocessing, facilitating scalable deployment within existing EMR systems.

Several limitations should be acknowledged. The single-institution, high-volume clinic setting and short observation period (17 days) may limit generalizability to other ART clinics, particularly low-volume centers or those with different patient demographics. The model’s performance in different healthcare systems, countries, or clinic types requires external validation. However, the focus on routinely available EMR variables enhances potential transferability, as these variables are commonly collected across ART clinics.

The limited feature set represents a key limitation. While we used all routinely available EMR data, additional features such as BMI, previous surgical history, comorbidities, surgical complexity scores, and anesthesia type were not available in structured format. The ablation study confirmed that procedure type is the dominant predictor, but inclusion of additional clinical covariates could potentially improve model performance and explanatory power. This limitation is inherent to single-institution EMR data, where some clinical information may not be structured or readily available.

The R² of 0.40–0.41 (95% CI: 0.33, 0.49) indicates that our models explain approximately 40–41% of the variance in operative duration, leaving 59–60% unexplained. This distinction between explanatory fit and operational improvement is critical for interpreting model performance in scheduling contexts. The reported R² is presented as a descriptive measure of explained variance rather than as a benchmark of operational adequacy. More importantly, the relative improvement over baseline (7.4–7.6% reduction in MAE) demonstrates practical clinical value, as absolute R² values are less meaningful than relative performance gains in operational settings. When contextualized within similar surgical duration prediction studies, R² values of 0.3–0.5 are typical. In practical terms, with 20 procedures per day, the model’s improvement translates to approximately 6 min of time savings, contributing to reduced waiting times and improved scheduling efficiency. The focus should be on relative improvement (7.4–7.6% over MA5 baseline) rather than absolute R² values, as operational settings prioritize incremental gains over perfect prediction.

Conclusions

This study demonstrates that operative duration in ART procedures can be accurately predicted using simple, interpretable linear models trained on routinely collected EMR data. Benchmarking against a moving-average (MA5) heuristic, the proposed models achieved a 7.4–7.6% reduction in mean absolute error, corresponding to approximately 15–16 s per procedure.

These findings indicate that transparent and easily deployable models can provide reliable predictive performance without the data demands or complexity of advanced algorithms. When implemented in high-volume ART clinics, their cumulative impact can improve scheduling precision, minimize downstream delays, and enhance overall operational efficiency. Future work should validate these findings in multi-institution datasets and explore integration with real-time scheduling systems for prospective evaluation.

Acknowledgements

Not applicable.

Abbreviations

ART

Assisted Reproductive Technology

ASA

American Society of Anesthesiologists

BMI

Body Mass Index

EMR

Electronic Medical Record

ICSI

Intracytoplasmic Sperm Injection

OR

Operating Room

PACU

Post-Anesthesia Care Unit

RBF

Radial Basis Function

SVR

Support Vector Regression

USD

United States Dollar

Authors’ contributions

YSP Conceptualization, formal analysis, and writing – original draft. SO Data curation, preprocessing, model development, and writing– original draft. HYK Visualization, validation, and writing – review & editing. All authors read and approved the final manuscript.

Funding

This work was supported by the Academic Promotion System, Tech University of Korea. This research was also supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF), funded by the Ministry of Education (RS-2023-00248403).

Data availability

The datasets generated and analyzed during the current study are not publicly available due to institutional privacy regulations but are available from the corresponding author on reasonable request.

Declarations

Ethics approval and consent to participate

This study was conducted using fully de-identified electronic medical record (EMR) data obtained from a reproductive medicine specialty clinic in South Korea. According to the Bioethics and Safety Act of the Republic of Korea, ethical approval and informed consent were not required for retrospective research using anonymized data.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Seoyoung Oh and Hyo Young Kim contributed equally to this work.

References

  • 1.Cardoen B, Demeulemeester E, Beliën J. Operating room planning and scheduling: A literature review. Eur J Oper Res. 2010;201(3):921–32. 10.1016/j.ejor.2009.04.011. [Google Scholar]
  • 2.Dexter F, Traub RD. How to schedule elective surgical cases into specific operating rooms to maximize the efficiency of use of operating room time. Anesth Analg. 2002;94(4):933–42. 10.1097/00000539-200204000-00030. [DOI] [PubMed] [Google Scholar]
  • 3.Wullink G, van Houdenhoven M, Hans EW, van Oostrum JM, Kazemier G. Surgery capacity allocation planning and case mix scheduling using a decisional model. Anaesthesia. 2007;62(6):571–8. 10.1111/j.1365-2044.2007.05041.x. [Google Scholar]
  • 4.Lee H, Kim S, Moon H, Lee H, Kim K, Jung S, Yoo S. Hospital length of stay prediction for planned admissions using observational medical outcomes partnership common data model: retrospective study. J Med Internet Res. 2024;26:e59260. 10.2196/59260. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Matsota P, Kalimeris K, Kostopanagiotou G. Effects on Propofol consumption and IVF outcome: a narrative review. J Clin Med. 2021;10(5):1112. 10.3390/jcm10051112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Practice Committees of the American Society for Reproductive Medicine (ASRM) and the Society for Reproductive Biologists and Technologists (SRBT). Comprehensive guidance for human embryology, andrology, and endocrinology laboratories: management and operations: a committee opinion. Fertil Steril. 2022;117(6):1183–202. 10.1016/j.fertnstert.2022.02.016. [DOI] [PubMed] [Google Scholar]
  • 7.Spence RT, Vardy J, Reti AM, et al. A scoping review of machine learning for surgical capacity and two perioperative outcomes. BJS Open. 2023;7(3):zrad098. 10.1093/bjsopen/zrad098. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Park JB, Roh GH, Kim K, Kim HS. Development of predictive model of surgical case durations using machine learning approach. J Med Syst. 2025;49(1):8. 10.1007/s10916-025-02141-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Riahi V, Hassanzadeh H, Khanna S, et al. Improving preoperative prediction of surgery duration. BMC Health Serv Res. 2023;23:1343. 10.1186/s12913-023-10264-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Kendale S, Bishara A, Burns M, Solomon S, Corriere M, Mathis M. Machine learning for the prediction of procedural case durations developed using a large multicenter database: algorithm development and validation study. JMIR AI. 2023;2:e44909. 10.2196/44909. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Gabriel R, Harjai B, Simpson S, Du A, Tully J, George O, Waterman R. An ensemble learning approach to improving prediction of case duration for spine surgery: algorithm development and validation. JMIR Perioper Med. 2023;6:e39650. 10.2196/39650. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Azriel D, Rinott Y, Tal O, Abbou B, Rappoport N. Surgery duration prediction using multi-task feature selection. IEEE J Biomed Health Inf. 2024;28(7):4216–23. 10.1109/JBHI.2024.3374783. [DOI] [PubMed] [Google Scholar]
  • 13.Ko H, et al. Prediction of emergency department length of stay using regression models. Int J Med Inf. 2020;139:104143. 10.1016/j.ijmedinf.2020.104143. [Google Scholar]
  • 14.Şentürk S, et al. Spine surgery duration prediction with random forest. J Spine Surg. 2020;6(4):681–8. 10.21037/jss-20-598.33447670 [Google Scholar]
  • 15.Dorogush AV, Ershov V, Gulin A. CatBoost: gradient boosting with categorical features. NeurIPS Workshop. 2018. Available at: https://arxiv.org/abs/1810.11363 (accessed 2025-09-26).
  • 16.Zhang Y, et al. Support vector regression for predicting hospital wait times. Health Inf J. 2019;25(3):1115–26. 10.1177/1460458217747989. [Google Scholar]
  • 17.Maskal S, Jain R, Fedrigon D 3rd, Rose E, Monga M, Sivalingam S. The cost of operating room delays in an endourology center. Can Urol Assoc J. 2020;14(7):E304–8. 10.5489/cuaj.6099. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Macario A. What does one minute of operating room time cost? J Clin Anesth. 2010;22(4):233–6. 10.1016/j.jclinane.2010.02.003. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets generated and analyzed during the current study are not publicly available due to institutional privacy regulations but are available from the corresponding author on reasonable request.


Articles from BMC Medical Research Methodology are provided here courtesy of BMC

RESOURCES