Abstract
The study aimed to identify critical factors associated with the surgical stability of pogonion (Pog) by applying machine learning (ML) to predict relapse following two-jaw orthognathic surgery (2 J-OGJ). The sample set comprised 227 patients (110 males and 117 females, 207 training and 20 test sets). Using lateral cephalograms taken at the initial evaluation (T0), pretreatment (T1), after (T2) 2 J-OGS, and post treatment (T3), 55 linear and angular skeletal and dental surgical movements (T2-T1) were measured. Six ML modes were utilized, including classification and regression trees (CART), conditional inference tree (CTREE), and random forest (RF). The training samples were classified into three groups; highly significant (HS) (≥ 4), significant (S) (≥ 2 and < 4), and insignificant (N), depending on Pog relapse. RF indicated that the most important variable that affected relapse rank prediction was ramus inclination (RI), CTREE and CART revealed that a clockwise rotation of more than 3.7 and 1.8 degrees of RI was a risk factor for HS and S groups, respectively. RF, CTREE, and CART were practical tools for predicting surgical stability. More than 1.8 degrees of CW rotation of the ramus during surgery would lead to significant Pog relapse.
Subject terms: Biophysics, Computational biology and bioinformatics
Introduction
Orthognathic surgery is performed to overcome skeletal discrepancies, obtain esthetics, and achieve normal occlusion. However, unstable outcomes often require dental compensation during postoperative orthodontic treatment and other surgical procedures1. Surgical instability, including hierarchy in post-surgical stability, is well established based on the surgical direction. Changes > 2 mm or 2° were defined as moderately unstable, and 4 > mm or 4° were highly unstable2–4. A comprehensive report on hierarchy5 indicated that post-surgical instability after mandibular setback was related to "A technical problem," which meant that the chin occasionally underwent clockwise (CW) rotation during the operation, and later the pterygomassetreic sling induced the opposite direction even with rigid fixation. The quantity of CW rotation of the proximal segment was correlated with the linear measurement of pogonion (Pog)6. Although two-jaw orthognathic surgery(2 J-OGS) was expected to overcome this situation, the proximal segment counter CW rotation after surgery, measured as ramus inclination (RI), was significantly associated with the amount of mandibular relapse7. Based on the literature above, the major relapse occurred during CW rotation of the ramus (proximal segment) during surgery, which was related to the forward movement of the Pog after surgery. Therefore, training a dataset by including pre-operative (T1) and post-operative RI change (T2, and six to eight weeks later) to a machine learning (ML) algorithm may lead to predicting the change in Pog during retention in the testing set.
Artificial intelligence (AI) refers to the development of computer systems that can perform tasks that require human intelligence. ML is a subfield of AI that focuses on devising algorithms and statistical models that computers can use to "learn" from data without explicit programming. Deep learning is a subset of ML that uses artificial neural networks inspired by the structure and function of the human brain to process and analyze large amounts of data8. Studies on ML and deep learning in the field of temporomandibular joint (TMJ) in the dental orthodontic department have been reported9–13. Jung stated that it is possible to classify extraction versus non-extraction with a 93% success rate using ML9. Etemd reported the ranking factors determining the extraction using random forest (RF)10. Li suggested that the K-Nearest Neighbors (KNN) method was the best model for distinguishing between extraction and non-extraction, extraction patterns, and anchorage determination11. Fang used multivariate logistic regression to detect cephalometric variables associated with degenerative joint disease12. Lee et al.13 adopted RF to determine the rank of the risk factors related to temporomandibular disorders. ML has demonstrated the potential for predicting surgical outcomes14.
To our knowledge, stability prediction of 2 J-OGS surgery using ML has not been reported. Since the obvious clinical expression in patients with skeletal class III is the sagittal chin projection (Pog), the quantitative change in Pog was selected for investigation. The purpose of the present study was to identify the critical factors associated with the surgical stability of Pog by applying ML to predict relapse following 2 J-OGS.
Methods
Subjects
The study sample consisted of 319 adult Korean patients diagnosed with skeletal class III malocclusion who underwent combined surgical orthodontic treatment and 2 J-OGS surgery at Seoul National University Dental University Hospital or Ajou University Dental Hospital, located in Republic of Korea, between 2006 and 2017. The inclusion criteria were as follows; (1) patients who had undergone 2 J-OGS surgery, Le Fort I osteotomy in the maxilla, and bilateral sagittal split osteotomy in the mandible, (2) patients who underwent rigid fixation with a metal plate and monocortical screws for fixation of the osteotomized bony segments, (3) patients for whom photographs and lateral cephalograms were taken at the initial visit (T0), at least one month before the surgery (T1), at least one month after the surgery (T2), and at debonding (T3), and (5) patients who faculty orthodontists treated with more than 30 years of experience (SHB and YHK). The exclusion criteria were (1) patients who had cleft lip and/or palate or congenital craniofacial deformities, (2) patients who had a history of trauma in the craniofacial area, and (3) patients who had severe facial asymmetry (menton deviation > 5 mm), and (4) patients who underwent vertical genioplasty. Supplementary Table 1 describes the age, sex, and Pog posterior movement (1.59 ± 2.76 mm). Consequently, the final study sample included 227 patients (110 males and 117 females). This retrospective case–control study was reviewed and approved by the Institutional Review Board of Seoul National University Dental Hospital (IRB no. ERI20022) and Ajou University Hospital (IRB no. AJIRB-MED-MDB-19–039). All experimental protocols were approved by the two institutional committees. Seoul National University Dental Hospital and Ajou University Hospital IRB committees waived the need of patient informed consent. Previous studies have indicated that the major relapse after 2 J-OGS surgery occurred within 8 weeks7 to 1 year5. Thus, 1 year of follow-up was sufficient to examine relapse.
Sample size calculation
Power analyses were conducted using Cohen's effective sample size15 with a significance level (α) of 0.05 and a power (1-β) of 0.9. Based on the mean and standard deviation (SD) values of postsurgical linear change in Pog from a previous study7, which were reported as 1.87 and 2.6 mm, respectively, sample size calculations were performed using R software (ver. 4.0.3, Vienna, Austria). The results indicated that a minimum of 20 individuals were required to achieve the desired statistical power for the study. According to Rajput's suggestion16, a suitable sample size in machine learning algorithms should have an effective size greater than 0.5 and an ML accuracy of over 80%. Additionally, Rajput indicated that increasing the sample size beyond the threshold point would not significantly improve performance. In this study, the standardized effect size was 1.14, which exceeds the threshold of 0.5, indicating a substantial effect size. Therefore, among the machine learning algorithms used in this study, those that demonstrate an accuracy of more than 80% can be considered acceptable in terms of their performance.
Landmarks and variables used in this study
Figures 1 and 2 illustrated the definitions of the landmarks and linear and angular variables. Fifty-five linear and angular skeletal and dental surgical movements (T2-T1) were measured, of which 16 were calibrated relative to the horizontal and vertical reference planes for further analysis of linear changes to assess the magnitude of surgical movement. Postoperative relapse was estimated by measuring Pog movement (T3-T2). The identification of landmarks and measurement of variables were performed by a single operator (YHK).
Intra-examiner reliability assessment
To evaluate intra-examiner reliability, the same investigator (YHK) re-evaluated all variables from 20 randomly selected subjects one month after the initial measurement. After conducting paired t-tests, no significant differences were observed between the first and second measurements. As a result, the first set of variables was used for subsequent statistical analyses.
Statistical analyses
The normality of the data distribution for each variable was assessed using the Shapiro–Wilk test. Statistical analysis was conducted among groups using a one-way analysis of variance and the Kruskal–Wallis test. A Bonferroni post-hoc analysis was performed. Statistical analysis was performed using R version 4.2.2. A significance level of p < 0.05 was established for all statistical tests.
ML algorithms
Six ML approaches were utilized to identify factors contributing to Pog relapse, and these algorithms were compared to determine the optimal method for prediction, classification, and regression trees (CART)17, conditional inference tree (CTREE)18, linear discriminant analysis19, support vector machine20, KNN21, RF22. A tenfold cross-validation was performed, repeating the process ten times to further reduce the variance in the results. The literature23 supporting k-fold cross-validation indicates that it is an effective resampling technique to mitigate overfitting in machine learning models. Cross-validation is particularly useful when dealing with limited data samples. In the k-fold cross-validation process, the dataset is partitioned into k subsets, or "folds," with equal sizes. During the evaluation phase, the model is trained and tested k times. In each iteration, one fold is held out as the test set, while the remaining (k-1) folds are used for training the model. This procedure ensures that the model is assessed on different subsets of data, which helps to provide a more robust evaluation of its performance24,25. The primary advantage of k-fold cross-validation is that it allows the model to be trained and tested on various data partitions, thereby reducing the risk of overfitting. Overfitting occurs when a model becomes too specialized to the training data and performs poorly on new, unseen data. By repeatedly evaluating the model on different data subsets, k-fold cross-validation helps to identify whether the model generalizes well across various data distributions. This technique provides a more reliable estimate of the model's performance metrics, such as accuracy, precision, recall, and F1 score, compared to a single train-test split evaluation. Moreover, it aids in optimizing hyperparameters and selecting the best model architecture that yields better generalization to unseen data. In summary, k-fold cross-validation is a valuable tool for machine learning model evaluation, particularly when dealing with limited data and aiming to avoid overfitting. Its implementation can lead to more robust and accurate models by ensuring better generalization across different data samples. The training and testing set consisted of 207 and 20 samples, respectively.
Metrics
The metric evaluation included accuracy, kappa, mean balanced accuracy, mean F1 score, mean recall, mean sensitivity, and mean specificity.
Ethics declaration
The study design followed the Declaration of Helsigki principles and was approved by SNUDH and Ajou University Hospital. his retrospective case–control study was reviewed and approved by the Institutional Review Board of Seoul National University Dental Hospital (IRB no. ERI20022) and Ajou University Hospital (IRB no. AJIRB-MED-MDB-19–039). The IRB Committee waived the requirement for obtaining patient consent from both institutions.
Results
Based on previous studies2–4, the training set was classified into three subgroups based on the rank of relapse; highly significant (HS, n = 19) relapse, which was defined as greater than 4 mm of relapse; significant (S, n = 62) relapse, which was defined as a relapse ranging between 2 and 4 mm, and insignificant (N, n = 126) relapse, which was < 2 mm. The evaluation involved calculating the position of the Pog between T3 (debonding) and T2 (surgery). The differences in cephalometric variables among the three groups in the training set (n = 207) are presented in Supplementary Table 2. Bjork sum, articular angle, gonial angle, lower anterior–posterior height ratio (ANS-Me/N-Me), FMA, SN to MP, SNA, FM_UOP, and A-point to vertical reference plane VRP displayed statistically significant differences (Fig. 1). The metrics evaluation among the ML models was shown in Fig. 2 and summarized in Table 1. A scale close to 1.0 indicated a higher prediction level. The significance of the differences between the metric distributions of the different ML algorithms was shown in Table 2. Each number indicated the difference between the algorithms, and p-values were described. For example, in accuracy, the mean difference between CART and CTREE was 0.008, obtained by subtracting them in Table 2 (|0.966–0.958|= 0.008). In general, RF presented the most significant difference. The performance metrics of the ML algorithms in the testing set (n = 20) were compared in Table 3. CART, CTREE, and RF displayed better prediction results. For example, RF predicted a sagittal chin point (Pog) surgical relapse of more than 2 mm 95% (19/20), and considering the classification between HS and S, 90% (18/20) was the same as the actual outcomes (Supplementary Table 2). In RF, "VarImp" stands for "variable importance." The variable importance measures the relative importance of each predictor variable in the RF model. The six important head variables were RI, articular angle, Bjork sum, gonial angle, Sn to MP, and FMA. (Supplementary Fig. 1). Although RF predicted the rank of relapse and found critical variables, quantitative critical points can be obtained from Decision Tree models, which also visualize the prediction process to understand the process easily (CTREE, Fig. 3a and CART, Fig. 3b). In Fig. 3a, the prediction model of CTREE was illustrated. The first step was evaluating the amount of CW rotation of the ramus to predict the Pog relapse rank, N, S, and HS. No significant relapse was forecasted if it was less than 1.86 degrees (− 1.86). When more than 1.86°of CW rotation occurred during surgery, the next step was to evaluate whether it was more or less than 3.72. The third step was determining whether the articular angle changed by more than 9.25°in the same direction. If so, the fifth step estimated the increased vertical position of point A (Apoint_y). An HS relapse was anticipated if it was more than 1.12 mm. CART (Fig. 3b) revealed that the CW rotation of the ramus with critical points of 1.8° and 3.7° was essential for forecasting the relapse rank.
Table 1.
Min | 1st.Qu | Median | Mean | 3rd.Qu | Max | NA.s | ||
---|---|---|---|---|---|---|---|---|
Accuracy | CART | 0.850 | 0.950 | 0.955 | 0.966 | 1.000 | 1.000 | 0 |
Accuracy | CTREE | 0.818 | 0.950 | 0.952 | 0.958 | 1.000 | 1.000 | 0 |
Accuracy | LDA | 0.650 | 0.818 | 0.857 | 0.860 | 0.905 | 1.000 | 0 |
Accuracy | SVM | 0.800 | 0.900 | 0.905 | 0.912 | 0.951 | 1.000 | 0 |
Accuracy | KNN | 0.571 | 0.724 | 0.773 | 0.776 | 0.818 | 0.950 | 0 |
Accuracy | RF | 0.850 | 0.952 | 1.000 | 0.974 | 1.000 | 1.000 | 0 |
Kappa | CART | 0.752 | 0.911 | 0.918 | 0.939 | 1.000 | 1.000 | 0 |
Kappa | CTREE | 0.648 | 0.906 | 0.914 | 0.923 | 1.000 | 1.000 | 0 |
Kappa | LDA | 0.426 | 0.667 | 0.732 | 0.740 | 0.820 | 1.000 | 0 |
Kappa | SVM | 0.592 | 0.801 | 0.821 | 0.832 | 0.907 | 1.000 | 0 |
Kappa | KNN | 0.050 | 0.405 | 0.521 | 0.521 | 0.634 | 0.906 | 0 |
Kappa | RF | 0.752 | 0.912 | 1.000 | 0.953 | 1.000 | 1.000 | 0 |
Mean_Balanced_Accuracy | CART | 0.849 | 0.967 | 0.978 | 0.975 | 1.000 | 1.000 | 0 |
Mean_Balanced_Accuracy | CTREE | 0.755 | 0.951 | 0.977 | 0.964 | 1.000 | 1.000 | 0 |
Mean_Balanced_Accuracy | LDA | 0.690 | 0.809 | 0.855 | 0.851 | 0.905 | 1.000 | 0 |
Mean_Balanced_Accuracy | SVM | 0.712 | 0.810 | 0.868 | 0.865 | 0.906 | 1.000 | 0 |
Mean_Balanced_Accuracy | KNN | 0.513 | 0.634 | 0.683 | 0.692 | 0.739 | 0.905 | 0 |
Mean_Balanced_Accuracy | RF | 0.801 | 0.973 | 1.000 | 0.977 | 1.000 | 1.000 | 0 |
Mean_F1 | CART | 0.753 | 0.907 | 0.964 | 0.949 | 1.000 | 1.000 | 0 |
Mean_F1 | CTREE | 0.722 | 0.881 | 0.961 | 0.945 | 1.000 | 1.000 | 3 |
Mean_F1 | LDA | 0.563 | 0.730 | 0.796 | 0.809 | 0.864 | 1.000 | 20 |
Mean_F1 | SVM | 0.694 | 0.828 | 0.863 | 0.876 | 0.957 | 1.000 | 39 |
Mean_F1 | KNN | 0.686 | 0.741 | 0.785 | 0.783 | 0.820 | 0.863 | 87 |
Mean_F1 | RF | 0.786 | 0.919 | 1.000 | 0.956 | 1.000 | 1.000 | 1 |
Mean_Precision | CART | 0.750 | 0.889 | 0.958 | 0.948 | 1.000 | 1.000 | 0 |
Mean_Precision | CTREE | 0.556 | 0.889 | 0.958 | 0.939 | 1.000 | 1.000 | 1 |
Mean_Precision | LDA | 0.498 | 0.733 | 0.838 | 0.806 | 0.917 | 1.000 | 12 |
Mean_Precision | SVM | 0.542 | 0.917 | 0.952 | 0.912 | 0.956 | 1.000 | 35 |
Mean_Precision | KNN | 0.795 | 0.871 | 0.922 | 0.903 | 0.938 | 0.956 | 87 |
Mean_Precision | RF | 0.786 | 0.889 | 1.000 | 0.956 | 1.000 | 1.000 | 1 |
Mean_Recall | CART | 0.760 | 0.951 | 0.974 | 0.964 | 1.000 | 1.000 | 0 |
Mean_Recall | CTREE | 0.593 | 0.944 | 0.974 | 0.949 | 1.000 | 1.000 | 0 |
Mean_Recall | LDA | 0.504 | 0.712 | 0.778 | 0.774 | 0.889 | 1.000 | 0 |
Mean_Recall | SVM | 0.556 | 0.667 | 0.786 | 0.775 | 0.833 | 1.000 | 0 |
Mean_Recall | KNN | 0.333 | 0.474 | 0.529 | 0.546 | 0.598 | 0.833 | 0 |
Mean_Recall | RF | 0.667 | 0.967 | 1.000 | 0.966 | 1.000 | 1.000 | 0 |
Mean_Sensitivity | CART | 0.760 | 0.951 | 0.974 | 0.964 | 1.000 | 1.000 | 0 |
Mean_Sensitivity | CTREE | 0.593 | 0.944 | 0.974 | 0.949 | 1.000 | 1.000 | 0 |
Mean_Sensitivity | LDA | 0.504 | 0.712 | 0.778 | 0.774 | 0.889 | 1.000 | 0 |
Mean_Sensitivity | SVM | 0.556 | 0.667 | 0.786 | 0.775 | 0.833 | 1.000 | 0 |
Mean_Sensitivity | KNN | 0.333 | 0.474 | 0.529 | 0.546 | 0.598 | 0.833 | 0 |
Mean_Sensitivity | RF | 0.667 | 0.967 | 1.000 | 0.966 | 1.000 | 1.000 | 0 |
Mean_Specificity | CART | 0.939 | 0.978 | 0.983 | 0.985 | 1.000 | 1.000 | 0 |
Mean_Specificity | CTREE | 0.867 | 0.964 | 0.982 | 0.979 | 1.000 | 1.000 | 0 |
Mean_Specificity | LDA | 0.821 | 0.909 | 0.933 | 0.928 | 0.956 | 1.000 | 0 |
Mean_Specificity | SVM | 0.869 | 0.935 | 0.956 | 0.954 | 0.976 | 1.000 | 0 |
Mean_Specificity | KNN | 0.693 | 0.792 | 0.841 | 0.839 | 0.879 | 0.976 | 0 |
Mean_Specificity | RF | 0.935 | 0.981 | 1.000 | 0.989 | 1.000 | 1.000 | 0 |
Pre-processing: centered (55), scaled (55), Resampling: Cross-Validated (tenfold, repeated 10 times).
CART classification and regression trees (Complexity parameter = 0.176), CTREE conditional inference tree (mincriterion = 0.9), LDA linear discriminant analysis, SVM support vector machines (sigma = 0.01225348 and C = 2), KNN K-nearest neighbor (k = 13), RF Random Forest (mtry = 28), By R 4.2.2 with package 'caret', 207 samples 55 predictor 3 classes: ‘HS’, ‘S,’ ‘N’, N No significant relapse, S significant relapse, HS highly significant relapse.
Table 2.
CART | CTREE | LDA | SVM | KNN | RF | ||
---|---|---|---|---|---|---|---|
Accuracy | CART | 0.008 | 0.106 | 0.054 | 0.190 | − 0.008 | |
Accuracy | CTREE | 0.004 | 0.098 | 0.046 | 0.182 | − 0.016 | |
Accuracy | LDA | < 0.001 | < 0.001 | − 0.052 | 0.084 | − 0.114 | |
Accuracy | SVM | < 0.001 | < 0.001 | < 0.001 | 0.136 | − 0.062 | |
Accuracy | KNN | < 0.001 | < 0.001 | < 0.001 | < 0.001 | − 0.198 | |
Accuracy | RF | 0.005 | < 0.001 | < 0.001 | < 0.001 | < 0.001 | |
Kappa | CART | 0.016 | 0.199 | 0.107 | 0.418 | − 0.014 | |
Kappa | CTREE | 0.005 | 0.183 | 0.091 | 0.403 | − 0.030 | |
Kappa | LDA | < 0.001 | < 0.001 | − 0.092 | 0.220 | − 0.213 | |
Kappa | SVM | < 0.001 | < 0.001 | < 0.001 | 0.312 | − 0.121 | |
Kappa | KNN | < 0.001 | < 0.001 | < 0.001 | < 0.001 | − 0.433 | |
Kappa | RF | 0.009 | < 0.001 | < 0.001 | < 0.001 | < 0.001 | |
Mean_Balanced_Accuracy | CART | 0.011 | 0.124 | 0.110 | 0.283 | − 0.003 | |
Mean_Balanced_Accuracy | CTREE | 0.023 | 0.113 | 0.099 | 0.272 | − 0.014 | |
Mean_Balanced_Accuracy | LDA | < 0.001 | < 0.001 | − 0.013 | 0.159 | − 0.126 | |
Mean_Balanced_Accuracy | SVM | < 0.001 | < 0.001 | 1.000 | 0.172 | − 0.113 | |
Mean_Balanced_Accuracy | KNN | < 0.001 | < 0.001 | < 0.001 | < 0.001 | − 0.285 | |
Mean_Balanced_Accuracy | RF | 1.000 | < 0.001 | < 0.001 | < 0.001 | < 0.001 | |
Mean_F1 | CART | 0.006 | 0.142 | 0.076 | 0.161 | − 0.007 | |
Mean_F1 | CTREE | 0.031 | 0.137 | 0.074 | 0.158 | − 0.013 | |
Mean_F1 | LDA | < 0.001 | < 0.001 | − 0.069 | 0.028 | − 0.149 | |
Mean_F1 | SVM | < 0.001 | < 0.001 | < 0.001 | 0.090 | − 0.083 | |
Mean_F1 | KNN | 0.001 | 0.001 | 1.000 | 0.156 | − 0.170 | |
Mean_F1 | RF | < 0.001 | < 0.001 | < 0.001 | < 0.001 | 0.001 | |
Mean_Precision | CART | 0.009 | 0.143 | 0.037 | 0.037 | − 0.009 | |
Mean_Precision | CTREE | 0.287 | 0.132 | 0.025 | 0.035 | − 0.018 | |
Mean_Precision | LDA | < 0.001 | < 0.001 | − 0.100 | − 0.093 | − 0.151 | |
Mean_Precision | SVM | 0.256 | 1.000 | < 0.001 | 0.011 | − 0.045 | |
Mean_Precision | KNN | 1.000 | 1.000 | 0.173 | 1.000 | − 0.047 | |
Mean_Precision | RF | < 0.001 | 0.001 | < 0.001 | 0.058 | 1.000 | |
Mean_Recall | CART | 0.016 | 0.190 | 0.189 | 0.419 | − 0.002 | |
Mean_Recall | CTREE | 0.063 | 0.174 | 0.173 | 0.403 | − 0.017 | |
Mean_Recall | LDA | < 0.001 | < 0.001 | − 0.001 | 0.229 | − 0.192 | |
Mean_Recall | SVM | < 0.001 | < 0.001 | 1.000 | 0.230 | − 0.191 | |
Mean_Recall | KNN | < 0.001 | < 0.001 | < 0.001 | < 0.001 | − 0.420 | |
Mean_Recall | RF | 1.000 | 0.003 | < 0.001 | < 0.001 | < 0.001 | |
Mean_Sensitivity | CART | 0.016 | 0.190 | 0.189 | 0.419 | − 0.002 | |
Mean_Sensitivity | CTREE | 0.063 | 0.174 | 0.173 | 0.403 | − 0.017 | |
Mean_Sensitivity | LDA | < 0.001 | < 0.001 | − 0.001 | 0.229 | − 0.192 | |
Mean_Sensitivity | SVM | < 0.001 | < 0.001 | 1.000 | 0.230 | − 0.191 | |
Mean_Sensitivity | KNN | < 0.001 | < 0.001 | < 0.001 | < 0.001 | − 0.420 | |
Mean_Sensitivity | RF | 1.000 | 0.003 | < 0.001 | < 0.001 | < 0.001 | |
Mean_Specificity | CART | 0.006 | 0.057 | 0.031 | 0.147 | − 0.004 | |
Mean_Specificity | CTREE | 0.005 | 0.051 | 0.025 | 0.140 | − 0.010 | |
Mean_Specificity | LDA | < 0.001 | < 0.001 | − 0.026 | 0.089 | − 0.061 | |
Mean_Specificity | SVM | < 0.001 | < 0.001 | < 0.001 | 0.115 | − 0.035 | |
Mean_Specificity | KNN | < 0.001 | < 0.001 | < 0.001 | < 0.001 | − 0.150 | |
Mean_Specificity | RF | 0.026 | < 0.001 | < 0.001 | < 0.001 | < 0.001 |
Significant values are in bold.
p-value adjustment: Bonferroni.
Upper diagonal: estimates of the difference.
Lower diagonal: p-value for H0: difference = 0.
Table 3.
n = 20 | Overall statistics | Statistics by class | Sensitivity | Specificity | Precision | Recall | F1 | Prevalence | Detection rate | Detection prevalence | Balanced accuracy |
---|---|---|---|---|---|---|---|---|---|---|---|
Test_CART | Accuracy : 1.000 (0.832, 1.000) | Class: HS | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 0.150 | 0.150 | 0.150 | 1.000 |
Test_CART | P-value [Acc > NIR]: < 0.001 | Class: N | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 0.550 | 0.550 | 0.550 | 1.000 |
Test_CART | Kappa: 1.000 | Class: S | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 0.300 | 0.300 | 0.300 | 1.000 |
Test_CTREE | Accuracy: 1.000 (0.832, 1.000) | Class: HS | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 0.150 | 0.150 | 0.150 | 1.000 |
Test_CTREE | P-value [Acc > NIR]: < 0.001 | Class: N | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 0.550 | 0.550 | 0.550 | 1.000 |
Test_CTREE | Kappa: 1.000 | Class: S | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 0.300 | 0.300 | 0.300 | 1.000 |
Test_KNN | Accuracy: 0.850 (0.621, 0.968) | Class: HS | 0.000 | 1.000 | NA | 0.000 | NA | 0.150 | 0.000 | 0.000 | 0.500 |
Test_KNN | P-value [Acc > NIR]: 0.005 | Class: N | 1.000 | 0.889 | 0.917 | 1.000 | 0.957 | 0.550 | 0.550 | 0.600 | 0.944 |
Test_KNN | Kappa: 0.727 | Class: S | 1.000 | 0.857 | 0.750 | 1.000 | 0.857 | 0.300 | 0.300 | 0.400 | 0.929 |
Test_LDA | Accuracy: 0.900 (0.683, 0.9877) | Class: HS | 0.667 | 1.000 | 1.000 | 0.667 | 0.800 | 0.150 | 0.100 | 0.100 | 0.833 |
Test_LDA | P-value [Acc > NIR]: < 0.001 | Class: N | 1.000 | 0.889 | 0.917 | 1.000 | 0.957 | 0.550 | 0.550 | 0.600 | 0.944 |
Test_LDA | Kappa: 0.823 | Class: S | 0.833 | 0.929 | 0.833 | 0.833 | 0.833 | 0.300 | 0.250 | 0.300 | 0.881 |
Test_RF | Accuracy: 1.000 (0.832, 1.000) | Class: HS | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 0.150 | 0.150 | 0.150 | 1.000 |
Test_RF | P-value [Acc > NIR]: < 0.001 | Class: N | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 0.550 | 0.550 | 0.550 | 1.000 |
Test_RF | Kappa: 1.000 | Class: S | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 0.300 | 0.300 | 0.300 | 1.000 |
Test_SVM | Accuracy: 0.900 (0.683, 0.9877) | Class: HS | 0.667 | 1.000 | 1.000 | 0.667 | 0.800 | 0.150 | 0.100 | 0.100 | 0.833 |
Test_SVM | P-value [Acc > NIR]: < 0.001 | Class: N | 1.000 | 0.889 | 0.917 | 1.000 | 0.957 | 0.550 | 0.550 | 0.600 | 0.944 |
Test_SVM | Kappa: 0.823 | Class: S | 0.833 | 0.929 | 0.833 | 0.833 | 0.833 | 0.300 | 0.250 | 0.300 | 0.881 |
No information rate: 0.550.
Discussion
This study aimed to predict the stability of sagittal chin projection (Pog) following 2 J-OGS surgery using ML. The changes in Pog during surgery between the preoperative (T1) and postoperative (T2) stages were used to predict the change in Pog at the debonding stage (T3). This study employed ML algorithms to identify the critical factors associated with the surgical stability of Pog. In agreement with earlier research6,7, our study emphasizes the significance of changes in Pog relapse between the pre-operative and post-operative stages as indicators of surgical instability. This supports the idea that alterations in the proximal segment of the mandible in the clockwise (CW) direction during surgery and counterclockwise (CCW) direction in the retention period are crucial factors in determining the stability of Pog. The application of ML algorithms to predict surgical stability in orthodontics and dental orthognathic surgery has been gaining interest in recent years. In this context, our study builds upon previous work by Jung9, Etemd10, and Li11 successfully utilized AI techniques to classify extraction versus non-extraction cases, rank factors determining extraction, and distinguish between extraction patterns, respectively. The current study expands this research with a comparable performance by employing ML algorithms to predict Pog stability following 2 J-OGS surgery, which has not been previously explored.
In this study, a tenfold cross-validation method was used to evaluate the predictive performance of the ML model. The performances of six popular ML algorithms were compared by adopting multiple evaluation metrics. Since the sample number of each group was different and the HS group had the smallest number (n = 19), the mean balance accuracy, precision, recall, and sensitivity were also investigated to account for the class imbalance. In the current study, the "false negative" detection was clinically critical since the prediction of relapse should not exclude those patients who will relapse. On the other hand, the "false positive" of the HS and S groups were not as significant as the "false negative." Therefore, the mean balanced precision, recall, and sensitivity, useful metrics when the cost of false-negative prediction is high, were utilized and examined (Fig. 2). CART and CTREE performed better than the others, and RF displayed the best scores. For example, RF exhibited the highest mean balanced accuracy, followed by CART, CTREE, Support Vector Machine, Linear Discriminant Analysis, and KNN (Table 1). Statistical differences were examined among the ML models (Table 2). For example, the mean balanced accuracy of RF differed from the others, except for CART. Table 3 demonstrates the testing set data results, which indicate that RF, CART, and CTREE also exhibit superior performance. Therefore, the results of these three algorithms were investigated further (Table 3).
As shown in Supplementary Table 2, RF was predicted correctly in 18/20 samples. Case number four underwent 4.78 mm relapse (HS in reality), but it was predicted to be in the S group, which was inaccurate but partially correct regarding whether relapse occurred. Case number five showed a 2 mm Pog backward movement, but that number was incorrect. A unique feature of RF is that it reveals an important variable (Supplementary Fig. 1); an essential variable that affected the rank prediction was RI, followed by the articular angle, Bjork sum, gonial angle, Sn to MP, and FMA. These variables were all related to the vertical increment during surgery, implicating the importance of maintaining the vertical dimension in the mandible. The composition of the decision-making triage is illustrated in Fig. 3. CTREE forecasted that the first and second critical numbers of RI CW rotation were 1.86 (S) and 3.72 (HS), respectively. The articular angle and Bjork sum were nominated in the next tree, followed by A point vertical and a saddle angle increment. The most crucial advantage of decision trees is that they suggest critical numbers. The exact numbers were acquired using CART regarding the RI CW rotation (Fig. 3b).
The present study has several limitations. The first is the overfitting of the ML algorithms. Overfitting is a common problem in ML. A model is trained to fit the training data so closely that it starts memorizing instead of generalizing and identifying patterns. When a model overfits, it performs very well on the training data; however, its performance on new, unseen data is poor. This phenomenon occurs when the model is overly tailored to the training dataset, leading to reduced generalizability and accuracy when making predictions on new, unseen data26. Furthermore, this study only collected samples from two universities, two oral surgeons who operated on the surgery, and two orthodontists who performed orthodontic treatment. Considering the different treatment plans, techniques, and ethnic backgrounds, other institutions may have different predictions. Nonetheless, it may be more appropriate to make predictions based on data from each institution, given that most institutions likely employ specific surgical techniques and orthodontic mechanics. The second limitation is that Pog was the only measurement. Other measurements, such as the maxillary occlusal plane27, vertical bony step28 and points A, B, etc., should be addressed in future studies.
This study provides valuable insights into ML's application of ML in predicting Pog stability after 2 J-OGS surgery. The findings of this study indicate that the ML model developed could be used to predict the relapse of Pog accurately, suggesting the critical number of variables associated with the surgical stability of Pog. The clinical implication of the current study was that ML applications could be used to identify patients at high risk of surgical relapse and develop appropriate postoperative management strategies to improve surgical stability. The model's accuracy in predicting Pog's relapse could reduce the need for further surgical procedures, reducing the treatment cost and duration.
Conclusions
The primary objective of this study was to utilize ML algorithms to predict sagittal chin projection (Pog) stability after 2 J-OGS surgery and identify the key factors contributing to surgical stability. Changes in Pog relapse with mandibular CW rotation during surgery served as indicators of surgical instability. RF, CART, and CTREE demonstrated the most robust predictive performances of the six ML algorithms assessed in this study. The study revealed that a CW rotation of more than 3.7 and 1.8 degrees of RI CW rotation was the most significant risk factor for HS (≥ 4) and S (≥ 2 and < 4) Pog relapse, respectively. The findings of this study suggest that ML algorithms, mainly RF and decision-tree models, are practical tools for predicting surgical stability. Additionally, decision tree models enable the visualization of the prediction process using a triage illustration.
Supplementary Information
Acknowledgements
This research was supported by a grant from the Korea Health Technology R&D Project through the Korea Health Industry Development Institute, funded by the Ministry of Health &Welfare, Republic of Korea (Grant number: HI18C1638). The authors would like to express their sincere gratitude to Dr. Sungsu Heo for his invaluable contribution to this research through insightful illustrations of the figures presented in this article.
Author contributions
Y.H.K. collection of data, analysis of data, interpretation of data, construction of manuscript, I.K. analysis of data, interpretation of data, conception and design of the article, Y.K. conception and design of the article, M.K. conception and design of the article, J.C. conception and design of the article, M.H. conception and design of the article, K.K. conception and design of the article, S.L. conception and design of the article, S.K. conception and design of the article, N.K. “conception and design of the article, J.W.S, conception and design of the article, S.S. conception and design of the article, S.B. collection of data, conception and design of the article, H.S.C.: conception and design of the article, collection of data, analysis of data, interpretation of data, construction of manuscript.
Data availability
The test set data can be obtained via github (https://github.com/pfChae/The-prediction-of-sagittal-chin-point-relapse-following-double-jaw-surgery-using-machine-learning).
Competing interests
The authors declare no competing interests.
Footnotes
The original online version of this Article was revised: The original version of this Article contained an error in the spelling of the author Minji Kim, which was incorrectly given as Minji Ki.
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Change history
2/2/2024
A Correction to this paper has been published: 10.1038/s41598-024-53035-x
Supplementary Information
The online version contains supplementary material available at 10.1038/s41598-023-44207-2.
References
- 1.Troy BA, Shanker S, Fields HW, Vig K, Johnston W. Comparison of incisor inclination in patients with Class III malocclusion treated with orthognathic surgery or orthodontic camouflage. Am. J. Orthod. Dentofacial Orthop. 2009;135(146):e1–146.e9. doi: 10.1016/j.ajodo.2008.07.012. [DOI] [PubMed] [Google Scholar]
- 2.Proffit WR, Turvey TA, Phillips C. Orthognathic surgery: A hierarchy of stability. Int. J. Adult Orthodon. Orthognath. Surg. 1996;11:191–204. [PubMed] [Google Scholar]
- 3.Proffit WR, Bailey LJ, Phillips C, Turvey TA. Long-term stability of surgical open-bite correction by Le Fort I osteotomy. Angle Orthod. 2000;70:112–117. doi: 10.1043/0003-3219(2000)070<0112:LTSOSO>2.0.CO;2. [DOI] [PubMed] [Google Scholar]
- 4.Bailey L, Cevidanes LH, Proffit WR. Stability and predictability of orthognathic surgery. Am. J. Orthod. Dentofac. Orthop. 2004;126:273–277. doi: 10.1016/j.ajodo.2004.06.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Proffit WR, Turvey TA, Phillips C. The hierarchy of stability and predictability in orthognathic surgery with rigid fixation: An update and extension. Head Face Med. 2007;3:21. doi: 10.1186/1746-160X-3-21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Cho HJ. Long-term stability of surgical mandibular setback. Angle Orthod. 2007;77:851–856. doi: 10.2319/052306-209.1. [DOI] [PubMed] [Google Scholar]
- 7.Al-Delayme R, Al-Khen M, Hamdoon Z, Jerjes W. Skeletal and dental relapses after skeletal class III deformity correction surgery: Single-jaw versus double-jaw procedures. Oral Surg. Oral Med. Oral Pathol. Oral Radiol. 2013;115:466–472. doi: 10.1016/j.oooo.2012.08.443. [DOI] [PubMed] [Google Scholar]
- 8.Sarker IH. Deep learning: A comprehensive overview on techniques, taxonomy, applications and research directions. SN Comput. Sci. 2021;2:420. doi: 10.1007/s42979-021-00815-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Jung SK, Kim TW. New approach for the diagnosis of extractions with neural network machine learning. Am. J. Orthod. Dentofacial Orthop. 2016;149:127–133. doi: 10.1016/j.ajodo.2015.07.030. [DOI] [PubMed] [Google Scholar]
- 10.Etemad L, et al. Machine learning from clinical data sets of a contemporary decision for orthodontic tooth extraction. Orthod. Craniofac. Res. 2021;24(Suppl 2):193–200. doi: 10.1111/ocr.12502. [DOI] [PubMed] [Google Scholar]
- 11.Li P, et al. Orthodontic treatment planning based on artificial neural networks. Sci. Rep. 2019;9:2037. doi: 10.1038/s41598-018-38439-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Fang X, et al. Machine-learning-based detection of degenerative temporomandibular joint diseases using lateral cephalograms. Am. J. Orthod. Dentofacial Orthop. 2023;163:260–271.e5. doi: 10.1016/j.ajodo.2022.10.015. [DOI] [PubMed] [Google Scholar]
- 13.Lee KS, Jha N, Kim YJ. Risk factor assessments of temporomandibular disorders via machine learning. Sci. Rep. 2021;11:19802. doi: 10.1038/s41598-021-98837-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Elfanagely O, et al. Machine learning and surgical outcomes prediction: A systematic review. J. Surg. Res. 2021;264:346–361. doi: 10.1016/j.jss.2021.02.045. [DOI] [PubMed] [Google Scholar]
- 15.Cohen J. Statistical power analysis. Curr. Dir. Psychol. Sci. 1992;1(3):98–101. doi: 10.1111/1467-8721.ep10768783. [DOI] [Google Scholar]
- 16.Rajput D, Wang W-J, Chen C-C. Evaluation of a decided sample size in machine learning applications. BMC Bioinform. 2023;24:48. doi: 10.1186/s12859-023-05156-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Batra M, Agrawal R. Comparative analysis of decision tree algorithms. In: Panigrahi B, Hoda M, Sharma V, Goel S, editors. Advances in Intelligent Systems and Computing. Nature Inspired Computing. Springer Singapore; 2018. pp. 31–36. [Google Scholar]
- 18.Hothorn T, Hornik K, Zeileis A. Unbiased recursive partitioning: A conditional inference framework. J. Comput. Graph Stat. 2006;15:651–674. doi: 10.1198/106186006X133933. [DOI] [Google Scholar]
- 19.Wu L, Shen C, Van Den Hengel A. Deep linear discriminant analysis on fisher networks: A hybrid architecture for person re-identification. Pattern Recogn. 2017;65:238–250. doi: 10.1016/j.patcog.2016.12.022. [DOI] [Google Scholar]
- 20.Schölkopf B. Support Vector Learning. Oldenbourg; 1997. [Google Scholar]
- 21.Bhatia N. Vandana. Survey of nearest neighbor techniques. Int. J. Comput. Sci. Inf. Secur. 2010;8:302–305. [Google Scholar]
- 22.Breiman L, Last M, Rice J. Random forests: Finding quasars. In: Breiman L, Last M, Rice J, editors. Statistical Challenges in Astronomy. Springer-Verlag; 2003. pp. 243–254. [Google Scholar]
- 23.Brodeur ZP, Herman JD, Steinschneider S. Bootstrap aggregation and cross-validation methods to reduce overfitting in reservoir control policy search. Water Resour. Res. 2020;56:e2020WR027184. doi: 10.1029/2020WR027184. [DOI] [Google Scholar]
- 24.Nematzadeh Z, Ibrahim R, Selamat A. Comparative studies on breast cancer classifications with k-fold cross validations using machine learning techniques. In: Nematzadeh Z, Ibrahim R, Selamat A, editors. 2015 10th Asian Control Conference (ASCC) IEEE; 2015. pp. 1–6. [Google Scholar]
- 25.Prusty S, Patnaik S, Dash SK. SKCV: Stratified K-fold cross-validation on ML classifiers for predicting cervical cancer. Front. Nanotechnol. 2022;4:972421. doi: 10.3389/fnano.2022.972421. [DOI] [Google Scholar]
- 26.Friedrich S, et al. Is there a role for statistics in artificial intelligence? Adv. Data Anal. Classif. 2021;16:823–846. doi: 10.1007/s11634-021-00455-6. [DOI] [Google Scholar]
- 27.Kang SY, et al. Stability of clockwise rotation of the maxillary occlusal plane in skeletal Class III patients treated with two-jaw surgery. Orthod. Craniofac. Res. 2022 doi: 10.1111/ocr.12601. [DOI] [PubMed] [Google Scholar]
- 28.Batbold M, et al. Vertical bony step between proximal and distal segments after mandibular setback is related with relapse: A cone-beam computed tomographic study. Am. J. Orthod. Dentofacial Orthop. 2022;161:e524–e533. doi: 10.1016/j.ajodo.2021.10.016. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The test set data can be obtained via github (https://github.com/pfChae/The-prediction-of-sagittal-chin-point-relapse-following-double-jaw-surgery-using-machine-learning).