A novel machine learning model for screening the risk of obstructive sleep apnea using craniofacial photography with questionnaires

June-Young Park; Hye-Rim Shin; Min Hye Kim; Yunsoo Kim; Wi-Sun Ryu; Eun Young Kim; Hyeyeon Chang; Woo-Jin Lee; Jee Hyun Kim; Tae-Joon Kim

doi:10.5664/jcsm.11560

. 2025 May 1;21(5):843–854. doi: 10.5664/jcsm.11560

A novel machine learning model for screening the risk of obstructive sleep apnea using craniofacial photography with questionnaires

June-Young Park ^1,^*, Hye-Rim Shin ^2,^*, Min Hye Kim ^3,^*, Yunsoo Kim ³, Wi-Sun Ryu ⁴, Eun Young Kim ⁵, Hyeyeon Chang ⁶, Woo-Jin Lee ⁷, Jee Hyun Kim ^8,^**,^✉, Tae-Joon Kim ^1,^3,^9,^**,^✉

PMCID: PMC12048310 PMID: 39815737

Abstract

Study Objectives:

Undiagnosed or untreated moderate-to-severe obstructive sleep apnea (OSA) increases cardiovascular risks and mortality. Early and efficient detection is critical, given its high prevalence. We aimed to develop a practical and efficient approach for OSA screening, using simple facial photography and sleep questionnaires.

Methods:

We retrospectively included 748 participants who completed polysomnography, sleep questionnaires (STOP-BANG), and facial photographs at a university hospital between 2012 and 2023. Owing to class imbalance, we randomly undersampled the participants, categorized into the moderate/severe or no/mild OSA group, based on an apnea-hypopnea index of 15 events/h. Using a validated convolutional neural network, we extracted the OSA probability scores from photographs, which were used as the input for the questionnaires. Four machine learning models were employed to classify the moderate/severe vs no/mild groups and evaluated in the test dataset.

Results:

We analyzed 426 participants (213 each in the moderate/severe and no/mild groups). The mean (standard deviation) age was 44.6 (14.7) years; 80.8% were men. Logistic regression achieved the highest performance: the area under the receiver operator curve was 97.2%, and accuracy was 91.9%. Adding OSA probability, retrieved from facial photographs, to the questionnaires improved performance, compared with using questionnaires or photographs alone (the area under the receiver operating characteristic curve 97.2% using both, 85.7% for photographs alone, and 64% and 79.1% for questionnaire threshold STOP-BANG scores of 3 and 4, respectively).

Conclusions:

Using simple facial photographs and sleep questionnaires, a 2-stage approach (convolutional neural network + machine learning) accurately classified OSA into moderate/severe vs no/mild OSA groups. This method may facilitate optimal OSA treatment and avoid unnecessary costly evaluations.

Citation:

Park J-Y, Shin H-R, Kim MH, et al. A novel machine learning model for screening the risk of obstructive sleep apnea using craniofacial photography with questionnaires. J Clin Sleep Med. 2025;21(5):843–854.

Keywords: obstructive sleep apnea, facial photography, sleep questionnaires, machine learning, screening tool

BRIEF SUMMARY

Current Knowledge/Study Rationale: The high prevalence of obstructive sleep apnea and its association with increased cardiovascular risks and mortality underscore the urgent need for effective early detection methods. This study addressed this need by developing a practical approach that combined simple facial photography with sleep questionnaires using machine learning techniques.

Study Impact: By employing a convolutional neural network to analyze facial photography and integrating it with questionnaire data, this two-step approach enhanced screening accuracy compared to questionnaires or photographs alone. This method not only advances diagnostic efficiency but also has the potential to reduce the need for costly, resource-intensive evaluations, ultimately improving patient outcomes through timely intervention.

INTRODUCTION

Obstructive sleep apnea (OSA) is a sleep-disordered breathing condition characterized by intermittent partial or complete airway obstruction during sleep. OSA causes sleep disturbances, and many studies demonstrate that OSA is a risk factor for comorbidities such as cardiovascular diseases, metabolic disorders, and cognitive impairment.¹ The overall prevalence of OSA ranges from 9–38% in the general adult population and is much higher among older adults.² More than 85% of patients with clinically significant OSA are never diagnosed.³ Considering the high prevalence of undiagnosed or untreated OSA, early detection is crucial because these individuals with severe OSA are at an increased risk of cardiovascular complications and mortality if the OSA remains untreated.⁴ Studies indicate that individuals with moderate OSA have a 17% higher risk of all-cause mortality, and those with severe OSA face a 46% higher risk compared to individuals without OSA.⁵ These findings highlight the importance of timely identification and management of OSA to prevent associated cardiovascular complications and mortality.

The most common approach for evaluating OSA severity is the apnea-hypopnea index (AHI), which is the average number of apnea and hypopnea events per hour of sleep. AHI can be measured using either polysomnography (PSG) data or home sleep apnea test.¹^,⁶ PSG is a comprehensive and reliable method for diagnosing OSA, although it requires expertise and is costly and time-consuming.⁶ A home sleep apnea test serves as a substitute for PSG but is typically recommended for a subset of patients with suspected OSA risk. Hence, it may not be suitable for OSA screening, because it is primarily intended for individuals who are already suspected of having OSA based on clinical indications.⁷ Thus, sleep questionnaires such as the STOP-BANG and Berlin questionnaires have been developed. However, they have low specificity, produce more false-positive results, and fail to exclude individuals who are at low risk of OSA.⁸ Furthermore, their typical focus is on neck circumference, which neglects a comprehensive craniofacial assessment. Craniofacial abnormalities are increasingly recognized as essential risk factors for OSA. Anomalies of the facial skeleton and enlargement of soft tissues increase the possibility of narrowing the upper airway, which can lead to the onset of OSA.⁹ In addition, Asians, including Koreans with OSA, tend to be less obese than those of other ethnicities.¹⁰ Craniofacial anatomical risk factors would contribute more in patients with a low body mass index (BMI), exhibiting more skeletal abnormalities, such as a smaller and retropositioned mandible, a smaller maxilla, and a shorter, steeper anterior cranial base.¹¹ Therefore, a single measurement is insufficient to capture the multifactorial nature of OSA.¹²

Machine learning methods can simultaneously consider multidimensional interactions between variables, eliminating the need to summarize or validate them individually.¹³ Especially with various variables influencing OSA onset, recent studies employ machine learning techniques to enhance the accuracy of risk prediction by leveraging the complex correlations among these variables.¹⁴^–¹⁶ However, there is currently no method that reliably detects OSA using sleep questionnaires and readily accessible craniofacial images in machine learning models.

Thus, this study aimed to develop a machine learning model that can effectively detect OSA risk using easily obtainable facial photographs and conventional simple sleep questionnaires. By combining anatomical information from facial images with clinical data from questionnaires, we hypothesized that an improved performance and a highly usable screening tool would be provided.

METHODS

Study population

This retrospective cohort study included Korean individuals, aged > 19 years, who underwent PSG at the Dankook University Hospital between 2012 and 2023. Of 2,149 individuals, 748 individuals with complete data on clinical parameters and photographic images were eligible for analysis. The Institutional Review Board of Dankook University Hospital (Cheonan, Republic of Korea; Institutional Review Board no. 2023–03-031) approved the study protocol. Patient consent was waived for this retrospective study because the data were obtained from the hospital database. Facial photography was conducted on the night of PSG for patients who were undergoing PSG with consent for clinical assessment of risk for OSA. In actual clinical settings, many hospitals use craniofacial photography as a clinical indicator for OSA in addition to intraoral examination. The Institutional Review Board approved the waiver of informed consent because all data were deidentified prior to analysis. Personal identifiers were removed or altered to prevent any linkage between individuals and their data in compliance with Institutional Review Board procedures. Moreover, the analysts involved in the evaluation were blinded to any personal information about the participants, preventing any possibility of identification during the analysis.

Data preparation

STOP-BANG questionnaires

Before undergoing PSG, participants completed sleep-related questionnaires, including the Berlin Questionnaire, as part of the usual care protocol for sleep-disorder screening. These questionnaires were administered prior to PSG, and the responses from the Berlin Questionnaire were converted into STOP-BANG scores. Due to the complexity of the Berlin Questionnaire scoring system and its extensive list of questions, it can be challenging to apply consistently. Therefore, in this study, we chose to use the STOP-BANG scoring system, which is simpler and more accurate for quantifying OSA risk.¹⁷ The STOP-BANG questionnaire comprises 4 self-reported answers [Q1(S): loud snoring; Q2(T): tiredness; Q3(O): being observed to stop breathing during sleep; and Q4(P): high blood pressure] and four demographic and anthropometric assessments [Q5(B): BMI > 35 kg/m²; Q6(A): age > 50 years; Q7(N): neck circumference > 40 cm; Q8(G): male sex]. A positive answer was scored as “1.” The total score range ranged from 0–8; scores ≥ 3 indicated a high-risk group.¹⁸

PSG

Each participant underwent PSG using the Comet-PLUS PSG system by Grass Technologies (Astro-Med, Inc., West Warwick, Rhode Island) and Embla N7000 (Medcare-Embla, Reyk-javik, Iceland). The PSG recorded various parameters, including electroencephalogram, chin and leg electromyogram, electrocardiogram, electrooculogram, oronasal airflow using thermistor and nasal pressure transducer, thoracic and abdominal effort, microphone for snoring, body position, and oxygen saturation. The PSG data were analyzed by trained technicians and sleep specialists following The AASM Manual for the Scoring of Sleep and Associated Events: Rules, Terminology and Technical Specifications, version 2.4.¹⁹ Apnea was defined as a reduction of at least 90% in airflow from baseline for 10 seconds or more, and hypopnea was characterized by a ≥ 30% reduction in airflow lasting at least 10 seconds, associated with either arousal or a decrease in oxygen saturation of ≥ 3%.²⁰

Facial photo

Standardized facial photographs were taken on the same day as the PSG to evaluate craniofacial abnormalities, following the method by Park et al.²¹ Two-dimensional (2D) images capturing the front and lateral views of the patient’s face, including the lower chin and upper neck, were taken with a digital camera (WB350F; Samsung, Seoul, Republic of Korea). Participants were instructed to stand naturally with a straight posture, facing forward for the frontal photograph and then to rotate 90 degrees to the left for the lateral photograph. We exclusively used lateral images due to their ability to provide a 3-dimensional perspective of critical facial features relevant to OSA, such as the jaw, neck, and nose. Although combining frontal and lateral images could offer a more comprehensive view, it poses challenges in image alignment and consistency, which may affect model performance. Lateral photographs are particularly effective for visualizing anatomical features associated with airway obstruction and upper airway anatomy, key factors in OSA assessment.²² By exclusively using lateral images, this streamlined approach simplified the methodology and ensured greater dataset consistency, which is essential for robust and reproducible model development. Additionally, extraneous elements, such as backgrounds, were removed to focus on the facial area. Using MediaPipe,²³ an open-source Python library for facial recognition, we extracted coordinates (top-left, top-right, bottom-left, bottom-right) defining the facial region. These coordinates were used to crop the original images, isolating the side profile. All images were standardized to a resolution of 256 × 256 pixels in RGB format. To address inconsistencies in chin positioning, we applied a fixed-size crop centered around the bounding box, ensuring the entire face, including the chin and neck, was captured and standardized regardless of face orientation. Instead of relying on rotation-based adjustments, which often require precise and consistent landmark detection, we employed a convolutional neural network (CNN)-based approach that has demonstrated robustness in handling real-world variability, such as minor inconsistencies in facial positioning, pose, and lighting conditions. This approach effectively balances accuracy and ease of implementation, making it suitable for scalable screening applications.²⁴

CNN–based facial photo algorithm: OSA-Net

In our previous study, we developed a CNN named OSA-Net, designed to extract anatomical features from lateral craniofacial photograph data.²¹ The network is composed of 6 stages, integrating key components such as the initial feature extraction layer, 2D convolutional layer, batch normalization, activation function, depthwise separable convolutional layer, and Squeeze-and-Excitation block, along with global average pooling, and a fully connected layer. CNNs are typically divided into 2 main parts: feature extraction and classification. In OSA-Net, Stages 1 through 5 handle feature extraction and Stage 6 is responsible for classification, identifying the presence of OSA. The model was trained using 5-fold cross-validation, with each fold running for 100 epochs. The training was conducted using the LazyAdam optimizer, an initial learning rate of 0.001, and a batch size of 16. To mitigate overfitting, callback functions such as Early Stopping and Reduce Learning Rate were applied. Early Stopping halted training if the validation loss did not improve for 20 consecutive epochs, and Reduce Learning Rate decreased the learning rate by 20% if no improvement was observed in the validation loss over 10 epochs. With demonstrated effectiveness (area under the receiver operating characteristic curve [AUROC], 90%; accuracy, 85%), we used the architecture of OSA-Net to pretrain using standardized facial photos from the current dataset. The pretrained CNN model generated probability values for each photograph, which were converted into a novel feature termed the facial photo score using the softmax function. These probabilities represent the likelihood of OSA risk. Finally, a new machine-learning model was trained by integrating the facial photo scores with responses to the eight questions in the STOP-BANG questionnaire.

Training machine learning models

OSA severity was based on the AHI, obtained from PSG data, and divided into 2 groups with AHI ≥ 15 events/h as the threshold for analysis of the no/mild OSA group and the moderate/severe OSA group. Data imbalance is a significant issue in classification. When the number of samples for a particular class significantly exceeds that of other classes, overfitting can occur in which the model excessively learns the pattern of the majority class. To address the overfitting problem caused by data imbalance, we employed random undersampling (RUS), which involves randomly selecting data from the distribution of the majority class to reduce the size and balance the dataset. RUS randomly discards the majority class (ie, moderate/severe OSA group) from downsampling to make it compatible with the number of samples in the minority class (ie, no/mild OSA group).²⁵ Equally distributed data from the 2 groups were randomly divided into training (n = 340) and test dataset (n = 86) at a ratio of 8:2, a common practice in machine learning model development for datasets that are not extremely large.²⁶

We selected 4 common machine learning techniques to build a model predicting OSA risk²⁷^,²⁸: logistic regression (LR), support vector machine, random forest, and extreme gradient boosting. During the training phase, each machine learning model was optimized, using hyperparameter tuning, based on a grid search method with leave-one-out cross validation. We applied leave-one-out cross validation, a well-known validation method, to reduce overfitting biases in small sample sizes.²⁹ This cross-validation uses 1 sample as the validation set and the remaining samples as the training set. The same round was repeated until all samples were tested. The average accuracy was calculated as the percentage of all leave-one samples classified correctly.³⁰ During tuning, each model detected the parameter set with the best average accuracy among all parameter combinations.³¹ The final tuned models were validated using a test dataset to evaluate their performance. The AUROC, a commonly used performance metric for determining the best classifier, was derived by comparing 4 different models.³² The best-performing model was the model with the highest AUROC. The SHapley Additive exPlanations (SHAP) value was calculated for each feature to improve the interpretability of the model. The models were implemented using Python, version 3.8 (Python Software Foundation, Wilmington, Delaware) with Scikit-learn, version 0.24 (Figure 1).³³

AHI = apnea-hypopnea index, AUROC = area under the receiver operating characteristic curve, BN = batch normalization C, regularization values, CNN = convolutional neural network, Conv = convolutional layer, FC = fully connected layer, GAP = global average pooling, LOO-CV = leave-one-out cross-validation, LR = logistic regression, n_estimators = number of gradient-boosted trees, n_trees = number of classifications and regression trees, RF = random forest, RUS = random undersampling, SHAP = SHapley Additive exPlanations, SVM = support vector machine, XGB = extreme gradient boosting.

Statistical analysis and model assessment

The participants’ characteristics data are presented with descriptive statistics as the mean ± standard deviation for continuous variables or as frequency (percentage) for categorical variables. The differences between the moderate/severe OSA groups before and after RUS, and between the no/mild and moderate/severe OSA groups, were compared. Student’s t test was used to analyze normally distributed continuous data. The Mann–Whitney U test was used to compare nonnormally distributed data. The chi-squared test was conducted to compare categorical data. Statistical significance was set at P < .05 (SAS version 9.4; SAS Institute, Cary, North Carolina). The model performance on the test sets was primarily assessed with the AUROC, although various metrics were derived to estimate the overall evaluation. The optimal receiver operating characteristic threshold was selected, based on Youden’s index.³⁴ The accuracy, sensitivity (ie, recall), specificity, positive predictive value (ie, precision), negative predictive value, and F1 score were evaluated at this threshold. The area under the precision–recall curve was also assessed for prediction performance. This metric shows precision values for the corresponding recall values for different thresholds and is more useful than AUROC when dealing with highly skewed datasets.³⁵ We also report various performance metrics for the trained models, accompanied by 95% confidence intervals estimated using 1,000 bootstrapping resamples.³⁶

RESULTS

Participants’ characteristics

After excluding participants without clinical variables or photographs, an analysis was conducted on 748 participants, comprising 213 individuals with no/mild OSA (28.5%) and 535 with moderate/severe OSA (71.5%). Table 1 shows the characteristics of the eligible participants, after applying RUS, and identifies differences between the no/mild and moderate/severe OSA groups, based on demographic variables (ie, age and sex), anthropometric variables (ie, neck circumference and BMI), and clinical variables (ie, AHI, oxygen desaturation index, total sleep time below 90% oxygen saturation, and STOP-BANG questionnaire). Descriptive statistics of the variables in the moderate/severe OSA group before and after RUS were checked, and no significant differences were observed (Table S1 in the supplemental material). We then used a balanced dataset consisting of 426 individuals to develop a machine learning model. The mean age of the participants was 44.6 years, and most individuals were male (n = 344, 80.8%). The average age, BMI, AHI, neck circumference, STOP-BANG scores, and frequencies of Q1(S), Q3(O), Q4(P), Q5(B), Q6(A), Q7(N), Q8(G) were higher in the moderate/severe OSA group than in the no/mild OSA group. All variables, except the question about tiredness [Q2(T)], showed significant differences between the moderate/severe and no/mild OSA groups.

Table 1.

Characteristics of the analyzed participants.

	Total (n = 426)	No/Mild OSA Group (n = 213)	Moderate/Severe OSA Group (n = 213)	P*
Age (years)	44.6 ± 14.7	39.7 ± 14.7	49.4 ± 13.0	<.001
Sex				<.001
Male	344 (80.8)	156 (73.2)	188 (88.3)
Female	82 (19.2)	57 (26.8)	25 (11.7)
BMI (kg/m²)	25.7 ± 4.1	24.4 ± 3.6	26.9 ± 4.2	<.001
AHI (events/h)	27.4 ± 27.9	6.8 ± 4.4	48.0 ± 26.3	<.001
3% ODI (events/h)	21.0 ± 26.8	4.8 ± 10.3	37.2 ± 28.5	<.001
T90 (% of TST)	8.0 ± 15.7	1.5 ± 6.6	14.5 ± 19.2	<.001
Neck circumference (cm)	38.6 ± 3.6	37.2 ± 3.5	40.0 ± 3.2	<.001
STOP-BANG questionnaire
Q1. Do you snore loudly? (yes)	326 (76.5)	131 (61.5)	195 (91.5)	<.001
Q2. Do you often feel tired, fatigued, or sleepy during daytime? (yes)	368 (86.4)	185 (86.9)	183 (85.9)	.778
Q3. Has anyone observed you stop breathing during your sleep? (yes)	254 (59.6)	83 (39.0)	171 (80.3)	<.001
Q4. Do you have or are you being treated for high blood pressure? (yes)	136 (31.9)	47 (22.1)	89 (41.8)	<.001
Q5. BMI > 35 kg/m²? (yes)	9 (2.1)	0	9 (4.2)	.002
Q6. Age > 50-year-old? (yes)	155 (36.4)	54 (25.4)	101 (47.4)	<.001
Q7. Neck circumference > 40 cm? (yes)	128 (30.0)	30 (14.1)	98 (46.0)	<.001
Q8. Sex male? (yes)	344 (80.8)	156 (73.2)	188 (88.3)	<.001
Total score	4.0 ± 1.5	3.2 ± 1.4	4.9 ± 1.2	<.001
Total score ≥ 3	354 (83.1)	146 (68.5)	208 (97.7)	<.001

Open in a new tab

Data are presented as the mean ± standard deviation or n (%). *P value between the no/mild and moderate/severe OSA groups. AHI = apnea-hypopnea index, BMI = body mass index, ODI = oxygen desaturation index, T90 = total sleep time below 90% oxygen saturation, TST = total sleep time.

Performance of the machine learning models

Four machine learning models from different techniques completed training, and each model was validated on a test dataset to evaluate its performance in predicting OSA risk. The performance results derived from the 4 models are summarized in Table 2. In Figure 2A, the AUROCs of the LR, support vector machine, random forest, extreme gradient boosting, STOP-BANG, and facial photo classifiers are illustrated to visualize the performance comparison. Compared with the other machine-learning models in the test dataset, the LR model had the highest AUROC and accuracy values. Therefore, the LR model was selected as the best prediction model for further analysis. The distribution of scores from the STOP-BANG, facial photo, and the predicted values from the LR model are depicted in Figure 2B, Figure 2C, and Figure 2D, respectively. STOP-BANG scores were highly concentrated, with scores of 4. The distribution of the facial photo scores exhibited a tendency toward a bimodal pattern, with scores clustering around lower and higher values, though the separation was not strongly pronounced. In contrast, the predicted values displayed a clearer division, closely aligning with the binary classes of 0 and 1.

Table 2.

Comparison of model performance predicting OSA risk in different machine learning techniques.

	LR	RF	SVM	XGBoost
AUROC (%)	97.2	96.0	91.6	96.7
(95% CI)	(93.9–99.5)	(91.7–99.2)	(85.7–96.8)	(92.4–99.5)
AUPRC (%)	97.0	95.3	91.9	95.7
(95% CI)	(93.7–99.5)	(91.0–99.1)	(85.1–96.8)	(92.4–99.5)
Sensitivity (%)	93.0	97.7	90.7	95.3
(95% CI)	(85.9–97.6)	(84.5–96.2)	(75.9–90.9)	(84.6–96.3)
Specificity (%)	90.7	83.7	76.7	86.0
(95% CI)	(85.1–97.6)	(84.9–96.3)	(75.7–91.6)	(84.2–96.4)
Accuracy (%)	91.9	90.7	83.7	90.7
(95% CI)	(86.0–97.6)	(84.8–96.3)	(75.4–90.7)	(84.5–96.4)
PPV (%)	90.9	85.7	79.6	87.2
(95% CI)	(85.3–97.1)	(84.6–96.3)	(75.6–90.8)	(84.4–96.5)
NPV (%)	92.9	97.3	89.2	94.9
(95% CI)	(85.3–96.8)	(84.4–96.4)	(76.1–91.1)	(84.5–96.5)
F1 score (%)	92.0	91.3	84.8	91.1
(95% CI)	(85.5–96.8)	(84.3–96.0)	(76.0–91.1)	(83.8–96.4)
Threshold	0.547	0.308	0.298	0.396

Open in a new tab

The F1 score is calculated by 2 × sensitivity × PPV/(sensitivity + PPV). AUROC = area under receiver operating characteristic, AUPRC = area under the precision–recall curve, CI = confidence interval, LR = logistic regression, NPV = negative predictive value, PPV = positive predictive value, RF = random forest, RUS = random undersampling, SVM = support vector machine, XGBoost = extreme gradient boosting.

**(A)** The performance of the screening models built by different machine learning techniques. **(B)** The kernel density plot of STOP-BANG scores. **(C)** The kernel density plot of facial photo scores. **(D)** The kernel density plot of values calculated from the logistic regression model. AUROC = area under the receiver operating characteristic, LR = logistic regression, RF = random forest, SBQ = STOP-BANG questionnaire, SVM = support vector machine, XGBoost = extreme gradient boosting.

Table 3 shows that our developed model stands out among comparable studies that have analyzed similar factors to predict the risk of OSA. Of note, the integration of facial photographs and STOP-BANG questionnaire responses enhanced the discrimination ability of the model, achieving an AUROC of 97.2%. Machine-learning models using STOP-BANG responses alone, such as random forest, and the CNN model using only facial photographs, both achieved an AUROC of 85.7%. These results demonstrate the added value of combining both modalities, resulting in a substantial improvement in predictive accuracy. Furthermore, we visualized the plot in Figure 3A to compare the accuracy of the STOP-BANG threshold at 4 and LR models in predicting OSA risk in individual participants. They were categorized into the moderate/severe OSA or no/mild OSA group, based on their actual AHI values. The groups were assigned, based on the values predicted by each of the 2 methods for the same person. Among individuals with an AHI < 15 events/h (n = 213), STOP-BANG misclassified 41.3% (n = 88) of individuals into the moderate/severe OSA group, whereas our model misclassified only 10.5% (n = 22). A significant difference in performance between the two methods was confirmed, using McNemar’s test.³⁷ The distribution of the McNemar change index [ie, (b − c)²/(b + c)] is 32.4, which is well-approximated by a chi-squared distribution with 1 degree of freedom. If the index exceeds 3.841, we reject the null hypothesis that no difference exists in performance between the 2 methods (Figure 3B). Similarly, when we applied the STOP-BANG threshold of 3, the results were consistent with those observed at threshold 4. Our model demonstrated a lower misclassification rate compared to STOP-BANG 68.5% (n = 146), further emphasizing its superior performance in predicting OSA risk, even when the threshold was adjusted (McNemar change index, 63.2) (Figure S2 in the supplemental material).

Table 3.

Evaluation of discriminating performances of OSA risk screening models.

Study	Sample Size (Whole; Test)	Moderate/Severe OSA Definition	Diagnostic Cut-Off Criteria for Labeling	Method	Features Analyzed, (n)	AUROC	AUPRC	Sen	Spe	Acc	PPV	NPV	F1 Score	Threshold
Our model	342; 86	AHI ≥ 15	50%	CNN + LR	Facial photo, (1); SBQ, (8)	97.2%	97.0%	93.0%	90.7%	91.9%	90.9%	92.9%	92.0%	0.547
				CNN	Facial photo, (1)	85.7%	84.6%	79.1%	88.4%	76.7%	87.2%	80.9%	82.9%	0.573
				RF	SBQ, (8)	85.7%	81.1%	78.1%	82.8%	80.5%	82.0%	79.1%	80.0%	0.56
				—	SBQ ≥ 4, (1)	79.1%	71.4%	90.7%	67.4%	79.1%	73.6%	87.9%	81.3%	—
He et al 2021¹⁴	197; —	AHI ≥ 10	74%	LR	Facial photo, (4)	90%	—	85.6%	84.3%	—	—	—	—	—
He et al 2021¹⁴	197; —	AHI ≥ 10	74%	LR	Facial photo, (4); physical measurements, (2)	93%	—	88.4%	86.3%	—	—	—	—	—
Chen et al 2023¹⁵	653;187	AHI ≥ 15	55.3%	CatBoost	Facial photo, (68); clinical variables, (19)	76%	—	75%	—	71%	72%	—	73%	—
Chen et al 2023¹⁵	653;187	AHI ≥ 15	55.3%	—	SBQ ≥ 3, (1)	69%	—	—	—	—	—	—	—	—
Remya et al 2017¹⁶	76; —	AHI ≥ 10	68.4%	LR	Facial photo, (28); Anthropometric parameters, (11)	—	—	93.1%	20.0%	74.4%	—	—	—	—
Huo et al 2023⁴⁰	2,357; 1,237	AHI ≥ 15	45.2%	LR	Clinical variables (6)	78%	72%	77%	68%	—	—	—	—	—
Huo et al 2023⁴⁰	2,357; 1,237	AHI ≥ 15	45.2%	—	SBQ ≥ 3, (1)	69%	59%	—	—	—	—	—	—	—
He et al 2022⁴¹	202; 101	AHI ≥ 15	62.3	LR	Clinical variables (4)	83.7%	—	81.1%	76.0%	81.2%	—	—	—	—
He et al 2022⁴¹	202; 101	AHI ≥ 15	62.3	—	SBQ ≥ 3, (1)	73.8%	—	—	—	—	—	—	—	—

Open in a new tab

The F1 score is calculated by 2 × sensitivity × PPV/(sensitivity + PPV). Acc = accuracy, AHI = apnea-hypopnea index, AUROC = area under the receiver operating characteristic, AUPRC = area under the precision–recall curve, CNN = convolutional neural network, LR = logistic regression, NPV = negative predictive value, OSA = obstructive sleep apnea, PPV = positive predictive value, RF = random forest, Sen = sensitivity, Spe = specificity, SBQ = STOP-BANG questionnaire.

**(A)** Visualization plot of individuals in the actual group (ie, no/mild OSA group and moderate/severe OSA group) and the estimated group screened using method 1 and method 2. Red indicates the moderate/severe OSA group and green indicates the no/mild OSA group. **(B)** Contingency table for the McNemar test. AHI = apnea-hypopnea index, OSA = obstructive sleep apnea, SBQ = STOP-BANG questionnaire.

Model interpretability

To further analyze the influence of each input feature on the screening model, we created a summary plot of the SHAP values (Figure 4A). The features are listed based on their contribution to the prediction in descending order, and each point represents the SHAP value of each variable for each individual in the test dataset. The lower the feature value, the stronger the blue color, indicating a negative association with OSA risk. By contrast, the higher the feature value, the more intense the red color, suggesting a higher risk of OSA. For example, as the facial photo score value increased, the likelihood of OSA increased. In the remaining questionnaires, a positive response was coded as 1 and represented with the color red; thus, participants who answered positively to Q3 had a higher likelihood of having OSA. Among the independent features, the facial photo score had a major role in identifying OSA risk, followed by Q3(O), Q7(N), Q1(S), Q4(P), Q8(G), Q2(T), Q6(A), and Q5(B). In Figure 4B, the heatmap presents the impact of every feature on OSA prediction for all participants in the test dataset. This heatmap assists in recognizing the SHAP values behind the similar prediction values for different individuals. The heatmap shows that individuals with high prediction values had high SHAP values for the facial photo score.

**(A)** Summary plot of feature importance ranking and positive and negative influence in SHAP. Horizontal positioning implicates the impact of a feature value for the prediction. The thickness and color denote the sample size and original value for each feature, respectively. The topmost position denotes a large impact on prediction. **(B)** The heatmap plot with the model’s predictions (f(x)) for input features. The function f(x) is the sum of the SHAP values for each instance. The width of the black bar on the right side illustrates the average of the absolute SHAP values of each feature. SHAP = SHapley Additive exPlanations.

DISCUSSION

OSA is a sleep-related disorder associated with numerous chronic comorbidity. However, accurately screening for OSA remains a challenge before patients undergo a time- and labor-intensive diagnostic process, whether it be PSG, the gold standard for OSA diagnosis, or home sleep apnea test, which serves as an alternative to PSG. To address this issue, we developed a screening tool that could identify the risk of OSA, based on 2D facial photographs and questionnaires, by using machine-learning techniques. The performance of the model was more robust when facial images were integrated into the questionnaires than when each was used separately. LR was the most suitable machine learning technique, with an AUROC, sensitivity, specificity, and accuracy of 97.2%, 93%, 90.7%, and 91.9%, respectively. Given that facial images and questionnaires can be easily obtained, our model, derived from artificial intelligence methodologies, can provide a screening tool that is accurate and easy to use.

Our model outperformed the widely used STOP-BANG classifier in predicting OSA risk, even when applying thresholds of 3, 4, or 5 points, or developing a machine learning model based on the individual components of the STOP-BANG questionnaire (Table S3 in the supplemental material). Although the STOP-BANG questionnaire was originally designed to predict AHI > 5 events/h with a threshold of 3, our findings in Table S3 indicate that a threshold of 4 is more appropriate for predicting AHI > 15 events/h. This supports its application in identifying patients at higher risk for moderate-to-severe OSA. The STOP-BANG, which detects OSA risk at a cutoff score of 3 or higher, is known for its high sensitivity.⁸ We also obtained a consistent result (90.7%). However, despite the high sensitivity, a higher error rate occurred, compared with our model, when screening individuals in the no/mild OSA group as the moderate/severe OSA group. This tendency to overestimate may result in patients undergoing unnecessary treatments or incurring additional expenses. Hence, reliable screening tools should be used with caution with regard to underestimation and overestimation.

The improved performance of our model underscores the significant role of craniofacial information in predicting the risk of OSA. SHAP plots revealed that the photograph contributed the most to screening OSA risks, with higher SHAP values correlating to higher prediction values. Specifically, the facial photo score exhibited the highest SHAP value of 2.158, far surpassing other traits such as Q3-O (0.502), Q7-N (0.370), and Q1-S (0.344). This emphasizes the dominant influence of facial traits, particularly the facial photo score, in model decision-making. By contrast, the contribution of Q5(B) to OSA prediction was the lowest. Obesity is often reflected in face and neck images, potentially allowing the SHAP value of the facial photograph to dominate its importance, thus diminishing the significance of Q5(B). However, craniofacial anatomical factors may be more crucial than BMI in screening for OSA risk. Nonetheless, ethnicity and obesity more likely influenced the results. In our study, the BMI was higher in the moderate/severe OSA group than in the no/mild OSA group (26.9 vs 24.4). Only a small percentage of individuals in the moderate/severe OSA group (4.2%) had a BMI above the STOP-BANG cutoff of 35. Other studies³⁸ have reported that the BMI range for OSA in non-Korean populations is typically higher. This suggests that some individuals with OSA in Korea may not exhibit high BMI. In such cases, craniofacial anatomical risk factors may play a more significant role than obesity in the development of OSA.¹¹ Some researchers have consequently advocated lowering the Q5(B) cutoff to 30 for Koreans.³⁹ However, our study used the standard STOP-BANG, which may have reduced the accuracy for Q5(B) classification.

Our study employed a CNN-based photo score derived from facial images alongside the STOP-BANG questionnaire to predict OSA, achieving an AUROC of 97.2%, outperforming similar models in Table 3. Similar studies, such as those by He et al¹⁴ and Remya et al,¹⁶ used facial photos and craniofacial measurements; however, their approaches relied on specific distance measurements that captured only isolated aspects of facial anatomy. In contrast, our model extracted a comprehensive set of anatomical features using CNNs, enabling a holistic analysis of facial structures, which likely contributed to its superior predictive accuracy. Other studies, including those by Huo et al⁴⁰ and He et al,⁴¹ relied on clinical variables such as age, BMI, and neck circumference. Although these models achieved AUROC scores of 78% and 83.7%, respectively, the absence of anatomical data limited their predictive power. The incorporation of CNN-based anatomical features in our study provided a more nuanced and accurate assessment of OSA risk.

Several methods have been employed to obtain craniofacial information such as computed tomography,⁴² lateral cephalometry (X-ray),⁴³ and the distance values of facial landmarks from 3-dimensional photographs.⁴⁴ However, these methods are challenging to implement in practical situations. In contrast, our model requires only a 2D image that can be easily captured using a single photo from a digital camera or phone camera. Furthermore, the facial photographs were taken as part of the usual care protocol for OSA screening, rather than under controlled research conditions, which may introduce some variability in positioning. Nevertheless, this approach reflects real-world clinical practice and may enhance the practical applicability in everyday settings, including nonhospital and public contexts.

Unlike Chen et al,¹⁵ who used a high-dimensional feature set combining 68 facial and 19 clinical variables but achieved a much lower AUROC of 76%, our model balanced simplicity and complexity effectively. By integrating the STOP-BANG questionnaire—a validated and simplified clinical tool—and leveraging advanced computational methods such as CNNs for comprehensive facial feature extraction, our model achieved superior performance with fewer variables. This finding indicates that extensive use of features does not necessarily translate to better performance, particularly when overfitting is a risk.

A previous study²¹ demonstrated high performance (AUROC, 90%; accuracy, 85%) in predicting OSA risk, based solely on photo information. When this prediction model with facial photo was applied to the current test dataset, we achieved an AUROC of 85.7% and an accuracy of 76.7%. Although the prediction model with facial photo outperformed the STOP-BANG classifiers, our current model surpassed these results, further enhancing screening accuracy. This finding highlights the importance of incorporating data gathered from questionnaires—such as observed stopped breathing, neck circumference, loud snoring, hypertension, sex, and age—to attain maximum accuracy. Whereas various studies have proposed models using either facial images⁹^,¹⁴^–¹⁶ or questionnaires,¹⁸^,²⁸ each modality has limitations in fully capturing both anatomical and clinical information. To bridge this gap, we combined these 2 data sources through a 2-stage approach focused on simplicity and clarity. This method reduces computational complexity while enhancing interpretability and practicality for clinical application.⁴⁵

Among the various machine learning methods, LR performed best in identifying OSA risk. LR has emerged as the most common machine learning technique for the early diagnosis of OSA.⁴⁶ This finding is because it performs better with an underlying linearity assumption.⁴⁷ We found significant differences in characteristics between the groups. In addition, LR performs better with moderate sample sizes and a limited set of clinical predictors,⁴⁸ which is why it was chosen as a suitable model for our study.

In practical clinical settings, an imbalance can arise between the control and case groups. This imbalance could potentially introduce bias into the model development. Many studies predicting OSA have used the RUS method.⁴⁹^,⁵⁰ We compared the performance of the models developed from the original data with the actual RUS applied at 1:1 and 1:2 ratios. We consequently found that the model performed better on balanced 1:1 data than on the original or 1:2 data (Table S2 in the supplemental material).

The AHI is a commonly used metric that considers apneas and hypopneas per hour of sleep. However, the respiratory disturbance index is calculated by combining the AHI and the respiratory effort-related arousal index. We first developed the model, based on each of the 2 indices, and found that it performed better when using the AHI. The model developed by using the respiratory disturbance index had an AUROC and accuracy of 94.1% and 88.1%, respectively. Nevertheless, we judged that using the model, developed using the AHI, was appropriate because the AHI can be measured more clearly and is more universal. Although we used the total AHI, which combines both supine and lateral measurements, relying solely on the supine AHI alone showed that 16 of the 213 individuals initially categorized as having no/mild OSA (AHI < 15 events/h) actually had supine AHI greater than 30 events/h, indicating severe OSA. This represents 7.5% of the total, suggesting that such misclassification due to a short supine sleep time was relatively infrequent.

This study had several limitations. First, because this was a single-center study with only Korean individuals, the results should be generalized with caution and may only be applicable to a single ethnic group. Second, owing to the small sample size, the final dataset was insufficient for a detailed analysis across various OSA severity thresholds. Third, despite securing a larger dataset than similar studies that use facial photos with clinical variables, our study lacks an extensive independent test dataset. Additionally, although we employed rigorous validation techniques, the performance of the model may still be affected by location-specific factors and biases inherent to a single-center study. Future validation will be conducted as additional prospective datasets are collected. Therefore, further studies with more extensive data from multicenter collections, preferably involving multiple ethnicities, are required.

Our screening tool, 2D photography incorporating various symptoms of OSA, offers a promising approach for investigating distinct clinical OSA phenotypes such as OSA with insomnia or OSA with bruxism. Furthermore, this tool may enable clinicians to explore whether craniofacial anatomical risk factors contribute to severity variables beyond AHI, including metrics such as the hypoxic burden. Finally, this tool could broaden public access to OSA screening, particularly in nonobese populations.

CONCLUSIONS

In this study, we demonstrated an innovative methodology integrating facial photographs with sleep questionnaires employing CNN and machine learning techniques. Our results validated this 2-stage approach for accurately detecting individuals at higher risk of OSA. By providing a simple and cost-effective screening solution, this tool can improve OSA management and potentially reduce the need for costly and unnecessary assessments.

DISCLOSURE STATEMENT

All authors have seen and approved this manuscript. Work for this study was performed at Ajou University School of Medicine, Suwon, Republic of Korea and Dankook University Hospital, Cheonan, Republic of Korea. Hye-Rim Shin reports that the study was supported by Dankook University Hospital Research Grant in 2023. Tae-Joon Kim reports that the study was supported by the National Research Foundation of Korea (NRF) grants funded by the Korea government (Ministry of Science and Information and Communication Technology [MSIT]) (NRF-2022R1F1A1076135), Bio-convergence Technology Education Program through the Korea Institute for Advancement Technology (KIAT) funded by the Ministry of Trade, Industry and Energy (No. P0017805), and Clinical-Basic Translational Collaboration Grant by Ajou University Medical Center in 2021. The funding bodies had no role in the conception, design, conduct, interpretation, or analysis of the study or in the approval of the publication. Tae-Joon Kim, Hye-Rim Shin, June-Young Park, Min Hye Kim, and Yunsoo Kim have patent #KR10-2024-0064573 pending to Assignee. The other authors declare no conflicts of interest.

Supplemental Materials

jcsm.11560.sm001.pdf^{(423.6KB, pdf)}

DOI: 10.5664/jcsm.11560

ACKNOWLEDGMENTS

The authors thank Suk Hyun Lee and Ji Heon Choi, the polysomnography technicians at Dankook University Hospital (Cheonan, Republic of Korea). Author contributions: Conception and design of the study: H-R.S., M.H.K., J.H.K., and T-J.K. Acquisition, analysis, and interpretation of data: J-Y.P., H-R.S., M.H.K., and T-J.K. First drafting of the manuscript: J-Y.P., H-R.S., M.H.K., and Y.K. Critical revisions for important intellectual content: W-S.R., E.Y.K., H.C., W-J.K., J.H.K., and T-J.K.

ABBREVIATIONS

AHI: apnea-hypopnea index
AUROC: area under the receiver operating characteristic curve
BMI: body mass index
CNN: convolutional neural network
LR: logistic regression
OSA: obstructive sleep apnea
PSG: polysomnography
RUS: random undersampling
SHAP: SHapley Additive exPlanations
2D: 2-dimensional

REFERENCES

1. Kapur VK, Auckley DH, Chowdhuri S, et al . Clinical practice guideline for diagnostic testing for adult obstructive sleep apnea: an American Academy of Sleep Medicine clinical practice guideline . J Clin Sleep Med. 2017. ; 13 ( 3 ): 479 – 504 . [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Senaratna CV, Perret JL, Lodge CJ, et al . Prevalence of obstructive sleep apnea in the general population: a systematic review . Sleep Med Rev. 2017. ; 34 : 70 – 81 . [DOI] [PubMed] [Google Scholar]
3. Kato M, Adachi T, Koshino Y, Somers VK . Obstructive sleep apnea and cardiovascular disease . Circ J. 2009. ; 73 ( 8 ): 1363 – 1370 . [DOI] [PubMed] [Google Scholar]
4. Marin JM, Carrizo SJ, Vicente E, Agusti AG . Long-term cardiovascular outcomes in men with obstructive sleep apnoea-hypopnoea with or without treatment with continuous positive airway pressure: an observational study . Lancet. 2005. ; 365 ( 9464 ): 1046 – 1053 . [DOI] [PubMed] [Google Scholar]
5. Gottlieb DJ, Yenokyan G, Newman AB, et al . Prospective study of obstructive sleep apnea and incident coronary heart disease and heart failure: the Sleep Heart Health Study . Circulation. 2010. ; 122 ( 4 ): 352 – 360 . [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Hajipour F, Jozani MJ, Moussavi Z . A comparison of regularized logistic regression and random forest machine learning models for daytime diagnosis of obstructive sleep apnea . Med Biol Eng Comput. 2020. ; 58 ( 10 ): 2517 – 2529 . [DOI] [PubMed] [Google Scholar]
7. Rosen IM, Kirsch DB, Chervin RD, et al . Clinical use of a home sleep apnea test: an American Academy of Sleep Medicine position statement . J Clin Sleep Med. 2017. ; 13 ( 10 ): 1205 – 1207 . [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Alqudah AM, Elwali A, Kupiak B, Hajipour F, Jacobson N, Moussavi Z . Obstructive sleep apnea detection during wakefulness: a comprehensive methodological review . Med Biol Eng Comput. 2024. ; 62 ( 5 ): 1277 – 1311 . [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Vidigal TA, Haddad FL, Guimaraes TM, et al . Can intraoral and facial photos predict obstructive sleep apnea in the general and clinical population? Sleep. 2024. ; 47 ( 3 ): zsad307 . [DOI] [PubMed] [Google Scholar]
10. Tan K . Appropriate body-mass index for Asian populations and its implications for policy and intervention strategies . Lancet. 2004. ; 363 ( 9403 ): 157 – 163 . [DOI] [PubMed] [Google Scholar]
11. Sutherland K, Lee RW, Cistulli PA . Obesity and craniofacial structure as risk factors for obstructive sleep apnoea: impact of ethnicity . Respirology. 2012. ; 17 ( 2 ): 213 – 222 . [DOI] [PubMed] [Google Scholar]
12. Casale M, Pappacena M, Rinaldi V, Bressi F, Baptista P, Salvinelli F . Obstructive sleep apnea syndrome: from phenotype to genetic basis . Curr Genomics. 2009. ; 10 ( 2 ): 119 – 126 . [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Holfinger SJ, Lyons MM, Keenan BT, et al . Diagnostic performance of machine learning-derived OSA prediction tools in large clinical and community-based samples . Chest. 2022. ; 161 ( 3 ): 807 – 817 . [DOI] [PMC free article] [PubMed] [Google Scholar]
14. He S, Li Y, Xu W, et al . The predictive value of photogrammetry for obstructive sleep apnea . J Clin Sleep Med. 2021. ; 17 ( 2 ): 193 – 202 . [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Chen Q, Liang Z, Wang Q, et al . Self-helped detection of obstructive sleep apnea based on automated facial recognition and machine learning . Sleep Breath. 2023. ; 27 ( 6 ): 2379 – 2388 . [DOI] [PubMed] [Google Scholar]
16. Remya KJ, Mathangi K, Mathangi DC, et al . Predictive value of craniofacial and anthropometric measures in obstructive sleep apnea (OSA) . Cranio. 2017. ; 35 ( 3 ): 162 – 167 . [DOI] [PubMed] [Google Scholar]
17. Boynton G, Vahabzadeh A, Hammoud S, Ruzicka DL, Chervin RD . Validation of the STOP-BANG questionnaire among patients referred for suspected obstructive sleep apnea . J Sleep Disord Treat Care. 2013. ; 2 ( 4 ): 1000121 . [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Chung F, Abdullah HR, Liao P . STOP-Bang questionnaire: a practical approach to screen for obstructive sleep apnea . Chest. 2016. ; 149 ( 3 ): 631 – 638 . [DOI] [PubMed] [Google Scholar]
19. Berry RB, Brooks R, Gamaldo C, et al . AASM Scoring Manual updates for 2017 (version 2.4) . J Clin Sleep Med. 2017. ; 13 ( 5 ): 665 – 666 . [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Berry RB, Budhiraja R, Gottlieb DJ, et al . Rules for scoring respiratory events in sleep: update of the 2007 AASM Manual for the Scoring of Sleep and Associated Events . J Clin Sleep Med. 2012. ; 8 ( 5 ): 597 – 619 . [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Park J-Y, Shin H-R, Kim T-J . OSA-NET: an efficient convolutional neural network for OSA diagnosis screening tool . In: 2023 Twelfth International Conference on Image Processing Theory, Tools and Applications . Piscataway, NJ: : IEEE; ; 2023. : 1 – 6 . [Google Scholar]
22. Tan SN, Yang HC, Lim SC . Anatomy and pathophysiology of upper airway obstructive sleep apnoea: review of the current literature . Sleep Med Res. 2021. ; 12 ( 1 ): 1 – 8 . [Google Scholar]
23. Lugaresi C, Tang J, Nash H, et al . Mediapipe: a framework for building perception pipelines . arXiv. Preprint posted online June 14, 2019. .
24. Wu Y, Hassner T, Kim K, Medioni G, Natarajan P . Facial landmark detection with tweaked convolutional neural networks . IEEE Trans Pattern Anal Mach Intell. 2018. ; 40 ( 12 ): 3067 – 3074 . [DOI] [PubMed] [Google Scholar]
25. Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A . RUSBoost: a hybrid approach to alleviating class imbalance . IEEE Trans Syst, Man, Cybern A. 2010. ; 40 ( 1 ): 185 – 197 . [Google Scholar]
26. Raschka S . Model evaluation, model selection, and algorithm selection in machine learning . arXiv. Preprint posted online November 13, 2018. .
27. Kim YJ, Jeon JS, Cho S-E, Kim KG, Kang S-G . Prediction models for obstructive sleep apnea in korean adults using machine learning techniques . Diagnostics. 2021. ; 11 ( 4 ): 612 . [DOI] [PMC free article] [PubMed] [Google Scholar]
28. Ha S, Choi SJ, Lee S, et al . Predicting the risk of sleep disorders using a machine learning–based simple questionnaire: development and validation study . J Med Internet Res. 2023. ; 25 : e46520 . [DOI] [PMC free article] [PubMed] [Google Scholar]
29. Molinaro AM, Simon R, Pfeiffer RM . Prediction error estimation: a comparison of resampling methods . Bioinformatics. 2005. ; 21 ( 15 ): 3301 – 3307 . [DOI] [PubMed] [Google Scholar]
30. Wong T-T . Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation . Pattern Recognit. 2015. ; 48 ( 9 ): 2839 – 2846 . [Google Scholar]
31. Xie Y, Zhu C, Zhou W, Li Z, Liu X, Tu M . Evaluation of machine learning methods for formation lithology identification: a comparison of tuning processes and model performances . J Petrol Sci Engin. 2018. ; 160 : 182 – 193 . [Google Scholar]
32. Bradley AP . The use of the area under the ROC curve in the evaluation of machine learning algorithms . Pattern Recognit. 1997. ; 30 ( 7 ): 1145 – 1159 . [Google Scholar]
33. Pedregosa F, Varoquaux G, Gramfort A, et al . Scikit-Learn: machine learning in Python . J Mach Learn Res. 2011. ; 12 : 2825 – 2830 . [Google Scholar]
34. Youden WJ . Index for rating diagnostic tests . Cancer. 1950. ; 3 ( 1 ): 32 – 35 . [DOI] [PubMed] [Google Scholar]
35. Davis J, Goadrich M . The relationship between Precision-Recall and ROC curves . In: Proceedings of the 23rd International Conference on Machine Learning. New York: : ACM; ; 2006. : 233 – 240 . [Google Scholar]
36. Jung K, Lee J, Gupta V, Cho G . Comparison of bootstrap confidence interval methods for GSCA using a Monte Carlo simulation . Front Psychol. 2019. ; 10 : 2215 . [DOI] [PMC free article] [PubMed] [Google Scholar]
37. Caronni A, Sciumè L . Is my patient actually getting better? Application of the McNemar test for demonstrating the change at a single subject level . Disabil Rehabil. 2017. ; 39 ( 13 ): 1341 – 1347 . [DOI] [PubMed] [Google Scholar]
38. Hnin K, Mukherjee S, Antic NA, et al . The impact of ethnicity on the prevalence and severity of obstructive sleep apnea . Sleep Med Rev. 2018. ; 41 : 78 – 86 . [DOI] [PubMed] [Google Scholar]
39. Ong TH, Raudha S, Fook-Chong S, Lew N, Hsu A . Simplifying STOP-BANG: use of a simple questionnaire to screen for OSA in an Asian population . Sleep Breath. 2010. ; 14 ( 4 ): 371 – 376 . [DOI] [PubMed] [Google Scholar]
40. Huo J, Quan SF, Roveda J, Li A . BASH-GN: a new machine learning–derived questionnaire for screening obstructive sleep apnea . Sleep Breath. 2023. ; 27 ( 2 ): 449 – 457 . [DOI] [PMC free article] [PubMed] [Google Scholar]
41. He S, Li Y, Xu W, Han D . Using clinical data to predict obstructive sleep apnea . J Thorac Dis. 2022. ; 14 ( 2 ): 227 – 237 . [DOI] [PMC free article] [PubMed] [Google Scholar]
42. Kim J-W, Lee K, Kim HJ, et al . Predicting obstructive sleep apnea based on computed tomography scan using deep learning models . Am J Respir Crit Care Med. 2024. ; 210 ( 2 ): 211 – 221 . [DOI] [PubMed] [Google Scholar]
43. Chan H-L, Yuen H-M, Au C-T, Chan KC-C, Li AM, Lui L-M . Classification of childhood obstructive sleep apnea based on X-ray images analysis by quasi-conformal geometry . Pattern Recognit. 2024. ; 152 : 110454 . [Google Scholar]
44. Eastwood P, Gilani SZ, McArdle N, et al . Predicting sleep apnea from three-dimensional face photography . J Clin Sleep Med. 2020. ; 16 ( 4 ): 493 – 502 . [DOI] [PMC free article] [PubMed] [Google Scholar]
45. Kline A, Wang H, Li Y, et al . Multimodal machine learning in precision health: a scoping review . NPJ Digit Med. 2022. ; 5 ( 1 ): 171 . [DOI] [PMC free article] [PubMed] [Google Scholar]
46. Ferreira-Santos D, Amorim P, Silva Martins T, Monteiro-Soares M, Pereira Rodrigues P . Enabling early obstructive sleep apnea diagnosis with machine learning: systematic review . J Med Internet Res. 2022. ; 24 ( 9 ): e39452 . [DOI] [PMC free article] [PubMed] [Google Scholar]
47. Nusinovici S, Tham YC, Yan MYC, et al . Logistic regression was as good as machine learning for predicting major chronic diseases . J Clin Epidemiol. 2020. ; 122 : 56 – 69 . [DOI] [PubMed] [Google Scholar]
48. van der Ploeg T, Austin PC, Steyerberg EW . Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints . BMC Med Res Methodol. 2014. ; 14 ( 1 ): 137 . [DOI] [PMC free article] [PubMed] [Google Scholar]
49. Arslan RS, Ulutas H, Köksal AS, Bakir M, Çiftçi B . End-to end decision support system for sleep apnea detection and apnea-hypopnea index calculation using hybrid feature vector and machine learning . Biocybernet Biomed Engin. 2023. ; 43 ( 4 ): 684 – 699 . [Google Scholar]
50. Jansri U, Tretriluxana S . Effect of resampling techniques on deep learning model training in sleep apnea classification . In: 2022 International Electrical Engineering Congress . Piscataway, NJ: : IEEE; ; 2022. : 1 – 4 . [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Materials

jcsm.11560.sm001.pdf^{(423.6KB, pdf)}

DOI: 10.5664/jcsm.11560

[b1] 1. Kapur VK, Auckley DH, Chowdhuri S, et al . Clinical practice guideline for diagnostic testing for adult obstructive sleep apnea: an American Academy of Sleep Medicine clinical practice guideline . J Clin Sleep Med. 2017. ; 13 ( 3 ): 479 – 504 . [DOI] [PMC free article] [PubMed] [Google Scholar]

[b2] 2. Senaratna CV, Perret JL, Lodge CJ, et al . Prevalence of obstructive sleep apnea in the general population: a systematic review . Sleep Med Rev. 2017. ; 34 : 70 – 81 . [DOI] [PubMed] [Google Scholar]

[b3] 3. Kato M, Adachi T, Koshino Y, Somers VK . Obstructive sleep apnea and cardiovascular disease . Circ J. 2009. ; 73 ( 8 ): 1363 – 1370 . [DOI] [PubMed] [Google Scholar]

[b4] 4. Marin JM, Carrizo SJ, Vicente E, Agusti AG . Long-term cardiovascular outcomes in men with obstructive sleep apnoea-hypopnoea with or without treatment with continuous positive airway pressure: an observational study . Lancet. 2005. ; 365 ( 9464 ): 1046 – 1053 . [DOI] [PubMed] [Google Scholar]

[b5] 5. Gottlieb DJ, Yenokyan G, Newman AB, et al . Prospective study of obstructive sleep apnea and incident coronary heart disease and heart failure: the Sleep Heart Health Study . Circulation. 2010. ; 122 ( 4 ): 352 – 360 . [DOI] [PMC free article] [PubMed] [Google Scholar]

[b6] 6. Hajipour F, Jozani MJ, Moussavi Z . A comparison of regularized logistic regression and random forest machine learning models for daytime diagnosis of obstructive sleep apnea . Med Biol Eng Comput. 2020. ; 58 ( 10 ): 2517 – 2529 . [DOI] [PubMed] [Google Scholar]

[b7] 7. Rosen IM, Kirsch DB, Chervin RD, et al . Clinical use of a home sleep apnea test: an American Academy of Sleep Medicine position statement . J Clin Sleep Med. 2017. ; 13 ( 10 ): 1205 – 1207 . [DOI] [PMC free article] [PubMed] [Google Scholar]

[b8] 8. Alqudah AM, Elwali A, Kupiak B, Hajipour F, Jacobson N, Moussavi Z . Obstructive sleep apnea detection during wakefulness: a comprehensive methodological review . Med Biol Eng Comput. 2024. ; 62 ( 5 ): 1277 – 1311 . [DOI] [PMC free article] [PubMed] [Google Scholar]

[b9] 9. Vidigal TA, Haddad FL, Guimaraes TM, et al . Can intraoral and facial photos predict obstructive sleep apnea in the general and clinical population? Sleep. 2024. ; 47 ( 3 ): zsad307 . [DOI] [PubMed] [Google Scholar]

[b10] 10. Tan K . Appropriate body-mass index for Asian populations and its implications for policy and intervention strategies . Lancet. 2004. ; 363 ( 9403 ): 157 – 163 . [DOI] [PubMed] [Google Scholar]

[b11] 11. Sutherland K, Lee RW, Cistulli PA . Obesity and craniofacial structure as risk factors for obstructive sleep apnoea: impact of ethnicity . Respirology. 2012. ; 17 ( 2 ): 213 – 222 . [DOI] [PubMed] [Google Scholar]

[b12] 12. Casale M, Pappacena M, Rinaldi V, Bressi F, Baptista P, Salvinelli F . Obstructive sleep apnea syndrome: from phenotype to genetic basis . Curr Genomics. 2009. ; 10 ( 2 ): 119 – 126 . [DOI] [PMC free article] [PubMed] [Google Scholar]

[b13] 13. Holfinger SJ, Lyons MM, Keenan BT, et al . Diagnostic performance of machine learning-derived OSA prediction tools in large clinical and community-based samples . Chest. 2022. ; 161 ( 3 ): 807 – 817 . [DOI] [PMC free article] [PubMed] [Google Scholar]

[b14] 14. He S, Li Y, Xu W, et al . The predictive value of photogrammetry for obstructive sleep apnea . J Clin Sleep Med. 2021. ; 17 ( 2 ): 193 – 202 . [DOI] [PMC free article] [PubMed] [Google Scholar]

[b15] 15. Chen Q, Liang Z, Wang Q, et al . Self-helped detection of obstructive sleep apnea based on automated facial recognition and machine learning . Sleep Breath. 2023. ; 27 ( 6 ): 2379 – 2388 . [DOI] [PubMed] [Google Scholar]

[b16] 16. Remya KJ, Mathangi K, Mathangi DC, et al . Predictive value of craniofacial and anthropometric measures in obstructive sleep apnea (OSA) . Cranio. 2017. ; 35 ( 3 ): 162 – 167 . [DOI] [PubMed] [Google Scholar]

[b17] 17. Boynton G, Vahabzadeh A, Hammoud S, Ruzicka DL, Chervin RD . Validation of the STOP-BANG questionnaire among patients referred for suspected obstructive sleep apnea . J Sleep Disord Treat Care. 2013. ; 2 ( 4 ): 1000121 . [DOI] [PMC free article] [PubMed] [Google Scholar]

[b18] 18. Chung F, Abdullah HR, Liao P . STOP-Bang questionnaire: a practical approach to screen for obstructive sleep apnea . Chest. 2016. ; 149 ( 3 ): 631 – 638 . [DOI] [PubMed] [Google Scholar]

[b19] 19. Berry RB, Brooks R, Gamaldo C, et al . AASM Scoring Manual updates for 2017 (version 2.4) . J Clin Sleep Med. 2017. ; 13 ( 5 ): 665 – 666 . [DOI] [PMC free article] [PubMed] [Google Scholar]

[b20] 20. Berry RB, Budhiraja R, Gottlieb DJ, et al . Rules for scoring respiratory events in sleep: update of the 2007 AASM Manual for the Scoring of Sleep and Associated Events . J Clin Sleep Med. 2012. ; 8 ( 5 ): 597 – 619 . [DOI] [PMC free article] [PubMed] [Google Scholar]

[b21] 21. Park J-Y, Shin H-R, Kim T-J . OSA-NET: an efficient convolutional neural network for OSA diagnosis screening tool . In: 2023 Twelfth International Conference on Image Processing Theory, Tools and Applications . Piscataway, NJ: : IEEE; ; 2023. : 1 – 6 . [Google Scholar]

[b22] 22. Tan SN, Yang HC, Lim SC . Anatomy and pathophysiology of upper airway obstructive sleep apnoea: review of the current literature . Sleep Med Res. 2021. ; 12 ( 1 ): 1 – 8 . [Google Scholar]

[b23] 23. Lugaresi C, Tang J, Nash H, et al . Mediapipe: a framework for building perception pipelines . arXiv. Preprint posted online June 14, 2019. .

[b24] 24. Wu Y, Hassner T, Kim K, Medioni G, Natarajan P . Facial landmark detection with tweaked convolutional neural networks . IEEE Trans Pattern Anal Mach Intell. 2018. ; 40 ( 12 ): 3067 – 3074 . [DOI] [PubMed] [Google Scholar]

[b25] 25. Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A . RUSBoost: a hybrid approach to alleviating class imbalance . IEEE Trans Syst, Man, Cybern A. 2010. ; 40 ( 1 ): 185 – 197 . [Google Scholar]

[b26] 26. Raschka S . Model evaluation, model selection, and algorithm selection in machine learning . arXiv. Preprint posted online November 13, 2018. .

[b27] 27. Kim YJ, Jeon JS, Cho S-E, Kim KG, Kang S-G . Prediction models for obstructive sleep apnea in korean adults using machine learning techniques . Diagnostics. 2021. ; 11 ( 4 ): 612 . [DOI] [PMC free article] [PubMed] [Google Scholar]

[b28] 28. Ha S, Choi SJ, Lee S, et al . Predicting the risk of sleep disorders using a machine learning–based simple questionnaire: development and validation study . J Med Internet Res. 2023. ; 25 : e46520 . [DOI] [PMC free article] [PubMed] [Google Scholar]

[b29] 29. Molinaro AM, Simon R, Pfeiffer RM . Prediction error estimation: a comparison of resampling methods . Bioinformatics. 2005. ; 21 ( 15 ): 3301 – 3307 . [DOI] [PubMed] [Google Scholar]

[b30] 30. Wong T-T . Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation . Pattern Recognit. 2015. ; 48 ( 9 ): 2839 – 2846 . [Google Scholar]

[b31] 31. Xie Y, Zhu C, Zhou W, Li Z, Liu X, Tu M . Evaluation of machine learning methods for formation lithology identification: a comparison of tuning processes and model performances . J Petrol Sci Engin. 2018. ; 160 : 182 – 193 . [Google Scholar]

[b32] 32. Bradley AP . The use of the area under the ROC curve in the evaluation of machine learning algorithms . Pattern Recognit. 1997. ; 30 ( 7 ): 1145 – 1159 . [Google Scholar]

[b33] 33. Pedregosa F, Varoquaux G, Gramfort A, et al . Scikit-Learn: machine learning in Python . J Mach Learn Res. 2011. ; 12 : 2825 – 2830 . [Google Scholar]

[b34] 34. Youden WJ . Index for rating diagnostic tests . Cancer. 1950. ; 3 ( 1 ): 32 – 35 . [DOI] [PubMed] [Google Scholar]

[b35] 35. Davis J, Goadrich M . The relationship between Precision-Recall and ROC curves . In: Proceedings of the 23rd International Conference on Machine Learning. New York: : ACM; ; 2006. : 233 – 240 . [Google Scholar]

[b36] 36. Jung K, Lee J, Gupta V, Cho G . Comparison of bootstrap confidence interval methods for GSCA using a Monte Carlo simulation . Front Psychol. 2019. ; 10 : 2215 . [DOI] [PMC free article] [PubMed] [Google Scholar]

[b37] 37. Caronni A, Sciumè L . Is my patient actually getting better? Application of the McNemar test for demonstrating the change at a single subject level . Disabil Rehabil. 2017. ; 39 ( 13 ): 1341 – 1347 . [DOI] [PubMed] [Google Scholar]

[b38] 38. Hnin K, Mukherjee S, Antic NA, et al . The impact of ethnicity on the prevalence and severity of obstructive sleep apnea . Sleep Med Rev. 2018. ; 41 : 78 – 86 . [DOI] [PubMed] [Google Scholar]

[b39] 39. Ong TH, Raudha S, Fook-Chong S, Lew N, Hsu A . Simplifying STOP-BANG: use of a simple questionnaire to screen for OSA in an Asian population . Sleep Breath. 2010. ; 14 ( 4 ): 371 – 376 . [DOI] [PubMed] [Google Scholar]

[b40] 40. Huo J, Quan SF, Roveda J, Li A . BASH-GN: a new machine learning–derived questionnaire for screening obstructive sleep apnea . Sleep Breath. 2023. ; 27 ( 2 ): 449 – 457 . [DOI] [PMC free article] [PubMed] [Google Scholar]

[b41] 41. He S, Li Y, Xu W, Han D . Using clinical data to predict obstructive sleep apnea . J Thorac Dis. 2022. ; 14 ( 2 ): 227 – 237 . [DOI] [PMC free article] [PubMed] [Google Scholar]

[b42] 42. Kim J-W, Lee K, Kim HJ, et al . Predicting obstructive sleep apnea based on computed tomography scan using deep learning models . Am J Respir Crit Care Med. 2024. ; 210 ( 2 ): 211 – 221 . [DOI] [PubMed] [Google Scholar]

[b43] 43. Chan H-L, Yuen H-M, Au C-T, Chan KC-C, Li AM, Lui L-M . Classification of childhood obstructive sleep apnea based on X-ray images analysis by quasi-conformal geometry . Pattern Recognit. 2024. ; 152 : 110454 . [Google Scholar]

[b44] 44. Eastwood P, Gilani SZ, McArdle N, et al . Predicting sleep apnea from three-dimensional face photography . J Clin Sleep Med. 2020. ; 16 ( 4 ): 493 – 502 . [DOI] [PMC free article] [PubMed] [Google Scholar]

[b45] 45. Kline A, Wang H, Li Y, et al . Multimodal machine learning in precision health: a scoping review . NPJ Digit Med. 2022. ; 5 ( 1 ): 171 . [DOI] [PMC free article] [PubMed] [Google Scholar]

[b46] 46. Ferreira-Santos D, Amorim P, Silva Martins T, Monteiro-Soares M, Pereira Rodrigues P . Enabling early obstructive sleep apnea diagnosis with machine learning: systematic review . J Med Internet Res. 2022. ; 24 ( 9 ): e39452 . [DOI] [PMC free article] [PubMed] [Google Scholar]

[b47] 47. Nusinovici S, Tham YC, Yan MYC, et al . Logistic regression was as good as machine learning for predicting major chronic diseases . J Clin Epidemiol. 2020. ; 122 : 56 – 69 . [DOI] [PubMed] [Google Scholar]

[b48] 48. van der Ploeg T, Austin PC, Steyerberg EW . Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints . BMC Med Res Methodol. 2014. ; 14 ( 1 ): 137 . [DOI] [PMC free article] [PubMed] [Google Scholar]

[b49] 49. Arslan RS, Ulutas H, Köksal AS, Bakir M, Çiftçi B . End-to end decision support system for sleep apnea detection and apnea-hypopnea index calculation using hybrid feature vector and machine learning . Biocybernet Biomed Engin. 2023. ; 43 ( 4 ): 684 – 699 . [Google Scholar]

[b50] 50. Jansri U, Tretriluxana S . Effect of resampling techniques on deep learning model training in sleep apnea classification . In: 2022 International Electrical Engineering Congress . Piscataway, NJ: : IEEE; ; 2022. : 1 – 4 . [Google Scholar]

PERMALINK

A novel machine learning model for screening the risk of obstructive sleep apnea using craniofacial photography with questionnaires

June-Young Park, MS

Hye-Rim Shin, MD

Min Hye Kim, MD

Yunsoo Kim, PhD

Wi-Sun Ryu, MD, PhD

Eun Young Kim, MD

Hyeyeon Chang, MD, PhD

Woo-Jin Lee, MD, PhD

Jee Hyun Kim, MD, PhD

Tae-Joon Kim, MD, PhD

Abstract

Study Objectives:

Methods:

Results:

Conclusions:

Citation:

BRIEF SUMMARY

INTRODUCTION

METHODS

Study population

Data preparation

STOP-BANG questionnaires

PSG

Facial photo

CNN–based facial photo algorithm: OSA-Net

Training machine learning models

Figure 1. Schematic of the analysis pipeline.

Statistical analysis and model assessment

RESULTS

Participants’ characteristics

Table 1.

Performance of the machine learning models

Table 2.

Figure 2. The distribution and performance of the screening models in the test dataset.

Table 3.

Figure 3. Comparison of the errors made by method 1 (STOP-BANG) and method 2 (logistic regression model).

Model interpretability

Figure 4. SHAP value plots.

DISCUSSION

CONCLUSIONS

DISCLOSURE STATEMENT

Supplemental Materials

ACKNOWLEDGMENTS

ABBREVIATIONS

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases