Abstract
Objective
Sleep-related respiratory abnormalities are typically detected using polysomnography. There is a need in general medicine and critical care for a more convenient method to detect sleep apnea automatically from a simple, easy-to-wear device. The objective was to detect abnormal respiration and estimate the Apnea–Hypopnea Index (AHI) automatically with a wearable respiratory device with and without SpO2 signals using a large (n = 412) dataset serving as ground truth.
Design
Simultaneously recorded polysomnography (PSG) and wearable respiratory effort data were used to train and evaluate models in a cross-validation fashion. Time domain and complexity features were extracted, important features were identified, and a random forest model was employed to detect events and predict AHI. Four models were trained: one each using the respiratory features only, a feature from the SpO2 (%)-signal only, and two additional models that use the respiratory features and the SpO2 (%) feature, one allowing a time lag of 30 s between the two signals.
Results
Event-based classification resulted in areas under the receiver operating characteristic curves of 0.94, 0.86, and 0.82, and areas under the precision-recall curves of 0.48, 0.32, and 0.51 for the models using respiration and SpO2, respiration-only, and SpO2-only, respectively. Correlation between expert-labelled and predicted AHI was 0.96, 0.78, and 0.93, respectively.
Conclusions
A wearable respiratory effort signal with or without SpO2 signal predicted AHI accurately, and best performance was achieved with using both signals.
Keywords: Respiration disorders, Apnea, Machine learning, Wearable
Introduction
Abnormalities of breathing during sleep may reflect both a primary sleep disorder (sleep apnea) and the health of the brain, lungs, or heart. There would be broad utility in an easy to apply, minimally obtrusive, and accurate measure of respiration, either for sleep apnea diagnostics or to track key vital signs in a range of situations. Such situations include intensive care units [1] or home tracking. Central sleep apnea, which occurs when breathing stops during sleep because the brain fails to send signals to the muscles that control breathing [2], is especially common in hospitalized patients [3].
Obstructive sleep apnea is more common in the general population than central sleep apnea, a wide (4–24% or even higher) range of prevalence, differing by race, age, associated medical conditions, and exact diagnostic criteria used, of the adult population [4-7]. A recent study on global prevalence of obstructive sleep apnea reports that almost one billion people are affected and stresses the urge to improve both healthcare and cost-effectiveness [8].
The standard method for sleep apnea detection has evolved from full polysomnography (PSG) in a sleep center or hospital, to various home sleep apnea test (HSAT) systems, to innovation in minimal contact or contact-free wearables/nearables, though most of the latter are not yet certified as medical devices. Problems with PSG or HSATs include variable needs for manual scoring or score checking, low convenient repeatability, and inability to capture dynamics across multiple nights. Conventional sleep testing is especially difficulty to accomplish in medically challenging environments such as an intensive care unit, or nursing home, or in situations where change in medical condition is expected, such as heart failure.
Several studies report implementation of machine learning and deep learning techniques to automate the apnea–hypopnea detection process from PSG signals [9-12], but the previously noted limitations persist.
The ideal respiratory monitoring device should have at least three characteristics: ease of use/application, automated analytics with interpretable decision-making, and accuracy. In the present study, we focused on a method for detecting sleep apnea from a practical wearable respiratory belt on its own or in combination with SpO2. Our analytics approach consisted of the extraction of clinically relevant and interpretable features, as well as a machine learning model (random forest). The features were designed based on physiological considerations and were expected to be indicative of apneic events across patients and situations, addressing potential covariate shifts. This modelling approach stands in contrast to more complex, difficult to interpret approaches such as deep learning.
We show a novel, automated method for the detection of sleep apnea and AHI categorization with patient-friendly equipment suitable for a variety of clinical settings. We further provide the main code used in this study, ten PSGs with annotations and the wearable respiratory signals, and all trained models on our GitHub page [13].
Materials and methods
Data collection
The data, PSGs (Natus System [14]) from patients who also wore a wearable respiration belt (AirGo [15]), were collected at the Massachusetts General Hospital Sleep Laboratory between January 2019 and January 2020 with approval of the Partners Institutional Review Board under protocol 2018P002937. Participants were coming to the sleep lab for an overnight clinical sleep study, typically because of suspected sleep apnea or for sleep apnea therapy titration studies, and were enrolled through verbal consent shortly before the onset of PSG. Patients who agreed to participate wore the respiratory belt during their sleep study. SpO2 signals were taken from the PSG system. There were no exclusion criteria. Experts at the sleep lab annotated the PSG recordings according to the American Academy of Sleep Medicine 2007 manual [16]. PSG and wearable recordings synchronization have been manually reviewed and corrected if necessary.
While most sleep studies use respiratory inductance plethysmography (RIP)-based effort belts, the wearable respiratory belt used in this study, a CE Class IIa-certified wearable medical device, contains a silver band that proportionally changes resistance as the belt changes length and measures changes in thoracic circumference when breathing. By placing the belt around the lower thorax, near the floating ribs (see Figure E1), it provides respiratory effort and volumetric information from one single location [17]. Participants in this study wore the wearable belt between the abdominal and thoracic RIP effort belts.
Apnea and hypopnea scoring rule
We used expert (4% hypopnea rule [16]) and automatic (3% and 4% hypopnea rules [16]) labels (see online supplement) to train machine learning models on the wearable respiration and oximetry signals, resulting in the following ground truth labels: expert labels 4% rule (“EL4”), auto labels 4% rule (“AL4”), and auto labels 3% rule (“AL3”).
Definition of apnea event
All events labelled as any type of apnea or hypopnea are treated as an “apnea event”. All remaining data are categorized as a “non-apnea-event”.
Cross-validation
To obtain an unbiased performance estimate for the models, we split the data subject-wise into training, validation, and test sets in a tenfold cross-validation fashion (see online supplement).
Overall modelling approach (see Fig. 1 and Figure E2)
Fig. 1.
Flowchart of the modelling approach. We train models to detect apnea events and predict Apnea–Hypopnea Index based on either the wearable respiration signal only or on respiration and oxygen saturation signals
We used a 90-s window, allowing enough context information before and after a typical 10- to 30-s lasting apnea event, and a step size of 1 s to extract features from the respiration signal, which were used to train a random forest model [18] for the binary apnea classification task (“Model Respiration-Only”). Furthermore, one feature from the SpO2 signal was extracted (desaturation drop within 45 s of the detected event) and, together with the respiration features, was used to train another model with the same goal (“Model Respiration + SpO2”). A third model was trained with a very similar approach to the Respiration + SpO2 model, but the SpO2 feature was extracted to be insensitive to a time lag of up to ± 30 s between the respiration and SpO2 signal (“Model Respiration + SpO2 (robust)”). The latter is intended for scenarios where time alignment of signals coming from different sensors cannot be guaranteed. Finally, we constructed an “SpO2-Only” model that solely checks if the magnitude of a desaturation drop is above a certain threshold.
The following four types of features were extracted at various timepoints (see Figure E3): (1) standard deviation of the signal, (2) sum of positive first derivative, (3) sample entropy [19, 20], and (4) Katz fractal dimension[21]. Features 1 and 2 were aimed at encoding the information about the expected decrease in standard deviation and ventilation during an apnea period compared to before and after an event. Features 3 and 4 were aimed at encoding information about the regularity of the respiration, as we expected periods with apneas have a more irregular, unpredictable pattern than normal breathing. Features 3 and 4 were computed both from the raw respiration signal and from a 1.2-s moving average of the respiration signal that is supposed to be less affected by short, small artifacts such as movement and heartbeats. From a total of 242 features, we performed feature selection [22, 23], and we determined the number of features that achieved an optimal median receiver operating characteristic curve (ROC AUC) [24] among the ten folds. Furthermore, we performed model hyperparameter optimization.
We used the models to detect apnea events and AHI, and evaluated the performances using ROC AUC, area under the precision-recall curve (PRC AUC) [25], accuracy, sensitivity, precision, F1 score[26], and correlation between AHI and oxygen desaturation. AHI was computed both for the ground truth labels and for the model predicted apnea events following the AASM 2007 manual [16]: AHI = #of events/(hours of sleep). We also considered the situation where expert-based sleep stages are unavailable. In this case, one can use computational models to stage sleep [27, 28] or use total PSG recording time as a denominator when computing AHI. We did both and computed the Pearson correlation coefficients between the AHIs using the expert sleep stages and these two alternative approaches (all using the Respiration + Spo2 (4% rule) apnea prediction model). The computational model used to stage sleep was a deep neural network model that uses the wearable respiratory effort belt as input, as in our previous work [28] (see online supplement for further details on methodology).
Results
Data collection
Participants (n=404) yielding 412 PSGs had a mean age of 56 years and were mostly Caucasian (71%) and overweight (80% had a BMI > 25); 46% were female; and 61% were diagnosed with sleep apnea (33% mild, 20% moderate, 8% severe). In all 412 PSGs, the wearable belt was worn by the patients from the beginning until the end of the study. Mean (STD) hours of data and sleep recorded per PSG were 7.5 (0.7) and 6.2 (1.1), respectively; see Table 1 for further demographic variables sleep indices.
Table 1.
Baseline characteristics and polysomnography results
| Demographic | n (%) |
|---|---|
| Age (years)a | 56 (16) |
| Sex | |
| Male | 217 (54%) |
| Female | 186 (46%) |
| Unknown | 1 (0.1%) |
| Race | |
| White or Caucasian | 286 (71%) |
| Black or African American | 22 (5%) |
| Asian | 21 (5%) |
| Native Hawaiian, Pacific Islander, American Indian, or Alaska Native | 4 (1%) |
| Other or unknown | 71 (18%) |
| Hispanic ethnicity | 17 (14%) |
| Body mass index | |
| 19–25 | 73 (18%) |
| 25–30 | 122 (30%) |
| 30–65 | 201 (50%) |
| Unknown | 10 (2%) |
| Charlson Comorbidity Indexa | 1.4 (1.6) |
| Epworth Sleepiness Scale (ESS) | |
| ESS 0–5 (low normal) | 101 (25%) |
| ESS 6–10 (high normal) | 108 (27%) |
| ESS 11–15 (mild/moderate) | 61 (15%) |
| ESS 16–24 (severe) | 35 (9%) |
| Unknown | 99 (25%) |
| Previously diagnosed with high blood pressure | |
| Yes | 139 (34%) |
| Unknown | 18 (4%) |
| Polysomnography results | |
| Study type | |
| Diagnostic | 193 (47%) |
| Split night | 112 (28%) |
| Full night titration | 106 (26%) |
| Recording length (h) | 7.5 (0.7) |
| Hours of sleep | 6.2 (1.1) |
| Apnea–Hypopnea Index (AHI)a | 11 (13) |
| AHI < 5 | 161 (39%) |
| 5 ≤ AHI < 15 | 134 (33%) |
| 15 ≤ AHI < 30 | 82 (20%) |
| 30 ≤ AHI | 32 (8%) |
| Sleep efficiency (%)a | 82 (12) |
| Stage R (%)a | 16 (8) |
| Stage N1 (%)a | 10 (8) |
| Stage N2 (%)a | 59 (13) |
| Stage N3 (%)a | 14 (10) |
| Sleep Fragmentation Indexa | 7 (5) |
Mean (standard deviation)
Feature selection and feature importance
As illustrated in Figure E6, the individual folds result in AUCs between 0.8 and 0.92 and F1 scores between 0.55 and 0.75 on the validation sets. The mean and median performances increased up to approximately 40 features and plateaued until approximately 55, and then, performances decreased again (“overfitting”).
Analyzing those results, we decided to use a maximum of 51 features (per fold) and exclude all sample entropy and fractal dimension features computed on the raw (non-smoothed) respiration signal, as the smoothed version features showed higher importance.
The most important respiration features among all folds were (descending order):
Ventilation features at positions 55, 60, 65, 75, 80, 90 (all with width 10 s, and 12 s width for position 60)
Katz fractal dimension with width 10 s at positions 65 and 60
Ventilation width 10 min (“reference”)
Katz fractal dimension with width 60 s at position 90
Standard deviation with widths 8 and 10 s at positions 60 and 80 and width 10 min (“reference”)
Sample entropy with width 30 s at positions 70 and 75
Figure 2 shows 15 min of sleep data, including signals recorded with the PSG system as well as the wearable belt, and all four types of features computed from the wearable belt. Figure E7 shows the signals for 4 h.
Fig. 2.
Signals recorded with the PSG system and the wearable belt, and features computed from the wearable belt. The 15-min period shows a change of stable breathing to instable breathing with expert-annotated apnea events (red vertical lines). During apneic events, the ventilation and standard deviation features show reduced values, and the Katz fractal dimension and sample entropy features, encoding non-linear complexity information, show distinct patterns. The expert-labelled events shown in this example were all detected by both the Respiration-Only and Respiration + SpO2 models
Event based
Table E1 shows results for the event-based performance evaluation of the three models with different inputs. Adding the SpO2 feature increased the overall sensitivity from 55.6 to 66.7% and reduced the number of false positives by nearly 50% to a false-positive rate of 2.7%. The lag-robust version performed similarly to the “non-robust” model and was just a fraction more sensitive (67.6% vs. 66.7%) and less precise. When including the SpO2 feature, the largest increase in class-based sensitivity occurs for hypopneas, which is not surprising given that its definition includes “a reduction of airflow of more than 50%” but not a complete absence of respiration—so this event was expected to be more difficult to detect looking at respiration signal only. Sensitivity values for obstructive apnea events were 68% and 55% for the Respiration + SpO2 and Respiration-Only models, respectively. These results show that, while there are typically thorax movements during obstructive apneic events that are not detected by this wearable effort belt, there is still a signature present in the combination of the effort belt such as fragmented and chaotic signals both during inhalation and exhalation, while normal breathing typically shows smooth inhalation and an s-shaped exhalation and SpO2 signals that allow detecting such events.
Mean patient performance
Model performances averaged over all PSGs that contained at least five expert-labelled apnea events (n = 360) (see Table 2 for expert 4% and auto 3% rules, and Table E2 for auto 4% rule) showed:
Table 2.
Mean patient performance
| Performance metric | Respiration + SpO2 | Respiration + SpO2 (robust) | Respiration-Only | SpO2-Only |
|---|---|---|---|---|
| Expert labels (4% hypopnea rule) | ||||
| ROC AUC | 0.94 (0.93 to 0.94) | 0.93 (0.92 to 0.93) | 0.86 (0.85 to 0.87) | 0.82 (0.81 to 0.83) |
| PRC AUC | 0.48 (0.46 to 0.50) | 0.45 (0.42 to 0.47) | 0.32 (0.30 to 0.35) | 0.51 (0.50 to 0.53) |
| Accuracy | 0.94 (0.93 to 0.95) | 0.94 (0.93 to 0.95) | 0.92 (0.92 to 0.93) | 0.96 (0.95 to 0.97) |
| Sensitivity | 0.58 (0.56 to 0.61) | 0.58 (0.56 to 0.61) | 0.50 (0.48 to 0.53) | 0.73 (0.71 to 0.76) |
| Precision | 0.50 (0.48 to 0.53) | 0.44 (0.42 to 0.46) | 0.32 (0.30 to 0.34) | 0.57 (0.55 to 0.58) |
| F1 score | 0.52 (0.50 to 0.54) | 0.48 (0.46 to 0.50) | 0.36 (0.34 to 0.38) | 0.62 (0.61 to 0.64) |
| Automatic labels (3% hypopnea rule) | ||||
| ROC AUC | 0.93 (0.92 to 0.93) | 0.92 (0.92 to 0.93) | 0.86 (0.85 to 0.87) | 0.83 (0.82 to 0.83) |
| PRC AUC | 0.44 (0.42 to 0.46) | 0.42 (0.40 to 0.44) | 0.32 (0.30 to 0.34) | 0.52 (0.51 to 0.53) |
| Accuracy | 0.95 (0.94 to 0.96) | 0.95 (0.94 to 0.96) | 0.93 (0.93 to 0.94) | 0.95 (0.94 to 0.96) |
| Sensitivity | 0.51 (0.48 to 0.53) | 0.54 (0.51 to 0.56) | 0.53 (0.51 to 0.56) | 0.85 (0.83 to 0.87) |
| Precision | 0.52 (0.50 to 0.54) | 0.48 (0.45 to 0.50) | 0.36 (0.34 to 0.38) | 0.49 (0.46 to 0.51) |
| F1 score | 0.50 (0.48 to 0.52) | 0.49 (0.47 to 0.51) | 0.40 (0.37 to 0.42) | 0.59 (0.57 to 0.61) |
The results are similar for all ground truth label versions (3% and 4%, hypopnea scoring rule), showing the model approach is suitable for either of the scoring rules.
ROC AUC is greatest for Respiration + SpO2 models (0.94), followed by Respiration-Only (0.86) and SpO2-Only (0.82).
PRC AUC is greatest for SpO2-Only (0.51), followed by Respiration + SpO2 (0.48) and Respiration-Only (0.32).
The Respiration + SpO2 robust version shows only a slight performance decrease compared to Respiration + SpO2 (ROC AUC 0.93 vs. 0.94, PRC AUC 0.45 vs. 0.48, respectively).
Apnea–Hypopnea Index
The AHI evaluation from the Respiration + SpO2 model resulted in coefficients of determinations (r2) 0.92 and 0.85 for the expert labels (4% hypopnea rule) and auto labels (3% hypopnea rule), respectively. The respective AHI categorization accuracies (Acc) were 0.8 and 0.7 (see Table E3 in the online supplement for r2 and accuracy values for the Respiration-Only and SpO2-Only models).
Figure 3 shows the scatterplot for detected and expert-labelled AHI for the Respiration + SpO2, the Respiration-Only model, and SpO2-Only model (panels A–C); the distribution of the absolute AHI difference (panels D–F); and the AHI categorization confusion matrix (panels G–I). The coefficients of determinations of 0.92, 0.61, and 0.87, respectively, show that all models can predict AHI with good agreement to the true AHI. We observed an increased performance if the SpO2 feature is added together with the respiration signal. Of all PSG recordings determined to be normal, mild, moderate, and severe AHI by the experts, 88.2%, 73.1%, 72.0%, and 93.8%, respectively, were determined to be in the same AHI severity category by the model, hence showing a decreased performance in the mild and moderate categories, compared to the normal and severe categories. Figure E8 and Figure E9 show similar trends for the auto 3% and auto 4% rule.
Fig. 3.
Apnea–Hypopnea Index (AHI)–based model performance evaluation (based on expert labels) for models trained on features from respiration and SpO2 (left column), Respiration-Only (middle column), and SpO2-Only (right column). Results are obtained from left-out test data in a tenfold cross-validation fashion; n = 409 polysomnographic recordings including wearable respiration. A–C Scatterplots of detected (model-based) and expert-labelled AHI per recording. D–F Difference of detected and expert-labelled AHI, including 95% confidence interval (CI). Both model results show a unimodal error distribution. G–I Confusion matrix (in %) with AHI categorization (normal: 0–5, mild: 5–15, moderate: 15–30, severe: > 30), where rows: expert-labelled AHI category, columns: detected AHI category. Accuracies for categorizations: 80% (Respiration + SpO2), 67% (Respiration-Only), and 77% (SpO2-Only)
Correlations between the true AHI and the detected AHIs were not significantly different for patients stratified by sex, age, race, BMI, ESS, PSG study type, and history of hypertension, showing that model performance is robust across demographic and clinical variables (see Table 3). The ROC AUCs for the binarized AHI categorization tasks are between 0.94–0.98 (Respiration + SpO2), 0.85–0.95 (Respiration-Only), and 0.91–0.97 (SpO2-Only) for the different hypopnea rules, indicating that the models accurately distinguish between AHI categories for any input signal and hypopnea rule used (see Fig. 4).
Table 3.
Pearson correlation coefficients and 95% confidence interval for expert-based and model-based Apnea–Hypopnea indices for subgroups
| N PSGs | AirGo + SpO2 model | AirGo model | SpO2 model | |
|---|---|---|---|---|
| Female | 189 | 0.96 (0.93, 0.98) | 0.81 (0.72, 0.87) | 0.94 (0.88, 0.97) |
| Male | 219 | 0.95 (0.94, 0.97) | 0.76 (0.66, 0.84) | 0.93 (0.90, 0.95) |
| Age < 45 | 88 | 0.94 (0.90, 0.97) | 0.67 (0.56, 0.80) | 0.91 (0.83, 0.96) |
| Age 45–60 | 97 | 0.97 (0.94, 0.99) | 0.83 (0.73, 0.90) | 0.97 (0.95, 0.99) |
| Age > 60 | 169 | 0.95 (0.93, 0.97) | 0.78 (0.68, 0.85) | 0.91 (0.87, 0.94) |
| Race White | 290 | 0.95 (0.93, 0.97) | 0.74 (0.66, 0.81) | 0.93 (0.9, 0.96) |
| Race Black | 22 | 0.97 (0.83, 0.99) | 0.86 (0.37, 0.96) | 0.93 (0.73, 0.97) |
| Race Asian | 21 | 0.99 (0.91, 1.0) | 0.86 (0.3, 0.97) | 0.98 (0.86, 0.99) |
| BMI 19–25 | 73 | 0.94 (0.90, 0.98) | 0.81 (0.71, 0.90) | 0.87 (0.78, 0.95) |
| BMI 25–30 | 123 | 0.96 (0.93, 0.98) | 0.87 (0.77, 0.93) | 0.93 (0.90, 0.96) |
| BMI 30–65 | 203 | 0.96 (0.94, 0.97) | 0.75 (0.65, 0.82) | 0.95 (0.91, 0.97) |
| ESS 0–5 | 101 | 0.97 (0.91, 0.98) | 0.76 (0.57, 0.87) | 0.93 (0.84, 0.97) |
| ESS 6–10 | 108 | 0.95 (0.92, 0.98) | 0.75 (0.63, 0.85) | 0.94 (0.90, 0.96) |
| ESS 11–15 | 61 | 0.95 (0.92, 0.98) | 0.72 (0.58, 0.86) | 0.91 (0.83, 0.96) |
| ESS 16–24 | 35 | 0.97 (0.93, 0.99) | 0.88 (0.70, 0.96) | 0.92 (0.80, 0.98) |
| Dx high BP | 139 | 0.95 (0.92, 0.97) | 0.68 (0.57, 0.78) | 0.93 (0.87, 0.96) |
| No Dx high BP | 247 | 0.95 (0.93, 0.97) | 0.82 (0.72, 0.88) | 0.92 (0.89, 0.95) |
| Diagnostic night | 193 | 0.95 (0.91, 0.97) | 0.80 (0.72, 0.88) | 0.91 (0.86, 0.95) |
| Split night | 110 | 0.95 (0.92, 0.97) | 0.78 (0.69, 0.85) | 0.93 (0.87, 0.96) |
| Full night titration | 105 | 0.93 (0.88, 0.96) | 0.62 (0.46, 0.75) | 0.87 (0.77, 0.93) |
Fig. 4.
Receiver operating curves and the corresponding areas under the curve (AUC) evaluated on binary AHI categorization tasks, for different hypopnea scoring rules. All models result in AUCs greater than 0.85, showing an accurate AHI categorization
Example model detections and expert labels are shown together with the respiration and oxygen saturation signals in Fig. 5.
Fig. 5.
Example signals and detections: 60 min of a recording from a 47-year-old subject, where the AHI detection is comparable to the mean model performance (true AHI: 7, detected AHI respiration + SpO2: 8, detected AHI Respiration-Only: 10). The respiration and SpO2 signals as well as the apnea events annotated by the sleep experts and detected by our two models are presented. Point (a) shows an instance where both models labelled an apnea event, but the sleep experts did not, even though a decrease in the respiration signal and a large drop in blood oxygenation are visible. Such instances are treated as false positives, even though this event is either missed by the sleep experts or the event potentially narrowly fails to qualify for an apnea event. Point (b) is an example of an event where the sleep experts and both models label an apnea event. Point (c) shows an instance where the Respiration + SpO2 model incorrectly labels an apnea event, but the Respiration-Only model does not. Point (d) is labelled as an apnea event by the sleep expert but is not detected by either model (false negative). The region around point (e) shows a series of similar events, all showing a cyclic decrease and increase in respiration amplitude and desaturation. Some of those events are marked as apnea events by the sleep expert; some are not—illustrating the difficulty for a model to fully agree with the human labels. At point (f), the respiration clearly switches from an instable period to a more stable period and it becomes more regular—neither the sleep expert nor the models detect events here
The detected AHI changed only marginally when the respiratory effort-based neural network-scored sleep stages (r = 0.986) were used in place of the expert-scored sleep stages. The same was true when used the recording duration as denominator for computing the AHI (r = 0.965).
Discussion
Detecting abnormal sleep respiration including classic sleep apnea in a patient-friendly and low-cost manner can enable clinicians and researchers to better understand and treat primary sleep disorders and abnormal sleep breathing in a range of medical and neurological conditions. The present study provides new measurement insights in this area as it:
Uses a large dataset of simultaneously recorded wearable signals and polysomnographies (as the gold standard) with 404 subjects and a rigorous method of model performance evaluation
Quantifies the performance increase when combining a respiratory belt and SpO2(%) sensor instead of using both sensors individually
Evaluates the AHI detection performance for the two AASM-accepted hypopnea scoring definitions, using the “3% desaturations and arousal” and “4% desaturations” rules, where we found slight differences (r2 = 0.85 and r2 = 0.92, respectively)
Shows state-of-the-art performance in automatically detecting individual apnea events and assessing AHI with a feature-based machine learning approach, compared to other studies published using wearable or portable signals (see Table E5)
We found a total of thirty-nine studies that detected sleep apnea using either SpO2 signals, respiratory signals, or a combination of the two signals. Of those studies, seven [29-35] used a wearable or portable device. Of those seven studies, three used SpO2 signals alone, two used respiration signals alone, and two studies used both respiration and SpO2 signals. Comparing our results with these seven studies is challenging, as four evaluated performances on the subject-level only, two on event-level only, and only one study on both event- and subject-level but used a custom definition of sensitivity which makes direct comparison infeasible. Of the two studies with event-level performance evaluations, only Van Steenkiste et al. [32] had a separation of subjects into training and test sets. This study (N = 25) used either PSG effort belts or a wearable respiratory sensor together with a deep learning model and reports ROC AUC 0.73–0.80, PRC AUC of 0.42–0.57, sensitivities 0.58–0.72, and precisions 0.38–0.48 [32]. These performance values are comparable to ours (Respiration+SpO2: ROC AUC of 0.93, PRC AUC 0.47, sensitivity 0.51, and precision 0.59). Another recent publication, using subject-level performance evaluation, Álvarez et al. [34] (N test = 96) developed a machine learning model with oximetry and airflow, and reported comparable AHI detection performance of 0.93 correlation between model and expert AHI when using both signals, and a reduction of performance when using only one of the two signals. Taken together, our method shows comparable performance to state-of-the-art methods in the literature and is the only study that reports both event-based and subject-based level performances (with standard metrics) for a wearable apnea detection method, further providing code for reproducibility.
While some of the observed event-based performances may seem moderate (e.g., precision of 0.59), the good subject-level AHI performances (r2 = 0.92) indicate that event-based inaccuracies largely wash out over a full night of sleep. Potential explanations include the binary nature of the event labels, i.e., yes/no for a particular event, both for ground truth (experts) and for output (model) labels. “Softer” labels, such as probabilities or severities, might be needed to better interpret such results. Further, only one expert annotated each PSG recording. Considering previously reported expert disagreement in sleep analysis, having multiple experts score the same recordings would allow a richer performance evaluation.
In the context of sleep apnea, a diagnosis from a wearable respiration device with or without the addition of SpO2 allows greater flexibility of sleep apnea diagnosis in cases where full PSGs or even home testing is not readily feasible. High-resolution respiratory effort monitoring also enables extraction of sleep stage information [28], and sleep breathing phenotype data, which may further enable improved precision management. As AHI might not be the only variable of interest to assess sleep-disordered breathing, we have included a section “Beyond AHI” in the supplement material, where we show how the presented respiratory features and apnea detection method can be used to (a) derive duration of apnea events (enabling the possibility to filter detected events, e.g., with a minimum duration), (b) derive AH event-specific desaturations [36], and (c) detect hypoventilation and/or bradypnea.
The concept of home sleep and sleep apnea testing continues to evolve rapidly with advances in technology. Current guidelines to assess obstructive sleep apnea do not include single respiratory effort belts [37] though several commercially available and clinically used devices use this configuration (e.g., ApneaLink Air, ResMed, Inc. [38]). Our results show that we have the potential to obtain clinically useful accuracy with a sensitive respiratory effort signal alone, combined with advanced analytics.
Limitations to the study include potential selection bias from the fact that the study population was chosen from a single center and included only patients already coming to the sleep lab for testing. Most of the patients in our cohort were older than 40 years old, Caucasian, and obese. Although we did not find statistically significant differences in our results across age, sex, race, and BMI, data with a wider range of demographic variables is needed for more detailed validation. Including more patients from multiple hospitals and patients from the general public could have increased the breadth of the study population and the generalizability of the findings. Additionally, only one (out of seven total) sleep expert scored each PSG recording, as is typical of clinical recordings. Therefore, there is a possibility that there is some variation or label noise when it comes to the ground truth sleep expert labels, because scoring PSGs is a subjective process and can be biased based on the expert that scored the recording. This concern is largely mitigated by the finding that results were similar when evaluating the model against ground truth labels determined from the PSG by automated methods. Further, every analytics approach comes with trade-offs. We used hand-engineered features and shallow machine learning because we believe these features to be relatively stable across situations (sleep lab, home, ICU) and demographics. While more complex methods such as deep learning may lead to improvements in prediction performance using the current dataset that was collected in one specific situation (sleep lab), higher uncertainty of model behavior is expected in new situations such as at home, in the ICU, or in other demographics.
In future work, we plan to apply this model to ICU patients. Patients enrolled in a clinical trial (https://clinicaltrials.gov/ct2/show/NCT03355053?term=Investigation+of+Sleep+in+the+Intensive+Care+Unit) in the ICU at Massachusetts General Hospital wear a respiratory belt for the study, and we plan to use this information in addition to SpO2 to detect apneic events in this population. For the patients already enrolled, we performed a preliminary check using our models and obtained promising initial results—Figure E13 shows the resulting apnea detections for one example patient.
Conclusions
Respiratory effort signals obtained with a single patient-friendly wearable band, standard oxygen saturation signals, and an automated analytics approach allowed us to extract clinically relevant respiratory features, train a shallow machine learning model, and detect sleep apnea with good accuracy. For the most accurate detection of sleep apnea and AHI, both respiratory and SpO2 signals were needed.
Supplementary Material
Funding
M.B.W. was supported by the Glenn Foundation for Medical Research and American Federation for Aging Research (Breakthroughs in Gerontology Grant); the American Academy of Sleep Medicine (AASM Foundation Strategic Research Award); the Football Players Health Study (FPHS) at Harvard University; the Department of Defense through a subcontract from Moberg ICU Solutions, Inc.; and by the NIH (1R01NS102190, 1R01NS102574, 1R01NS107291, 1RF1AG064312).
Footnotes
Conflict of interest Wolfgang Ganglberger declares that he/she has no conflict of interest. Abigail A. Bucklin declares that he/she has no conflict of interest. Ryan A. Tesh declares that he/she has no conflict of interest. Madalena Da Silva Cardoso declares that he/she has no conflict of interest. Haoqi Sun declares that he/she has no conflict of interest. Michael J. Leone declares that he/she has no conflict of interest. Luis Paixao declares that he/she has no conflict of interest. Ezhil Panneerselvam declares that he/she has no conflict of interest. Elissa M. Ye declares that he/she has no conflict of interest. B. Taylor Thompson reports personal fees from Bayer and Thetis, outside the submitted work. Oluwaseun Akeju declares that he/she has no conflict of interest. David Kuller reports non-financial support from Myair Inc., during the conduct of the study; non-financial support from Myair Inc., outside the submitted work; in addition, Dr. Kuller has a Patent US10123724B2 “Breath volume monitoring system and method” issued. Dr. Thomas reports personal fees from GLG Councils, Guidepoint Global, and Jazz Pharmaceutics, outside the submitted work. In addition, Dr. Thomas has a patent ECG-spectrogram with royalties paid by MyCardio, LLC, a patent Auto-CPAP with royalties paid by DeVilbiss-Drive, and an unlicensed patent CO2 device for central/complex sleep apnea issued. Dr. Westover reports grants from NIH, during the conduct of the study.
Ethical approval All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.
Informed consent Informed consent was obtained from all individual participants included in the study.
Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/s11325-021-02465-2.
References
- 1.Revels SL, Cameron BH, Cameron RB (2019) Obstructive sleep apnea and perioperative delirium among thoracic surgery intensive care unit patients: perspective on the STOP-BANG questionnaire and postoperative outcomes. J Thorac Dis 11:S1292–S1295. 10.21037/jtd.2019.04.63 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Sacchetti LM, Mangiardi P (2012) Obstructive sleep apnea: causes, treatment and health implications. Nova Science Publishers [Google Scholar]
- 3.Khayat R, Abraham W, Patt B et al. (2012) Central sleep apnea is a predictor of cardiac readmission in hospitalized patients with systolic heart failure. J Card Fail 18:534–540. 10.1016/j.cardfail.2012.05.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Epstein LJ, Kristo D, Strollo PJ et al. (2009) Clinical guideline for the evaluation, management and long-term care of obstructive sleep apnea in adults. J Clin Sleep Med 5:263–276 [PMC free article] [PubMed] [Google Scholar]
- 5.Johnson DA, Guo N, Rueschman M, et al. (2018) Prevalence and correlates of obstructive sleep apnea among African Americans: the Jackson heart sleep study. Sleep 41. 10.1093/sleep/zsy154 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Jackson M (2017) Prevalence and correlates of sleep apnea among US male veterans, 2005–2014. Prev Chronic Dis 14. 10.5888/pcd14.160365 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Soler X, Gaio E, Powell FL et al. (2015) High prevalence of obstructive sleep apnea in patients with moderate to severe chronic obstructive pulmonary disease. Annals ATS 12:1219–1225. 10.1513/AnnalsATS.201407-336OC [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Benjafield AV, Ayas NT, Eastwood PR et al. (2019) Estimation of the global prevalence and burden of obstructive sleep apnoea: a literature-based analysis. Lancet Respir Med 7:687–698. 10.1016/S2213-2600(19)30198-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Deviaene M, Testelmans D, Borzée P, et al. (2019) Feature selection algorithm based on random forest applied to sleep apnea detection. In: 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). pp 2580–2583 [DOI] [PubMed] [Google Scholar]
- 10.Marcos JV, Hornero R, Alvarez D et al. (2010) Automated detection of obstructive sleep apnoea syndrome from oxygen saturation recordings using linear discriminant analysis. Med Biol Eng Comput 48:895–902. 10.1007/s11517-010-0646-6 [DOI] [PubMed] [Google Scholar]
- 11.Varon C, Caicedo A, Testelmans D et al. (2015) A novel algorithm for the automatic detection of sleep apnea from single-lead ECG. IEEE Trans Biomed Eng 62:2269–2278. 10.1109/TBME.2015.2422378 [DOI] [PubMed] [Google Scholar]
- 12.Almazaydeh L, Elleithy K, Faezipour M (2012) Obstructive sleep apnea detection using SVM-based classification of ECG signal features. In: 2012 Annual International Conference of the IEEE Engineering in Medicine and Biology Society. pp 4938–4941 [DOI] [PubMed] [Google Scholar]
- 13.(2020) mghcdac/respiratory_event_detection_wearable. MGH Clinical Data Animation Center [Google Scholar]
- 14.Welcome to Neuro. In: Natus. https://neuro.natus.com/. Accessed 6 Apr 2020 [Google Scholar]
- 15.MyAir LLC. In: AirGo™. https://www.myairgo.com. Accessed 6 Apr 2020
- 16.Berry RB, Budhiraja R, Gottlieb DJ et al. (2012) Rules for scoring respiratory events in sleep: update of the 2007 AASM manual for the scoring of sleep and associated events. J Clin Sleep Med 8:597–619. 10.5664/jcsm.2172 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Antonelli A, Guilizzoni D, Angelucci A et al. (2020) Comparison between the AirgoTM device and a metabolic cart during rest and exercise. Sensors 20:3943. 10.3390/s20143943 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Breiman L (2001) Random forests. Mach Learn 45:5–32. 10.1023/A:1010933404324 [DOI] [Google Scholar]
- 19.Richman JS, Moorman JR (2000) Physiological time-series analysis using approximate entropy and sample entropy. American Journal of Physiology-Heart and Circulatory Physiology 278:H2039–H2049. 10.1152/ajpheart.2000.278.6.H2039 [DOI] [PubMed] [Google Scholar]
- 20.Vallat R (2020) raphaelvallat/entropy. https://github.com/raphaelvallat/entropy. Accessed 6 Apr 2020 [Google Scholar]
- 21.Katz MJ (1988) Fractals and the analysis of waveforms. Comput Biol Med 18:145–156. 10.1016/0010-4825(88)90041-8 [DOI] [PubMed] [Google Scholar]
- 22.Bellman R (2010) Dynamic programming. Princeton University Press, Princeton, NJ, USA [Google Scholar]
- 23.Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction, Second Edition, 2nd edn. Springer-Verlag, New York [Google Scholar]
- 24.Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms [Google Scholar]
- 25.Davis J, Goadrich M (2006) The relationship between precision-recall and ROC curves. In: Proceedings of the 23rd international conference on Machine learning - ICML ‘06. ACM Press, Pittsburgh, Pennsylvania, pp 233–240 [Google Scholar]
- 26.Information Retrieval, 2nd ed. Van Rijsbergen CJ. London: Butterworths; 1979: 208 pp. Price: $32.50 - Blair - 1979 - Journal of the American Society for Information Science - Wiley Online Library. https://asistdl.onlinelibrary.wiley.com/doi/abs/ 10.1002/asi.4630300621. Accessed 6 Apr 2020 [DOI] [Google Scholar]
- 27.Fiorillo L, Puiatti A, Papandrea M et al. (2019) Automated sleep scoring: a review of the latest approaches. Sleep Med Rev 48:101204. 10.1016/j.smrv.2019.07.007 [DOI] [PubMed] [Google Scholar]
- 28.Sun H, Ganglberger W, Panneerselvam E et al. (2019) Sleep staging from electrocardiography and respiration with deep learning. Sleep. 10.1093/sleep/zsz306 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Álvarez D, Gutiérrez-Tobal GC, Vaquerizo-Villar F, et al. (2016) Automated analysis of unattended portable oximetry by means of Bayesian neural networks to assist in the diagnosis of sleep apnea. In: 2016 Global Medical Engineering Physics Exchanges/Pan American Health Care Exchanges (GMEPE/PAHCE). pp 1–4 [Google Scholar]
- 30.Garde A, Dehkordi P, Karlen W, et al. (2014) Development of a screening tool for sleep disordered breathing in children using the phone oximeter™. PLoS One 9. 10.1371/journal.pone.0112959 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Pathinarupothi RK, DP J, Rangan ES, et al. (2017) Single sensor techniques for sleep apnea diagnosis using deep learning. In: 2017 IEEE International Conference on Healthcare Informatics (ICHI). pp 524–529 [Google Scholar]
- 32.Van Steenkiste T, Groenendaal W, Dreesen P et al. (2020) Portable detection of apnea and hypopnea events using bio-impedance of the chest and deep learning. IEEE J Biomed Health Inform. 10.1109/JBHI.2020.2967872 [DOI] [PubMed] [Google Scholar]
- 33.Koley BL, Dey D (2013) Real-time adaptive apnea and hypopnea event detection methodology for portable sleep apnea monitoring devices. IEEE Trans Biomed Eng 60:3354–3363. 10.1109/TBME.2013.2282337 [DOI] [PubMed] [Google Scholar]
- 34.Álvarez D, Cerezo-Hernández A, Crespo A et al. (2020) A machine learning-based test for adult sleep apnoea screening at home using oximetry and airflow. Sci Rep 10:1–12. 10.1038/s41598-020-62223-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Wu J-C, Wang C-W, Huang Y-H, et al. (2018) A portable monitoring system with automatic event detection for sleep apnea level-IV evaluation. In: 2018 IEEE International Symposium on Circuits and Systems (ISCAS). pp 1–4 [Google Scholar]
- 36.Azarbarzin A, Sands SA, Stone KL et al. (2019) The hypoxic burden of sleep apnoea predicts cardiovascular disease-related mortality: the Osteoporotic Fractures in Men Study and the Sleep Heart Health Study. Eur Heart J 40:1149–1157. 10.1093/eurheartj/ehy624 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Collop NA, Tracy SL, Kapur V et al. (2011) Obstructive sleep apnea devices for out-of-center (OOC) testing: technology evaluation. J Clin Sleep Med 7:531–548. 10.5664/JCSM.1328 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.ApneaLink Air™ home sleep testing solution - ResMed healthcare professional. In: Healthcare Professional. https://www.resmed.com/en-us/healthcare-professional/products-and-support/home-sleep-testing/apnealink-air/. Accessed 3 May 2021 [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.





