Abstract
Study Objectives:
The objective of this meta-analysis was to analyze agreement in apnea-hypopnea index (AHI) determination between peripheral arterial tonometry (PAT) and polysomnography (PSG) studies.
Methods:
Mean AHI bias and standard deviation extracted from Bland-Altman plots reported in studies were pooled in a meta-analysis, which was then used to calculate percentage errors of limit agreement in AHI determination by PAT using PSG AHI as the reference. Individual participant data (where reported in studies) were used to compute Cohen’s kappa to assess agreement between PSG and PAT on sleep apnea severity and for computing the sensitivity and specificity of PAT at different AHI thresholds using PSG AHI as the reference.
Results:
From 17 studies and 1,318 participants (all underwent simultaneous PSG and use of the WatchPAT device), a pooled mean AHI bias of 0.30 (standard error [SE], 0.74) and a WatchPAT AHI percentage error of 230% was calculated. The meta-analysis of Cohen’s kappa for agreement between PSG and WatchPAT studies for classifying patients with no sleep apnea, mild, moderate, or severe sleep apnea severity was 0.45 (SE, 0.06), 0.29 (SE, 0.05), 0.25 (SE, 0.07), and 0.64 (SE, 0.05), respectively. At AHI thresholds of 5, 15 and 30 events/h, WatchPAT studies showed pooled sensitivities and specificities of 94.11% and 43.47%, 92.21% and 72.39%, and 74.11% and 87.10%, respectively. Likelihood ratios were not significant at any AHI threshold.
Conclusions:
The results of this meta-analysis suggest clinically significant discordance between WatchPAT and PSG measurements of AHI, significant sleep apnea severity misclassification by PAT studies, and poor diagnostic test performance.
Citation:
Iftikhar IH, Finch CE, Shah AS, Augunstein CA, Ioachimescu OC. A meta-analysis of diagnostic test performance of peripheral arterial tonometry studies. J Clin Sleep Med. 2022;18(4):1093–1102.
Keywords: polysomnography, peripheral arterial tonometry, diagnostic test performance, meta-analysis
BRIEF SUMMARY
Current Knowledge/Study Rationale: Peripheral arterial tonometry–based technology is a validated sleep apnea diagnostic testing modality. However, its accuracy in determining sleep apnea severity has not been studied in a systematic fashion.
Study Impact: There is significant discordance between peripheral arterial tonometry and polysomnography studies in determining the apnea-hypopnea index and disagreement in sleep apnea severity classification. This meta-analysis also showed poor specificity of peripheral arterial tonometry studies in an apnea-hypopnea index range of 5–15 events/h, and we recommend confirmation with polysomnography in such situations.
INTRODUCTION
Peripheral arterial tonometry (PAT)–based home sleep apnea testing is a relatively novel method for studying sleep apnea. There are 2 PAT devices currently approved by the U.S. Food & Drug Administration: WatchPAT (Itamar Medical Inc., Caesarea, Israel) and NightOwl (Ectosense, Belgium). WatchPAT devices measure several physiological functions such as pulse waveform, heart rate, heart rate variability, oximetry, actigraphy, body position, and snoring. From a combination of a PAT signal, oximetry, and actigraphy, the proprietary respiratory events scoring algorithm calculates an apnea-hypopnea index (AHI).
Several prior studies1–8 and a meta-analysis9 that assessed the use of WatchPAT devices showed strong correlations between the gold-standard polysomnography (PSG)-derived AHI and that derived from WatchPAT studies and made an argument that WatchPAT devices represented a reliable sleep apnea detecting modality. Some of these studies,3–5,7,8 by showing Bland-Altman (B&A) plots, have also attempted to show good “agreement” between PSG and WatchPAT. B&A plots are a way of quantifying the agreement between 2 quantitative measurement methods by studying the mean difference and constructing limits of agreement.10 Some researchers have analyzed the sensitivity and specificity of PAT studies in detecting sleep apnea, albeit with unknown levels of sleep apnea prevalence in their respective study populations.1,3,11 However, neither the reported high correlation nor the mean bias from B&A plots (with no established “clinically” acceptable limits of agreement) completely allay the concerns of clinicians practicing in the field of sleep medicine, because there are concerns not only for high rates of sleep apnea misclassification12,13 but also of significant night-to-night variability13 in sleep apnea severity as measured by these devices.
Hence, there are issues with both “accuracy” and “precision.” Whereas “accuracy” describes the systematic error of the measurement tool and indicates how close the measured value is to the true value, “precision” describes the reproducibility of measurements considering the variability of repeated values resulting from random error. A measurement tool may be precise but inaccurate, meaning that resultant values are consistent but similarly far from the true value. Although the correlation studies can be misleading because they only evaluate the linear association of 2 sets of observations, by showing agreement, the B&A plots, on the other hand, may not actually show accuracy, an inherent limitation of B&A plots.10
Although the limits of agreement on B&A plots do refer to precision (if the limits are narrow, then the precision is high, and vice versa), these limits may only be interpreted properly if the confidence intervals for the limits are known, which is, unfortunately, consistently poorly presented in studies. The solution to this problem is suggested to be reporting the percentage error (PE) of the limits of agreement, which can be used as a cutoff for whether to accept a new technique.14,15 PE is calculated by dividing the limits of agreement by the mean value of measurements taken using the reference method in a population. Because some random error in measurement is expected when combining 2 precisions with 2 measurements, with an acceptable cutoff for PE for a measurement one can assume that any calculated PE for a new measurement technique falling below that cutoff would equate to that new technique having an error similar to the reference standard and hence would be considered acceptable or a good alternative to the reference standard.14
This assumption formed the basis of this meta-analysis, in which we sought to analyze the PE data from such studies published in the literature. Secondary objectives were to analyze data on sleep apnea severity misclassification and to conduct a meta-analysis of diagnostic test performance at different AHI thresholds.
METHODS
This systematic review and meta-analysis were conducted following the established guidelines from the Preferred Reporting Items for Systematic Reviews and Meta-Analyses,15 reported in Table S1 (449.9KB, pdf) in the supplemental material.
Search strategy and study eligibility criteria
PubMed and Web of Science were searched for studies of adult participants, published in English, from inception to September 22, 2021. Studies were considered eligible for inclusion if they included adult participants who underwent simultaneous (ie, during the same night) diagnostic PSGs (ie, without positive airway pressure treatment) and PAT tests for AHI and included extractable data from B&A plots—specifically, mean differences between PSG and PAT studies and their standard deviations/limits of agreement. Conference proceedings or abstracts/posters were excluded. After the initial screening of the titles from the search database, full-text articles of shortlisted abstracts were independently assessed by all authors for inclusion in this meta-analysis. The studies were assessed for study quality based on Cochrane methods (details in Table S2 (449.9KB, pdf) the supplemental material). Disagreement on any study selection or study quality was resolved by consensus. Study quality assessment is reported in Table S2 (449.9KB, pdf) in the supplemental material.
Data extraction and synthesis
Data extracted from studies included first author’s name, publication year, population characteristics, mean AHIs from PSG and PAT studies, and the criteria used in PSG for scoring hypopneas.
GraphGrabber software (v2.0.2; Quintessa, U.K) was used by 1 of the authors (I.H.I.) to independently extract numerical data from the mean bias and limits of agreement lines shown on B&A plots. These data were validated by confirming them with the reported data in the texts of the manuscripts (where reported). Calculated standard deviations (SDs; 1 × SD) and mean bias from individual studies were then used to compute the random-effects pooled estimate of bias, pooled standard error (SE), and pooled 95% confidence intervals using Comprehensive Meta-Analysis software (CMA version 2.2.064; Biostat, Englewood, NJ). From the pooled SEx, we calculated the pooled SD, using the formula .
The pooled mean AHI (for input in the above equation) was analyzed by separate meta-analyses (Figure S1 (449.9KB, pdf) , Figure S2 (449.9KB, pdf) , and Figure S3 (449.9KB, pdf) in the supplemental material).
Heterogeneity was assessed using the I2 index. High heterogeneity was assessed by subgroup analyses based on different definitions of hypopneas used in previous studies and on different technologies of PAT. Publication bias was assessed using funnel plots and the Begg and Mazumdar rank correlation test.
To meta-analyze how much PSG and PAT studies agreed with each other in classifying patients into 4 different sleep apnea severity categories (normal/AHI < 5 events/h, mild/AHI 5–14.9 events/h, moderate/AHI 15–29.9 events/h or severe sleep apnea/AHI ≥ 30 events/h), we shortlisted studies that reported sufficient information to calculate Cohen’s κ and its variance and then pooled this information in a meta-analysis. Studies were considered eligible for inclusion in this meta-analysis if they reported sufficient information to calculate Cohen’s κ and its variance. The benchmarks for interpreting Cohen’s κ as proposed by Landis and Koch16 are as follows: 0.8–1.0 indicate near-perfect agreement, 0.6–0.8 indicate substantial agreement, 0.4–0.6 indicate moderate agreement, 0.2–0.4 indicate fair agreement, zero–0.2 indicate slight agreement, and zero or lower indicate poor agreement. The meta-analysis of diagnostic test performance (sensitivity, specificity, diagnostic odds ratio [DOR], and positive and negative likelihood ratios [LR+, LR–]) for PAT studies at different AHI thresholds was performed in Stata version 13 (StataCorp, College Station, TX) using the “metandi” function. Further details of these methods are described under Additional details on Methods in the supplemental material.
RESULTS
Meta-analysis of mean AHI bias and PE
Figure 1 shows the study selection process, and Table S3 (449.9KB, pdf) in the supplemental material details the reasons for the exclusion of some studies (including the one that studied NightOwl17). After selectively excluding these studies, a meta-analysis of mean AHI bias was performed based on the remaining 17 studies1–4,7,11,12,18–27 with a total of 1,318 participants who were studied using PSG and WatchPAT simultaneously. Table 1 details the population characteristics. The pooled mean bias was 0.30 (SE, 0.74, I2, 78%; Figure 2). The PE based on these results was then calculated as
Figure 1. Search strategy and selection process.
AHI = apnea-hypopnea index.
Table 1.
Baseline characteristics of study population.
Study | Country | Study Population | Age (y) | n | Sex (male, %) | BMI |
---|---|---|---|---|---|---|
Ayas et al 200318 | United States | Adults either suspected of sleep apnea or not suspected | 47 | 30 | 66.6% | 31 |
Pittman et al 20044 | United States | Adults suspected of sleep apnea referred to a sleep laboratory | 43.2 (10.8) | 29 | 72% | 33.9 (7.1) |
Penzel et al 20042 | Germany | Adults suspected of or already diagnosed with sleep apnea | NR | 17 | NR | NR |
Zou et al 20061 | Sweden | Some participants hypertensive, some normotensive | 60 (7) | 98 | 56% | 28 (4) |
Holmedahl et al 201920 | Norway | Participants had COPD | 61.4 (9.1) | 16 | 43.7% | 26.4 (5.3) |
O’Brien et al 201223 | United States | Pregnant participants in third trimester | 30.2 (7.1) | 31 | 0% | 31.9 (8.1) |
Onder et al 201224 | Turkey | Adults suspected of sleep apnea referred to a sleep laboratory | Group 1: 30.72 (2.89) | Group 1: 29 | Group 1: 65.5% | Group 1: 30.27 (5.58) |
Group 2: 55 (4.77) | Group 2: 27 | Group 2: 63% | Group 2: 30.81 (3.22) | |||
Weimin et al 201326 | China | Adults suspected of sleep apnea presenting to an otolaryngology clinic | 47.45 (13.46) | 28 | 71% | 29.99 (5.74) |
Garg et al 201427 | United States | Adult African Americans screened by Berlin questionnaire | 44.7 (10.6) | 75 | 24% | NR |
Körkuyu et al 201522 | Turkey | Adults suspected of sleep apnea presenting to a sleep medicine clinic | 49.2 (9.6) | 30 | 83% | 29.6 (4.4) |
Gan 201719 | Singapore | Adults suspected of sleep apnea | 39 (16) | 20 | 90% | 27.2 (5.5) |
Ioachimescu et al 202012 | United States | Adults suspected of sleep apnea referred to a sleep laboratory | 52.5 (41.8–62.5) | 500 | 80% | 31.6 (28–35.9) |
Jen et al 202021 | United States | Adults with COPD | 63 (7) | 33 | 61% | 28.1 (6.7) |
Kasai et al 20207 | Japan | Adults suspected of sleep apnea referred to a sleep laboratory | 58.0 ± 11.9 | 120 | 85% | 26.4 ± 5.4 |
Pillar et al 202025 | Canada, Israel, United States, Germany | Investigators selectively recruited patients with heart failure, resulting in 50/84 participants with either CHF or Afib or both | 57 (16) | 84 | 67.5% | 29.8 (5.7) |
Tauman et al 20203 | United States, Canada, Germany, Israel | Patients diagnosed with Afib and suspected of sleep apnea (not all patients had Afib during the sleep study night) | 68 ± 12 | 101 | 69% | 31 ± 5.2 |
Tondo et al 202111 | Italy | Participants included those thought to be at low risk for sleep apnea | 51.70 ± 14.28 | 47 | 62% | 26 ± 5.67 |
Data are presented as mean and standard deviation or median with range; n indicates total number of participants analyzed in meta-analysis. Afib = atrial fibrillation, BMI = body mass index, CHF = congestive heart failure, COPD = chronic obstructive pulmonary disease, NR = not reported in source article.
Figure 2. Pooled mean AHI bias and PE.
The size of the square boxes indicates the weight of the effect size as determined by the number of participants and the crossing blue horizontal lines indicate confidence limits; the diamond is the pooled effect size with its edges on the horizontal plane indicating the confidence limits. AHI = apnea-hypopnea index, CI = confidence interval, I2 = heterogeneity, PE = percentage error of limits of agreement.
Exploring heterogeneity
Heterogeneity was explored using subgroup analyses. Data were parsed into 4 subgroups, as follows: subgroup A, with studies7,12,19,25 that used the 3% desaturation (for hypopneas) along with the 30% airflow reduction (from baseline) rule; subgroup B,3,20,24,26,27 with similar scoring criteria as the former but using the 4% desaturation rule; subgroup C, with studies4,11,18,21–23 that used the 50% airflow reduction (from baseline) rule along with 3% desaturation (for hypopneas); and subgroup D, which grouped the 2 studies by Penzel et al2 and Zou et al.1 Although both of these studies used a similar hypopnea scoring rule as the others in subgroup C, the study by Penzel et al2 did not specify whether a 3% or 4% desaturation rule was used for hypopneas, and the study by Zhou et al1 used the 4% desaturation rule for scoring hypopneas. Results are displayed in a forest plot (Figure 3), showing PEs of 498%, 105%, 86%, and 293% for subgroups A, B, C, and D, respectively.
Figure 3. Subgroup mean AHI bias and PE analyses based on scoring criteria.
The size of the square boxes indicates the weight of the effect size as determined by the number of participants and the crossing blue horizontal lines indicate confidence limits; the diamonds are the pooled effect size with its edges on the horizontal plane indicating the confidence limits. AHI = apnea-hypopnea index, CI = confidence interval, I2 = heterogeneity, PE = percentage error of limits of agreement.
Further subgroup analyses were conducted based on the specific version of WatchPAT technology used (WatchPAT 100 or WatchPAT 200). Results are shown in Figure 4. The PEs calculated for WatchPAT 100 and WatchPAT 200 were 151% and 250%, respectively.
Figure 4. Subgroup mean AHI bias and PE analyses based on PAT technology.
The size of the square boxes indicates the weight of the effect size as determined by the number of participants and the crossing blue horizontal lines indicate confidence limits; the diamonds are the pooled effect size with its edges on the horizontal plane indicating the confidence limits. The WP100 and WP200 denote the WatchPAT 100 and 200 devices. AHI = apnea-hypopnea index, CI = confidence interval, I2 = heterogeneity, PAT = peripheral arterial tonometry, PE = percentage error of limits of agreement.
Meta-analysis of Cohen’s κ for sleep apnea severity classification
Based on 6 studies4,11,12,18,23,26 and a total of 665 participants, the pooled estimate of Cohen’s κ for agreement on the classification of patients into those with no sleep apnea was 0.45 (SE, 0.06; I2, 0%), for those with mild sleep apnea it was 0.29 (SE, 0.05; I2, 0%), for those with moderate sleep apnea it was 0.25 (SE, 0.07; I2, 17.97%), and for those with severe sleep apnea it was 0.64 (SE, 0.05; I2, 21.16%).
Meta-analysis of diagnostic test performance
Based on 6 studies4,11,12,18,23,26 and a total of 665 participants, the computed pooled sensitivity, specificity, and DOR at AHI thresholds of 5, 15, and 30 events/h are reported in Table 2. Hierarchical summary receiver operating curves are shown in Figure 5.
Table 2.
Meta-analysis of diagnostic test accuracy at different AHI thresholds.
Sensitivity | Specificity | DOR | LR+ | LR– | |
---|---|---|---|---|---|
AHI ≥ 5 events/h | 94.11%; SE, 2.6% | 43.47%; SE, 12.9% | 12.30; SE, 3.62 | 1.66; SE, 0.34 | 0.13; SE, 0.03 |
AHI ≥ 15 events/h | 92.21%; SE, 2.4% | 72.39%; SE, 7.8% | 31.08; SE, 17.97 | 3.34; SE, 0.97 | 0.10; SE, 0.03 |
AHI ≥ 30 events/h | 74.11%; SE, 5.6% | 87.10; SE, 3.4% | 19.33; SE, 6.46 | 5.74; SE, 1.41 | 0.29; SE, 0.06 |
AHI = apnea-hypopnea index, DOR = diagnostic odds ratio, LR+ and LR– = positive and negative likelihood ratios, SE = standard error.
Figure 5. Meta-analysis of Cohen’s κ for sleep apnea severity classification.
The size of the square boxes indicates the weight of the effect size as determined by the number of participants and the crossing blue horizontal lines indicate confidence limits; the diamonds are the pooled effect size with its edges on the horizontal plane indicating the confidence limits. CI = confidence interval, I2 = heterogeneity.
Publication bias assessment
Meta-analysis data of mean bias assessment were used to construct a funnel plot for publication bias assessment. This plot (Figure S4 (449.9KB, pdf) in the supplemental material) did not show any asymmetry on visual inspection, and the Begg and Mazumdar rank correlation test showed that the Kendall’s tau b (corrected for ties) was 0.13, with a 1-tailed P value (recommended) of .22 or a 2-tailed P value of .44 (based on continuity-corrected normal approximation).
DISCUSSION
As far as the accuracy of WatchPAT and its agreement with PSG is concerned for accurately determining the true AHI, based on the PEs calculated from the pooled mean bias in this meta-analysis, the 2 tests showed significant discordance from each other. As for their degree of agreement in correctly classifying patients into the 4 different sleep apnea severity classifications, aside from the severe sleep apnea and no sleep apnea classifications, there seemed to be only fair agreement between the PSG and WatchPAT testing modalities in categorizing mild and moderate sleep apnea severity. Similarly, at lower AHI thresholds, WatchPAT showed lower specificity with a higher chance of false positives for sleep apnea diagnosis.
The results of this meta-analysis contradict in some ways those of earlier studies1–4 that have argued in favor of WatchPAT studies. As mentioned earlier, the basis for such an argument in these studies is either good correlation (between PSG and WatchPAT AHIs) or the display of B&A plots showing values falling within the limits of agreement, even though in the literature there exist no accepted limits of agreement for B&A plots on sleep studies. With regard to the correlation studies, one needs to consider that correlation expresses the relationship between 2 variables (if related and how strongly related) and not the differences. A high correlation does not always imply that a good agreement exists between the 2 methods. As such, the correlation coefficient in a linear regression model can sometimes be misleading when assessing agreement because it evaluates only the linear association of 2 sets of observations (or a goodness of fit to a line).
On the other hand, a B&A plot is a method to quantify agreement (graphically) between 2 quantitative measurements by constructing limits of agreement, calculated by using the mean and the SD of the differences between 2 measurements. The graph is an XY scatter plot, in which the Y axis shows the difference between the 2 paired measurements (A–B) and the X axis represents the average of these measures ([A+B]/2). It is recommended that 95% of the data points should lie within ± 2 SD of the mean difference. In a B&A plot, when comparing a new clinical measurement technique with an established one, one assumes that measurements from neither of the testing techniques are closer to the true value, which often remains unknown, and therefore, per Bland and Altman, the plot of the differences in measurements is best compared to the mean of the 2 measurements (from the 2 testing techniques), which is often the best estimate when the true value is unknown.28 A B&A plot, by itself, does not show whether the agreement is sufficient or suitable to use 1 or the other testing modality indifferently, because it only quantifies the bias and a range of agreement, within which 95% of the differences are included. It is generally accepted that the best way to use the B&A plots is to a priori define the limits of maximum acceptable differences (expected limits of agreement) based on clinically relevant criteria.10
Because a B&A plot does not statistically determine the superiority of 1 testing method over another and because the precision of both methods is quantitatively dependent on the value of the mean measurement (in this case, mean AHI), PEs are sometimes calculated for accuracy and precision.14,15 Here again, there needs to be an agreement within the field of sleep medicine as to what should be considered an acceptable PE. Examples of this strategy can be obtained from the cardiology literature, where 2 meta-analyses on cardiac output measurement using 2 different testing methods suggested that a PE of 30%29 and 45%30 should be considered acceptable. However, PEs in AHI determination as computed from this meta-analysis seem clinically significant for sleep studies, even by the most liberal criteria. Even when parsing the data based on the different scoring criteria for hypopneas, high PEs were noted across all subgroups (analyses). Although the B&A and correlation analyses have their value when applied and interpreted in the right setting, the reality is that correlation and agreement can be interpreted differently by clinicians and have the potential to mislead many to equate them to accuracy. Calculating PEs from B&A plots (as in this meta-analysis) may offer a reliable supplemental metric of accuracy.
Although the accuracy of WatchPAT studies remains in question and leaves much to be debated on, it is also worth pointing out that there can be significant night-to-night variation in AHI determination from WatchPAT studies. In a recent study that evaluated the night-to-night variability of WatchPAT studies, AHI varied by an average of 56.7% from the previous night and the misclassification rate of sleep apnea severity on 3 different nights was between 22% and 25%.13 However, to be fair, one can also expect some night-to-night variability with PSG, and in 1 study 25% of the participants were found to have an AHI increase of at least 20 events/h when they were studied again on a second night using PSG.31
Misclassification was also studied in this meta-analysis using Cohen’s κ. Although some studies have considered Cohen’s κ for agreement in classification into mild (0.29) and moderate sleep apnea (0.25) as fair agreement,16 others have argued that little confidence should be placed in results with Cohen’s κ < 0.60.32 However, determining misclassification by this method has its limitations. For example, if one evaluates the agreement between PAT and PSG studies for mild sleep apnea (AHI 5–14.9 events/h) and the PSG study determines the AHI to be 5.9 events/h but the PAT study for the same patient during the same night shows an AHI of 31 events/h, then in the 2 × 2 contingency table, the output is recorded as “PSG+ve and PAT-ve” for mild sleep apnea. Contrast this scenario with one in which PSG shows an AHI of 5.9 events/h but PAT shows an AHI of 4 events/h. In such a situation, this output would still be marked in the 2 × 2 contingency table as “PSG+ve and PAT-ve” for mild sleep apnea. Although both scenarios would allow one to compute a Cohen’s κ, the latter would be a less-relevant misclassification and the former would be more misleading.
Given that this statistical approach has its inherent limitations, another approach was used in this meta-analysis in which data from eligible studies were analyzed for pooled sensitivity, specificity, DOR (defined as the ratio of the odds of the test being positive if the patient has a disease relative to the odds of the test being positive if the patient does not have the disease33), and LR+ and LR– at different AHI thresholds of 5, 15, and 30 events/h. The rationale for the DOR and LRs is that they are indicators of test performance and unlike accuracy they are independent of disease prevalence.34 From Table 2 and hierarchical summary receiver operating curves (Figure 6), one sees that as the AHI thresholds go higher, sensitivity declines and specificity increases. At the AHI 5 events/h threshold, the increased sensitivity comes at the expense of reduced specificity, which means more false positives. At the other extreme, at the AHI ≥ 30 events/h threshold, although specificity increases significantly, sensitivity does decrease but not significantly, and hence false negatives are to be expected but not to the extent that one would encounter false positives in the former situation (ie, the AHI 5 events/h threshold).
Figure 6. HSROCs for different AHI thresholds.
Summary points with 95% confidence limits and prediction regions are displayed for different AHI thresholds. AHI = apnea-hypopnea index, HSROC = hierarchical summary receiver operating curve.
Although it may seem that the most optimal combinations of sensitivity and specificity are obtained with an AHI threshold of 15 events/h, which also has the highest DOR, one also needs to interpret this possibility with the reported LR+ and LR–. In general, as the LRs move farther away from a value of 1.0, the strength of their association with the absence of a disease increases. It is generally accepted that tests with a very high LR+ and a very low LR– have greater discriminatory ability, and only tests with LRs > 10 or < 0.1 are useful in establishing or excluding a diagnosis.35 It is clear from Table 2 that the values of LR+ and LR– do not meet these criteria for any category of sleep apnea severity. In addition, between the AHI thresholds of 5 and 15 events/h, one wonders how much more valuable WatchPAT studies are than sleep apnea screening questionnaires such as STOP-Bang, with similar sensitivity and specificity at lower AHI thresholds,36 and, given the high chance of false positives in that AHI range, whether PSG should be conducted to confirm the findings.
One limitation of this analysis (as detailed under Additional details on Methods in the supplemental material) is that at the AHI thresholds of 5 and 30 events/h, because there were a few cells in the 2 × 2 contingency table that had zero values, analyzing hierarchical summary receiver operating curve summary points was not possible without the application of 0.5 as a correction factor in those cells, an approach previously studied.37 Another limitation to this analysis and of the meta-analysis of Cohen’s κ is that some studies that reported calculated sensitivity and specificity could not be included because of the absence of analyzable individual patient data reported in those studies. Nevertheless, the total number of participants from the excluded studies was far less than the total number of participants analyzed in the analyses in this meta-analysis.
There are some other limitations to this meta-analysis. Although the funnel plot did not show statistical evidence of publication bias, there were a few studies (including one that studied NightOwl17) that could not be included in the meta-analysis because data were not extractable from their B&A plots (see Table S3 (449.9KB, pdf) ). The main analysis also showed high heterogeneity. However, this heterogeneity was explored with several subgroup analyses and even in those subgroups with acceptable heterogeneity, the PEs remained clinically significant. Last, and although not entirely a limitation per se (as it was not a priori defined to be the objective of this meta-analysis), this meta-analysis did not analyze other data from PAT and PSG studies, such as total sleep time, percentages of nonrapid eye movement sleep and rapid eye movement sleep, and oxygen desaturation index. These data would constitute important analyses because in a recent study, the misestimation of the total sleep time (26 ± 63 minutes) by a proprietary WatchPAT algorithm was thought to be responsible for some error in the correct classification of sleep apnea severity.12
In conclusion, this meta-analysis shows clinically significant discordance between PAT (specifically, WatchPAT) and PSG measurements of AHI, significant misclassification by WatchPAT studies especially for mild- and moderate-severity classes of sleep apnea, and poor diagnostic test performance. Not only does the field of sleep medicine need to define what should be considered clinically acceptable limits of agreement in the AHI differences plotted on B&A plots (or for that matter, other indices of sleep apnea severity) for future studies of any new sleep wearables or other existing home sleep apnea testing devices, but sleep medicine professionals also need more adequately powered and large-scale clinical studies that would ideally use the same respiratory scoring criteria for both PAT (NightOwl or WatchPAT) and PSG studies to better understand the differences/discrepancies between PAT and PSG estimation of AHIs.
DISCLOSURE STATEMENT
All authors have seen and approved the manuscript. Work for this study was performed at the Atlanta Veterans Affairs Medical Center, Decatur, Georgia. The authors report no conflicts of interest.
ACKNOWLEDGMENTS
Author contributions: IHI conceptualized the study design and protocol; contributed to the literature search, data extraction and analysis, and risk of bias assessment of studies; and wrote the manuscript. CEF contributed to the literature search and risk of bias assessment. ASS contributed to the literature search. CAG contributed to data validation. OCI partly contributed to data extraction. All authors contributed to the manuscript.
ABBREVIATIONS
- AHI
apnea-hypopnea index
- B&A
Bland-Altman
- DOR
diagnostic odds ratio
- LR+
positive likelihood ratio
- LR–
negative likelihood ratio
- PAT
peripheral arterial tonometry
- PE
percentage error (of limits of agreement)
- PSG
polysomnography
- SD
standard deviation
- SE
standard error
REFERENCES
- 1. Zou D , Grote L , Peker Y , Lindblad U , Hedner J . Validation a portable monitoring device for sleep apnea diagnosis in a population based cohort using synchronized home polysomnography . Sleep. 2006. ; 29 ( 3 ): 367 – 374 . [DOI] [PubMed] [Google Scholar]
- 2. Penzel T , Kesper K , Pinnow I , Becker HF , Vogelmeier C . Peripheral arterial tonometry, oximetry and actigraphy for ambulatory recording of sleep apnea . Physiol Meas. 2004. ; 25 ( 4 ): 1025 – 1036 . [DOI] [PubMed] [Google Scholar]
- 3. Tauman R , Berall M , Berry R , et al . Watch-PAT is useful in the diagnosis of sleep apnea in patients with atrial fibrillation . Nat Sci Sleep. 2020. ; 12 : 1115 – 1121 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Pittman SD , Ayas NT , MacDonald MM , Malhotra A , Fogel RB , White DP . Using a wrist-worn device based on peripheral arterial tonometry to diagnose obstructive sleep apnea: in-laboratory and ambulatory validation . Sleep. 2004. ; 27 ( 5 ): 923 – 933 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Pinto JA , Godoy LB , Ribeiro RC , Mizoguchi EI , Hirsch LA , Gomes LM . Accuracy of peripheral arterial tonometry in the diagnosis of obstructive sleep apnea . Braz J Otorhinolaryngol. 2015. ; 81 ( 5 ): 473 – 478 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Pang KP , Gourin CG , Terris DJ . A comparison of polysomnography and the WatchPAT in the diagnosis of obstructive sleep apnea . Otolaryngol Head Neck Surg. 2007. ; 137 ( 4 ): 665 – 668 . [DOI] [PubMed] [Google Scholar]
- 7. Kasai T , Takata Y , Yoshihisa A , et al . Comparison of the apnea-hypopnea index determined by a peripheral arterial tonometry-based device with that determined by polysomnography—results from a multicenter study . Circ Rep. 2020. ; 2 ( 11 ): 674 – 681 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Choi JH , Kim EJ , Kim YS , et al . Validation study of portable device for the diagnosis of obstructive sleep apnea according to the new AASM scoring criteria: Watch-PAT 100 . Acta Otolaryngol. 2010. ; 130 ( 7 ): 838 – 843 . [DOI] [PubMed] [Google Scholar]
- 9. Yalamanchali S , Farajian V , Hamilton C , Pott TR , Samuelson CG , Friedman M . Diagnosis of obstructive sleep apnea by peripheral arterial tonometry: meta-analysis . JAMA Otolaryngol Head Neck Surg. 2013. ; 139 ( 12 ): 1343 – 1350 . [DOI] [PubMed] [Google Scholar]
- 10. Giavarina D . Understanding Bland Altman analysis . Biochem Med (Zagreb). 2015. ; 25 ( 2 ): 141 – 151 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Tondo P , Drigo R , Scioscia G , et al . Usefulness of sleep events detection using a wrist worn peripheral arterial tone signal device (WatchPAT) in a population at low risk of obstructive sleep apnea . J Sleep Res. 2021. ; 30 ( 6 ): e13352 . [DOI] [PubMed] [Google Scholar]
- 12. Ioachimescu OC , Allam JS , Samarghandi A , et al . Performance of peripheral arterial tonometry-based testing for the diagnosis of obstructive sleep apnea in a large sleep clinic cohort . J Clin Sleep Med. 2020. ; 16 ( 10 ): 1663 – 1674 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Tschopp S , Wimmer W , Caversaccio M , Borner U , Tschopp K . Night-to-night variability in obstructive sleep apnea using peripheral arterial tonometry: a case for multiple night testing . J Clin Sleep Med. 2021. ; 17 ( 9 ): 1751 – 1758 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Odor PM , Bampoe S , Cecconi M . Cardiac output monitoring: validation studies—how results should be presented . Curr Anesthesiol Rep. 2017. ; 7 ( 4 ): 410 – 415 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Page MJ , McKenzie JE , Bossuyt PM , et al . The PRISMA 2020 statement: an updated guideline for reporting systematic reviews . J Clin Epidemiol. 2021. ; 134 : 178 – 189 . [DOI] [PubMed] [Google Scholar]
- 16. Landis JR , Koch GG . The measurement of observer agreement for categorical data . Biometrics. 1977. ; 33 ( 1 ): 159 – 174 . [PubMed] [Google Scholar]
- 17. Massie F , Mendes de Almeida D , Dreesen P , Thijs I , Vranken J , Klerkx S . An evaluation of the NightOwl home sleep apnea testing system . J Clin Sleep Med. 2018. ; 14 ( 10 ): 1791 – 1796 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Ayas NT , Pittman S , MacDonald M , White DP . Assessment of a wrist-worn device in the detection of obstructive sleep apnea . Sleep Med. 2003. ; 4 ( 5 ): 435 – 442 . [DOI] [PubMed] [Google Scholar]
- 19. Gan YJ , Lim L , Chong YK . Validation study of WatchPat 200 for diagnosis of OSA in an Asian cohort . Eur Arch Otorhinolaryngol. 2017. ; 274 ( 3 ): 1741 – 1745 . [DOI] [PubMed] [Google Scholar]
- 20. Holmedahl NH , Fjeldstad OM , Engan H , Saxvig IW , Grønli J . Validation of peripheral arterial tonometry as tool for sleep assessment in chronic obstructive pulmonary disease . Sci Rep. 2019. ; 9 ( 1 ): 19392 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Jen R , Orr JE , Li Y , et al . Accuracy of WatchPAT for the diagnosis of obstructive sleep apnea in patients with chronic obstructive pulmonary disease . COPD. 2020. ; 17 ( 1 ): 34 – 39 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Körkuyu E , Düzlü M , Karamert R , et al . The efficacy of Watch PAT in obstructive sleep apnea syndrome diagnosis . Eur Arch Otorhinolaryngol. 2015. ; 272 ( 1 ): 111 – 116 . [DOI] [PubMed] [Google Scholar]
- 23. O’Brien LM , Bullough AS , Shelgikar AV , Chames MC , Armitage R , Chervin RD . Validation of Watch-PAT-200 against polysomnography during pregnancy . J Clin Sleep Med. 2012. ; 8 ( 3 ): 287 – 294 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Onder NS , Akpinar ME , Yigit O , Gor AP . Watch peripheral arterial tonometry in the diagnosis of obstructive sleep apnea: influence of aging . Laryngoscope. 2012. ; 122 ( 6 ): 1409 – 1414 . [DOI] [PubMed] [Google Scholar]
- 25. Pillar G , Berall M , Berry R , et al . Detecting central sleep apnea in adult patients using WatchPAT—a multicenter validation study . Sleep Breath. 2020. ; 24 ( 1 ): 387 – 398 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Weimin L , Rongguang W , Dongyan H , Xiaoli L , Wei J , Shiming Y . Assessment of a portable monitoring device WatchPAT 200 in the diagnosis of obstructive sleep apnea . Eur Arch Otorhinolaryngol. 2013. ; 270 ( 12 ): 3099 – 3105 . [DOI] [PubMed] [Google Scholar]
- 27. Garg N , Rolle AJ , Lee TA , Prasad B . Home-based diagnosis of obstructive sleep apnea in an urban population . J Clin Sleep Med. 2014. ; 10 ( 8 ): 879 – 885 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Bland JM , Altman DG . Statistical methods for assessing agreement between two methods of clinical measurement . Lancet. 1986. ; 1 ( 8476 ): 307 – 310 . [PubMed] [Google Scholar]
- 29. Critchley LA , Critchley JA . A meta-analysis of studies using bias and precision statistics to compare cardiac output measurement techniques . J Clin Monit Comput. 1999. ; 15 ( 2 ): 85 – 91 . [DOI] [PubMed] [Google Scholar]
- 30. Peyton PJ , Chong SW . Minimally invasive measurement of cardiac output during surgery and critical care: a meta-analysis of accuracy and precision . Anesthesiology. 2010. ; 113 ( 5 ): 1220 – 1235 . [DOI] [PubMed] [Google Scholar]
- 31. Levendowski DJ , Zack N , Rao S , et al . Assessment of the test-retest reliability of laboratory polysomnography . Sleep Breath. 2009. ; 13 ( 2 ): 163 – 167 . [DOI] [PubMed] [Google Scholar]
- 32. McHugh ML . Interrater reliability: the kappa statistic . Biochem Med (Zagreb). 2012. ; 22 ( 3 ): 276 – 282 . [PMC free article] [PubMed] [Google Scholar]
- 33. Glas AS , Lijmer JG , Prins MH , Bonsel GJ , Bossuyt PM . The diagnostic odds ratio: a single indicator of test performance . J Clin Epidemiol. 2003. ; 56 ( 11 ): 1129 – 1135 . [DOI] [PubMed] [Google Scholar]
- 34. Šimundić AM . Measures of diagnostic accuracy: basic definitions . EJIFCC. 2009. ; 19 ( 4 ): 203 – 211 . [PMC free article] [PubMed] [Google Scholar]
- 35. Deeks JJ , Altman DG . Diagnostic tests 4: likelihood ratios . BMJ. 2004. ; 329 ( 7458 ): 168 – 169 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Chung F , Abdullah HR , Liao P . STOP-Bang questionnaire: a practical approach to screen for obstructive sleep apnea . Chest. 2016. ; 149 ( 3 ): 631 – 638 . [DOI] [PubMed] [Google Scholar]
- 37. Moses LE , Shapiro D , Littenberg B . Combining independent studies of a diagnostic test into a summary ROC curve: data-analytic approaches and some additional considerations . Stat Med. 1993. ; 12 ( 14 ): 1293 – 1316 . [DOI] [PubMed] [Google Scholar]