Abstract
Study objectives
Changes in sleep with aging are associated with risk for Alzheimer’s and other neurological diseases, risk of accidents, and can be a predictor of health decline. For this reason, continuous sleep monitoring is of great interest for researchers, clinicians, and family members. The objective of this study was to assess the validity of consumer sleep-tracking devices in older relative to young adults.
Methods
Analyses were based on one night of sleep assessed in young (19-24 years; n = 13) and older adults (56-80 years; n = 19). Participants wore sleep-tracking wearables (Fitbit Sense 2, Oura Ring) and nearables (Withings Sleep Mat, Sleep Score Max) were positioned nearby. Sleep measures were compared to polysomnography.
Results
Results suggest that devices may be less accurate in older relative to young adults in commonly reported measures. In older adults, devices underestimated total sleep time (Fitbit bias = -74.5 minutes, p=.012; Oura bias = -75.5, p = <0.0001; Withings bias = -45.7, p=.083; Sleep Score Max bias = -56.5, p=.001) and wake after sleep onset (Fitbit bias = -44.1, p=.012; Oura bias = -19.8, p=.2823; Withings bias = -32.1, p=.129; Sleep Score Max bias = -71.6, p=.006) and overestimated deep sleep time (Fitbit bias = -29.3, p=.013; Oura bias = 71.5, p=.001; Withings bias = 97.4, p = <0.0001; Sleep Score Max bias = 88.8, p = <0.0001). Devices performed poorly in identifying individual sleep stages, particularly deep sleep. Limits of agreement were generally greater in older adults than younger adults across all measures, suggesting less precision in measuring older adults’ sleep.
Conclusions
Older adults interested in tracking their sleep and clinicians and researchers using consumer devices as a replacement for polysomnography should use caution when interpreting results from these devices.
This paper is part of the Consumer Sleep Technology Collection
Keywords: sleep tracking, devices, polysomnography, sleep, aging, wearables, nearables
Statement of Significance.
Continuous monitoring of older adult sleep holds potential for predicting health decline and disease onset. While consumer sleep-tracking devices provide an acceptable option for in-home, long-term sleep tracking for young adults, our results suggest these devices are much less accurate when used in older adults. As such, wearable and nearable consumer sleep-tracking devices should be used with caution in older adults and more accurate hardware and software are needed for potential health interventions in older populations.
Introduction
Poor sleep is a harbinger of poor health in older adults. Short and fragmented sleep is associated with Alzheimer’s disease and related dementias and other neurodegenerative diseases and often precedes the onset of these diseases [1–3]. Poor sleep in older adults is also associated with risk of falls and accidents [4,5], cognitive decline [3,6], and generally poor health [7]. For these reasons, early detection of changes in sleep as individuals age is of great interest. Clinicians, researchers, individuals, and family members would all benefit from continuous sleep monitoring in-home to detect the subtle changes that may take place over time.
Consumer sleep tracking devices (CSTs) provide a possible solution. The number of CSTs has exploded in recent years, with devices claiming to track or measure sleep from wearable wrist, finger, or head-worn devices, to “nearables” including mattress pads and bedside devices. Consumers with an interest in tracking their own sleep are able to access these devices at a reasonable price and at relative convenience. In the world of research, CSTs represent a novel opportunity, offering the potential of lower participant burden, greater participant comfort, greater accessibility, lower costs, and less invasiveness relative to polysomnography (PSG). While research-grade actigraphy watches are a potential alternative, these devices often require time-consuming hand-scoring of data and are typically more expensive than consumer devices. Polysomnography is the gold standard of sleep measurement, but requires expensive equipment, a technician for setup and data scoring, and involves potentially disruptive electrode wires and sensors.
However, benefits of CSTs presume the device is accurate enough to serve as a reasonable substitute for PSG or research actigraphy, making validation an important criterion. Additionally, CSTs are reliant on algorithmic interpretation and generally do not have raw data available, making performance dependent on these “black box” algorithms that may change, making it necessary to validate each new generation of sensors [8]. While prior work has shown that some CSTs are fairly accurate in young adults for measuring total sleep time compared to PSG [9–12], limited work has been done assessing these measures in an older adult population. Much of the previous literature in older adults has focused on sleep diaries or actigraphy as a comparison, but these methods do not accurately capture sleep periods the same way as PSG [13–15]. A study including PSG in middle aged (40-69 years) and younger adults (23-39 years) did find lower accuracy and agreement of the Oura Ring in the middle aged cohort compared to younger adults [16]. Another study assessed the epoch-by-epoch agreement, bias, and accuracy of wearables (Dreem headband, Fitbit Sense, Oura Ring, Xiaomi Mi Band 7, Axtro Fit3) against PSG in young (18-30 years), middle aged (31-50 years), and older adults (51-70 years), finding reduced accuracy of devices in older relative to young adults [17]. However, that work did not include nearables, and used a minimal PSG montage recorded in a laboratory setting.
It is necessary to assess these devices against the gold-standard PSG in older adults, as changes in sleep due to aging must be accounted for by these devices and their algorithms. Sleep in older adults differs notably from that of younger adults, including decreased total sleep time (TST), increased wake after sleep onset (WASO), and decreased slow wave and deep sleep [18, 19] Additionally, CSTs are typically reliant on proxy methods like accelerometry, heart rate, or temperature, using these as part of an algorithm to estimate sleep without using direct measures of brain activity that would indicate sleep. In older adults, these proxies may be less clearly correlated with actual sleep. For example, CSTs often detect electrodermal activity [8] which may be included in their sleep detection algorithms, but older adults have significant differences in electrodermal activity patterns compared to young adults [20]. In addition to physiological differences, older adults are also more likely to be diagnosed with sleep disorders [21], which may not be accounted for by a CST’s algorithm. Given these sleep differences, it is important to assess the validity of CSTs directly in older adult populations. The present work aims to address these gaps by directly comparing performance of consumer sleep tracking wearable and nearable devices with PSG in a cohort of healthy older adults in the home, with a direct comparison group of younger adults.
Methods
Participants
The study initially enrolled 18 young adults and 24 older adults. 10 participants were removed from analysis: 3 young adults and 2 older adults were removed for not achieving at least 4.5 hours of total sleep time on the PSG night, and 2 younger adults and 3 older adults withdrew from the study. The two younger adults and two older adults withdrew for scheduling conflicts; one older adult withdrew due to device frustration. Participants included in final analyses were 19 older adults (13 female, mean age = 65.8 ± 7.0 years, age range = 56-80 years) and 13 younger adults (13 female, mean age = 20.9 ± 1.4 years, age range = 19-24 years). Participants were screened against diagnosed sleep, neurological, or psychiatric disorders, and scored within healthy ranges on the Telephone Interview for Cognitive Status (TICs) [22]. Eligible participants were also required to have stable home Wi-Fi connection and be comfortable with smartphone technology. All participants gave written consent for the study.
Procedure
Study procedures were approved by the University of Massachusetts’s Institutional Review Board. Data collection took place over seven nights – nights 1 and 3-7 included just devices to assess usability and comfort. PSG to assess sleep measure validity was only on night 2. On night 1, research technicians arrived at a participant’s home approximately two hours before the participant’s reported bedtime and connected the CST devices to the participant’s home internet and oriented the participant to the CST devices by providing instructions on how to use them. To improve device accuracy, basic demographic information was entered in the CST device app including sex and age. Participants were instructed to begin using the CSTs to track their sleep. To minimize participant burden in using multiple devices, participants were only asked to use devices at night and remove wearables during the day. On the second night, research technicians returned to the participant’s home to confirm that the devices were set up correctly and then apply the PSG. Participants were instructed to continue using the CST devices on their own for the remaining five nights and were provided with a manual reviewing device usage, as well as daily reminders to charge and use the devices.
Each morning, participants filled out an online survey that collected information on their sleep and wake times, their sleeping environment, such as bedsharing status, and anything that may have affected their sleep, such as caffeine intake. At the end of the week, they also completed surveys on the usability and comfort of each device. These surveys included the NASA Task Load Index [23] and the Comfort Rating Scale for Wearable Computers [24]. The NASA Task Load Index consists of 6 sliding scales from 1 to 21 for participants to rate the devices on: mental demand, physical demand, temporal demand, frustration, performance, and effort. The Comfort Rating Scale for Wearable Computers consists of a series of statements related to the comfort of the devices, such as “The device is painful to wear” which participants then rate on a Likert scale ranging from “Strongly Disagree to “Strongly Agree.” These allowed participants to rate the comfort, frustration, and physical and mental demands of each device.
Devices
Four consumer sleep tracking devices were assessed: the Fitbit Sense 2 (version 194.61, Fitbit Inc., San Francisco, CA, USA), the Oura Ring (3rd generation version 2.8.41, Oura Inc., Oulu, Finland), the Sleep Score Max (version 2.1.0.1, SleepScore Labs, Carlsbad, CA, USA), and the Withings Sleep Mat (version 2481, Withings Inc., Issy-les-Moulineaux, France). Participants were also provided with a smartphone to use the corresponding apps for each of the devices.
The Fitbit Sense is a wrist-worn device that uses accelerometry, electrodermal activity sensors, and photoplethysmography in order to estimate sleep metrics. The Oura Ring is a smart ring that is worn on the finger and measures accelerometry, skin temperature sensors, and photoplethysmography to estimate sleep metrics. Participants were provided with multiple sizing options for wearables (e.g., both a small and large wristband for the Fitbit and multiple sizes of ring for the Oura), so that both devices could be correctly fit to the participant. Both devices have corresponding phone apps to access data.
The other two devices were “nearables,” devices that are not directly worn by participants but placed in close proximity. The Withings Sleep Mat is a smart mat that was placed under participants’ mattress, and uses a pneumatic sensor to track movement, heart rate, and respiration. The Sleep Score Max is a bedside device that uses sonar to listen to breathing, snoring, and movement and also tracks light levels and room temperature. Both the Sleep Score Max and the Withings have their own phone apps that display the summarized data.
All devices reported light and deep sleep (LS, DS), REM sleep, WASO, and TST. Three devices (Fitbit, Oura, Sleep Score Max) reported sleep efficiency (SE). Three devices (Oura, Withings, Sleep Score Max) reported sleep onset latency (SOL). Data was extracted from each device via their corresponding smartphone apps. Epoch-by-epoch data was not available for all devices, so only summary measures were extracted.
Polysomnography
Polysomnography was collected using the Embletta MPR and the Embletta ST+ Proxy, an ambulatory device that allowed participants to move freely in their homes after application. The PSG montage included ten EEG electrodes (GND, REF, M1 & M2, F3 & F4, C3 & C4, O1 & O2), two electrooculogram (EOG) channels and four electromyogram (EMG) channels (two on the chin and one on each hand), as well as an electrocardiogram (EKG) sensor on the collarbones. Two respiratory sensor belts and a pulse oximeter on the finger were also included.
Analysis
Polysomnography data was recorded directly onto the Embletta MPR device and later exported via RemLogic. Polysomnography data was manually scored in 30s epochs according to American Academy of Sleep Medicine standards [25]. Possible stages included wake time, NREM1, NREM2, NREM3, and REM. Based on previous literature [12,26], NREM1 and NREM2 were summed together and classified as LS while NREM3 was classified as DS.
To evaluate the differences in sleep tracking between gold-standard PSG data and the devices, analyses included Bland–Altman plots, bias and limits of agreement, Lin’s Concordance Correlation Coefficients (CCCs), and Mean Absolute Percent Error (MAPE). Bland–Altman plots were created following the framework developed by Menghini et al [26]. Mean comfort, physical and mental demand, and frustration were also calculated to evaluate the devices from a user perspective, assessed after 7 days of device usage. Differences are reported using a two-tailed paired t-test, and effect sizes were reported with Cohen’s d for paired samples. Device failures and data loss across the full 7 days were also tabulated to assess the devices’ reliability.
Results
Sample description
Participant demographics are presented in Table 1. Notably, the two groups scored similarly on the TICS and had similar BMI. Older and younger adults differed overall in average NREM1, NREM3, and WASO based on PSG. Older adults had higher WASO on average and spent much less time in DS (NREM3) in particular. They also spent more time in NREM1 than their younger adult counterparts. Older adults also experienced lower sleep efficiency compared to younger adults. These reflect common differences in sleep for older relative to young adults [18, 19] and provide assurance that age-related differences in sleep that may be of interest are indeed present in our participant sample.
Table 1.
Summary demographic and sleep data
| Older adults | Younger adults | |
|---|---|---|
| N | 19 | 13 |
| Female:Male | 13:6 | 12:1 |
| Age (years)* | 65.0 (6.96) | 21.2 (1.4) |
| Race & ethnicity | ||
| Asian | 0 | 7 |
| Black/African | 0 | 1 |
| White/Caucasian | 18 | 6 |
| Hispanic/Latinx | 0 | 3 |
| Middle Eastern/North African | 0 | 1 |
| More than one selected | 0 | 5 |
| DNR | 1 | 0 |
| Height (m) | 1.7 (0.1) | 1.6 (0.1) |
| Weight (kg) | 71.9 (12.56) | 66.8 (12.2) |
| BMI | 25.0 (3.32) | 24.9 (2.9 |
| TICS score | 36.9 (2.05) | 36.1 (2.2) |
| Bedsharing* | 0.75 (0.42) | 0.05 (0.12) |
| Sleep characteristics | ||
| TST | 464.6 (80.0) | 414.4 (68.7) |
| WASO* | 99.1 (68.3) | 38.1 (13.2) |
| NREM1* | 47.3 (24.2) | 10.4 (7.0) |
| NREM2 | 241.8 (70.0) | 216.2 (49.6) |
| NREM3* | 23.2 (38.0) | 93.5 (34.2) |
| REM | 65.3 (22.4) | 81.3 (33.3) |
| SE* | 72.7% (14.8%) | 86.0% (10.0%) |
| SOL | 52.2 (43.8) | 28.8 (46.7) |
DNR = did not respond. BMI = body mass index; TICS = Telephone Interview for Cognitive Status. Standard deviations are given in parentheses. Bed sharing status was calculated into proportions of how many nights the bed was shared over the total nights out of seven that had available morning questionnaire data (where bedsharing status was reported), which was then averaged. Asterisks indicate statistically significant differences (p<.05) between older adults and younger adults.
Device usability and failures
Participants reported subjective comfort, frustration, and the physical and mental demands associated with the devices throughout the study (Table 2). The Sleep Score Max and Withings nearable devices scored significantly higher than wearables for comfort among both younger and older adults, however the Sleep Score Max also scored significantly higher in mental demand and frustration than the Withings (Table 3). Comfort, demand, and frustration metrics were similar between older adults and younger adults. The only significant differences were in the comfort rating of the Fitbit and Oura, which older adults rated as slightly more comfortable relative to young adult ratings.
Table 2.
Comfort, mental demand, physical demand, and frustration metrics for each device by age group, with standard deviations in parentheses
| Comfort | Mental Demand | Physical Demand | Frustration | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Device | OA | YA | p-value | Hg | OA | YA | p-value | Hg | OA | YA | p-value | Hg | OA | YA | p-value | Hg |
| Fitbit | 14.3 (4.1) | 18.5 (5.9) | 0.037* | 0.84 | 4.0 (3.18) | 2.7 (1.8) | 0.147 | 0.47 | 3.6 (4.1) | 3.2 (2.4) | 0.682 | 0.13 | 5.0 (5.0) | 3.7 (2.87) | 0.357 | 0.30 |
| Oura | 17.3 (5.6) | 22.2 (6.2) | 0.034* | 0.79 | 2.2 (1.3) | 1.9 (1.3) | 0.608 | 0.18 | 3.3 (3.7) | 3.2 (3.0) | 0.978 | <0.01 | 5.0 (5.25) | 4.2 (3.26) | 0.579 | 0.18 |
| Withings | 10.4 (0.98) | 11.0 (2.8) | 0.395 | 0.39 | 2.7 (3.0) | 1.5 (0.8) | 0.115 | 0.49 | 2.3 (2.5) | 2.9 (4.3) | 0.625 | 0.19 | 3.7 (4.0) | 2.5 (3.3) | 0.362 | 0.31 |
| Sleep Score Max | 11.5 (3.2) | 11.8 (4.8) | 0.864 | 0.06 | 5.2 (4.67) | 4.0 (3.1) | 0.405 | 0.27 | 3.1 (2.6) | 3.3 (3.6) | 0.829 | 0.08 | 6.5 (6.1) | 7.4 (6.4) | 0.690 | 0.14 |
Comfort ratings were on a scale from 0 to 50, where lower ratings equate to more comfortable. For demand and frustration metrics the scale was 0 to 21, where lower ratings equate to lower demands. Asterisks indicate statistically significant difference between older and younger adults using a Welch’s unpaired t-test. Hg = Hedges g.
Table 3.
Between-device comparisons for comfort, mental demand, physical demand, and frustration metrics using paired Wilcoxon signed rank tests
| Older Adults | Younger Adults | |||||
|---|---|---|---|---|---|---|
| Fitbit | Oura | Withings | Fitbit | Oura | Withings | |
| Comfort | ||||||
| Oura | 0.089 | – | 0.082 | – | ||
| Withings | 0.001* | 0.000* | – | 0.002* | 0.002* | – |
| Sleep Score Max | 0.014* | 0.003* | 0.410 | 0.008* | 0.003* | 0.344 |
| Physical Demand | ||||||
| Oura | 0.886 | – | 0.905 | – | ||
| Withings | 0.205 | 0.399 | – | 0.514 | 0.753 | – |
| Sleep Score Max | 0.554 | 0.621 | 0.281 | 0.810 | 0.953 | 0.268 |
| Mental Demand | ||||||
| Oura | 0.020* | – | 0.037* | – | ||
| Withings | 0.046* | 0.751 | – | 0.013* | 0.089 | – |
| Sleep Score Max | 0.601 | 0.016* | 0.100 | 0.291 | 0.030* | 0.010* |
| Frustration | ||||||
| Oura | 0.755 | – | 0.639 | – | ||
| Withings | 0.123 | 0.243 | – | 0.169 | 0.090 | – |
| Sleep Score Max | 0.440 | 0.266 | 0.036* | 0.050* | 0.114 | 0.029* |
We considered device reliability by measuring data loss. Data loss included both technical errors like failed Wi-Fi or Bluetooth connection, as well as user error including failing to charge device or start a sleep session if necessary. Data loss on the PSG night did not differ significantly between groups or devices (Fig. 1). For wearable devices (Fitbit and Oura), greater data loss across the week occurred in older adults (Table 4; Fig. 2). For nearable devices (Withings and Sleep Score Max), greater data loss occurred in younger adults.
Figure 1.
Comparison of data loss on the PSG night. No significant differences were found between device data loss.
Table 4.
Data loss reported by device and group for the one night of PSG data collected as well as device data lost throughout the duration of the study. On the PSG night, there were no significant differences in data loss between groups, or between devices within groups (per a Fisher’s exact test)
| Older adults | Young adults | p-value (between) | Odds ratio | ||
|---|---|---|---|---|---|
| Fitbit | PSG nights | 5.3% | 7.7% | >0.999 | 1.481 |
| All nights | 13.6% | 4.1% | 0.047* | 0.350 | |
| Oura | PSG nights | 21.1% | 7.7% | 0.625 | 0.364 |
| All nights | 24.3% | 14.3% | 0.071 | 0.531 | |
| Withings | PSG nights | 10.5% | 15.4% | >0.999 | 1.524 |
| All nights | 9.3% | 22.5% | <0.001* | 4.287 | |
| Sleep Score Max | PSG nights | 26.3% | 30.8% | >0.999 | 1.236 |
| All nights | 25.7% | 38.8% | 0.042* | 1.841 |
Figure 2.
Comparison of data loss throughout the full week of data collection. In older adults, data collection was most successful with the Withings and Fitbit, with significant differences between both devices and the Oura and Sleep Score Max. In younger adults, data collection was most successful with the Fitbit and Oura, with significant differences between those devices and the Withings and Sleep Score Max. * = p<.05; ** = p<.01; *** = p<.001; ns = no significance.
Sleep measurement
Device performance was compared to PSG for the following measures: TST, WASO, LS (compared to combined NREM1 and NREM2 from PSG), DS (compared to NREM3 from PSG), and REM sleep. In all analyses, participant BMI and bedsharing status were not found to have significant impacts on accuracy.
Total sleep time
In younger adults, all devices measured TST with no statistically significant difference from PSG (Table 5). In the older adult group, all devices except the Withings differed significantly from PSG-determined TST, all underestimating TST. Both bias and mean average percent error (MAPE) values for each device were larger in older adults than in younger adults. All devices misestimated TST by at least 40 minutes more in older than younger adults (Fig. 3). The device with the largest discrepancy between older and younger adults was the Oura, with an overall bias of –75.5 for OAs and –15.5 for YAs; however, the Oura also had the least discrepancy in limits of agreement (LOA) between older and younger adults, with width of the LOA only larger in older adults by 10.7 minutes. Limits of agreement and MAPE were larger in older adults than in younger adults across all devices, and CCCs are lower in older adults than younger adults across all devices.
Table 5.
TST for each device. Standard deviations are indicated in parentheses
| Device | Group | # Nights | PSG TST (SD) | Device TST (SD) | p-value | Cohen’s d | Bias | MAPE | LOA | CCC (CI) |
|---|---|---|---|---|---|---|---|---|---|---|
| Fitbit | OA | 18 | 464.3 (80.2) | 389.8 (109.2) | 0.012* | 0.67 | −74.5 | 16.0% | −293.1 – 144.1 | 0.24 (-0.11 - 0.54) |
| YA | 12 | 415.2 (71.7) | 392.2 (56.3) | 0.148 | 0.45 | −23.1 | 5.6% | −123.7 – 77.6 | 0.64 (0.2 - 0.86) | |
| Oura | OA | 15 | 471.8 (75.3) | 396.3 (63.4) | <0.0001* | 1.49 | −75.5 | 16.0% | −175.0 – 24.0 | 0.45 (0.15 - 0.68) |
| YA | 12 | 420.1 (68.5) | 404.6 (60.4) | 0.288 | 0.32 | −15.5 | 3.7% | −109.6 – 78.7 | 0.7 (0.27 - 0.9) | |
| Withings | OA | 17 | 480.9 (64.5) | 435.2 (106.6) | 0.083 | 0.45 | −45.7 | 9.5% | −245.4 – 154.1 | 0.29 (-0.09 - 0.6) |
| YA | 11 | 399.0 (61.7) | 394.2 (51.0) | 0.732 | 0.11 | −4.8 | 1.2% | −94.0 – 84.4 | 0.67 (0.19 - 0.9) | |
| Sleep Score Max | OA | 14 | 471.2 (79.5) | 414.7 (65.6) | 0.001* | 1.12 | −56.5 | 12.0% | −155.8 – 42.8 | 0.57 (0.23 - 0.79) |
| YA | 9 | 413.5 (81.0) | 398.8 (69.7) | 0.135 | 0.55 | −14.7 | 3.6% | −66.7 – 37.3 | 0.92 (0.72 - 0.98) |
Asterisks indicate statistically significant differences between device and PSG (p-values obtained using a two-tailed paired t-test). Nights included reflect data loss on the night of PSG collection. TST is reported in minutes.
Figure 3.
Bland–Altman plot showing the comparison of TST between CSTs and PSG (in minutes). Older adult data are indicated in green points and green solid lines (green = bias[mean(device – PSG)], light green = upper and lower LOA); younger adult data are indicated in gray points and gray dashed lines (dark gray = bias, light gray = upper and lower LOA).
Wake after sleep onset
All devices underestimated WASO in older adults, and this bias was significant in the Fitbit and Sleep Score Max (Table 6). In younger adults, the Fitbit and Withings overestimated WASO and the Oura and Sleep Score Max underestimated WASO (Fig. 4), but these differences were non-significant. Limits of agreement were wider across all devices for older adults than for younger adults, and MAPE was greater in older adults than younger adults in all devices but the Withings. CCCs were low for all devices in both groups, but were lower in older adults than in younger adults.
Table 6.
WASO for each device, separated by age group
| Device | Group | # Nights | PSG WASO | Device WASO | p-value | Cohen’s d | Bias | MAPE | LOA | CCC (CI) |
|---|---|---|---|---|---|---|---|---|---|---|
| Fitbit | OA | 18 | 97.9 (70.1) | 53.8 (24.0) | 0.012* | 0.66 | −44.1 | 45.0% | −174.3 - 86.08 | 0.14 (-0.07 - 0.35) |
| YA | 12 | 37.0 (13.1) | 41.1 (12.4) | 0.400 | 0.25 | 4.1 | 11.1% | −27.8 – 36.0 | −0.17 (-0.39 - 0.64) | |
| Oura | OA | 15 | 104.3 (72.2) | 84.5 (35) | 0.282 | 0.29 | −19.8 | 19.0% | −154.1 –114.5 | 0.25 (-0.13 - 0.57) |
| YA | 12 | 39.8 (12.2) | 36.0 (11.9) | 0.425 | 0.24 | −3.8 | 9.6% | −35.3 – 27.6 | 0.1 (-0.45 - 0.6) | |
| Withings | OA | 17 | 105.1 (69.7) | 72.9 (54.8) | 0.129 | 0.39 | −32.1 | 30.6% | −194.3 – 130.1 | 0.11 (-0.31 – 0.5) |
| YA | 11 | 38.6 (14.4) | 51.6 (37.4) | 0.255 | 0.36 | 13.1 | 33.9% | −57.3 – 83.5 | 0.18 (-0.21 - 0.51) | |
| Sleep Score Max | OA | 14 | 101.0 (76.5) | 29.4 (23.6) | 0.006* | 0.88 | −71.6 | 70.9% | −230.6 – 87.4 | −0.01 (-0.18 - 0.16) |
| YA | 9 | 38.51 (14.9) | 32.7 (25.4) | 0.439 | 0.27 | −5.8 | 15.2% | −48.0 – 36.3 | 0.45 (-0.12- 0.79) |
Standard deviations are indicated in parentheses. Asterisks indicate statistically significant differences between device and PSG (p-values obtained using a two-tailed paired t-test). Nights included reflect data loss on the night of PSG collection. WASO is reported in minutes.
Figure 4.
Bland–Altman plot showing the comparison of WASO between CSTs and PSG (in minutes). Older adult data are indicated in black and green points and green solid lines (green = bias[mean(device – PSG)], light green = upper and lower LOA); younger adult data are indicated in gray points and gray dashed lines (dark gray = bias, light gray = upper and lower LOA).
All four Bland–Altman plots showed some visual signs of heteroscedasticity, which may indicate that all the devices are worse at detecting WASO when more of it occurs. We further tested WASO for heteroscedasticity via studentized Breusch-Pagan tests, which showed none of the devices were statistically heteroscedastic (all p-values above 0.05) in both older and younger adults for all devices. This test may be underpowered due to the small sizes of our samples, however.
Light sleep
Device reported time spent in LS differed significantly from PSG for the Oura and Sleep Score Max in older adults, and for the Oura and Withings in younger adults (Table 7). CSTs struggled at recognizing LS, the Fitbit being the only device that did not significantly differ in either group. MAPE remained higher when device reports differed significantly from PSG. Limits of agreement were greater in older adults than in younger adults across all devices (Fig. 5). CCCs were lower in older adults than younger adults across all devices, indicating poorer agreement between PSG and devices for older adults.
Table 7.
Light sleep for each device, separated by age group
| Device | Group | # Nights | PSG LS | Device LS | p-value | Cohen’s d | Bias | MAPE | LOA | CCC (CI) |
|---|---|---|---|---|---|---|---|---|---|---|
| Fitbit | OA | 18 | 285.0 (79.6) | 261.2 (55.8) | 0.279 | 0.27 | −23.8 | 8.4% | −195.4 – 147.8 | 0.18 (-0.26 - 0.56) |
| YA | 12 | 222.5 (50.4) | 218.7 (37.5) | 0.701 | 0.11 | −3.8 | 1.7% | −69.8 – 62.1 | 0.71 (0.31 - 0.9) | |
| Oura | OA | 15 | 297.2 (87.5) | 220.9 (59.3) | 0.012* | 0.75 | −76.3 | 25.7% | −276.0 – 123.4 | 0.05 (-0.27 - 0.35) |
| YA | 12 | 227.8 (52.9) | 146.4 (56.5) | <0.001* | 1.9 | −81.4 | 35.7% | −164.5 – 1.7 | 0.32 (0.03 – 0.55) | |
| Withings | OA | 17 | 309.3 (76.6) | 265.1 (81.6) | 0.133 | 0.40 | −44.3 | 14.3% | −263.0 – 174.5 | 0.01 (-0.42 - 0.42) |
| YA | 11 | 218.7 (51.1) | 179.4 (56.6) | 0.034* | 0.74 | −39.4 | 18.0% | −143.6 – 64.9 | 0.4 (-0.09 - 0.73) | |
| Sleep Score Max | OA | 14 | 299.1 (86.9) | 246.4 (64.9) | 0.038* | 0.62 | −52.7 | 17.6% | −220.4 – 115.0 | 0.3 (-0.13 - 0.63) |
| YA | 9 | 223.1 (51.5) | 232.1 (42.8) | 0.605 | 0.18 | −9.0 | 4.0% | −89.2 – 107.2 | 0.43 (-0.25 - 0.83) |
Standard deviations are indicated in parentheses. Asterisks indicate statistically significant differences between device and PSG (p-values obtained using a two tailed paired t-test). Nights included reflect data loss on the night of PSG collection. LS is reported in minutes.
Figure 5.
Bland–Altman plot showing the comparison of light sleep (LS) between CSTs and PSG (in minutes). Older adult data are indicated in green points and green solid lines(green = bias[mean(device – PSG)], light green = upper and lower LOA); younger adult data are indicated in gray points and gray dashed lines(dark gray = bias, light gray = upper and lower LOA). Light sleep values from PSG are the sum of epochs scored as NREM1 and NREM2.
Deep sleep
The Oura and Fitbit both had significant differences in reported DS as compared to PSG in both older and younger adults (Table 8); however, the Fitbit also had the narrowest limits of agreement in both older and younger adults (Fig. 6). The Withings and Sleep Score Max had significant differences in reported deep sleep, but only in the older adult group. However, levels of bias, limits of agreement, and MAPE were both greater in older adults than younger adults across all devices. CCCs were low across all devices and groups, but lower in older adults than in younger adults.
Table 8.
Deep Sleep for each device, separated by age group
| Device | Group | Nights included | PSG DS | Device DS | p-value | Cohen’s d | Bias | MAPE | LOA | CCC (CI) |
|---|---|---|---|---|---|---|---|---|---|---|
| Fitbit | OA | 18 | 25.9 (39.3) | 55.2 (23.1) | 0.013* | 0.68 | 29.3 | 113.2% | −55.6 – 114.2 | 0.07 (-0.24 - 0.36) |
| YA | 12 | 95.6 (34.85) | 77.3 (14.1) | 0.083 | 0.55 | −18.3 | 19.1% | −83.3 – 46.8 | 0.18 (-0.16 - 0.48) | |
| Oura | OA | 15 | 16.9 (37.2) | 88.3 (52.0) | 0.001* | 1.06 | 71.5 | 423.3% | −60.3 –203.4 | −0.05 (-0.26 - 0.17) |
| YA | 12 | 96.5 (33.9) | 177.5 (48.0) | <0.0001* | 1.75 | 81.0 | 84.0% | −10.0 –172.1 | 0.12 (-0.08 - 0.31) | |
| Withings | OA | 17 | 13.5 (22.4) | 110.9 (65.7) | <0.0001* | 1.33 | 97.4 | 720.0% | −45.6 – 240.5 | −0.03 (-0.13 - 0.07) |
| YA | 11 | 94.5 (37.3) | 129.2 (47.07) | 0.068 | 0.62 | 34.7 | 36.7% | −75.7 – 145.1 | 0.09 (-0.36 - 0.5) | |
| Sleep Score Max | OA | 14 | 12.7 (20.6) | 101.5 (43.9) | <0.0001* | 1.81 | 88.8 | 698.3% | −7.5 – 185.1 | −0.01 (-0.1 - 0.09) |
| YA | 9 | 91.7 (36.0) | 95.3 (22.7) | 0.794 | 0.09 | 3.7 | 4.0% | −76.0 – 83.3 | 0.08 (-0.52 - 0.63) |
Standard deviations are indicated in parentheses. Asterisks indicate statistically significant differences between device and PSG (p-values obtained using a two tailed paired t-test). Nights included reflect data loss on the night of PSG collection.
Figure 6.
Bland–Altman plot showing the comparison of deep sleep (DS) between CSTs and PSG (in minutes). Older adult data are indicated in green points and green solid lines (green = bias[mean(device – PSG)], light green = upper and lower LOA); younger adult data are indicated in gray points and gray dashed lines (dark gray = bias, light gray = upper and lower LOA). Deep sleep values for PSG include all epochs scored as NREM3.
REM sleep
Only the Fitbit was significantly different from PSG in older adults and no devices significantly different in younger adults (Table 9). However, limits of agreement were wider in all devices for older adults (Fig. 7), and CCCs were low across all devices and groups. CCCs were lower in older adults than in younger adults in all devices except the Oura.
Table 9.
REM sleep for each device, separated by age group
| Device | Group | Nights included | PSG REM (minutes) | Device REM (minutes) | p-value | Cohen’s d | Bias | MAPE | LOA | CCC (CI) |
|---|---|---|---|---|---|---|---|---|---|---|
| Fitbit | OA | 18 | 67.7 (22.4) | 90.4 (36.3) | 0.049* | 0.52 | 22.7 | 33.5% | −63.2 – 108.6 | −0.04 (-0.37 – 0.3) |
| YA | 12 | 83.3 (33.9) | 96.2 (26.6) | 0.169 | 0.42 | 12.9 | 15.5% | −46.6 – 72.3 | 0.46 (-0.05 – 0.78) | |
| Oura | OA | 15 | 64.4 (23.9) | 87.0 (50.9) | 0.106 | 0.45 | 22.7 | 35.2% | −77.0 –122.3 | 0.16 (-0.2 – 0.47) |
| YA | 12 | 82.0 (34.7) | 80.7 (38.4) | 0.910 | 0.03 | −1.3 | 1.6% | −77.5 – 74.9 | 0.12 (-0.08 - 0.31) | |
| Withings | OA | 17 | 66.7 (22.7) | 76.5 (54.8) | 0.521 | 0.43 | 9.9 | 14.8% | −111.3 – 131.0 | −0.03 (-0.13 - 0.07) |
| YA | 11 | 74.1 (29.7) | 85.6 (40.7) | 0.238 | 0.38 | 11.6 | 15.6% | −48.3 – 71.3 | 0.60 (0.12 - 0.85) | |
| Sleep Score Max | OA | 14 | 67.9 (23.0) | 66.2 (26.4) | 0.869 | 0.04 | −1.6 | 2.4% | −73.4 – 70.1 | −0.09 (-0.57 - 0.44) |
| YA | 9 | 83.0 (39.0) | 70.9 (19.3) | 0.242 | 0.42 | −12.8 | 15.3% | −72.3 – 46.7 | 0.47 (0.01 - 0.76) |
Standard deviations are indicated in parentheses. Asterisks indicate statistically significant differences between device and PSG (p-values obtained using a two tailed paired t-test). Nights included reflect data loss on the night of PSG collection.
Figure 7.
Bland–Altman plot showing comparison of time in REM between CSTs and PSG. Older adult data are indicated in green points and green solid lines(green = bias[mean (device – PSG)], light green = upper and lower LOA); younger adult data are indicated in gray points and gray dashed lines (dark gray = bias, light gray = upper and lower LOA).
Sleep efficiency
All three devices that reported sleep efficiency overestimated it in older adults, and limits of agreement were wider in older adults in all devices (Fig. 8). The Oura Ring also overestimated sleep efficiency in younger adults; however this bias was more than twice as large in older adults, and CCCs were again lower in older adults (Table 10). The Fitbit and Sleep Score Max underestimated sleep efficiency slightly in younger adults, but this difference was not significant for either device.
Figure 8.
Bland–Altman plot showing comparison of time in SE between CSTs and PSG. Older adult data are indicated in green points and green solid lines(green = bias[mean (device – PSG)], light green = upper and lower LOA); younger adult data are indicated in gray points and gray dashed lines (dark gray = bias, light gray = upper and lower LOA).
Table 10.
Sleep efficiency (SE) for each device, separated by age group
| Device | Group | Nights included | PSG SE (minutes) | Device SE (minutes) | p-value | Cohen’s d | Bias | MAPE | LOA | CCC (CI) |
|---|---|---|---|---|---|---|---|---|---|---|
| Fitbit | OA | 18 | 72.6 (15.4) | 81.1 (17.2) | 0.087* | 0.44 | 8.5 | 11.80% | −46.23 – 29.12 | 0.27 (-0.15 – 0.61) |
| YA | 12 | 86.2 (10.4) | 81.9 (19.3) | 0.492 | 0.21 | −4.3 | 5.00% | −29.61 – 41.58 | 0.18 (-0.17 – 0.48) | |
| Oura | OA | 15 | 70.8 (15.4) | 82.6 (6.4) | 0.008* | 0.79 | 11.8 | 16.70% | −39.9 – 18.2 | 0.18 (-0.08 – 0.41) |
| YA | 12 | 85.4 (10.2) | 91.7 (3.4) | 0.038* | 0.68 | 6.2 | 7.30% | −11.14 – 3.52 | 0.37 (-0.04 – 0.68) | |
| Sleep Score Max | OA | 14 | 71.8 (16.5) | 85.7 (6.6) | 0.002* | 1.02 | 14 | 19.40% | −40.7 – 12.8 | 0.25 (0.01 – 0.46) |
| YA | 9 | 84.7 (11.8) | 84.7 (10.4) | 0.986 | 0.01 | −0.1 | 0.08% | −18.21 – 22.11 | 0.11 (-0.48 – 0.63) |
Standard deviations are indicated in parentheses. Asterisks indicate statistically significant differences between device and PSG (p-values obtained using a two tailed paired t-test). Nights included reflect data loss on the night of PSG collection. Sleep efficiency was not calculated by the Withings mat.
Sleep onset latency
All three devices that reported sleep onset latency underestimated it in older adults, and limits of agreement were wider in older adults in all devices (Fig. 9). MAPE was higher and CCCs were lower in older adults for all three devices as well (Table 11). Neither the Withings nor the Sleep Score Max deviated significantly from PSG in their measurement of sleep onset latency, although the Withings was trending towards a significant difference in older adults (p = .0684).
Figure 9.
Bland–Altman plot showing comparison of time in SOL between CSTs and PSG. Older adult data are indicated in green points and green solid lines(green = bias[mean (device – PSG)], light green = upper and lower LOA); younger adult data are indicated in gray points and gray dashed lines (dark gray = bias, light gray = upper and lower LOA).
Table 11.
Sleep onset latency (SOL) sleep for each device, separated by age group
| Device | Group | Nights included | PSG SOL (minutes) | Device SOL (minutes) | p-value | Cohen’s d | Bias | MAPE | LOA | CCC (CI) |
|---|---|---|---|---|---|---|---|---|---|---|
| Oura | OA | 15 | 57.2 (46.9) | 9.6 (6.2) | 0.002* | 0.98 | −47.6 | 83.2% | −51.0 – 140.0 | −0.02 (-0.09 – 0.05) |
| YA | 12 | 30.6 (48.) | 5.6 (3.6) | 0.096* | 0.53 | −25.0 | 81.7% | −33.8 – 59.6 | 0.05 (-0.10 – 0.20) | |
| Withings | OA | 17 | 57.4 (43.4) | 32.4 (26.7) | 0.068 | 0.47 | −24.9 | 43.5% | −78.2 – 128.1 | −0.05 (-0.39 – 0.30) |
| YA | 11 | 23.9 (47.5) | 32.1 (27.2) | 0.602 | 0.16 | 8.2 | 34.2% | −76.1 – 43.7 | −0.08 (-0.35 – 0.20) | |
| Sleep Score Max | OA | 14 | 55.9 (49.3) | 28.9 (31.1) | 0.111 | 0.46 | −27.0 | 48.2% | −88.8 – 142.8 | −0.02 (-0.41 – 0.37) |
| YA | 9 | 35.5 (55.3) | 26.9 (30.3) | 0.376 | 0.31 | −8.6 | 24.3% | −34.69 – 37.56 | 0.66 (0.28 – 0.86) |
Standard deviations are indicated in parentheses. Asterisks indicate statistically significant differences between device and PSG (p-values obtained using a two tailed paired t-test). Nights included reflect data loss on the night of PSG collection. Sleep onset latency was not calculated by the Fitbit.
Discussion
These results provide insight into the unique user experience and validity of consumer sleep-tracking devices in older compared to young adults. However, we provide strong evidence that the agreement of wearables and nearables in assessments of sleep is reduced relative to their performance in young adults.
Device usability, comfort, and data loss
Subjectively, all participants reported similar levels of device usability and comfort. While wearables are not as comfortable as the nearables, the older adults were less bothered with them. Older and younger adults reported generally similar levels of mental demand, physical demand, and frustration with devices. However, there was a significant amount of data loss during the study period, on the night of PSG comparison and throughout the study week. Data loss on the PSG night generally reflected technical errors or failure to record, since study staff were present to ensure devices were charged, worn, and connected. Data loss over the full week may represent user error, including failures to charge, wear, or start a sleep session on a device. For younger adults, week-long device data loss was lowest in the wearable devices (Fitbit and Oura), which may be due to greater baseline familiarity with wearable devices vs nearables. The high levels of data loss, particularly the Sleep Score Max, were unexpected in this group. It is possible that the need to manually start a sleep session with the Sleep Score Max was counter to what they were familiar with and that prior knowledge overrode the device training they received. According to a framework developed by Kim and Choudhury (2020), when deciding whether or not to adopt a wearable activity tracker into daily life, older adults considered the difficulty of learning how to use a device as an important factor in their decision, while younger adults did not consider the difficulty of learning how to use the device into their decision at all [27]. This predisposition could have led our older adult participants to be more engaged when learning how to use the devices, and more cognizant of connectivity issues and other data loss than young adult participants.
For older adults, week-long device data loss was on a spectrum, where the most successful device was one that required no input after setup (Withings Sleep Mat) and the device that was least successful was the one requiring the most user input in the form of starting a sleep session (Sleep Score Max). The devices with intermediate amounts of data loss were those requiring intermediate amounts of user input (e.g., Fitbit Sense and Oura Ring, which required that a participant charge and wear them but not interact directly). This suggests that simplifying the amount of user input may significantly impact successful data collection when using these devices, particularly in older adults who may be less accustomed to CST usage than younger adults. From a consumer perspective, this may not matter – adults using these devices for longer than a week are more likely to habituate to higher burden devices if they are motivated to track their own sleep – but for research purposes, considering user burden would be vital to prevent data loss.
Sleep measurement
Overall, there were many differences in the characterization of sleep in older versus younger adults across all devices. Limits of agreement were wider across all categories in all devices for older adults than for younger adults, suggesting less precision in measuring older adults’ sleep overall. All devices underestimated TST in all participants, but this bias was greater - and only significant - in older adults. When devices differed significantly from PSG in older adults, effect sizes were also moderate (Fitbit, p=.0115, d = 0.67) to high (Oura, p<.0001, d = 1.49, and Sleep Score Max, p=.0011, d = 1.12). Given that all devices also underestimated WASO and sleep onset latency in older adults, this may be an issue in accurately detecting sleep or wake onset in older adults. For older adult consumers interested in their own sleep, this could cause anxieties over perceived lack of sleep, given the reduction in TST already present with age. While the results in younger adults suggest that CSTs are useful for measuring TST and other general sleep measures in this age group, researchers should use these devices with caution in an older adult population, despite the relative convenience of a CST over PSG.
While wake after sleep onset seemed to be measured relatively accurately in younger adults, WASO was underestimated by all devices in older adults, and limits of agreement were much wider in older adults. There was a proportional bias in older adults in that increased WASO resulted in less accurate device measurement. Effect sizes were consistently stronger in older adults than in younger adults. CCCs indicated low concordance with PSG across all devices and were consistently lower in older adults. This is a critical consideration for potential use of the device for detecting age-related changes in sleep as increased sleep fragmentation is one of the earliest changes in sleep in adults [28]. WASO is a major risk factor for cognitive decline [2,3], and as such is a parameter of interest in older adults for researchers and clinicians alike.
Consumer sleep tracking devices have long had trouble accurately assessing sleep stages [29]. Here we found that light sleep was also of mixed accuracy between younger and older adults. The Oura had significant differences from PSG in both groups, and the Withings and the Sleep Score Max had lower accuracy in older adults. The Fitbit provided reasonably accurate estimates of light sleep for young adults but not older adults. Once again, CCCs reflected lowered concordance in older adults compared to younger adults.
Deep sleep was also rather poorly estimated by the CSTs, particularly in older adults for whom deep sleep was drastically overestimated. The Fitbit was the best performer, particularly in the young adult group. However, for all devices the bias, MAPE, and limits of agreement were greater in older adults, suggesting less accurate results for older adults regardless of device selection. Effect sizes as well were very strong, and higher in older adults than in younger adults for all devices except the Oura. CCCs were generally low, reflecting the devices’ difficulty with specific sleep staging, but CCCs were consistently lower in the older adult sample. Given the ties between deep sleep, health, and learning, researchers may be particularly interested in accurate staging of deep sleep, but these results suggest that CSTs may not be a viable alternative to PSG in this age group. For consumers, an inflated amount of deep sleep may give a false impression of sleep quality.
REM sleep was the sleep stage with the fewest group differences, with only the Fitbit having significant differences from PSG in older adults, and all devices with relatively low bias in REM detection in younger adults on average. The relative accuracy of REM staging compared to light and deep sleep may be due to the close ties between heart rate variability and REM sleep [30] and the success in these device sensors in capturing heart rate. Nonetheless, REM was frequently misestimated by as much as 100 minutes, and CCCs reflected this non-concordance.
Sleep efficiency was significantly different across all devices for older adults, and performance of CSTs was consistently overestimated in older adults. This may be due to a combination of inaccurate WASO measurement and inaccurate sleep onset measurement. The Oura had significant differences in sleep efficiency as compared to PSG, despite relatively accurate WASO measurement, suggesting that this may be driven by its low performance in sleep onset latency. In contrast, the Sleep Score Max’s lower performance in sleep efficiency in older adults is more likely to be driven by its struggles in WASO detection, as its sleep onset latency measurement was more accurate.
Generally, sleep onset latency was slightly more accurate, with the nearable devices (Withings and Sleep Score Max) providing estimates of sleep onset latency that did not differ significantly from PSG. However, lower CCCs in older adults, higher MAPE, and higher bias suggest that CSTs still struggled to categorize sleep onset accurately, and in all cases, sleep onset was underestimated. This underestimation in sleep onset may have contributed to the overestimation of TST in older adults.
While all devices broadly gave relatively accurate results in younger adults, there was no one device or category (wearable vs nearable) that was most accurate in measuring older adults’ sleep. When considering levels of concordance and limits of agreement, the Oura and Sleep Score Max performed the best in measuring TST while the Oura and Fitbit were most successful in measuring WASO; however the Oura and Sleep Score Max had significant deviations from PSG in TST with a strong effect size, and the Fitbit had significant deviations in WASO measurement with a moderate effect size. In measuring specific sleep stages, the Sleep Score Max performed slightly better in measuring light sleep in older adults, the Fitbit in measuring deep sleep, and the Oura in measuring REM sleep. When considering bias, the nearables (Withings and Sleep Score Max) seemed to be slightly more accurate in measuring TST, but the Oura and Withings had the lowest bias in measuring WASO. For specific sleep stages, the Fitbit was least biased in older adults in measuring light and deep sleep, but had one of the highest levels of bias in REM sleep.
These heterogeneous results suggest that device performance, especially in older adults, is heavily dependent on specific device algorithms – one device may prioritize certain sensors or measurements in their sleep algorithms. Without knowing what an algorithm uses, we can only speculate as to the specific reasons these devices fail to capture older adult sleep as accurately. For example, if wearable device algorithms happen to be heavily reliant on skin temperature measurement as a proxy for sleep onset, they may not be as accurate when faced with the reduced skin temperature seen in older adults [31, 32] Additionally, many wearables use photodiodes that do not necessarily perform as well on older adults due to thicker blood vessels, vasoconstriction due to temperature regulation changes, and thinning skin. This may cause wearables to miscategorize periods of wake as sleep more frequently than in younger adults with higher baseline skin temperatures. However, nearables make assumptions as well. For instance, the Withings Mat and Sleep Score Max rely in part on movement detection. Healthy older adults actually move less in their sleep than young adults [33], which may be why these devices underestimated how much the older adults were awake. Additionally, as seen in table 1, the older adult group had increased WASO relative to young adults. This decrease in sleep quality is expected in this age group. Errors in WASO classification may be a driver of magnified inaccuracies in older adults, since wake miscategorized as sleep may contribute to inflated TST or individual sleep stages. In other words, young adults with high WASO would likely also have high risk of misclassification. Individual differences in device sensors and algorithms may account for the varied performance of all devices, but the results strongly suggest that all devices tested do not account for changes in sleep and physiology across the lifespan.
Limitations
While our sample size was small, previous literature validating devices against PSG in younger adults has had larger sample sizes (ranging from 21 [12] to 62, [10]) but work looking at different age groups has included group sizes as small as 16 [16, 17] Additionally, our sample was primarily White and, although not measured, fair-skinned. Several studies have suggested that photodiode-based devices are less effective in individuals with darker skin tones [34]. Additional studies should consider the intersection of skin tone and age as skin thinning with age may also affect validity.
Although this study screened out participants that had been diagnosed with a sleep disorder, it is possible that there were subclinical or undiagnosed sleep disorders remaining in eligible participants. While a diagnostic PSG session was out of the scope of this study, future work could include an exclusion night in which participants’ sleep could be examined for markers of sleep disorders. As such, these results may not be generalizable to adults who do have diagnosed sleep disorders.
While this study aimed for greater ecological validity by having all data collection occur in-home, the nature of the protocol is not typical usage of a consumer sleep tracking device. Specifically, the need for participants to use multiple sleep tracking devices concurrently may have resulted in greater user error or frustration than might be found while using one device at a time. The use of multiple sleep tracking devices also meant that to reduce participant burden, participants were only asked to use wearables at night. This means that daytime rest or nap periods were not accounted for. Additionally, while participants were screened for user comfort with a smartphone, participants did not use their own phones and may have been less comfortable with the less familiar interface of a lab-provided phone. In older adults with less experience with different smartphone operating systems, this may have increased user error. In terms of epoch-by-epoch data availability, the phone and devices were in the field before the PSG night, so there may have been drift in time synchronization. This, as well as the lack of availability of epoch-by-epoch measures for all devices, eliminated the ability to test the specific accuracy of devices at staging sleep, and further studies should assess epoch-by-epoch agreement in older adults. Lastly, these results are only relevant for the specific firmware versions and algorithms current at the time of data collection, so caution should be used when interpreting these results as algorithms continue to change and be updated.
Conclusions
These results support previous literature that CSTs are a reasonable alternative to PSG or research-grade actigraphy in younger adults, particularly for more general needs like total sleep time. However, it seems that algorithms may not be ideally equipped to track sleep in older adults as accurately. These findings coincide with the recent recommendations from the World Sleep Society which suggest caution in interpreting CST results, particularly in situations like older age or reduced sleep continuity, where CSTs are likely to be less accurate [35]. Given the disparity in sleep quality in older adults and the cognitive and health effects of age-related changes in sleep, further validation and algorithm development for a range of ages is needed for CSTs to be viable across the lifespan, despite the benefits in comfort and availability they provide. Developers of sleep tracking devices must more strongly consider the diversity of their user base and cater their algorithms and designs to be effective for every individual who may use them — particularly in regard to changes in sleep across the lifespan.
Acknowledgments
Data collection was supported by Mireya Ortiz and Kyle Valade. This study was supported by NIH National Institute on Aging grant P30 AG073107.
Contributor Information
Mary Emma Searles, Institute for Applied Life Sciences, University of Massachusetts, Amherst, MA, United States.
Angela Licata, Institute for Applied Life Sciences, University of Massachusetts, Amherst, MA, United States.
Matthew Cucinotta, Institute for Applied Life Sciences, University of Massachusetts, Amherst, MA, United States.
Kyle Kainec, Institute for Applied Life Sciences, University of Massachusetts, Amherst, MA, United States.
Rebecca M C Spencer, Institute for Applied Life Sciences, University of Massachusetts, Amherst, MA, United States; Department of Psychological & Brain Sciences, University of Massachusetts, Amherst, MA, United States.
Author contributions
Mary Emma Searles (Formal analysis [lead], Investigation [lead], Methodology [supporting], Project administration [supporting], Supervision [supporting], Visualization [lead], Writing—original draft [lead], Writing—review & editing [lead]), Angela Licata (Investigation [supporting], Visualization [lead], Writing—original draft [supporting], Writing—review & editing [supporting]), Matthew Cucinotta (Investigation [supporting], Visualization [supporting], Writing—original draft [supporting], Writing—review & editing [supporting]), Kyle Kainec (Conceptualization [supporting], Funding acquisition [supporting], Methodology [supporting], Project administration [lead]), Rebecca M. C. Spencer (Conceptualization [lead], Funding acquisition [lead], Methodology [supporting], Project administration [lead], Resources [lead], Supervision [lead], Writing—original draft [supporting], Writing—review & editing [supporting]).
Disclosure statement
Financial disclosure: None.
Non-financial disclosure: None.
Data availability
The data underlying this article will be shared on reasonable request to the corresponding author.
References
- 1. Bubu OM, Brannick M, Mortimer J, et al. Sleep, cognitive impairment, and Alzheimer’s disease: a systematic review and meta-analysis. Sleep. 2017;40(1):zsw032. 10.1093/sleep/zsw032 [DOI] [Google Scholar]
- 2. Irwin MR, Vitiello MV. Implications of sleep disturbance and inflammation for Alzheimer’s disease dementia. Lancet Neurol. 2019;18(3):296–306. 10.1016/S1474-4422(18)30450-2 [DOI] [PubMed] [Google Scholar]
- 3. Wennberg A, Wu M, Rosenberg P, Spira A. Sleep disturbance, cognitive decline, and dementia: a review. Semin Neurol. 2017;37(4):395–406. 10.1055/s-0037-1604351 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Knechel NA, Chang P. The relationships between sleep disturbance and falls: a systematic review. J Sleep Res. 2022;31(5):e13580. 10.1111/jsr.13580 [DOI] [PubMed] [Google Scholar]
- 5. Zhou T, Dai X, Yuan Y, et al. Adherence to a healthy sleep pattern is associated with lower risks of incident falls and fractures during aging. Front Immunol. 2023;14:1234102. 10.3389/fimmu.2023.1234102 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Keage HAD, Banks S, Yang KL, Morgan K, Brayne C, Matthews FE. What sleep characteristics predict cognitive decline in the elderly? Sleep Med. 2012;13(7):886–892. 10.1016/j.sleep.2012.02.003 [DOI] [PubMed] [Google Scholar]
- 7. Spira AP. Sleep and health in older adulthood: recent advances and the path forward. J Gerontol A. 2018;73(3):357–359. 10.1093/gerona/glx263 [DOI] [Google Scholar]
- 8. de Zambotti M, Cellini N, Goldstone A, Colrain IM, Baker FC. Wearable sleep Technology in Clinical and Research Settings. Med Sci Sports Exerc. 2019;51(7):1538–1557. 10.1249/MSS.0000000000001947 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Kainec K, Caccavaro J, Barnes M, Hoff C, Berlin A, Spencer RMC. Evaluating accuracy in five commercial sleep-tracking devices compared to research-grade Actigraphy and polysomnography. Sensors. 2024;24(2):635. 10.3390/s24020635 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Schyvens AM, Peters B, Van Oost NC, et al. A performance validation of six commercial wrist-worn wearable sleep-tracking devices for sleep stage scoring compared to polysomnography. SLEEP Advances. 2025;6(2):zpaf021. 10.1093/sleepadvances/zpaf021 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Robbins R, Weaver MD, Sullivan JP, et al. Accuracy of three commercial wearable devices for sleep tracking in healthy adults. Sensors. 2024;24(20):6532. 10.3390/s24206532 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Chinoy ED, Cuellar JA, Jameson JT, Markwald RR. Performance of four commercial wearable sleep-tracking devices tested under unrestricted conditions at home in healthy young adults. Nat Sci Sleep. 2022;14:493–516. 10.2147/NSS.S348795 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Hughes JM, Song Y, Fung CH, et al. Measuring sleep in vulnerable older adults: a comparison of subjective and objective sleep measures. Clin Gerontol. 2018;41(2):145–157. 10.1080/07317115.2017.1408734 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Lach HW, Lorenz RA, Palmer JL, Koedbangkham J, Noimontree W. Home monitoring to track activity and sleep patterns among older adults: a feasibility study. CIN. Comput Inform Nurs. 2019;37(12):628–637. 10.1097/CIN.0000000000000569 [DOI] [PubMed] [Google Scholar]
- 15. Wei J, Boger J. Sleep detection for younger adults, healthy older adults, and older adults living with dementia using wrist temperature and Actigraphy: prototype testing and case study analysis. JMIR Mhealth Uhealth. 2021;9(6):e26462. 10.2196/26462 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Ghorbani S, Golkashani HA, Chee NI, et al. Multi-night at-home evaluation of improved sleep detection and classification with a memory-enhanced consumer sleep tracker. NSS. 2022;14:645–660. 10.2147/NSS.S359789 [DOI] [Google Scholar]
- 17. Ong JL, Golkashani HA, Ghorbani S, et al. Selecting a sleep tracker from EEG-based, iteratively improved, low-cost multisensor, and actigraphy-only devices. Sleep Health. 2024;10(1):9–23. 10.1016/j.sleh.2023.11.005 [DOI] [PubMed] [Google Scholar]
- 18. Li J, Vitiello MV, Gooneratne N. Sleep in Normal aging. Sleep Med Clin. 2018;13(1):1–11. 10.1016/j.jsmc.2017.09.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Miner B, Kryger MH. Sleep in the aging population. Sleep Med Clin. 2017;12(1):31–38. 10.1016/j.jsmc.2016.10.008 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Bari DS, Yacoob Aldosky HY, Martinsen ØG. Simultaneous measurement of electrodermal activity components correlated with age-related differences. J Biol Phys. 2020;46(2):177–188. 10.1007/s10867-020-09547-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Ancoli-Israel S. Sleep and its disorders in aging populations. Sleep Med. 2009;10:S7–S11. 10.1016/j.sleep.2009.07.004 [DOI] [PubMed] [Google Scholar]
- 22. Brandt J, Spencer M, Folstein M. The telephone interview for cognitive status. Cogn Behav Neurol. 1988;1(2):111 [Google Scholar]
- 23. Hart SG, Staveland LE. Development of NASA-TLX (task load index): Results of empirical and theoretical research. In: Hancock PA, Meshkati N, eds. Advances in Psychology. Vol 52. Human Mental Workload; 1988:139–183. [Google Scholar]
- 24. Knight JF, Baber C, Schwirtz A, Bristow HW. The comfort assessment of wearable computers. In: Proceedings. Sixth International Symposium on Wearable Computers; 2002:65–72. [Google Scholar]
- 25. Malhotra RK. AASM scoring manual 3: a step forward for advancing sleep care for patients with obstructive sleep apnea. J Clin Sleep Med. 20(5):835–836. 10.5664/jcsm.11040 [DOI] [Google Scholar]
- 26. Menghini L, Cellini N, Goldstone A, Baker FC, de Zambotti M. A standardized framework for testing the performance of sleep-tracking technology: step-by-step guidelines and open-source code. Sleep. 2020;44(2):zsaa170. 10.1093/sleep/zsaa170 [DOI] [Google Scholar]
- 27. Kim S, Choudhury A. Comparison of older and younger adults’ attitudes toward the adoption and use of activity trackers. JMIR Mhealth Uhealth. 2020;8(10):e18312. 10.2196/18312 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Ohayon MM, Carskadon MA, Guilleminault C, Vitiello MV. Meta-analysis of quantitative sleep parameters from childhood to old age in healthy individuals: developing normative sleep values across the human lifespan. Sleep. 2004;27(7):1255–1273. 10.1093/sleep/27.7.1255 [DOI] [PubMed] [Google Scholar]
- 29. Grandner MA, Lujan MR, Ghani SB. Sleep-tracking technology in scientific research: looking to the future. Sleep. 2021;44(5):zsab071. 10.1093/sleep/zsab071 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Penzel T, Kantelhardt JW, Lo CC, Voigt K, Vogelmeier C. Dynamics of heart rate and sleep stages in Normals and patients with sleep apnea. Neuropsychopharmacol. 2003;28 Suppl 1(1):S48–S53. 10.1038/sj.npp.1300146 [DOI] [Google Scholar]
- 31. Raymann RJEM, Van Someren EJW. Diminished capability to recognize the optimal temperature for sleep initiation may contribute to poor sleep in elderly people. Sleep. 2008;31(9):1301–1309. [PMC free article] [PubMed] [Google Scholar]
- 32. Blatteis CM. Age-dependent changes in temperature regulation – a mini review. Gerontology. 2011;58(4):289–295. 10.1159/000333148 [DOI] [PubMed] [Google Scholar]
- 33. Gori S, Ficca G, Giganti F, Nasso ID, Murri L, Salzarulo P. Body movements during night sleep in healthy elderly subjects and their relationships with sleep stages. Brain Res Bull. 2004;63(5):393–397. 10.1016/j.brainresbull.2003.12.012 [DOI] [PubMed] [Google Scholar]
- 34. Koerber D, Khan S, Shamsheri T, Kirubarajan A, Mehta S. Accuracy of heart rate measurement with wrist-worn wearable devices in various skin tones: a systematic review. J Racial Ethn Health Disparities. 2022;10(6):2676–2684. 10.1007/s40615-022-01446-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Chee MW, Baumert M, Scott H, et al. World sleep society recommendations for the use of wearable consumer health trackers that monitor sleep. Sleep Med. 2025;131:106506. 10.1016/j.sleep.2025.106506 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The data underlying this article will be shared on reasonable request to the corresponding author.









