Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

medRxiv logoLink to medRxiv
[Preprint]. 2025 Oct 28:2025.08.11.25332026. Originally published 2025 Aug 13. [Version 2] doi: 10.1101/2025.08.11.25332026

Pulse oximeter performance and skin pigment: comparison of 34 oximeters using current and emerging regulatory frameworks

Caroline Hughes 1,2,3, Danni Chen 1,2, Tyler Law 1,2,4, Philip Bickler 1,2, John Feiner 1,2, Leonid Shmuylovich 5, Ella Behnke 1,2, Lily Ortiz 1,2,4, Gregory Leeb 1,2,4, Isabella Auchus 2, Fekir Negussie 4, Ronald Bisegerwa 4,6, René Vargas Zamora 1,2, Elizabeth Igaga 4,6, Kelvin Moore Jr 1,4,7, Olubunmi Okunlola 8, Ellis Monk 9, Jana Lyn Fernandez 1,4, Odinakachukwu Ehie 2,4, Bernadette Wilks 2,4, Koyinsola Oyefeso 1,4, Deleree Schornack 1,2, Cornelius Sendagire 4,6,10, Michael S Lipnick 1,2,4
PMCID: PMC12363702  PMID: 40832401

Abstract

Background

The International Organization for Standardization (ISO) and US Food and Drug Administration (FDA) are updating regulations for pulse oximeters to reduce performance disparities linked to skin pigment. We tested common oximeters with current and anticipated regulatory frameworks. We hypothesized that not all oximeters show more positive bias in darkly vs lightly pigmented participants and that few oximeters would ‘pass’ the anticipated FDA regulations.

Methods

We used a controlled desaturation protocol to test 34 oximeters across arterial oxygen saturations (SaO2) 70–100% in healthy adults. Based on what FDA and ISO had shared at the time of study design, we studied cohort sizes of ≥ 24 with ≥ 25% of participants being darkly pigmented. We used the subjective Monk Skin Tone (MST) scale and the objective individual typology angle (ITA) derived from a spectrophotometer to characterize skin pigment. The root mean square error (ARMS), bias (mean of SpO2 - SaO2 error), and skin pigment differential bias were calculated. Monte Carlo simulation explored potential impacts of participant selection on device passing.

Results

For cohorts of 24 participants, 28/34 oximeters passed 2017 ISO Standard (ARMS ≤ 4%), 22/34 passed 2013 FDA guidance (ARMS ≤ 3%), 21/34 oximeters passed both ARMS and differential bias criteria for anticipated ISO standards, and 1/34 passed anticipated FDA criteria. More devices passed with cohorts > 24. Eleven oximeters had more positive bias in participants with dark vs. light (dorsal finger) pigmentation across 70–100% SaO2. Eighteen devices could pass or fail depending on cohorts selected for analysis.

Conclusions

Pulse oximeters show variable performance across manufacturers and models. Notably, only some devices show more positive bias in people with darker skin. Anticipated updates to ISO and FDA frameworks yield strikingly different assessments and require refinement of cohort sizes and differential bias criteria. Whether new guidelines will translate into improved real-world performance or reduced health disparities is yet to be determined.

Introduction

Pulse oximeters are essential clinical tools that noninvasively estimate the percent of hemoglobin bound with oxygen (SpO2). However, long-standing concerns over variable performance across manufacturers and worse performance in people with darker skin have only recently received significant attention. Reports of oximeter inaccuracies in people with darker skin pigmentation began in the 1980s, and the COVID-19 pandemic brought this issue to the forefront as numerous studies reported not only positive bias (i.e. SpO2 overestimating true arterial blood functional oxygen saturation, SaO2) but also healthcare disparities (i.e. underrecognition or undertreatment of hypoxemia) in patients who self-identified as Black, Asian, Hispanic, or Native American.14 Laboratory studies in healthy participants have shown similar findings for some oximeters, especially at lower SaO2.58

In response to these concerns, the International Organization for Standardization (ISO) and US Food and Drug Administration (FDA) began updating pulse oximeter regulatory frameworks to improve performance and reduce potential disparities related to skin pigment. The FDA 510(k) premarket notification guidance for pulse oximeters and ISO 80601-2-61 standards were last updated in 2013 and 2017, respectively, and recommend manufacturers verify pulse oximeter performance using “10 or more healthy participants that vary in age and gender.”9,10 The 2013 FDA guidance also recommended at least 15% of the participants (i.e., the proportion of the US population in 2013 who identified as African American) should have “darkly pigmented” skin. As previously reported, these recommendations are likely underpowered to ensure equitable device performance.11 In 2024, the FDA released a discussion paper proposing larger verification study cohorts (24 participants) with more diversity. In 2025, FDA released a new draft guidance recommending even larger cohorts (150 participants), with more diversity of skin pigment, and additional statistical analyses.12 The ISO has shared some details of anticipated changes via public forums.13 Both agencies are expected to imminently release finalized versions of their new regulatory frameworks for pulse oximeters.1316

Considering these upcoming changes and ongoing uncertainty about the extent to which oximeters on the market have bias related to skin color, we sought to independently test the performance of 34 commonly used pulse oximeters with both current and anticipated regulatory frameworks. We hypothesized that relatively few devices would pass anticipated frameworks, and that most but not all pulse oximeters would show more positive bias in people with dark vs. light pigmentation.

Methods

This study was conducted at the University of California San Francisco (UCSF) Hypoxia Laboratory from 2022 to 2024 with UCSF IRB approval (#21–35637, ClinicalTrials.gov ID: NCT06142019). Written informed consent was obtained from all participants.

Oximeter selection

We tested 34 pulse oximeters. Devices were chosen based on popularity in online marketplaces, clinician input from diverse settings, and global health donor procurement lists to capture commonly used devices across varied clinical settings and price points. Purchase prices varied from $10 to $5,999. A list of tested devices, including prices, form factors, wavelengths, and regulatory data is available in Table S4. For three devices (the Shenzhen PC-60NW, Masimo MightySat, and Acare AH-M1/MX), we purchased two of the same model and reported results by year of device purchase. For two additional devices (Masimo Rad97 and Masimo Rad G), we tested each with two different probes and reported each device-probe combination separately. All participants were monitored throughout the protocol with the lab’s ‘clinical monitor’ oximeter (Nellcor PM1000N, Medtronic, USA).

Study demographics and skin pigment assessment

Participants were healthy adults 18–50 years old who were non-smoking, with no history of lung, cardiovascular, kidney, or liver disease and without hemoglobinopathy, anemia, clotting disorders, or Raynaud’s disease. Our enrollment targeted diversity of both skin pigment and sex assigned at birth.

Participants’ age, sex, and US National Institutes of Health (NIH) race were self-reported. Participants’ height, weight, and finger diameter were measured. Percent modulation of infrared light (an indicator of participant perfusion) was recorded from the clinical monitor and divided by 10 to approximate comparability with Masimo Perfusion Index.8

Our approach to skin pigment assessment is provided in the Supplemental Digital Content eMethods. Briefly, two research coordinators assigned subjective skin pigmentation data for each participant using the Monk Skin Tone (MST) scale.17,18 Coordinators also used the Konica Minolta CM 700-d (KM) spectrophotometer to derive individual typology angle (ITA), a frequently used surrogate for melanin19,20 at the study participant’s dorsal distal phalanx (DP) (the site for study oximeters) and forehead (the recommended site for using MST).19

We categorized participants into light, medium or dark bins based on previously proposed cutoffs: ‘light’ (MST 1–4 and ITA of > 30°), ‘medium’ (MST 5–7 and ITA between 30° and −30°), ‘dark’ (MST 8–10 and ITA of < −30° with ≥ 50% of these participants having ITA < −50°).12,21,22 We refer to participants with both MST 8–10 and ITA of < −50° as ‘very dark.’ Unless otherwise noted, we used the ‘dark’ definition above (not ‘very dark’) for analyses. In cases of discordance between ITA and MST, the participant was binned by ITA.

Controlled desaturation protocol

Each participant underwent a controlled desaturation protocol as previously described and available online and in the Supplemental Digital Content eMethods. 2325 Briefly, study investigators controlled partial pressures of inspired oxygen, carbon dioxide, and nitrogen to achieve six stable “plateaus” of targeted arterial functional oxygen saturation (SaO2) between SaO2 ~70% and 100% (i.e. 2 plateaus in each decile 70–80%, 80–90%, and 90–100%). At each plateau, multiple arterial blood samples were collected ≥ 20 seconds apart from a radial artery catheter. At the time of each blood sample, pulse oximeter SpO2’s were recorded, and blood samples were immediately analyzed for SaO2 using Radiometer ABL90 Flex Plus (ABL) (Radiometer, Copenhagen, Denmark) blood gas analyzers. Of note, participants’ hands were not warmed by default, and most probes were randomly placed on all five finger digits (see the Supplemental Digital Content eMethods).

Sample size and cohort composition

The cohort size and composition for the primary analysis were based on what had been shared publicly by the FDA and ISO at the time of study design.1316,26 These criteria were: 1. ≥ 24 unique participants; 2. Each participant must contribute 16 to 30 data points; 3. Pooled SaO2 data must span at least 73% to 97% SaO2; 4. ≥ 90% of participants must provide ≥ 1 data point < 85%; 5. ≥ 69% of participants must provide ≥ 1 data point in the 70–80% decile; 6. The cohort must include ≥ 33% of each sex; 7. Using forehead skin data, ≥ 25% of participants must fall into each color bin (i.e. light, medium, or dark). Given uncertainties in anticipated recommendations for cohort size, we tested all devices in at least 24 participants and continued testing with as many participants as possible based on available resources.

Statistical and sensitivity analyses

The primary analysis used 24 participant cohorts and skin pigment data from the forehead and dorsal distal phalanx (DP). We used two metrics to define a ‘passing device’ based on what was shared publicly by FDA and ISO at the time of study design (Table 2): 1. accuracy root mean square error (ARMS); and 2. differential bias. ARMS is a performance threshold that measures the square root of the mean of the squared differences in SpO2 minus SaO2. Differential bias is defined in two ways: 1. For ITA, as the difference in SpO2 bias between two theoretical participants with an ITA difference of 100° (i.e., one participant with dark ITA −50° and another with light ITA 50°); and 2. For MST, as the difference between MST bins. We estimated the differential bias and 95% confidence interval (CI) using a linear mixed effects (LME) model. The anticipated ISO standard recommended differential bias thresholds (point estimate ≤ 4% for 70–85% SaO2 and ≤ 2% for 85–100% SaO2) for only ITA, while the anticipated FDA guidance recommended thresholds (95% CI < 3.5% for 70–85% SaO2 and < 1.5% for 85–100% SaO2) for both ITA and pairwise comparisons of MST bins 1–4, 5–7, and 8–10.12,15,16

Table 2.

Summary of current and anticipated regulatory frameworks for pulse oximetry-

ISO (2017) Anticipated ISO FDA (2013) Anticipated FDA
# participants for verification study cohorts ≥10 ≥ 24 ≥10 150
Diversity of biological sex − ≥33% of each biological sex In each MST group, at least 40% of participants are male, and at least 40% of participants are female
Diversity of skin pigment Skin color should be described* MST at forehead: ≥25% MST 1–3 (light), 4–7 (medium), 8–10 (dark)
ITA at forehead: ≥ 25% ITA > 30° (light), 30° ≥ ITA ≥ −30° (medium), < −30° (dark) AND at least 50% of the participants in dark category have an ITA ≤ −50° (If a subject has ITA and MST that do not place this subject in the same light, medium, dark category, use the ITA categorization)
2 of 10, or 15% of total should have dark skin* MST at forehead: ≥25% MST 1–4, 5–7, 8–10
ITA at the forehead: at least 50% of the participants in MST group 8–10 have an ITA ≤ −50° at the forehead
ARMS criteria between 70–100% SaO2 ARMS <4%
“upper 95 % and lower 95% limits of agreement shall at a minimum be provided”
Arms ≤ 3% with 95% CI disclosed Arms ≤ 3% transmittance
Arms ≤ 3.5% reflectance
Arms < 3% including 95% CI for both transmittance and reflectance devices
Statistical assessment of bias due to skin color None For SaO2 ranges 70–85% and 85–100%:
- Using linear regressions, estimate differential bias between values at an ITA of −50° and +50° at the sensor emitter site
- Absolute differential bias ≤ 2% for 85–100% SaO2, and ≤ 4% for 70–85% SaO2
None For SaO2 ranges 70–85% and 85–100%:
- Absolute difference in SpO2 bias across ITA at the sensor emitter site AND forehead MST levels < 1.5% for SaO2 > 85% and < 3.5% for SaO2 70–85%
*

No specific recommendations provided on how skin color should be measured or described

We assessed for bias related to pigment by calculating ARMS and error (SpO2 minus SaO2) for each pigment bin (forehead and DP).12 Error and ARMS were summarized as means with 95% CIs. The 95% CIs were determined using bootstrapping (random resampling with replacement) with 1,000 repetitions. Error across skin color bins was compared using an LME model, which included a random intercept to account for repeated measures within participants. ARMS was computed for each participant and compared between groups of participants based on pigment binning using a Welch’s two-sample t-test. A two-sided p-value < 0.05 was considered significant. P-values are reported unadjusted for multiple comparisons in accordance with FDA guidance for independent pairwise analyses, which does not mandate multiplicity adjustments.

We conducted a secondary analysis for each device using the largest cohorts possible that still met cohort diversity criteria. Additional secondary analyses included the impact on device ‘passing’ of different anatomical pigment assessment sites, different definitions of pigment binning, and different statistical methods for estimating bias, including use of a linear regression (LR) model for differential bias instead of the LME model. The LR model regressed the participant-level bias on the ITA values and estimated the difference in predicted bias between ITA values of −50° and 50° based on the fitted model.

A Monte Carlo simulation explored if participant selection bias could influence ‘passing’ status. For each device, Monte Carlo simulation created 100 cohorts of 24 participants, with at least six darkly pigmented participants. If a device always passed or always failed in all simulated cohorts, it was determined to be less susceptible to selection bias (i.e. ‘cherrypicking’ participants).

The SpO2 from most study oximeters was recorded manually into a Research Electronic Data Capture (REDCap) Database at the time of each arterial blood sample, except for the Nellcor PM1000N and Masimo Rad97 which streamed data directly into Labview (National Instruments, Austin, TX). Most oximeters used a self-contained interface that displayed SpO2 data on the oximeter itself and could not directly stream data from the oximeter. All other physiologic data were collected into Labview, and all Labview data were merged into the REDCap Database. All analyses were performed with R 4.3.2 (R Core Team, 2023) and Python v3.9.6 (Python Software Foundation, 2024). De-identified data for this study are openly accessible through the Open Oximetry Data Repository via PhysioNet and accessible via the OpenOximetry.org website.27

Results

We completed 348 desaturation studies in 155 participants (30,174 paired SpO2:SaO2 samples). Each device was tested in ≥ 24 unique participants (median 38, range 24–120) with > 540 SpO2:SaO2 pairs approximately equally distributed across SaO2 deciles, and with > 25% of data points in each decile coming from participants in each skin color bin (Table S3). Demographics are reported in Table 1.

Table 1.

Baseline characteristics of participants

Covariates Overall
Demographics
Age (years) 25 [24,29]
BMI (kg/m2) 23 [21,26]
Female 80 (52%)
Male 75 (48%)
Finger Diameter (mm) 10 [9,12]
Percent Modulation 2 [1,5]
NIH Race
Asian 40 (26%)
Black/African American 36 (23%)
Hispanic/Latino 16 (10%)
White 46 (30%)
Other/Multiple 17 (11%)
MST (Forehead)
1 5 (1%)
2 41 (12%)
3 17 (5%)
4 57 (17%)
5 74 (22%)
6 37 (11%)
7 20 (6%)
8 65 (19%)
9 21 (6%)
10 4 (1%)
ITA
ITA (Forehead) 18 [−19,33]
ITA (DP) 11 [−27,27]
ITA Bin (Forehead)
ITA >30° 111 (32%)
−30 <= ITA <= 30° 158 (45%)
ITA <−30° 79 (23%)
ITA <−50° 42 (12%)

Categorical variables are presented as N (%), while continuous variables are summarized as median [Q1, Q3].

BMI = Body Mass Index; MST = Monk Skin Tone Scale; ITA = Individual Typology Angle; NIH = National Institutes of Health; DP = Dorsal Distal Phalanx

This table presents the descriptive statistics for demographics and skin color data. Data are presented as median [Q1, Q3] for continuous variables and count (%) for categorical variables. Race was classified according to National Institutes of Health (NIH) categories. Skin tone was assessed using the Monk Skin Tone Scale (MST) and the Individual Typology Angle (ITA) at two anatomical sites: forehead and dorsal distal phalanx (DP), with repeated measures taken in participants who completed more than one study visit. Note that participants with ITA <−50° are a subset of participants with ITA <−30°.

Overall oximeter performance

For cohorts of 24 participants, device ARMS ranged from 1.69 to 8.04 (Figure 1). Of the 12 devices with ARMS > 3% (i.e. 2013 FDA Guidance threshold), eight had 510(k) premarket clearances. All six devices with an ARMS > 4% (i.e. 2017 ISO Standard threshold) reported a European Conformity (CE) marking (Figure 1 and Table S4).

Figure 1. Device conformity with current and anticipated FDA and ISO performance thresholds for 24 participant cohorts and DP ITA.

Figure 1.

This figure displays devices from lowest to highest accuracy root mean square error (ARMS) with 95% confidence intervals (CI) for each device tested in a cohort of 24 participants. The boxes marked with colors indicate that the device meets the ARMS or differential bias thresholds (differential bias and 95% CI < 3.5% for 70–85% SaO2 and < 1.5% for 85–100% SaO2 for 2025 FDA, and differential bias point estimate ≤ 4% for 70–85% SaO2 and ≤ 2% for 85–100% SaO2 for 2025 ISO). Differential bias was calculated with the Monk Skin Tone (MST) Scale and individual typology angle (ITA) as indicated. The 95% CI were determined using bootstrapping (random resampling with replacement) with 1,000 repetitions.

Device SpO2 bias ranged from −5.71 to 3.48 with a median of 0.97 (IQR: −0.04 −1.67) (Table S2). We found 24 devices with overall positive bias, and 10 devices with overall negative bias.

ARMS by skin color bin

In cohorts of 24, we analyzed ARMS by DP pigmentation across 70–100% SaO2 and found four devices with significantly different ARMS between light and dark skin color bins (Table S1 and Figure 2). We found significantly higher ARMS in the dark (DP) skin color bin for two devices at 70–80% SaO2, four devices at 80–90% SaO2, and two devices at 90–100% SaO2. One device showed significantly higher ARMS in the light skin color bin (Table S1).

Figure 2. Distribution of bias and ARMS across ITA bins at the DP site over 70–100% saturation range in 24 participant cohorts.

Figure 2.

This figure shows the distribution of pulse oximeter measured oxygen saturation (SpO) bias (mean of SpO minus SaO error, where SaO is functional arterial oxygen saturation) and root mean square error (ARMS) in each skin color bin defined by Individual Typology Angle (ITA) measured at the dorsal distal phalanx (DP) site across the saturation range of 70–100%. Devices shown are selected with statistically significantly differences in either bias or ARMS between light (ITA >30°) and dark (ITA <−30°) bins. Panel A displays the distribution of bias in closed circles with 95% confidence interval (CI) whiskers. Panel B shows the distribution of ARMS in closed circles with 95% CI whiskers. Data are stratified into four ITA bins (top to bottom): ITA > 30°, −30° ≤ ITA ≤ 30°, ITA < −30°, and ITA < −50°; note that ITA < −50° is a subset of ITA < −30°. Sample size (n) for each bin is indicated at left in the corresponding color. The vertical dashed line in Panel A represents bias of 0 and in Panel B represents the ARMS 3% threshold.

When analyzing ARMS by forehead pigmentation in cohorts of 24, two devices showed significantly higher ARMS in the dark vs. light color bin at 70–100% SaO2 (Table S6), with similar but not exact results to the DP site when broken down by SaO2 deciles.

With cohorts larger than 24, more oximeters demonstrated statistically significant ARMS differences between light and dark bins (Table S1, Table S6, Table S8, Table S9). Two devices demonstrated statistically significant differences in ARMS only when comparing light and ‘very dark’ color bins (Table S9).

Bias by skin color bin

We found five devices with more positive bias in dark vs. light DP skin color bins across SaO2 70–100% (median difference in bias (dark minus light) 1.56, IQR: 1.30 – 2.70), and two devices with more positive bias in the light bin (Table S2 and Figure 2). When analyzing by SaO2 deciles, we found significantly more positive bias in dark vs light skin color bins for four devices at 70–80% SaO2 (median difference 4.22, IQR: 3.40 – 4.74), four devices at 80–90% SaO2 (median difference 2.11 IQR: 1.60 – 2.85) and two devices at 90–100% SaO2 (median difference 1.36, range: 1.20 – 1.51). Bias by forehead pigmentation also varied (Table S7).

Analysis of larger cohort sizes revealed more devices (11/34) with significant differences in bias when comparing light vs. dark bins at the DP across SaO2 70–100%. Four devices demonstrated significantly more positive bias when comparing ‘very dark’ vs light color bins, but not dark vs light (Table S10 and Figure 3).

Figure 3. Distribution of bias and ARMS across ITA bins at the DP site over 70–100% saturation range in maximum participant cohorts.

Figure 3.

This figure shows the distribution of pulse oximeter measured oxygen saturation (SpO) bias (mean of SpO minus SaO error, where SaO is functional arterial oxygen saturation) and root mean square error (ARMS) in each skin color bin defined by Individual Typology Angle (ITA) measured at the dorsal distal phalanx (DP) site across the saturation range of 70–100%. Devices shown are selected with statistically significantly differences in either bias or ARMS between light (ITA >30°) and dark (ITA <−30°) bins. Panel A displays the distribution of bias in closed circles with 95% confidence interval (CI) whiskers. Panel B shows the distribution of ARMS in closed circles with 95% CI whiskers. Data are stratified into four ITA bins (top to bottom): ITA > 30°, −30° ≤ ITA ≤ 30°, ITA < −30°, and ITA < −50°; note that ITA < −50° is a subset of ITA < −30°. Sample size (n) for each bin is indicated at left in the corresponding color. The vertical dashed line in Panel A represents bias of 0 and in Panel B represents the ARMS 3% threshold.

Figures 4S13 contain the distribution of ARMS and bias for all devices across all saturation ranges and both anatomical sites.

Figure 4. Modified bland altman plots for four selected devices in cohort of 24 participants.

Figure 4.

This figure shows the modified bland altman (BA) plots for four devices, showing pulse oximeter measured oxygen saturation (SpO) bias (mean of SpO minus SaO error, where SaO is functional arterial oxygen saturation) and 95% limits of agreement (LOA). Data points are colored by Monk Skin Tone Scale (MST) at the forehead, as recommended by the U.S. Food and Drug Administration (FDA) draft 510(k) guidance. Panel A: Shenzhen Med-Link Electronics Tech Co., Ltd. AM801 (passing ARMS/failing Individual Typology Angle (ITA) differential bias); Panel B: Nonin Medical, Inc. Co-Pilot H500 (passing ARMS/passing ITA differential bias); Panel C: Bistos Co., Ltd. BT-710 (failing ARMS/passing ITA differential bias); Panel D: Guangdong Biolight Meditech Co., Ltd. M70 (failing ARMS/failing ITA differential bias). In each panel, the red solid horizontal line represents the SpO₂ bias; dashed lines are the 95% LOA, both have accounted for repeated measures.

Differential bias

In cohorts of 24 using DP ITA, differential bias ranged from −1.39 to 9.17 at 70–85% SaO2 (median 1.50, IQR: 0.00 – 2.53) and −1.12 to 2.18 at 85–100% SaO2 (median 0.42, IQR: −0.37 – 0.94) (Table S12).

With the larger cohort sizes, DP ITA differential bias ranged from −1.34 to 4.83 at 70–85% SaO2 (median 1.36, IQR: 0.01 – 2.33) and from −0.92 to 1.72 at 85–100% SaO2 (median of 0.38, IQR: −0.10 – 0.96) (Table S13). Forehead ITA differential bias was similar. The LME and LR models produced similar results.

Old vs. anticipated regulatory frameworks

With cohorts of 24 participants, 28/34 devices passed 2017 ISO ARMS criteria (≤4%), 22/34 passed 2013 FDA ARMS (≤ 3%) criteria (which is also the anticipated ISO criteria), and 12/34 passed anticipated FDA ARMS criteria (95% UCI < 3%). For differential bias, 1/34 oximeters passed the anticipated FDA criteria, and 32/34 passed the anticipated ISO criteria (Table S12 and Figure 1). Overall, 21/34 oximeters passed both the ARMS and differential bias criteria for anticipated ISO standard, and 1/34 oximeters passed both criteria for anticipated FDA guidance.

With the larger cohort sizes, the number of devices passing both the anticipated ARMS and differential bias criteria increased from 1/34 to 7/34 for FDA, and from 21/34 to 25/34 for ISO (Table S3 and Table S13).

Impact of pigmentation definitions and measurement sites on bias

The use of race, MST or ITA with varied binning thresholds and different anatomical sites yielded different numbers of devices with significant differences in bias (Table S14). Analysis of light vs dark DP ITA in maximum cohort sizes identified the most devices with significant differences in bias (11), while analysis by race (white vs. Black/African American) identified the fewest.

Monte Carlo Simulation

We found numerous devices could ‘pass’ or ‘fail’ ARMS criteria depending on which participants were selected for cohort analysis (Figure S3).

Discussion

We found that many pulse oximeters fail to meet current or anticipated regulatory performance recommendations, and some but not all oximeters had more positive bias in dark vs light participants. Anticipated ISO and FDA regulatory frameworks produced markedly different ‘passing’ rates for the same devices. Most devices passed the anticipated ISO standard, yet only one device passed the anticipated FDA guidance (in cohorts of 24 healthy adults).

Our findings corroborate prior studies demonstrating variable performance and positive bias in some oximeters on the market.6,7,23,24,28,29 In several instances, our findings differ from manufacturers’ reports (Figure 1 and Table S4), which may be partly explained by methodological differences. For example, we did not warm most participants’ hands and used all five digits. This was done in an effort to better reflect clinical reality and contrasts with most prior oximeter regulatory verification studies, which typically only used digits 2–4 and actively warmed participants’ hands to increase peripheral perfusion (a factor known to improve device performance and permitted by FDA and ISO).8 Additionally, we analyzed consecutively enrolled participants. Our Monte Carlo analysis suggests that had we ‘cherry-picked’ participants for analysis, a practice not explicitly prohibited for regulatory submissions, we could make poorly performing oximeters look better.

Consistent with our hypothesis, we found most (18/34) but not all oximeters exhibited more positive bias (i.e. SpO2 overestimating SaO2) in participants with dark vs. light skin (Table S10 and Table S11). Unexpectedly, several devices demonstrated more positive bias in participants with light vs. dark skin.

We confirmed our hypothesis that fewer devices pass anticipated regulatory frameworks, but the contrast in differential bias criteria between FDA and ISO is particularly noteworthy. Nearly all devices we tested passed the anticipated differential bias criteria for ISO, but only one passed anticipated FDA criteria (Figure 1). The anticipated ISO standard performed similarly to the 2013 FDA guidance. This is concerning given that devices cleared by 2013 FDA guidance were used in studies demonstrating oximeter performance and health disparity concerns4,30 ISO should tighten its recommendations and ensure differential bias thresholds reflect magnitudes of bias (e.g. <2%) seen in clinically relevant SpO2 ranges (e.g. 85–95%).3137The FDA differential bias thresholds necessitate larger study cohorts for some devices to pass. This has generated concern that if studies become too costly or time-consuming, there could be negative global market implications (e.g. decreased device access and increased costs).13,38

Increasing the size of verification cohorts (from 10 to 24) and utilizing MST and ITA are good steps toward improving diversity. Our findings suggest that most devices will require cohorts > 24 to pass anticipated FDA criteria, but a cohort size of 150, as proposed in the January 2025 draft FDA guidance, is likely unnecessary. Additionally, we found the anatomical site for skin pigment assessment and definition of “dark” skin pigment also impacted bias, ARMS, and regulatory ‘passing’ rates (Table S1 and Table S15).

Without a better understanding of how performance criteria in the lab correlate with performance (or health disparities) in the clinical setting, the optimal regulatory performance thresholds will remain uncertain.

There was no device characteristic (e.g. cost, number of wavelengths, form factor etc.) that was clearly linked to performance, though the study was not designed for this purpose.

This study had several limitations. First, we generally tested only one oximeter per model. For devices with detachable probes, we replaced probes after approximately 24 participants. Thus, we cannot assess variability related to manufacturing differences or wear and tear. For the three oximeter models, we found different performance across manufactured years despite no obvious differences in hardware or software (Table S16, Figure 1, and Table S5). Our clinical monitor (PM1000N) had reasonably consistent performance in these same participant cohorts, suggesting this was not a methodological problem. Further work is needed to analyze variability across different probes of the same model, a potentially important source of error13 Second, devices were tested in comparable but not identical cohorts. For example, some cohorts (including those for two devices with the best ARMS) had higher participant perfusion (Table S17). Third, several aspects of our desaturation protocol could have influenced findings. For example, not all oximeters are designed for use on the 1st and 5th digits. Plateau stability was partially defined by stability of the clinical monitor, which may not always align with stability of tested oximeters. To mitigate this, we also used stability of calculated saturation (ScO2)23 based on exhaled CO2 and O2 partial pressures, to define plateau stability. Another protocol-related limitation is that SaO2 can vary by 1–3% across co-oximeter (i.e. blood gas analyzer) brands.39 We could not account for this factor because manufacturers generally do not disclose which co-oximeter was used in testing. Finally, there are also limitations in our skin pigment assessment protocol as previously reported.40 Although individuals with ITA < −50° represent a critical group for ensuring devices work equitably across all skin tones, we were unable to enroll a sufficient number of participants in this category to meaningfully assess its impact on bias analyses. Generalizability of our findings to the clinical setting, pediatric patients or devices we did not test should be done with caution.

Conclusion

Clinicians should be aware that many oximeters do not meet current or anticipated regulatory performance recommendations, and some, but not all, devices show more positive bias in people with dark vs light skin pigment. While it may be uncertain to regulators how much differential bias is reasonable to allow, it is clear that some oximeters appear to have minimal or no significant bias related to pigment, and all manufacturers should aspire to this performance goal regardless of the thresholds regulators ultimately propose. Future studies are needed to optimize regulatory frameworks, though many improvements can be made immediately using available data. Despite performance limitations of regulatory-compliant oximeters, clinicians should continue to utilize these essential tools but use great caution when using absolute SpO2 cutoffs for providing or withholding treatments.

Supplementary Material

Supplement 1
media-1.docx (12.4KB, docx)
Supplement 2

Acknowledgements

We are grateful to members of the Open Oximetry Collaborative Community who participated in the project’s open forum discussions on protocols and approaches to data analysis, including James Ramsay, Jenna Lester, Alex Pogorzelski, Margaret Akey, Daryl Dorsey, Bob Kopotic and Sandy Weininger (ISO/IEC Conveners of the Oximeters Medical Device Standards), and many others.

Funding Statement:

This study was conducted as part of the Open Oximetry Project funded by the Gordon and Betty Moore Foundation, Patrick J McGovern Foundation, PATH/UNITAID, and Robert Wood Johnson Foundation. Dr Ellis Monk’s time utilized for data analysis, reviewing and editing was funded by grant number: DP2MH132941.

Abbreviations and Acronyms:

ARMS

Accuracy root mean square error

CI

Confidence interval

UCI

Upper confidence interval

DP

Dorsal distal phalanx

FDA

US Food and Drug Administration

ISO

International Organization for Standardization

ITA

Individual Typology Angle

KM

Konica Minolta

LME

Linear mixed effects (model)

LR

Linear regression (model)

MST

Monk Skin Tone (Scale)

NIH

National Institutes of Health

SaO2

Functional arterial hemoglobin oxygen saturation

SpO2

Pulse oximeter indirect measure of arterial hemoglobin oxygen saturation

UCSF

University of California San Francisco

Funding Statement

This study was conducted as part of the Open Oximetry Project funded by the Gordon and Betty Moore Foundation, Patrick J McGovern Foundation, PATH/UNITAID, and Robert Wood Johnson Foundation. Dr Ellis Monk’s time utilized for data analysis, reviewing and editing was funded by grant number: DP2MH132941.

Footnotes

Conflicts of Interest: The UCSF Hypoxia Research Laboratory receives funding from multiple industry sponsors to test the sponsors’ devices for the purposes of product development and regulatory performance testing. This paper does not include data collected for sponsors. All data were collected from devices procured by the Hypoxia Research Laboratory for the purposes of independent research. No company provided any direct funding for this study, participated in study design, or was involved in analyzing data or writing the manuscript. None of the authors own stock or equity interests in any pulse oximeter companies.

References

  • 1.Ruppel H, Makeneni S, Faerber JA, et al. Evaluating the Accuracy of Pulse Oximetry in Children According to Race. JAMA Pediatr. 2023;177(5):540–543. doi: 10.1001/jamapediatrics.2023.0071 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Wong AKI, Charpignon M, Kim H, et al. Analysis of Discrepancies Between Pulse Oximetry and Arterial Oxygen Saturation Measurements by Race and Ethnicity and Association With Organ Dysfunction and Mortality. JAMA Netw Open. 2021;4(11):e2131674. doi: 10.1001/jamanetworkopen.2021.31674 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Henry NR, Hanson AC, Schulte PJ, et al. Disparities in Hypoxemia Detection by Pulse Oximetry Across Self-Identified Racial Groups and Associations With Clinical Outcomes. Crit Care Med. 2022;50(2):204–211. doi: 10.1097/CCM.0000000000005394 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Sjoding MW, Dickson RP, Iwashyna TJ, Gay SE, Valley TS. Racial Bias in Pulse Oximetry Measurement. N Engl J Med. 2020;383(25):2477–2478. doi: 10.1056/NEJMc2029240 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Zeballos RJ, Weisman IM. Reliability of Noninvasive Oximetry in Black Subjects during Exercise and Hypoxia. Am Rev Respir Dis. 1991;144(6):1240–1244. doi: 10.1164/ajrccm/144.6.1240 [DOI] [PubMed] [Google Scholar]
  • 6.Feiner JR, Severinghaus JW, Bickler PE. Dark Skin Decreases the Accuracy of Pulse Oximeters at Low Oxygen Saturation: The Effects of Oximeter Probe Type and Gender. Anesth Analg. 2007;105(6):S18. doi: 10.1213/01.ane.0000285988.35174.d9 [DOI] [PubMed] [Google Scholar]
  • 7.Bickler PE, Feiner JR, Severinghaus JW. Effects of Skin Pigmentation on Pulse Oximeter Accuracy at Low Saturation. Anesthesiology. 2005;102(4):715–719. doi: 10.1097/00000542-200504000-00004 [DOI] [PubMed] [Google Scholar]
  • 8.Gudelunas MK, Lipnick M, Hendrickson C, et al. Low Perfusion and Missed Diagnosis of Hypoxemia by Pulse Oximetry in Darkly Pigmented Skin: A Prospective Study. Anesth Analg. 2024;138(3):552. doi: 10.1213/ANE.0000000000006755 [DOI] [PubMed] [Google Scholar]
  • 9.ISO 80601-2-61:2017. ISO. Accessed June 4, 2024. https://www.iso.org/standard/67963.html [Google Scholar]
  • 10.Oximeters Pulse - Premarket Notification Submissions [510(k)s]: Guidance for Industry and Food and Drug Administration Staff. April 11, 2024. Accessed June 4, 2024. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/pulse-oximeters-premarket-notification-submissions-510ks-guidance-industry-and-food-and-drug [Google Scholar]
  • 11.Okunlola OE, Lipnick MS, Batchelder PB, Bernstein M, Feiner JR, Bickler PE. Pulse Oximeter Performance, Racial Inequity, and the Work Ahead. Respir Care. 2022;67(2):252–257. doi: 10.4187/respcare.09795 [DOI] [PubMed] [Google Scholar]
  • 12.Health C for D and R. Pulse Oximeters for Medical Purposes - Non-Clinical and Clinical Performance Testing, Labeling, and Premarket Submission Recommendations. January 10, 2025. Accessed January 12, 2025. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/pulse-oximeters-medical-purposes-non-clinical-and-clinical-performance-testing-labeling-and [Google Scholar]
  • 13.openoximetry.org. Collaborative Community. OpenOximetry. Accessed July 12, 2024. https://openoximetry.org/community/ [Google Scholar]
  • 14.FDA Executive Summary: Performance Evaluation of Pulse Oximeters Taking into Consideration Skin Pigmentation, Race and Ethnicity. Published online February 2, 2024:49. https://www.fda.gov/media/175828/download [Google Scholar]
  • 15.Stakeholder Consultation on Pulse Oximeters: Next Steps and Closing Remarks.; 2024. Accessed June 26, 2024. https://www.youtube.com/watch?v=2MqyMdrSZfs [Google Scholar]
  • 16.Stakeholder Consultation on Pulse Oximeters: Updating Global Requirements for Pulse Oximeters.; 2024. Accessed June 26, 2024. https://www.youtube.com/watch?v=zSl5OWkOyck [Google Scholar]
  • 17.Hypoxia Lab Skin Color Assessment Protocol. Google Docs. Accessed February 10, 2025. https://docs.google.com/document/d/17SGVjf1phxyhBB3TvetyuOFN_UC5z2humiEHI4dcEro/edit?usp=sharing&usp=embed_facebook [Google Scholar]
  • 18.Skin Tone Research @ Google AI | undefined. Accessed June 10, 2024. https://www.skintone.google/mste-dataset [Google Scholar]
  • 19.Vasudevan S, Vogt WC, Weininger S, Pfefer TJ. Melanometry for objective evaluation of skin pigmentation in pulse oximetry studies. Commun Med. 2024;4(1):1–19. doi: 10.1038/s43856-024-00550-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Stamatas GN, Kollias N. Blood stasis contributions to the perception of skin pigmentation. J Biomed Opt. 2004;9(2):315–322. doi: 10.1117/1.1647545 [DOI] [PubMed] [Google Scholar]
  • 21.Skin Color Subgroup Meeting 3-March 13, 2024.; 2024. Accessed June 9, 2025. https://www.youtube.com/watch?v=6ELI1XcbI [Google Scholar]
  • 22.Del Bino S, Bernerd F. Variations in skin colour and the biological consequences of ultraviolet radiation exposure. Br J Dermatol. 2013;169(s3):33–40. doi: 10.1111/bjd.12529 [DOI] [PubMed] [Google Scholar]
  • 23.Leeb G, Auchus I, Law T, et al. The performance of 11 fingertip pulse oximeters during hypoxemia in healthy human participants with varied, quantified skin pigment. eBioMedicine. 2024;102. doi: 10.1016/j.ebiom.2024.105051 [DOI] [Google Scholar]
  • 24.Lipnick MS, Feiner JR, Au P, Bernstein M, Bickler PE. The Accuracy of 6 Inexpensive Pulse Oximeters Not Cleared by the Food and Drug Administration: The Possible Global Public Health Implications. Anesth Analg. 2016;123(2):338–345. doi: 10.1213/ANE.0000000000001300 [DOI] [PubMed] [Google Scholar]
  • 25.Hypoxia lab study protocol basic pulse ox hypoxemia performance study (Public comment). Google Docs. Accessed July 1, 2024. https://docs.google.com/document/d/1bpVzZg9M0StLoyB_iF07V8at0h8lQhDyYnNCoSMVZQU/edit?usp=sharing&usp=embed_facebook [Google Scholar]
  • 26.Approach for Improving the Performance Evaluation of Pulse Oximeter Devices Taking into Consideration Skin Pigmentation, Race and Ethnicity: Discussion Paper and Request for Feedback. [Google Scholar]
  • 27.openoximetry.org. Data Repository. OpenOximetry. Accessed March 21, 2025. https://openoximetry.org/data-repository/ [Google Scholar]
  • 28.Blanchet MA, Mercier G, Delobel A, et al. Accuracy of Multiple Pulse Oximeters in Stable Critically Ill Patients. Respir Care. 2023;68(5):565–574. doi: 10.4187/respcare.10582 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Harskamp RE, Bekker L, Himmelreich JCL, et al. Performance of popular pulse oximeters compared with simultaneous arterial oxygen saturation or clinical-grade pulse oximetry: a cross-sectional validation study in intensive care patients. BMJ Open Respir Res. 2021;8(1):e000939. doi: 10.1136/bmjresp-2021-000939 [DOI] [Google Scholar]
  • 30.Fawzy A, Wu TD, Wang K, et al. Racial and Ethnic Discrepancy in Pulse Oximetry and Delayed Identification of Treatment Eligibility Among Patients With COVID-19. JAMA Intern Med. 2022;182(7):730. doi: 10.1001/jamainternmed.2022.1906 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Martin D, Johns C, Sorrell L, et al. Effect of skin tone on the accuracy of the estimation of arterial oxygen saturation by pulse oximetry: a systematic review. Br J Anaesth. 2024;132(5):945–956. doi: 10.1016/j.bja.2024.01.023 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Graham HR, King C, Rahman AE, et al. Reducing global inequities in medical oxygen access: the Lancet Global Health Commission on medical oxygen security. Lancet Glob Health. 2025;13(3):e528–e584. doi: 10.1016/S2214-109X(24)00496-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.The BOOST II United Kingdom, Australia, and New Zealand Collaborative Groups. Oxygen Saturation and Outcomes in Preterm Infants. N Engl J Med. 2013;368(22):2094–2104. doi: 10.1056/NEJMoa1302298 [DOI] [PubMed] [Google Scholar]
  • 34.Oster ME, Pinto NM, Pramanik AK, et al. Newborn Screening for Critical Congenital Heart Disease: A New Algorithm and Other Updated Recommendations: Clinical Report. Pediatrics. 2024;155(1):e2024069667. doi: 10.1542/peds.2024-069667 [DOI] [Google Scholar]
  • 35.Siemieniuk RAC, Chu DK, Kim LHY, et al. Oxygen therapy for acutely ill medical patients: a clinical practice guideline. Published online October 24, 2018. doi: 10.1136/bmj.k4169 [DOI] [Google Scholar]
  • 36.World Health Organization. Clinical management of severe acute respiratory infection (SARI) when COVID-19 disease is suspected. Interim guidance. Published online March 13, 2020. Accessed July 21, 2025. https://www.who.int/docs/default-source/coronaviruse/clinical-management-of-novel-cov.pdf [Google Scholar]
  • 37.Moore KL, Gudelunas K, Lipnick MS, Bickler PE, Hendrickson CM. Pulse Oximeter Bias and Inequities in Retrospective Studies––Now What? Respir Care. 2022;67(12):1633–1636. doi: 10.4187/respcare.10654 [DOI] [PubMed] [Google Scholar]
  • 38.Lipnick MS, Ehie O, Igaga EN, Bicker P. Pulse Oximetry and Skin Pigmentation—New Guidance From the FDA. JAMA. 2025;333(16):1393–1395. doi: 10.1001/jama.2025.1959 [DOI] [PubMed] [Google Scholar]
  • 39.Gehring H, Duembgen L, Peterlein M, Hagelberg S, Dibbelt L. Hemoximetry as the “Gold Standard”? Error Assessment Based on Differences Among Identical Blood Gas Analyzer Devices of Five Manufacturers. Anesth Analg. 2007;105(6):S24. doi: 10.1213/01.ane.0000268713.58174.cc [DOI] [PubMed] [Google Scholar]
  • 40.Lipnick MS, Chen D, Law T, et al. Comparison of methods for characterizing skin pigment diversity in research cohorts. Br J Dermatol. Published online October 10, 2025:ljaf397. doi: 10.1093/bjd/ljaf397 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1
media-1.docx (12.4KB, docx)
Supplement 2

Articles from medRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES