Skip to main content
American Journal of Epidemiology logoLink to American Journal of Epidemiology
. 2019 May 30;188(9):1695–1704. doi: 10.1093/aje/kwz132

Utility of the 5-Minute Apgar Score as a Research Endpoint

Marit L Bovbjerg 1,, Mekhala V Dissanayake 2, Melissa Cheyney 3, Jennifer Brown 4, Jonathan M Snowden 5
PMCID: PMC6736341  PMID: 31145428

Abstract

Although Apgar scores are commonly used as proxy outcomes, little evidence exists in support of the most common cutpoints (<7, <4). We used 2 data sets to explore this issue: one contained planned community births from across the United States (n = 52,877; 2012–2016), and the other contained hospital births from California (n = 428,877; 2010). We treated 5-minute Apgars as clinical “tests,” compared against 18 known outcomes; we calculated sensitivity, specificity, positive and negative predictive values, and the area under the receiver operating characteristic curve for each. We used 3 different criteria to determine optimal cutpoints. Results were very consistent across data sets, outcomes, and all subgroups: The cutpoint that maximizes the trade-off between sensitivity and specificity is universally <9. However, extremely low positive predictive values for all outcomes at <9 indicate more misclassification than is acceptable for research. The areas under the receiver operating characteristic curves (which treat Apgars as quasicontinuous) were generally indicative of adequate discrimination between infants destined to experience poor outcomes and those not; comparing median Apgars between groups might be an analytical alternative to dichotomizing. Nonetheless, because Apgar scores are not clearly on any causal pathway of interest, we discourage researchers from using them unless the motivation for doing so is clear.

Keywords: Apgar score, infant health, ROC curve


Dr. Virginia Apgar created her namesake score in 1953 (1), and its use quickly became ubiquitous in maternity care settings throughout high-resource countries, including the United States. Epidemiologists likewise quickly adopted the Apgar score as an outcome in perinatal research, because it is straightforward, easily understood, and almost universally recorded in birth-related data sources. The score comprises 5 categories (skin color, heart rate, reflex irritability, activity/flexion, and respiratory effort) that are each scored from 0 to 2, resulting in an overall range of 0 to 10, with 10 indicating that the highest score was given for each clinical indicator of neonatal well-being. In modern practice, the Apgar score is calculated at 1 and 5 minutes after birth, and again at 10 minutes if the 5-minute Apgar was low.

In theory, Apgar scores are used clinically as a screening tool, to identify those infants in need of more careful observation or intervention; in practice, however, a distressed infant will receive interventions well before the first Apgar is assigned. Thus, clinicians have debated the utility of Apgar scores for screening purposes almost since their inception (29), and yet this has not affected the ubiquity with which Apgar scores are assessed and recorded in vital records and other birth data.

Leaving that debate to clinicians, here we are concerned solely and explicitly with a second common use of Apgar scores: use in research, as a proxy endpoint for eventual infant well-being. Few would argue that the Apgar score itself is the thing that matters—rather, it is a marker for other, later, more meaningful neonatal morbidity or mortality endpoints such as neonatal seizures, sepsis, or death. Although validity of surrogate endpoints is a hotly debated topic in clinical research (7, 1014), relatively little research exists on the utility of the Apgar score in this context. If the Apgar is indeed an adequate proxy endpoint, and dozens of studies each year for decades have treated it in this manner (we cite only some of the most recent examples (1528)), it must accurately predict long-term sequelae. However, whether the Apgar score can indeed predict—as opposed to merely being associated with—subsequent, more clinically relevant adverse outcomes has received little attention in the literature.

Furthermore, despite near-universal dichotomization of the Apgar in research, little work has been done to determine what cutpoints are most useful for determining which neonates might be expected to have long-term negative sequelae. Soon after her initial publication, Dr. Apgar coauthored a follow-up paper that proposed the now-familiar 0–3, 4–6, and 7–10 groupings; (29) commonly referred to as “very low,” “low,” and “normal” Apgar ranges, respectively. Most—indeed, nearly all—research using this variable in the intervening 65 years has operated under the tacit assumption that these categories are the most meaningful when trying to predict neonatal prognosis and longer-term outcomes, despite the absence of evidence supporting their use.

To assess the degree to which Apgar scores predict neonatal complications, and are therefore a valid proxy endpoint when studying pregnancy and birth, we had 3 objectives. The first was to determine the overall discriminatory power of 5-minute Apgar scores for a variety of neonatal outcomes within the context of contemporary US maternity care. The second objective was to determine the optimal cutpoint for 5-minute Apgar scores, maximizing the trade-off between sensitivity and specificity. Our a priori hypothesis for this second objective was that a 5-minute Apgar that was very low (<4) would perform better than one that was low (<7) for this purpose. Because differential Apgar assignment has been observed according to gestational age (2, 3, 7, 3032), birthweight (4, 3335), and other factors, including provider type and geographic region (5, 9, 3638), our third objective was to compare 5-minute Apgar discriminatory performance and chosen cutpoints across different risk-stratified populations.

METHODS

Data sources

We used 2 separate data sets for this analysis. The first is the Midwives Alliance of North America Statistics Project (MANA Stats) data registry. MANA Stats contains medical records–based data from midwife-led, planned, community (home or birth center) births, predominantly in the United States. Reliability and validity of this data set are presented elsewhere (39). Participation by community birth midwives is voluntary; we estimate that MANA Stats captures 16% of birth center births and 19% of homebirths (based on 2015 data, the most recent year for which data are available via Centers for Disease Control and Prevention’s WONDER tool (40)) in the United States. The institutional review board at Oregon State University has approved this analysis; both childbearing women and midwives give informed consent for their MANA Stats data to be used for research.

We used MANA Stats data from birth years 2012 to 2016, excluding people who transferred care to a different provider prior to the onset of labor (usually to a hospital-based provider because of a pregnancy complication, or concerns that insurance would not cover the community birth; n = 10,711). We also excluded records from births that did not occur in the United States (n = 488), those that ended with an intrapartum fetal death (n = 67), and those missing the 5-minute Apgar score (n = 3,218). After applying these criteria, the final sample size used in these analyses was 52,877 pregnancies (53,008 neonates). These are typically (although not always (41, 42)) low-risk pregnancies, given that they are planned community births with midwives. Preterm labor is a contraindication for community birth; thus, these births do not appear in the sample.

The second data set contains linked vital statistics/patient discharge data from the state of California in 2010. This data set is linked and maintained by the Office of Statewide Health Planning and Development (OSHPD). It includes patient discharge data for hospital admissions up to 9 months prior to delivery, linked to maternal and neonatal admissions up to a year postpartum. The California Health and Human Services Agency codes the data according to consistent specifications, performs rigorous quality checks, and reviews the cohort file before release. For the main analysis, we excluded births with missing gestational ages (n = 12,899) or missing Apgar scores (n = 2,901) as well as those that occurred preterm (n = 47,539). The final sample size for this analysis after applying these criteria was 428,877 pregnancies (n = 432,099 neonates). The OSHPD data contain a mix of risk profiles, as one would expect with data from multiple hospitals. The institutional review board at Oregon Health & Science University approved this analysis; these are administrative data so informed consent is not required.

Analysis

To address our first objective (determine overall discriminatory power of 5-minute Apgar scores), we began by dichotomizing the 5-minute Apgar score at every possible cutpoint: 10 versus <10; 9 or 10 versus <9; ≥8 versus <8, etc. We then constructed a screening/diagnostic-testing 2 × 2 table (with Apgar as the “test” and one of 9 known outcomes as the “disease”) for each cutpoint, from which we calculated sensitivity, specificity, and positive (PPV) and negative (NPV) predictive values. We drew receiver operating characteristic (ROC) curves and calculated the area under the curve (AUC), which we used to quantify the overall discriminatory power of 5-minute Apgar scores for each outcome. The AUC can be interpreted as the probability that a randomly selected infant who experienced the outcome scored lower than a randomly selected one who did not (43). Importantly, AUC addresses overall discriminatory power of a clinical test without use of any single cutpoint. This and all other analyses were conducted separately for the 2 data sets, because the precise outcomes and definitions varied slightly between them. For ease of comparison, these outcomes and definitions are shown in Table 1.

Table 1.

Outcomes Used for This Analysis and Their Frequencies, Comparing the Community Birth Sample (Midwives Alliance of North America Statistics Project, United States, 2012–2016) and the Hospital Birth Sample (Office of Statewide Health Planning and Development, California, 2010)

Outcome Community Birth Sample (n = 53,008 Infants) Hospital Birth Sample (n = 432,099 Infants)
Definition Outcome Frequencya Outcome, % Definition Outcome Frequency Outcome, %
Neonatal transfer of care A transfer of care to a higher-level facility (usually transfer from the community setting to a hospital) within 6 hours of birth for a neonatal indication. The most common reasons for neonatal transfer are evaluation of congenital anomalies and symptoms of respiratory distress. 824/48,587b 1.7 Transfer to another facility within 24 hours of delivery 1,682 0.4
Respiratory distress syndrome Midwife indicated “respiratory distress/respiratory distress syndrome” as a neonatal complication in the first 28 days, as a reason for neonatal transfer to the hospital in the first 6 hours, as a reason for neonatal hospitalization in the first 6 weeks, and/or as the cause of neonatal or infant death. 900 1.7 ICD-9 code 769 1,592 0.4
Meconium aspiration syndrome Midwife indicated “meconium aspiration syndrome” as a neonatal complication in the first 28 days as a reason for neonatal transfer to the hospital in the first 6 hours and/or as a reason for neonatal hospitalization in the first 6 weeks. 142 0.3 ICD-9 code 770.11, 770.12 482 0.1
Neonatal infection Midwife indicated “infection other than sepsis” as a neonatal complication in the first 28 days and/or “signs/symptoms of infection in the baby” as a reason for neonatal transfer to the hospital in the first 6 hours or as a reason for neonatal hospitalization in the first 6 weeks. 524 1.0
Asphyxia ICD-9 code 768.5, 768.6, 768.9, 768.7x 192 0.04
Sepsis Midwife indicated “sepsis” as a neonatal complication in the first 28 days and/or “bacterial sepsis of the newborn” as the cause of neonatal or infant death. 60 0.1 ICD-9 code 771.81 5,996 1.4
Seizures Midwife indicated “seizures” as a neonatal complication in the first 28 days as a reason for neonatal transfer to the hospital in the first 6 hours and/or as a reason for neonatal hospitalization in the first 6 weeks. 86 0.2 Any involuntary repetitive, convulsive movement or behavior
Serious neurologic dysfunction—severe alteration of alertness.
Excludes:
  • Lethargy or hypotonia in the absence of other neurologic findings

  • Symptoms associated with CNS congenital anomalies; ICD-9 code of 779.0

318 0.07
Neonatal hospitalization Any neonatal hospitalization in the first 6 weeks 1,893c 3.6
Prolonged length of stay Length of stay ≥3 days for a vaginal birth or ≥5 days for a cesarean 11,589 2.68
NICU admission Any NICU admission in the first 6 weeks 1,331d 2.5 Admission into a facility or unit staffed and equipped to provide continuous mechanical ventilatory support for a newborn 14,390 3.3
Neonatal death Liveborn infant who died during the first 27 complete days 51/52,978e 0.96 per 1,000f Liveborn infant who died during the first 27 complete days 291 0.67 per 1,000

Abbreviations: CNS, central nervous system; ICD-9, International Classification of Diseases, Ninth Revision; NICU, neonatal intensive care unit.

a Denominator was 53,008 unless otherwise specified.

bn = 4,417 intrapartum transfers dropped for analyses using this outcome because they were no longer at risk; n = 3 missing.

cn = 136 missing.

dn = 140 missing.

en = 30 missing.

f Includes deaths that were attributed to congenital anomalies. Without these, the neonatal death rate in the community birth sample is 0.59/1,000.

For our second objective, we used the sensitivities, specificities, and ROC curves to choose the optimal cutpoint according to 3 criteria that have been advocated in the literature: Youden’s index (44), which is the sum of sensitivity and specificity minus 1; the cutpoint that minimizes the linear distance to the (0,1) point in the upper-left corner of the ROC curve; (45) and Liu’s criterion (46), which is the product of sensitivity and specificity. This process was repeated separately for the 9 outcomes, separately in each data set.

For research purposes, PPV and NPV are also important. PPV is interpreted as the probability that a particular infant experienced the outcome, given that they tested positive (testing positive is having a 5-minute Apgar score below a certain cutpoint), and NPV is the probability that a particular infant did not experience the outcome, given that they tested negative (testing positive is having a 5-minute Apgar score at or above the cutpoint). We plotted PPV and NPV at each possible cutpoint, for each outcome for each data set, to observe how confident a researcher could be that the 5-minute Apgar score is indeed sufficient as a proxy endpoint for the outcomes studied.

The third objective was to explore whether results from either of the above varied according to risk status of the pregnancy, birth location, delivery mode, provider type, and so on. We undertook a series of sensitivity analyses to accomplish this objective: First, we repeated the above, comparing, in the MANA Stats data, birth center versus home, Certified Nurse Midwife versus Certified Professional Midwife, and finally, term versus postterm (postterm is defined as ≥42 completed weeks; preterm labor is a contraindication for community birth so these pregnancies do not appear in the sample). In the OSHPD data, we compared preterm versus term (postterm ≥42 weeks is uncommon in modern US hospitals), white non-Hispanic versus black versus Hispanic, and planned cesarean without labor versus cesarean during labor versus vaginal birth. We also limited the OSHPD data to only low-risk births (excluding preterm, multiples, breech, and elective repeat cesarean delivery) and repeated the analysis. Analyses were performed using R, version 3.3.2 (R Foundation for Statistical Computing, Vienna, Austria); Stata, version 14 (StataCorp LLC, College Station, Texas); and SPSS, version 24.0.0.0 (IBM Corp, Armonk, New York).

RESULTS

Demographic and pregnancy risk factors for the 2 samples are shown in Table 2. Aside from birth place, pregnancy risk level, provider type, and geographic region (all discussed above), the most notable difference between the data sets is demographic, specifically race/ethnicity: The MANA Stats data are composed of predominantly white, non-Hispanic women, while the more than half of the women in the OSHPD data set are Latina. As expected, the community birth sample (MANA Stats) contains much lower rates of common pregnancy complications such as multiple gestation, history of cesarean, and primiparity. Notably, the incidence of cesarean in the index pregnancy (which would occur for planned community births after an intrapartum transfer to a hospital) is lower in the MANA Stats data by a factor of 10 (3.2% vs. 32.2%).

Table 2.

Sample Demographic and Pregnancy Risk Characteristics of the Community Birth Sample (Midwives Alliance of North America Statistics Project, United States, 2012–2016) and Hospital Birth Sample (Office of Statewide Health Planning and Development, California, 2010)

Characteristic Community Birth (n = 52,877 Pregnancies) Hospital Birth (n = 428,877 Pregnancies)
No. % No. %
Maternal age, yearsa 30.6 (4.9) 28.4 (6.3)
Pregravid BMIb 22.9 (20.8–26.0) 24.6 (21.6–28.9)
Maternal race
 White, non-Hispanic 45,617 86.4 116,597 27.3
 Hispanic 3,146 5.9 228,010 53.4
 Black 1,381 2.6 22,506 5.3
 Asian 1,925 3.6 48,823 11.4
 American Indian/Alaska Native (includes “other, multiracial,” hospital data only) 765 1.4 10,738 2.5
Mother eligible for Medicaidc (community birth), or birth paid for by Medicaid (hospital) 12,010 22.8 216,275 50.05
Breech presentation 559 1.1 10,623 2.5
Multiple pregnancy 131 0.2 6,800 1.6
Mother was primiparous 17,155 32.4 172,719 40.0
History of cesarean (denominator is multiparas only) 2,890 8.09 72,536 27.8
Cesarean delivery for the index pregnancy 1,712d 3.2 138,981 32.2

Abbreviation: BMI, body mass index.

a Values are expressed as mean (standard deviation).

b BMI calculated at weight (kg)/height (m)2. Values are expressed as median (interquartile range).

c The community birth data set uses this variable, rather than whether Medicaid paid for the birth, because Medicaid does not reimburse for community birth in all states.

d Cesarean delivery in the community birth data would occur following an intrapartum transfer of care to a hospital.

Frequencies of neonatal outcomes are shown with the outcome definitions in Table 1. Counterintuitively, incidence of many of these outcomes is higher in the lower-risk community birth sample; this is largely secondary to practice and policy differences between the 2 birth sites and is addressed further in the Discussion.

In terms of overall discriminatory power, we found variable results, although in all cases the Apgar performed better than random. The ROC curve for neonatal death in the community birth sample is shown in Figure 1A, and ROC curves for the remaining outcomes in Web Figure 1 (available at https://academic.oup.com/aje). The AUCs ranged from 0.608 for neonatal infection to 0.908 for meconium aspiration syndrome. The ROC curve for neonatal death in the hospital birth sample is shown in Figure 1B, and ROC curves for the remaining outcomes in Web Figure 2. Here, AUCs ranged from 0.549 for prolonged length of stay to 0.860 for asphyxia. For the 7 outcomes common to both data sets, discrimination appeared to be slightly better in the community birth setting.

Figure 1.

Figure 1.

Receiver operating characteristic curves for neonatal death showing discrimination of the 5-minute Apgar score dichotomized at every possible point. A) Data are from medical records (n = 53,008), Midwives Alliance of North America Statistics Project, United States (2012–2016); area under the curve is 0.849. B) Data are from vital records/claims data for hospital births (n = 444,500), Office of Statewide Health Planning and Development, California (2010); area under the curve is 0.846. The asterisk indicates sensitivity/specificity values for the cutoff of <10; the filled square is for <9, and the circle is for <8. In these data, each of the 3 criteria for choosing the “best” cutoff (Youden’s index, the closest to (0,1) criterion, and Liu’s criterion) indicated that <9 is the optimal cutoff.

Across both data sets, for most outcomes, there was agreement among all 3 criteria for the optimal research cutpoint (see Web Figures 1 and 2). Furthermore, and in contrast to our a priori hypothesis, the best cutpoint was neither <4 nor <7; rather, it was consistently <9.

PPVs and NPVs for all outcomes are shown in Figure 2, with MANA Stats data on the left and OSHPD data on the right. NPVs were universally close to 1.0 (as expected for rare outcomes, but they thus provide very little discriminatory power), and PPVs, although slightly more variable, were universally too low to provide adequate identification of infants destined to experience the outcomes. At the “chosen” cutpoint of <9, in particular, the PPVs were mostly well below 20%, indicating upwards of 80% misclassification.

Figure 2.

Figure 2.

Positive and negative predictive values for several neonatal outcomes, calculated at all possible 5-minute Apgar score cutoffs. A) Data are from a planned community birth sample (n = 53,008), Midwives Alliance of North America Statistics Project, United States (2012–2016). B) Data are from vital records/claims data for hospital births (n = 444,500), Office of Statewide Health Planning and Development, California (2010). Because the outcomes are rare, all negative predictive values are near 100%; these are the lines at the top of each panel. The positive predictive values are more variable, and appear nearer to the x-axis. LOS, length of stay; MAS, meconium aspiration syndrome; NICU, neonatal intensive care unit; RDS, respiratory distress syndrome.

In our sensitivity analyses (objective 3), nearly identical results were found in all subgroups (data not shown): The most consistently chosen optimal cutpoint according to all 3 criteria was <9, no matter the subgroup, outcome, or data set, but this cutpoint does not perform well in terms of PPV. AUCs were consistently better than random but, as with the main analysis, varied substantially by particular outcome/data set.

DISCUSSION

We found that, in the context of contemporary US maternity care, the overall discriminatory power of the 5-minute Apgar varied by outcome; for some outcomes (e.g., infection, prolonged length of stay) discrimination was only marginally better than random (Web Figures 1 and 2). On the other hand, for arguably the most important outcome, neonatal death, discrimination was relatively high (85% in both data sets; Figure 1). Although few other researchers have approached Apgar scores using AUC from ROC curves, our neonatal death results do agree very closely with the 2 others of which we are aware: 80% discrimination for neonatal death among term infants as reported by Chong and Karlberg (43) and 85% (term) as reported by Cnattingius et al. (47).

Regarding specific cutpoints, given that so much previous research has shown that very low Apgar (<4) is associated with extremely high relative risks and relative odds for adverse neonatal outcomes (24, 7, 8, 30, 31, 3335), we expected to find that <4 (or something similarly low) was the optimal cutpoint when dichotomizing an Apgar score. However, we found instead that <9 was the cutpoint that maximized the trade-off between sensitivity and specificity. This remained true in all subgroups we examined, across all outcomes, consistently in both data sets. Interestingly, <9 is also the “best” cutpoint published in the single other paper we know of that systematically examined test characteristics of options other than <4 and <7 (43). However, even if the Apgar score is dichotomized at this optimal point of <9, at least 80% of infants categorized as “positive” for the outcome will be false positives, regardless of the outcome under study (see Figure 2). There are also numerous false negatives, as evidenced by the sensitivities shown in Web Figures 1 and 2 at the <9 points. This level of error makes the dichotomized Apgar score a poor proxy outcome for research purposes. Therefore, we propose that future research analyses should not use a dichotomized Apgar score: Even at the “best” cutpoint, discriminatory performance is too poor.

Despite these shortcomings of the dichotomized Apgar score, our AUC results suggest relatively high discriminatory power for the nondichotomized Apgar score. AUC treats the Apgar as a quasicontinuous variable, and when used in this manner, the Apgar does indeed perform adequately at discriminating between infants who will become cases and those who will not, for many outcomes. Thus, we propose that, for analyses that must use the Apgar score, researchers should plan to compare median Apgar scores between 2 groups instead of dichotomizing. This approach is also not without issues (most importantly, treating an 11-level ordinal variable as though it were continuous, with all of the assumptions therein), but given the poor performance of dichotomized Apgar scores, we believe that it is the best solution short of no longer using Apgar scores in research.

Using medians might solve one issue going forward, but still unaddressed is the substantial body of literature reporting large risk and odds ratios for associations between the Apgar score and various adverse outcomes, using the standard <7 or <4 dichotomized Apgar scores (seizures (48, 49), respiratory distress (50, 51), sepsis (52), neonatal intensive care unit admission (51), hypoxic-ischemic encephalopathy (51), cerebral palsy (33, 35, 49, 53), and neonatal or infant death. (24, 7, 30, 31, 3335, 49, 54)). The seeming incompatibility between the extensive misclassification shown in our results and the strong associations between dichotomized scores and poor outcomes is a result of several factors. First, there is an underappreciated distinction between predictive performance (e.g., as quantified by an AUC) and strength of statistical association (e.g., as quantified by a risk ratio or risk difference) (55). It has likewise been noted that even very strong values of an explanatory measure such as an odds ratio or risk difference do not necessarily imply strong predictive performances (56). One thus cannot conclude that Apgar scores (or any similar measure) predict future poor outcomes merely because statistically significant odds or risk ratios have been reported in the literature. Our results, in fact, suggest that the opposite is true for Apgar scores: They are indeed associated with numerous adverse neonatal outcomes (a population-level measure), but they are unsuccessful for prediction (an individual-level endeavor).

The second issue is where the Apgar score falls in the causal structure. From a research perspective, the Apgar score is neither an exposure, a confounder, nor a true outcome. Rather, the Apgar score is a way of quantifying a mediator (a “descending proxy,” in the language of directed acyclic graphs (57)). As shown in Figure 3, a pregnant woman’s medical history, demographic characteristics, and pregnancy risk level will affect the course of labor, which in turn affects how well the infant fares during the immediate postpartum period, which in turn affects morbidity and mortality outcomes. Importantly, the Apgar score does not appear in this causal chain—it instead quantifies the third factor, that of how well an infant is faring during the immediate postpartum period, which includes how well they are responding to any resuscitation efforts. The Apgar score does not, by itself, cause any adverse outcomes (there are no arrows leading away from Apgar in Figure 3).

Figure 3.

Figure 3.

Proposed causal model demonstrating where the 5-minute Apgar score falls on the causal pathway. We included all posited arrows involving the Apgar score, our key study variable. For simplicity and ease of reading, we have omitted some arrows (e.g., direct arrows from pregnancy and maternal factors to adverse neonatal outcomes).

One could argue that we actually do not know the degree to which Apgar scores are themselves meaningful outcomes, but the clinical basis does not provide a priori evidence that they are. For example, the 0–10 score range was chosen arbitrarily, and there are any number of ways a neonate could score a 7 (and almost certainly these are not medically equivalent) (47). Public health or clinical interventions are generally not proposed with the exclusive purpose of modifying Apgar scores, and it seems unlikely that this could be accomplished without modifying exposures in Figure 3, outcomes, or both. Thus the degree to which Apgar scores are meaningful at all in research is open for debate.

Data presented in Tables 1 and 2 portray a contradiction upon first inspection: The community birth data do, indeed, represent a lower-risk sample (lower proportions of primiparas, breech presentations, twins, etc.), yet they have higher incidences of many of the adverse outcomes. This is not a result of the community birth setting per se (i.e., for low-risk women it is not inherently dangerous to plan a community birth (41, 5862)) but rather a natural consequence of the realities of practicing in the community birth setting. For instance, neonatal transfer is more common simply because, for community births (in the MANA Stats data), the threshold for transfer is “does this baby need a hospital?” Whereas for OSHPD data, the birth occurred in a hospital already, and so the threshold is “does this baby need a higher-level hospital?” Often the transfers recorded in MANA Stats data are for evaluation of congenital anomalies, monitoring for suspected respiratory distress syndrome, and similar issues. Neither of these would likely trigger a transfer from a lower-level hospital to an academic hospital unless the neonate’s condition was extremely poor. Further, some births in the OSHPD data set are already occurring in the highest-level, most specialized hospital in the region. Thus, a meaningful proportion of the OSHPD hospital births are not “at risk” for the neonatal transfer outcome (although we could not distinguish these), while all births are at risk of transfer in the MANA Stats data. This also helps to explain the respiratory distress syndrome and meconium aspiration syndrome discrepancies. In the MANA Stats data, we counted an infant as having experienced respiratory distress syndrome or meconium aspiration syndrome if those were listed as the reason for transfer. One must consider, however, that community midwives are trained to transfer conservatively, when any of these or other conditions are even suspected; not all infants thus transferred will receive these as official diagnoses after evaluation at the hospital. As such, these conditions are likely overestimated in the MANA Stats data, because we do not have access to the eventual discharge diagnoses. Additionally, midwives use their medical records for reference when entering MANA Stats data, whereas OSHPD data come from discharge/claims codes and birth certificates. Because not all conditions listed in the electronic health record make it onto the discharge code list, conditions such as respiratory distress syndrome or meconium aspiration syndrome might be underestimated in the OSHPD data. Nonetheless, 2 of the most important outcomes, neonatal intensive care unit admission and sepsis, are indeed more prevalent in the OSHPD data, as one would expect for a sample including more higher-risk births. These outcomes are likely also recorded in the data sets with the least amount of error, because they have clear diagnostic criteria and billing implications.

Strengths and limitations

To our knowledge, ours is the largest study that quantitatively and theoretically assesses the usefulness of Apgar scores for research purposes. We included data from 2 different data sets and from neonates with a variety of demographic characteristics, pregnancy-related risk factors, birth locations, and provider types. Our results were remarkably consistent regardless of how we subgrouped the data and, furthermore, are highly consistent with the few previous reports that exist. Our very large sample provided sufficient event numbers even for rare outcomes, and the data sets were sufficiently detailed to allow inclusion of numerous outcomes. Although there are noted differences in assigning Apgar scores according to birth attendant credential, birth place, gestational age, and other variables, including delivery mode (36, 38, 43, 47, 63), the consistency of our results across outcomes and data sets, including across all sensitivity analysis subgroups, suggests that this variability does not affect our main conclusions, although it is possible that the Apgar would perform well in a subgroup other than those we tested.

However, the extent to which our findings generalize to the US population is limited by our data sets. Participation in the MANA Stats data registry is voluntary, and it includes data from 15%–20% of planned community births in the United States. Community births themselves comprise fewer than 2% of US births annually (64). Additionally, our hospital data come exclusively from California. Regional variability is known to occur in other elements of obstetric practice (65), and Apgar scores might also vary regionally within the United States (there is evidence that they vary by country (38)). We thus certainly have substantial selection bias in our sample, although the degree to which this might affect our main findings is unclear (66). Future research could explore the Apgar scores’ predictive ability in other populations as well.

Another limitation is that timing of neonatal death is not available in the data sets. An infant who dies in the first hour after birth is not at risk for some of the other outcomes we examined (prolonged length of stay or neonatal intensive care unit admission, in particular), and we were unable to remove these from the denominators. However, neonatal death is a rare enough outcome in our sample that it seems unlikely that their removal for later-onset outcomes would substantially affect results. We also lack data on timing of diagnoses (e.g., meconium aspiration syndrome), and there is likely some conflation of relevant pre- and post-Apgar assignment events beyond those discussed above. Finally, it is certainly the case that labor, delivery, and neonatal care practices change over time (67, 68), and the MANA Stats data and the OSHPD data are not from the same birth years. This could have affected our results, and might also hinder comparisons with earlier work. However, given the high degree of consistency of our results across all analyses, subgroups, outcomes, and 2 very distinct data sets, any biases introduced by these limitations would need to be substantial to alter our main conclusions.

CONCLUSIONS

We conclude that a dichotomized Apgar score is not an effective proxy outcome for research purposes. Even at the cutpoint that consistently maximizes sensitivity and specificity (which was neither <7 nor <4 but rather <9), both false-positive and false-negative rates are above desirable levels of accuracy for research purposes. The Apgar score performs better when left as a quasicontinuous measure. However, given the variable and in some cases very limited predictive performance we have demonstrated for a wide range of neonatal complications, we encourage epidemiologists and health services researchers to refrain from using Apgar scores when possible or to be explicit about what analytical purpose the Apgar score is serving.

Supplementary Material

Web Material

ACKNOWLEDGMENTS

Author affiliations: Epidemiology Program, College of Public Health and Human Sciences, Oregon State University, Corvallis, Oregon (Marit L. Bovbjerg); Department of Obstetrics and Gynecology, Oregon Health and Science University, Portland, Oregon (Mekhala V. Dissanayake); Anthropology Program, College of Liberal Arts, Oregon State University, Corvallis, Oregon (Melissa Cheyney); Department of Epidemiology, School of Public Health, University of Washington, Seattle, Washington (Jennifer Brown); and School of Public Health, Oregon Health and Science University–Portland State University, Portland, Oregon (Jonathan M. Snowden).

This work was supported by the Health Resources and Services Administration (grant R40MC26810 to M.L.B. and M.C.) and the Eunice Kennedy Shriver National Institute of Child Health and Human Development (grant R00 HD079658-03 to J.M.S., also funding M.V.D.). Data collection for Midwives Alliance of North America Statistics Project, J.B. as a research assistant, and assistance with manuscript formatting were funded by the Foundation for the Advancement of Midwifery.

We thank Frances M. Biel for assistance with manuscript formatting. We also thank the Midwives Alliance of North America Division of Research volunteers, without whose work data collection would not be possible.

This work was presented at the 30th Annual Meeting of the Society for Pediatric and Perinatal Epidemiologic Research, June 18–19, 2018, Baltimore, Maryland; and the International Normal Labour and Birth Research Conference, June 25–27, 2018, Ann Arbor, Michigan.

Conflict of interest: none declared.

Abbreviations

AUC

area under the curve

MANA Stats

Midwives Alliance of North America Statistics Project

NPV

negative predictive value

OSHPD

Office of Statewide Health Planning and Development

PPV

positive predictive value

ROC

receiver operating characteristic

REFERENCES

  • 1. Apgar V. A proposal for a new method of evaluation of the newborn infant. Curr Res Anesth Analg. 1953;32(4):260–267. [PubMed] [Google Scholar]
  • 2. Li F, Wu T, Lei X, et al. The Apgar score and infant mortality. PLoS One. 2013;8(7):e69072. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Iliodromiti S, Mackay DF, Smith GC, et al. Apgar score and the risk of cause-specific infant mortality: a population-based cohort study. Lancet. 2014;384(9956):1749–1755. [DOI] [PubMed] [Google Scholar]
  • 4. Drage JS, Kennedy C, Schwarz BK. The Apgar score as an index of neonatal mortality. A report from the Collaborative Study of Cerebral Palsy. Obstet Gynecol. 1964;24:222–230. [PubMed] [Google Scholar]
  • 5. Rubarth L. The Apgar score: simple yet complex. Neonatal Netw. 2012;31(3):169–177. [DOI] [PubMed] [Google Scholar]
  • 6. Jepson HA, Talashek ML, Tichy AM. The Apgar score: evolution, limitations, and scoring guidelines. Birth. 1991;18(2):83–92. [DOI] [PubMed] [Google Scholar]
  • 7. Schmidt B, Kirpalani H, Rosenbaum P, et al. Strengths and limitations of the Apgar score: a critical appraisal. J Clin Epidemiol. 1988;41(9):843–850. [DOI] [PubMed] [Google Scholar]
  • 8. Ellis M, Manandhar N, Manandhar DS, et al. An Apgar score of three or less at one minute is not diagnostic of birth asphyxia but is a useful screening test for neonatal encephalopathy. Indian Pediatr. 1998;35(5):415–421. [PubMed] [Google Scholar]
  • 9. Bharti B, Bharti S. A review of the Apgar score indicated that contextualization was required within the contemporary perinatal and neonatal care framework in different settings. J Clin Epidemiol. 2005;58(2):121–129. [DOI] [PubMed] [Google Scholar]
  • 10. Alonso A, Van der Elst W, Molenberghs G, et al. On the relationship between the causal-inference and meta-analytic paradigms for the validation of surrogate endpoints. Biometrics. 2015;71(1):15–24. [DOI] [PubMed] [Google Scholar]
  • 11. Buyse M, Molenberghs G, Paoletti X, et al. Statistical evaluation of surrogate endpoints with examples from cancer clinical trials. Biom J. 2016;58(1):104–132. [DOI] [PubMed] [Google Scholar]
  • 12. Gomella LG, Oliver Sartor A. The current role and limitations of surrogate endpoints in advanced prostate cancer. Urol Oncol. 2014;32(1):28.e1–28.e9. [DOI] [PubMed] [Google Scholar]
  • 13. Patel RB, Vaduganathan M, Samman-Tahhan A, et al. Trends in utilization of surrogate endpoints in contemporary cardiovascular clinical trials. Am J Cardiol. 2016;117(11):1845–1850. [DOI] [PubMed] [Google Scholar]
  • 14. Schievink B, Lambers Heerspink H, Leufkens H, et al. The use of surrogate endpoints in regulating medicines for cardio-renal disease: opinions of stakeholders. PLoS One. 2014;9(9):e108722. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Bovbjerg ML, Cheyney M, Everson C. Maternal and newborn outcomes following waterbirth: the Midwives Alliance of North America Statistics Project, 2004 to 2009 cohort. J Midwifery Womens Health. 2016;61(1):11–20. [DOI] [PubMed] [Google Scholar]
  • 16. Cox KJ, Bovbjerg ML, Cheyney M, et al. Planned home VBAC in the United states, 2004–2009: outcomes, maternity care practices, and implications for shared decision making. Birth. 2015;42(4):299–308. [DOI] [PubMed] [Google Scholar]
  • 17. Nethery E, Gordon W, Bovbjerg ML, et al. Rural community birth: maternal and neonatal outcomes for planned community births among rural women in the United States, 2004–2009. Birth. 2018;45(2):120–129. [DOI] [PubMed] [Google Scholar]
  • 18. Al-Shaikh GK, Ibrahim GH, Fayed AA, et al. Grand multiparity and the possible risk of adverse maternal and neonatal outcomes: a dilemma to be deciphered. BMC Pregnancy Childbirth. 2017;17:310. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Kim T, Burn SC, Bangdiwala A, et al. Neonatal morbidity and maternal complication rates in women with a delivery body mass index of 60 or higher. Obstet Gynecol. 2017;130(5):988–993. [DOI] [PubMed] [Google Scholar]
  • 20. Kuper SG, Sievert RA, Steele R, et al. Maternal and neonatal outcomes in indicated preterm births based on the intended mode of delivery. Obstet Gynecol. 2017;130(5):1143–1151. [DOI] [PubMed] [Google Scholar]
  • 21. Zhou X, Jin LJ, Hu CY, et al. Efficacy and safety of remifentanil for analgesia in cesarean delivery. Medicine (Baltimore). 2017;96(48):e8341. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Lindroos L, Elfvin A, Ladfors L, et al. The effect of twin-to-twin delivery time intervals on neonatal outcome for second twins. BMC Pregnancy Childbirth. 2018;18:36. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Goossens SMTA, Ensing S, van der Hoeven MAHBM, et al. Comparison of planned caesarean delivery and planned vaginal delivery in women with a twin pregnancy: a nation wide cohort study. Eur J Obstet Gynecol Reprod Biol. 2018;221:97–104. [DOI] [PubMed] [Google Scholar]
  • 24. Dolgun ZN, Inan C, Altintas AS, et al. Is there a relationship between route of delivery, perinatal characteristics, and neonatal outcome in preterm birth? Niger J Clin Pract. 2018;21(3):312–317. [DOI] [PubMed] [Google Scholar]
  • 25. Fajar JK, Andalas M, Harapan H. Comparison of Apgar scores in breech presentations between vaginal and cesarean delivery. Ci Ji Yi Xue Za Zhi. 2017;29(1):24–29. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Jang W, Flatley C, Greer RM, et al. Comparison between public and private sectors of care and disparities in adverse neonatal outcomes following emergency intrapartum cesarean at term—a retrospective cohort study. PLoS One. 2017;12(11):e0187040. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Bodner-Adler B, Kimberger O, Griebaum J, et al. A ten-year study of midwife-led care at an Austrian tertiary care center: a retrospective analysis with special consideration of perineal trauma. BMC Pregnancy Childbirth. 2017;17:357. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Grobman WA, Rice MM, Reddy UM, et al. Labor induction versus expectant management in low-risk nulliparous women. N Engl J Med. 2018;379(6):513–523. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Apgar V, James LS. Further observations on the newborn scoring system. Am J Dis Child. 1962;104:419–428. [DOI] [PubMed] [Google Scholar]
  • 30. Lagatta J, Yan K, Hoffmann R. The association between 5-min Apgar score and mortality disappears after 24 h at the borderline of viability. Acta Paediatr. 2012;101(6):e243–e247. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Lee HC, Subeh M, Gould JB. Low Apgar score and mortality in extremely preterm neonates born in the United States. Acta Paediatr. 2010;99(12):1785–1789. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Marinov VG, Koleva-Georgieva DN, Sivkova NP, et al. The 5-minute Apgar score as a prognostic factor for development and progression of retinopathy of prematurity. Folia Med (Plovdiv). 2017;59(1):78–83. [DOI] [PubMed] [Google Scholar]
  • 33. Nelson KB, Ellenberg JH. Apgar scores as predictors of chronic neurologic disability. Pediatrics. 1981;68(1):36–44. [PubMed] [Google Scholar]
  • 34. de Oliveira TG, Freire PV, Moreira FT, et al. Apgar score and neonatal mortality in a hospital located in the southern area of São Paulo City, Brazil. Einstein (Sao Paulo). 2012;10(1):22–28. [DOI] [PubMed] [Google Scholar]
  • 35. Phalen AG, Kirkby S, Dysart K. The 5-minute Apgar score: survival and short-term outcomes in extremely low-birth-weight infants. J Perinat Neonatal Nurs. 2012;26(2):166–171. [DOI] [PubMed] [Google Scholar]
  • 36. Grünebaum A, McCullough LB, Brent RL, et al. Justified skepticism about Apgar scoring in out-of-hospital birth settings. J Perinat Med. 2015;43(4):455–460. [DOI] [PubMed] [Google Scholar]
  • 37. Clark DA, Hakanson DO. The inaccuracy of Apgar scoring. J Perinatol. 1988;8(3):203–205. [PubMed] [Google Scholar]
  • 38. Siddiqui A, Cuttini M, Wood R, et al. Can the Apgar score be used for international comparisons of newborn health? Paediatr Perinat Epidemiol. 2017;31(4):338–345. [DOI] [PubMed] [Google Scholar]
  • 39. Cheyney M, Bovbjerg M, Everson C, et al. Development and validation of a national data registry for midwife-led births: the Midwives Alliance of North America Statistics Project 2.0 dataset. J Midwifery Womens Health. 2014;59(1):8–16. [DOI] [PubMed] [Google Scholar]
  • 40. Centers for Disease Control Natality, 2007–2015 Results Form. https://wonder.cdc.gov/controller/datarequest/D66;jsessionid=8FA38BA256FF8A57D57DB827BB7334D0 Accessed February 16, 2018.
  • 41. Cheyney M, Bovbjerg M, Everson C, et al. Outcomes of care for 16,924 planned home births in the United States: the Midwives Alliance of North America Statistics Project, 2004 to 2009. J Midwifery Womens Health. 2014;59(1):17–27. [DOI] [PubMed] [Google Scholar]
  • 42. Bovbjerg ML, Cheyney M, Brown J, et al. Perspectives on risk: assessment of risk profiles and outcomes among women planning community birth in the United States. Birth. 2017;44(3):209–221. [DOI] [PubMed] [Google Scholar]
  • 43. Chong DSY, Karlberg J. Refining the Apgar score cut-off point for newborns at risk. Acta Paediatr. 2004;93:53–59. [PubMed] [Google Scholar]
  • 44. Youden WJ. Index for rating diagnostic tests. Cancer. 1950;3(1):32–35. [DOI] [PubMed] [Google Scholar]
  • 45. Perkins NJ, Schisterman EF. The inconsistency of “optimal” cutpoints obtained using two criteria based on the receiver operating characteristic curve. Am J Epidemiol. 2006;163(7):670–675. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Liu X. Classification accuracy and cut point selection. Stat Med. 2012;31(23):2676–2686. [DOI] [PubMed] [Google Scholar]
  • 47. Cnattingius S, Norman M, Granath F, et al. Apgar score components at 5 minutes: risks and prediction of neonatal mortality. Paediatr Perinat Epidemiol. 2017;31(4):328–337. [DOI] [PubMed] [Google Scholar]
  • 48. Blackman JA. The value of Apgar scores in predicting developmental outcome at age five. J Perinatol. 1988;8(3):206–210. [PubMed] [Google Scholar]
  • 49. Thorngren-Jerneck K, Herbst A. Low 5-minute Apgar score: a population-based register study of 1 million term births. Obstet Gynecol. 2001;98(1):65–70. [DOI] [PubMed] [Google Scholar]
  • 50. Wennergen M, Krantz M, Hjalmarson O, et al. Low Apgar score as a risk factor for respiratory disturbances in the newborn infant. J Perinat Med. 1987;15(2):153–160. [DOI] [PubMed] [Google Scholar]
  • 51. Salustiano EM, Campos JA, Ibidi SM, et al. Low Apgar scores at 5 minutes in a low risk population: maternal and obstetrical factors and postnatal outcome. Rev Assoc Medica Bras (1992). 2012;58(5):587–593. [PubMed] [Google Scholar]
  • 52. Soman M, Green B, Daling J. Risk factors for early neonatal sepsis. Am J Epidemiol. 1985;121(5):712–719. [DOI] [PubMed] [Google Scholar]
  • 53. Krebs L, Langhoff-Roos J, Thorngren-Jerneck K. Long-term outcome in term breech infants with low Apgar score—a population-based follow-up. Eur J Obstet Gynecol Reprod Biol. 2001;100(1):5–8. [DOI] [PubMed] [Google Scholar]
  • 54. Casey BM, McIntire DD, Leveno KJ. The continuing value of the Apgar score for the assessment of newborn infants. N Engl J Med. 2001;344(7):467–471. [DOI] [PubMed] [Google Scholar]
  • 55. Shmueli G. To explain or predict? Stat Sci. 2010;25(3):289–310. [Google Scholar]
  • 56. Pepe MS, Janes H, Longton G, et al. Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker. Am J Epidemiol. 2004;159(9):882–890. [DOI] [PubMed] [Google Scholar]
  • 57. Schisterman EE, Cole SR, Platt RW. Overadjustment bias and unnecessary adjustment in epidemiologic studies. Epidemiology. 2009;20(4):488–495. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. de Jonge A, van der Goes BY, Ravelli AC, et al. Perinatal mortality and morbidity in a nationwide cohort of 529,688 low-risk planned home and hospital births. BJOG. 2009;116(9):1177–1184. [DOI] [PubMed] [Google Scholar]
  • 59. Offerhaus P, Rijnders M, de Jonge A, et al. Planned home compared with planned hospital births in the Netherlands: intrapartum and early neonatal death in low-risk pregnancies. Obstet Gynecol. 2012;119(2):387–388; author reply 388–389. [DOI] [PubMed] [Google Scholar]
  • 60. Birthplace in England Collaborative Group, Brocklehurst P, Hardy P, et al. Perinatal and maternal outcomes by planned place of birth for healthy women with low risk pregnancies: the Birthplace in England national prospective cohort study. BMJ. 2011;343:d7400. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Stapleton SR, Osborne C, Illuzzi J. Outcomes of care in birth centers: demonstration of a durable model. J Midwifery Womens Health. 2013;58(1):3–14. [DOI] [PubMed] [Google Scholar]
  • 62. Snowden JM, Tilden EL, Snyder J, et al. Planned out-of-hospital birth and birth outcomes. N Engl J Med. 2015;373(27):2642–2653. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63. Watterberg KL, Committee on Fetus and Newborn . Policy statement on planned home birth: upholding the best interests of children and families. Pediatrics. 2013;132(5):924–926. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. MacDorman MF, Declercq E. Trends and characteristics of United States out-of-hospital births 2004–2014: new information on risk status and access to care. Birth. 2016;43(2):116–124. [DOI] [PubMed] [Google Scholar]
  • 65. Baicker K, Buckles KS, Chandra A. Geographic variation in the appropriate use of cesarean delivery. Health Aff (Millwood). 2006;25(5):w355–w367. [DOI] [PubMed] [Google Scholar]
  • 66. Rothman KJ, Gallacher JE, Hatch EE. Why representativeness should be avoided. Int J Epidemiol. 2013;42(4):1012–1014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67. Vivian-Taylor J, Sheng J, Hadfield RM, et al. Trends in obstetric practices and meconium aspiration syndrome: a population-based study. BJOG. 2011;118(13):1601–1607. [DOI] [PubMed] [Google Scholar]
  • 68. Harrison W, Goodman D. Epidemiologic trends in neonatal intensive care, 2007–2012. JAMA Pediatr. 2015;169(9):855–862. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Web Material

Articles from American Journal of Epidemiology are provided here courtesy of Oxford University Press

RESOURCES