Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2020 Dec 15;15(12):e0243843. doi: 10.1371/journal.pone.0243843

Developing a national birth cohort for child health research using a hospital admissions database in England: The impact of changes to data collection practices

Ania Zylbersztejn 1,2,*, Ruth Gilbert 1,2,3, Pia Hardelid 1,2
Editor: Umberto Simeoni4
PMCID: PMC7737962  PMID: 33320878

Abstract

Background

National birth cohorts derived from administrative health databases constitute unique resources for child health research due to whole country coverage, ongoing follow-up and linkage to other data sources. In England, a national birth cohort can be developed using Hospital Episode Statistics (HES), an administrative database covering details of all publicly funded hospital activity, including 97% of births, with longitudinal follow-up via linkage to hospital and mortality records. We present methods for developing a national birth cohort using HES and assess the impact of changes to data collection over time on coverage and completeness of linked follow-up records for children.

Methods

We developed a national cohort of singleton live births in 1998–2015, with information on key risk factors at birth (birth weight, gestational age, maternal age, ethnicity, area-level deprivation). We identified three changes to data collection, which could affect linkage of births to follow-up records: (1) the introduction of the “NHS Numbers for Babies (NN4B)”, an on-line system which enabled maternity staff to request a unique healthcare patient identifier (NHS number) immediately at birth rather than at civil registration, in Q4 2002; (2) the introduction of additional data quality checks at civil registration in Q3 2009; and (3) correcting a postcode extraction error for births by the data provider in Q2 2013. We evaluated the impact of these changes on trends in two outcomes in infancy: hospital readmissions after birth (using interrupted time series analyses) and mortality rates (compared to published national statistics).

Results

The cohort covered 10,653,998 babies, accounting for 96% of singleton live births in England in 1998–2015. Overall, 2,077,929 infants (19.5%) had at least one hospital readmission after birth. Readmission rates declined by 0.2% percentage points per annual quarter in Q1 1998 to Q3 2002, shifted up by 6.1% percentage points (compared to the expected value based on the trend before Q4 2002) to 17.7% in Q4 2002 when NN4B was introduced, and increased by 0.1% percentage points per annual quarter thereafter. Infant mortality rates were under-reported by 16% for births in 1998–2002 and similar to published national mortality statistics for births in 2003–2015. The trends in infant readmission were not affected by changes to data collection practices in Q3 2009 and Q2 2013, but the proportion of unlinked mortality records in HES and in ONS further declined after 2009.

Discussion

HES can be used to develop a national birth cohort for child health research with follow-up via linkage to hospital and mortality records for children born from 2003 onwards. Re-linking births before 2003 to their follow-up records would maximise potential benefits of this rich resource, enabling studies of outcomes in adolescents with over 20 years of follow-up.

Introduction

Administrative health and vital statistics data, including hospital admission and birth and death registration data, are extremely valuable research resources for population health. Whole-country coverage minimises selection bias and loss to follow-up [1], and enables developing “natural experiments” to assess the impact of policy changes and public health interventions on population health [2, 3]. Large sample size enables studying rare outcomes, such as child mortality or congenital anomalies [4, 5]. Further, secondary use of routinely collected data reduces study costs and time compared to de novo data collection. In several countries national birth cohorts for child health research are commonly derived from birth registers (with information about key risk factors at birth, covering all children born in a given jurisdiction), with follow-up from linked death registration and hospital admission records. Such cohorts are commonly used for child health research for example in the Nordic countries [6], Australia [7, 8], Canada [9, 10].

In the UK, national birth cohorts based on birth registration datasets, with follow-up from linked hospital admission and mortality records are routinely used for research in Scotland [11, 12] and Wales [13]. In England, a whole-country birth cohort with rich information about the baby, mother and family could be developed by linking multiple data sources with information about birth and delivery (from Office for National Statistics [ONS] Birth Registration data, National Health Service [NHS] birth notification data, and Hospital Episode Statistics [HES], an administrative hospital dataset), with follow-up via linkage to hospital admission and mortality records. Feasibility of such linkage has been demonstrated by researchers at City University of London [1416]. These data, however, are only available for births between 2005 and 2014, linkage is not updated routinely, and access is not straightforward [15].

Instead, researchers are increasingly using HES to develop national birth cohorts in England [5, 10, 17] HES covers details of all patient care funded by the National Health Service (NHS) in England, including all births that occur in NHS hospitals (in 2016, 97.4% of all deliveries in England occurred in NHS hospitals) [18, 19]. For each birth admission, HES contains information about the health of the baby at birth and offers the possibility of longitudinal follow-up through routine linkage to consecutive hospital contacts (including admissions) and mortality records. With on-going data collection and hospital records linkable to an individual reaching back to April 1997, HES has the potential to be an invaluable source of data for child health research, provided there is consistent, high quality of linkage to longitudinal follow-up. However, as with any administrative dataset, changes to data collection methods are outside the control of researchers yet can have profound impacts on the quantity and quality of data collected.

This study aims to evaluate how changes to data collection for HES and vital statistics systems over time have affected the quality of linkage between birth admissions and subsequent HES and mortality records. First, we describe methods for developing a birth cohort using HES. We then demonstrate how changes to collection of patient identifiers used for linkage of birth admissions to subsequent HES and mortality records have affected estimates of readmission and mortality rates. Finally, we suggest how linkage error can be addressed retrospectively to create hospital record trajectories in England for children and adolescents from 1997 onwards with accurate linkage to their birth episodes.

Methods

Data sources

Hospital episode statistics

HES is an administrative database that covers all hospital activity paid for by the NHS in England (covering an estimated 98–99% of all hospital activity in England) [18]. HES is collated nationally and maintained by NHS Digital. Initially, the database was established to inform management and planning of healthcare services [18). Since April 2004, data on all admissions are collected for financial purposes [20].

The basic analysis unit in the HES Admitted Patient Care dataset (HES APC) is a consultant episode, defined as the time during which a patient is under the care of one hospital consultant. A hospital admission can consist of multiple consultant episodes if a patient is seen by more than one consultant/healthcare professional [18]. Each HES APC episode includes patient demographic information (e.g.: sex, age, ethnicity, partial post code) and clinical details (codes for diagnoses and procedures) which can be used to derive measures of comorbidities such as congenital anomalies,(4) chronic conditions [21], and long term outcomes (such as emergency admissions [22]). Clinical coders translate information from medical discharge notes into diagnostic codes (using the International Classification of Disease version 10, ICD-10 [23]) and procedure codes (using a UK-specific system, the Office of Population Censuses and Surveys Classification of Interventions and Procedures, OPCS [24]). Although all NHS clinical coders are required to complete national accreditation training to ensure standardisation of recorded information between hospitals, coding sensitivity could vary according to the quality and level of detail covered in discharge notes [18]. Socioeconomic status is measured using the Index of Multiple Deprivation (IMD) score, an area-level measure of deprivation allocated at the Lower Layer Super Output Area level (covering on average 1500 people) according to the patient’s postcode. The score combines area-level indicators in seven domains: income, employment, health and disability, education, crime, barriers to housing and services, and living environment [25].

HES APC also includes the 97% of births in England that occur in NHS hospitals (but not home births or births in private hospitals). Each birth in HES APC leads to at least two records: one delivery episode for the mother and a birth episode for each baby. Both the maternal delivery and baby’s birth episodes contain the same 19 variables detailing the delivery and labour, called the “baby tail”, including information such as gestational age, birth weight, maternal age, mode of delivery etc. Delivery and birth episodes are not routinely linked by NHS Digital [18].

HES has been collected nationally since 1989. Patients’ hospital admission records can be linked over time since 1st April 1997 using the HESID–a study-specific pseudonymised patient identifier generated by NHS Digital [26]. The algorithm to generate HESID is based on the NHS number, local hospital patient identifier, date of birth, postcode and sex (see S1 Appendix Table A for details) [27].

HES-ONS linked mortality data

Information about causes and date of deaths can be obtained through linkage of HES to Office for National Statistics (ONS) mortality records, which cover all deaths registered in England. For deaths at ≥28 days of life, information about the underlying cause of death, and any additional conditions that contributed to death are recorded. For deaths at age <28 days, the neonatal death certificate is used, which lists main conditions in the baby and the mother, with no single underlying cause of death [28].

NHS Digital links HES episodes to ONS mortality records for deaths registered from the 1st January 1998 onwards using NHS number, date of birth, sex and postcode (see S1 Appendix Table B for details) [29].

NHS Digital also flags all in-hospital deaths recorded in HES (that is, hospital episodes where the discharge method was recorded as ‘died’). NHS Digital compares information for in-hospital deaths recorded in HES and ONS mortality records and provides an indicator of agreement between these sources. Deaths identified using HES only (even if there is no link to an ONS mortality record) are included in the HES-ONS mortality data. However, these deaths do not have any recorded causes of death available [29].

Study participants

We developed a birth cohort of singleton live-born infants, who resided in England and were born between 1st January 1998 and 31st December 2015, based on birth admissions recorded in HES. We identified birth admissions using broad selection criteria based on diagnostic codes and admission details recorded in HES (such as admission type, see S2 Appendix Table A for details). We used only information recorded in baby’s birth record, as linkage between maternal delivery and baby’s birth admissions is not routinely available. We excluded multiple births due to an increased risk of false matches (i.e. multiple individuals being allocated the same HESID, meaning that it is not possible to distinguish between their hospital or mortality records) among same sex siblings. We also excluded stillbirths and births indicated by unfinished HES episodes, which should not include any clinical details (detailed exclusion criteria are listed in S2 Appendix Table B). We cleaned data on key risk factors recorded at birth admission in HES (birth weight, gestational age, sex, ethnicity, maternal age, IMD score and county of residence), excluding implausible or inconsistent recordings, and finally we removed infants who were not resident in England. Details of data cleaning are described in S2 Appendix, and Stata code for cohort derivation can be found at https://github.com/UCL-CHIG/HES-birth-cohorts.

Children were followed up from birth to their first birthday or death by linkage to hospital admission and mortality records using HESID. All episodes of care in HES APC were linked into hospital admissions, which we defined as a continuous period of time that a child spent under NHS hospital care. Hospital transfers and admissions within 1 day of each other were treated as one inpatient admission [30].

Outcomes

We focus on assessing trends in two outcomes which could be affected by changes in data collection practices for patient identifiers used for linkage between birth admissions and follow-up records. Firstly, we looked at the hospital readmission rate in the first year of life to assess impact of changes in data collection on internal linkage between birth and hospital admissions within HES. We defined the hospital readmission rate as the proportion of infants with at least 1 hospital admission in the first year of life after discharge from the birth admission. The birth admission included transfers to neonatal intensive care units or other hospitals within 1 day of birth. Second, we looked at the infant mortality rate to evaluate of the impact of changes in data collection on external linkage to ONS mortality records. Infant deaths were indicated if a linked ONS mortality record was found or if discharge method in hospital record indicated death. We calculated the infant mortality rate as the number of children who died at age 0–364 days per 1000 live births.

Changes in collection of patient identifiers used for linkage

We identified three changes in data collection which could affect the completeness and accuracy of patient identifiers recorded in HES and influence linkage within HES and with other datasets:

  1. Q4 2002: implementation of NHS Numbers for Babies (NN4B) service on 29th October 2002. Since October 2002 maternity staff in England request an NHS number for babies in hospital after birth using an on-line system as part of the Statutory Birth Notification process. Prior to this point babies were allocated NHS number by registrars at birth registration, which could take up to 6 weeks [31].

  2. Q3 2009: Introduction of Registration ONline system (RON) on 1st July 2009 RON is a web-based online registration system for births and deaths used by registrars in local authority registry offices. The introduction of RON enabled additional validation checks at the point of birth and death registration, such as validation of address and postcode [32].

  3. Q2 2013: correction of a postcode extraction error by NHS Digital on 1st April 2013 An extraction error resulted in birth admissions missing postcode prior to the 2013/14 financial year. As a result, all consultant episodes specified as births (using epitype variable in HES) were missing a postcode and all variables derived from postcode (including IMD score) [33].

Statistical analysis

Cohort coverage

We estimated the coverage of the HES birth cohort by comparing the number of births per year in the cohort to national birth statistics for England published by the ONS [34, 35]. We derived the proportion of births per calendar year with valid and complete information on key risk factors recorded at birth: birth weight, gestational age, sex, ethnicity, maternal age, and quintile of IMD score as a proportion of all births in the HES birth cohort.

Impact on internal linkage between births and hospital admission records

We used interrupted time series analysis (ITSA) to assess the impact of changes in data collection on internal linkage of birth records to hospital admission records [36]. We fitted a linear regression model with the proportion of babies with at least one hospital readmission after birth per annual quarter of birth as the outcome. We adjusted for year of birth (as a continuous variable) and indicator of annual quarter of birth. We also included three binary indicators for time periods before and after each of the changes to data collection equal to 0 before and 1 after change (to quantify changes in observed readmission rates in Q4 2002, Q3 2009 and Q2 2013 vs “expected” rates for these time points based on the trends before each change). Lastly, we also adjusted for an effect modification term between year of birth and the period indicator before vs after each change to data collection (to assess changes in the trends in readmission rates in Q1 1998-Q3 2002, Q4 2002-Q2 2009, Q3 2009-Q1 2013 and Q2 2013 onwards).

We hypothesised that the quality of identifiers, and therefore linkage, would be better for babies who have a longer birth admission as their NHS number and other identifiers can be updated during their hospital stay. Therefore, we repeated the models separately for readmissions in babies with long birth admission (≥7 days) and in babies with short birth admission (<7 days).

Impact on external linkage between births and infant mortality records

We compared trends in crude infant mortality rates derived from the HES birth cohort with official infant mortality statistics published for singleton live births in England and Wales by the ONS (95% of births in England and Wales occur in England) by year of birth to evaluate the impact of changes in data collection on linkage between HES and ONS mortality records [3739]. We tabulated mortality by age at death as neonatal mortality (at 0–27 days) and post-neonatal mortality (at 28–364 days).

Deaths in linked HES-ONS data can be defined through a link to ONS mortality records or via the discharge method recorded in hospital, and all in-hospital deaths should link to ONS mortality records. To further explore the impact of the three changes to data collection on linkage between HES and ONS mortality data, we looked at changes by year of birth in: (1) the number and proportion of deaths in the HES birth cohort that were identified using only discharge method in HES (with no ONS mortality record), and (2) the number and proportion of infant deaths from ONS mortality records which did not link to any HES record.

All deaths with an ONS mortality record should have recorded causes of death. We explored the proportion of these deaths in the HES birth cohort with no recorded causes of death as an indicator of potential data extraction errors. All analyses were done using Stata version 15.

Ethics statement

We have a data sharing agreement with National Health Service (NHS) Digital to use a de-identified extract of HES data linked to ONS mortality records for research on child health. We did not require ethical approval to use these datasets [40].

Results

Cohort coverage and data completeness

We identified 10,653,998 singleton live births in HES APC in calendar years 1998–2015, covering 96.4% of all singleton live births in England (Table 1). Overall, birth weight was complete for 66% of births, gestational age for 63%, maternal age for 62%. Completeness has improved over time, reaching on average 82% for birth weight, 78% for gestational age, and 69% for maternal age for births from 2009 onwards (Fig 1). Baby’s sex was over 99% complete throughout the study period. IMD score (based on earliest recording in any hospital admission record in the first year of life) and ethnicity (based on most commonly recorded value) were complete for 53% and 74% of births, respectively. IMD score was complete for only 34% of births in 2008–2012 reflecting the NHS Digital data extraction error that led to postcode and all postcode-derived variables being missing from birth admission records. The error was corrected in April 2013. IMD score was complete in 89% of records in calendar years 2014–15.

Table 1. Coverage of HES birth cohort compared to national statistics published by the ONS for England.

Year of birth HES birth cohort England (according to ONS)* coverage
1998 546,804 584,928 93.5%
1999 542,734 572,611 94.8%
2000 526,254 556,172 94.6%
2001 527,602 547,292 96.4%
2002 540,030 549,003 98.4%
2003 554,104 572,711 96.8%
2004 571,530 589,248 97.0%
2005 577,964 595,019 97.1%
2006 592,768 616,588 96.1%
2007 602,801 635,561 94.8%
2008 631,027 652,280 96.7%
2009 630,170 649,416 97.0%
2010 649,431 665,746 97.5%
2011 650,223 666,320 97.6%
2012 653,116 672,505 97.1%
2013 626,619 646,941 96.9%
2014 614,637 640,663 95.9%
2015 616,184 643,363 95.8%
total 10,653,998 11,056,368 96.4%

HES = Hospital Episode Statistics. ONS = Office for National Statistics.

*We estimated the number of singleton live births in England based on the number of total births in England [34, 35] and assuming that the ratio of singleton live births to all live births was the same in England as in England and Wales (97.0% in 1998–2015) [3739].

Fig 1. Completeness of key risk factors recorded in baby’s birth records in HES birth cohort.

Fig 1

HES = Hospital Episode Statistics, IMD = Index of Multiple Deprivation. *Note that ethnicity and IMD score were completed using each child’s longitudinal hospital admission records (see S2 Appendix for details).

Impact on internal linkage between births and hospital admission records

Overall, 2,077,929 (19.5%) babies had at least one hospital readmission after birth during infancy. The proportion of babies with hospital readmission after birth declined by 0.2% percentage points per annual quarter from Q1 1998 until Q3 2002 (Fig 2). In Q4 2002, after the NN4B system was introduced, the hospital readmission rate increased by 6.1% percentage points (compared to the “expected” readmission rate based on the trend before Q4 2002) to a total of 17.7%. Thereafter, the readmission rate increased by 0.1% percentage points per annual quarter for births in Q4 2002 –Q2 2009 and in Q3 2009 –Q1 2013, and by 0.2% per annual quarter from Q2 2013 onwards. There was no statistically significant change in the observed vs expected proportion in Q3 2009 when RON was introduced nor in Q2 2013 when the postcode extraction error for births was corrected by the data provider. Detailed ITSA results are shown in S3 Appendix Table A.

Fig 2. Trends in the proportion of infants with at least one hospital admission after birth in infancy by quarter and year of birth.

Fig 2

HES = Hospital Episode Statistics, ITSA = Interrupted time series analysis. Vertical lines indicate time points when the collection of identifiers used to generate HESID has changed: 1) Q4 2002: implementation of NHS Numbers for Babies service 2) Q3 2009: Introduction of registration online system 3) Q2 2013: correcting postcode extraction error by NHS Digital.

Results separated by length of birth admission were comparable, suggesting that linkage error was not affected by length of stay. The proportion of babies with a long birth admission (≥7 days) who had at least one hospital readmission in infancy after birth declined by 0.2% percentage points per annual quarter between Q1 1998 and Q3 2002. In Q4 2002 when NN4B system was introduced the readmission rate increased by 8.6% percentage points to a total of 29.8% (compared to the “expected” readmission rate based on trend in 1998–2002), and continued to increase by 0.2% percentage points per annual quarter in Q4 2002 –Q2 2009, and by 0.1% thereafter. There was no statistically significant change in the observed vs expected readmission rate in Q3 2009 nor in Q2 2013. For babies with a short birth admission (<7 days), the readmission rate declined by 0.2% of births per annual quarter of birth in Q1 1998 –Q3 2002. In Q4 2002, the observed readmission rate increased by 5.9% percentage points to a total of 16.7% of babies with hospital readmission compared to the “expected” rate and increased thereafter. There was no statistically significant change in the trend in Q3 2009 nor in Q2 2013.

Impact on external linkage between births and infant mortality records

We identified 42,963 infant deaths between 1998 and 2015 in the HES birth cohort. Infant mortality in the cohort was 4.03 deaths per 1000 live births in 1998–2015, compared to 4.16/1000 live births reported by ONS for England and Wales. Infant mortality rates were underestimated by 16% for births before 2003, and closely matched rates reported by the ONS for births in 2003–2015 (Fig 3). The difference in rates before 2003 was primarily driven by underreported post-neonatal mortality rates in the HES birth cohort. Given 2,683,424 births in 1998–2002 identified in HES, the difference in mortality rates derived from HES birth cohort and national statistics reported by the ONS translates to 2,082 deaths missing from the HES birth cohort, of which 1,347 were in the post-neonatal period.

Fig 3. Comparison of components of infant mortality in HES birth cohort compared to national figures reported by ONS.

Fig 3

HES = Hospital Episode Statistics, ONS = Office for National Statistics. Vertical lines indicate time points when the collection of identifiers used to generate HESID has changed: 1) Q4 2002: implementation of NHS Numbers for Babies service 2) Q3 2009: Introduction of registration online system 3) Q2 2013: correcting postcode extraction error by NHS Digital.

Overall, 13% of deaths in the HES birth cohort were identified using only information recorded in hospital admission records, with no link to an ONS mortality record. The proportion of deaths with no linked ONS mortality record decreased over time: before introduction of NN4B system in 2002, 29% of infant deaths had no link to an ONS mortality record. This proportion decreased to 10% between 2003 and 2009 (when RON was introduced) and further to 5% in 2010–2015 (Table 2). There was no change after 2013 when postcode extraction error was corrected by NHS Digital. Similar patterns were observed for the proportion of infant deaths recorded in ONS mortality rates that did not link to any HES record (Table 2).

Table 2. Number and proportion of infant deaths that did not link to ONS mortality record and number and proportion of infant deaths that were indicated using HES data only.

HES birth cohort ONS Mortality Records
Year of birth Total infant deaths* % without a link to ONS mortality record Total infant deaths % without a link to HES
1998 2,337 33% 4,357 28%
1999 2,396 24% 4,584 26%
2000 2,269 27% 4,358 27%
2001 1,948 33% 4,320 30%
2002 (introduction of NN4B) 2,104 32% 4,246 29%
2003 2,571 12% 3,967 14%
2004 2,585 10% 3,864 13%
2005 2,602 9% 3,870 11%
2006 2,651 9% 3,931 11%
2007 2,651 10% 3,969 11%
2008 2,668 10% 4,106 12%
2009 (introduction of RON) 2,577 8% 3,985 8%
2010 2,546 6% 3,834 7%
2011 2,444 6% 3,656 7%
2012 2,342 4% 3,509 8%
2013 (correcting postcode extraction error for births in HES) 2,171 5% 3,181 8%
2014 2,075 4% 2,934 8%
2015 2,026 4% 2,924 7%

HES = Hospital Episode Statistics, NN4B = NHS Numbers for Babies (NN4B) ONS = Office for National Statistics, RON = Registration ONline system. Rows marked in grey indicate years when data collection practices have changed.

*Infant deaths were indicated if a linked ONS mortality record was found or if discharge method in hospital record indicated death.

80% of deaths with no ONS mortality record occurred in the first week of life (n = 4,548). Because there is no linked ONS mortality record, these deaths do not have any recorded causes of death in the linked HES-ONS data. We found that a further 1% of all deaths with a linked ONS mortality record had no recorded causes of death. 98% of these deaths were at age 28–30 days, accounting for 75% of all deaths on days 28–30 in 1998–2016.

Discussion

Key findings

The implementation of NN4B, which allowed the NHS number to be allocated to babies in hospital shortly after birth led to substantial improvements in ascertainment of subsequent hospital admissions and infant deaths in a birth cohort developed using HES birth records. Unreliable linkage between births and follow-up records before 2003 resulted in a misleading downward trend in hospital readmissions before the introduction of NN4B in October 2002. Infant mortality rates derived from the HES birth cohort were also underestimated for births in 1998–2002. The introduction of RON correlated with a reduction in the proportion of unlinked deaths indicated in HES and ONS mortality records. Fixing the postcode extraction error in 2013 did not impact the quality of linkage, but it helped to improve the completeness of the IMD score for birth records in HES.

Strengths & limitations

We present validated methods for developing a national birth cohort using HES, which covered 96.4% of births in England. Whole-country coverage enabled us to assess impact of changes in data collection on linkage to follow-up records as the trends in mortality or hospital admission records were not affected by loss to follow-up or selection bias. To ensure we captured all births, we used broad selection criteria for identifying births including diagnostic and procedure codes, and administrative variables recorded in HES. We recommend this approach to maximise cohort coverage. For example, previous studies using only admission method to indicate births identified 87% of births [41], while an algorithm based on multiple variables captured 97% of births [42]. Our criteria were consistent with previous studies (with minor differences reflecting differences in study aim) [42, 43]. Our Stata code is openly available on Github, with detailed description for users in S2 Appendix of this paper. Our paper can serve as a helpful primer for researchers interested in using HES for child health research.

Our methods excluded multiple births from the cohort, due to the increased risk of false matches for multiple births, especially for same sex siblings [27], and increased uncertainty about coding of stillbirths among multiple births [42]. Our exclusion criteria (listed in S2 Appendix Table B) could be used to develop a cohort of multiple births. Further work is needed to evaluate the coverage of multiple births in HES, and coding of stillbirths for multiple births in the HES birth cohort. A recent national audit of maternity and perinatal care for women with multiple births and their babies has identified case ascertainment of 89.5% using multiple detailed datasets [44]. Further work is needed to evaluate the quality of linkage to longitudinal hospital admission and mortality records among twins and triplets.

We did not have individual-level information about personal identifiers recorded at each admission. To evaluate the quality of linkage we were therefore limited to looking at trends in two outcomes likely to be affected by changes in data collection for HES and vital statistics. Information on the proportion of babies with missing NHS number, postcode and sex per calendar year could be provided by NHS Digital to researchers when data are provided, allowing the assessment of potential biases in their studies due to linkage errors.

Interpretation

The introduction of NN4B in October 2002 had the biggest impact on the quality of linkage between birth, hospital and mortality records. Prior to the introduction of NN4B in October 2002, newborns were assigned an NHS number at birth registration, which could occur up to 6 weeks after birth. If a birth episode did not contain an NHS number, it could only be linked to consecutive admissions in HES and to ONS mortality records using postcode, date of birth and sex and no link would be established if, for example, the child changed their address. As a consequence, babies born before October 2002 were more likely to be allocated a new HESID at first hospital admission following birth.

Re-linking birth episodes to longitudinal hospital admissions prior to 2003 would provide a unique resource for birth cohort studies of health outcomes in adolescents with 20 years of follow-up after birth. There are two barriers to re-linking births before 2003 to their consecutive records. First, many births might be missing NHS numbers as discussed above. Secondly, due to a data extraction error by the data provider, birth episodes prior to April 2013 are missing postcode at birth [33]. Re-linking these births to consecutive hospital admissions and death records would therefore require three steps. First, babies need to be linked to their mothers to obtain information about postcode at birth, as postcode is 99% complete in mothers’ delivery records in HES [42]. Second, using the date of birth, sex and complete postcode at delivery, birth episodes would need to be linked to the Personal Demographics Service (PDS), a national database of all patients who interact with the NHS (including all patients registered with a GP, babies who have received an NHS number at birth, as well as patients admitted to hospital via accident and emergency), and its predecessor the National Health Service Central Register (NHSCR) [45]. Finally, the NHS number obtained via linkage between HES birth records and PDS/NHSCR could be used to re-link HES birth episodes before 2003 to HES admissions after birth. However, more support and resources would need to be allocated to NHS Digital to address these data quality issues.

We also identified a need to improve the completeness of risk factors recorded at birth admissions in HES. Key risk factors recorded at birth admission (birth weight, gestational age, maternal age) were recorded in 62–66% of births. It has previously been shown that completeness and accuracy of these baby variables in HES vary between hospitals (as submission of these variables to HES is not mandatory), and several approaches have been developed to derive and validate nationally representative sub-cohorts of births in hospitals with high quality of recorded data [5, 41, 46, 47]. Due to a data extraction error, the IMD score–the only measure of socioeconomic status available in HES–was missing from birth episodes prior to April 2013. Completeness of key risk factors at birth could be improved by linking mother’s delivery and baby’s birth admission records as maternal delivery records are often more complete than birth records, and have not been affected by the postcode extraction error prior to 2013. Mother and baby records can be linked using deterministic and probabilistic methods (with a linkage rate of 96%) [42]. We have previously shown that copying information from the maternal delivery record to the linked baby birth record result in a high completeness recording of key risk factors including birth weight and gestational age [48].

Complete information on key risk factors at birth could also be obtained through linkage of HES to ONS birth registration and NHS birth notification data. These datasets contain highly complete information on a number of risk factors including birth weight, gestational age, parity, parental country of birth and ethnic group [49]. Feasibility of linkage between these datasets and HES has been demonstrated by researchers at City University of London [1416]. Since these datasets are now all held by NHS Digital, this valuable linkage could be routinely updated [14]. Linkage of multiple children to the same mother through HES, ONS birth registration, NHS birth notification or all of these datasets, would additionally enable characterisation of siblings and family groups [50, 51].

Since April 2015 maternity and child health services in England are also required to contribute data collected in antenatal clinics (such as smoking status or body mass index at first booking), and details of delivery and birth collected at the maternity ward (such as gestational age, delivery method, and diagnoses of the newborn baby) to the Maternity Services Data Set (MSDS) also maintained by the NHS Digital [52]. This dataset is not yet linked to HES and the quality of recorded data has not been evaluated nationally. Given sufficient data quality, linkage of MSDS and HES would provide a rich resource for perinatal health research in the future.

National birth cohorts from administrative linked datasets provide an invaluable resource for child health research. Many countries (e.g. the Nordic countries, Scotland or Australia) have a long tradition of using such data for research, with data collection reaching back to the ‘80s. We demonstrate that in England, HES can be used to develop a birth cohort, following children born from 2003 into their adolescent years. A birth cohort for children born in 1998 onwards would be possible to develop if errors in linkage of birth episodes were corrected. Birth admissions cover key characteristics of mothers and babies, such as birth weight, gestational age, sex, maternal age and area-level deprivation, comparable with information recorded in birth registers in the Nordic countries, Scotland, Canada and Australia, although improvements to data completeness are needed. Diagnoses are coded using ICD-10 classification, which is also used in hospital records in Europe, New Zealand, Australia and Canada [18], enabling international comparisons of child health outcomes (such as mortality, congenital anomalies, chronic conditions or respiratory tract infections) [4, 5, 8, 10, 53].

Conclusion

We identified a significant improvement in linkage within HES records and to ONS mortality records with the introduction of a programme to allocate NHS numbers at birth in 2002. HES provides a unique resource for future child health studies, with ongoing data collection, and historical data going back to 1998, allowing over 20 years of follow up for the oldest children in the cohort. To fully benefit from this rich resource for child health research, improvements in the quality of recorded data are needed. Linkage of babies’ birth record to mothers’ delivery records, ONS birth notification and NHS birth registration data can be used to enhance the completeness of key risk factors at birth. Birth admissions prior to 2003 need to be re-linked to consecutive admissions and death records. Such re-linkage would be invaluable for birth cohort studies of health outcomes in adolescents and adults where long follow-up times are needed.

Supporting information

S1 Appendix. HES linkage algorithms.

(DOCX)

S2 Appendix. Developing a cohort of singleton live births in HES.

(DOCX)

S3 Appendix. Additional results.

(DOCX)

Acknowledgments

This work uses data provided by patients and collected by the NHS as part of their care and support.

Data Availability

Authors do not have permission to share patient-level Hospital Episode Statistics (HES) data. Qualified researchers can request access to the data from the NHS Digital Data Access Advisory Group (enquiries@nhsdigital.nhs.uk).

Funding Statement

AZ was funded by a PhD studentship funded from awards to the Farr Institute of Health Informatics Research, London, from the Medical Research Council, Arthritis Research UK, British Heart Foundation, Cancer Research UK, Chief Scientist Office, Economic and Social Research Council, Engineering and Physical Sciences Research Council, National Institute for Health Research, National Institute for Social Care and Health Research, and Wellcome Trust (grant no MR/K006584/1). This research benefits from and contributes to the National Institute for Health Research (NIHR) Children and Families Policy Research Unit, but was not commissioned by the NIHR Policy Research Programme. The views expressed are those of the author(s) and not necessarily those of the NIHR or the Department of Health and Social Care. Research at UCL Great Ormond Street Institute of Child Health is supported by the NIHR Great Ormond Street Hospital Biomedical Research Centre. RG receives funding from Health Data Research UK. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

Decision Letter 0

Umberto Simeoni

21 Jul 2020

PONE-D-20-14030

Developing a national birth cohort for child health research using a hospital admissions database in England: the impact of changes to data collection practices

PLOS ONE

Dear Dr. Zylbersztejn,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands.

Major objections have been raised during the review process about the methods and attainable aims of the described process and database: Although some of them may be out of reach given the setting of the study, we invite you to consider submitting a revised version of the manuscript that addresses the remarks made by the reviewers. 

Would you choose to do so, please address all the points made by the reviwers in the below report. 

Please submit your revised manuscript by Sep 04 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Umberto Simeoni

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2.We note that you have indicated that data from this study are available upon request. PLOS only allows data to be available upon request if there are legal or ethical restrictions on sharing data publicly. For information on unacceptable data access restrictions, please see http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions.

In your revised cover letter, please address the following prompts:

a) If there are ethical or legal restrictions on sharing a de-identified data set, please explain them in detail (e.g., data contain potentially identifying or sensitive patient information) and who has imposed them (e.g., an ethics committee). Please also provide contact information for a data access committee, ethics committee, or other institutional body to which data requests may be sent.

b) If there are no restrictions, please upload the minimal anonymized data set necessary to replicate your study findings as either Supporting Information files or to a stable, public repository and provide us with the relevant URLs, DOIs, or accession numbers. Please see http://www.bmj.com/content/340/bmj.c181.long for guidelines on how to de-identify and prepare clinical data for publication. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories.

We will update your Data Availability statement on your behalf to reflect the information you provide.

3. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This is a manuscript describing hospital episode statistics (HES) data on births in the UK and assessing whether this data can be used for following up children using routine data. While this paper provides interesting descriptive information about England’s HES statistics, the research question (s) are difficult to discern. It reads, at times, more like a report describing the constitution of a dataset, than a scientific study. The authors need to be more straightforward about the central premise of their study, clarify the research aims and place the results in a broader international context.

The introduction and title of the MS place emphasis on the methods for identifying a “national birth cohort” – the authors begin by citing the failure of the 2015 effort to establish the Lifestudy and then refer to examples from other countries where register data are used for research. From the onset, the authors need to clarify what they mean by “national birth cohort”. There is a large difference between being able to link hospital data and studying some longer-term (mainly hospital-related) outcomes and creating a national birth cohort (as initiated in many countries, see https://lifecycle-project.eu/). When the authors claim in their discussion that they have shown the feasibility of using HES to create a national birth cohort, this is not convincing if they mean this cohort to be an alternative for cohort studies such as Lifestudy or if comparing with the Nordic registers. Many other issues remain outstanding such as whether data can be linked to other sources, issues of consent, etc…

The research questions underlying the study aims are not clear. It is stated: “In this study, we present methods for developing a national birth cohort using HES and provide Stata code for cohort derivation. We demonstrate how long-term health outcomes of children in the cohort (such as hospital admission or death) are affected by changes in the quality of recorded identifiers used for linkage of birth admissions to consecutive hospital admissions within HES and to other datasets. We suggest how linkage error can be addressed retrospectively to create hospital record trajectories for children and adolescents from 1997 onwards with accurate linkage to their birth episodes.

Concerning the first aim, presenting methods and giving code is transparent and helpful for other researchers, but does not constitute a research question. So, although instructions to identify births are provided, there is no test to show how this definition improves on others. There is no discussion about whether these methods are similar to those used in other datasets (for instance: Kuklina EV, Whiteman MK, Hillis SD, et al. An enhanced method for identifying obstetric deliveries: implications for estimating maternal morbidity. Matern Child Health J. Jul 2008;12(4):469-477). Most studies using hospital discharge data use algorithms to identify births.

Second, the section on how changes to the quality of recorded identifiers improved linkage and quality of indicators relying on linkage isn’t particularly surprising and therefore, while it is reassuring to find that better linkage seems to improve the accuracy of some indicators (leading to higher admission rates, for instance), it’s not clear how this adds to overall knowledge about how hospital episode statistics can be used for research on children. Furthermore, there is no gold standard - in the case of hospital admissions – and there may still be substantial errors.

In terms of the conceptualisation of the study, the database constituted by the authors only allows a partial evaluation of the capabilities of the HES data as mothers are not linked with their babies, even though the authors state that this is possible and would improve the quality of data. The authors need to justify why they developed this study without linking these data.

Given the large amounts of missing data, it is not clear why the authors did not describe the characteristics of births with and without missing data – is this related to the hospital? The region?

The reader also wonders whether there has there been any validation of these data with medical records to assess the validity of the data?

Many countries use hospital discharge data for research; putting the HES within this broader international context would be useful. How do these results compare to those in other countries?

Abstract –“Numbers for Babies (NN4B) system for allocation of unique National Health Service (NHS) number at birth in Q4 2002” is not interpretable as a stand alone sentence.

While multiples pose many problems for linkages and use of administrative and register data, the solution of eliminating them is not optimal.

Reviewer #2: This study described methods used to create a national birth cohort using HES and aimed to evaluate the quality of linkage between births and follow-up records and its impact on two health outcomes in children. Overall, I think this is a well written and informative study.

To help further improve the manuscript it would be good if the authors could address the following points:

- It’s great to see that the authors have provided a link to the Stata code for derivation of the cohort. Can they just check that this is complete and consistent with all the steps they describe in the paper e.g. from a quick scan of the Stata code I could not see any code for excluding non-English residents. Similar, it would be really helpful if they can make sure that all the data cleaning steps in the Stata code are described in the appendix of the paper e.g. from the Stata code it looks like they have a number of additional cleaning steps such as those described under the overall heading ‘Additional data cleaning & duplication to ensure one birth episode per HESID’ that are currently not described in the appendix of the paper.

- Discussion, key findings & abstract – when the authors say the proportion of babies with hospital readmission after birth increased by a third to 17.7% is this compared to the proportion with hospital readmission in Q1 1998? If so, can the authors make this explicit in both the discussion and abstract or consider instead stating what I think is probably the more informative figure of 6.1% compared to the expected value based on the trend before Q4 2002.

- Page 4 of the discussion - From the data the authors have it cannot be stated with certainty that babies with longer birth admission were more likely to have their NHS number updated during the hospital stay. Also are the 35% and 25% figures quoted on page 4 of the discussion compared to the hospital readmission proportions in Q1 1998? If so, can the authors make this explicit. However, I again would consider the 5.9% and 8.6% figures (compared to expected value based on trend before Q4 2002) that the authors quote in the results to be the most relevant – and these figures actually imply a greater shift in the readmission rate occurred for births with longer not shorter birth admissions.

Other minor points:

- Abstract discussion – even if births prior to 2003 are not correctly re-linked, HES has the potential to provide national longitudinal hospitalisation birth cohort data for child research so suggest slightly reword last sentence to something like “HES has the potential to provide national longitudinal hospitalisation birth cohort data for child health research, but births prior to 2003 need correctly re-linking to follow-up records.”

- Reference 6 relates to a data linkage study conducted in Scotland so would not cite it with reference to Canada as have done in the introduction.

- Introduction – please make it explicit that HES includes all births in English NHS hospitals and presumably does the 97.4% figure relate to the proportion of all births in England rather than England and Wales?

- Methods - suggest rephrasing first sentence under study participants to something like “We developed a cohort of singleton live births between 1st January 1998 and 31st December 2015 to mother’s resident in England based on birth…”

- Methods – you state that you cleaned data on maternal age but this is not detailed in appendix 2.

- Can you make it a bit more explicit in the methods that you did not use information recorded in the mother’s delivery records in this study.

- It would helpful for completeness to include the details of the HES field/variable names you used to identify the risk factors in the appendix.

- In the methods section you state that you defined hospital admissions as a continuous period of time that a child spent under hospital care and that hospital transfers and admissions within 1 day of each other were treated as one inpatient admission which seems to contradict with what you say in appendix 3 (hospital admission defined as total time spent by a patient in one hospital, with hospital transfers classified as separate inpatient admissions) – please clarify.

- In the outcomes section of the methods where you define infant deaths suggest rephrasing slightly for clarity to: ‘Infant deaths were defined where a linked death record was found (that is, via link to ONS mortality record) or the discharge method in the hospital record was recorded as ‘died’.

- Methods – the first time you mention the implementation of NHS numbers for babies, did you mean “Q4 2002..” rather than “Q3/Q4 2002..”?

- Can you clarify that you were looking at hospital readmission in the first year of life in the methods outcomes section and appendix table 6.

- Figure 3 – would be helpful to mark the time points when the collection of identifiers used to generate HESID changed as you did for Figure 2.

- Page 2 of discussion – think you need to add an ‘of’ after ‘Further work is needed to evacuate the quality”

- On page 4 of the discussion would suggest softening the wording slight to something like “Fixing the postcode extraction error in 2013 did not appear to impact on quality of linkage, but…” Also, according to your Figure 1, fixing the postcode extraction error did not ensure that IMD was available for all births – it only correlated with an increase in the completeness of this variable to 89% in years 2014-2015 – can you amend the wording on page 4 of the discussion to reflect this.

- It would probably be clearer to use different colours rather than different shades of grey in the figures.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Dec 15;15(12):e0243843. doi: 10.1371/journal.pone.0243843.r002

Author response to Decision Letter 0


2 Nov 2020

We would like to thank both reviewers for their comments, which helped us to revise and improve the manuscript.

Comments to the Author - Reviewer #1:

1. This is a manuscript describing hospital episode statistics (HES) data on births in the UK and assessing whether this data can be used for following up children using routine data. While this paper provides interesting descriptive information about England’s HES statistics, the research question (s) are difficult to discern. It reads, at times, more like a report describing the constitution of a dataset, than a scientific study. The authors need to be more straightforward about the central premise of their study, clarify the research aims and place the results in a broader international context.

We thank reviewer for their comments. We have now revised the paper to more clearly reflect the aim of the paper, which is to evaluate how changes to data collection in HES and in vital statistics data over time have affected the quality of linkage between birth admissions and longitudinal follow-up records. Specific changes are listed below:

- We have clarified in the abstract and introduction that the overall aim of this study is to evaluate how changes to data collection in HES and vital statistics data over time have affected the quality of linkage between birth admissions and longitudinal follow-up records.

- We have changed the short title to “Impact of changes to data collection on a national birth cohort from administrative health records in England”

- In the methods section, we include “Changes to collection of patient identifiers used for linkage” in a separate section to highlight that this is main exposure of interest for all outcomes.

- We have rephrased subheadings used in the statistical analysis section of methods and results to match our aim.

- We now explicitly refer to the three changes in data collection when describing results for infant mortality (we also highlight these changes in figure 3 and table 2)

- In the discussion, we have revised the key findings to focus on impact of the three changes to data collection

- We have also re-organised the interpretation section of the discussion to better reflect the aim of this paper.

We also agree that the paper was lacking international context. We have now revised the introduction and discussion to provide a comparison of hospital discharge datasets available in other countries:

- In the introduction we discuss the advantages of population-level birth cohorts (1st paragraph of the introduction), we describe datasets available in England and provide a rationale for deriving a birth cohort using HES.

- In the “interpretation” section of the discussion (paragraph 1, p22), we compare information available in HES to that collected in other countries and cite international comparative studies that include birth cohorts from HES.

2. The introduction and title of the MS place emphasis on the methods for identifying a “national birth cohort” – the authors begin by citing the failure of the 2015 effort to establish the Life study and then refer to examples from other countries where register data are used for research. From the onset, the authors need to clarify what they mean by “national birth cohort”. There is a large difference between being able to link hospital data and studying some longer-term (mainly hospital-related) outcomes and creating a national birth cohort (as initiated in many countries, see https://lifecycle-project.eu/). When the authors claim in their discussion that they have shown the feasibility of using HES to create a national birth cohort, this is not convincing if they mean this cohort to be an alternative for cohort studies such as Life study or if comparing with the Nordic registers. Many other issues remain outstanding such as whether data can be linked to other sources, issues of consent, etc…

We thank reviewer for this useful comment. We did not intend to suggest that national birth cohorts from administrative health records can replace the traditional, nationally-representative cohorts such as the Life Study, which collected much more detailed variables and (often) biosamples for a smaller number of participants. We have now revised the introduction accordingly – we removed the mention of Life Study, and we define national birth cohorts from administrative health databases in the first paragraph of the introduction.

3. The research questions underlying the study aims are not clear. It is stated: “In this study, we present methods for developing a national birth cohort using HES and provide Stata code for cohort derivation. We demonstrate how long-term health outcomes of children in the cohort (such as hospital admission or death) are affected by changes in the quality of recorded identifiers used for linkage of birth admissions to consecutive hospital admissions within HES and to other datasets. We suggest how linkage error can be addressed retrospectively to create hospital record trajectories for children and adolescents from 1997 onwards with accurate linkage to their birth episodes.”

We have now clarified that the aim of this study is to evaluate how changes to data collection in national registration systems and HES have affected the quality of linkage between birth admissions and subsequent HES and mortality records, and we then list steps we took to achieve this aim. The last paragraph of the introduction (p5) now reads as follows:

“This study aims to evaluate how changes to data collection for HES and vital statistics systems over time have affected the quality of linkage between birth admissions and subsequent HES and mortality records. First, we describe methods for developing a birth cohort using HES. We then demonstrate how changes to collection of patient identifiers used for linkage of birth admissions to subsequent HES and mortality records have affected estimates of readmission and mortality rates. Finally, we suggest how linkage error can be addressed retrospectively to create hospital record trajectories in England for children and adolescents from 1997 onwards with accurate linkage to their birth episodes”

4. Concerning the first aim, presenting methods and giving code is transparent and helpful for other researchers, but does not constitute a research question. So, although instructions to identify births are provided, there is no test to show how this definition improves on others. There is no discussion about whether these methods are similar to those used in other datasets (for instance: Kuklina EV, Whiteman MK, Hillis SD, et al. An enhanced method for identifying obstetric deliveries: implications for estimating maternal morbidity. Matern Child Health J. Jul 2008;12(4):469-477). Most studies using hospital discharge data use algorithms to identify births.

We have now revised the aim of the paper, as discussed in point 3 above. The first step required to meet this aim is to develop a birth cohort using Hospital Episode Statistics. Although this is not strictly an aim of this paper, our description of how to develop a birth cohort within HES and the accompanying Stata code will be helpful for researchers interested in using HES for child health research (as pointed out by reviewer 2), which they can use and amend according to their research needs.

In the discussion we compare our method to other proposed methods applied to HES data (first paragraph under “Strengths & limitations” heading). Comparisons with methods applied in other countries is less relevant to this work. While our algorithm includes some ICD-10 codes (which could be applied internationally), it also covers a number of data fields that are specific to HES and not applicable to other datasets (such as patient classification variable or admission method variable – coding of this variable will likely vary between hospital datasets internationally). Similarly, we note that the paper cited by the reviewer use the ICD-9 Clinical Modification system which is not used in any UK country. Further, in most countries, identification of births does not require complex algorithms, as they are indicated via linkage to birth certificates / birth records (e.g. medical birth registers in the Nordic countries).

5. Second, the section on how changes to the quality of recorded identifiers improved linkage and quality of indicators relying on linkage isn’t particularly surprising and therefore, while it is reassuring to find that better linkage seems to improve the accuracy of some indicators (leading to higher admission rates, for instance), it’s not clear how this adds to overall knowledge about how hospital episode statistics can be used for research on children. Furthermore, there is no gold standard - in the case of hospital admissions – and there may still be substantial errors.

Since HES is collected for financial purposes rather than research, changes to data collection that are outside the control of researchers could lead to misleading conclusions (as demonstrated by the declining trend in hospital admissions before 2002 due to missed links between births and hospital admissions). There is no information on the quality of linkage from the data provider, nor information about the proportion of records with complete patient identifiers used for linkage. Therefore careful validation studies, including of any linkage between HES and other databases, need to be carried out by researchers who use these data.

Documenting strengths and limitations of HES and methods for evaluating quality of linkage without having full information on personal identifiers used for linkage will be extremely valuable to the growing number of researchers using HES. A quick search in PubMed for mentions of “Hospital Episode Statistics” in titles and abstracts revealed that over 1000 papers have used HES since 2010 and the number of papers is growing over time. Our paper highlights that follow-up for births before 2002 is not reliable, which can help researchers to assess feasibility of their proposed studies before starting a time consuming and resource intensive application for HES data.

We are also not clear what the reviewer means regarding no existing gold standard for measuring hospital admissions. While this may be the case in other countries, HES is considered the gold standard hospital admission dataset in England since the vast majority of acute and planned care, particularly for children, takes place within NHS hospitals and would therefore be recorded in HES (we now mention in the methods section that HES covers estimated 98–99% of all hospital activity in England).

6. In terms of the conceptualisation of the study, the database constituted by the authors only allows a partial evaluation of the capabilities of the HES data as mothers are not linked with their babies, even though the authors state that this is possible and would improve the quality of data. The authors need to justify why they developed this study without linking these data.

This paper aimed to assess the impact of changes to data collection practices on readmission and mortality rates among infants in a HES birth cohort. While it is definitely possible to link mothers and babies within pseudonymised HES data, this paper did not aim to describe this linkage which is detailed elsewhere. We point the reader to a relevant paper which present methods for deterministic and probabilistic linkage of mothers and babies (paragraph 1 on page 21 in the “interpretation” section of discussion). Such linkage is not carried our routinely by NHS Digital, the HES data provider, which we now highlight in the paper (last sentence on p6, first paragraph on p8), and therefore some researchers might only have access to data on babies.

7. Given the large amounts of missing data, it is not clear why the authors did not describe the characteristics of births with and without missing data – is this related to the hospital? The region?

Patterns of missing data differ between variables. As we mention, IMD score was missing due to a data extraction error for all episodes of care marked as birth. Completeness of variables in the “baby tail” such as birth weight, gestational age, maternal age varies by hospital since some hospitals submit these variables to NHS Digital and some do not (submitting these variables to NHS Digital is not mandatory ). We now mention in the discussion that rates of missing data vary between hospitals and we provide a number of references for how missing data on baby characteristics has been dealt with in other studies. For example, researchers have developed approaches to indicate hospitals with high quality of recorded data, and validate and analyse data from “sub-cohorts” of births in those hospitals with high quality of data (last paragraph on p20)

8. The reader also wonders whether there has there been any validation of these data with medical records to assess the validity of the data?

Validation has been attempted for specific conditions, e.g. selected mental health disorders (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5868851/), but not more generally. However, diagnostic information in HES is entered by clinical coders, who are required to complete national accreditation training to ensure standardisation of recorded information between hospitals. Clinical coders translate medical notes into ICD-10 codes once patient is discharged from hospitals. Nonetheless, coding sensitivity could vary according to the quality and level of detail covered in medical notes. We have now added this information in methods under “Hospital Episode Statistics” heading (paragraph 1, p6):

9. Many countries use hospital discharge data for research; putting the HES within this broader international context would be useful. How do these results compare to those in other countries?

We agree that the paper was lacking international context. We now revised introduction to reflect our ambition to derive a data resource comparable with population-level registers available in e.g. the Nordic countries (paragraph 1, p4). We explain, that comparable data source linking birth registration, maternity and hospital admission records has only been carried out for one study to date in England, and this data resource is not updated routinely (paragraph 2, page 4). Instead, researchers are increasingly using HES to develop whole-country birth cohorts. (paragraph 1, page 5). In the discussion, we also compare the HES birth cohort data resource to those available in other countries (last paragraph of the “interpretation” section, p 22).

10. Abstract –“Numbers for Babies (NN4B) system for allocation of unique National Health Service (NHS) number at birth in Q4 2002” is not interpretable as a stand alone sentence.

Thank you for pointing this out. We have rephrased it for clarity as

“We identified three changes to data collection, which could affect linkage of births to follow-up records: (1) the introduction of the “NHS Numbers for Babies (NN4B)”, an on-line system which enabled maternity staff to request a unique healthcare patient identifier (NHS number) immediately at birth rather than at civil registration, in Q4 2002; (2) the introduction of additional data quality checks at civil registration in Q3 2009; and (3) fixing a postcode extraction error for births by the data provider in Q2 2013.”

11. While multiples pose many problems for linkages and use of administrative and register data, the solution of eliminating them is not optimal.

We agree with the reviewer and we point this out as an important limitation of this study in the discussion (paragraph 2 on page 19). Further work is needed to evaluate the quality of linkage of birth records to hospital admissions and mortality records for multiple births, however, this was beyond the scope of this paper. Some of the challenges associated with multiple births include babies being allocated the same patient identifier, making it impossible to distinguish hospital trajectories, and challenges with recording of stillbirths. We now also refer to a recent national audit of maternity and perinatal care for women with multiple births and their babies has identified case ascertainment of 89.5% using multiple detailed datasets (Discussion page 19).

Comments to the Author - Reviewer #2:

This study described methods used to create a national birth cohort using HES and aimed to evaluate the quality of linkage between births and follow-up records and its impact on two health outcomes in children. Overall, I think this is a well written and informative study. To help further improve the manuscript it would be good if the authors could address the following points:

1. It’s great to see that the authors have provided a link to the Stata code for derivation of the cohort. Can they just check that this is complete and consistent with all the steps they describe in the paper e.g. from a quick scan of the Stata code I could not see any code for excluding non-English residents. Similar, it would be really helpful if they can make sure that all the data cleaning steps in the Stata code are described in the appendix of the paper e.g. from the Stata code it looks like they have a number of additional cleaning steps such as those described under the overall heading ‘Additional data cleaning & duplication to ensure one birth episode per HESID’ that are currently not described in the appendix of the paper.

We thank reviewer for checking our Stata code on Github, this is greatly appreciated. We have now added extra description of our code in S2 Appendix and we have added additional do-files on GitHub for completeness. Specific changes include:

- We now refer to GitHub Repository in the appendix. We have numbered sections of the appendix and we highlight which do-file from GitHub relates to work described in each section.

- We added a brief introduction mentioning preliminary data cleaning for hospital admission and infant mortality records listed in do-files 1 and 2.

- We have expanded section 1 of the appendix on developing the birth cohort to cover more detailed description of data cleaning to derive one birth episode per HESID, and additional information about cleaning of implausible values of birth weight, gestational age and maternal age. All steps are included in do-file 3.

- We have added additional Stata do-files describing cleaning and linking hospital admissions in the first year of life (do-file 4, described in section 2 of S2 appendix), cleaning and linking ONS mortality records (do-file 5, described in section 3 of S2 appendix)

- We also added do-file 6 with steps taken to finalise the cohort (such as linking in additional variables derived from linked hospital admission and mortality records, deriving length of birth admission, and excluding non-English residents). These steps are described in section 4 of S2 appendix.

2. Discussion, key findings & abstract – when the authors say the proportion of babies with hospital readmission after birth increased by a third to 17.7% is this compared to the proportion with hospital readmission in Q1 1998? If so, can the authors make this explicit in both the discussion and abstract or consider instead stating what I think is probably the more informative figure of 6.1% compared to the expected value based on the trend before Q4 2002.

Thank you for highlighting this, we have now rephrased abstract and results (p14-15) to indicate that we refer to an increase by 6.1% percentage points compared to the expected value based on trend before Q4 2002. We have made additional minor changes throughout results section (p13-14) to clarify how model results should be interpreted. We have revised the “key findings“ section of the discussion, and the specific figures are no longer included.

3. Page 4 of the discussion - From the data the authors have it cannot be stated with certainty that babies with longer birth admission were more likely to have their NHS number updated during the hospital stay. Also are the 35% and 25% figures quoted on page 4 of the discussion compared to the hospital readmission proportions in Q1 1998? If so, can the authors make this explicit. However, I again would consider the 5.9% and 8.6% figures (compared to expected value based on trend before Q4 2002) that the authors quote in the results to be the most relevant – and these figures actually imply a greater shift in the readmission rate occurred for births with longer not shorter birth admissions.

We thank reviewer for highlighting this, we agree that it cannot be stated with certainty that babies with longer birth admissions were more likely to have their NHS number updated during the hospital stay and we removed this from the discussion. Instead, in the results section we now state: “Results separated by length of birth admission were comparable suggesting that linkage error was not affected by length of stay” before stating ITSA model results (last paragraph p 13).

Other minor points:

4. Abstract discussion – even if births prior to 2003 are not correctly re-linked, HES has the potential to provide national longitudinal hospitalisation birth cohort data for child research so suggest slightly reword last sentence to something like “HES has the potential to provide national longitudinal hospitalisation birth cohort data for child health research, but births prior to 2003 need correctly re-linking to follow-up records.”

Thank you, we agree and we have revised the discussion of the abstract to say “HES can be used to develop a national birth cohort for child health research with follow-up via linkage to hospital and mortality records for children born from 2003 onwards. Re-linking births before 2003 to their follow-up records would maximise potential benefits of this rich resource, enabling studies of outcomes in adolescents with over 20 years of follow-up.”

5. Reference 6 relates to a data linkage study conducted in Scotland so would not cite it with reference to Canada as have done in the introduction.

Thank you for spotting this error, we have now included a more relevant reference.

6. Introduction – please make it explicit that HES includes all births in English NHS hospitals and presumably does the 97.4% figure relate to the proportion of all births in England rather than England and Wales?

Thank you for this suggestion. Previously cited figure referred to England and Wales, but we have now double-checked the figure for deliveries in England only. The sentence now reads: “HES covers details of all patient care funded by the National Health Service (NHS) in England, including all births that occur in NHS hospitals (in 2016, 97.4% of all deliveries in England occurred in NHS hospitals).”

Please note, that later on we compare mortality to national statistics reported for England and Wales as mortality rates for singleton live births (rather than all live births) are not published for England only. We highlight that in the methods section and state that 95% of births in England and Wales occur in England (page 10, paragraph 1).

7. Methods - suggest rephrasing first sentence under study participants to something like “We developed a cohort of singleton live births between 1st January 1998 and 31st December 2015 to mother’s resident in England based on birth…”

Thank you, we have now rephrased this sentence to say “We developed a birth cohort of singleton live-born infants, who resided in England and were born between 1st January 1998 and 31st December 2015, based on birth admissions recorded in HES”

8. Methods – you state that you cleaned data on maternal age but this is not detailed in appendix 2.

Thank you for highlighting this omission, we have now added information about cleaning of maternal age to appendix, and we double checked that this is also included in Stata code on GitHub (see S2 Appendix section 1 and do-file 3)

9. Can you make it a bit more explicit in the methods that you did not use information recorded in the mother’s delivery records in this study.

We have added a sentence to clarify this in “study participants” section of the methods section (p8), which reads as: “We identified birth admissions using broad selection criteria based on diagnostic codes and admission details recorded in HES (such as admission type, see appendix table 3 for details). We used only information recorded in baby’s birth record, as linkage between maternal delivery and baby’s birth admissions is not routinely available”

10. It would helpful for completeness to include the details of the HES field/variable names you used to identify the risk factors in the appendix.

Thank you for this suggestion, we have now added variable names for all variables mentioned in the appendix, for example for birth characteristics we say:

“Next, we removed implausible values of gestational age (we replaced gestational age to missing if gestational age was <22 or >45 weeks, gestat variable in HES), birth weight (replaced to missing if <200g, birweit variable in HES) and maternal age (replaced to missing if <10 or >60, matage variable in HES).”

11. In the methods section you state that you defined hospital admissions as a continuous period of time that a child spent under hospital care and that hospital transfers and admissions within 1 day of each other were treated as one inpatient admission which seems to contradict with what you say in appendix 3 (hospital admission defined as total time spent by a patient in one hospital, with hospital transfers classified as separate inpatient admissions) – please clarify.

Thank you for highlighting this inconsistency, the statement in the manuscript was correct and we have now updated information in section 2 of S2 Appendix to say:

“An admission was defined as a continuous period of time that a child spent under NHS hospital care. Hospital transfers and admissions within 1 day of each other were treated as one inpatient admission”

12. In the outcomes section of the methods where you define infant deaths suggest rephrasing slightly for clarity to: ‘Infant deaths were defined where a linked death record was found (that is, via link to ONS mortality record) or the discharge method in the hospital record was recorded as ‘died’.

Thank you, we have now rephrased this sentence as follows: “Infant deaths were indicated if a linked ONS mortality record was found or if discharge method in hospital record indicated death” (p8)

13. Methods – the first time you mention the implementation of NHS numbers for babies, did you mean “Q4 2002..” rather than “Q3/Q4 2002..”?

Thank you for highlighting this, yes, we meant Q4 2002 and we have now corrected this.

14. Can you clarify that you were looking at hospital readmission in the first year of life in the methods outcomes section and appendix table 6.

We have now clarified this in methods section and appendix table 6 (now S3 Appendix Table A).

15. Figure 3 – would be helpful to mark the time points when the collection of identifiers used to generate HESID changed as you did for Figure 2.

Thank you, we have now added lines to the figure to indicate time periods when data collection has changed, and we added a detailed caption listing these time points.

16. Page 2 of discussion – think you need to add an ‘of’ after ‘Further work is needed to evacuate the quality”

Thank you for highlighting this, we have now corrected this sentence.

17. On page 4 of the discussion would suggest softening the wording slight to something like “Fixing the postcode extraction error in 2013 did not appear to impact on quality of linkage, but…” Also, according to your Figure 1, fixing the postcode extraction error did not ensure that IMD was available for all births – it only correlated with an increase in the completeness of this variable to 89% in years 2014-2015 – can you amend the wording on page 4 of the discussion to reflect this.

Thank you for highlighting this. We have now moved this sentence to the key findings (to better address our aim – to assess impact of changes in data collection on linkage of births to follow-up records), which now reads:

“Fixing the postcode extraction error in 2013 did not impact the quality of linkage, but it helped to improve the completeness of the IMD score for birth records in HES.”

18. It would probably be clearer to use different colours rather than different shades of grey in the figures.

Thank you, we have now added colour to the figures

Attachment

Submitted filename: reviewer comments.docx

Decision Letter 1

Umberto Simeoni

30 Nov 2020

Developing a national birth cohort for child health research using a hospital admissions database in England: the impact of changes to data collection practices

PONE-D-20-14030R1

Dear Dr. Zylbersztejn,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Umberto Simeoni

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: (No Response)

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: No

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors have been very responsive to the first round of review comments and the objectives and methods of the MS are now clearly stated. This will be a useful contribution to the literature.

Reviewer #2: (No Response)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Acceptance letter

Umberto Simeoni

4 Dec 2020

PONE-D-20-14030R1

Developing a national birth cohort for child health research using a hospital admissions database in England: the impact of changes to data collection practices

Dear Dr. Zylbersztejn:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Umberto Simeoni

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Appendix. HES linkage algorithms.

    (DOCX)

    S2 Appendix. Developing a cohort of singleton live births in HES.

    (DOCX)

    S3 Appendix. Additional results.

    (DOCX)

    Attachment

    Submitted filename: reviewer comments.docx

    Data Availability Statement

    Authors do not have permission to share patient-level Hospital Episode Statistics (HES) data. Qualified researchers can request access to the data from the NHS Digital Data Access Advisory Group (enquiries@nhsdigital.nhs.uk).


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES