Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Nov 12.
Published in final edited form as: Traffic Inj Prev. 2019 Nov 12;20(SUP2):S151–S155. doi: 10.1080/15389588.2019.1679552

Catalyzing Traffic Safety Advancements via Data Linkage: Development of the New Jersey Safety and Health Outcomes (NJ-SHO) Data Warehouse

Allison E Curry 1,2, Melissa R Pfeiffer 2, Meghan E Carey 2, Lawrence J Cook 3
PMCID: PMC7035196  NIHMSID: NIHMS1543933  PMID: 31714800

Abstract

Objective

Our objective is to describe the development of the New Jersey Safety and Health Outcomes (NJ-SHO) data warehouse—a unique and comprehensive data source that integrates various state-level administrative databases in NJ to enable the field of traffic safety to address critical, high-priority research questions.

Methods

We have obtained full identifiable data from the following statewide administrative databases for the state of New Jersey: (1) driver licensing database; (2) Administration of the Courts data on traffic-related citations; (3) police-reported crash database; (4) birth certificate data; (5) death certificate data; and (6) hospital discharge data as well as (7) childhood electronic records from NJ residents who were patients of the Children’s Hospital of Philadelphia pediatric healthcare network, and (8) Census tract-level indicators. We undertook an iterative process to develop a linkage algorithm in LinkSolv 9.0 software using records for individuals born in select birth years (1987 and 1988) and subsequently execute the linkage for the entire study period (2004 through 2017). Several metrics were used to evaluate the quality of the linkage process.

Results

We identified a total of 62,685,619 records and 19,247,363 distinct individuals; 10,352,998 of these individuals had more than one record brought together during the linkage process. Our evaluation of this linkage suggests that the linkage was of high quality.

Conclusions

The resulting NJ-SHO data warehouse will be one of the most comprehensive and rich traffic safety data warehouses to date. The warehouse has already been utilized for numerous studies and will be fully primed to support a host of rigorous studies, both in and beyond the field of traffic safety.

Keywords: administrative data, data linkage, injury prevention, motor vehicle crash

INTRODUCTION

Analyses of state- and national-level motor vehicle crash report data have substantially improved our understanding of crash causation and enabled us to monitor progress in reducing crash burden. However, the utility of these data have been substantially limited. Crash reports contain data only on the events occurring just prior to the crash, the crash event itself, and the conditions of involved individuals in the moments just after the crash—essentially limiting the period of study to just a few minutes. Further, without access to individual-level data, crash events can only be studied in isolation; multiple events experienced by an individual driver cannot be connected. Finally, lack of access to identifiable data precludes linkage to other valuable existing data sources—including driver license and health outcome data; this severely limits our ability to situate crash events in the context of a driver’s history and their short- and long-term outcomes and to follow individual drivers over the course of licensure.

Fortunately, the presence of identifiable information on US state-level crash reports enables these data to be linked to an array of other relevant data sources. Previous linkages of crash data to subsequent injury outcomes (i.e., death or hospital discharge data)—for example National Highway Traffic Safety Administration’s (NHTSA) Fatal Analysis Report System (FARS) database and the NHTSA-funded Crash Outcomes Data Evaluation System (CODES) project—have established the ability of linked datasets to advance knowledge in the field of traffic safety. Population-level linkages between pre-crash and crash data are much rarer; thus far, there have been only a few published examples in the US of statewide driver licensing-to-crash linkages (Chapman, Masten, & Browning, 2014; Curry, Pfeiffer, Durbin, & Elliott, 2015; Foss, Masten, & Martell, 2014; Zhang & Lin, 2013). To our knowledge, there is no existing comprehensive traffic safety data resource that spans the continuum from underlying contributing factors and relevant previous events to crash injury outcomes.

Our efforts to develop the New Jersey Safety and Health Outcomes (NJ-SHO) data warehouse began in 2011 with the overall goal of creating a unique and comprehensive data source—which integrates numerous state-level administrative databases in New Jersey—that catalyzes our ability to both develop novel epidemiologic methods and address critical, high-priority research questions in traffic safety. Further, we intentionally designed the warehouse so that it catalyzes novel research beyond traffic safety, and even in fields beyond injury prevention. In the current paper, we describe the data sources included in the warehouse; detail our process of obtaining, linking, and evaluating the quality of the warehouse; and comment on its previous, current, and future uses.

METHODS

Data Sources

As shown in Table 1, we obtained full identifiable data from numerous statewide administrative databases for New Jersey. (1) NJ’s statewide driver licensing database includes the complete licensing record for all individuals who held a NJ driver’s license from January 2004 through December 2018. The database includes full names, 15-digit Driver Licensing Numbers, and residential address (up to 3 per person); exact dates of birth, learner’s permit, independent licensure, and death; and date and type of license transactions (i.e., initial, renewal, duplicate, change, upgrade, and downgrade). (2) NJ’s statewide traffic citation database was populated by the NJ Administration of the Courts into the licensing database (at the driver level) prior to our receipt. This database includes dates and types of all traffic-related citations as well as license suspensions and restorations. (3) NJ’s police-reported crash database includes all data collected on the NJ Police Crash Investigation Report (NJTR-1) for all police-reported crashes from 2004 through 2017. A crash is reportable in NJ if it results in an injury or >$500 in property damage. (New Jersey Motor Vehicle Commission, 2011) (4) NJ birth certificate and (5) NJ death certificate databases include all data collected on vital statistics collection forms. Death data includes injury-related fields (date, time, place, location) and cause of death. (6) NJ Hospital Discharge Data Collection System contains data from all hospital inpatient, outpatient, and emergency department visits across the state of NJ, including diagnostic International Classification of Disease and E-codes. In addition to these statewide administrative databases, we are incorporating several other sources into the NJ-SHO warehouse to support a wider array of analyses. (7) Childhood electronic health records for all patients of the CHOP healthcare network—which encompasses >50 locations in southeastern Pennsylvania and southern New Jersey—who reside in NJ. (8) US Census and American Community Survey Data includes Census tract-level indicators (e.g., median household income, population density, availability of healthcare providers). For each of these data sources, files were obtained from the relevant organization listed in Table 1. Data from all sources were imported into a common structure (i.e., SAS datasets) and variables being used for the linkage were standardized. Standardized data from all data sources were initially combined into a single dataset; we then created a separate dataset for each birth year (or several birth years if the number of records was smaller). Finally, we identified the range of identifiable data elements in each source that were available to be included in a probabilistic linkage (Table 2).

Table 1.

Description of New Jersey administrative datasets being integrated into the New Jersey Safety and Health Outcomes (NJ-SHO) warehouse.

Data Source Contains Years Obtained # of Records Provided By
NJ Driver Licensing Database Detailed data on every driver licensed in the state of NJ 2004 - 2018 ≈11 million drivers NJ Motor Vehicle Commission
NJ Crash Report Database1,2 Crash-, vehicle-, driver-, occupant-, and pedestrian-level data for all police-reported crashes in NJ 2004 - 2017 ≈7 million crash-involved drivers, ≈100,000 crash-involved pedestrians NJ Motor Vehicle Commission
NJ Administration of the Courts (AOC) Data Date and type of all traffic-related citations issued in NJ; directly populated by AOC into the NJ Driver Licensing Database 2004 - 2018 Multiple records for drivers in licensing database NJ Motor Vehicle Commission
NJ Birth Certificate Data Birth certificate data for all births occurring in NJ birth years 1979 - 2000 ≈2.5 million births NJ Department of Health
NJ Death Certificate Data3 Death certificate data for all deaths in NJ 2004 - 2016 ≈940,000 deaths NJ Department of Health
NJ Hospital Discharge Data Collection System Detailed utilization data on all NJ inpatient, outpatient, and emergency department discharges; files are derived from hospital uniform billing information 2004 - 2017 ≈41 million records imported NJ Department of Health
Childhood electronic health record (EHR) data EHR data all CHOP healthcare network patients who were NJ residents at last CHOP visit 2005 - 20184 ≈200,000 patients CHOP
US Census and American Community Survey Data Age-, sex-, and race/ethnicity-specific population data; Census-tract-level geographic and socioeconomic indicators 2004 - 2014 - US Census Bureau website
1

A crash is reportable to police if it results in injury to or death of any person, or damage to property of any one person in excess of $500.00.

2

Records for crash-involved drivers, occupants, and pedestrians/bicyclists are separate records. Drivers and pedestrians/bicyclists are included in the current linkage; occupants will be linked separately.

3

We expect this to be a more complete reporting of crash-related fatalities than NHTSA’s FARS data, as FARS is limited to fatalities occurring ≤ 30 days of crash.

4

Data are limited to those born from 1987 through 2000 (i.e., driving eligible ages).

Table 2.

List of identifiable data elements used for linkage.

Data Element License Crash-Involved Driver Crash-Involved Pedestrian Birth Death Hospital CHOP EHR
Name (first, last, middle initial) X X X X X X X
Date of birth X X X X X X X
Geography of residence X X X X X X X
Sex X X X X X X X
Social Security Number X X
Date of death X X X X
Event date X X X X X
Event location X X X
Race and ethnicity X X X X

Linkage Process

We conducted a probabilistic linkage in LinkSolv 9.0. Briefly, LinkSolv uses Bayes’ rule to calculate posterior probabilities of a true match between two records based on agreements (within a specified tolerance) and disagreements (outside the established tolerance) between examined data elements (Gelman, Carlin, Stern, & Rubin, 2014). Match probabilities are determined both by the discriminating power of data elements (agreement on common values have less impact than agreement on rare values) and their reliability (disagreement on accurate data elements provides more evidence against a match than disagreement on less accurate data elements). A full linkage process involves several “passes.” Each pass brings together pairs of records that have exactly the same values on selected criteria (“join” criteria) and then subsequently evaluates those pairs based on additional criteria (“match” criteria). Match criteria are the same for each pass, but join criteria differ, thereby ensuring that disagreement on a single data element will not prevent the identification of a true match. Using an iterative process, we developed an algorithm that ultimately consisted of two passes and balanced identifying true matches and minimizing inclusion of false matches. Development was conducted with two separate birth years (1987 and 1988) that had the largest numbers of records and representation from each data source. Subsequently, we executed the linkage on the full warehouse to identify all records pertaining to a single individual; all records identified as belonging to a single individual were combined into a “set.”

We evaluated the overall linkage process in several ways. First, we determined the median and interquartile range (IQR) of the match probabilities (i.e., the likelihood of the match being true, as calculated by LinkSolv) for all pairs that were joined, as well as the lowest match probability among all of the pairs in each set. We also calculated the median and IQR for the number of distinct records combined in each set. Finally, we determined the proportion of records from each data source that matched with a record from one of the other sources (e.g., license record matched with a birth record) or the same source (e.g., two birth records matched).

RESULTS

Linkage Results

Table 1 describes the number of records contained in each data source that was included in the linkage. There was a total of 62,685,619 records for individuals across all data sources (birth: 2,469,971; licensing/citation: 10,932,316; crash-involved driver: 6,653,904; crash-involved pedestrian/bicyclist: 105,198; CHOP EHR: 206,201; hospital (inpatient, outpatient, ED): 41,376,316; death: 941,713). Through the linkage process, we identified a total of 19,247,363 distinct individuals, 10,352,998 of whom had more than one record brought together during the linkage process (i.e., were included in a set).

Evaluation of Linkage

We evaluated the quality of our linkage process in several ways. First, we assessed the match probability—that is, the likelihood of the match being true—for each pair of records that were joined within each set. Overall, the median match probability among all accepted pairs was 0.9999991 (IQR: 0.9998636, 1.0000000). Second, the lowest match probability for any two records within sets was 0.99 or higher for 84% of all sets and 0.90 or higher for 95% of sets. Third, we determined the number of distinct records that were combined in each set. The median (IQR) number of distinct records combined in each set (and therefore excluding records that were not linked with any other record) was 4 (3, 6). The maximum number of records per set was 412, largely due to multiple hospital visits. Finally, the number and proportion of records in each source that after the linkage process ended up in a set with at least one record from a second source are shown in Table 3. In particular, we were interested in the number of same-source matches for data sources that are expected to have only one record per individual (i.e., birth, license, EHR, and death datasets). As we had hoped, the proportions of records that were part of set with a same-source record were very low (1.0% of birth records; 0.3% of death records; 0.8% of EHR records; 0.5% of license records). In all, 0.35% (n=36,277) of total sets had an issue in which more than one record that should be unique had matched.

Table 3.

Number and proportion of all records in each source that are in a set with at least one record from another source.

Data Source
In a set with at least one record from: Birth CHOP EHR Death Driver/crash Hospital License Pedestrian/bicyclist
N % N % N % N % N % N % N %
Birth 25,248 1.0 154,569 75.0 12,202 60.3 1,261,405 54.8 5,988,705 55.2 1,582,725 51.1 18,764 49.7
CHOP EHR 155,616 9.4 1,611 0.8 1,075 12.1 92,358 7.9 561,444 9.1 161,644 9.1 1,634 6.2
Death 12,325 0.5 1,079 0.5 2,531 0.3 206,464 3.1 3,943,208 9.6 592,561 5.4 5,413 5.2
Crash-involved driver 730,262 29.6 59,675 28.9 139,513 15.0 3,660,441 55.0 12,729,728 35.6 3,368,797 30.9 25,933 25.1
Hospital 1,456,989 59.0 154,629 75.0 847,101 90.0 4,897,134 73.6 35,899,722 86.8 6,489,900 59.4 71,881 68.3
License 1,583,834 64.1 160,761 78.0 590,613 63.3 5,546,743 83.4 25,336,317 70.2 58,446 0.5 45,441 43.8
Crash-involved pedestrian/bicyclist 18,522 0.8 1,619 0.8 5,218 0.6 51,109 0.8 446,029 1.1 44,865 0.4 4,139 3.9

DISCUSSION

The resulting NJ-SHO data warehouse will be one of the most comprehensive and rich traffic safety data warehouses in the US to date. Our evaluation suggests the linkage was conducted with high quality. The warehouse will be fully primed to support a host of rigorous and innovative traffic safety studies. Inclusion of drivers of all ages—as well as vehicle occupants, pedestrians, and bicyclists—will ensure that the NJ-SHO warehouse can support studies in a wide array of high-priority areas of traffic safety research, including impaired driving, older driver crashes, pedestrian and bicyclist injuries, and child passenger safety. In addition, the analytic warehouse will have several unique features, including fully geocoded residential addresses of both licensed drivers and crash locations. Further, because records in two data sources are linked independently of all other data sources (e.g., birth records are linked to EHR records regardless of driver licensure status), the NJ-SHO can also be leveraged for analyses outside of the field of traffic study, and even beyond injury prevention. Such features will optimize its ability to catalyze research on a wide array of topics.

Exemplar Applied Research Using NJ-SHO

Thus far, we have utilized the NJ-SHO database to evaluate and directly inform GDL policy, uniquely advance injury methods, and enhance our understanding of young driver behavior and crashes. A total of 20 peer-reviewed scientific papers have been published using NJ-SHO data. These include an evaluation of New Jersey’s first-in-the-US Graduated Driver Licensing (GDL) decal provision (Curry, Localio, Pfeiffer, & Durbin, 2014), examination of licensing and crash rates among adolescents and young adults with medical conditions (Curry et al., 2017; Curry, Yerys, Metzger, Carey, & Power, 2019), and foundational analyses to inform US states considering applying GDL policies to older novice drivers (Curry et al., 2015). Numerous additional studies are ongoing or planned, including linkage of the NJ-SHO to Medicaid and Medicare health insurance claim data, a study assessing medication use among older drivers, a study examining restraint use among child passengers.

ACKNOWLEDGEMENTS

This work was supported by the Eunice Kennedy Shriver National Institute of Child Health and Human Development at the National Institutes of Health Awards R21HD092850 (PI: Curry).

REFERENCES

  1. Chapman EA, Masten SC, & Browning KK (2014). Crash and traffic violation rates before and after licensure for novice California drivers subject to different driver licensing requirements. Journal of Safety Research, 50, 125–138. [DOI] [PubMed] [Google Scholar]
  2. Curry AE, Localio R, Pfeiffer MR, & Durbin DR (2013). Graduated Driver Licensing Decal Law: Effect on Young Probationary Drivers. American Journal of Preventive Medicine, 44, 1–7. [DOI] [PubMed] [Google Scholar]
  3. Curry AE, Metzger KB, Pfeiffer MR, Elliott MR, Winston FK, & Power TJ (2017). Motor Vehicle Crash Risk Among Adolescents and Young Adults With Attention-Deficit/Hyperactivity Disorder. JAMA Pediatrics, 164(6), 942–948. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Curry AE, Pfeiffer MR, Durbin DR, & Elliott MR (2015). Young driver crash rates by licensing age, driving experience, and license phase. Accident Analysis and Prevention, 80, 243–250. [DOI] [PubMed] [Google Scholar]
  5. Curry AE, Yerys BE, Metzger KB, Carey ME, & Power TJ (2019). Traffic Crashes, Violations, and Suspensions Among Young Drivers With ADHD. Pediatrics, 143(6), e20182305. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Foss RD, Masten SV, & Martell CA (2014). Examining the safety implications of later licensure: crash rates of older vs. younger novice drivers before and after Graduated Driver Licensing; Washington D.C. [Google Scholar]
  7. Gelman A, Carlin JB, Stern HS, & Rubin DB (2014). Bayesian data analysis (Vol 2). Boca Raton, FL, USA: Chapman & Hall/CRC. [Google Scholar]
  8. New Jersey Motor Vehicle Commission. (2011). NJTR-1 form field manual. Retrieved May 1, 2015, from http://www.state.nj.us/transportation/refdata/accident/pdf/NJTR-1Field_Manual.pdf
  9. Zhang Y, & Lin G (2013). Disparity surveillance of nonfatal motor vehicle crash injuries. Traffic Injury Prevention, 14(7), 697–702. [DOI] [PubMed] [Google Scholar]

RESOURCES