Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2013 Apr 1.
Published in final edited form as: Acad Emerg Med. 2012 Apr;19(4):469–480. doi: 10.1111/j.1553-2712.2012.01324.x

Evaluating the Use of Existing Data Sources, Probabilistic Linkage, and Multiple Imputation to Build Population-based Injury Databases Across Phases of Trauma Care

Craig Newgard 1, Susan Malveau 1, Kristan Staudenmayer 1, N Ewen Wang 1, Renee Y Hsia 1, N Clay Mann 1, James F Holmes 1, Nathan Kuppermann 1, Jason S Haukoos 1, Eileen M Bulger 1, Mengtao Dai 1, Lawrence J Cook 1; the WESTRN investigators1
PMCID: PMC3334286  NIHMSID: NIHMS355617  PMID: 22506952

Abstract

Objectives

The objective was to evaluate the process of using existing data sources, probabilistic linkage, and multiple imputation to create large population-based injury databases matched to outcomes.

Methods

This was a retrospective cohort study of injured children and adults transported by 94 emergency medical systems (EMS) agencies to 122 hospitals in seven regions of the western United States over a 36-month period (2006 to 2008). All injured patients evaluated by EMS personnel within specific geographic catchment areas were included, regardless of field disposition or outcome. The authors performed probabilistic linkage of EMS records to four hospital and postdischarge data sources (emergency department [ED] data, patient discharge data, trauma registries, and vital statistics files) and then handled missing values using multiple imputation. The authors compare and evaluate matched records, match rates (proportion of matches among eligible patients), and injury outcomes within and across sites.

Results

There were 381,719 injured patients evaluated by EMS personnel in the seven regions. Among transported patients, match rates ranged from 14.9% to 87.5% and were directly affected by the availability of hospital data sources and proportion of missing values for key linkage variables. For vital statistics records (1-year mortality), estimated match rates ranged from 88.0% to 98.7%. Use of multiple imputation (compared to complete case analysis) reduced bias for injury outcomes, although sample size, percentage missing, type of variable, and combined-site versus single-site imputation models all affected the resulting estimates and variance.

Conclusions

This project demonstrates the feasibility and describes the process of constructing population-based injury databases across multiple phases of care using existing data sources and commonly available analytic methods. Attention to key linkage variables and decisions for handling missing values can be used to increase match rates between data sources, minimize bias, and preserve sampling design.


Injury continues to be a major cause of death and disability, particularly among the young.1,2 While the development of trauma centers, trauma systems, injury prevention programs, and public policy have resulted in many strides toward reducing the burden of injury, much work remains. Integral to understanding and further reducing the burden of injury is the ability to measure how injury relates to meaningful health outcomes across broad populations and different phases of care. Trauma registries have traditionally provided the bulk of available injury data. However, trauma registries preferentially target more seriously injured patients treated at trauma centers, are not population-based, and are typically limited to in-hospital outcomes. There is a growing need for broad population-based injury data that effectively span multiple phases of care (out-of-hospital, in-hospital, and postdischarge); include patients with minor and severe injuries; and are not limited by cost, resource, and confidentiality constraints. Such population-based data may be used for trauma quality assurance, improving the effectiveness of field triage protocols and early treatment interventions, and evaluating the outcomes of injured patients not treated at trauma centers and falling outside of traditional quality assurance data sources (e.g., trauma registries).

The increasing availability of electronic data combined with certain analytic methods (probabilistic linkage3,4 and multiple imputation5) provide an opportunity to create such unique data resources. Probabilistic linkage is a method for matching disparate datasets when a unique identifier is not available and has been used to match emergency medical systems (EMS) records to hospital outcomes6,7 and validated among injured patients.8 Due to match rates (the proportion of matches among eligible patients) typically less than 100%, and the substantive portion of missing values inherent in EMS and trauma data sources, missing data have been another obstacle in developing population-based injury databases. Handling missing values inappropriately can generate bias, reduce sample size, and lessen study power.914 Multiple imputation can effectively mitigate these limitations, provided that certain assumptions are met. We are unaware of any studies evaluating the combined use of probabilistic linkage and multiple imputation in the construction of large population-based databases. While the use and integration of electronic health information is being actively promoted in the United States,15 there is a need for additional literature detailing the methods to effectively link such records across multiple phases of care between different agencies and institutions and appropriately handle missing values.

In this article, we describe the methods and evaluate the use of existing data files, probabilistic linkage, and multiple imputation in constructing large population-based injury databases matched to outcomes under a variety of conditions across seven regions in the western United States.

METHODS

Study Design

This was a population-based retrospective cohort study. Fifteen institutional review boards at the seven sites reviewed and approved this protocol.

Study Setting and Population

The cohort included seven regions encompassing 12,590 square miles and a population of 10.5 million persons. We included patients evaluated by 94 EMS agencies and transported to 122 hospitals (15 Level I trauma centers, eight Level II trauma centers, three Level III hospitals, four Level IV hospitals, one Level V hospital, and 91 nontrauma hospitals) over a 36-month period (January 2006 through December 2008). The seven sites included: Portland, Oregon/Vancouver, Washington (four counties); King County, Washington; Sacramento, California (two counties); San Francisco, California; Santa Clara, California (two counties); Denver County, Colorado; and Salt Lake City, Utah (four counties). These sites are part of the Western Emergency Services Translational Research Network (WESTRN), a research consortium of geographic regions, EMS agencies, and hospitals linked through Clinical and Translational Science Award (CTSA) centers funded by the National Institutes of Health. Each site consisted of a predefined geographic EMS “footprint” typically including a central metropolitan area (urban/suburban) with or without surrounding rural and frontier areas.

The study sample included all children and adults for whom the 9-1-1 EMS system was activated within the predefined geographic regions and the EMS provider(s) recorded a primary impression of “injury” or “trauma,” regardless of field disposition or outcome. Specifying the sample in this manner allowed for a broad, population-based, out-of-hospital injury cohort defined by EMS providers that included patients with injuries ranging from mild to severe. We excluded interhospital transfers without an initial EMS response; EMS records listed as “cancelled,” “no patient found,” or “stand by” (i.e., calls without patient contact); and scheduled (i.e., non 9-1-1 activation) transports. The database was originally created as part of a project evaluating the use of field trauma triage guidelines among the broad injury population served by EMS.

Data Sources, Data Capture, and Processing

EMS Data

All EMS agencies included in the study had electronic health record systems and the ability to export these data in aggregate form. The EMS data served as the primary database into which all hospital-based and postdischarge record sources were linked for each site. EMS records that did not match to hospital records were retained to preserve the population-based sampling frame. EMS data fields included patient demographics (including age, sex, date of birth, and home zip code), date of service, EMS times, vital signs, Glasgow Coma Scale (GCS) score, procedures, triage criteria, field disposition, mode of transport, and destination hospital. For sites with a dual-response or tiered EMS system (whereby two EMS charts are generated for each patient), we used probabilistic linkage to match multiple EMS records for the same patient.

Trauma Registries

We used data from 11 trauma registries (two statewide trauma registries and nine individual trauma center registries). While there were variable inclusion criteria between registries, the content of the data contained within the registries was fairly consistent. Variables included patient demographics, date of admission, prehospital vital signs and times, receiving hospital, International Classification of Disease Ninth revision (ICD9) diagnosis codes, ICD9 procedure codes, Abbreviated Injury Scale (AIS) scores, Injury Severity Scores (ISS), emergency department (ED) disposition, hospital length of stay (LOS), intensive care unit (ICU) stay, and in-hospital mortality.

Patient Discharge Databases

Five different statewide, nonpublic (some patient identifiers) patient discharge databases (PDD) were used for this project. The PDD data files capture administrative data for all admitted patients (except Veterans Affairs hospitals) and have similar formatting and content between states. These data files contained patient demographics (including age, sex, date of birth, and home zip code), date of admission and discharge, hospital name, ICD9 diagnosis codes, ICD9 procedure codes, LOS, and hospital disposition (including death). For linkage to EMS records, we restricted the PDD files to patients with an admission source of “emergency room,” admission type listed as “emergency” or “urgent,” admission dates within the 36-month study period, and hospitals included in the EMS-defined catchment regions. Of note, the PDD files were not restricted to patients with hospital-based injury diagnoses, as the primary sample was generated by EMS provider primary impression of injury, regardless of subsequent hospital diagnoses. Restricting hospital files by diagnosis codes would have unnecessarily restricted the number of records available for matching to EMS records and thereby reduced match rates.

ED Data

Nonpublic ED data were available for four sites (sites D, E, F, and G). The format and content of these data were similar to those of the PDD files. The ED data files captured nonadmitted patients presenting to any ED in the state, including those discharged, expired, or transferred to another hospital. The ED and PDD data files were mutually exclusive in all but one region. As with the PDD data, ED data files were not restricted to patients with hospital-based injury diagnoses.

Vital Statistics Data

Vital statistics data were available for two sites (F and G). These files provided 1+ year follow-up data, including mortality, cause of death, date of death, and most recent hospitalization prior to death for patients initially admitted to a hospital in the state. The format and content were very similar to the PDD and ED data sources.

Data Analysis

Probabilistic Linkage

We used probabilistic linkage3,4 (LinkSolv, v.8.2, Strategic Matching, Inc., Morrisonville, NY) to match patient medical records between different data sources. A detailed description of the process and variables used in record linkage is provided elsewhere.6,7 Briefly, common variables between two data sets were identified, assigned estimates for error (the expected probability of mismatch among records known to be true matches), and match weights. We applied tolerance parameters to adjust for expected differences in variables between data sets (e.g., ±1 day for date of service) and adjusted for variables with codependence (e.g., city and county) to avoid inflating the probability of a match. We used blocking variables to restrict analyses to records with exact matches on certain terms (e.g., date, age) to improve computational efficiency. Three to five passes (different combinations of blocking variables) were run for each linkage analysis. High-probability matches with a cumulative match weight above a given threshold value (equivalent to 90% probability of a match) were considered “true” matches and retained in the database, while matches below this weight were rejected.6,8,16 However, we manually reviewed matches above and below this value (i.e., probability of match 0.80 to 0.95) to assure selection of an appropriate cut-point for each linkage analysis.

Up to five linkage analyses were used for each site, including EMS-to-EMS, EMS-to-trauma registry, EMS-to-PDD, EMS-to-ED, and EMS-to-vital statistics. For the EMS-to-EMS and EMS-to-registry linkages, we used up to 18 common variables, including date of service, times, date of birth, zip code (home, incident), demographics (age, sex), field vital signs, field procedures, cause of injury, incident city, field disposition, and destination hospital. For the EMS-to-PDD, EMS-to-ED, and EMS-to-vital statistics system record linkage analyses, we used six variables: date of service, date of birth, home zip code, age, sex, and hospital. One site (G) also used name, social security number (last four digits), and cause of injury (grouped e-codes) in these linkage analyses. Only deaths in the field were excluded from the linkage analyses. Figure 1 depicts the data linkage processes and missing data methods used in this study.

Figure 1.

Figure 1

Schematic of data flow, probabilistic linkage and multiple imputation.

Injury Outcomes

Because injury severity measures (AIS, ISS, ICD9-based Injury Severity Score [ICISS]) are not typically included in administrative data sources, we used a mapping function (ICDPIC module for Stata v11, StataCorp, College Station, TX) and ICD9 diagnoses from all linked hospital records to generate injury severity measures.17 Previous studies have validated software for mapping administrative diagnosis codes to anatomic injury scores.18,19 We used two common definitions of “serious injury”: ISS ≥ 16 and AIS ≥ 3. In addition, we created measures for orthopedic and nonorthopedic (abdominal, thoracic, brain, spine, neck, vascular, and interventional radiology procedures) operative interventions using ICD9 procedure codes. All sources of linked hospital records were used to code in-hospital mortality.

Multiple Imputation

Following completion of record linkage and injury outcome coding, we used multiple imputation to handle missing out-of-hospital and in-hospital data points. Multiple imputation is an analytic method whereby observed values are used to generate a range of plausible values for each previously missing data point based on existing correlations and relationships between variables, provided certain assumptions are met.5 Two or more values (the imputations) are selected from a range of plausible values for each previously missing data point and included in the multiply imputed data sets.5 The variability in values within and between data sets allows the inherent uncertainty in multiple imputation to be appropriately accounted for when these data are analyzed.5 We have previously validated these methods for handling missing out-of-hospital values and trauma data under a variety of conditions.9,10 Our imputation models included 40 out-of-hospital variables (age, sex, field vital signs, GCS score, mechanism of injury, trauma triage criteria, and field procedures) and 10 hospital variables (hospital, interhospital transfer, ISS, AIS, ICISS, nonorthopedic surgery, orthopedic surgery, blood transfusion, LOS, and in-hospital mortality). We included nontransported patients in the imputation models to provide a source of patients similar to those evaluated and discharged from the ED (this was particularly important in sites without ED data). Nontransported patients without a matched hospital record were assumed to be alive and not seriously injured. We used flexible chains regression models for multiple imputation (IVEware, Survey Methodology Program, Survey Research Center, Institute for Social Research, University of Michigan, Ann Arbor, MI)20 with generation of 10 multiply imputed data sets, analyzed independently and combined using Rubin’s rules to appropriately account for within- and between-data set variance.5 All continuous variables were retained in their primary form to preserve information.9,21 We compared results from the multiple imputation models analyzed with varying levels of missing data, different sources of data available for linkage, and site-specific versus combined-site imputation models. The combined-site models used the same processes as described above, but compiled data from multiple sites (e.g., three sites) and included a fixed-effect term for site in the imputation model.

Comparative Analysis Between Sites

We used descriptive statistics to compare and contrast results from the linkage and multiple imputation processes between sites. We estimated the proportion of record matches out of eligible patients (also commonly termed “match rate”—this term will be used throughout the paper for this metric) for the following groups: EMS-transported patients, nontransported patients, admitted patients, and 1-year mortality. Transported patients provided a subgroup where, in theory, all persons were eligible to match to a hospital (ED, registry, or PDD) record. Nontransported patients represented a different subgroup, in which only a minority of patients was expected to match to a hospital record. Match rates for transported patients were calculated using total matches from any hospital source (ED, registry, or PDD) as the numerator and all transported patients as the denominator. A similar strategy was used to calculate match rates for nontransported patients. Calculating match rates for admitted patients and 1-year mortality required assumptions about the total number of eligible patients (the denominator), as there were no variables in the EMS data to directly identify such groups. For sites with all hospital data sources (ED, registry, and PDD), the number of expected admissions was estimated by applying the admission rate among matched records to the full sample of transported patients. For sites with only registry and PDD data sources (admission rate not available), we used a weighted average admission rate calculated from sites with all hospital data sources. The numerator was the total number of admitted patients from PDD and trauma registry data. For linkage to vital statistics records, we estimated the expected number of deaths (the denominator) using admission rate and literature detailing 1-year mortality among adult trauma patients.22 The numerator was the number of actual vital statistics system matches (deaths) among transported patients. All analyses were conducted using SAS (v9.2, SAS Institute, Cary, NC).

RESULTS

Assessing Probabilistic Linkage

We collected data for 381,719 injured patients evaluated by EMS in the seven regions over the 3-year study period. Table 1 demonstrates site-specific data sources, sample sizes, field disposition, missing data for key linkage variables, the number and proportion of matches to each hospital record source, and estimated match rates. There was variability between sites in data sources available for linkage to EMS records, which directly affected match rates. Modifying the values used for assumptions in calculating admission rates among sites without ED data directly affected the estimated match rates. For example, by changing the estimated admission rate for transported patients in site B from 26.4% (weighted average) to 23.0%, the calculated match rate increased from 58.5% to 67.2%. Estimates using the weighted average admission rate are presented in Table 1 for sites without ED data. Among the four sites with all hospital data sources available, ED data were the largest contributor to hospital outcomes (Table 1). PDD was also an important source of outcomes, particularly among sites that did not have ED data available. The number of outcomes provided solely by trauma registry data was modest at sites with all data sources, but was a substantial contributor of unique outcomes at sites without ED data.

Table 1.

Data Sources, Key Linkage Variables, Number of Record Matches, and Match Rates for the Seven Study Sites

Site
A B C D E F G
Data sources
   EMS agencies supplying
     data
10 32 1 1 2 2 46
   Trauma registries used 3 1 1 1 1 4 1
   Patient discharge data Yes Yes Yes Yes Yes Yes Yes
   ED data No No No Yes Yes Yes Yes
   VSS data No No No No No Yes Yes
Total EMS records, n (%) 65,483 109,825 58,394 4,693 18,161 64,720 60,443
   Dead in the field 170 (0.3) 313 (0.3) 101 (0.2) 22 (0.5) 118 (0.7) 229 (0.4) 303 (0.5)
   No transport 7,556 (11.5) 30,539 (27.8) 11,820 (20.2) 683 (15.0) 2,568 (14.1) 5,538 (8.6) 18,587 (30.8)
   Transported 57,737 (88.2) 78,075 (71.1) 46,158 (79.1) 3,973 (84.7) 15,401 (84.8) 58,683 (90.7) 41,187 (68.1)
   Police custody 20 (0.03) 898 (0.8) 315 (0.5) 15 (0.3) 74 (0.4) 270 (0.4) 366 (0.6)
Missing key EMS linkage
 variables,*n (%)
   Date of service 0 (0) 89 (0.01) 0 (0) 1 (0.02) 0 (0) 0 (0) 120 (0.2)
   Date of birth 5,757 (8.8) 21,095 (19.2) 5,594 (9.6) 1 (0.02) 15,730 (86.6) 2,866 (4.4) 27,538 (45.6)
   Zip code 54,575 (83.3) 70,172 (63.9) 31,051 (53.2) 4,693 (100) 16,347 (90.0) 6,497 (10.0) 2,883 (4.8)
   Age 578 (0.9) 2,101 (1.9) 710 (1.2) 1 (0.02) 76 (0.4) 1,229 (1.9) 1,600 (2.7)
   Sex 16,756 (25.5) 6,986 (6.4) 727 (1.2) 10 (0.2) 73 (0.4) 1,1013 (1.6) 2,490 (4.1)
   Destination hospital 2,255 (3.4) 4,891 (4.5) 1,133 (1.9) 209 (4.5) 563 (3.6) 1,648 (2.8) 937 (1.6)
Hospital record matches, n (%
  of matches)
   Total 15,680 12,297 8,815 3,656 2,387 49,288 37,322
   PDD data 11,422 (72.8) 9,723 (79.1) 8,012 (90.9) 728 (15.5) 721 (30.2) 11,169 (22.7) 4,198 (11.2)
   Unique PDD 9,018 (57.5) 6,229 (50.7) 6,019 (68.3) 464 (9.9) 271 (11.4) 7,148 (11.0) 1,638 (4.4)
   Registry data 6,665 (42.5) 6,068 (49.3) 2,796 (31.7) 385 (8.2) 830 (34.8) 13,625 (27.6) 5,453 (14.6)
   Unique registry 4,260 (27.1) 2,574 (20.9) 803 (9.1) 103 (2.2) 367 (15.4) 2,722 (4.2) 2,533 (6.8)
   ED data 2,825 (60.2) 1,299 (54.4) 35,397 (71.8) 32,199 (86.3)
   Unique ED 2,807 (59.8) 1,286 (53.9) 28,515 (44.1) 29,436 (78.9)
   VSS 974 (3.2) 1,134 (3.0)
Estimated match rates,§n (%)
   Any hospital record
     among transported
     patients
15,427 (26.7) 12,071 (15.5) 8,523 (18.5) 3,476 (87.5) 2,300 (14.9) 47,644 (81.2) 32,904 (80.0)
   Any hospital record
     among admitted patients
14,490 (95.0) 12,071 (58.5) 8,523 (69.9) 811 (87.6) 1,067 (14.9) 13,799 (81.2) 7,382 (77.9)
   Any hospital record
     among non-transports
253 (3.3) 226 (0.7) 282 (2.4) 175 (25.6) 83 (3.2) 1,578 (28.5) 4,223 (22.7)
   Vital statistics records
     among transported
     patients
967 (88.0) 916 (98.7)
Other
   Admission rate 23.3 29.0 23.0

PDD = patient discharge data; VSS = vital statistics systems.

*

Six variables were used as the primary match fields for the EMS-ED, EMS-PDD, and EMS-VSS record linkages. For the EMS-registry linkage, up to 12 additional variables (field vital signs, procedures, times, location, mechanism of injury) were also used for linkage.

Home zip code was used for six sites, although site G used incident zip code.

Matches specified as “unique” represent EMS records that matched to a single source of hospital data without any duplicate hospital record match from other data files.

For site F, vital statistics records were only available for patients during the initial 2-year period (2006 and 2007). The “% of matches” value was therefore calculated using a denominator of total hospital record matches for the initial 2-year period (n = 30,894).

§

Match rates are calculated within each of the categories listed: EMS transports; EMS transports resulting in admission; patients evaluated by EMS, but not transported; and EMS patients transported who die within 1 year of initial presentation.

Match rates varied based on the proportion of missing data for key linkage variables (Figure 2). Sites with a low proportion of missing data for linkage variables had high match rates, with date of birth serving as a critical variable in determining match rate. While site G had 46% missing data for date of birth, the availability of patient name and social security number helped sustain a high match rate at that site. The missing data pattern at site E (high proportion of missing data for date of birth and home zip code) illustrates that date of service, age, sex, and hospital are not enough to ensure a successful linkage to hospital outcomes. Furthermore, the missing data pattern in site D suggests that home zip code is not mandatory for a successful linkage, provided the proportion of missing data for the other key variables (including date of birth) is low.

Figure 2.

Figure 2

Match rate as a function of missing data for key linkage variables among the four sites with three sources of hospital data available (PDD, trauma registry, and ED).* *All sites used date, date of birth (DOB), zip code (home or incident), age, sex, and hospital for probabilistic linkage. Site G also used name (first, last), social security number (last 4 digits), and cause of injury (grouped e-codes) for linkage. PDD = patient discharge data.

Match weights for the different linkage variables are demonstrated in Figure 3 for site G (the one site that used the six core linkage terms, plus name and social security number). This figure illustrates that date of birth has similar discriminatory value to first name, last name, and the last four digits of social security number. While the latter three terms substantially increase the ability to match patient records, they are often not available (e.g., six of the seven sites did not have these variables available for linkage). The broad range of values for zip code, age, and hospital demonstrates that common values in these fields have relatively low weights of agreement, and therefore lower ability to generate high-probability matches, while uncommon values (rare zip codes, extremes of age, and hospitals with low EMS traffic) yield higher match weights and therefore greater likelihood of being retained as “true” matches.

Figure 3.

Figure 3

Median match weights of agreement for nine common probabilistic linkage variables from one site. *Error bars represent the 95th and 5th percentile values (i.e., the middle 90% of all match weights) for agreement match weights. A single site is represented in the figure, as there was only one site that had access to names and social security numbers (white boxes) for linkage. The percentiles for date and sex were small enough to appear nonexistent in this figure.

Missing Data and Use of Multiple Imputation

In Table 2A, we list the proportions of missing data for several hospital-based injury outcomes among patients transported by EMS. For most measures, the proportion of missing data is simply a reflection of the proportion of EMS records that did not match to a hospital record. However, some measures (e.g., ICISS) have higher percentages of missing data due to a lack of hospital-based ICD9 injury codes necessary for calculation. In Table 2B, we compare key injury metrics between the nonimputed and multiply imputed samples, plus the respective sample sizes. There are different proportions of patients with serious injuries (ISS ≥ 16, AIS ≥ 3) between the nonimputed and imputed samples within the same site, with the nonimputed samples generating universally higher values. For in-hospital mortality, four sites have comparable values between the imputed and nonimputed samples, although mortality rates differ for site C (1.6% vs. 3.6%), site D (3.5% vs. 1.5%), and site E (22.8% vs. 1.7%).

Table 2.

The Proportion of Missing Hospital Data and Use of Multiple Imputation for Handling Missing Outcomes Among Patients Transported by EMS

Site
A B C D E F G
A. Proportion of Missing Values for Hospital Outcomes
Total EMS transported patients, n 57,737 78,075 46,158 3,973 15,401 58,683 41,187
Hospital record match rate among transported patients 26.7% 15.5% 18.5% 87.5% 14.9% 81.2% 80.0%
Hospital data available for linkage
   Trauma registries Yes Yes Yes Yes Yes Yes Yes
   Patient discharge data Yes Yes Yes Yes Yes Yes Yes
   ED data No No No Yes Yes Yes Yes
Data to be imputed among transported patients, n (%)
   Missing ISS 42,359 (73.4) 66,135 (84.7) 37,635 (81.5) 497 (12.5) 13,101 (85.1) 11,039 (18.8) 8,283(20.1)
   Missing maximum AIS 42,359 (73.4) 66,194 (84.8) 37,692 (81.7) 497 (12.5) 13,101 (85.1) 11,039 (18.8) 8,283 (20.1)
   Missing ICISS 45,065 (78.1) 67,368 (86.3) 40,866 (88.5) 884 (22.3) 13,402 (87.0) 19,105 (32.6) 9,189 (22.3)
   Missing LOS 42,311 (73.3) 66,006 (84.5) 37,714 (81.7) 521 (13.1) 13,217 (85.8) 11,339 (19.3) 8,523 (20.7)
   Missing nonorthopedic surgery 42,310 (73.3) 66,004 (84.5) 37,635 (81.5) 497 (12.5) 13,101 (85.1) 11,039 (18.8) 8,283 (20.1)
   Missing orthopedic surgery 42,310 (73.3) 66,004 (84.5) 37,635 (81.5) 497 (12.5) 13,101 (85.1) 11,039 (18.8) 8,283 (20.1)
   Missing in-hospital mortality 42,310 (73.3) 65,970 (84.5) 37,635 (81.5) 497 (12.5) 13,101 (85.1) 11,039 (18.8) 8,283 (20.1)
B. Hospital Outcome Measures Among Injured Patients Transported by EMS Before and After Using Multiple Imputation
Nonimputed, linked sample, n (%)
   N 15,427 12,071 8,523 3,476 2,300 47,644 32,904
   ISS ≥ 16 1,885 (12.3) 2,094 (17.5) 926 (10.9) 100 (2.9) 118 (5.1) 588 (1.2) 1,380 (4.2)
   Maximum AIS ≥ 3 5,664 (36.8) 6,439 (54.2) 2,656 (31.4) 206 (5.9) 302 (13.1) 1,881 (4.0) 3,944 (12.0)
   Nonorthopedic surgery 1,458 (9.5) 1,269 (10.5) 1,025 (12.0) 131 (3.8) 114 (5.0) 1,163 (2.4) 739 (2.3)
   Orthopedic surgery 4,984 (32.3) 4,817 (39.9) 1,893 (22.2) 242 (7.0) 257 (11.2) 4,378 (9.2) 2,623 (8.0)
   In-hospital mortality 445 (2.9) 380 (3.2) 308 (3.6) 52 (1.5) 40 (1.7) 451 (1.0) 392 (1.2)
Fully imputed, linked sample, n (%)
   N 57,737 78,075 46,158 3,973 15,401 58,683 41,187
   ISS ≥ 16 3,281 (5.7) 2,801 (3.6) 1,119 (2.4) 109 (2.7) 123 (0.8) 590 (1.0) 1,430 (3.5)
   Maximum AIS ≥ 3 14,318 (24.8) 17,102 (21.9) 4,407 (9.6) 248 (6.2) 385 (2.5) 1,771 (3.0) 4,256 (10.3)
   Nonorthopedic surgery 4,308 (7.5) 4,546 (5.8) 5,906 (12.8) 207 (5.2) 1,074 (7.0) 1,625 (2.8) 915 (2.2)
   Orthopedic surgery 20,382 (35.3) 25,897 (33.2) 7,574 (16.4) 354 (8.9) 3,287 (21.3) 6,821 (11.6) 2,852 (6.9)
   In-hospital mortality 1,340 (2.3) 1,916 (2.5) 741 (1.6) 139 (3.5) 3,511 (22.8) 710 (1.2) 502 (1.2)

AIS = Abbreviated Injury Scale; LOS = length of stay; ICISS = ICD9-based Injury Severity Score; ISS = Injury Severity Score.

Table 3 illustrates the relationship between sample size, data availability, and the handling of missing values. The table is restricted to sites with all data sources available and high match rates. Restricting the sample to patients who matched to a hospital record reduces the sample size by 12.5% to 82.1%, depending on the data sources available. In the absence of ED data, restricting to patients with matched hospital records reduces the sample size by 64.8% to 82.1%. Figures 4A through 4D demonstrate the relationship between match rate, the amount of missing data (generally reflective of hospital data available for linkage and match rates), decisions regarding handling missing values, sample size, and hospital outcome metrics. For most sites, the proportion of patients with ISS ≥ 16 is comparable between nonimputed and imputed samples, except for sites D and G, where there are notably more seriously injured patients among the PDD/registry dataset restricted to observed values. In contrast, there is much more within-site variation for patients with AIS ≥ 3, major nonorthopedic surgery, and in-hospital mortality metrics when using the different analytic strategies. These figures also demonstrate that the variance (95% confidence intervals) of estimates is sensitive to sample size, type of variable being imputed, and the combination of these factors.

Table 3.

Sample Size Based on Data Availability for Linkage and Whether or Not Multiple Imputation Is Used to Handle Missing Values*

Site
D F G
Full sample of injured patients transported by EMS 3,973 58,683 41,187
Linked ED, PDD, registry 3,476 (87.5) 47,644 (81.2) 32,904 (80.0)
Linked ED, PDD, registry + MI 3,973 (100) 58,683 (100) 41,187 (100)
Linked PDD, registry 829 (20.9) 20,669 (35.2) 7,382 (17.9)
Linked PDD, registry + MI 3,973 (100) 58,683 (100) 41,187 (100)

Calculation of sample sizes without multiple imputation assumes that persons with missing outcomes are excluded from the analysis (complete case analysis).

Values in parentheses are percentages.

MI = multiple imputation; PDD = patient discharge data; registry = trauma registry data.

*

The table is restricted to the three sites with all three hospital data sources (ED, PDD, registry) available and high (≥ 80%) match rates.

Figure 4.

Figure 4

Figure 4

Figure 4

Figure 4

Differences in outcome metrics by availability of data sources for linkage and use of multiple imputation among injured patients transported by EMS.* (A) ISS ≥ 16. (B) AIS ≥ 3. (C) Major nonorthopedic surgical intervention. (D) In-hospital mortality. *The y-axis scale for Figure 3B is expanded to 50%. AIS = Abbreviated Injury Scale; ISS = Injury Severity Scores; MI = multiple imputation (method for handling missing values); PDD = patient discharge data; registry = trauma registry data.

Figures 5A and 5B are histograms of ISS and AIS using multiply imputed data from each of the seven sites. The shape of the AIS curves is variable between sites. Sites with higher proportions of missing values generally yield higher estimates for AIS ≥ 3. For ISS, there is variability between sites for low to moderate values (ISS ≤ 10), although higher ISS values are more consistent. When dichotomized to ISS ≥ 16 versus ISS < 16, proportions are comparable between sites, despite wide variation in percent missing and sample sizes. Table 4 demonstrates injury outcome metrics for site-specific versus combined-site approaches to multiple imputation. Sites D and E have notable differences in outcome metrics (e.g., in-hospital mortality) between single-site versus combined-site values. ISS ≥ 16 estimates for single- versus multisite imputation are almost identical, regardless of sample size or percentage of missing values.

Figure 5.

Figure 5

Figure 5

Smoothed-curve histograms of maximum AIS score and ISS using multiply imputed data, by site.* (A) AIS. (B) ISS. *The y-axes in Figures 5A and 5B represent the percentages of injured patients transported by EMS with each grade of injury severity among the fully imputed, complete data set (i.e., including patients with observed outcomes, plus multiply imputed outcomes). AIS = Abbreviated Injury Scale; ISS = Injury Severity Score.

Table 4.

Outcome Metrics Among Injured Patients Transported by EMS Using Site-specific Versus Combined-site Multiple Imputation Models

Site
D
E
F
Single Site Combined Sites Single Site Combined Sites Single site Combined Sites
Number used for MI models* 4,671 87,205 18,043 87,205 64,491 87,205
Number of transported patients 3,973 15,401 58,683
ISS ≥ 16 109 (2.7) 100 (2.5) 123 (0.8) 123 (0.8) 590 (1.0) 589 (1.0)
Maximum AIS ≥ 3 248 (6.2) 206 (5.2) 385 (2.5) 548 (3.6) 1,77 (3.0) 1,741 (3.0)
Nonorthopedic surgery 207 (5.2) 145 (3.7) 1,074 (7.0) 688 (4.5) 1,625 (2.8) 1,374 (2.3)
Orthopedic surgery 354 (8.9) 285 (7.2) 3,287 (21.3) 2,752 (17.9) 6,821 (11.6) 5,487 (9.4)
In-hospital mortality 139 (3.5) 61 (1.5) 3,511 (22.8) 569 (3.7) 710 (1.2) 565 (1.0)

Values in parentheses are percentages.

*

Sample sizes used for multiple imputation models include nontransported patients (to represent lower acuity, less injured patients), although outcome values are calculated for transported patients (the primary group of interest).

AIS = Abbreviated Injury Scale; ISS = Injury Severity Score; MI = multiple imputation.

DISCUSSION

In this study, we demonstrated that probabilistic linkage and multiple imputation can be used to build population-based, injury databases matched to outcomes from multiple sources of existing electronic data. Among sites with three sources of hospital data and low proportions of missing values for key linkage variables, match rates to EMS records were high (≥80%). Our results also illustrate that multiple imputation can be used to minimize bias and preserve sample size, although imputed results and variance are affected by sample size, types of variables being imputed, proportion of missing values, and single-site versus combined-site approaches to imputation. The findings within and across sites in this study provide important insight and guidance for other regions seeking to replicate these processes and for studies that aim to track a large volume of patients from the out-of-hospital to the hospital setting and beyond. While previous research has demonstrated the ability to link many different datasets for injured patients,23 this study fills an important void in describing the effect of different combinations of linkage variables (and missing data for linkage variables) on match rates, the effects of different missing data strategies on linked databases, and the focus on linking primary EMS data across multiple phases of care.

Match rates were closely related to the amount of missing data for certain linkage variables. Among the six core linkage variables used in all linkage analyses (date of service, date of birth, home zip code, age, sex, and hospital), date of birth had the greatest importance in determining EMS match rate to hospital records. In sites with high percentages of missing values for date of birth, achieving high match rates was unlikely without the availability of other highly discriminating linkage variables such as first name, last name, and social security number. It is important to note that no single variable is sufficiently discriminating to generate high probability matches based on its own; record linkage must be done using multiple variables of adequate discriminatory value relative to the file sizes to yield high probability matches.16 When attempting record linkage without the availability of date of birth, names, or social security number (e.g., site E), match rates will be low and likely skewed toward nonrepresentative patients (e.g., extremes of age, rare zip codes, transports to hospitals with low EMS traffic). Home zip code had a range of values for match weight (reflecting common and less common zip codes served by EMS), although we did not find the same correlation between match rate and zip code as with date of birth. We did find high match rates for EMS records using variables routinely captured in EMS, ED, inpatient, and vital statistics data. These findings illustrate that successful linkage of data across phases of care is critically dependent upon complete data capture for the six core linkage variables (particularly among EMS charts) and having access to variables with adequate discriminatory value (e.g., date of birth).

The concept of multiple imputation has been applied to probabilistic linkage for missing “links” to handle situations with poorly discriminating variables and high rates of missingness for important linkage variables, thereby generating more representative samples of matched patients.24 However, integrating multiple imputation into the linkage process generates multiple linked datasets (e.g., 5), each with a slightly different combination of linked records, which must be analyzed using standard multiple imputation methods (i.e., Rubin’s rules for combining estimates5,11). While such a strategy is used to produce a linked-only sample with representative characteristics of the original sample, it also eliminates the option of using multiple imputation to later handle other missing values (e.g., outcome metrics, EMS clinical data) with the accordant bias and sample size consequences. That is, there is not a viable option for using multiple imputation twice. In this study, we demonstrate the strategy of keeping high-probability record matches (without multiple imputation in the linkage process) and employing multiple imputation for subsequently handling missing, clinically relevant EMS and outcome data to preserve the original sample size and to provide outcomes for all EMS patients.

Our findings also illustrate the relative importance of different types of hospital record sources for maximizing the likelihood of outcome matches to EMS records. Trauma registries, PDD, and ED data each provided unique hospital data sources for matching to EMS data files, although ED data provided the largest overall proportion of record linkages. For states contemplating the development and usability of different data systems, these findings strongly support wider adoption of statewide ED databases to comprehensively match prehospital and hospital phases of care. PDD data files provided the next largest volume of record links and were particularly important for sites without statewide ED data available. The proportion of record linkages from trauma registries was more variable, likely due to differences in registry inclusion criteria and missing data for registry linkage variables.

Following completion of record linkage, the amount of missing data and how missing values were handled had substantial effect on sample size, bias, and variance. Simply omitting records with missing values (complete case analysis) markedly reduced sample size and generated inflated proportions for outcome metrics. The type of variable being imputed (e.g., ordinal vs. categorical) and the range of values also directly affected outcome metrics in the imputed samples. The proportion of patients with ISS ≥ 16 was comparable between sites, regardless of sample size and proportion of missing data, suggesting that for ordinal (or continuous) terms with a broad range of values, multiple imputation is relatively robust to differences in the amount of missing data and sample size. This finding was further demonstrated using within-site comparisons of ISS ≥ 16 of imputed and nonimputed samples with different proportions of missing values. We have shown similar results in simulation studies using motor vehicle crash data.9 Conversely, among ordinal terms with more compressed scales (e.g., AIS) and dichotomous variables (e.g., in-hospital mortality), there was notably more variability in multiply imputed estimates and increased variance. These findings suggest that multiple imputation is particularly sensitive to sample size and the proportion of missing values for categorical and compressed ordinal data. Although there are many factors affecting rules of thumb for the “acceptable” proportion of missing values for multiple imputation (e.g., sample size, depth of information in observed values, mechanism of missingness, types of variables being imputed), our study suggests that effective use of multiple imputation requires more conservative percentages for dichotomous variables (<20% missing values), yet more tolerance with broad ordinal and continuous variables (>50% missing values). Sites with larger sample sizes (e.g., 40,000 or more) may be able to compensate for higher proportions of missing values for dichotomous variables.

Comparison of estimates from single-site versus multisite imputation models demonstrates that sites with modest sample sizes or high proportions of missing values can “borrow” information from other sites to generate less biased estimates. While we did not know the true values, the benefit of combined-site versus single-site multiple imputation models can be inferred by comparing values between the two approaches within sites. For example, in site D (modest sample size), the mortality rate varied between the nonimputed, single-site imputed, and combined-site imputed samples. For the single-site imputed value to be true, the 497 subjects with missing outcomes would have had to have a mortality rate of 17.5%, which is very unlikely. Additional factors such as similar versus dissimilar populations and sites, the overall percentage of missing values across sites, and number and type of variables available for imputation models must also be considered before combining data from multiple sites. However, these results are encouraging for smaller sites and systems, as partnering with similar sites to increase sample size and observed values may help navigate through single-site limitations.

While this study focused on the subset of injured patients served by EMS, these data strategies and techniques are readily transferable to all patients served by EMS, as well as many other patient populations. Other applications using similar methods and additional sources of existing databases may include evaluating the effectiveness of regionalized STEMI, stroke, and cardiac arrest care; exploring and measuring different alternatives to health care delivery among broad populations; measuring the effectiveness of new EMS treatment or triage protocols; providing EMS agency-specific feedback (including outcomes) for quality assurance and education purposes; and tracking long-term survival for certain groups of patients.

LIMITATIONS

To allow comparison between sites, we had to assume that the inclusion criteria for the EMS samples were applied consistently across sites. While it is possible that there was some site-to-site variability in application of the EMS inclusion criteria and injury outcomes, the within-site comparisons allowed us to account for such intersite variability in the results. We did not know true matches and nonmatches for any site, as such information would have required access to the original medical records. However, we have previously demonstrated the validity of probabilistic linkage using identical software and approach to linkage analysis, including a low mismatch rate (high specificity) across a variety of linkage scenarios.8 Our calculated match rates represent estimates of sensitivity without the ability to calculate specificity, although we believe specificity remained high based on the variables used for linkage.8,16 In addition, while we were able to directly calculate match rates for transported and nontransported patients, match rates for admitted patients and 1-year mortality required several assumptions and therefore represent estimated values. Modifying the values used for these assumptions affects the estimated match rates.

One of the key assumptions for generating valid estimates using multiple imputation is that the missing data mechanism is “ignorable.”5,11 Ignorable data exist when observed values can be used to effectively explain the mechanism by which data are missing and accurately estimate the range of plausible values (i.e., that persons with missing values are not systematically different from those with observed values).5,11 However, the presence of ignorable data is not directly testable. We believe that this assumption was met in these data based on the depth of information provided in observed values, previous simulation studies demonstrating the validity of multiply imputed values using similar EMS and trauma data,9,10 and by the consistency between imputed and nonimputed values for sites with relatively little missing data (i.e., where imputed values could be compared to known values for the same sample).

CONCLUSIONS

We demonstrate that existing data files for EMS, ED, patient discharge data, trauma registries, and vital statistics data for injured children and adults can be probabilistically linked with high match rates using commonly available variables, provided that there are minimal missing data for key linkage terms. Attention to minimizing missing values for key linkage variables (e.g., date of birth) among EMS charts, further development of statewide data sources (e.g., ED and vital statistics records), and assuring responsible access to protected health information across data sources will further increase the ability to match large volumes of records across phases of care and provide outcomes for injured patients served by EMS. Handling missing values in such linked databases with multiple imputation yields less biased estimates than complete case analysis, although sample size, variable type, percentage missing, and combined-site versus single-site imputation strategies directly affect the estimates and accompanying variance. These results illustrate the feasibility of and a roadmap for building broad, population-based injury databases that span prehospital, hospital, and postdischarge phases of care.

Acknowledgments

This project was supported by the Robert Wood Johnson Foundation Physician Faculty Scholars Program; the Oregon Clinical and Translational Research Institute (grant UL1 RR024140); University of California, Davis Clinical and Translational Science Center (grant UL1 RR024146); Stanford Center for Clinical and Translational Education and Research (grant 1UL1 RR025744); University of Utah Center for Clinical and Translational Science (grant UL1-RR025764 and C06-RR11234); and UCSF Clinical and Translational Science Institute (grant UL1 RR024131). All Clinical and Translational Science Awards are from the National Center for Research Resources, a component of the National Institutes of Health (NIH), and NIH Roadmap for Medical Research.

The authors acknowledge and thank all the participating emergency medical services (EMS) agencies, EMS medical directors, trauma registrars, and state offices who supported and helped provide data for this project.

Footnotes

The authors have no relevant financial information or potential conflicts of interest to disclose.

References

  • 1.Kochanek KD, Xu JQ, Murphy SL, Miniño AM, Kung HC. Deaths: preliminary data for 2009. National Vital Statistics Reports. 2011;59:1–51. [PubMed] [Google Scholar]
  • 2.Centers for Disease Control and Prevention. Injury Prevention & Control: Data & Statistics (WISQARS). Nonfatal Injury Data. [Accessed Jan 9, 2012];Overall All Injury Causes Nonfatal Injuries and Rates per 100,000. Available at: http://www.cdc.gov/injury/wisqars/nonfatal.html.
  • 3.Clark DE. Practical introduction to record linkage for injury research. Inj Prev. 2004;10:186–191. doi: 10.1136/ip.2003.004580. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Jaro MA. Probabilistic linkage of large public health data files. Stat Med. 1995;14:491–498. doi: 10.1002/sim.4780140510. [DOI] [PubMed] [Google Scholar]
  • 5.Rubin DB. Multiple Imputation for Nonresponse in Surveys. New York, NY: John Wiley & Sons, Inc.; 1987. [Google Scholar]
  • 6.Dean JM, Vernon DD, Cook L, Nechodom P, Reading J, Suruda A. Probabilistic linkage of computerized ambulance and inpatient hospital discharge records: a potential tool for evaluation of emergency medical services. Ann Emerg Med. 2001;37:616–626. doi: 10.1067/mem.2001.115214. [DOI] [PubMed] [Google Scholar]
  • 7.Newgard CD, Zive D, Malveau S, Leopold R, Worrall W, Sahni R. Developing a statewide emergency medical services database linked to hospital outcomes: a feasibility study. Prehosp Emerg Care. 2011;15:303–319. doi: 10.3109/10903127.2011.561404. [DOI] [PubMed] [Google Scholar]
  • 8.Newgard CD. Validation of probabilistic linkage to match de-identified ambulance records to a state trauma registry. Acad Emerg Med. 2006;13:69–75. doi: 10.1197/j.aem.2005.07.029. [DOI] [PubMed] [Google Scholar]
  • 9.Newgard CD, Haukoos JS. Missing data in clinical research--part 2: multiple imputation. Acad Emerg Med. 2007;14:669–678. doi: 10.1197/j.aem.2006.11.038. [DOI] [PubMed] [Google Scholar]
  • 10.Newgard CD. The validity of using multiple imputation for missing prehospital data in a state trauma registry. Acad Emerg Med. 2006;13:314–324. doi: 10.1197/j.aem.2005.09.011. [DOI] [PubMed] [Google Scholar]
  • 11.Little RJA, Rubin DB. Statistical Analysis with Missing Data. 2nd ed. New York, NY: John Wiley & Sons, Inc.; 2002. [Google Scholar]
  • 12.Van Der Heijden GJ, Donders ART, Stijnen T, Moons KG. Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example. J Clin Epidemiol. 2006;59:1102–1109. doi: 10.1016/j.jclinepi.2006.01.015. [DOI] [PubMed] [Google Scholar]
  • 13.Crawford SL, Tennstedt SL, McKinlay JB. A comparison of analytic methods for non-random missingness of outcome data. J Clin Epidemiol. 1995;48:209–219. doi: 10.1016/0895-4356(94)00124-9. [DOI] [PubMed] [Google Scholar]
  • 14.Greenland S, Finkle WD. A critical look at methods for handling missing covariates in epidemiologic regression analyses. Am J Epidemiol. 1995;142:1255–1264. doi: 10.1093/oxfordjournals.aje.a117592. [DOI] [PubMed] [Google Scholar]
  • 15.One hundred eleventh Congress of the United States of America. Final Feb 19, 2009. Title XIII--Health Information Technology. American Recovery and Reinvestment Act of 2009; pp. 112–165. [Google Scholar]
  • 16.Cook LJ, Olson LM, Dean JM. Probabilistic record linkage: relationships between file sizes, identifiers, and match weights. Method Inform Med. 2001;40:196–203. [PubMed] [Google Scholar]
  • 17.Clark DE, Osler TM, Hahn DR. ICDPIC: Stata Module to Provide Methods for Translating International Classification of Diseases (Ninth Revision) Diagnosis Codes into Standard Injury Categories and/or Scores. Boston, MA: Boston College, Department of Economics; 2009. [Google Scholar]
  • 18.MacKenzie EJ, Steinwachs DM, Shankar BS, Turney SZ. An ICD-9CM to AIS conversion table: development and application. Proc AAAM. 1986;30:135–151. [Google Scholar]
  • 19.MacKenzie EJ, Steinwachs DM, Shankar B. Classifying trauma severity based on hospital discharge diagnoses. Validation of an ICD-9CM to AIS-85 conversion table. Med Care. 1989;27:412–422. doi: 10.1097/00005650-198904000-00008. [DOI] [PubMed] [Google Scholar]
  • 20.Raghunathan T, Lepkowski, Van Hoewyk J, Solenberger P. A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodol. 2001;27:85–95. [Google Scholar]
  • 21.Royston P, Altman DG, Sauerbrei W. Dichotomizing continuous predictors in multiple regression: a bad idea. Statist Med. 2005;25:127–141. doi: 10.1002/sim.2331. [DOI] [PubMed] [Google Scholar]
  • 22.Davidson GH, Hamlat CA, Rivara FP, Koepsell TD, Jurkovich GJ, Arbabi S. Long-term survival of adult trauma patients. JAMA. 2011;305:1001–1007. doi: 10.1001/jama.2011.259. [DOI] [PubMed] [Google Scholar]
  • 23.Johnson SW, Walker J. The Crash Outcome Data Evaluation System (CODES) [Accessed Jan 9, 2012];National Highway Traffic and Safety Administration, U.S. Department of Transportation. DOT HS 808 338. Available at: http://www-nrd.nhtsa.dot.gov/Pubs/808-338.pdf.
  • 24.McGlincy MH. [Accessed Jan 9, 2012];A Bayesian Record Linkage Methodology for Multiple Imputation of Missing Links. Available at: http://www.amstat.org/sections/srms/Proceedings/y2004/files/Jsm2004-000683.pdf.

RESOURCES