Abstract
Background:
Linking records from the National Spinal Cord Injury Model Systems (SCIMS) database to the National Trauma Data Bank (NTDB) provides a unique opportunity to study early variables in predicting long-term outcomes after traumatic spinal cord injury (SCI). The public use data sets of SCIMS and NTDB are stripped of protected health information, including dates and zip code.
Objectives:
To develop and validate a probabilistic algorithm linking data from an SCIMS center and its affiliated trauma registry.
Method:
Data on SCI admissions 2011–2018 were retrieved from an SCIMS center (n = 302) and trauma registry (n = 723), of which 202 records had the same medical record number. The SCIMS records were divided equally into two data sets for algorithm development and validation, respectively. We used a two-step approach: blocking and weight generation for linking variables (race, insurance, height, and weight).
Results:
In the development set, 257 SCIMS-trauma pairs shared the same sex, age, and injury year across 129 clusters, of which 91 records were true-match. The probabilistic algorithm identified 65 of the 91 true-match records (sensitivity, 71.4%) with a positive predictive value (PPV) of 80.2%. The algorithm was validated over 282 SCIMS-trauma pairs across 127 clusters and had a sensitivity of 73.7% and PPV of 81.1%. Post hoc analysis shows the addition of injury date and zip code improved the specificity from 57.9% to 94.7%.
Conclusion:
We demonstrate the feasibility of probabilistic linkage between SCIMS and trauma records, which needs further refinement and validation. Gaining access to injury date and zip code would improve record linkage significantly.
Keywords: databases, data linkage, rehabilitation, spinal cord injuries, trauma
Established in 1975, the National Spinal Cord Injury Model Systems (SCIMS) database has been enrolling patients with traumatic spinal cord injury (SCI) who received inpatient care at the SCIMS centers within 1 year of injury and also conducting follow-up data collection at postinjury years 1 and 5 and every 5 years thereafter.1 As of March 2020, the database captured about 6% of new SCI cases in the United States and had outcome data up to 45 years post injury for 34,504 people with SCI enrolled by 29 SCIMS centers. The database has been used to address a number of long-term physical and psychosocial outcomes after SCI.1–4 Patient enrollment usually occurs at rehabilitation. Due to constraints with resources, the SCIMS database does not contain emergency medical services (EMS) or trauma care information. As a result, the opportunity for investigators to utilize the database to address how acute trauma care can influence rehabilitation outcomes is limited.
Little research has addressed this knowledge gap in the SCI literature. Nemunaitis and colleagues merged data from trauma and rehabilitation services within an academic center and identified early predictors of functional outcomes after SCI.5 A series of reports published in this issue of Topics of Spinal Cord Injury Rehabilitation also aimed to investigate the data elements from the EMS and acute trauma care in predicting rehabilitation outcomes after SCI.6,7 However, the generalizability of these study findings to the SCI population could be limited by the study size, geographic representation, and length of follow-up.
The National Trauma Data Bank (NTDB) is the world’s largest trauma data repository, including over 7.5 million records voluntarily submitted by more than 900 trauma centers in the United States.8 Since its inception in 1989, the NTDB has continuously evolved and improved data quality, for instance, through the implementation of the National Trauma Data Standard in 2007 and American College of Surgeons Trauma Quality Improvement Program in 2010. Data contained in the NTDB include EMS and trauma services as well as anatomic, physiologic, and procedure variables.9 Data have been widely used for research investigating the effectiveness of trauma system development as well as acute mortality and morbidity from traumatic injuries.10 There have been very few NTDB studies, however, looking at clinical outcomes after SCI or spine trauma,11,12 even though NTDB is well positioned to study rare occurrences, like SCI, due to its large sample size.
Because data collection ceases at discharge from trauma care, the NTDB does not have data on rehabilitation or quality of life. To answer questions regarding early care’s influences on long-term outcomes after SCI, record linkage between NTDB and SCIMS offers a unique opportunity that is not possible in either database alone. Although the NTDB cannot be considered nationally representative of traumatic injuries, virtually all level I/II trauma centers now submit data to the NTDB. As all SCIMS centers are affiliated with one or more level I trauma centers, it is reasonable to assume that most of SCIMS database participants should also have been reported to the NTDB in recent years.
Due to privacy reasons, many national administrative and research data sets available to the public, including SCIMS and NTDB, are stripped of the protected health information (PHI) under the Health Insurance Portability Accountability Act (HIPAA),13 which prevents deterministic matching by medical record number, names, date of birth, and other personal identifying information between data sets. As a result, probabilistic matching techniques have been utilized in research to link records from different sources using data elements that are common to the data sets of interest, such as age, sex, race, year, and medical diagnosis. For example, investigators have developed a probabilistic matching algorithm linking records from a Traumatic Brain Injury Model Systems (TBIMS) center to a trauma registry.14,15 This algorithm was further applied to link data from the National TBIMS database to NTDB and identified 3,575 matched records for studying the impact of extracranial injury on mental health after traumatic brain injury (TBI).16
This study was conducted to develop and validate an algorithm for probabilistic record linkage between a single SCIMS center and local trauma registry, based on nonpersonal identifying information common to these two registries. If successful, the algorithm can be further utilized to match data from the National SCIMS database and NTDB for research conducted on a large scale to identify early care factors associated with long-term outcomes after SCI.
Method
Data on admissions between 2011 and 2018 were retrieved from an SCIMS center and a trauma registry within the same institution. The institutional trauma registry has been submitting data to the NTDB. Selection criteria for the trauma records included (a) patients alive at discharge from trauma care and (b) diagnosis of spine trauma based on the Abbreviated Injury Scale body region code. For the SCIMS records, we excluded SCIs as a result of medical and surgical complications in agreement with the NTDB traumatic injury criteria. The true-match records (same persons) between the SCIMS and trauma registry were determined by the medical record number. This study was approved by the local institutional review board.
We randomly divided the SCIMS records equally into two data sets, training and validation, for probabilistic algorithm development and validation, respectively. We conducted chi-square, Fisher’s exact, and Student t tests to compare demographic and injury-related factors between the training and validation sets.
Probabilistic linkage: Development of algorithm
We used the same approach as previous studies14,15,17,18 in developing a probabilistic matching algorithm: blocking and weight generation for linking variables. We then chose the optimal cutoff point of total weight based on validity metrics.
Blocking and clusters of SCIMS-trauma pairs
To improve efficiency of matching, we limited the SCIMS-trauma pairwise comparisons to those pairs having the same values on “blocking” variables.17,18 We selected common variables of high specificity, accuracy, and completeness for “blocking”; they were age, sex, and year of injury. After blocking, each SCIMS record was paired with one or more trauma records that share the same sex, age, and year of injury, which formed a cluster. The size of each cluster was determined by the number of trauma records that agreed on the blocking variables with the index SCIMS record in the cluster.
Linking variables and weight generation
We assessed the agreement/disagreement on the linking variables for each SCIMS-trauma pair and calculated the total weight. We chose four variables that existed in both the SCIMS and NTDB and had fewer values missing/unknown as linking variables: race, insurance, height, and body weight.
We used two measures in generating the weight (W) of a linking variable — the quality of the data (m), and the probability of random agreement (u). The m was calculated for each linking variable as the percentage of true-match records that agreed on the variable of interest, which reflects the probability that the two records (SCIMS and trauma) of the same person share the same characteristics. The u was estimated for every category/value of each linking variable based on the frequency in the trauma registry, the larger data set of the two. In other words, every category/value of the variable (such as race = white) has its own u that indicates the likelihood of the two records (SCIMS and trauma) within a pair having the same status (race = white) by chance.
We estimated the weight (W) based on the agreement/disagreement on the variable of interest between two records. If the two records (SCIMS and trauma) within each pair shares the same value for that variable (i.e., race = white):
![]() |
If the two records disagree on that variable (i.e., race = white in SCIMS, black in trauma):
![]() |
The total weight (Wtotal) for each SCIMS-trauma pair was computed by summing up the weight of each linking variable:
![]() |
Validity metrics
A higher total weight suggests a greater likelihood that the two records are the same person. The first criterion for determining the “linked” (positive) was the SCIMS-trauma pair having the highest Wtotal within each cluster. To reduce the false positive rate, we further required the highest Wtotal above a cutoff value (second criterion). The optimal Wtotal cutoff point was determined by the distribution of Wtotal and validity metrics: sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV).
Probabilistic linkage: Validation of algorithm
We applied the same blocking variables to the validation set and generated the Wtotal for each SCIMS-trauma pair using the m and u identified in the training set. The validity metrics were also computed based on the same criteria as the training set.
Post hoc analysis
To assess any selection bias of the probabilistic algorithm toward certain groups of individuals, we conducted the chi-square, Fisher’s exact, and Student t tests to compare demographic and injury characteristics of those correctly linked SCIMS records (true positives) versus those being falsely excluded (false negatives). To evaluate the utility of PHI in improving the accuracy of probabilistic linkage, we added the date of injury and residential zip code, along with race, insurance, height, and body weight as linking variables in the development and validation of the probabilistic linkage algorithm. The validity metrics were also calculated. Date of injury and zip code are usable data points in a limited data set under HIPAA.19 Both variables are reported to the SCIMS and NTDB, but excluded from the public use data sets.
All data analyses were conducted using SAS 9.4 software.
Results
The final study sample included 302 SCIMS and 723 trauma records, of which 202 had the same medical record number and were considered as true-match (same persons). The SCIMS records were further randomly divided into two data sets: training (n = 151) and validation (n = 151). These two data sets were similar regarding the demographic and injury-related characteristics (Table 1). The number of true-match records was 97 in the training set and 105 in the validation set.
Table 1.
Demographic and injury-related characteristics of SCIMS records in training and validation sets
Characteristics | Training set [mean ± SD or n (%)] | Validation set [mean ± SD or n (%)] | p* |
---|---|---|---|
Sample size | 151 | 151 | |
Blocking variables | |||
Age at injury | 40.2 ± 17.0 | 38.8 ± 16.7 | .47 |
Sex | .78 | ||
Male | 118 (78.1) | 120 (79.5) | |
Female | 33 (21.9) | 31 (20.5) | |
Year of injury | .09 | ||
2011 | 18 (11.9) | 10 (6.6) | |
2012 | 15 (9.9) | 20 (13.2) | |
2013 | 10 (6.6) | 18 (11.9) | |
2014 | 23 (15.2) | 21 (13.9) | |
2015 | 21 (13.9) | 35 (23.2) | |
2016 | 25 (16.6) | 22 (14.6) | |
2017 | 20 (13.3) | 12 (8.0) | |
2018 | 19 (12.6) | 13 (8.6) | |
Linking variables | |||
Race | .41 | ||
Hispanic | 3 (2.0) | 0 (0.0) | |
Non-Hispanic white | 77 (51.0) | 83 (55.0) | |
Non-Hispanic black | 69 (45.7) | 66 (43.7) | |
Othera | 2 (1.3) | 2 (1.3) | |
Insurance | .85 | ||
Medicaid | 23 (15.2) | 25 (16.5) | |
Medicare | 15 (10.0) | 17 (11.3) | |
Self-pay | 42 (27.8) | 37 (24.5) | |
Private/Commercial | 56 (37.1) | 62 (41.1) | |
Other government | 0 (0.0) | 0 (0.0) | |
Other unclassified | 2 (1.3) | 2 (1.3) | |
Unknown | 13 (8.6) | 8 (5.3) | |
Height, cm | 176.6 ± 9.9 | 176.9 ± 10.6 | .83 |
Unknown | 2 (1.3) | 4 (2.6) | |
Weight, kg | 84.1 ± 22.6 | 82.5 ± 20.2 | .53 |
Unknown | 5 (3.3) | 7 (4.6) | |
Injury-related characteristics | |||
Etiology of injury | .76 | ||
Vehicular | 82 (54.3) | 79 (52.3) | |
Violence | 28 (18.5) | 29 (19.2) | |
Falls | 32 (21.2) | 31 (20.5) | |
Other | 8 (5.3) | 12 (7.0) | |
Unknown | 1 (0.7) | 0 | |
Level of injury | .13 | ||
Tetraplegia | 87 (57.6) | 71 (47.0) | |
Paraplegia | 59 (39.1) | 74 (49.0) | |
Normal | 0 (0.0) | 2 (1.3) | |
Unknown | 5 (3.3) | 4 (2.7) | |
Completeness of injury | .55 | ||
AIS A | 50 (33.1) | 50 (33.1) | |
AIS B | 10 (6.6) | 13 (8.6) | |
AIS C | 27 (17.9) | 20 (13.2) | |
AIS D | 58 (38.4) | 62 (41.1) | |
AIS E | 0 (0.0) | 2 (1.3) | |
Unknown | 6 (4.0) | 4 (2.7) |
Note: AIS = American Spinal Injury Association Impairment Scale; SCIMS = Spinal Cord Injury Model Systems.
Other race includes American Indian, Alaska Native, Asian, Pacific Islander, some other race, and multiracial category.
*Based on the chi-square, Fisher’s exact, and Student t tests, as appropriate.
Algorithm development
The potential pairwise comparisons between 151 SCIMS and 723 trauma records were 109,173 (= 151 × 723) in the training set. By blocking, we reduced the number of SCIMS-trauma pairwise comparisons to 257 across 129 clusters (SCIMS patients), of which 91 pairs were true-match (Table 2). Six (6.2%) of the 97 true-match records were excluded through the blocking process because of the disagreement on age (n = 4) and sex (n = 2) between the SCIMS and trauma records. The number of SCIMS-trauma pairs within each cluster ranged from 1 to 6, with an average of 2.0 (Table 2).
Table 2.
Descriptive statistics of the training and validation sets pre and post blocking
Statistics, n | Training set | Validation set | ||
---|---|---|---|---|
Before blocking | After blocking | Before blocking | After blocking | |
Trauma records |
723 | 220 | 723 | 213 |
SCIMS records | 151 | 129 | 151 | 127 |
True-match records | 97 | 91 | 105 | 99 |
SCIMS-trauma pairs | 109,173 | 257 | 109,713 | 282 |
Mean pairs per cluster (range) | na | 2.0 (1–6) | na | 2.2 (1–6) |
True-match with highest Wtotal (%)a | na | 80 (87.9) | na | 87 (87.9) |
True-match with highest Wtotal ≥0.5 (%)a | na | 65 (71.4) | na | 73 (73.7) |
Note: SCIMS = Spinal Cord Injury Model Systems; na = not applicable.
Probabilistic linkage with race, insurance, height, and body weight.
The percentage of true-match records that agreed on race (mrace), insurance (minsurance), height (mheight), and body weight (mweight) was 94.9%, 69.1%, 71.1%, and 68.0%, respectively. To optimize the m value, height and body weight were categorized into eight and six groups, respectively (Table 3). The frequency distribution of the four linking variables (u) in the trauma registry is shown in Table 3. Based on the m and u values and the agreement status for each linking variable, we calculated the total weight (Wtotal) for each SCIMS-trauma pair. Using the first criterion, we considered the pair with the highest Wtotal within each cluster as “linked” (positive) and correctly identified 80 of the 91 true-match records (sensitivity, 87.9%; Table 2); 80 of the 129 “linked” pairs were true-match records (PPV, 62.0%).
Table 3.
The frequency of each linking variable (u) based on 723 trauma records
Categories | u value, % |
---|---|
Race/Ethnicity | |
Hispanic | 1.80 |
Non-Hispanic white | 59.34 |
Non-Hispanic black | 37.48 |
Other | 1.38 |
Insurance | |
Medicaid | 11.48 |
Medicare | 16.18 |
Self-pay | 32.78 |
Private/Commercial | 36.51 |
Other government | 0.41 |
Other | 2.63 |
Height, cm | |
0–149 | 0.41 |
150–159 | 3.87 |
160–169 | 16.74 |
170–179 | 30.43 |
180–189 | 34.02 |
190–199 | 5.67 |
200–250 | 0.69 |
Missing/Unknown | 8.16 |
Body weight, kg | |
1–59 | 6.78 |
60–79 | 30.01 |
80–99 | 35.13 |
100–119 | 16.46 |
120–650 | 7.33 |
Missing/Unknown | 4.29 |
The Wtotal frequency distribution among the 257 SCIMS-trauma pairs shows some hint of bimodal characteristics (Figure 1). The first reflects lower Wtotal values corresponding to mostly incorrect-match pairs, while the second contains greater Wtotal values representing a large portion of correct-match pairs. On visual inspection, the right tail of the overlapping Wtotal between the highest within cluster (n = 129) versus all other weights (n = 128) falls between 0.0 and 2.5. The validity metrics using different Wtotal cutoff values are presented by a receiver operating characteristic curve (ROC) in Figure 2. The optimal cutoff point was set as 0.5, which correctly identified 65 of the 91 true-match records (sensitivity, 71.4%) and where 65 of the 81 “linked” pairs were true-match (PPV, 80.2%; Table 4).
Figure 1.
Frequency distribution of total weights (Wtotal) among 257 SCIMS-trauma pairs in the training set: the highest weight within each cluster (black bar) versus all other weights (white bar). Grey bar represents the overlapping of the highest and all other weights. (A) Probabilistic linkage with race, insurance, height, and body weight. (B) Probabilistic linkage with race, insurance, height, body weight, injury date, and residential zip code.
Figure 2.
Accuracy of record linkage in the training set presented by the receiver operating characteristic curve. The dotted line represents the probabilistic linkage with race, insurance, height, and body weight. The solid line represents the probabilistic linkage with race, insurance, height, body weight, injury date, and residential zip code. Each black dot indicates a different cutoff value of total weight (Wtotal).
Table 4.
Probabilistic linkage with race, insurance, height, and body weight: Validity metrics in training and validation sets
Training set | Validation set | |||||
---|---|---|---|---|---|---|
True-match | True-match | |||||
Link status | Yes | No | Total | Yes | No | Total |
Linka | 65 | 16 | 81 | 73 | 17 | 90 |
Not link | 26 | 22 | 48 | 26 | 11 | 37 |
Total | 91 | 38 | 129 | 99 | 28 | 127 |
Sensitivity (%) | 71.4 | 73.7 | ||||
PPV (%) | 80.2 | 81.1 | ||||
Specificity (%) | 57.9 | 39.3 | ||||
NPV (%) | 45.8 | 29.7 |
Note: NPV = negative predictive value; PPV = positive predictive value.
SCIMS-trauma pairs with the highest total weight at each cluster ≥ 0.5.
Algorithm validation
By blocking, there were 282 pairwise comparisons across 127 clusters (SCIMS patients) in the validation set, among which 99 were true-match records (Table 2). The average size of the cluster was 2.2 (range, 1–6 pairs per cluster). The frequency distribution of the Wtotal also showed a hint of bimodal distribution. Using a Wtotal cutoff value of 0.5, the sensitivity and PPV were 73.7% and 81.1%, respectively (Table 4).
True positives versus false negatives
There were 65 SCIMS records correctly linked and 32 records falsely excluded through blocking (n = 6) or probabilistic linkage (n = 26) in the training set (Table 2). The corresponding true positives and false negatives in the validation set were 73 and 32 records, respectively. Taking these two data sets together, those correctly linked records (n = 138) had a higher percentage of male (84.1% vs 71.9%, p = .04) and white race (56.5% vs 42.2%, p = .02) but a lower percentage of missing data in insurance (0.7% vs 12.5%), height (0.0% vs 7.8%), and body weight (0.7% vs 9.4%) than those falsely excluded (n = 64). The correct linkage was also associated with recent years of injury (p = .01) and an increased body weight (p = .05; Table 5).
Table 5.
Demographic and injury-related characteristics of 202 SCIMS records: True positives versus false negatives from probabilistic linkage with race, insurance, height, and body weight
Characteristics | True positives [mean ± SD or n (%)] | False negatives [mean ± SD or n (%)] | p* |
---|---|---|---|
Sample size | 138 | 64 | |
Blocking variables | |||
Age at injury | 39.1 ± 15.6 | 36.0 ± 16.2 | .20 |
Sex | .04 | ||
Male | 116 (84.1) | 46 (71.9) | |
Female | 22 (15.9) | 18 (28.1) | |
Year of injury | .01 | ||
2011 | 4 (2.9) | 12 (18.8) | |
2012 | 19 (13.8) | 7 (10.9) | |
2013 | 12 (8.7) | 8 (12.5) | |
2014 | 26 (18.8) | 8 (12.5) | |
2015 | 27 (19.6) | 13 (20.3) | |
2016 | 20 (14.5) | 10 (15.6) | |
2017 | 17 (12.3) | 4 (6.3) | |
2018 | 13 (9.4) | 2 (3.1) | |
Linking variables | |||
Race | .02 | ||
Hispanic | 2 (1.5) | 1 (1.6) | |
Non-Hispanic white | 78 (56.5) | 27 (42.2) | |
Non-Hispanic black | 58 (42.0) | 33 (51.5) | |
Othera | 0 (0.0) | 3 (4.7) | |
Insurance | <.01 | ||
Medicaid | 16 (11.6) | 12 (18.7) | |
Medicare | 13 (9.4) | 6 (9.4) | |
Self-pay | 44 (31.9) | 21 (32.8) | |
Private/Commercial | 62 (44.9) | 16 (25.0) | |
Other government | 0 (0.0) | 0 (0.0) | |
Other unclassified | 2 (1.5) | 1 (1.6) | |
Unknown | 1 (0.7) | 8 (12.5) | |
Height, cm | 177.0 ± 10.5 | 174.8 ± 9.7 | .17 |
Unknown | 0 (0.0) | 5 (7.8) | |
Weight, kg | 84.6 ± 22.1 | 79.0 ± 16.2 | .05 |
Unknown | 1 (0.7) | 6 (9.4) | |
Demographic characteristics | |||
Marital status | .27 | ||
Single | 57 (41.3) | 34 (53.1) | |
Married | 56 (40.6) | 22 (34.4) | |
Otherb | 25 (18.1) | 8 (12.5) | |
Education | .01 | ||
>High school | 26 (18.9) | 6 (9.4) | |
High school | 78 (56.5) | 31 (48.4) | |
<High school | 33 (23.9) | 22 (34.4) | |
Unknown | 1 (0.7) | 5 (7.8) | |
Employment | .16 | ||
Employed | 90 (65.2) | 31 (48.5) | |
Student/Trainee | 7 (5.1) | 5 (7.8) | |
Unemployed | 31 (22.4) | 21 (32.8) | |
Otherc | 10 (7.3) | 7 (10.9) | |
Injury-related characteristics | |||
Etiology of injury | .14 | ||
Vehicular | 72 (52.2) | 37 (57.8) | |
Violence | 25 (18.1) | 17 (26.6) | |
Falls | 29 (21.0) | 8 (12.5) | |
Other | 12 (8.7) | 2 (3.1) | |
Level of injury | .96 | ||
Tetraplegia | 72 (52.2) | 33 (51.6) | |
Paraplegia | 61 (44.2) | 28 (43.7) | |
Normal | 1 (0.7) | 1 (1.6) | |
Unknown | 4 (2.9) | 2 (3.1) | |
Completeness of injury | .17 | ||
AIS A | 42 (30.5) | 31 (48.4) | |
AIS B | 14 (10.1) | 6 (9.4) | |
AIS C | 19 (13.8) | 5 (7.8) | |
AIS D | 57 (41.3) | 19 (29.7) | |
AIS E | 1 (0.7) | 1 (1.6) | |
Unknown | 5 (3.6) | 2 (3.1) |
Note: AIS = American Spinal Injury Association Impairment Scale; SCIMS = Spinal Cord Injury Model Systems.
Other race includes American Indian, Alaska Native, Asian, Pacific Islander, some other race, and multiracial category.
Other marital status includes divorced, separated, widowed, and unclassified.
Other employment includes homemaker, retired, and unclassified.
Based on the chi-square, Fisher’s exact, and Student t tests, as appropriate.
Probabilistic linkage with use of PHI
The percentage of true-match records that agreed on date of injury (mdate), and residential zip code (mzip) was 84.5% and 80.4%, respectively. As illustrated in Figure 1, the frequency distribution of Wtotal was bimodal. Using 0.5 as a cut off, the sensitivity was 96.7%, while the specificity was 94.7% in the training set (Table 6). The ROC curve is shown in Figure 2.
Table 6.
Probabilistic linkage with race, insurance, height, body weight, injury date, and residential zip code: Validity metrics in training and validation sets
Training set | Validation set | |||||
---|---|---|---|---|---|---|
True-match | True-match | |||||
Link status | Yes | No | Total | Yes | No | Total |
Linka |
88 | 2 | 90 | 98 | 3 | 101 |
Not link | 3 | 36 | 39 | 1 | 25 | 26 |
Total | 91 | 38 | 129 | 99 | 28 | 127 |
Sensitivity (%) | 96.7 | 99.0 | ||||
PPV (%) | 97.8 | 97.0 | ||||
Specificity (%) | 94.7 | 89.3 | ||||
NPV (%) | 92.3 | 96.2 |
Note: PPV = positive predictive value; NPV = negative predictive value.
SCIMS-trauma pairs with the highest Wtotal ≥ 0.5.
Discussion
Analyzing data from a single SCIMS center and local trauma registry, this study developed a probabilistic matching algorithm to link records between trauma and rehabilitation. Similar to previous TBIMS research,14,15 we used a two-step approach and applied the same blocking variables (age, sex, and year of injury). Because the SCIMS does not have as many anatomic and physiologic variables as NTDB and TBIMS do, we used four linking variables (race, insurance, height, and body weight), whereas the TBIMS report15 used 12 variables (race, insurance, acute care length of stay, initial Glasgow Coma Scale motor, verbal, eye movement, total, respiratory rate, systolic blood pressure, fracture of base of skull, fracture of calvarium, and cause of injury). Despite fewer linking variables, the sensitivity of our algorithm is similar to TBIMS (71.4% vs 74.1%), whereas the PPV is lower than the TBIMS (80.2% vs 98.2%).
Because of a small cluster size, ranging from one to six pairs (average, two pairs) per cluster, we were not able to create the cluster weight difference (CWD) that was used as a third criterion in previous TBIMS studies.14,15 With an average cluster size of 8.9 pairs, the recent TBI report calculated CWD, difference of the highest to the second highest total weights within each cluster, and defined the cutoff point for CWD as the 90th percentile of CWD values among the false matched records.15 If the CWD was smaller than the cutoff value, all matched pairs within the same cluster would be rejected to account for the margin of error in distinguishing which pair is the true-match. Nevertheless, Kumar et al concluded that the third criterion may be too stringent for practical applications. It would be interesting to see the utility of this matching criterion in the SCIMS database in a future study of a larger cluster size and broader geographic representation.
The specificity (57.9%) of the probabilistic matching algorithm for the training set is not desirable, as a high number of false positive records would adversely impact the validity of studies using the merged data set to address early predictors of rehabilitation outcomes. Specificity was not previously reported in TBIMS research,14,15 perhaps because their rehabilitation patients were all present in the trauma registry; there were no “true negatives.” In the present study, we had 100 out of total 302 SCIMS patients (33.1%) missing in the local trauma registry for various reasons. For example, some SCIMS rehabilitation patients were referred from other hospitals or other departments (such as neurosurgery). Patients being present in the SCIMS database but not in the trauma registry could also be due to coding errors in SCI diagnosis, incomplete ascertainment of SCI from the trauma registry using the Abbreviated Injury Scale body region code, or incomplete reporting of trauma patients to the local trauma registry. As virtually all level I/II trauma centers submit data to the NTDB and all SCIMS centers are affiliated with level I trauma center(s), we do not anticipate a large number of SCIMS patients missing in the NTDB data set as at the local registry level.
To improve the sensitivity and specificity of individual data linkage between the SCIMS and NTDB, we should investigate other variables that are highly specific, such as the date of injury and residential zip code, both available in the SCIMS and NTDB. The post hoc analysis shows the specificity increased from 57.9% to 94.7% with the addition of injury date and zip code to the probabilistic linkage process. These two variables are classified as PHI by HIPAA but are acceptable as data points in a limited data set that technically can be shared between entities with data use agreement. The American College of Surgeons (NTDB owner) currently does not share PHI nor allow PHI data to be linked by a third party. In contrast, the SCIMS database, funded by the National Institute on Disability, Independent Living, and Rehabilitation Research, is available for free download (public use data set) and for request with data use agreement (limited data set).20
There are several other non-PHI variables that currently exist in the SCIMS and NTDB that can be considered for further algorithm development and refinement. The acute care length of stay is one such variable. Nevertheless, to agree on the operational definition between the SCIMS and trauma registry, the acute care length of stay would be good only for SCIMS patients admitted to the SCIMS affiliated acute care within 24 hours of injury (Day 1 admissions); otherwise, the SCIMS database would not capture the appropriate information about acute length of stay. Unfortunately, Day 1 admissions only account for one third of total SCIMS patients. There is also an opportunity for adding NTDB variables for SCIMS data collection, which might improve record linkage in the future. The Injury Severity Score would be a good trauma variable for consideration, as it is specific for record linkage and also important in predicting functional outcomes after SCI.5
In addition to the small cluster size mentioned above, another limitation of the present study is that the probabilistic matching algorithm was preliminary, based on data from a single SCIMS site and affiliated level I trauma center. Its application to other SCIMS sites and national databases deserves further validation, as the referral pattern and data quality (such as missing data) could vary by SCIMS and trauma centers. The accuracy of probabilistic linkage is influenced by data completeness and demographic factors, which needs further improvement.
Conclusion
This study demonstrates the feasibility of probabilistic record linage between the SCIMS database and trauma registry. The probabilistic matching algorithm will need to be refined and validated with a large study size of a broad geographic representation before being applied to the national datasets. Sorting out the logistics and regulatory requirements regarding the access to NTDB’s date of injury and residential zip code variables would improve the accuracy of record linkage significantly. The combination of two large databases provides an unprecedented opportunity for studying the influences of early variables (EMS and trauma care as well as anatomic, physiologic, and procedures data) on rehabilitation and long-term outcomes after SCI. Given the relatively rare occurrence, yet immense social and economic impact of SCI, the merged data set with long-term follow-up will facilitate research and improvement of trauma and rehabilitation care for patients with SCI.
Footnotes
Financial Support
Dr. Roach reports grants from MetroHealth System during the conduct of the study.
This study was supported by the National Institute on Disability, Independent Living, and Rehabilitation Research (NIDILRR) (grant no. 90DP0083). NIDILRR is a center within the Administration for Community Living (ACL), Department of Health and Human Services (HHS). The contents of this manuscript do not necessarily represent the policy of NIDILRR, ACL, or HHS, and you should not assume endorsement by the Federal Government.
Conflicts of Interest
The authors declare no conflicts of interest.
REFERENCES
- 1.Chen Y, DeVivo MJ, Richards JS, SanAgustin TB. Spinal Cord Injury Model Systems: Review of program and national database from 1970 to 2015. Arch Phys Med Rehabil. 2016;97(10):1797–1804. doi: 10.1016/j.apmr.2016.02.027. [DOI] [PubMed] [Google Scholar]
- 2.DeVivo MJ, Jackson AB, Dijkers MP, Becker BE. Current research outcomes from the Model Spinal Cord Injury Care Systems. Arch Phys Med Rehabil. 1999;80:1363–1364. doi: 10.1016/s0003-9993(99)90245-9. [DOI] [PubMed] [Google Scholar]
- 3.Lammertse DP, Jackson AB, Sipski ML. Research from the Model Spinal Cord Injury Systems: Findings from the current 5-year grant cycle. Arch Phys Med Rehabil. 2004;85(11):1737–1739. doi: 10.1016/j.apmr.2004.08.002. [DOI] [PubMed] [Google Scholar]
- 4.Chen Y, Deutsch A, DeVivo MJ et al. Current research outcomes from the spinal cord injury model systems. Arch Phys Med Rehabil. 2011;92(3):329–331. doi: 10.1016/j.apmr.2010.12.011. [DOI] [PubMed] [Google Scholar]
- 5.Nemunaitis G, Roach MJ, Claridge J, Mejia M. Early predictors of functional outcome after trauma. PM R. 2016;8(4):314–320. doi: 10.1016/j.pmrj.2015.08.007. [DOI] [PubMed] [Google Scholar]
- 6.Volovetz J, Roach MJ, Stampas A, Nemunaitis G, Kelly ML. Blood alcohol concentration is associated with improved AIS motor score after spinal cord injury. Top Spinal Cord Inj Rehabil. 2020;26(4):261–267. doi: 10.46292/sci20-00014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Slocum CS, Zafonte R. Early trauma indicators and rehabilitation outcomes in traumatic spinal cord injury. Top Spinal Cord Inj Rehabil. 2020;26(4):253–260. doi: 10.46292/sci20-00017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Hashmi ZG, Kaji AH, Nathens AB. Practical guide to surgical data sets: National Trauma Data Bank (NTDB) JAMA Surg. 2018;153(9):852–853. doi: 10.1001/jamasurg.2018.0483. [DOI] [PubMed] [Google Scholar]
- 9.National Trauma Data Standard Data dictionary 2020 admissions. https://www.facs.org/-/media/files/quality-programs/trauma/ntdb/ntds/data-dictionaries/ntds_data_dictionary_2020.ashx Accessed April 30, 2020.
- 10.Haider AH, Saleem T, Leow JJ et al. Influence of the National Trauma Data Bank on the study of trauma outcomes: Is it time to set research best practices to further enhance its impact? J Am Coll Surg. 2012;214(5):756–768. doi: 10.1016/j.jamcollsurg.2011.12.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Schoenfeld AJ, Belmont PJ, Jr, See AA, Bader JO, Bono CM. Patient demographics, insurance status, race, and ethnicity as predictors of morbidity and mortality after spine trauma: A study using the National Trauma Data Bank. Spine J. 2013;13(12):1766–1773. doi: 10.1016/j.spinee.2013.03.024. [DOI] [PubMed] [Google Scholar]
- 12.Branco BC, Plurad D, Green DJ et al. Incidence and clinical predictors for tracheostomy after cervical spinal cord injury: A National Trauma Databank review. J Trauma. 2011;70(1):111–115. doi: 10.1097/TA.0b013e3181d9a559. [DOI] [PubMed] [Google Scholar]
- 13.United States Department of Health and Human Services Guidance regarding methods for deidentification of protected health information in accordance with the Health Insurance Portability and Accountability Act (HIPAA) privacy rule. 2012 Nov 26; https://www.hhs.gov/sites/default/files/ocr/privacy/hipaa/understanding/coveredentities/De-identification/hhs_deid_guidance.pdf Accessed April 30, 2020.
- 14.Kesinger MR, Kumar RG, Ritter AC, Sperry JL, Wagner AK. Probabilistic matching approach to link deidentified data from a trauma registry and a traumatic brain injury model system center. Am J Phys Med Rehabil. 2017;96(1):17–24. doi: 10.1097/PHM.0000000000000513. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Kumar RG, Wang Z, Kesinger MR et al. Probabilistic matching of deidentified data from a trauma registry and a traumatic brain injury model system center: A follow-up validation study. Am J Phys Med Rehabil. 2018;97(4):236–241. doi: 10.1097/PHM.0000000000000838. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Kesinger MR, Juengst SB, Bertisch H et al. Acute trauma factor associations with suicidality across the first 5 years after traumatic brain injury. Arch Phys Med Rehabil. 2016;97(8):1301–1308. doi: 10.1016/j.apmr.2016.02.017. [DOI] [PubMed] [Google Scholar]
- 17.Mason CA, Tu S. Data linkage using probabilistic decision rules: A primer. Birth Defects Res A Clin Mol Teratol. 2008;82(11):812–821. doi: 10.1002/bdra.20510. [DOI] [PubMed] [Google Scholar]
- 18.Sayers A, Ben-Shlomo Y, Blom AW, Steele F. Probabilistic record linkage. Int J Epidemiol. 2016;45(3):954–964. doi: 10.1093/ije/dyv322. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.United States Department of Health and Human Services Section 164.514 (e) Other requirements relating to uses and disclosures of protected health information: Limited data set. October 1, 2003. https://www.govinfo.gov/content/pkg/CFR-2003-title45-vol1/xml/CFR-2003-title45-vol1-sec164-514.xml Accessed April 30, 2020.
- 20.National Spinal Cord Injury Statistical Center Using the National Spinal Cord Injury Model Systems Database. 2019 https://www.nscisc.uab.edu/Public_Pages/Database_files/Using_National_SCIMS_Database.pdf Accessed April 30, 2020.