Table 1. Characteristics of the Hospital Safety Reports, Patient Population, and Data Sets for Machine Learning Model Development and Validation.
Characteristica | No. (%) | |||||
---|---|---|---|---|---|---|
MGH | BWH | Total | ||||
Data set I annotated (with keywords) | Data set II (without keywords) | Data set III (recent reports) | All MGH reports | Data set IV (all BWH reports) | All reports | |
Years | April 2006-March 2016 | April 2006-March 2016 | March 2016-June 2018 | April 2006-June 2018 | May 2004-January 2019 |
|
Patientsb | 7630 | 63 768 | 27 922 | 97 778 | 75 076 | 172 854 |
All reportsc | 9107 | 105 904d | 46 046 | 174 799 | 124 229 | 299 028 |
Reports of identifiable patientse | 9047 | 94 692 | 42 454 | 157 824 | 118 764 | 276 588 |
No. of reports per patient, mean (range)f | 1.2 (1-12) | 1.5 (1-54) | 1.5 (1-34) | 1.6 (1-54) | 1.6 (1-40) | 1.6 (1-54) |
No. of words per reports, median (IQR) | 74 (43-124) | 51 (30-86) | 63 (35-106) | 57 (33-96) | 37 (17-67) | 48 (25-84) |
Patient demographics | ||||||
Age, median (IQR), yg | 58.3 (38.6-71.5) | 59.3 (43.4-71.9) | 60.1 (43.6-71.7) | 59.3 (43.0-71.6) | 60.2 (44.7-71.6) | 59.7 (43.8-71.6) |
Sex | ||||||
Female | 3504 (45.9) | 30 823 (48.3) | 13 594 (48.7) | 47 891 (49.0) | 38 653 (51.5) | 86 544 (50.1) |
Male | 3977 (52.1) | 31 715 (49.7) | 13 859 (49.6) | 48 016 (49.1) | 32 303 (43.0) | 80 319 (46.5) |
Unknown | 149 (2.0) | 1230 (1.9) | 469 (1.7) | 1871 (1.9) | 4120 (5.5) | 5991 (3.5) |
Race | ||||||
White | 5999 (78.6) | 50 043 (78.5) | 21 617 (77.4) | 76 322 (78.1) | 53 736 (71.6) | 130 058 (75.2) |
Black | 415 (5.4) | 3543 (5.6) | 1742 (6.2) | 5481 (5.6) | 6832 (9.1) | 12 313 (7.1) |
Asian | 228 (3.0) | 1956 (3.1) | 1048 (3.8) | 3264 (3.3) | 1877 (2.5) | 5141 (3.0) |
Others | 94 (1.2) | 841 (1.3) | 280 (1.0) | 1213 (1.2) | 613 (0.8) | 1826 (1.1) |
Unknown | 894 (11.7) | 7385(11.6) | 3235 (11.6) | 11 498 (11.8) | 12 018 (16.0) | 23 516 (13.6) |
Ethnicity | ||||||
Non-Hispanic | 6605 (86.6) | 55 408 (86.9) | 24 079 (86.2) | 84 579 (86.5) | 62 271 (82.9) | 146 850 (85.0) |
Hispanic | 588 (7.7) | 4802 (7.5) | 2298 (8.2) | 7610 (7.8) | 5417 (7.2) | 13 027 (7.5) |
Unknown | 437 (5.7) | 3558 (5.6) | 1545 (5.5) | 5589 (5.7) | 7388 (9.8) | 12 977 (7.5) |
Abbreviations: BWH, Brigham and Women’s Hospital; IQR, interquartile range; MGH, Massachusetts General Hospital.
Summary of the characteristics of patient demographics information and cases.
Patients with a complete and valid medical record number.
Reports including those with and without a valid patient medical record number.
The sum of the 3 data sets from MGH is not equal to the total number of all reports because of the following reason. In a previous study in which data set I was created,12 exact keyword matching with a gradually curated keyword list was used to create the data set; thus, some cases, which contained morphological or lexical variations of the keywords, were missed. Therefore, in this study, to conduct a strict evaluation of the model’s ability to identify allergic reactions missed by keyword search, we constructed data set II using a more comprehensive keyword-matching algorithm. We excluded all the reports that contained any of the expert-curated keywords and morphological or lexical variations of the keywords (eg, prefix [eg, allerg-], suffix [eg, -cillin] and letter cases such as uppercase, lowercase, or capitals). Because of this reason, data set I plus data set II was less than all of the MGH reports between April 2006 and March 2016.
Reports linked to a valid patient medical record number.
Calculated using the reports linked to a valid patient medical record number.
Calculated using the event date and patient’s date of birth.