Skip to main content
. 2020 Nov 16;3(11):e2022836. doi: 10.1001/jamanetworkopen.2020.22836

Table 1. Characteristics of the Hospital Safety Reports, Patient Population, and Data Sets for Machine Learning Model Development and Validation.

Characteristica No. (%)
MGH BWH Total
Data set I annotated (with keywords) Data set II (without keywords) Data set III (recent reports) All MGH reports Data set IV (all BWH reports) All reports
Years April 2006-March 2016 April 2006-March 2016 March 2016-June 2018 April 2006-June 2018 May 2004-January 2019
  • BWH: May 2004-January 2019

  • MGH: April 2006-June 2018

Patientsb 7630 63 768 27 922 97 778 75 076 172 854
All reportsc 9107 105 904d 46 046 174 799 124 229 299 028
Reports of identifiable patientse 9047 94 692 42 454 157 824 118 764 276 588
No. of reports per patient, mean (range)f 1.2 (1-12) 1.5 (1-54) 1.5 (1-34) 1.6 (1-54) 1.6 (1-40) 1.6 (1-54)
No. of words per reports, median (IQR) 74 (43-124) 51 (30-86) 63 (35-106) 57 (33-96) 37 (17-67) 48 (25-84)
Patient demographics
Age, median (IQR), yg 58.3 (38.6-71.5) 59.3 (43.4-71.9) 60.1 (43.6-71.7) 59.3 (43.0-71.6) 60.2 (44.7-71.6) 59.7 (43.8-71.6)
Sex
Female 3504 (45.9) 30 823 (48.3) 13 594 (48.7) 47 891 (49.0) 38 653 (51.5) 86 544 (50.1)
Male 3977 (52.1) 31 715 (49.7) 13 859 (49.6) 48 016 (49.1) 32 303 (43.0) 80 319 (46.5)
Unknown 149 (2.0) 1230 (1.9) 469 (1.7) 1871 (1.9) 4120 (5.5) 5991 (3.5)
Race
White 5999 (78.6) 50 043 (78.5) 21 617 (77.4) 76 322 (78.1) 53 736 (71.6) 130 058 (75.2)
Black 415 (5.4) 3543 (5.6) 1742 (6.2) 5481 (5.6) 6832 (9.1) 12 313 (7.1)
Asian 228 (3.0) 1956 (3.1) 1048 (3.8) 3264 (3.3) 1877 (2.5) 5141 (3.0)
Others 94 (1.2) 841 (1.3) 280 (1.0) 1213 (1.2) 613 (0.8) 1826 (1.1)
Unknown 894 (11.7) 7385(11.6) 3235 (11.6) 11 498 (11.8) 12 018 (16.0) 23 516 (13.6)
Ethnicity
Non-Hispanic 6605 (86.6) 55 408 (86.9) 24 079 (86.2) 84 579 (86.5) 62 271 (82.9) 146 850 (85.0)
Hispanic 588 (7.7) 4802 (7.5) 2298 (8.2) 7610 (7.8) 5417 (7.2) 13 027 (7.5)
Unknown 437 (5.7) 3558 (5.6) 1545 (5.5) 5589 (5.7) 7388 (9.8) 12 977 (7.5)

Abbreviations: BWH, Brigham and Women’s Hospital; IQR, interquartile range; MGH, Massachusetts General Hospital.

a

Summary of the characteristics of patient demographics information and cases.

b

Patients with a complete and valid medical record number.

c

Reports including those with and without a valid patient medical record number.

d

The sum of the 3 data sets from MGH is not equal to the total number of all reports because of the following reason. In a previous study in which data set I was created,12 exact keyword matching with a gradually curated keyword list was used to create the data set; thus, some cases, which contained morphological or lexical variations of the keywords, were missed. Therefore, in this study, to conduct a strict evaluation of the model’s ability to identify allergic reactions missed by keyword search, we constructed data set II using a more comprehensive keyword-matching algorithm. We excluded all the reports that contained any of the expert-curated keywords and morphological or lexical variations of the keywords (eg, prefix [eg, allerg-], suffix [eg, -cillin] and letter cases such as uppercase, lowercase, or capitals). Because of this reason, data set I plus data set II was less than all of the MGH reports between April 2006 and March 2016.

e

Reports linked to a valid patient medical record number.

f

Calculated using the reports linked to a valid patient medical record number.

g

Calculated using the event date and patient’s date of birth.