Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Aug 1.
Published in final edited form as: J Biomed Inform. 2023 May 12;144:104390. doi: 10.1016/j.jbi.2023.104390

Enhancing early autism prediction based on electronic records using clinical narratives

Junya Chen 1, Matthew Engelhard 1, Ricardo Henao 1, Samuel Berchuck 1, Brian Eichner 1, Eliana M Perrin 1, Guillermo Sapiro 1, Geraldine Dawson 1
PMCID: PMC10526711  NIHMSID: NIHMS1907090  PMID: 37182592

Abstract

Recent work has shown that predictive models can be applied to structured electronic health record (EHR) data to stratify autism likelihood from an early age (<1 year). Integrating clinical narratives (or notes) with structured data has been shown to improve prediction performance in other clinical applications, but the added predictive value of this information in early autism prediction has not yet been explored. In this study, we aimed to enhance the performance of early autism prediction by using both structured EHR data and clinical narratives. We built models based on structured data and clinical narratives separately, and then an ensemble model that integrated both sources of data. We assessed the predictive value of these models from Duke University Health System over a 14-year span to evaluate ensemble models predicting later autism diagnosis (by age 4 years) from data collected from ages 30 to 360 days. Our sample included 11,750 children above by age 3 years (385 meeting autism diagnostic criteria). The ensemble model for autism prediction showed superior performance and at age 30 days achieved 46.8% sensitivity (95% confidence interval, CI: 22.0%, 52.9%), 28.0% positive predictive value (PPV) at high (90%) specificity (CI: 2.0%, 33.1%), and AUC4 (with at least 4-year follow-up for controls) reaching 0.769 (CI: 0.715, 0.811). Prediction by 360 days achieved 44.5% sensitivity (CI: 23.6%, 62.9%), and 13.7% PPV at high (90%) specificity (CI: 9.6%, 18.9%), and AUC4 reaching 0.797 (CI: 0.746, 0.840). Results show that incorporating clinical narratives in early autism prediction achieved promising accuracy by age 30 days, outperforming models based on structured data only. Furthermore, findings suggest that additional features learned from clinician narratives might be hypothesis generating for understanding early development in autism.

Keywords: Autism, EHR data, unstructured data, ensemble model, language models

INTRODUCTION

Autism spectrum disorder (hereafter referred to as “autism”) is characterized by qualitative differences in social communication and social interaction, and restricted and repetitive behaviors [1], [2]. Prevalence estimates for autism have increased from 0.7% (one in 150) in surveillance years 2000 and 2002 to 2.3% (one in 44) in surveillance year 2018 in the United States, based on findings by the U.S. Autism and Developmental Disabilities Monitoring (ADDM) Network [3]. The rising prevalence of autism increases the demand for early prediction and intervention to improve children’s outcomes. Even though the autism prevalence rates are increasing and reliable diagnosis is possible by 2 year, the median age of diagnosis is after the fourth birthday [3], which is more than 2 years later than the American Academy of Pediatrics recommends universal autism screening. This disparity is even worse among medically underserved ethnic and racial minorities. For example, White children received the autism diagnosis at 6.3 years of age compared with 7.9 years for Black children [4].

Routine screening for autism often relies on the Modified Checklist for Autism in Toddlers (M-CHAT), which begins with a screening questionnaire administered by a pediatrician; more in-depth follow-up is recommended for children who scoring in the lower positive range. Although useful [5], mounting evidence suggests that the M-CHAT-R/F has significant limitations [6]–[10]. It has lower accuracy in primary care settings [8] and with caregivers from Black and Hispanic/Latino backgrounds and lower education, and with girls [8], [11]–[13]. Compared to white parents, Black parents report fewer autism concerns [11]. A substantial underuse of screening in primary care is due to difficulty interpreting caregivers’ reports and the requirement to conduct a follow-up interview to increase the screening accuracy [14]. Even when near universal screening is achieved, a study of over 25,000 children found that the M-CHAT-R/F’s sensitivity was 38.8% and its positive predictive value was 14.6%, with lower values for girls, children of color and lower-income households [8].

The M-CHAT relies heavily on parent-reported behaviors and, thus, is not able to assess all aspects of a child’s development, including those that require observation or testing by a qualified professional. Recognizing this limitation, considerable research effort has been devoted to alternative, objective early screening methods. The predictive value of a number of different sources of data has been explored, including EEG signals [15], eye movement data [16]–[18], behavioral data quantifying social interaction, restricted interests and repetitive behaviors [19], fMRI and MRI data [20], [21] and genetic testing [22]. While these approaches have delivered substantial benefits, a significant obstacle is that they typically require specialized objective biomarkers collection or rely on expert knowledge. In contrast, passive monitoring of electronic health record (EHR) data is a promising alternative approach to early prediction. In our previous work [23], we found that EHR-based autism prediction achieved promising accuracy by age 30 days, improving by 1 year of age. Model-based autism prediction at age 30 days achieved 45.5% sensitivity and 23.0% positive predictive value (PPV) at high (90%) specificity, which compares favorably to the performance of the M-CHAT at a much later age. Importantly, these results were obtained using information already found in structured EHR fields only.

However, structured data only accounts for approximately 20% of healthcare data [24]. The remaining 80% is unstructured, including clinical narratives or notes, namely, EHR free-form text fields, discharge summaries, progress notes, etc. Recent evidence has demonstrated that integrating EHR structured and unstructured data improves prediction performance in multiple applications in the medical field [25][26]. We hypothesized that unstructured data provide complementary information to structured data (and vice versa), and thus models integrating heterogeneous EHR data types could detect autism with stronger predictive power. Additionally, we hypothesized that using an explainable span model [27], [28], which highlights specific text passages having greatest impact on model predictions, would allow us identify specific predictive phrases and passages, which may in turn help encourage trust in the model and recognize clinical descriptions found early in life that are associated with later autism diagnosis.

In this study, we trained and evaluated our EHR-based early autism prediction models on data from the Duke University Health System (DUHS), a large academic medical center located in and around Durham, North Carolina. In recognition of the limitations discussed above, we present an ensemble learning scheme that brings together natural language processing and machine learning methods to integrate unstructured and structured data and thereby fully leveraging the predictive value of information contained in the EHR. We aimed to (i) improve the performance of early autism prediction by incorporating clinical notes (and quantify the resulting improvement); (ii) use explainable models to highlight early descriptive findings that predict diagnosis; (iii) present comprehensive discussion to help interpret results and inspire further research.

METHODS

Cohort identification and data extraction.

This research utilized data from Duke University Health System (DUHS) from 2013 to 2022. These records were extracted from the DUHS EHR, which is based on the platform developed by Epic (Verona, Wisconsin). Records were also sourced from several EHR platforms operating prior to 2013. The research was conducted at Duke University from 8/1/2021 to 4/1/2022 and all procedures were in accordance with the Duke Health Institutional Review Board. All analyses were completed within the Duke Protected Analytics Computing Environment (PACE), an isolated virtual network which only authorized users can access. Participant consent was waived due to the minimal risk posed by the study procedures and the infeasibility of obtaining consent from a large retrospective cohort.

The inclusion criteria for sample selection were as follows: (a) date of birth between 3/1/2013 and 12/1/2019; (b) at least 1 recorded encounter within the DUHS before age 30 days; (c) at least 1 clinical note within the DUHS before age 30 days; and (d) at least 2 total recorded encounters within the DUHS before age 1 year.

Diagnosis identification

Our methods for identifying diagnoses and selecting controls have been previously described [23]. Briefly, billing codes 299.00, 299.01, 299.80, 299.81, and 299.90 from the ninth revision, and F84.0, F84.8, F84.5, and F84.9 from the tenth revision of the International Classification of Diseases, Clinical Modification (ICD-9-CM and ICD-10-CM), were used to identify autism spectrum disorder (“autism”) (see sTable 1 for more details). Our computable phenotype required [23]: (a) ≥2 codes associated with encounters distinct dates, or (b) ≥1 code associated with an encounter with a DUHS clinic specializing in neurodevelopmental conditions. These requirements follow the work of [8] [29], which have required either (a) at least two ASD-related codes, or (b) one code from a specialist; based on the finding that this improves the accuracy of ICD-based autism identification. This approach was validated at DUHS specifically in our previous work [23].

Patients meeting the criteria for any of the conditions above by age 3 years were included in the analysis. Additionally, a population of controls (a) not meeting the criteria for any of these conditions; and (b) not later diagnosed with neurodevelopmental conditions including autism, attention-deficit/hyperactivity disorder (ADHD) and intellectual disability; was selected for inclusion (see sTable 1 for details). Since the prevalence of each condition increases with age, our selection of controls was stratified by year of birth to avoid possible bias. Specifically, we selected as many controls as possible while maintaining the same ratio of autism cases to controls across all birth years. This procedure was also designed to yield a sample-specific autism prevalence that was comparable to the overall prevalence of autism within the DUHS.

Data preprocessing

The dataset included both EHR structured data from patient documented diagnosis codes, procedure codes, laboratory measurements, medications1, vital signs, and encounter details and EHR unstructured data from clinical notes, including specialty, notes and note titles.

For EHR structured data, we first clipped the continuous covariates at 95% and 5% quantile to remove outliers, then the min-max scaler was adopted to transform all features in the training set into the range zero and one. Then we applied principal components analysis (PCA) over these variables and selected the first 100 components which explain 87% of the variance in the original data, this alleviates multicollinearity in the datasets. In EHR unstructured data, each case has several clinical notes with different note types, reflecting different clinical aspects, and the clinical notes are ordered chronologically by timestamp. The models were constructed and trained at the individual level, whereby the clinical notes for each individual were combined into a single sample. Notably, a prerequisite for sample selection was the inclusion of at least one clinical note from the Duke University Health System (DUHS) before the age of 30 days, with sample sizes being equivalent across different age groups. There is a challenge with the way EHR unstructured data is entered, namely, providers commonly use shortcuts and create templates when filling out the forms, which leads to propagating more patient data than is relevant (note bloat) as well as outdated information throughout health records. To decrease the redundancy in the raw EHR unstructured data, instances with duplicated notes at the individual level were deleted. After that, we discarded some irrelevant notes (detailed templates were listed in sTable 2) following the guidance of two DUHS providers who are part of the research team, Drs. B. Eichner and G. Dawson. Then the clinical notes underwent standard language preprocessing steps including removing punctuation and special characters, removing stop words and lower casing before being fed into further models.

Model development and evaluation

Data were divided at random into a training set (70%) used to develop models and a validation set (15%) and a test set (15%), the latter used exclusively to evaluate the performance of the final model.

Three types of models were explored independently during model development on EHR structured data: (a) Logistic regression; (b) L2-regularized Cox proportional hazards (Cox-PH) models [30]; and (c) random survival forest models [31]. Another four types of models were explored independently during model development on EHR unstructured data: (a) TF-IDF with Naïve Bayes model (TF-IDF); (b) Convolutional networks (CNN) with details in sTable 3; (c) Span prediction model (Span) [28] with details in sTable 3; and (d) BioBERT model pretrained on biomedical literature, with the same architecture in [32].

We oversampled autism cases with replacement to obtain a more balanced data distribution on the training set. However, this action increased the likelihood of overfitting, thus we also modified the sample weights as the inverse of the number of samples in loss function to reduce overfitting [33]. The hyperparameters were tuned on the validation set during model development. Other details about model development and hyperparameters are included in sTable 4.

After the base models were trained, all the base model predictions on validation and test sets were collected, then we applied a min-max scaler on the training data and used the scaler on the validation and test set. Another deep survival model of two linear layers (hidden size = 128) was trained as the overall ensemble model on the predictions on the validation set, and the results were reported on the test set. The best model trained on EHR unstructured model and structured model are denoted as the unstructured model and the structured model respectively, and the final ensemble model on all the base models is denoted as the ensemble model.

Models were defined and trained using scikit-learn for logistic regression and TF-IDF; scikit-survival for Cox-PH and random survival; deep-survival for ensemble model; imblearn for ensemble model; TensorFlow for CNN, Span and BioBERT, in Python (v3.7). Among all models, the model with the highest average concordance index (see Performance measures) on the validation set was selected as the final model, then applied to the test set and evaluated.

Performance measures

Similar to our previous research [23], we consider the Area (AUC) under the receiver operating characteristic (ROC) [34], Average positive predictive value (AP) 2[35], and Concordance Index (CI) [36] to quantitatively assess performance. Details of the measures are listed in sTable 5.

Patients who are diagnosed as negative earlier may develop the condition later due to longer study follow-up, and their predictor value may change from baseline during follow-up. Thus, we report the AUC and AP metrics as functions of time, where AUCt (APt) is defined as the AUC (AP) when limiting negative cases (controls) to patients followed for at least t years. These metrics were chosen as our primary evaluation measures due to their higher clinical relevance and interpretability.

We performed sensitivity analysis on the threshold t for these measures based on the corresponding receiver operating characteristic curve and the precision-recall curve using the AUCt and APt scores. Out of all thresholds, we have examined t=4 years in detail, as this cutoff reflects a good compromise between (a) ensuring children have had an opportunity to be diagnosed; and (b) maintaining a large evaluation sample covering a wide range of age groups. For this 4-year follow-up threshold, high sensitivity and high specificity operating points were selected to achieve 90% sensitivity and specificity, respectively. All performance measures were evaluated on the full test set as well as in subgroups defined by demographic variables, namely, sex, race and ethnicity.

We also compared the performance of models incorporating all note types with models incorporating discharge summaries only. Discharge summaries comprise only 1% of all notes, but typically provide a comprehensive overview of a patient’s hospital stay including key information such as the admission and discharge diagnoses, a list of medications, and any procedures performed. We specifically trained two models with identical architecture on all notes and discharge summaries only, pinpointing the differences in the information provided within these notes.

Other statistical analyses

Group differences in demographic variables (sex, racial and ethnic group) between autism cases and controls were calculated by cross-tabulating autism diagnosis status with each distinct value of that variable, then applying a chi-square test to the resulting contingency table. To quantify the expected variability of the model performance in subgroups on cases and controls, we obtained interquartile intervals, i.e., 25%–75% percentiles, around the mean difference by performing 1000 bootstrap resamples [37].

Considering that the note type, rather than its content may be informative, we also extracted one-hot metadata from note types, note titles, and specialties. Specifically, for each patient, the corresponding categorical representation indicates whether he or she has at least one note in a certain note type, note titles or specialties (see Table 2). A deep survival model with two linear layers (hidden size = 128) was trained on these metadata. The hyperparameter of the deep survival model [38] was tuned on the validation set and the model performance is reported on the test set.

Table 2:

Note demographics compared between autism cases and controls.

Variable Value Autism Control
Total N 385 11,365
Type Physician 385 (100.0%) 11350 (99.9%)
Registered Nurse 385 (100.0%) 11297 (99.4%)
Medical Assistant 295 (76.6%) 5500 (48.4%)
Audiologist 226 (58.7%) 1879 (16.5%)
Resident 360 (93.5%) 8977 (79.0%)
Physician Assistant 236 (61.3%) 5345 (47.0%)
Case Manager 303 (78.7%) 7074 (62.2%)
Licensed Clinical Social Worker 205 (53.2%) 3162 (15.8%)
Nurse Practitioner 324 (84.2%) 7804 (68.7%)
Speech and Language Pathologist 232 (60.3%) 697 (6.1%)
Psychologist 239 (62.1%) 151 (1.3%)
Title Pediatrics 383 (99.5%) 11187 (98.4%)
Obstetrics 326 (84.7%) 9960 (87.6%)
Family Medicine 256 (63.6%) 5679 (50.0%)
Licensed Clinical Social Worker 210 (54.5%) 3052 (26.9%)
Emergency Medicine 215 (55.8%) 4570 (40.2%)
Internal Medicine 243 (63.1%) 5293 (46.5%)
Speech Pathology 232 (60.2%) 695 (6.1%)
Psychology 239 (62.1%) 153 (1.3%)
Specialty Pediatrics 318 (82.6%) 8481 (74.6%)
Emergency Medicine 239 (62.1%) 5078 (44.7%)
Urgent Care 206 (53.5%) 4810 (42.3%)
Pediatric Psychiatry 237 (61.6%) 111 (1.0%)

To better understand the gains from the unstructured model beyond the heuristic that more data provides more information, the top 50 words with the highest odd ratio in the TF-IDF model were listed and aggregated into clinically cohesive groups to quantify the total contribution of each group to predictions across time. Those keywords were further divided into 6 groups under the guidance of Dr. B., Eichner as follows: prematurity and in-utero growth-related, perinatal factors, sex of baby, maternal factors, infant infectious complications, and infant respiratory, and each one is labeled in a specific color. Similarly, text passages highlighted by the span model were ranked by their effect on model-predicted log-odds, and the top 20 positive and top 20 negative scores were highlighted.

RESULTS

Description of cohort

There were 24,650 patients who met the eligibility criteria, of which, 466 (1.9%) satisfied our autism computable phenotype (see Diagnosis Identification Section). Of these, 385 (82.6%) had clinical notes with the DUHS before the age of 30 days. A total of 16,378 patients had no diagnosis codes related to autism, ADHD, or other neurodevelopmental conditions, and were therefore eligible for selection as controls. Of these, 11,365 (69.4%) had clinical notes with the DUHS before the age of 30 days. In total, there were 385 cases and 11,365 controls totaling 11,750 patients included in the study.

A total of 443,887 clinical notes from 385 autism cases and 11,365 controls met all inclusion criteria. Cohort demographics are compared between autism cases and controls in Table 1. Among the autism cases, there were 303 males and 82 females (78.7% male). Note characteristics (provider type, provider title, and specialty) are compared between autism cases and controls in Table 2. For each individual, the median number of all the note types (median number > 0) is compared in sFigure 1, where progress notes form up 61.9% of all the notes and the largest difference lies in the consult type (9 consults notes in cases and 2 in controls). The 25th quantiles, median numbers, and 75th quantiles of provider name for each individual are 30, 47, and 69 among the autism cases and 14, 22, and 36 among controls.

Table 1:

Cohort demographics compared between autism cases and controls.

Variable Value Autism Control p-value
Total N 385 11,365
Sex Female 82 (21.3%) 5863 (51.6%) <0.001
Male 303 (78.7%) 5502 (45.4%)
Race American Indian or Alaskan Native 6 (1.6%) 60 (0.5%) <0.001
Asian 16 (4.2%) 542 (4.8%) 0.568
Black 133 (34.5%) 3598 (31.7%) 0.252
Multiracial 14 (3.6%) 682 (6.0%) 0.051
Native Hawaiian or Pacific Islander 1 (0.3%) 24 (0.2%) 0.842
Unknown Race 74 (19.2%) 1801 (15.8%) 0.081
White 142 (36.9%) 4645 (40.9%) 0.104
Ethnicity Hispanic 68 (17.7%) 1635 (14.4%) 0.078
Not Hispanic 305 (79.2%) 9148 (80.5%) 0.444
Unknown Ethnicity 12 (3.1%) 582 (5.1%) 0.075

Prediction performance over time

Model performance was evaluated on a total of 1,762 of the 11,750 patients included in the study who were randomly assigned to the test set. The BioBERT model was not effective in this application, with prediction performance just slightly better than random guessing, thus we removed the unstructured model from our model sets. As shown in Figure 1, Table 3 and Table 4, AUC4 of the unstructured model ranged from 0.732 at 60 days to 0.785 at 360 days, AUC4 of the structured data ranged from 0.716 at 90 days to 0.770 at 360 days. AP4 of the unstructured model ranged from 0.320 at 60 days to 0.379 at 360 days, AP4 of the structured model ranged from 0.221 at 60 days to 0.316 at 270 days. In contrast, AUC4 of the ensemble model ranged from 0.751 at 60 days to 0.796 at 360 days, and AP4 increased from 0.336 at 60 days to 0.399 at 360 days (see Table 5, Table 6). The ensemble model had the highest AUC4, AP4, and concordance index compared to the model on structured and unstructured data among all ages (see sFigure 2), with these metrics significantly higher than models on structured data and unstructured data (p<0.05). When including controls with other neurodevelopmental conditions, AUC4 ranged from 0.650 at 30 days to 0.669 at 270 days, and AP4 ranged from 0.148 at 60 days to 0.221 at 360 days. The concordance index ranged from 0.652 at 30 days to 0.673 at 360 days, as shown in Table 2.

Figure 1: Autism prediction performance on EHR structured and unstructured data by age at the time of prediction.

Figure 1:

Prediction performance is shown for structured models based on EHR structured data, unstructured models based on EHR unstructured data, and ensemble models on all data. Those data were collected from birth through 30, 60, 90, 180, 270, and 360 days of age, respectively. Cases and controls were defined according to the inclusion criteria. The panels summarize performance by age via the area under the receiver operating characteristic curve (left), average positive predictive value (middle), and concordance index (right) with limiting negative cases to patients followed for at least 4 years.

Table 3:

Performance measures over time (EHR-Unstructured data).

Data Collection Threshold
Performance Measure 30 days 60 days 90 days 180 days 270 days 360 days
Concordance Index 0.718 0.693 0.711 0.709 0.733 0.743
95% confidence interval for Concordance Index (0.679, 0.783) (0.658, 0.771) (0.682, 0.795) (0.699, 0.772) (0.700, 0.807) (0.710, 0.813)
AUC4 0.767 0.732 0.749 0.755 0.770 0.785
95% confidence interval for AUC4 (0.717, 0.783) (0.676, 0.777) (0.697, 0.797) (0.699, 0.801) (0.719, 0.818) (0.732, 0.833)
AP4 0.322 0.320 0.327 0.346 0.377 0.379
95% confidence interval for AP4 (0.248, 0.408) (0.240, 0.415) (0.249, 0.421) (0.261, 0.435) (0.293, 0.465) (0.298, 0.470)
Sensitivity at 90% specificity 0.395 0.342 0.395 0.342 0.421 0.434
95% confidence interval for Sensitivity at 90% specificity (0.300, 0.500) (0.264, 0.441) (0.301, 0.493) (0.258, 0.450) (0.323, 0.536) (0.338, 0.521)
PPV at 90% specificity 0.283 0.287 0.318 0.330 0.313 0.292
95% confidence interval for PPV at 90% specificity (0.200, 0.328) (0.223, 0.353) (0.24, 0.371) (0.275, 0.405) (0.248, 0.378) (0.223, 0.358)
Sensitivity at 97% specificity 0.184 0.263 0.250 0.263 0.303 0.303
95% confidence interval for Sensitivity at 97% specificity (0.102, 0.264) (0.173, 0.350) (0.164, 0.343) (0.174, 0.356) (0.217, 0.392) (0.211, 0.396)
PPV at 97% specificity 0.433 0.452 0.514 0.452 0.500 0.500
95% confidence interval for PPV at 97% specificity (0.325, 0.439) (0.333, 0.568) (0.377, 0.600) (0.327, 0.560) (0.397, 0.629) (0.386, 0.60)
Specificity at 90% sensitivity 0.336 0.357 0.402 0.359 0.353 0.379
95% confidence interval for Specificity at 90% sensitivity (0.256, 0.493) (0.243, 0.469) (0.218, 0.522) (0.261, 0.488) (0.208, 0.536) (0.209, 0.579)
PPV at 90% sensitivity 0.117 0.121 0.129 0.121 0.120 0.124
95% confidence interval for PPV at 90% sensitivity (0.085, 0.159) (0.093, 0.152) (0.094, 0.162) (0.092, 0.157) (0.089, 0.168) (0.082, 0.175)

Table 4:

Performance measures over time (EHR-Structured data).

Data Collection Threshold
Performance Measure 30 days 60 days 90 days 180 days 270 days 360 days
Concordance Index 0.680 0.704 0.695 0.745 0.746 0.761
95% confidence interval for Concordance Index (0.659, 0.744) (0.670, 0.776) (0.668, 0.775) (0.698, 0.799) (0.702, 0.808) (0.725, 0.823)
AUC4 0.718 0.720 0.716 0.747 0.753 0.770
95% confidence interval for AUC4 (0.651, 0.786) (0.680, 0.787) (0.674, 0.773) (0.699, 0.800) (0.720, 0.814) (0.740, 0.835)
AP4 0.240 0.221 0.224 0.269 0.316 0.287
95% confidence interval for AP4 (0.187, 0.275) (0.189, 0.265) (0.198, 0.283) (0.214, 0.301) (0.267, 0.374) (0.228, 0.350)
Sensitivity at 90% specificity 0.316 0.395 0.368 0.395 0.342 0.487
95% confidence interval for Sensitivity at 90% specificity (0.229, 0.412) (0.287, 0.518) (0.285, 0.473) (0.288, 0.526) (0.261, 0.475) (0.378, 0.592)
PPV at 90% specificity 0.267 0.274 0.303 0.311 0.391 0.324
95% confidence interval for PPV at 90% specificity (0.211, 0.338) (0.189, 0.333) (0.235, 0.372) (0.246, 0.375) (0.328, 0.421) (0.256, 0.286)
Sensitivity at 97% specificity 0.250 0.224 0.171 0.237 0.303 0.263
95% confidence interval for Sensitivity at 97% specificity (0.166, 0.333) (0.130, 0.317) (0.103, 0.254) (0.163, 0.387) (0.217, 0.385) (0.173, 0.385)
PPV at 97% specificity 0.462 0.421 0.364 0.459 0.538 0.487
95% confidence interval for PPV at 97% specificity (0.298, 0.58) (0.279, 0.52) (0.243, 0.482) (0.352, 0.58) (0.412, 0.65) (0.369, 0.585)
Specificity at 90% sensitivity 0.285 0.368 0.422 0.492 0.377 0.354
95% confidence interval for Specificity at 90% sensitivity (0.181, 0.428) (0.220, 0.515) (0.233, 0.504) (0.177, 0.606) (0.227, 0.552) (0.226, 0.536)
PPV at 90% sensitivity 0.110 0.123 0.132 0.146 0.124 0.120
95% confidence interval for PPV at 90% sensitivity (0.089, 0.148) (0.093, 0.164) (0.097, 0.163) (0.092, 0.184) (0.090, 0.168) (0.082, 0.161)

Table 5:

Ensemble model performance measures over time (EHR-structured data and EHR-unstructured data).

Data Collection Threshold
Performance Measure 30 days 60 days 90 days 180 days 270 days 360 days
Concordance Index 0.725 0.721 0.731 0.740 0.749 0.757
95% confidence interval for Concordance Index (0.688, 0.793) (0.691, 0.796) (0.688, 0.800) (0.698, 0.804) (0.726, 0.814) (0.722, 0.825)
AUC4 0.769 0.751 0.767 0.761 0.780 0.797
95% confidence interval for AUC4 (0.715, 0.811) (0.701, 0.801) (0.701, 0.812) (0.723, 0.80) (0.729, 0.824) (0.746, 0.840)
AP4 0.347 0.336 0.346 0.359 0.387 0.399
95% confidence interval for AP4 (0.255, 0.429) (0.236, 0.410) (0.253, 0.429) (0.287, 0.473) (0.293, 0.477) (0.314, 0.489)
Sensitivity at 90% specificity 0.421 0.434 0.460 0.460 0.421 0.447
95% confidence interval for Sensitivity at 90% specificity (0.293, 0.513) (0.304, 0.520) (0.353, 0.548) (0.364, 0.554) (0.326, 0.534) (0.364, 0.544)
PPV at 90% specificity 0.280 0.302 0.312 0.327 0.283 0.344
95% confidence interval for PPV at 90% specificity (0.2, 0.331) (0.211, 0.331) (0.212, 0.344) (0.214, 0.341) (0.21, 0.336) (0.210, 0.336)
Sensitivity at 97% specificity 0.223 0.223 0.250 0.289 0.303 0.306
95% confidence interval for Sensitivity at 97% specificity (0.148, 0.309) (0.145, 0.306) (0.168, 0.346) (0.192, 0.397) (0.194, 0.375) (0.205, 0.414)
PPV at 97% specificity 0.5 0.432 0.462 0.488 0.489 0.553
95% confidence interval for PPV at 97% specificity (0.32, 0.521) (0.256, 0.463) (0.274, 0.5) (0.313, 0.529) (0.315, 0.528) (0.326, 0.573)
Specificity at 90% sensitivity 0.468 0.447 0.365 0.437 0.343 0.445
95% confidence interval for Specificity at 90% sensitivity (0.220, 0.529) (0.150, 0.519) (0.244, 0.476) (0.247, 0.526) (0.289, 0.572) (0.236, 0.629)
PPV at 90% sensitivity 0.140 0.138 0.122 0.134 0.118 0.137
95% confidence interval for PPV at 90% sensitivity (0.095, 0.168) (0.083, 0.162) (0.092, 0.154) (0.099, 0.168) (0.098, 0.173) (0.096, 0.189)

Table 6:

The performance of EHR-structured data, EHR-unstructured data, and emsemble model measures over time.

Data Collection Threshold
Performance Measure Data 30 days 60 days 90 days 180 days 270 days 360 days
Concordance Index Unstructured data 0.718 0.693 0.711 0.709 0.733 0.743
Structured data 0.680 0.704 0.695 0.745 0.746 0.761
Ensemble 0.725 0.721 0.731 0.740 0.749 0.757
AUC4 Unstructured data 0.767 0.732 0.749 0.755 0.770 0.785
Structured data 0.718 0.720 0.716 0.747 0.753 0.770
Ensemble 0.769 0.751 0.767 0.761 0.780 0.797
AP4 Unstructured data 0.322 0.320 0.327 0.346 0.377 0.379
Structured data 0.240 0.221 0.224 0.269 0.316 0.287
Ensemble 0.347 0.336 0.346 0.359 0.387 0.399

To quantify the independency of model predictions, the true positive (autism cases correctly predicted by the model), false negative (autism cases incorrectly classified as positive by the model), and true negative (controls correctly predicted by the model) are shown in the Venn diagram (sFigure 3). All 18 autism cases learned by both the structured model and the unstructured model are recognized by the ensemble model, and only 2 autism cases are not learned by the structured model and the unstructured model are also not recognized by the ensemble model.

Sensitivity to the length of required follow-up among controls is shown for the 30-day ensemble models (Figure 2, sFigure 4) and the 360-day ensemble models (sFigure 5). At 30 days, the AUCt ranged from an AUC2 of 0.699 to an AUC5 of 0.777, and the APt ranged from an AP2 of 0.250 (a 5.9-fold increase over effective prevalence) to an AP7 of 0.665 (a 1.8-fold increase over effective prevalence). At 360 days, the AUCt ranged from an AUC2 of 0.722 to an AUC5 of 0.810, and the APt ranged from an AP2 of 0.282 (a 6.7-fold increase over effective prevalence) to an AP7 of 0.715 (1.9-fold increase over effective prevalence).

Figure 2: Sensitivity to follow-up threshold for 30-day models.

Figure 2:

The panels show the effect of varying the required follow-up threshold t from 2 to 7 years when evaluating the performance of the 30-day models via the APt.

Performance in subgroups

AUC4 was higher in males (0.763) than in females (0.652) (see Figure 3). AP4 was higher in males (0.462) than in females (0.160), fold increase in AP4 over autism prevalence was higher in females (11.4) than in males (8.8). Among all racial groups represented in the test set (with controls required 4-year follow-up), AUC4 ranged from 0.748 (Unknown Race) to 0.857 (Black).

Figure 3: Prediction performance by sex and race.

Figure 3:

Performance when discriminating between children later diagnosed with autism and children not diagnosed through age 4 years is stratified by sex and race. The panels show the AUC4 (top), AP4 (middle), and AP4 divided by autism prevalence (bottom) in each group. The dotted lines indicate performance associated with random guessing (i.e., no information).

We also compared the model on EHR unstructured data of all note types with EHR unstructured data of discharge summaries only. Models on unstructured data have close AUC4 (0.767 for all note types and 0.739 for discharge summary at 30 days) and AP4 scores (0.322 for all note types and 0.280 on discharge summary at 30 days) on early days, but all note types performance improves with days (0.785 for AUC4 and 0.379 for AP4 of all note types at 360 days) while models on discharge summary are not changing much (0.723 for AUC4 and 0.314 for AP4 of all note types at 360 days), and the situation with ensemble model is similar with the model on unstructured data in results (see sFigure 7).

To understand the contribution of data demographics, we compared the two identical survival models trained on clinical notes and the metadata of notes. The AUC4 was higher on prediction based on clinical notes (0.767) than predictions on metadata (0.574) and 0.105 in AP4 at 30 days. AP4 was higher on clinical notes (0.322) than on metadata (0.105).

Model Interpretability

The top 40 keywords with the highest log odds learned by the TF-IDF model are listed in sTable 6. The odds ratio ranged from 9.408 in infant respiratory group to 55.846 in prematurity and in-utero growth-related group at 30 days. The odds ratio in each group does not change significantly across days, except for the prematurity and in-utero growth-related group. The linear regression coefficient of the prematurity and in-utero growth-related group is −0.005 and the corresponding p-value is 0.005.

The 20 spans most strongly predictive of autism (scores ranging from 2.669 to 2.136) and the 20 spans least strongly predictive of autism (scores ranging from −7.580 to −4.087) learned by the Span model were listed in sTable 7. Keywords identified from the TF-IDF model have been highlighted and colored by the group.

DISCUSSION

Clinical notes are readily available in the EHR and do not require any data collection other than that which already takes place during routine care. Consistent with earlier findings supporting the added predictive value of clinical notes [27], our results demonstrate that notes contain complementary information to structured data (and vice versa) for early prediction of autism likelihood, and that ensemble models built on full EHR data are more effective than models that incorporate either structured or unstructured data only. Whereas screening tools such as M-CHAT are validated and recommended for children 16–30 months old, the EHR-based approach provided predictive information about a later autism diagnosis as early as 30 days after birth. This information could be used to alert the provider to increase clinical surveillance of a patient during the period before a formal diagnosis of autism can be made. The current study further supports the efficacy of leveraging full EHR data (both structured and unstructured data) for autism prediction, as the clinical notes provide complementary information to structured data and the ensemble model built on full EHR data are more effective than models that incorporate either structured or unstructured data only.

Furthermore, the significant differences in the distribution of note types, note titles, and specialties between autism cases and controls suggest that these metadata may contribute to the improved prediction. With other note types equally distributed, the numbers of consult type in autism cases are 4.5 times that in controls, which might be caused by higher complexity in the autism groups (see sFigure 2). Though differences exist in the distribution of note types, note titles, and specialties, the prediction model based on the is much worse than prediction based on clinical notes, indicating that the prediction improves because of the note itself, not the note distributions. Our results based on machine learning have demonstrated that EHR-unstructured data can provide complementary information to EHR-structured data for autism prediction, and the ensemble model covers most of the base model prediction and delivers the highest test AUC4 and AP4 among all models (sFigure 3).

The top two groups with the highest keyword odd ratios are the prematurity and in-utero growth-related group and maternal factors, including extreme prematurity and very low birth weight, and obesity, mental health, or immune system disorders respectively. The environmental risk factors involved in such events before and during birth have been shown to be associated with autism [39]. But these factors alone are unlikely to cause autism. Rather, they appear to increase a child’s likelihood (risk) of developing autism when combined with genetic factors [40].Our approach to interpretability shows that at least two distinct groups of keywords are contained in each of the top 20 spans most strongly predictive of autism. The interplay between the keyword groups suggests that EHR-unstructured data contain information that is autism -associated and multiple domains are related to the prediction.

Results shown in sFigure 6 suggest that the discharge summary is sufficient to capture most unstructured data contributing to prediction at 30 days, but it does not fully capture all predictive information later in the 1st year of life. Additional subgroup analyses show that although male sex is associated with higher autism likelihood, model-based autism prediction (AUC4) was more effective in girls than in boys (see Figure 3).

BERT (Bidirectional Encoder Representations from Transformers) [41] has been shown to be particularly effective in natural language processing tasks, such as language translation [42], question answering [43], text classification [44] and for analyzing medical texts [26][45]. In our setting, we used the pretrained BioBERT model at [https://github.com/naver/biobert-pretrained] instead of the GloVe embedding, and built the same decoder MLP layers as the CNN model and Span model. However, the BERT model was not effective for early autism prediction. We suspect that this might be due to the high redundancy in the clinical notes, which allow the deep model to overfit to noise in the raw clinical data and thus fail to generalize to out-of-sample data.

A potential solution to this problem is to feed the model with cleaner data such as ICD codes or data with templates [26][45].

Overall, our results suggest that the incorporation of clinical notes is useful to unlock the full potential of EHR data to stratify autism likelihood from an early age and thereby guide early autism prediction. Adding these results to structured data improves classification performance while still avoiding any need for additional and potentially costly data collection. Additional features, such as spans and keywords, learned from the notes might be hypothesis generating for further neural or behavioral investigation and treatments. Subsequent work will focus on the development of predictive models that can be deployed and iteratively evaluated in the context of the clinical workflow.

Limitations

Our findings are based on samples from DUHS, a large academic medical center in Durham, North Carolina. Due to demographic or population health characteristics that may differ from DUHS, our results may not be applicable to other settings, including different EHR systems or data models. The analysis of autism cases was limited by the recent adoption of EHRs, and results may improve as more cases are collected. Additionally, an EHR phenotype was used as a proxy for a standardized diagnosis, so there is a possibility of heterogeneity and unreliability of the diagnostic methods. Moreover, typed information is often entered, which leads to shortcuts and templates that make it difficult for natural language processing (NLP) to detect sentences, not templates. In future work, it would be beneficial to explore NLP approaches that extract concepts, which could be encoded as embeddings to then be used as predictors in a predictive model. This way, the resulting model will be more robust to negation or ambiguous statements, as well as producing more explainable predictions. Furthermore, machine learning approaches used to detect data in these tables are only associative, and it cannot be concluded from this that there is a causal relationship. Lastly, a bias with respect to the methods used for review (rule-based vs machine learning) may have been introduced, as rule-based methods are more commonly used in clinical contexts than in other areas.

Future work

These results motivate a few interesting directions for future works. One future direction is to explore methods for interpreting numeric values in clinical notes since the specific values may be clinically relevant. This could be approached by adding a digit embedding to the NLP framework that enables searching for the best interpretation on those numeric features in clinical notes. Another direction is to build another BERT model on the cleaner part of EHR data, such as the ICD codes and notes with templates, and with more features extracted, the performance of the ensemble model could be further boosted. Also, we need to compare the performance of ensemble model to different autism screening tools, including the M-CHAT, to determine their relative strengths and weaknesses and to identify any subpopulations of autistic children that may be missed by one tool but identified by another. We view these as complementary methods that could be used together improve early autism prediction. As more health records become available, we can continuously improve our model and analysis. We are also working to deploy our models within DUHS, integrate them with clinical workflows, and test the impact of early prediction via surfacing model predictions to primary care and other healthcare providers. It will also be important to explore how this EHR-based prediction can be used in the context of primary care and its potential impact on timely referrals for diagnostic evaluation and access to early intervention.

Supplementary Material

1

STATEMENT OF SIGNFICANCE.

Autism is a neurodevelopmental disorder that impacts communication and social interaction. Despite the numerous benefits of early detection, such as enabling early intervention, improving outcomes, providing access to services, and reducing family stress, the diagnosis of autism is typically delayed, with the median age of diagnosis often being beyond the recommended universal screening time. While structured EHR data has shown promise as a method for early autism prediction but incorporating unstructured data might improve the predictive power. Accordingly, this investigation aimed to create and assess an ensemble model that merged structured and unstructured EHR data to enhance early autism prediction.

Funding

Funding support was provided by NIMH R01MH121329 (Dawson, PI) and NICHD P50HD093074 (Dawson and Sapiro, Co-PIs). GS is also supported by NSF and DoD as well as gifts from Google, Amazon, and Cisco.

Footnotes

Conflict of Interest Statements

Geraldine Dawson is on the Scientific Advisory Boards of Akili Interactive, Inc, Zynerba, Nonverbal Learning Disability Project, and Tris Pharma, is a consultant to Apple, Gerson Lehrman Group, and Guidepoint Global, Inc., and receives book royalties from Guilford Press and Springer Nature. Dr. Dawson has developed technology that has been licensed to Apple, Inc. and Dawson and Duke University have benefited financially.

Guillermo Sapiro has developed technology that has been licensed and he and Duke University have benefited financially. Sapiro is also affiliated with Apple, Inc. This work is independent of such affiliation.

The other authors have no financial relationships or potential conflicts of interest relevant to this article.

Declaration of interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

CRediT author statement

Junya Chen, Matthew Engelhard, Ricardo Henao: Conceptualization, Methodology, Software

Matthew Engelhard, Samuel Berchuck: Data curation

Brian Eichner, Eliana M. Perrin, Guillermo Sapiro, and Geraldine Dawson : Supervision, Writing Reviewing and Editing.

1

We did not explicitly exclude any medication used to treat autism or highly related to autism because the feature collection window does not overlap with the autism diagnosis window. More specifically, the predictors available to our model (including medications) were based on data collected before age 1, which is months before autism can be diagnosed. Although there is debate regarding the earliest age at which autism is detectable in principle, it is generally believed to be after age 1, and in practice autism diagnosis before 18 months is exceedingly rare. Indeed, no diagnoses before age 1 were observed in our dataset (i.e. within our health system). Thus, while specific medications prescribed before age 1 may indeed be associated with later autism diagnosis, they cannot directly indicate the presence of a diagnosis, and as such our use of this information to inform autism prediction is intended.

2

The Positive Predictive Value (PPV) is a measure also referred to as precision. To calculate this, we use average precision as opposed to area under the curve, as the latter necessitates numerical integration due to the non-piecewise constant nature of the curve. While average precision (AP) and area under the precision-recall curve (AUPR) are not equivalent in a technical sense, it is common practice for software programs to compute the average precision as an approximation instead of a numerical integration estimate.

Publisher's Disclaimer: This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  • [1].Cheng W, Rolls ET, Gu H, Zhang J, and Feng J, “Autism: reduced connectivity between cortical areas involved in face expression, theory of mind, and the sense of self,” Brain J. Neurol, vol. 138, no. Pt 5, pp. 1382–1393, May 2015, doi: 10.1093/brain/awv051. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Lai M-C, Lombardo MV, and Baron-Cohen S, “Autism,” The Lancet, vol. 383, no. 9920, pp. 896–910, Mar. 2014, doi: 10.1016/S0140-6736(13)61539-1. [DOI] [PubMed] [Google Scholar]
  • [3].Christensen DL et al. , “Prevalence and Characteristics of Autism Spectrum Disorder Among Children Aged 8 Years — Autism and Developmental Disabilities Monitoring Network, 11 Sites, United States, 2012,” MMWR Surveill. Summ, vol. 65, no. 13, pp. 1–23, Nov. 2018, doi: 10.15585/mmwr.ss6513a1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Mandell DS, Listerud J, Levy SE, and Pinto-martin JA, “Race Differences in the Age at Diagnosis Among Medicaid-Eligible Children With Autism,” J. Am. Acad. Child Adolesc. Psychiatry, vol. 41, no. 12, pp. 1447–1453, Dec. 2002, doi: 10.1097/00004583-200212000-00016. [DOI] [PubMed] [Google Scholar]
  • [5].Dawson G, “Why it’s important to continue universal autism screening while research fully examines its impact,” JAMA Pediatr, vol. 170, no. 6, pp. 527–528, 2016. [DOI] [PubMed] [Google Scholar]
  • [6].Yuen T, Penner M, Carter MT, Szatmari P, and Ungar WJ, “Assessing the accuracy of the Modified Checklist for Autism in Toddlers: a systematic review and meta-analysis,” Dev. Med. Child Neurol, vol. 60, no. 11, pp. 1093–1100, 2018. [DOI] [PubMed] [Google Scholar]
  • [7].Carbone PS et al. , “Primary care autism screening and later autism diagnosis,” Pediatrics, vol. 146, no. 2, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Guthrie W et al. , “Accuracy of autism screening in a large pediatric network,” Pediatrics, vol. 144, no. 4, 2019. [DOI] [PubMed] [Google Scholar]
  • [9].Stenberg N et al. , “Identifying children with autism spectrum disorder at 18 months in a general population sample,” Paediatr. Perinat. Epidemiol, vol. 28, no. 3, pp. 255–262, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Sturner R, Howard B, Bergmann P, Stewart L, and Afarian TE, “Comparison of autism screening in younger and older toddlers,” J. Autism Dev. Disord, vol. 47, no. 10, pp. 3180–3188, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [11].Scarpa A et al. , “The modified checklist for autism in toddlers: Reliability in a diverse rural American sample,” J. Autism Dev. Disord, vol. 43, no. 10, pp. 2269–2279, 2013. [DOI] [PubMed] [Google Scholar]
  • [12].Dickerson AS et al. , “Autism spectrum disorder reporting in lower socioeconomic neighborhoods,” Autism, vol. 21, no. 4, pp. 470–480, 2017. [DOI] [PubMed] [Google Scholar]
  • [13].Donohue MR, Childs AW, Richards M, and Robins DL, “Race influences parent report of concerns about symptoms of autism spectrum disorder,” Autism, vol. 23, no. 1, pp. 100–111, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Robins DL, Casagrande K, Barton M, Chen C-MA, Dumont-Mathieu T, and Fein D, “Validation of the modified checklist for autism in toddlers, revised with follow-up (M-CHAT-R/F),” Pediatrics, vol. 133, no. 1, pp. 37–45, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [15].Baygin M et al. , “Automated ASD detection using hybrid deep lightweight features extracted from EEG signals,” Comput. Biol. Med, vol. 134, p. 104548, Jul. 2021, doi: 10.1016/j.compbiomed.2021.104548. [DOI] [PubMed] [Google Scholar]
  • [16].Jones W and Klin A, “Attention to eyes is present but in decline in 2–6-month-old infants later diagnosed with autism,” Nature, vol. 504, no. 7480, pp. 427–431, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [17].Liu W, Yu X, Raj B, Yi L, Zou X, and Li M, “Efficient autism spectrum disorder prediction with eye movement: A machine learning framework,” in 2015 International Conference on Affective Computing and Intelligent Interaction (ACII), Sep. 2015, pp. 649–655. doi: 10.1109/ACII.2015.7344638. [DOI] [Google Scholar]
  • [18].Chang Z et al. , “Computational methods to measure patterns of gaze in toddlers with autism spectrum disorder,” JAMA Pediatr, vol. 175, no. 8, pp. 827–836, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [19].V. R and S. R, “A machine learning based approach to classify autism with optimum behavior sets,” Int. J. Eng. Technol, vol. 7, no. 4, Art. no. 4, Dec. 2018, doi: 10.14419/ijet.v7i3.18.14907. [DOI] [Google Scholar]
  • [20].Sen B, Borle NC, Greiner R, and Brown MRG, “A general prediction model for the detection of ADHD and Autism using structural and functional MRI,” PloS One, vol. 13, no. 4, p. e0194856, 2018, doi: 10.1371/journal.pone.0194856. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [21].Hazlett HC et al. , “Early brain development in infants at high risk for autism spectrum disorder,” Nature, vol. 542, no. 7641, pp. 348–351, Feb. 2017, doi: 10.1038/nature21369. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [22].Krishnan A et al. , “Genome-wide prediction and functional characterization of the genetic basis of autism spectrum disorder,” Nat. Neurosci, vol. 19, no. 11, pp. 1454–1462, Nov. 2016, doi: 10.1038/nn.4353. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Engelhard M et al. , “Predictive value of early autism detection models based on electronic health record data collected before age 1.,” JAMA Netw. Open, no. In Press., 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [24].Kong H-J, “Managing Unstructured Big Data in Healthcare System,” Healthc. Inform. Res, vol. 25, no. 1, pp. 1–2, Jan. 2019, doi: 10.4258/hir.2019.25.1.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [25].Hernandez-Boussard T, Monda KL, Crespo BC, and Riskin D, “Real world evidence in cardiovascular medicine: ensuring data validity in electronic health record-based studies,” J. Am. Med. Inform. Assoc, vol. 26, no. 11, pp. 1189–1194, Nov. 2019, doi: 10.1093/jamia/ocz119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [26].Meng Y, Speier W, Ong M, and Arnold CW, “HCET: Hierarchical Clinical Embedding with Topic Modeling on Electronic Health Record for Predicting Depression,” IEEE J. Biomed. Health Inform, vol. 25, no. 4, pp. 1265–1272, Apr. 2021, doi: 10.1109/JBHI.2020.3004072. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [27].Zhang D, Yin C, Zeng J, Yuan X, and Zhang P, “Combining structured and unstructured data for predictive models: a deep learning approach,” BMC Med. Inform. Decis. Mak, vol. 20, no. 1, p. 280, Oct. 2020, doi: 10.1186/s12911-020-01297-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [28].Subramanian V, Engelhard M, Berchuck S, Chen L, Henao R, and Carin L, “SpanPredict: Extraction of Predictive Document Spans with Neural Attention,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online: Association for Computational Linguistics, Jun. 2021, pp. 5234–5258. doi: 10.18653/v1/2021.naacl-main.413. [DOI] [Google Scholar]
  • [29].Burke JP et al. , “Does a claims diagnosis of autism mean a true case?,” Autism, vol. 18, no. 3, pp. 321–330, Apr. 2014, doi: 10.1177/1362361312467709. [DOI] [PubMed] [Google Scholar]
  • [30].Therneau TM and Grambsch PM, “The Cox Model,” in Modeling Survival Data: Extending the Cox Model, Therneau TM and Grambsch PM, Eds., in Statistics for Biology and Health. New York, NY: Springer, 2000, pp. 39–77. doi: 10.1007/978-1-4757-3294-8_3. [DOI] [Google Scholar]
  • [31].Ishwaran H, Kogalur UB, Blackstone EH, and Lauer MS, “Random survival forests,” Ann. Appl. Stat, vol. 2, no. 3, pp. 841–860, Sep. 2008, doi: 10.1214/08-AOAS169. [DOI] [Google Scholar]
  • [32].Lee J et al. , “BioBERT: a pre-trained biomedical language representation model for biomedical text mining,” Bioinformatics, p. btz682, Sep. 2019, doi: 10.1093/bioinformatics/btz682. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [33].Barua S, Islam Md. M., Yao X, and Murase K, “MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning,” IEEE Trans. Knowl. Data Eng, vol. 26, no. 2, pp. 405–425, Feb. 2014, doi: 10.1109/TKDE.2012.232. [DOI] [Google Scholar]
  • [34].Bradley AP, “The use of the area under the ROC curve in the evaluation of machine learning algorithms,” Pattern Recognit, vol. 30, no. 7, pp. 1145–1159, Jul. 1997, doi: 10.1016/S0031-3203(96)00142-2. [DOI] [Google Scholar]
  • [35].Yang Y and Pedersen JO, “A comparative study on feature selection in text categorization,” in Icml, Citeseer, 1997, p. 35. [Google Scholar]
  • [36].Kwiecien R, Kopp-Schneider A, and Blettner M, “Concordance Analysis,” Dtsch. Ärztebl. Int, vol. 108, no. 30, pp. 515–521, Jul. 2011, doi: 10.3238/arztebl.2011.0515. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [37].Efron B and Tibshirani R, An introduction to the bootstrap. in Monographs on statistics and applied probability, no. 57. New York: Chapman & Hall, 1993. [Google Scholar]
  • [38].Ranganath R, Perotte A, Elhadad N, and Blei D, “Deep Survival Analysis,” in Proceedings of the 1st Machine Learning for Healthcare Conference, PMLR, Dec. 2016, pp. 101–114. Accessed: Dec. 27, 2022. [Online]. Available: https://proceedings.mlr.press/v56/Ranganath16.html [Google Scholar]
  • [39].Kolevzon A, Gross R, and Reichenberg A, “Prenatal and Perinatal Risk Factors for Autism: A Review and Integration of Findings,” Arch. Pediatr. Adolesc. Med, vol. 161, no. 4, pp. 326–333, 2007. [DOI] [PubMed] [Google Scholar]
  • [40].Muhle R, Trentacoste SV, and Rapin I, “The genetics of autism,” Pediatrics, vol. 113, no. 5, pp. e472–e486, 2004. [DOI] [PubMed] [Google Scholar]
  • [41].Devlin J, Chang M-W, Lee K, and Toutanova K, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” arXiv, May 24, 2019. Accessed: Dec. 17, 2022. [Online]. Available: http://arxiv.org/abs/1810.04805 [Google Scholar]
  • [42].Zhu J et al. , “Incorporating BERT into Neural Machine Translation.” arXiv, Feb. 17, 2020. doi: 10.48550/arXiv.2002.06823. [DOI] [Google Scholar]
  • [43].Qu C, Yang L, Qiu M, Croft WB, Zhang Y, and Iyyer M, “BERT with history answer embedding for conversational question answering,” in Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval, 2019, pp. 1133–1136. [Google Scholar]
  • [44].Sun C, Qiu X, Xu Y, and Huang X, “How to fine-tune bert for text classification?,” in China national conference on Chinese computational linguistics, Springer, 2019, pp. 194–206. [Google Scholar]
  • [45].Meng Y, Speier W, Ong MK, and Arnold CW, “Bidirectional Representation Learning from Transformers using Multimodal Electronic Health Record Data to Predict Depression,” IEEE J. Biomed. Health Inform, vol. 25, no. 8, pp. 3121–3129, Aug. 2021, doi: 10.1109/JBHI.2021.3063721. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [46].Pennington J, Socher R, and Manning C, “GloVe: Global Vectors for Word Representation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 1532–1543. doi: 10.3115/v1/D14-1162. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

RESOURCES