Abstract
Electronic phenotyping is the task of ascertaining whether an individual has a medical condition of interest by analyzing their medical record and is foundational in clinical informatics. Increasingly, electronic phenotyping is performed via supervised learning. We investigate the effectiveness of multitask learning for phenotyping using electronic health records (EHR) data. Multitask learning aims to improve model performance on a target task by jointly learning additional auxiliary tasks and has been used in disparate areas of machine learning. However, its utility when applied to EHR data has not been established, and prior work suggests that its benefits are inconsistent. We present experiments that elucidate when multitask learning with neural nets improves performance for phenotyping using EHR data relative to neural nets trained for a single phenotype and to well-tuned baselines. We find that multitask neural nets consistently outperform single-task neural nets for rare phenotypes but underperform for relatively more common phenotypes. The effect size increases as more auxiliary tasks are added. Moreover, multitask learning reduces the sensitivity of neural nets to hyperparameter settings for rare phenotypes. Last, we quantify phenotype complexity and find that neural nets trained with or without multitask learning do not improve on simple baselines unless the phenotypes are sufficiently complex.
Keywords: Electronic Health Records, Electronic phenotyping algorithms, Deep learning, Multi-task learning
1. Introduction
The goal of electronic phenotyping is to identify patients with (or without) a specific disease or medical condition using their electronic medical records. Identifying sets of such patients (i.e. a patient cohort) is the first step in a wide range of applications such as comparative effectiveness studies,1,2 clinical decision support,3,4 and translational research.5 Increasingly, such phenotyping is done via supervised machine learning methods.6–8
Multitask learning (MTL) is a widely used technique in machine learning that seeks to improve performance on a target task by jointly modeling the target task and additional auxiliary tasks.9 MTL has been used to good effect in a wide variety of domains including computer vision,10 natural language processing,11,12 speech recognition,13 and even drug development.14,15 However, its effectiveness using EHR data is less well established, with prior work providing contradictory evidence regarding its utility.16,17
In this work, we investigate the effectiveness of MTL for phenotyping using EHR. Our preliminary studies recapitulated the inconsistent benefits found in prior work.16,17 We thus aimed to elucidate the properties of the phenotypes for which MTL helps versus harms performance.
In this paper, we present a systematic exploration of the factors that determine whether or not MTL improves the performance of neural nets for phenotyping with EHR data. Our experiments suggest the following conclusions:
MTL helps performance for low prevalence (i.e. rare) phenotypes, but harms performance for relatively high prevalence phenotypes. Consistent with some prior work, there is a dose-response relationship with the number of auxiliary tasks, with the magnitude of the benefit or harm generally increasing as auxiliary tasks are added.
MTL reduces the sensitivity of neural nets to hyperparameter settings. This is of practical importance when one has a limited computational budget for model development.
Neural nets trained with or without MTL do not improve on simple baselines unless phenotypes are sufficiently complex. However, learning more complex models can be problematic with complex but low prevalence phenotypes. We explore this phenomenon by quantifying phenotype complexity using information theoretic metrics.
2. Background
2.1. Multitask nets
Multitask Learning
MTL seeks to improve performance on a given target task by jointly learning additional auxiliary tasks. For instance, if the target task is whether or not a patient has type 2 diabetes, one might jointly learn auxiliary tasks such as whether or not the patient has other diseases such as congestive heart failure or emphysema. MTL is most frequently embodied as a neural net in which the earliest layers of the network are shared among the target and auxiliary tasks, with separate outputs for each task (see Figure 1). MTL was originally proposed to improve performance on risk stratification of pneumonia patients by leveraging information in lab values as auxiliary tasks.9 It has since been used extensively for health care problems such as predicting illness severity18 and mortality,17 and disease risk and progression.19–23 However, the reported benefits of MTL are inconsistent across problems. For example, Che et. al showed that MTL improved performance on identifying physiological markers in clinical time series data,16 while Nori et. al concluded that MTL failed to improve performance on predicting mortality in an acute care setting.17 Our aim in this study is to clarify when one might expect MTL to help performance on problems using EHR data. We focus specifically on the foundational problem of phenotyping, which we discuss next.
Fig. 1.
The architecture of a multitask neural net for electronic phenotyping is shown on the right: the target task (shown in yellow) and the auxiliary tasks (shown in blue) share hidden layers and have distinct output layers; for comparison, we show the corresponding single-task neural net on the left with a single output layer for the target phenotype.
Electronic Phenotyping
In this study, phenotyping is simply identifying whether or not a patient has a given disease or disorder. The gold standard for phenotyping remains manual chart review by trained clinicians, which is time-consuming and expensive.24–26
This has spurred work on electronic phenotyping, which aims to solve the same problem using automated means and EHR data as input. The earliest electronic phenotyping algorithms were rule-based decision criteria created by domain experts.24–28 Figure 2 shows an example of a rule-based algorithm for type 2 diabetes mellitus. In this approach, identifying patients with the phenotype can be automated once the algorithm is specified, but the latter process is still time consuming and expensive.
Fig. 2.
Rule-based definitions for Type 2 Diabetes Mellitus from PheKB.34
More recent work has focused on using statistical learning6,29–33 to automate the process of specifying the algorithm itself using the methods of machine learning (i.e. models such as logistic regression, random forests, and neural nets). MTL is a particular method for doing this better. Our goal in this work is not to maximize performance for some phenotype but rather to gain insight into when MTL helps versus harms in this approach to phenotyping.
3. Methods
3.1. Dataset Construction and Design
Dataset
Our data comprises de-identified patient data spanning 2010 through 2016 for 1,221,401 patients from the Stanford Translational Research Integrated Database Environment (STRIDE) database.35 Each patient’s data includes timestamped diagnosis (ICD-9), procedure (CPT), drug (RxNorm) codes, along with demographic information (age, gender, race, and ethnicity). We use a simple multi-hot feature representation whereby each ICD-9, CPT, and RxNorm code is mapped to a binary indicator variable for whether the code occurs in the patient’s medical history. We similarly encode gender, race, ethnicity, and each integer value of age. This process results in a sparse representation of 29,102 features.
Target Task Phenotypes
Phenotyping with statistical classifiers is typically framed as a binary classification task, which requires data labeled with whether or not the patient has the phenotype. For this study, we derive the phenotypes using rule-based definitions from PheKB,36 a compendium of phenotype definitions developed to support genome-wide association studies. We focus on 4 phenotypes, chosen to span a range of prevalences. They are: type 2 diabetes mellitus (T2DM), atrial fibrillation (AF), abdominal aneurysm (AA), and angioedema (AE). The respective prevalences of these phenotypes in our data are 2.95%a, 2.89%, 0.12%, and 0.08%. We use these rule-based definitions to derive the phenotypes because they are easy to implement, scalable and transparent – later we describe how we take advantage of the rule-based definitions to gain insight into the effectiveness of MTL relative to baselines.
Auxiliary Tasks
Our auxiliary tasks are to classify phecodes, manually curated groupings of ICD-9 codes originally used to facilitate phenome-wide association studies.37 We randomly select phecodes with prevalence between 0.08% and 2.95%, i.e. the lowest and the highest target phenotype prevalences, as auxiliary tasks. We conduct binary classification on each phecode and experiment with 5, 10, and 20 randomly selected phecodes as auxiliary tasks.
3.2. Experimental Design
We aim to investigate whether and under what circumstances MTL improves performance upon baselines. Recent work suggests that we need to be careful in order to draw robust conclusions on the relative merits of machine learning, especially neural net based methods.38–41
First, one typically randomly partitions data into training, validation and test sets. We fit models to the training set, select or tune models using the validation set, and estimate performance on new data using the test set. All three steps use finite samples and are thus subject to noise due to sampling. This is especially true when data exhibit extreme class imbalance, as is the case with our phenotypes. Second, the performance of even simple feed-forward neural nets is known to be sensitive to hyperparameters such as the number of hidden layers and their sizes. Finally, fitting neural nets is inherently stochastic due to random initialization of model parameters and training by some variation of stochastic gradient descent. This, combined with the highly non-convex nature of neural nets, implies that different training runs of a neural net with fixed hyperparameters and dataset splits can still result in widely varying performance.42
We thus designed our experiments to mitigate noise due to these factors. First, for each phenotype, we perform ten random splits of the data into training (80%), validation (10%), and test sets (10%). We use stratified sampling to fix the prevalence of the targets to the overall sample prevalence in each of the training, validation and test sets. Second, for each of these splits, we perform a grid search over these hyperparameters for the MTNN and STNN models: we vary the number of hidden layers (1 or 2), their size (128, 256, 512, 1024, and 2048), and the initial learning rate for the algorithm (1e-4 and 5e-5). Moreover, we performed experiments varying the number of auxiliary tasks (in the form of 5, 10, and 20 nested, randomly selected phecodes) for MTNNs by conducting the above grid search for each scenario. For each split, we also fit an L1 regularized logistic regression model, tuned on the validation set. We use the area under the Precision-Recall curve (AUPRC) as our evaluation metric since it can be more informative than the area under the receiver operator characteristic curve (AUROC) in problems with extreme class imbalance.43
Phenotype Complexity
Our experiments suggested that the complexity of the phenotype is important in whether MTNNs and STNNs outperform well-tuned logistic regression. We quantified the phenotype complexity with regard to a subset of the features upon which the classifiers are builtb. If we had access to an oracle that told us which features of the patient representation are important in determining a patient’s phenotype, we could characterize the complexity of the phenotype with regard to the observed combinations of these features in the positive cases. We could also compare the distributions of the positive and negative cases to examine how difficult it is to discriminate positive and negative cases given the relevant features.
Our phenotypes are derived from the rule-based definitions, which we use as such an oracle: for each phenotype, we extract the features involved in its rule-based definitions (the oracle features) and count occurrences of each distinct combination of these features observed in the positive and negative cases. Each unique combination is represented as a binary string with each digit indicating the presence or absence of an oracle feature. Since some of the phenotype definitions involve very many combinations, we hash the combinations into a lower-dimensional space, i.e. a fixed number buckets. Specifically, we use a hash function to map the combinations (the variable-length binary strings) to a fixed number of hash codes (the buckets). We obtain the counts in each bucket for the positive and negative cases and analyze the resulting histograms using two information theoretic metrics.
Let xi be the vector of oracle features for bucket i. We summarize the phenotype complexity of positive cases by treating the histogram as a discrete probability distribution and calculate its information entropy,44 defined as:
where n is the number of buckets. This metric summarizes the diversity of positive cases with respect to the oracle features and is higher for more complex phenotypes.
We compare the distributions of the positive and negative cases using the Kullback-Leibler (KL) divergence.45 For discrete probability distributions P+ and P−, the KL divergence from P− to P+ is defined as:
where n is the number of bucketsc. P+(xi) and P−(xi) are the normalized frequencies of bucket i for cases and controls respectively. KL divergence measures the dissimilarity between the case and control distributions and is lower for the phenotypes that are harder to discriminate.d
Neural Net Details
All neural nets used ReLU activations46 for the hidden layers and Xavier initialization47 and were trained using Adam48 with standard parameters (β1 = 0.9 and β2 = 0.99) for 6 epochse. We controlled overfitting with batch normalization and early stopping on the validation set.
4. Experiments and Results
In this section, we present results that provide insight into the following questions:
When does MTL improve performance relative to single-task models for phenotyping?
How do the effects of MTL change with the number of phecodes as auxiliary tasks?
How do the neural net methods compare with strong baseline methods, and what are the characteristics of the tasks for which they provide some benefit?
4.1. When Does Multitask Learning Improve Performance?
We investigate the performance of MTNNs over a range of hyperparameter settings and over multiple random splits of the data. MTNN performance is compared to the performance of STNNs over the same hyperparameter settings and data splits. Figure 3 shows the optimal MTNN and STNN performance achieved on each split for the four phenotypes. We find that MTNNs consistently outperform STNNs for the low prevalence phenotypes, i.e. angioedema and abdominal aneurysm. In contrast, MTL harms performance for the relatively high-prevalence phenotypes, i.e. T2DM and atrial fibrillation. The left plot in Figure 4 shows the pairwise differences between MTNN and STNN optimal performance across the splits.
Fig. 3.
MTNN and STNN performance for Angioedema, Abdominal Aneurysm, Atrial Fibrillation, and Type 2 Diabetes Mellitus with various hyperparameter settings across the ten splits; the best case MTNN and STNN performance is emphasized by the solid dots: the blue and red dots correspond to MTNNs and STNNs respectively.
Fig. 4.
The left plot shows the pairwise differences in AUPRC values of the optimal MTNNs and STNNs for Angioedema, Abdominal Aneurysm, Atrial Fibrillation, and Type 2 Diabetes Mellitus across the ten splits. The right plot shows the pairwise differences in AUPRC values of the optimal STNNs and MTNNs with different number of phecodes as auxillary tasks.
Moreover, the performance of STNNs is very sensitive to hyperparameter settings for the low prevalence phenotypes, as illustrated by the large spread in AUPRC values (see Figure 3). In contrast, MTNNs are more robust to hyperparameter settings for these phenotypes. In practice, tuning neural nets is time-consuming and finding an ideal model demands extensive computation. MTL may increase our chance of finding a reasonable model, which is of practical value when one has a limited computational budget on model space exploration.
4.2. Relationship Between Performance and Number of Tasks
We investigate how MTL is influenced by the number of auxiliary tasks as defined in the form of phecodes. We trained MTNNs with nested sets of 5, 10, and 20 randomly selected phecodes (i.e. the 5-phecode set is a subset of the 10-phecode set, and so on), and reported the performance with the optimal hyperparameter setting for each split. The right plot in Figure 4 shows pairwise differences in AUPRC values between MTNNs and STNNs. For the low prevalence phenotypes, more phecodes increases performance gains. Similarly, more phecodes for high prevalence phenotypes leads to more severe negative effects, though the scale of the negative effects is smaller than the positive effects for low prevalence phenotypesf.
4.3. Comparison with Logistic Regression Baseline
In discussing the merits of MTL, it is important to also compare the performance against simpler baseline methods in addition to single-task neural nets. We compare the performance of the neural nets with L1 regularized logistic regression (LR), a consistently strong baseline for EHR data49,50 (see Figure 5). LR is consistently outperformed by the neural nets for abdominal aneurysm and type 2 diabetes mellitus, which are low and high prevalence respectively. For angioedema, a low prevalence phenotype, performance relative to LR is inconsistent across the splits, although MTNNs consistently beat STNNs. And for atrial fibrillation, a high prevalence phenotype, MTNNs and STNNs provide little or no benefit over LR. Prevalence alone is insufficient to account for the relative performance between both MTNN and STNN and LR.
Fig. 5.
MTNN, STNN, and LR optimal performance for Angioedema, Abdominal Aneurysm, Atrial Fibrillation, and Type 2 Diabetes Mellitus across splits: the blue squares, the red triangles, and the green dots correspond to MTNN, STNN, and LR respectively.
4.4. Interaction between Phenotype Prevalence and Complexity
Our comparison of MTNNs and STNNs versus LR suggests that phenotype prevalence alone cannot explain when neural nets outperform simpler linear models. We hypothesized that phenotype complexity also plays a role since neural nets with or without MTL can automatically model non-linearities and interactions, while LR must have non-linearities and interactions explicitly encoded in features. We leveraged the rule-based phenotype definitions to explore this hypothesis and found evidence of an interaction between phenotype prevalence and complexity.
Phenotype Complexity
For each phenotype, we generated histograms of the observed combinations of the oracle features for the positive and negative cases (see Figure 6) and calculated the information entropy of the positive cases and the KL divergence between the positive and negative cases (see Table 1) as described in Methods 3.2.
Fig. 6.
Distributions of the combinations of the oracle features involved in the rule-based definitions for Angioedema, Abdominal Aneurysm, Atrial Fibrillation, and Type 2 Diabetes Mellitus. The yellow and blue bars correspond to the positive and negative cases respectively. The x-axes represent the buckets of unique combinations of the oracle features: in our study, we use 32 buckets. Note that the choice of 32 buckets was arbitrary and not tuned in any way.
Table 1.
Phenotype Complexity
| Phenotype | Prevalence | Entropy | KL Divergence |
|---|---|---|---|
| Angioedema | 0.08 % | 3.233 | 0.930 |
| Abdominal Aneurysm | 0.12% | 1.396 | 2.414 |
| Atrial Fibrillation | 2.89% | 0.709 | 5.383 |
| Type 2 Diabetes Mellitus | 2.95 % | 3.012 | 3.806 |
We find that atrial fibrillation, a high-prevalence phenotype, has low entropy and high KL divergence. With respect to the oracle features, all the positive cases are similar to each other, while the positive and negative cases are very dissimilar to each other. A relatively simple model should be able to capture this, explaining the observation that LR achieves comparable performance to MTNNs and STNNs for this phenotype.
Abdominal aneurysm, a low prevalence phenotype, and T2DM, a high prevalence phenotype, have higher information entropy and lower KL divergence values than atrial fibrillation. Thus, the positive cases are more diverse and discrimination is more difficult than atrial fibrillation with respect to each phenotype’s oracle features. For these phenotypes, both MTNNs and STNNs outperform LR – we benefit from more expressive models. However, whether MTNNs beat STNNs depends on prevalence.
Finally, angioedema has the highest entropy and lowest KL divergence – it is both the most complex and hardest to discriminate of the four phenotypes. Complex phenotypes should benefit from more expressive models. However, we observe that while MTNNs consistently outperform STNNs, their performance relative to LR is inconsistent across splits. One possible explanation for this behavior is that relative performance is sensitive to the assignment of patients to training, validation and test sets: with such diverse cases and common support with respect to the oracle features, it is much more likely for the test set to contain patients unlike any seen in the training set.
5. Limitations
We have set out to investigate MTL and its effectiveness for electronic phenotyping. However, our work has important limitations. First, we randomly select phecodes for auxiliary tasks, but it has been argued that auxiliary tasks should be directly related to the target task.51 It is possible that better auxiliary tasks would improve the benefit of MTL. Specifically, more related phecodes might mitigate or eliminate the performance degradation observed for the high-prevalence phenotypes or inconsistent relative performance between MTNN and LR for angioedema. However, the notion of task relatedness is underspecified so it is problematic to compute in order to select auxiliary tasks. Indeed, in preliminary work we explored various formulations of relatedness to select auxiliary tasks but found that none performed better than random selection. One could ask domain experts to manually construct or pick auxiliary tasks for specific phenotypes, but this is beyond the scope of this work. Moreover, it has also been shown that the task relatedness is unnecessary for MTL to provide benefits.52 However, we acknowledge that it is an interesting line of inquiry for future work to further explore how to improve multitask learning for electronic phenotyping. Second, to address the unavailability of large-scale ground truth phenotypes, we use rule-based definitions because they are transparent and available, but we recognize that the phenomenon we observe may be artifacts of the rule-based definitions. We also acknowledge the possibility that the observed phenomenon might not generalize to other phenotypes; we focused on four phenotypes to conduct an in depth examination, sacrificing breadth. Finally, the rule-based phenotype definitions contain predicates encoding temporal relationships, e.g., a drug code followed by a diagnosis code. Our simple multi-hot feature representation does not encode temporal information. As a result, there is an upper bound on the performance of any statistical classifier using this feature representation.
6. Conclusion
We have investigated the effectiveness of multitask learning on electronic phenotyping with EHR data, aiming to elucidate the properties of situations for which MTL improves or harms performance. We trained multitask neural networks to classify a target phenotype jointly with auxiliary tasks drawn from phecodes. We found that MTL provided consistent performance improvements over single-task neural networks on extremely rare phenotypes. However, for relatively higher prevalence phenotypes, MTL actually reduced performance. In both cases, the effect scaled with the number of auxiliary tasks as defined in the form of phecodes. Moreover, we found that MTL improved the robustness of neural networks to hyperparameter settings for the extremely rare phenotypes, which is of practical value in situations when one has a limited computational budget for model exploration. Finally, we analyzed phenotype complexity to shed light on the relative performance of both MTNN and STNN versus well-tuned L1 regularized logistic regression baselines and found evidence of an interaction between phenotype prevalence and complexity. We showed that simple linear models are sufficient for non-complex phenotyping tasks. More expressive models can substantially improve performance for more complex phenotypes, but only if the data support learning them well, which may be problematic for rare phenotypes.
Acknowledgments
This work was supported by NLM R01-LM011369-05 and a grant supporting the Observational Health Data Science and Informatics (OHDSI) by Janssen Research and Development LLC. Internal funding by the School of Medicine at Stanford also supported part of this work. We gratefully acknowledge Jason Fries for many helpful discussions about this work.
Footnotes
The prevalence is low compared to the population prevalence of approximately 9% because the rule-based definitions from PheKB are tuned for high precision at the cost of lower recall.
There is no direct way to quantify the complexity of the rule-based definitions shown in Figure 2.
KL divergence does not admit zero probabilities so we use Laplace smoothing on the distributions to deal with combinations that do not have mutual support
Please refer to https://arxiv.org/abs/1808.03331 for a more detailed description of our method.
We found 6 epochs was sufficient for all models to converge
This dose-response relationship with the number of auxiliary tasks recapitulates the findings of Ramsundar et al,14 but we find the relationship holds for both the benefit and harm of MTL.
References
- 1.Crawford DC, Crosslin DR, Tromp G, Kullo IJ, Kuivaniemi H, Hayes MG, Denny JC, Bush WS, Haines JL, Roden DM et al. , Frontiers in Genetics 5, p. 184 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Manion FJ, Harris MR, Buyuktur AG, Clark PM, An LC and Hanauer DA, Current Oncology Reports 14, 494 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Longhurst CA, Harrington RA and Shah NH, Health Affairs 33, 1229 (2014). [DOI] [PubMed] [Google Scholar]
- 4.Wei W-Q and Denny JC, Genome Medicine 7, p. 41 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Shah NH, Nature Biotechnology 31, p. 1095 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Agarwal V, Podchiyska T, Banda JM, Goel V, Leung TI, Minty EP, Sweeney TE, Gyang E and Shah NH, Journal of the American Medical Informatics Association 23, 1166 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Halpern Y, Horng S and Sontag D, Proceedings of the 1st Machine Learning in Health Care (MLHC) , p. 209 (2016). [Google Scholar]
- 8.Banda J, Halpern Y, Sontag D and Shah N, AMIA Summits on Translational Science Proceedings , p. 48 (2017). [PMC free article] [PubMed] [Google Scholar]
- 9.Caruana R, Baluja S and Mitchell TM, Advances in Neural Information Processing Systems , 959 (1995). [Google Scholar]
- 10.Girshick R, Donahue J, Darrell T and Malik J, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 580 (2014). [Google Scholar]
- 11.Plank B, Søgaard A and Goldberg Y, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (2016). [Google Scholar]
- 12.Liu P, Qiu X and Huang X, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics , 1 (2017). [Google Scholar]
- 13.Toshniwal S, Tang H, Lu L and Livescu K, 18th Annual Conference of the International Speech Communication Association , 3532 (2017). [Google Scholar]
- 14.Ramsundar B, Kearnes S, Riley P, Webster D, Konerding D and Pande V, arXiv preprint arXiv:1502.02072 (2015). [Google Scholar]
- 15.Zhang P, Wang F and Hu J, AMIA Annual Symposium Proceedings 2014, p. 1258 (2014). [PMC free article] [PubMed] [Google Scholar]
- 16.Che Z, Kale D, Li W, Bahadori MT and Liu Y, Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , 507 (2015). [Google Scholar]
- 17.Nori N, Kashima H, Yamashita K, Ikai H and Imanaka Y, Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , 855 (2015). [Google Scholar]
- 18.Ghassemi M, Pimentel MAF, Naumann T, Brennan T, Clifton DA, Szolovits P and Feng M, Proceedings of the 29th Conference on Artificial Intelligence , 446 (2015). [PMC free article] [PubMed] [Google Scholar]
- 19.Ngufor C, Upadhyaya S, Murphree D, Kor D and Pathak J, IEEE International Conference on Data Science and Advanced Analytics , 1 (2015). [Google Scholar]
- 20.Wang X, Wang F, Hu J and Sorrentino R, AMIA Annual Symposium Proceedings 2014, p. 1180 (2014). [PMC free article] [PubMed] [Google Scholar]
- 21.Razavian N, Marcus J and Sontag D, Machine Learning for Healthcare Conference , 73 (2016). [Google Scholar]
- 22.Zhou J, Yuan L, Liu J and Ye J, Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , 814 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Lipton ZC, Kale DC, Elkan C and Wetzel R, arXiv preprint arXiv:1511.03677 (2015). [Google Scholar]
- 24.Newton KM, Peissig PL, Kho AN, Bielinski SJ, Berg RL, Choudhary V, Basford M, Chute CG, Kullo IJ, Li R et al. , Journal of the American Medical Informatics Association 20, e147 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Overby CL, Pathak J, Gottesman O, Haerian K, Perotte A, Murphy S, Bruce K, Johnson S, Talwalkar J, Shen Y et al. , Journal of the American Medical Informatics Association 20, e243 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Mo H, Thompson WK, Rasmussen LV, Pacheco JA, Jiang G, Kiefer R, Zhu Q, Xu J, Montague E, Carrell DS et al. , Journal of the American Medical Informatics Association 22, 1220 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Kho AN, Pacheco JA, Peissig PL, Rasmussen L, Newton KM, Weston N, Crane PK, Pathak J, Chute CG, Bielinski SJ et al. , Science Translational Medicine 3, 79re1 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Conway M, Berg RL, Carrell D, Denny JC, Kho AN, Kullo IJ, Linneman JG, Pacheco JA, Peissig P, Rasmussen L et al. , AMIA Annual Symposium Proceedings 2011, p. 274 (2011). [PMC free article] [PubMed] [Google Scholar]
- 29.Huang Y, McCullagh P, Black N and Harper R, Artificial Intelligence in Medicine 41, 251 (2007). [DOI] [PubMed] [Google Scholar]
- 30.Chen Y, Ghosh J, Bejan CA, Gunter CA, Gupta S, Kho A, Liebovitz D, Sun J, Denny J and Malin B, Journal of Biomedical Informatics 55, 82 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Zhou J, Wang F, Hu J and Ye J, Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , 135 (2014). [Google Scholar]
- 32.Ho JC, Ghosh J and Sun J, Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , 115 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Halpern Y, Horng S, Choi Y and Sontag D, Journal of the American Medical Informatics Association 23, 731 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Type 2 diabetes mellitus. https://phekb.org/phenotype/18, Accessed: 2018-07-23.
- 35.Lowe HJ, Ferris TA, Hernandez PM and Weber SC, American Medical Informatics Association Annual Symposium (2009). [Google Scholar]
- 36.Kirby JC, Speltz P, Rasmussen LV, Basford M, Gottesman O, Peissig PL, Pacheco JA, Tromp G, Pathak J, Carrell DS et al. , Journal of the American Medical Informatics Association 23, 1046 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Wei W-Q, Bastarache LA, Carroll RJ, Marlo JE, Osterman TJ, Gamazon ER, Cox NJ, Roden DM and Denny JC, PloS One 12, p. e0175508 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Li Y, Du N and Bengio S, arXiv preprint arXiv:1708.00065 (2017). [Google Scholar]
- 39.Melis G, Dyer C and Blunsom P, arXiv preprint arXiv:1707.05589 (2017). [Google Scholar]
- 40.Lucic M, Kurach K, Michalski M, Gelly S and Bousquet O, arXiv preprint arXiv:1711.10337 (2017). [Google Scholar]
- 41.Oliver A, Odena A, Raffel C, Cubuk ED and Goodfellow IJ, arXiv preprint arXiv:1804.09170 , 1 (2018). [Google Scholar]
- 42.Keskar NS, Mudigere D, Nocedal J, Smelyanskiy M and Tang PTP, arXiv preprint arXiv:1609.04836 (2016). [Google Scholar]
- 43.Saito T and Rehmsmeier M, PloS One 10, p. e0118432 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Shannon CE, ACM SIGMOBILE Mobile Computing and Communications Review 5, 3 (2001). [Google Scholar]
- 45.Kullback S and Leibler RA, The Annals of Mathematical Statistics 22, 79 (1951). [Google Scholar]
- 46.Nair V and Hinton GE, 807 (2010). [Google Scholar]
- 47.Glorot X and Bengio Y, Proceedings of the 13th International Conference on Artificial Intelligence and Statistics , 249 (2010). [Google Scholar]
- 48.Kingma DP and Ba J, arXiv preprint arXiv:1f12.6980 (2014). [Google Scholar]
- 49.Rajkomar A, Oren E, Chen K, Dai AM, Hajaj N, Hardt M, Liu PJ, Liu X, Marcus J, Sun M et al. , Digital Medicine 1, p. 18 (2018).31304302 [Google Scholar]
- 50.Razavian N, Blecker S, Schmidt AM, Smith-McLallen A, Nigam S and Sontag D, Big Data 3, 277 (2015). [DOI] [PubMed] [Google Scholar]
- 51.Caruana R, Machine learning 28, 41 (1997). [Google Scholar]
- 52.Romera-Paredes B, Argyriou A, Berthouze N and Pontil M, Proceedings of the 15th International Conference on Artificial Intelligence and Statistics , 951 (2012). [Google Scholar]






