Abstract
Phenotyping algorithms are essential tools for conducting clinical research on observational data. Manually devel- oped phenotyping algorithms, such as those curated within the eMERGE (electronic Medical Records and Genomics) Network, represent the gold standard but are time consuming to create. In this work, we propose a framework for learning from the structure of eMERGE phenotype concept sets to assist construction of novel phenotype definitions. We use eMERGE phenotypes as a source of reference concept sets and engineer rich features characterizing the con- cept pairs within each set. We treat these pairwise relationships as edges in a concept graph, train models to perform edge prediction, and identify candidate phenotype concept sets as highly connected subgraphs. Candidate concept sets may then be interrogated and composed to construct novel phenotype definitions.
Introduction
Phenotyping algorithms are essential tools for conducting clinical research on observational data. Developing pheno- types in a high-throughput manner, however, remains a major challenge1. In practice, phenotype development often in- volves identifying and iteratively refining sets of concepts from controlled terminologies (herein “concept sets”) which, when composed with rules, yield a phenotype definition. This expert-driven, manual process is time-consuming, and it remains a persistent bottleneck in phenotype development. As such, past efforts have tried to minimize the resource burden associated with phenotype concept set curation2–4. Most of these works have formulated the problem of con- cept set construction by considering the concept as the base instance and developing concept feature representations. In this work, we consider the concept pair as our base instance.
For our purposes, a phenotype is an observable manifestation of a clinical entity (e.g. a disease). Each phenotype is represented by one or more concept sets. If we treat concepts as nodes and concept pairs as edges, then a concept set is a fully connected subgraph, or clique, within a larger concept graph. From this perspective, constructing a phenotype concept set is equivalent to constructing a concept clique.
We learn how to build concept sets by first learning how to predict edges in the concept graph. We do so by training binary classifiers on rich concept pair features; a concept pair is a positive instances if both concepts belong to a common phenotype concept set, otherwise it is negative. To obtain ground-truth concept-pair labels, we extract a reference set of concept sets from the eMERGE Network’s Phenotype KnowledgeBase (PheKB) phenotype definitions. We evaluate our concept pair prediction models on held-out PheKB concept pairs (positive) and random samples of concept pairs not found in any PheKB phenotypes (negative).
Once trained, our models estimate edge likelihoods for all possible concept pairs. We propose two approaches to recovering highly connected subgraphs from these edge likelihood estimates to serve as candidate phenotype concept sets for use in phenotype construction: maximal clique and greedy concept set construction. We evaluate the ability of each method to recover all concept pairs from a sample of fully held-out PheKB concept sets.
Methods
1.1 Phenotype Reference Sets
Our datasets are derived from the eMERGE phenotypes, standardized vocabularies from the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM) as defined by the Observational Health Data Sciences and Informatics (OHDSI) Program, and clinical data from New York Presbyterian Hospital, a major tertiary care hospital, formatted to the OMOP CDM standard5. The standardized vocabularies describe ontological relationships among clinical terms such as conditions, procedures, measurements, and observations, all referred to in this work as concepts.
We generate a reference set of phenotypes from clinically validated Phenotype KnowledgeBase (PheKB) phenotype definitions from the eMERGE Network, including a broad range of disorders such as diabetes, autism, cataract, and severe childhood obesity6, 7. We treat this reference set as a ground truth source of phenotype concept sets. For each reference set phenotype, we extract all disease (ICD-9-CM, ICD-10-CM), procedure (HCPCS, CPT-4, ICD9Proc, ICD-10 PCS) and measurement (LOINC) codes. We then map these codes to concepts in the OMOP standardized vocabularies.
We use the structure of the OMOP standard vocabularies and concept observations in clinical data to build features for all concept pairs within our reference set. We describe these features in detail below.
1.2 Concept pair features
We developed 21 different features describing pairwise relationships among concepts. Each pair is characterized by their co-occurrence, semantic, lexical, and embedding similarity, and a binary indicator encoding the presence or absence of the pair in any of our phenotype concept sets. We categorized each of the newly created features within one of four categories: lexical, semantic, co-occurence, and concept embedding.
Lexical features. Our lexical features measure the degree to which two concepts, ci and cj , are similar linguistically. Each measure operates on the string representation of each concept’s name, si and sj . We calculated 5 lexical features for each concept pair:
-
• Levenshtein distance: A recursive definition for the absolute Levenshtein distance between si and sj is as follows:
math
where m and n index character positions in si and sj . The distance is calculate for the full strings by passing in there lengths, |si| and |sj |.
-
• Levenshtein ratio: Normalizes the Levenshtein distance by the maximum length of each concept string, si and sj.
math
-
• Jaro: Measures similarity between si and sj in terms of character transpositions.
math
where m is the number of matching characters and t is half the number of character transpositions in si and sj .
-
• Jaro-Winkler: A modification of Jaro metric which gives higher weight to strings that match beginning at a set prefix proportional length. We use l = 0.1:
i j i j 10 i j
math
• Fuzz partial ratio: Based on Levenshtein distance. Fuzz partial ratio focuses on substring matching.
math
Semantic features. The similarity score between two concepts, ci and cj , is based on the likeness of their meaning or semantic content. This set of features is meant to represent the semantic overlap between each concept. Eight different semantic features were extracted:
-
Ancestry indicator: Indicator function to determine ancestry of concepts.
math
-
Semantic similarity: Let dist(ci, cj ) measure the minimum path distance between concepts ci and cj in their graph of origin. Semantic similarity takes the ratio of the distance between the nearest common ancestor ca of concepts ci and cj , and their respective distances to the root ancestor R.
math
-
Resnik’s similarity8: Information content (IC) is the sum of the empirical log probabilities of a concept and all its descendants. Given two concepts, ci and cj , the maximal information content ancestor (MICA) is the common ancestor with the largest IC. Resnik similarity is the IC of the MICA.
math
-
Jiang measure9: This measure is based on Resnik’s similarity. This measure considers the information content of the two concepts ci and cj as well as that of there MICA.
math
-
Lin measure9: This measure reweights the Resknik’s similarity to account for the IC of each concept being compared. If ci or cj have high (low) IC, the Resknik’s ratio will be down (up) weighted.
math
-
Relevance measure9: Let p(c) be the empirical probability of concept c and all its descendants. The relevance measure weights the Lin measure by 1 − p(MICA(ci, cj )).
math
-
Information coefficient9: This measure builds upon the Lin measure, and is an improvement of the Relevance measure. The adjustment term is more sensitive to the information content, especially for those concepts that are close to the root (i.e. have empirical probability close to 1) or the leaf (i.e. have empirical probability close to 0) of the taxonomy.
math
-
GraphIC measure9: GraphIC was originally developed to compare the portion of the Gene Ontology graph shared by a pair of proteins. The metric takes all common ancestors of two concepts into account. Below, A(c) returns all the ancestors of a concept, c.
math
Co-occurrence features. Co-occurrence matrices are computed based on domain tables from the latest OMOP CDM (inpatient and outpatient data) database available from New York Presbyterian Hospital-Columbia University Medical Center. Domain tables include conditions, procedures, drugs, measurements, and observations. Concept co-occurence is calculated based on the frequency with which two concepts, ci and cj , co-occur within windowed patient time- series. We used various time windows to measure the co-occurrence matrices, including 60 days, 90 days, 180 days, 360 days, and lifetime.
Concept embedding features. The GloVe algorithm10 is run on the single visit, 180 day, 5 year, and lifetime co- occurrence matrices to generate concept embeddings. We then calculate a cosine similarity matrix for all concept pairs based on each concept’s embedding.
1.3 Concept pair prediction
We may train any binary classifier to perform concept pair prediction. Therefore, we experiment with a suite of models to explore how various model types perform within our framework. We train our models to predict if two concepts should appear together within a phenotype concept set using the rich concept pair features described above. Later we describe how these models are used to build concept sets.
Models. We use binary classifiers implemented in the popular SciKit Learn Python package11 including L1- and L2-regularized logistic regression, naive Bayes, decision trees, random forest, gradient boosted trees, and adaboost.
Experiments and Evaluation. Our datasets are heavily imbalanced due to the fact that the overwhelming majority of concept-pairs do not occur within a phenotype concept set. To correct for this, we randomly sample negative concept pairs to match the number positive pairs.
We evaluate our models’ performance with respect to 1) random held-out concept pair prediction (random hold-out) and 2) prediction of concept pairs within fully held-out phenotype concept sets (phenotype-aware hold-out). The latter setting is meant to mimic prediction of concept pairs for a novel phenotype. In this case none of the concept set’s pairs would be available for training.
For random hold-out, we construct a held-out test set by randomly sampling 10% of our positive concept pairs as well as an equal number of negative concept pairs. The remaining 90% of positive pairs along with an equal number of randomly sampled negative pairs are used as the training set. We repeat this process ten times to generate ten random training and test sets.
For phenotype-aware hold-out, we randomly select ten existent phenotype concept sets whose positive concept pairs contain approximately 10% of all positive pairs. We hold out all positive pairs contained in these concept sets and randomly sample an equal number of negative pairs to create a test set. For the training set, we use all remaining positive pairs along with a matching number of randomly sampled negative pairs. As with random hold-out, we repeat this process ten times to generate ten phenotype-aware held-out training and test sets.
To determine the feature-weightings that influenced the performance of the models, we re-ran L1 logistic regression with 15 combinations of the 4 different groups of features, including every combination within each group (e.g. running permutations of semantic features, lexical features, etc).
1.4 Concept set recovery
Our aim is to use concept pair prediction models to construct novel phenotype concept sets. To do this, we use our estimated concept pair (edge) likelihoods to recover highly connected sets of concepts from the concept graph. To focus our search, we construct a set of “seed” concepts and explore the graph to find neighbors which are highly connected to the seed set. In practice, the seed would contain concepts considered integral to the ultimate phenotype definition. We develop two approaches to recover concept sets given a concept seed.
Maximal clique concept sets. Here we recover a concept set as the maximal clique containing all of our seed concepts. To begin, we chose a trained model and estimate all concept pair likelihoods. Next we threshold these likelihoods to obtain a binary adjacency matrix. We then identify all concepts which are fully connected to the seed. This set of concepts along with the seed define a sub-graph which we probe to isolate the maximal clique containing our seed. We use maximal clique finding algorithms defined in NetworkX, an open-source Python package for graph analysis12,13. See Figure 2 for a visual summary of this approach.
Figure 2: Maximal clique concept set recovery.
Greedy concept sets. Here we take a greedy approach to building concept sets. As above, we choose a model, estimate all concept pair likelihoods, and define a concept seed. Next, we calculate the mean of edge likelihoods (MELs) between our seed concepts and all other concepts in the graph. Finally, the concept with the largest MEL is added to the seed. This process is repeated until all concepts are in the seed. The final output is a list of all concepts in their order of addition to the seed. See Figure 3 for a visual summary of this approach.
Figure 3: Greedy concept set recovery. For ease of presentation, we illustrate edge likelihoods for just two concepts not currently in the seed.
Experiments and Evaluation metrics. We focus on recovery of known, held-out phenotype concept sets. In our experiments, we estimate edge likelihoods using models trained in the phenotype-aware hold-out setting. We then evaluate how well our methods recover held-out phenotype concept sets given seeds of varying size. Seeds are always drawn from a target held-out concept set. Seed size is defined as a number of concepts (i.e. 3, 5, 10) or as a percentage of the target concept set (i.e. 10%, 30%, 50%).
Our two concept set recovery methods return different outputs; the first returns a subset of all concepts, while the second returns an ordered list of all concepts. Thus, we employ distinct metrics to evaluate each method. For maximal clique concept sets, we use precision, recall, and the F1 score. For greedy concept sets we use precision at a percentage of held-out phenotype (Prec.@%). This is the same as precision at K, where K is set equal to a percentage of the total concepts in a target held-out phenotype. We use this metric to permit aggregation of our results over held-out concept sets of various sizes.
For each target phenotype concept set and each seed size, we attempt recovery with 10 seeds randomly sampled from the target. For each metric, we aggregate results over all held-out targets within each model and seed size.
2 Results
Concept pair prediction. All of our models demonstrate strong predictive performance in both the random and phenotype aware hold-out settings (see Table 1). For random hold-out, note the high AUC-ROC values ranging from 0.9223 (naive Bayes) to 0.9414 (random forest) and high AUC-PR values ranging from 0.9045 (decision tree) to 0.9268 (gradient boosting). For phenotype-aware hold-out, performance weakens across all models. This performance decay is expected; fully held-out phenotypes have no concept pairs available for training. This lack of context during training makes recovery of held-out concept pairs more difficult than in random hold-out.
Table 1: Concept pair prediction evaluation. AUC-ROC: area under ROC curve; AUC-PR: area under PR curve; Max. F1: maximum observed F1 value; Prec.@50%: precision at 50% of total held-out concept pairs.
| Hold-out | Model | AUC-ROC | AUC-PR | Max. F1 | Prec.@50% |
| LR (L1) | 0.9310 ± 0.0007 | 0.9147 ± 0.0008 | 0.8622 ± 0.0013 | 0.9420 ± 0.0340 | |
| LR (L2) | 0.9294 ± 0.0007 | 0.9130 ± 0.0008 | 0.8619 ± 0.0012 | 0.9380 ± 0.0316 | |
| Naive Bayes | 0.9223 ± 0.0008 | 0.9049 ± 0.0012 | 0.8350 ± 0.0022 | 0.9320 ± 0.0392 | |
| Random | Decision Tree | 0.9347 ± 0.0014 | 0.9045 ± 0.0036 | 0.8768 ± 0.0011 | 0.9580 ± 0.0316 |
| Random Forest | 0.9414 ± 0.0009 | 0.9247 ± 0.0014 | 0.8794 ± 0.0028 | 0.9280 ± 0.0402 | |
| Gradient Boosting | 0.9439 ± 0.0010 | 0.9248 ± 0.0019 | 0.8841 ± 0.0011 | 0.9700 ± 0.0184 | |
| AdaBoost | 0.9410 ± 0.0005 | 0.9234 ± 0.0009 | 0.8775 ± 0.0007 | 0.9700 ± 0.0134 | |
| LR (L1) | 0.8740 ± 0.0268 | 0.8713 ± 0.0270 | 0.8147 ± 0.0183 | 0.9480 ± 0.0402 | |
| LR (L2) | 0.8752 ± 0.0297 | 0.8801 ± 0.0268 | 0.8118 ± 0.0197 | 0.9540 ± 0.0358 | |
| Phen. | Naive Bayes | 0.8766 ± 0.0266 | 0.8727 ± 0.0227 | 0.7873 ± 0.0445 | 0.9500 ± 0.0313 |
| Aware | Decision Tree | 0.8877 ± 0.0260 | 0.8634 ± 0.0260 | 0.8198 ± 0.0122 | 0.9000 ± 0.0639 |
| Random Forest | 0.8993 ± 0.0203 | 0.8695 ± 0.0217 | 0.8328 ± 0.0074 | 0.5080 ± 0.2964 | |
| Gradient Boosting | 0.8952 ± 0.0252 | 0.8743 ± 0.0254 | 0.8254 ± 0.0112 | 0.7800 ± 0.2741 | |
| AdaBoost | 0.8735 ± 0.0227 | 0.8554 ± 0.0237 | 0.8109 ± 0.0149 | 0.9580 ± 0.0340 |
To evaluate feature importance, we examined the trained coefficients in L1-penalized logistic regression. The most positively predictive covariates were the Lin measure and the Information coefficient, with beta coefficients of 7.290 and 8.868 respectively. Ancestry between and same visit co-occurrence had the most negative beta coefficient (-3.059). In our experiments with feature subsets, semantic similarity features contributed the most to model performance, with
an aggregate AUC-ROC of 0.94 and AUC-PR of 0.74. This was followed by lexical features, which on their own had an AUC-ROC of 0.78 and AUC-PR of 0.44. The concept co-occurrence and concept embedding had a generally lower AUC-ROC of 0.67.
Concept set recovery. We limit our experiments to concept pair likelihoods learned with L1- and L2-regularized lo- gistic regression, random forest, and gradient boosting. These models demonstrated the most competitive performance in phenotype-aware hold-out concept pair prediction (Table 1).
For maximal clique concept set recovery we tested various concept pair likelihood thresholds, and achieved best performance at a value of 0.75. Table 2 summarizes our evaluation. Though standard deviations are large, some general trends are apparent. Logistic regression trends to higher precision, while ensemble methods trend to higher recall. Somewhat unexpectedly, performance in all models appears to trend downward as the seed size grows. This may be an artifact of larger seeds leaving fewer target concepts available for recovery. In addition, maximal cliques grown from larger seeds may struggle to incorporate additional concepts as the barrier to inclusion is higher relative to smaller seeds.
Table 2: Maximal clique concept set recovery. Concept pair likelihood threshold set to 0.75.
| Metric | Seed Size | Logistic Regression (L1) | Logistic Regression (L2) | Random Forest | Gradient Boosting |
| 3 | 0.4367 ± 0.3916 | 0.5066 ± 0.4025 | 0.5265 ± 0.4092 | 0.4089 ± 0.4185 | |
| 5 | 0.4223 ± 0.4222 | 0.4939 ± 0.4251 | 0.4502 ± 0.4242 | 0.3911 ± 0.4343 | |
| Precision | 10 | 0.3334 ± 0.4156 | 0.3823 ± 0.4295 | 0.3219 ± 0.4180 | 0.2271 ± 0.3833 |
| 10% | 0.3270 ± 0.4246 | 0.3874 ± 0.4473 | 0.3479 ± 0.4099 | 0.2620 ± 0.4002 | |
| 30% | 0.2374 ± 0.3957 | 0.2336 ± 0.3959 | 0.1850 ± 0.3447 | 0.1713 ± 0.3402 | |
| 50% | 0.1760 ± 0.3559 | 0.1624 ± 0.3428 | 0.1584 ± 0.3331 | 0.1108 ± 0.2814 | |
| 3 | 0.3102 ± 0.3225 | 0.3459 ± 0.3386 | 0.3859 ± 0.3506 | 0.3316 ± 0.3887 | |
| 5 | 0.2855 ± 0.3310 | 0.2909 ± 0.3432 | 0.3493 ± 0.3823 | 0.3107 ± 0.3971 | |
| Recall | 10 | 0.2240 ± 0.3467 | 0.2588 ± 0.3584 | 0.2448 ± 0.3710 | 0.2264 ± 0.3889 |
| 10% | 0.1865 ± 0.2855 | 0.1907 ± 0.2962 | 0.2662 ± 0.3684 | 0.2433 ± 0.3757 | |
| 30% | 0.1454 ± 0.2760 | 0.1721 ± 0.3125 | 0.2039 ± 0.362 | 0.1831 ± 0.3478 | |
| 50% | 0.1348 ± 0.2843 | 0.1438 ± 0.3021 | 0.1950 ± 0.3695 | 0.1579 ± 0.3421 | |
| 3 | 0.2914 ± 0.3019 | 0.3241 ± 0.3064 | 0.3403 ± 0.2909 | 0.2795 ± 0.3257 | |
| 5 | 0.2721 ± 0.3145 | 0.2712 ± 0.3050 | 0.2855 ± 0.3041 | 0.2499 ± 0.3214 | |
| F1 | 10 | 0.2043 ± 0.3099 | 0.2320 ± 0.3209 | 0.1849 ± 0.2877 | 0.1704 ± 0.3160 |
| 10% | 0.1883 ± 0.2919 | 0.2015 ± 0.3026 | 0.2090 ± 0.2946 | 0.1923 ± 0.3128 | |
| 30% | 0.1234 ± 0.2473 | 0.1430 ± 0.2765 | 0.1288 ± 0.2541 | 0.1307 ± 0.2627 | |
| 50% | 0.1015 ± 0.2369 | 0.1069 ± 0.2520 | 0.1027 ± 0.2366 | 0.0947 ± 0.2328 |
Table 3 summarizes our evaluation for greedy concept set recovery. Here as above, logistic regression trends to higher precision over ensemble methods. However, performance trends for all models appear to grow with seed size. This behavior indicates that, unlike with maximal cliques, our greedy approach is better able to identify true concept pairs when a larger number of true concept pairs are incorporated into the seed.
Table 3: Greedy concept set recovery.
| Metric | Seed Size | Logistic Regression (L1) | Logistic Regression (L2) | Random Forest | Gradient Boosting |
| 3 | 0.8721 ± 0.2668 | 0.8766 ± 0.2569 | 0.8574 ± 0.2849 | 0.7956 ± 0.3374 | |
| 5 | 0.8880 ± 0.2575 | 0.8800 ± 0.2587 | 0.8606 ± 0.2845 | 0.7874 ± 0.3441 | |
| Prec.@10% | 10 | 0.8864 ± 0.2687 | 0.8690 ± 0.2770 | 0.8605 ± 0.2851 | 0.7922 ± 0.3407 |
| 10% | 0.9151 ± 0.2182 | 0.8967 ± 0.2269 | 0.8762 ± 0.2558 | 0.7933 ± 0.3296 | |
| 30% | 0.9103 ± 0.2214 | 0.8989 ± 0.2245 | 0.8820 ± 0.2362 | 0.8365 ± 0.2791 | |
| 50% | 0.9197 ± 0.2290 | 0.8884 ± 0.2558 | 0.8781 ± 0.2442 | 0.8576 ± 0.2694 | |
| 3 | 0.7933 ± 0.2862 | 0.8118 ± 0.2770 | 0.8063 ± 0.3027 | 0.7539 ± 0.3391 | |
| 5 | 0.8085 ± 0.2722 | 0.8202 ± 0.2667 | 0.8121 ± 0.2968 | 0.7563 ± 0.3387 | |
| Prec.@30% | 10 | 0.8049 ± 0.2784 | 0.8171 ± 0.2683 | 0.8022 ± 0.3042 | 0.7649 ± 0.3323 |
| 10% | 0.8175 ± 0.2622 | 0.8269 ± 0.2578 | 0.’ ± 0.2870 | 0.7363 ± 0.3425 | |
| 30% | 0.8506 ± 0.2398 | 0.8459 ± 0.2321 | 0.8441 ± 0.2506 | 0.7766 ± 0.2845 | |
| 50% | 0.8840 ± 0.2218 | 0.8627 ± 0.2300 | 0.8511 ± 0.2431 | 0.8004 ± 0.2615 | |
| 3 | 0.7086 ± 0.3038 | 0.7267 ± 0.2925 | 0.7369 ± 0.3086 | 0.6952 ± 0.3274 | |
| 5 | 0.7220 ± 0.2986 | 0.7386 ± 0.2843 | 0.7461 ± 0.3039 | 0.7099 ± 0.3227 | |
| Prec.@50% | 10 | 0.7226 ± 0.3002 | 0.7474 ± 0.2860 | 0.7499 ± 0.3133 | 0.7193 ± 0.3289 |
| 10% | 0.7242 ± 0.2957 | 0.7555 ± 0.2870 | 0.7661 ± 0.3056 | 0.6927 ± 0.3412 | |
| 30% | 0.7820 ± 0.2678 | 0.7941 ± 0.2674 | 0.8063 ± 0.2767 | 0.7310 ± 0.3045 | |
| 50% | 0.8281 ± 0.2383 | 0.8202 ± 0.2411 | 0.8211 ± 0.2582 | 0.7485 ± 0.2701 | |
| 3 | 0.5796 ± 0.3058 | 0.5837 ± 0.2956 | 0.5609 ± 0.2853 | 0.5586 ± 0.2945 | |
| 5 | 0.5980 ± 0.3025 | 0.5994 ± 0.2950 | 0.5706 ± 0.2861 | 0.5675 ± 0.2896 | |
| Prec.@100% | 10 | 0.6163 ± 0.3035 | 0.6162 ± 0.2985 | 0.5897 ± 0.2925 | 0.5791 ± 0.2996 |
| 10% | 0.6198 ± 0.3019 | 0.6193 ± 0.2966 | 0.5962 ± 0.2905 | 0.5638 ± 0.3055 | |
| 30% | 0.6818 ± 0.2833 | 0.6920 ± 0.2787 | 0.6891 ± 0.2924 | 0.6349 ± 0.3063 | |
| 50% | 0.7155 ± 0.2749 | 0.7330 ± 0.2847 | 0.7476 ± 0.2972 | 0.6677 ± 0.3058 |
3 Discussion
The rich concept pair features we develop power binary classification models with strong predictive performance. Importantly, performance weakens only slightly when predicting concept pairs from fully held-out phenotype concept sets. This behavior is essential for recovering held-out phenotypes as described in this work. It is also needed for constructing novel phenotype concept sets for which concept pairs are unavailable for training.
We explore two methods for recovering phenotype concept sets using our concept pair prediction models: maximal clique and greedy concept set recovery. Both leverage concept pair likelihood estimates to isolate highly connected concepts which serve as candidates for inclusion within a phenotype definition. We find that each method is capable of recovering concepts in held-out phenotype concept sets, motivating their use in novel concept set construction to power phenotype development.
Our concept set recovery methods accomplish distinct, but related tasks. Maximal clique recovery produces a set of concepts which can be considered as a complete phenotype concept set. This is attractive, since a potential user is given output which closely resembles the desired end product. However, we note a steady drop in all our performance metrics as we increase the size of the seed provided to this method. This makes sense given that, as the seed grows, it is less likely that additional true concept pairs will be fully connected to the seed. This finding suggests small, well informed seeds should be constructed to achieve optimal performance. Future work will explore how this method performs with small, expert-defined seeds as opposed to the random seeds used here.
Greedy concept set recovery returns a ranked list of all available concepts. This output demands some amount of expert post processing to parse into a phenotype concept set. As such, we are interested in exploring the use of this method within a recommender system which, given a seed, suggests highly weighted concepts for inclusion in a growing phenotype concept set. In our experiments, we observe this method attains higher precision as the initial seed size grows. Thus, top ranked concepts are likely to be good candidates for inclusion within a heavily seeded phenotype definition. Unlike with maximal cliques, our greedy approach adds new concepts to the seed based on how likely they are to be connected on average to concepts already in the seed. Thus, this method is better able to pick out true concept pairs as more information (i.e. more concepts) is incorporated into the seed.
Our approaches to phenotype concept set construction provide two potential paths toward rapid, on demand construc- tion of phenotype definitions. Moving forward, we intend to work more closely with our clinical expert collaborators to improve our concept set recovery methods, identify optimal parameterizations, and construct and evaluate novel phenotype definitions.
Figures & Table
Figure 1: Visual summary of reference set curation, feature engineering, and edge prediction. Columns C1 and C2 represent pairs of concepts pairs, Model abbreviations: LR (L1)- Logistic Regression with L1 penalty, LR (L2)- Logistic Regression with L2 penalty, RF- Random Forest, AB- Adaboost, GB- Gradient Boosting, DT- Decision Tree, and NB- Naive Bayes.
References
- [1].Hripcsak George, Albers David J. High-fidelity phenotyping: richness and freedom from bias. Journal of the American Medical Informatics Association. 2017. [DOI] [PMC free article] [PubMed]
- [2].Hripcsak George, Albers David J. Next-generation phenotyping of electronic health records. Journal of the American Medical Informatics Association. Jan 2013;20(1):117–21. doi: 10.1136/amiajnl-2012-001145. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Carroll Robert, Eyler Anne, Denny Joshua. Naive electronic health record phenotype identification for rheumatoid arthritis. AMIA … Annual Symposium proceedings/AMIA Symposium. AMIA Symposium. 2011;2011:189–96. [PMC free article] [PubMed] [Google Scholar]
- [4].Chen Yukun, Carroll Robert, Hinz Eugenia, Shah Anushi, Eyler Anne, Denny Joshua, Xu Hua. Applying active learning to high-throughput phenotyping algorithms for electronic health records data. Journal ofthe American Medical Informatics Association. 2013;20(e2):e253–e259. doi: 10.1136/amiajnl-2013-001945. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Reich C., Ryan P.B., Belenkaya Natarajan K. R., Blacketer C. Omop common data model v6.0 specifications
- [6].Kirby J.C., Speltz P., Rasmussen L.V., et al. Phekb: a catalog and workflow for creating electronic phenotype algorithms for transportability. Journal of the American Medical Informatics Association. 2016;23(6):1046–1052. doi: 10.1093/jamia/ocv202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Gottesman O., Kuivaniemi H., Tromp G., Faucett W.A., Li R., Manolio Teri A, et al. The electronic medical records and genomics (emerge) network: past, present, and future. Genetics in Medicine. 2013;15:761–771. doi: 10.1038/gim.2013.72. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Resnik P. Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research. 1999;11:95–130. [Google Scholar]
- [9].Deng Y., Gao L., Wang B., Guo X. Hposim: An r package for phenotypic similarity measure and enrichment analysis based on the human phenotype ontology. PLOS ONE. 2015;10(2):1046–1052. doi: 10.1371/journal.pone.0115692. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Pennington J. Glove: Global vectors for word representation
- [11].Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M., Duchesnay E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research. 2011;12:2825–2830. [Google Scholar]
- [12].Boppana Ravi, Halldorsson Magnus M. Approximating maximum independent sets by excluding subgraphs. BIT Numerical Mathematics. Jun 1992;32(2):180–196. [Google Scholar]
- [13].Hagberg Aric A., Schult Daniel A., Swart Pieter J. Exploring network structure, dynamics, and function using networkx. Proceedings in Science Conference (SciPy2008) 2008. pp. 11–15.



