CAGI SickKids challenges: Assessment of phenotype and variant predictions derived from clinical and genomic data of children with undiagnosed diseases

Laura Kasak; Jesse M Hunter; Rupa Udani; Constantina Bakolitsa; Zhiqiang Hu; Aashish N Adhikari; Giulia Babbi; Rita Casadio; Julian Gough; Rafael F Guerrero; Yuxiang Jiang; Thomas Joseph; Panagiotis Katsonis; Sujatha Kotte; Kunal Kundu; Olivier Lichtarge; Pier Luigi Martelli; Sean D Mooney; John Moult; Lipika R Pal; Jennifer Poitras; Predrag Radivojac; Aditya Rao; Naveen Sivadasan; Uma Sunderam; Saipradeep VG; Yizhou Yin; Jan Zaucha; Steven E Brenner; M Stephen Meyn

doi:10.1002/humu.23874

. Author manuscript; available in PMC: 2020 Sep 3.

Published in final edited form as: Hum Mutat. 2019 Sep 3;40(9):1373–1391. doi: 10.1002/humu.23874

CAGI SickKids challenges: Assessment of phenotype and variant predictions derived from clinical and genomic data of children with undiagnosed diseases

Laura Kasak ^1,², Jesse M Hunter ³, Rupa Udani ³, Constantina Bakolitsa ¹, Zhiqiang Hu ¹, Aashish N Adhikari ¹, Giulia Babbi ⁴, Rita Casadio ⁴, Julian Gough ⁵, Rafael F Guerrero ⁶, Yuxiang Jiang ⁶, Thomas Joseph ⁷, Panagiotis Katsonis ⁸, Sujatha Kotte ⁷, Kunal Kundu ^9,¹⁰, Olivier Lichtarge ^8,¹¹, Pier Luigi Martelli ⁴, Sean D Mooney ¹², John Moult ^9,¹³, Lipika R Pal ⁹, Jennifer Poitras ¹⁴, Predrag Radivojac ¹⁵, Aditya Rao ⁷, Naveen Sivadasan ⁷, Uma Sunderam ⁷, Saipradeep VG ⁷, Yizhou Yin ^9,¹⁰, Jan Zaucha ^5,^*, Steven E Brenner ¹, M Stephen Meyn ^16,¹⁷

PMCID: PMC7318886 NIHMSID: NIHMS1042703 PMID: 31322791

Abstract

Whole genome sequencing (WGS) holds great potential as a diagnostic test. However, the majority of patients currently undergoing WGS lack a molecular diagnosis, largely due to the vast number of undiscovered disease genes and our inability to assess the pathogenicity of most genomic variants. The CAGI SickKids challenges attempted to address this knowledge gap by assessing state-of-the-art methods for clinical phenotype prediction from genomes. CAGI4 and CAGI5 participants were provided with WGS data and clinical descriptions of 25 and 24 undiagnosed patients from the SickKids Genome Clinic Project, respectively. Predictors were asked to identify primary and secondary causal variants. Additionally, for CAGI5, groups had to match each genome to one of three disorder categories (neurologic, ophthalmologic, connective), and separately to each patient. The performance of matching genomes to categories was no better than random but two groups performed significantly better than chance in matching genomes to patients. Two of the ten variants proposed by two groups in CAGI4 were deemed to be diagnostic, and several proposed pathogenic variants in CAGI5 are good candidates for phenotype expansion. We discuss implications for improving in silico assessment of genomic variants and identifying new disease genes.

Keywords: CAGI, SickKids, phenotype prediction, variant interpretation, whole genome sequencing data, pediatric rare disease

INTRODUCTION

Next generation sequencing (NGS) is a disruptive technology that provides more comprehensive tests and several fold higher diagnostic yields than conventional methods for diagnosing genetic disorders. Looking to the future, deploying NGS-based whole genome sequencing (WGS) as a first tier diagnostic test has the potential to revolutionize the diagnosis of genetic disorders, given that the diagnostic yield of WGS for children suspected of a Mendelian disorder currently averages over 40% and continues to increase with time (Clark et al., 2018; Scocchia et al., 2019). Nonetheless, the majority of patients who now undergo WGS after first-line genomic testing failed to yield an answer remain without a molecular diagnosis. This gap between the current performance of WGS and its ultimate potential as a diagnostic test is primarily due to the large number of undiscovered disease genes and our inability to assess the pathogenicity of most genomic variants.

WGS identifies ~3.8 million variants in the average individual (Shen et al., 2013) with over 650 million variants already cataloged in a small proportion of the world’s population (dbSNP Build 151, April 2018 release). Correctly assigning one or a few of these variants as the cause of disease in an individual suspected of a genetic disorder is a herculean task, in large part because the majority of rare and low frequency variants are of unknown clinical significance and non-coding variants are rarely classifiable (Giral, Landmesser, & Kratzer, 2018; Gloss & Dinger, 2018; Zhu, Tazearslan, & Suh, 2017). Beyond the sheer numbers of variants, the complexity of this endeavor is compounded by locus heterogeneity, in which pathogenic variants in multiple genes can yield overlapping phenotypes. Furthermore, many genetic disorders are likely oligogenic or polygenic in nature (Jordan & Do, 2018; Kousi & Katsanis, 2015).

Bioinformatics-guided analysis of clinical WGS data is essential to overcome these challenges. While there have been significant improvements in the detection and classification of single nucleotide variants, copy number variants, and other structural variation from WGS data, there is a critical need to substantially improve the accuracy and efficiency of computer algorithms designed to predict a patient’s phenotype from their genotype and distinguish a phenotype’s causal variant(s) from millions of others. Overcoming the current limitations of WGS variant interpretation will not only improve clinical diagnosis, it also will advance our understanding of the etiology of genetic disorders and facilitate the development of better therapeutics, which will ultimately lessen the burden of genetic disease.

The Hospital for Sick Children’s (SickKids) Genome Clinic Project was designed to pilot the diagnostic and predictive use of whole genome sequencing (WGS) in children (Bowdin, Hayeems, Monfared, Cohn, & Meyn, 2016). The Project’s first cohort involved testing the performance of WGS vs. diagnostic chromosomal microarray in 100 children referred to clinical geneticists for suspected genetic disease. The second cohort compared WGS against targeted panel sequencing in 103 children seen in pediatric specialty clinics. The initial diagnostic rates for WGS were 38% for the microarray cohort and 43% for the targeted panel cohort.

Genome Clinic patients that remained undiagnosed after clinical assessment of their WGS data form useful cohorts for trialing novel approaches to molecular diagnosis and gene discovery. In that regard, The Genome Clinic Project collaborated with the Critical Assessment of Genome Interpretation (CAGI) to create open bioinformatics challenges for CAGI4 and CAGI5. The challenge cohorts consisted of patients for whom previous diagnostic assessment of their WGS data had yielded no causal variants. The bioinformatics teams who participated were provided with demographic information and clinical descriptions for each patient, as well as assembled WGS data with variant calling for each genome.

The SickKids CAGI4 challenge involved 25 undiagnosed patients from the Genome Clinic microarray cohort (Stavropoulos et al., 2016). Bioinformatics teams were provided with linked WGS data and HPO-based clinical descriptions for each patient. The primary challenge task was to identify the causal genomic variant(s) responsible for the patient’s phenotype.

The SickKids CAGI5 challenge involved 24 undiagnosed patients from the Genome Clinic panel test cohort who were being evaluated for one of three disease categories: ophthalmologic disorders, neurologic disorders, or connective tissue disorders (Lionel et al., 2018). Importantly, unlike CAGI4, the WGS data were not linked to specific patients. Teams had three primary tasks: a) match each genome to one of the three broad disease categories (ophthalmologic, neurologic, or connective tissue disease); b) match each genome to a specific patient’s clinical phenotype; and c) propose one or more causal variants that would explain the selected patient’s disease phenotype. In addition, teams were encouraged to submit variants they considered secondary findings (pathogenic disease-causing variants not related to the patient’s current phenotype). Variants from genomes assigned by the teams to the correct patient as potentially causative of the phenotype were assessed and classified according to the American College of Medical Genetics (ACMG) guidelines (Richards et al., 2015). The results of the CAGI4 and CAGI5 challenges are presented here.

METHODS

Patient data

For CAGI5, the SickKids Genome Clinic project provided de-identified clinical phenotypic information and whole genome sequencing data for 24 cases that were selected from the SickKids Genome Clinic panel sequencing cohort. The 24 patients consisted of 13 girls and 11 boys, ranging from 3 to 18 years in age. Sequencing and data analysis were performed as described in Lionel et al., 2018. These 24 cases remained unsolved after initial screening by the project’s clinical molecular geneticists for plausible coding, splicing, non-coding, and structural variants. The challenge cohort consisted of 6 patients with ophthalmologic disorders, 7 with neurologic disorders, and 11 with connective tissue disorders (Supporting Information). Predictors were provided with the phenotypic descriptions as shared with the diagnostic laboratory.

In CAGI4, the SickKids challenge involved 25 children with a wide range of suspected genetic disorders who were referred for clinical genome sequencing, but remained unsolved after initial screening. Phenotypic data were provided by the referring physicians and entered into Phenotips, a Human Phenotype Ontology-based database (Girdea et al., 2013). Detailed information and description of these cases is provided in Stavropoulos et al., 2016 and Pal et al., 2017.

To model the clinical testing environment, phenotypic information was limited to that routinely obtained from clinicians prior to molecular testing, rather than from an iterative, genotype-driven assessment of the patient. The diversity of phenotypes in the dataset represents the range of clinical presentations typically seen in children referred for diagnostic evaluations in subspecialty clinics at SickKids. All patients in the CAGI cohorts were consented for sharing of de-identified genomic and phenotypic data with external research projects. The original Genome Clinic project and data sharing with CAGI were approved by the Research Ethics Board at The Hospital for Sick Children (REB Protocol #1000037726).

CAGI5 SickKids challenge format

The CAGI5 SickKids challenge was divided into several tasks. First, teams had to match genomes to a broad phenotype category (ophthalmologic, neurologic, or connective tissue disorder). Second, genomes had to be matched to individual patients based on their clinical phenotype descriptions. In addition, teams could report primary variant(s) underlying each prediction (i.e., diagnostic variants) and secondary variants predicted to confer high risk of other disorders not present in the clinical phenotypic description.

Groups were required to provide a probability (0–1; 0 = no match, 1 = match) that a genome sequence matched a broad phenotype class as well as a probability that it belonged to a specific patient. Each predicted probability of a match included a standard deviation indicating confidence in the prediction. Organizers provided a template file, which had to be used for submission. Up to six distinct submissions were allowed from each group.

Bioinformatics groups:

The CAGI5 SickKids challenge went live on the genomeinterpretation.org web site in December 2017 and submissions closed in April 2018. Seven bioinformatics groups provided a single submission, while one team (Group 6) provided two separate submissions. The CAGI4 SickKids challenge went live in December 2015 and submissions closed in February 2016. Four groups participated in this challenge.

Assessment for CAGI5 challenge

Predicted broad phenotype categories and specific phenotype-genotype matches in each submission were assessed against the SickKids answer key. Firstly, assessors calculated the number of correct predictions for broad phenotype categories. Only the highest probability predictions were included in the assessment. If probabilities for two categories were equal and one of them was correct, it was scored as correct (giving full credit to ties). The number of matches with no credit to ties are also shown. For probability-based assessment, probabilities were normalized in each submission for each genome to sum to 1.0. Mean probability assigned by the submissions to the correct disease category provides an assessment of assigned probabilities not dependent on whether the highest probability predictions were correct. For each disease category, recall and precision were defined as TP/(TP+FN) and TP/(TP+FP), respectively. Of note, SID#1, 2, and 7 did not provide predictions for all genomes (8, 8, and 6 genomes not predicted, respectively) (Supp. Table S1).

In order to assess the statistical significance of the submissions, random predictions were simulated 10,000 times (Figure 1A and C). Each time, disease categories were assigned for 6 ‘ophthalmologic’, 7 ‘neurologic’ and 11 ‘connective’ purely based on their composition, i.e., the probabilities we assign a genome ‘ophthalmologic’, ‘neurologic’ and ‘connective’ are 6/24, 7/24 and 11/24 respectively; then the numbers of both overall correct assignments and correct assignments for each category were calculated. Moreover, in order to evaluate all the submitted predictions as a community, we also simulated the random predictions from a community containing 9 independent predictions (to match the number of submissions in this challenge) (Figure 1B). Specifically, each time we simulated the highest match number of 9 random predictions (as described above) and this process was conducted 10,000 times.

The second part of the challenge was to match the phenotype information given for each patient to the correct genome. Probabilities were normalized in each submission for each genome to sum to 1.0. Only the highest probabilities were considered in the assessment. If probabilities for two phenotype descriptions were equal and one of them was correct, it was scored as correct. This only affected SID#3, which had 1 match instead of 2 if ties were not taken into account. Equal probability for ≥5 phenotype descriptions including the correct one was not considered as a match. SID#2, 5, and 7 did not provide predictions for 12, 12, and 3 genomes, respectively. Similar to the broad disease category matching, the random number of genome-phenotype matches was inferred by in silico simulation. The distribution was calculated from 10,000 simulation runs. Each time, we assumed that the 24 genomes were ordered as G=G₁, G₂,..,G₂₄; a prediction can be treated as a random rearrangement of the above orders: P= G_a1, G_a2,..,Ga₂₄, a₁,a₂,..,a₂₄ ∈{1,2,…,24}; then the number of matched genomes was recorded. As sex information, which can be accurately inferred from genomes, was listed in the phenotypes (11 males and 13 females), we simulated the number of genome-phenotype matches considering sex. This was conducted by summing the match numbers of two independent simulations, which were similar to the process described above, while 11 and 13 genomes were included in each simulation respectively. In addition, to evaluate all the submissions as a community, we also simulated the highest number of matches from 9 independent predictions.

Phenotypic informational content scores for each patient were generated by PhenoTips from Monarch Initiative phenotypic profile analyses of the HPO terms contained in their supplied clinical description (Girdea et al., 2013). Correct genome-patient match scores were based on the number of highest probability matches for the correct genome with one match receiving one point.

Clinical assessment and classification of predicted variants

The proposed primary diagnostic and secondary variant(s) submitted by each group with correct genome patient matches were evaluated. Variants were classified as pathogenic, likely pathogenic, uncertain, likely benign, or benign according to ACMG diagnostic guidelines (Richards et al., 2015) by trained clinical genomic scientists. ACMG guidelines provide a framework for determining the level of evidence that a particular variant is a clinically actionable finding. The majority of information for variant classification was gleaned from VarSome (Kopanos et al., 2019). VarSome has links and information from the clinical variant curation database ClinVar (Landrum et al., 2018), the population database gnomAD (Karczewski et al., 2019), and references to relevant publications. The seventeen in silico predictions available in VarSome were also taken into consideration. The ACMG classification information in VarSome was not used to classify variants as they are generally not curated. Human Splicing Finder (Desmet et al., 2009) was used to determine the impact of variants on splicing.

Prediction methods

A detailed description of the methods used for the challenge accompanied each submission file. A brief summary of each CAGI5 prediction method is provided here, and detailed descriptions as well as CAGI4 methods are included in Supporting Information.

Group 1:

For phenotype matching, text mining for HPO terms (TPX software) was used, followed by manual QC. Gene prioritization was done by querying HANRD (Heterogeneous Association Network for Rare Diseases) and TPXRD (PubMed text mining) that give a set of ranked genes based on the input phenotype. Variant prioritization was achieved by using an in-house method (VPR). MAF (minor allele frequency), evolutionary conservation, in silico predictions, and ClinVar data were considered. Matching of genotypic to phenotypic case was done manually using the best possible intermediary disease.

Group 2:

Group 2 used eDGAR (Babbi et al., 2017) that collects known associations among genes and diseases, and PhenPath (Babbi et al., 2019), which groups diseases in terms of HPO terms and OMIM classifications and provides associations among phenotypes and genes, were used. For variant prioritization, SNPs&GO and UniProt were utilized. Sex of patients was also used to guide and validate the matching.

Group 3:

VCF files were analyzed using standard parameters, including variant quality, allele frequency, functional damage prediction and gene-phenotype associations, using a variety of tools and databases. Gender was considered in phenotype-genotype matches, but ethnic origin was not taken into account.

Group 4:

Group 4 used a phenotype-weighted subjective scoring of phenotypic profile (HPO and dbNSFP databases) together with gender information to guide matching. African ethnicity was also checked. The reasoning for choosing the strategy of phenotype-weighted scoring was to extract the pathogenic genetic information relevant to a particular profile out from each of the genomes. MAF, reported and predicted pathogenicity were considered.

Group 5:

Group 5 used an evolutionary action approach. In order to predict the disorder class for each individual, the predictors calculated the effect of the genetic variants on the fitness of each gene (evolutionary action). This fitness effect was used as the input of a diffusion process over a network of genes and diseases. The diffusion signal on each of the three disorder classes was used to calculate the probability of each genome to be linked to each disease. In order to match each individual’s genome to a clinical report, the predictors used again the diffusion process, and manual matching. Sex and ethnic origin information were also used.

Group 6:

Group 6 provided 2 separate submissions based on two different approaches. Clinical notes were searched against a gene-phenotype database (Monarch initiative for submission 6.1 and eRAM for submission 6.2), and the genes were sorted by the highest number of matching terms. Prediction of sex and ethnic origin was implemented. For variant prioritization, MAF filtering of protein altering variants was performed.

Group 7:

Group 7 utilized Ingenuity Variant Analysis (QIAGEN), which utilizes curated content from the literature as well as external databases. Genes known to be associated with patients’ phenotypes were selected. Phred score, MAF and ACMG classification (pathogenic and likely pathogenic) were taken into account.

Group 8:

To predict correspondence between phenotypes and genomes, group 8 used calculated scores for all genome-phenotype pairs and assigned the most likely connections using a bipartite matching algorithm. FunctionalFlow (Nabieva, Jim, Agarwal, Chazelle, & Singh, 2005) was used to predict risk genes with scores calibration based on the proportion of disease genes estimated by the AlphaMax algorithm (Jain, White, Trosset, & Radivojac, 2016; Jain, White, & Radivojac, 2016). MutPred2 (Pejaver et al., 2017) and MutPred-LOF (Pagel et al., 2017) were used to assign pathogenicity scores to variants. The final scores were assigned by combining gene scores and variant scores. Sex and ethnic origin information were also considered.

RESULTS

CAGI5 SickKids Challenge

The CAGI5 SickKids challenge was primarily designed to test how well bioinformatics algorithms are able (1) to match 24 genomes to three broad phenotype categories, and (2) to match each of the 24 genomes to a specific patient based on typical phenotype information. Groups could also identify diagnostic variants that underlie the predictions as well as secondary variants conferring high risk of other diseases. VCF files containing WGS data (SNVs and indels) from 24 patients and 24 unlinked phenotype descriptions were provided. Eight teams submitted predictions to this challenge (Table 1), with two distinct predictions from group 6.

Table 1.

A list of participating teams

ID	Submission ID	PI
CAGI5
Group 1	SID#1	Aditya Rao
Group 2	SID#2	Rita Casadio
Group 3	SID#3	Rehovot group
Group 4	SID#4	Lipika R. Pal/John Moult
Group 5	SID#5	Olivier Lichtarge
Group 6	SID#6.1, SID#6.2	Aashish Adhikari
Group 7	SID#7	Jennifer Poitras
Group 8	SID#8	Sean Mooney/Predrag Radivojac
CAGI4
Group 9	SID#9	Chris Mungall
Group 10	SID#10	Julian Gough
Group 11	SID#11	Aditya Rao
Group 12	SID#12	Lipika R. Pal/John Moult

Open in a new tab

Broad phenotype category matching

The first part of the CAGI5 SickKids challenge was to match 24 genomes to three broad phenotype categories: ophthalmologic (n=6), neurologic (n=7), or connective (n=11). Groups were allowed to give probabilities for one, two or all three categories for a given genome; however, it was noted in the challenge description that every genome sequence matches exactly one clinical phenotypic category. Table 2 shows which submissions provided the highest category probability to the correct genome. The highest probability was assigned to the correct genome only by 3.3 out of 9 submissions on average. Genome 81 had the correct category predicted in 8 out of 9 submission files, while genome 71 had the correct prediction in only one, SID#3 (Table 2). Intriguingly, connective tissue disorder was the correct category for both of these genomes.

Table 2.

Summary of the performance of all groups of matching the broad phenotype category to each genome and predicting which specific clinical description corresponds to which genome for CAGI5 challenge.

Genome code	Patient code	Correct category	SID# with correct disease category match	SID# with correct genome-patient match
7	X	ophthalmologic	7, 8	5, 7, 8
9	W	ophthalmologic	4, 5
17	H	ophthalmologic	4, 5	4
18	U	neurologic	1, 3
21	G	neurologic	5, 7	6.1
30	R	neurologic	1, 6.2
39	P	neurologic	1, 4, 5, 6.1, 7
42	O	ophthalmologic	3, 7	3
56	N	connective	3, 4, 5, 6.1, 6.2, 8	4, 8
57	T	connective	6.1, 6.2, 8
67	M	ophthalmologic	1, 5, 8	1
68	J	neurologic	3, 8	3, 5, 8
71	L	connective	3	1, 5
76	Q	connective	3, 4, 5, 6.1, 6.2
78	V	connective	1, 2, 3, 4, 6.2, 7, 8	6.2
79	K	connective	4, 5, 6.2
81	I	connective	1, 2, 3, 5, 6.1, 6.2, 7, 8
91	E	neurologic	4, 5, 7
92	S	connective	1, 3, 4, 6.1, 6.2
93	F	connective	4, 6.2, 7, 8	4, 8
95	C	ophthalmologic	1, 2, 4, 8	2, 4, 5
97	D	connective	5, 6.1, 6.2
99	B	neurologic	4, 8	4, 8
102	A	connective	6.2, 8	8

Open in a new tab

The nine submissions reached an average accuracy of 37% when the category with the highest probability was considered (giving ties full credit). This accuracy weighted by the submitted probabilities was even lower (25%). SID#4 performed the best by assigning the correct category to 50% of the genomes (Tables 2, 3), whereas SID#5, 6.2 and 8 did not lag far behind by correctly predicting the broad phenotype category for 11 of the 24 (45.8%) genomes. SID#8 predicted the correct matches with a higher probability among the best-performing groups and overall had the highest mean probability (0.49) assigned to the correct disease category (Table 3). When giving no credit to ties, SID#4 and 8 both ranked first with 11 matches (Table 3) and the mean accuracy of all submissions was 33%. These aforementioned five submissions all considered gender and ethnic information to guide matching; however, the strategies used were rather different, from evolutionary action to phenotype-weighed subjective scoring of phenotypic profile (see Methods for details).

Table 3.

Summary of the performance of each group’s submission(s) in broad disease category matching for CAGI5 challenge

Submission ID	Number of matches	Sum of probabilities for matches	Number of matches, no credit to ties	Mean probability assigned to the correct class
SID#1	8	7.0	7	0.44
SID#2	3	3.0	3	0.25
SID#3	9	4.7	9	0.34
SID#4	12	7.6	11	0.42
SID#5	11	5.2	9	0.36
SID#6.1	7	2.7	7	0.35
SID#6.2	11	7.8	9	0.40
SID#7	8	6.0	6	0.39
SID#8	11	9.9	11	0.49

Open in a new tab

In order to evaluate the statistical significance of the submissions, we simulated 10,000 random predictions based solely on the composition of disease categories (Figure 1A). A random prediction, on average, can correctly match 9 genomes (9/24=37.5%) to the corresponding category, and a submission would have to match at least 13 genomes to perform significantly better than random chance, as indicated by a p-value cutoff of 0.05. These results indicate that the submitted methods did not perform better than expected for random chance, as the average accuracy of the nine submissions was equal to the expected accuracy of a random prediction. There were equal numbers of submissions with accuracy higher or lower than 9, the median matches of random predictions, with none of the individual submissions performing significantly better than chance. Moreover, the simulation results showed that, the expected highest match number of nine random predictions was 12 (Figure 1B), the same as we observed here. Another strategy of a random prediction would be to give the highest prediction to the largest disease category (connective tissue), which would always result in an accuracy of 11 out of 24 (46%). Based on this, the average performance of the submissions was even lower than random chance.

Looking at the different disease categories, genomes belonging to the connective tissue category were the easiest to match with 47% of the genomes correctly assigned by all groups on average (Table 2). For ophthalmologic and neurologic disease categories, none of the groups performed significantly better than random. Only matching 4, 5, and >8 genomes for ophthalmologic, neurologic and connective tissue category respectively would achieve a p-value less than 0.05 (Figure 1C). SID#6.2 achieved the highest recall (0.91) in the connective tissue category by matching 10 out of 11 genomes correctly (Figure 2). This result is statistically significant compared to random prediction (Figure 1C); however, the submission had rather low precision (0.43) for the same category and failed to match any eye category genomes. Overall, most submissions achieved low true positive rates as well as low positive predictive values in this part of the challenge (Figure 2).

Figure 2. — (A) Recall and (B) precision values shown for each broad phenotype category by submissions.

Matching specific genomes to specific patients using phenotypic descriptions

The second part of the CAGI5 SickKids challenge was to use each patient’s phenotypic information to match the child to the correct genome. Groups provided as many probabilities as they wished, but only the highest probabilities were considered in the assessment (described in Methods). Table 4 shows which submissions provided the highest probability phenotype descriptions to the correct genome. On average only 1 out of 9 submissions made the correct genome to patient match (Table 2). Three genomes (7, 68, and 95) all had the most matches with the highest probability from three submissions, while 11 genomes (9, 18, 30, 39, 57, 76, 79, 81, 91, 92 and 97) were not matched by any group. Contrary to the broad phenotype matching, the ophthalmologic category was the easiest to predict correctly here: 83% of genomes were matched by at least one submission, compared to 43–45% for other categories.

Table 4.

Summary of the performance of each group’s submission(s) in the specific genome to patient matching

Submission ID	Number of matches	Sum of probabilities for matches	Mean probability assigned to the correct class
SID#1	2	2.00	0.08
SID#2	1	1.00	0.10
SID#3	2	0.38	0.08
SID#4	5	2.32	0.14
SID#5	4	3.59	0.34
SID#6.1	1	0.06	0.06
SID#6.2	1	0.36	0.07
SID#7	1	1.00	0.05
SID#8	6	4.52	0.26

Open in a new tab

The nine submissions achieved a mean accuracy of 11% considering the highest probability predictions. If weighted by the submitted probabilities, the accuracy dropped to 7%. Groups 1, 3, 4, 6, and 8 each predicted the phenotype description correctly for one genome that no other submission predicted (Table 2). The matches that groups 2, 5, and 7 predicted correctly were also assigned correctly by at least one other submission. Most of the groups considered gender based on their method descriptions; however, only three submissions (SID#3, 4, and 8) made the accurate sex prediction for all 24 genomes (Supp. Table S2).

SID#8 ranked the first in this part of the challenge, having the highest number of matches to correct phenotype descriptions (6 genomes out of 24). As for the broad disease category predictions, this submission also assigned the highest mean probability (0.26) to the correct phenotype descriptions among the best-performing groups (Table 4). As noted previously, the groups did not perform better than chance for broad disease category matching, however two teams performed significantly better than chance for matching patients to genomes. 10,000 random predictions were again simulated purely based on the number of genomes (Figure 3) to assess the significance of the submitted predictions. Sex of the patients was also included in the simulation. A random prediction was found on average to correctly match 2 genomes (2/24=8.3%) to the corresponding patient, which is slightly lower than the average of all groups (2.6 genomes) (Figure 3A). If a submission matched at least 5 genomes, the performance would be significantly better than random chance (p-value ≤0.05). The assessment shows that SID#4 and 8 performed significantly better than expected for random chance by matching 5 and 6 genomes to correct phenotype descriptions, respectively. Additionally, the simulation results revealed that, the expected highest match number of nine random predictions was 4, which is lower than the predictors achieved, although not statistically significant (Figure 3B).

Figure 3. — The expected number of genome-phenotype matches inferred by 10,000 times in silico simulation. (A) Distribution of overall number of matches. (B) Distribution of the highest match number observed in nine predictions (to simulate nine submissions of this challenge). In the simulation, we assume that sex for all genomes can be predicted correctly.

Effect of phenotype informational content on genome-patient matching

Diagnostic laboratories routinely use phenotypic descriptions provided by clinicians to guide their assessment of variants in patients undergoing exome and genome sequencing. In this regard, there is a general assumption that the more detailed the clinical description, the more useful it is in aiding molecular diagnosis. To see if this applied to the CAGI5 SickKids assessments, we examined the relationship between the informational content of a patient’s phenotypic description and the number of times they were matched to the correct genome. We found that there was a modest correlation (R²=0.478, p>0.05) between the informational content of the clinical descriptions and number of correct matches of genome to ophthalmologic patients (Figure 4). In contrast, for both neurologic and connective tissue patients, there was almost no correlation between the informational content of the phenotypic descriptions and the number of correct genome-patient matches (R²=0.042 and R²=0.029 respectively). Although the numbers of patients were small, these results suggest that rich phenotypic descriptions may aid genome-patient matching for specific categories of disease.

Figure 4. — Correlation between the informational content of the patient’s clinical description and the number of correct genome-patient matches by all submissions.

Classification and evaluation of predicted diagnostic variants

The third part of the CAGI5 challenge was to submit variants predicted to be causative of the patient’s phenotype. The variants associated with the highest probability for the correct genome were evaluated and classified by trained clinical geneticists using the 2015 ACMG clinical interpretation guidelines (Richards et al., 2015). Consistent with these patients already having gone through a diagnostic laboratory assessment of their genomes (described in Methods) without finding any clearly causal variants, none of the variants proposed by the groups were deemed to meet clinical criteria for being returnable to the clinicians as the causal variants for the patients’ disorders. However, predicted variants classified as likely pathogenic and certain variants of unknown significance (VUS) could be included in a clinical report. These variants are discussed below with regards to the plausibility of their contributing to the patients’ phenotypes (Table 5). Predicted variants that were clinically classified as benign or likely benign are listed in Supp. Table S3.

Table 5.

Nominated diagnostic variants predicted with the highest probability for correct genome-patient matches

Genome (patient)	Phenotype category	SID#	Genomic position (hg19)	Gene	Transcript	Nucleotide change	Protein change	Plausible gene for category?	Gene Phenotype Correlation	Classification
7(X)	Ophthalmologic	7	4:16024957:A:G	PROM1	NM_001145849.1	c.776T>C	p.Met259Arg	Yes	Poor	VUS
67(M)	Ophthalmologic	1	11:89017973:C:T	TYR	NM_000372.4	c.1217C>T	p.Pro406Leu	Yes	Poor	Likely pathogenic
68(J)	Neurological	3	1:156105054:G:T	LMNA	NM_170707.3	c.887G>T	p.Arg296Leu	Yes	Partial	VUS
78(V)	Connective	6.2	8:11615803:C:T	GATA4	NM_002052.4	c.1148C>T	p.Thr383Met	No	None	VUS
93(F)	Connective	4	22:41574743:A:G	EP300	NM_001429.3	c.7028A>G	p.Gln2343Arg	No	Poor	VUS
99(B)	Neurological	4	20:44044789:C:T	PIGT	NM_015937.5	c.−8C>T	5’UTR	Yes	Poor	VUS

Open in a new tab

Importantly, while one or more groups identified potential variant(s) in 13 out of 24 genomes, none of the groups identified the same variant or gene as disease-causing in the same genome (Table 5 and Supp. Table S3.) In addition, while several candidate causal variants were associated with recessive disease, all proposed variants were heterozygous and no additional variants, excluding likely benign and benign variants, were nominated in the same genes. Of note, Group 5 did not participate in this part of the challenge and Group 2 did not predict a variant for its correct patient-genome match (genome 95/patient C). The performance of each group based on the laboratory geneticist evaluation (Methods) is described below.