Assessment of patient clinical descriptions and pathogenic variants from gene panel sequences in the CAGI-5 intellectual disability challenge

Marco Carraro; Alexander Miguel Monzon; Luigi Chiricosta; Francesco Reggiani; Maria Cristina Aspromonte; Mariagrazia Bellini; Kymberleigh Pagel; Yuxiang Jiang; Predrag Radivojac; Kunal Kundu; Lipika R Pal; Yizhou Yin; Ivan Limongelli; Gaia Andreoletti; John Moult; Stephen J Wilson; Panagiotis Katsonis; Olivier Lichtarge; Jingqi Chen; Yaqiong Wang; Zhiqiang Hu; Steven E Brenner; Carlo Ferrari; Alessandra Murgia; Silvio CE Tosatto; Emanuela Leonardi

doi:10.1002/humu.23823

. Author manuscript; available in PMC: 2020 Sep 1.

Published in final edited form as: Hum Mutat. 2019 Jul 3;40(9):1330–1345. doi: 10.1002/humu.23823

Assessment of patient clinical descriptions and pathogenic variants from gene panel sequences in the CAGI-5 intellectual disability challenge

Marco Carraro ^1,^*, Alexander Miguel Monzon ^1,^*, Luigi Chiricosta ¹, Francesco Reggiani ^1,², Maria Cristina Aspromonte ³, Mariagrazia Bellini ^3,⁴, Kymberleigh Pagel ⁵, Yuxiang Jiang ⁵, Predrag Radivojac ⁵, Kunal Kundu ^6,⁷, Lipika R Pal ⁶, Yizhou Yin ^6,⁷, Ivan Limongelli ⁸, Gaia Andreoletti ^6,⁹, John Moult ^6,⁹, Stephen J Wilson ¹⁰, Panagiotis Katsonis ¹⁰, Olivier Lichtarge ¹⁰, Jingqi Chen ¹¹, Yaqiong Wang ¹¹, Zhiqiang Hu ¹¹, Steven E Brenner ¹¹, Carlo Ferrari ², Alessandra Murgia ^3,⁴, Silvio CE Tosatto ^1,^12,^*, Emanuela Leonardi ^3,^4,^*

PMCID: PMC7341177 NIHMSID: NIHMS1033051 PMID: 31144778

Abstract

The CAGI-5 intellectual disability challenge asked to use computational methods to predict patient clinical phenotypes and the causal variant(s) based on an analysis of their gene panel sequence data. Sequence data for 74 genes associated with intellectual disability (ID) and/or Autism spectrum disorders (ASD) from a cohort of 150 patients with a range of neurodevelopmental manifestations (i.e. ID, autism, epilepsy, microcephaly, macrocephaly, hypotonia, ataxia) have been made available for this challenge. For each patient, predictors had to report the causative variants and which of the seven phenotypes were present. Since neurodevelopmental disorders are characterized by strong comorbidity, tested individuals often present more than one pathological condition. Considering the overall clinical manifestation of each patient, the correct phenotype has been predicted by at least one group for 93 individuals (62%). ID and ASD were the best predicted among the seven phenotypic traits. Also, causative or potentially pathogenic variants were predicted correctly by at least one group. However, the prediction of the correct causative variant seems to be insufficient to predict the correct phenotype. In some cases, the correct prediction has been supported by rare or common variants in genes different from the causative one.

Keywords: Critical assessment, community challenge, genetic testing, phenotype prediction, variant interpretation

Introduction

Neurodevelopmental disorders (NDDs) are a spectrum of disease conditions affecting brain development. Affected patients have increased manifestations as their childhood progresses, as the pathogenic conditions disturb normal brain development. Manifestations usually start with a non-specific form of intellectual disability (ID), characterized by limitations both in intellectual functioning (reasoning, learning, problem solving) and in adaptive behaviour, which covers a range of everyday social and practical skills. However, additional manifestations, such as autistic spectrum disorders (ASD) and epileptic seizures, can arise (Bowley and Kerr 2000; Tonnsen et al. 2016). Structural abnormalities of the cranium (i.e. microcephaly, macrocephaly) may also be present at birth or appear postnatally. People with ID show also a delayed motor development, which become evident with abnormalities in gait, such as ataxic gait (i.e. a lack of coordination in movement with a tendency to fall), hypotonia (general muscle weakness), or with ‘unconscious’ active motor behaviours (e.g. dyskinetic–dystonic movements or stereotypies) (Almuhtaseb, Oppewal, and Hilgenkamp 2014). NDDs are clinically and phenotypically diverse, but driven by a substantial and overlapping genetic component, with numerous shared risk genes underlying these conditions (Mitchell 2011). In particular, complex conditions such as ID and ASD have already been associated to hundreds of different genes. Next Generation sequencing (NGS) has led to the identification of many new NDDs genes with an excess of de novo mutations when compared to controls (Iossifov et al. 2014). Despite remarkable genetic heterogeneity, the findings from NGS and improvement in systems biology approaches, unraveled convergent biological pathways involved in brain development and help our understanding of disease pathophysiology (Pinto et al. 2014; Krumm et al. 2014; Barabási, Gulbahce, and Loscalzo 2011; An et al. 2014).

As NDDs can in principle be diagnosed even before birth by genetic tests, this has led to an increasing application of next-generation sequencing in clinical practice. Medical laboratories are routinely asked to screen hundreds of patients, which are either affected by NDDs or at risk of developing the condition. The limiting factor for successful diagnosis has therefore become the identification of causative mutations to associate to given pathogenic phenotypes. As most of these mutations are extremely rare or private, the problem is one of interpreting the effects of scores of variants of unknown significance on a wide range of candidate genes. This background fits well into the framework of the Critical Assessment of Genome Interpretation (CAGI) experiment, which has a declared goal of assessing methods to help interpret the effects of variants of unknown significance. A similar challenge was present in the CAGI-4 experiment with the Hopkins gene panel, where predictors were asked to predict phenotypes based on the results of a genetic screening performed on a set of 83 genes associated to 14 different conditions (Chandonia et al. 2017).

The setup of the CAGI-5 ID challenge starts from a similar background. The Padua Genetics of Neurodevelopmental Disorders Lab at the Department of Woman and Child Health, University of Padua (henceforth, Padua NDD lab) has been using a gene panel to diagnose different NDD subtypes for the past couple of years. For the purpose of the CAGI-5 challenge, a dataset of 150 unpublished pediatric patients was released. Starting from the gene panel sequencing data, predictors were asked to predict (a) the phenotypes and (b) their causative or potentially causative variations for each patient. Phenotypes have been derived from the clinical notes collected by geneticists visiting the patients. Candidate variants have been validated by segregation analysis, i.e. verifying their absence in the parents according to the de novo paradigm, inherited from affected parents. It should be noted that this is a difficult “open world” CAGI challenge, as clinical notes may be somewhat subjective and only a subset of genes have been screened. Furthermore, the phenotypic traits to predict are pathophysiology conditions that can be present in different NDDs, thus, in contrast to the CAGI-4 Hopkins challenge, patients may manifest more than one of these phenotypes, in different combinations.

The challenge is realistic as it well represents the difficult of assigning causative mutations to complex neurological diseases in clinical practice. In a few selected cases, consistent predictions were used to challenge previous assumptions and have led to a revised molecular diagnosis.

Materials and Methods

Sequencing, variant nomenclature and analysis by the Padua NDD lab

Coding sequences and nearest flanking regions of 74 genes were targeted for deep sequencing with a custom Ampliseq panel assay using a mixture of oligonucleotides generating 1,834 amplicons covering 520 kb. Multiple indexed libraries were pooled and sequenced on the Ion PGM platform (Thermo Fisher Scientific). Alignment and variant calling were performed with the Ion Torrent Suite Software v 5.02 (Thermo Fisher Scientific). The panel of 74 genes was sequenced in 150 individuals referred to the Padua NDD lab for intellectual disability with or without autistic features. VCF files of the 150 patients were provided to the CAGI-5 organizers with clinical information regarding the presence of seven ‘phenotypic traits’ for each patient (Suppl. Table S1). The clinical information was provided by the patient’s physician, which were asked to fill a clinical record for each patient. When the clinician leaved a field empty, we indicated information about the specific trait as not available, although we cannot exclude that some patients may present it. The Padua NDD lab also indicated the identified variants of the sequenced genes that have been classified as causative, putative, or contributing factors (see Suppl. Table S2). Causative variants are supported by segregation analysis and genotype-phenotype correlation, while “putative” ones are rare or novel variants predicted as pathogenic for which segregation analysis is not available. Contributing factors are rare or novel variants predicted as pathogenic, inherited from apparently healthy parents, mapping on genes that confer a risk but are not sufficient to cause the disease, mapping on genes causing ASD susceptibility, or found mutated in individuals with very mild phenotypes. Table 1 summarizes the amount of patients with variants associated to each phenotype.

Table 1.

Patients for whom Padua NDD lab identified at least one causative or potentially disease variant in the answer key, summarized by phenotype. Each variant is specific for each patient and one patients can be associated to more than one phenotype.

Phenotype	Patients	Disease causing	Putative	Contributing factor	All variants
ID	49	25	18	12	55
ASD – Autistic traits	31	14	12	10	36
Epilepsy	18	9	8	2	19
Microcephaly	8	5	2	1	8
Macrocephaly	4	4	0	0	4
Hypotonia	6	4	1	1	6
Ataxia	3	1	2	1	4

Open in a new tab

To evaluate the putative clinical impact of the variants, the following criteria were applied: 1) allele frequency <0.002% in the Gnomad database, or <0.45% for variants in autosomal-recessive genes, as indicated by (Whiffin et al. 2019; Piton, Redin, and Mandel 2013) 2) absence of the variant in other samples (in-house database), 3) stop gain, frameshift and splicing variants were a priori considered to be most likely pathogenic, 4) for missense mutations, amino acid conservation and consensus of pathogenicity predictions were evaluated, 5) inheritance mode, 6) phenotypic consistency with the clinical signs associated to mutations in the same gene.

It is important to note, that for a diagnostic purpose, the thresholds used by the Padua NDD lab to filter candidate variants, have been calculated based on the assumption that the patient phenotype follow a Mendelian transmission.

Whiffin and colleagues demonstrated that for human Mendelian disease clinical genome interpretation is empowered by using high-resolution variant frequencies (Whiffin et al. 2019). To select candidate variants responsible for ID, Piton and colleagues suggested to filter variants with a frequency compatible with the incidence of the disease (i=2% in the general population) (Piton, Redin, and Mandel 2013). Since the repeat expansion on FMR1 gene remains the most frequent cause of X-linked forms of ID and given the genetic heterogeneity of NDDs, we expect that mutations in other genes account for less than 0.1% of all ID cases, resulting in a disease frequency <0.002% (i= 0.02 × 0.001). Variants in genes associated to recessive disorders should not exceed the threshold of 0.45% (√0.002%).

Challenge format

Participants were provided with 150 VCF files, one per patient, a detailed description of the seven disease phenotypes given in Suppl. Table S1, the 74 gene identifiers, the gene captured regions used in sequencing the patients in Browser Extensible Data (BED) format, a submission template, and a submission validation script. Furthermore, participants were informed that each patient may have more than one phenotypic trait, and all have at least one.

Participants were asked to submit the predictions of phenotypic traits and causative variants for each patient, based on their gene panel sequences. For each submission, participants were required to predict the probability that a patient has a referring phenotypic trait in each of the 7 phenotypic classes provided, as well as the predicted causal variant(s) from the gene panel sequence dataset for every disease class with a non-zero probability. Each predicted disease class probability also included a mandatory standard deviation (SD) field indicating the confidence prediction, with low SD indicating high confidence and high SD indicating low confidence.

Assessment

The prediction assessment was focused on evaluating the predictive ability of the different submissions, considering their performance on each disease phenotype. This approach has been successfully used for the analysis of multilabel classifier performance, since it focuses on a set of two-class prediction problems (Fawcett 2006). It also simplifies the assessment procedure, allowing to compare and highlight different method performances on each single phenotype, instead of evaluating them considering the whole predicted class matrix (150 × 7, one prediction for each patient and phenotype).

Predicted disease classes for each submission were assessed against the clinical phenotype given in the Padua NDD lab answer key, using the procedure described below. When the predictors did not provide a probability value leaving the asterisk on the template file, it was treated as probability zero in the assessment.

The first phase of the assessment procedure was the conversion of submitted probability values to positive (1) or negative (0) classes. The conversion was done by each phenotype column, considering as threshold the probability value which maximizes the Matthew correlation coefficient (MCC) for that phenotype. We compared all probability values of each phenotype with the corresponding threshold and assign 0 or 1 if the value is lower or higher, respectively. In addition, different performance measures were used to assess the predictions for each phenotype. Sensitivity and specificity have been used to evaluate model capability to detect positive cases and discriminate between positive and negative classes. The MCC, accuracy (ACC) and F1 measures have been used to evaluate both negative and positive predictions at the same time (see Suppl. Material for details). Particularly, MCC has been proven to be less influenced by an unbalanced dataset (Vihinen 2012), as is the case of this challenge where some phenotypes are completely unbalanced (Figure 1C). ROC curves have been produced comparing experimental and predicted probability values for each phenotype. The Area Under ROC curve (AUC) was calculated for these.

Figure 1. — Summary of CAGI-5 intellectual disability challenge experimental data. A) For the 150 patients included in the study, the Padua NDD lab noted at least one mutation relevant to the phenotype in the 33% of the patients B) Variant classes distribution. c) Number of patients where the presence or absence of the phenotype was ascertained by a clinician.

The R scripts used to perform the assessment are publicly available from the GitHub repository at URL: https://github.com/BioComputingUP/CAGI-ID-assessment.

Prediction methodology

A total of four groups, plus a late prediction (which can be found in the Suppl. Material), submitted predictions for the ID challenge. The group prediction approaches are summarized in Table 2 and described in detail below.

Table 2.

Computational approaches adopted by different groups.

Group			Annotation	Gene-Phenotype	Variant impact	Filters		Inheritance model
ID	Submission	Name	Annotation	Gene-Phenotype	Variant impact	Frequency	Low quality	Inheritance model
1	1.1	Mooney-Radivojac	ANNOVAR	HGMD, PhenoPred, and PPI for network propagation	MutPred2 and MutPredLOF	-	-	-
2	2.1	Moult Lab	Varant	OMIM+GHR, OMIM+HPO	13 levels of variant impact	SNVs >1%, SNVs in LCR low complexity region	yes	yes
	2.2
	2.3
	2.4
	2.5
	2.6
3	3.1	Lichtarge Lab	ANNOVAR	Diffusion on CTD (Comparative Toxicogenomics Database) associations	Evolutionary Action	No	yes	no
	3.1
	3.3
4	4.1	Brenner Lab	CHESS v0.1	Phenolyzer	VEP, REVEL score	SNVs MAF>5%	yes	yes
	4.2
	4.3

Open in a new tab

Group 1 (Mooney - Radivojac Lab):

Annotation of the protein coding variant in the raw VCF files was performed using ANNOVAR [http://annovar.openbioinformatics.org/en/latest/], including extraction of wild type and mutant protein sequences (Wang, Li, and Hakonarson 2010). Pathogenicity prediction scores were assigned to missense, stop gain, and frameshifting indel variants with Mutpred2 [http://mutpred.mutdb.org/] (Pejaver et al. 2017) and Mutpred-LOF [http://mutpredlof.cs.indiana.edu/] (Pagel et al. 2017). In each individual, phenotypic trait risk was determined based only upon the variant with the highest pathogenicity prediction score across a set of phenotype-specific risk genes. For each phenotypic trait, a list of risk genes that are known to harbor disease-causing variants associated with that phenotypic trait was compiled from the Human Gene Mutation Database (HGMD) (Stenson et al. 2017).

Gene lists were extended, particularly those with fewer known risk genes (macrocephaly, hypotonia and ataxic gait), with the PhenoPred [https://www.phenopred.org/] web tool (Radivojac et al. 2008) and a gene prioritization algorithm. Confirmed risk genes have been used as “seed” genes on the human protein-protein interaction network for running a network propagation algorithm (Nabieva et al. 2005). The propagation algorithm was performed in a 5-fold cross validation manner so as to get an initial score between [0, 1] for all the genes. The AlphaMax algorithm (Jain, White, and Radivojac 2016) was used to estimate the positive proportion of the risk genes and calibrate those initial scores to be proper probability scores measuring the likelihood of a gene being associated with the disease. For each phenotypic trait, the probability was MutPred2 or MutPred-LOF score of the highest scoring variant in the associated risk genes.

Group 2 (Moult Lab):

The 150 VCF files (one VCF file per patient) provided for the challenge were annotated using the Varant tool [http://compbio.berkeley.edu/proj/varant/Home.html], including region of occurrence (intron, exon, splice site or intergenic), observed minor allele frequencies (MAF), mutation type, predicted impact on protein function, and previously established associated phenotypes reported in ClinVar (Landrum et al. 2014). The RefGene (Pruitt et al. 2014) gene definition file was used for gene and transcript annotations in Varant. In addition, in-house scripts were written to further annotate the VCF files with HGMD (Stenson et al. 2003) disease-related variants, with dbscSNV (Jian, Boerwinkle, and Liu 2014) and SPIDEX [http://tools.genes.toronto.edu/] (Xiong et al. 2015) variants that potentially alter splicing, and with REVEL (Ioannidis et al. 2016) scores for missense variants. A quality control (QC) analysis were performed to exclude outlier samples (see Suppl. Material). The transition/transversion ratio (Ts/Tv) and heterozygous/homozygous ratio were compared to the 1000 Genomes dataset for the genomic regions captured for sequencing in the challenge dataset. Comparison of common, rare, and novel variant counts across samples was also performed. The 74 genes were mapped to one or more of the seven phenotype traits using two independent approaches generating two different gene-phenotype mapped files. In addition to the OMIM database, the Genetic Home Reference (https://ghr.nlm.nih.gov/) or Human Phenotype Ontology (https://hpo.jax.org/app/) databases, respectively, were used to map the phenotypes to the genes. The variant prioritization procedure was performed on each of these phenotype lists. Only rare variants (MAF less than or equal to 1% in Exac (http://exac.broadinstitute.org) or novel variants (not reported in ExAC), flagged as PASS in the VCF files, were considered. Indels in low complexity regions (LCR) were excluded from the analysis, based on the LCR dataset pre-computed for the human genome by Heng Li (Li 2014). A strand bias filter was used to remove variants whose alternate allele was present only on one strand of the reads mapped to the variant position. Variant prioritization was based on two main criteria, variant quality and variant impact, that were applied in a sequential manner to each sample. For each criterion, five different levels of variant quality and 13 different types of variant impact were defined respectively (for more details see Suppl. Material). Putative causative variants identified were further filtered for inheritance model associated with the gene, according to the available information for the gene concerned in OMIM and Genetic Home Reference database.

To compute a probability score, i.e. the probability of a variant causing a disease phenotype, a number of ad hoc procedures were used. An exception was for missense variants, where the probability was assigned using the extent of consensus among the four missense-analysis methods, previously calibrated from HGMD data and a control set of inter-species variants. Other variant types were subjectively assigned probabilities depending on the severity of the impact. Furthermore, depending on the considered mode of inheritance, the probability score was adjusted. Ad hoc probabilities of a correct variant call were also assigned to each variant based on the variant quality filters (see Suppl. Material). Six different predictions were performed based on the two different gene-phenotype lists and different combination of probabilities.

Group 3 (Lichtarge Lab):

Variants of poor sequencing quality (QUAL<80) were excluded from the analysis and the rest variants were annotated with ANNOVAR [http://annovar.openbioinformatics.org/en/latest/] (Wang, Li, and Hakonarson 2010). There were three submissions that used i) only missense, ii) missense and nonsense, and iii) all variations. The effect of each variant was estimated with the Evolutionary Action (EA) [http://mammoth.bcm.tmc.edu/uea/] equation (Katsonis and Lichtarge 2014) and the function loss of each gene was calculated as: LOF_g=1-∏(1-EA_i/100), where ∏ indicates the product for all mutations i in that gene. Nonsense and fs-indel variants were given EA of 100, while silent variants were given EA of 0. Genes were also weighted for their ability to tolerate mutations (w_g), calculated as the fractional rank of the average EA score of mutations seen in the gnomAD data (Lek et al., 2016). The weighted loss of function of each gene (w_g*LOF_g) was used as starting value for diffusion across the CTD gene-disease network (Mattingly et al. 2003). Diffusion scores were calculated for each disease (Lin et al. 2018) and a collective burden was calculated for each of the seven disease categories (normalized between 0–1). The relative ratios of the collective burden of the disease categories was used as the probability that a patient belongs to that disease category. The variants that contributed most to the collective burden of each disease category were reported as the causal variants.

Group 4 (Brenner Lab):

This group used their software CHESS v0.1 adjusting some parameters to perform predictions for the CAGI-5 ID challenge. Public data used on CHESS are variant frequency data from GNOMAD v2.0.2 [https://gnomad.broadinstitute.org/] (Lek et al. 2016), pre-calculated variant deleterious scores by REVEL [https://sites.google.com/site/revelgenomics/] (Ioannidis et al. 2016), and clinical evidence data from ClinVar (Landrum et al. 2016) (downloaded on 2017–10-02). Phenotype matching scores for all genes were calculated using Phenolyzer (Yang, Robinson, and Wang 2015). Pre-called variants from the case exome were annotated with data using VEP (McLaren et al. 2016), GNOMAD variant frequency data, ClinVar evidence, and the pre-calculated REVEL scores. To reduce the computing burden, common (variants with MAF >= 5%) and non-protein-altering variants have been excluded from the analysis. The selected variants were scored based on quality of data, impact severity, phenotype-match score (see Suppl. Material). Different scoring adjustments were also performed based on the inheritance mode considered. The three submissions correspond to three models with different stringency in the final decision, based on variant frequency in the 150 patient cohort and the probability score threshold used for each prediction.

Results

Summary of experimental data and submissions

Four groups submitted a total of 13 predictions for the CAGI-5 intellectual disability (ID) challenge. Group 2 submitted 6 predictions, groups 2 and 3 submitted 3 predictions each. In addition, a late submission (Group 5) was not considered for the general assessment but can be found in the Suppl. Material. Table 2 summarizes the participating groups, computational methods and their submissions.

An overview of the genetic and clinical data used in the ID challenge is shown in Figure 1. The 150 patients in the challenge can be divided into two groups: (1) patients for whom the Padua NDD lab identified at least one causative or potentially disease variant in the answer key (50 patients, 33%) and (2) patients for whom the Padua NDD lab excluded the presence of potentially pathogenic variants (100 patients, 67%). The total number of variants associated to at least one phenotype is 56 and variants are unique of each patient. In Table 1 is shown how variants are distributed in the different phenotypes. These variants were classified by the Padua NDD lab according to their possible effect as follows (Figure 1B): causative (25 variants), putative (18 variants) and contributing factor (13 variants). However, all variants were treated equivalently for purposes of assessing and ranking predictions.

Most of the patients with identified variants have at least one causative, and 16 and 13 patients show at least one putative or contributing factor variant, respectively. Combinations of different variant types in the same patient were observed only in a limited number of cases.

Phenotypic features were associated to each patient by a clinician. Although all patients have at least one feature assigned, the phenotypes were not equally represented in all patients. Figure 1C shows that most of the patients have ID, ASD, or Epilepsy. Other phenotypes (Microcephaly, Macrocephaly, Hypotonia and Ataxia) were less frequently observed in these patients. Nevertheless, for many patients no information was available about the presence or absence of a phenotype. Analyzing the overlap among phenotypes in patients, most patients have in common the phenotypes ID and ASD (39 patients), and ID, ASD and Epilepsy (21 patients) (Suppl. Figure S2).

Phenotype prediction assessment

In this CAGI-5 challenge, the phenotype assessment was performed individually for each of the seven phenotypic traits assigned by the Padua NDD lab. Figure 2 shows the number of groups predicting correctly a patient phenotype when it was present. The ID phenotype was best predicted by most of the groups for about 90% of ID patients. ASD was the second best phenotype predicted among the patients. Despite the limited number of patients with Microcephaly (18 patients), this phenotype was correctly predicted in about 60% of cases by most of the groups. Other phenotypes, such as Epilepsy, Macrocephaly, Hypotonia and Ataxia, were poorly predicted by the different groups.

Figure 2. — Number of patients with the phenotype. Colors represent the proportion and number of groups which correctly predicted the phenotype.

Considering only patients for whom the presence or absence of a phenotype was ascertained by a clinician (Figure 1C), it is important to observe the patient coverage by each submission. Suppl. Figure S1 shows the number of patients with predictions in each submission for the different phenotypes; at least one prediction was made for all possible patients. However, group 3 and 5 submissions did not predict any probability values for many patients, particularly in ASD, Epilepsy, Macrocephaly, Hypotonia and Ataxia phenotype. Suppl. Figure S1 also shows the number of patients for whom the phenotype was correctly predicted.

The overall submission performance was assessed using AUC for each phenotype, with MCC, ACC, and F1 measures used to better evaluate predictions. Figure 3 shows a summary of the AUC values obtained by each submission in the different phenotypes. In addition, Figure 4 and Suppl. Table S3 show the ROC curves and performance measures, respectively, for all submissions in each phenotype. Since the overall predictions are far from perfect performance, the prediction assessment for each phenotypic trait was performed also in the group of patients where the Padua NDD lab noted a potentially causative variants like previous CAGI challenge assessments (data not shown) (Chandonia et al. 2017). However, this did not show any improvement of predictor performance.

Figure 3. — Overall performance for each submission on phenotype prediction. A) Each cell represents the AUC values. The color scale ranges from dark (+1, perfect performance) to white (0, bad performance). White means random performance. B) Each cell represents the MCC values. The color scale ranges from green (+1, perfect correlation) to red (−1, negative correlation). White means no better than random prediction.

Figure 4. — ROC curves for each phenotype. Submissions are colored by predictor group.

For the ID phenotype, submission 4 of group 2 achieved the highest AUC value (0.78), followed by submissions 2, 6 and 3 of same group and submission 3 of group 3. Indeed, submission 3.3 obtained the highest overall performance considering all measures. They correctly predicted 146 patients out 150, and a moderate correlation with patient clinical data. ASD was the second most noted phenotype in patients by the Padua NDD lab. While all group 4 submissions and submission 2.3 reached higher AUC values than other groups, AUC values (average 0.56) and ROC curves remain close to random. Submission 4.3 and 1.1 achieved the best performance considering the other measures, with submission 1.1 equal in ACC, MCC and F1. Both submissions well identified patient phenotype in almost 100% of the cases.

Despite the rather good AUC, ACC and F1 values reached by some groups for the ID phenotype, and also for ASD, the MCC values remain quite low. Since MCC is not influenced by unbalanced categories, it shows a more realistic picture of prediction performance. As most of patients have the ID and ASD phenotypes, the confusion matrix is completely biased towards true positive values due to the highly imbalanced classes. This causes the ROC curve and consequent AUC not to reflect correctly the real predictor performance.

The presence or absence of the Epilepsy phenotype was poorly predicted by most groups, with an average MCC value of 0.05. This phenotype was particularly difficult, as roughly half of the patients had the disease. The best performances were achieved by group 4 and submission 1.1, predicting adequately more than 60% of patients. While performance measures show modest values, group 4 obtained the highest AUC (0.56) and group 1 the best MCC, AUC, and F1 values (see Suppl. Table S3).

Information about the presence or absence of Microcephaly and Macrocephaly was available for about half (81) of the patients. Microcephaly was reported in 18 patients and Macrocephaly in 12. Predictions for Microcephaly performed modestly, the best AUC being reached by submission 4.3, which correctly predicted 42 patients. Group 1 also predicted most of the patients with the phenotype (15 correct). In addition, most group 2 submissions obtained the best MCC and ACC values compared to other groups, predicting correctly 66 patients. However, best MCC values are again poor compared to other measures, denoting the effect of unbalanced categories in the predictions. Group 2 predictions were biased to identify patients without the phenotype (63 out of 63 patients) and just three patients with the disease. On the contrary, submission 4.3 was biased to predict patients with the disease (17 out of 18 patients) and 25 patients without the phenotype.

Performance assessment for Macrocephaly shows similar results as Microcephaly. Group 4 submissions performed better than other groups in terms of AUC. Submission 4.3 predicted the highest number of patients correctly (68 out 69 patients without the phenotype and 3 out 12 patients with the phenotype). Submissions 4.1 and 4.2 predicted correctly the highest number of patients with the phenotype (8 out 12 patients). Group 2 scored quite well in the prediction of patients without the phenotype, their submissions mostly predicting most of the patients where the phenotype was not noted. MCC values among submissions are again low, meaning that predictions were significantly biased.

The Hypotonia phenotype was positively or negatively noted in 68 patients by the Padua NDD lab. AUC values reached by different groups are poor, averaging around 0.5. Indeed, performance measures such as MCC and ACC are lower than in other phenotypes. Submission 4.3 obtained the best AUC, MCC and ACC values compared to other groups, correctly predicting 44 patients (6 out 28 with the phenotype and 38 out 40 without the phenotype). Submissions 2.1 and 2.3 predicted most of the patients with the phenotype (17 and 16, respectively).

The Ataxia phenotype was noted positively and negatively in 54 patients and only 11 patients had the disease. Submissions 4.1, 4.3, 2.1, 2.2 and 2.5 predicted well most of the patients but were biased to detect patients without the phenotype. Submissions 2.3 and 2.4 correctly predicted the presence of the disease in 7 and 8 patients, respectively. Consequently, the best AUC and MCC values were obtained by submission 2.4.

The overall submission ranking of this challenge was made considering the average AUC rankings for each phenotype. Table 3 shows the position reached by each submission in the different phenotypes. The best average ranked was submission 4.3, followed by other submissions of the same group.

Table 3.

Overall ranking among phenotypes by each submission. Individual phenotype ranking for each submission was made considering the performance measured by AUC.

Submission	ID	ASD	Epilepsy	Microcephaly	Macrocephaly	Hypotonia	Ataxia	Avg. Ranking	Final
4.3	6	4	1	1	3	1	11	3.86	1
4.1	10	1	2	3	1	7	13	5.29	2
4.2	9	3	4	4	2	5	10	5.29	2
3.3	7	12	5	9	6	3.5	8	7.21	4
2.3	4	2	12	12	10	8	3	7.29	5
3.2	12	7	6	9	6	3.5	8	7.36	6
1.1	11	9	3	2	4	11	12	7.43	7
2.1	5	5	13	7	11	6	5	7.43	7
2.4	1	11	8	13	9	10	1	7.57	9
3.1	13	8	7	9	6	2	8	7.57	9
2.2	2	10	9	5	13	13	4	8	11
2.5	8	6	11	6	8	12	6	8.14	12
2.6	3	13	10	11	12	9	2	8.57	13

Open in a new tab

For the CAGI-5 challenge, the assessment was performed also for each patient considering their overall clinical manifestations (Suppl. Table S4). Only 39 patients have the seven phenotypes negatively or positively assigned by Padua NDD lab. Among them, 13 patients (33%) were correctly predicted by at least one group, two patients were correctly predicted by two groups and three patients were correctly predicted by three groups. Four of these individuals have at least one variant. Group 2 was the best and predicted correctly the phenotype of 8 patients taking into account all their submissions (six submissions). Particularly, submission 2.4 predicted well 7 patients. Sixty-three patients (53%), among 119 with information about at least three phenotypes, were correctly predicted by at least one group. Group 1 and group 4 submission 3 were the best, correctly predicting 24 out 63 patients.

Furthermore, in this challenge we performed the assessment in predicting the overall clinical manifestation only for the 38 patients where the Padua NDD lab noted a causative or putative variant. For 22 of them (58%) the whole phenotype was correctly predicted by at least one group. Submission 3 group 4 and group 1 predicted correctly the phenotype in the same number of patients (13; 34.2%), 7 with causative and 6 with putative variants. On the other side, among the 100 patients with at least one assigned phenotype and where the Padua NDD lab did not report either a causative, putative or contributing factor variant; sixty-three (63%) were correctly predicted by at least one group. Considering only the patients with at least three assigned phenotypes (80), 22 (27%) of them were correctly predicted by at least one group. Again submission 4.3 and group 1 were the best groups in this subset of patients without pathogenic variants.

Variant prediction assessment

Predictors have been also assessed for their ability to detect variants in patients where clinicians have noted at least one variant probably associated to the phenotype. Figure 5 shows variant predictions for all patients and phenotypes by the different submissions. The amount of experimental variants (EV) with their corresponding classification are shown in the first three bars on the plot. Submissions of group 2 show the highest amount of well predicted variants associated to the different patient phenotypes (37 out of 56). Indeed, Group 2 outperformed other groups for causative (16 out of 25), putative causative (12 out of 18) and contributing factor (9 out of 13) variants. Submission 3 of group 4 was the second group predicting most of the variants. They correctly predicted 29 variants (11 causative, 9 putative causative and 9 contributing factor variants).

Figure 5. — Predicted variants distribution. Category “Experimental” is the amount of variants which were identified and classified by the Padua NDD lab. Each bar represents the amount of variants and type predicted by each submission.

In addition, Figure 6 shows the fraction of each mutation type well predicted by the different groups. It is possible to see that just a small amount of variants were well predicted by all groups. The 28% of causative and 15% of contributing variants were correctly identified by at least 3 groups. On the other hand, 17% of putative variants were well predicted by at least 3 groups. Table 4 contains the fraction of well predicted variants by each group submission. Group 2 did not only predict most variants but also obtained the highest fraction of correctly predicted variants, calculated as the amount of variants well predicted divided by all the predicted variants for all patients and phenotypes Suppl. Table S2 summarizes all variants noted by the Padua NDD lab and the groups which predicted them correctly. All 25 causative variants, except the SHANK3 frameshift indel chr22:51159830:A:TTC in patient MR1970.01, were correctly predicted by at least one group. After the initial assessment, we realized that this complex genetic event (nucleotide substitution chr22:51159830:A:C plus a TT insertion) was molecularly characterized by Sanger validation of the chr22:51159830:A:C variant, but the variant caller plugin failed to call the insertion at near position of the same reads. However, group 4 correctly predicted chr22:51159830:A:C as a potentially pathogenic variant.

Table 4:

Summary of variants prediction assessment by each submission.

Submission	Correctly pred. Variants	Total pred. Variants	Correctly pred. Variants / Exp. Variants	Correctly pred. Variants / Total pred. Variants
1.1	16	228	0.29	0.07
2.1	37	174	0.66	0.21
2.2	37	171	0.66	0.21
2.3	37	174	0.66	0.21
2.4	37	171	0.66	0.21
2.5	37	174	0.66	0.21
2.6	37	171	0.66	0.21
3.1	12	129	0.21	0.09
3.2	16	135	0.29	0.12
3.3	16	148	0.29	0.11
4.1	16	157	0.29	0.10
4.2	10	113	0.18	0.09
4.3	29	290	0.52	0.10

Open in a new tab

The Padua NDD lab considered some causative missense variants difficult to interpret (ATRX: p.N1377S; RAB39B p.F193L; GRIA3 p.R216Q; MED13L p.G706E), since pathogenicity predictions were discordant, allele frequency in control cohorts higher than expected, or proband phenotype partially consistent with those associated to the gene. However, the majority of the groups was able to predict these correctly. One example is the maternally inherited X-linked p.F193L of the RAB39B gene associated to recessive X-linked Mental Retardation syndrome (MR-XL72, OMIM 300271) or to Waissman syndrome, which is characterized by ID and early-onset Parkinson disease (OMIM 311510). This variant is predicted damaging by three out of twelve computational tools provided by ANNOVAR (LRT, Mutation Taster, and fathmmMKL), is moderately conserved during evolution, and present in a hemizygous state in two control cohort individuals. However, the variant maps to the C-terminal hypervariable tail of RAB39B which is relevant for protein interactions involved in protein targeting. The mother transmitting the p.F193L variant has a mild phenotype, consistent with those reported in the literature associated to a missense mutation at the close p.Gly192Arg position (Mata et al. 2015).

At least one group correctly predicted 16 out of 18 putative mutations. In particular, 7 variants were indicated by the majority of the groups. Three of these 7 variants were inherited and suspected to contribute to the disease together with other genetic or environmental factors. For the other four cases, after the CAGI-5 assessment, we contacted the families to follow up the molecular finding carrying out segregation analysis of the identified variants. Only one family answered our call, which allowed us to characterize the de novo status of the p.Y381H variant in the CASK gene. Even if the pathogenicity predictions were discordant, this variant was absent from control cohorts and in silico analysis suggested a structural role of this residue in the homo and hetero-dimerization of the CASK protein (Aspromonte et al. 2019). The proband phenotype is also consistent with those associated with CASK-related disorders.

In addition, at least one group correctly predicted the 13 variants classified as contributing factor, of which seven were indicated by the majority of the groups. This variant class is particularly relevant for autism susceptibility.

Novel variant predictions

Commonly predicted variants were also used to support those variants which were not considered by the Padua NDD lab. In order to check whether some relevant variant may be lost in the filtering process, the Padua NDD lab revised all of these 615 variants, which include 492 exonic (80%), 75 intronic (12.5%), 9 splicing (1.5%), and 6 5’/3’-UTR (untranslated region) (1%) variants.

Among the exonic variants, 80 (16.2%) were excluded for high allele frequency in the general population (MAF>0.002%). 150 variants with MAF<0.002% were excluded due to being present more than once in the cohort. 118 predicted to be likely gene disrupting variants (frameshift insertion, frameshift deletion, stop gain) were classified as sequencing errors after visual check of the raw data.

Focusing on variants indicated by the majority of the groups, we selected some variants to be reconsidered for Sanger validation and may be involved in the proband phenotype. In particular, for patient MR2001.01, three different groups (2, 3, and 4) predicted as potentially pathogenic two variants in NRXN1, p.L708I and p.I649V (NM_004801; 2:50765412 and 2:50765589 rs200074974). These variants were not reported to the patient due to being predicted neutral by the majority of the used computational methods. Variants of the NRXN1 gene are associated to schizophrenia, autism spectrum disorders, or the autosomal recessive Pitt-Hopkins-like syndrome 2, which is characterized by severe ID, developmental regression, hyper breathing, autistic behavior, and dysmorphic features. One of the putative variants found in MR2001.01, p.I649V, was reported in the literature in a patient with schizophrenia, inherited from the affected mother (Gauthier et al. 2011). Patient MR2001.01 has borderline ID with autistic traits and other behavioral psychiatric manifestations, such as depression and anxiety. Thus, its phenotype is not fully consistent with the recessive Pitt Hopkins-like disorder. However, we performed segregation analysis and found that the two NRXN1 variants were absent from the DNA of the mother and the healthy sister. This suggests that the two rare variants might be transmitted in cis from the father, who is not available for further investigation.

Two other variants have been reconsidered for Sanger validation and segregation analysis, a non-frameshift deletion and a synonymous variant, belonging to a class of variants for whom pathogenicity prediction is difficult to obtain. In patient MR1289_01, an in-frame deletion in the CC2D1A gene (NM_017721:exon1:c.27_35del:p.10_12del) was indicated as potentially pathogenic by groups 4 and 5. Mutations in CC2D1A are associated to ID, autosomal recessive 3 (MR-AR3), which is partially consistent with the proband phenotype. This position had a coverage of 74x, alternative allele frequency of 100% with a genotype quality of 24. At that position, which is part of a repeat sequence, other patients analyzed in the same experiment present sequence and alignment errors. However, Sanger sequencing of this amplicon revealed that the MR2001.01 case carries this variant in a heterozygous state. No other variants have been identified in the CC2D1A gene, which is completely covered by gene panel sequencing. As alterations of the CC2D1A have been implicated in NDDs only in recessive conditions, the final outcome for the patient does not change.

In patient MR1769_01, one hemizygous synonymous variant (NM_005120, c.3600C>T p.(Arg1200Arg) in the MED12 gene, has been indicated as pathogenic by group 5 (late predictor). The variant is predicted to potentially alter splicing by Human Splicing Finder (Desmet et al. 2009). For this patient, we reported two variants in the CNTNAP2 and FOXP1 genes, which we hypothesized to act in a two-hit model to determine the disorder, as previously described by (O’Roak et al. 2011). Nevertheless, a mutation in the MED12 gene could explain the family history evocative of an X-linked disorder, since the maternal uncle presents ID and ASD. However, segregation analysis revealed that the mother transmitted the MED12 variant to the two sons. Since the phenotype of the MR1769.01 brother is not consistent with MED12-related disorders, we can exclude this variant as a main molecular cause of the MR1769.01 phenotype.

Assessment by group summary

Group 1 (Mooney-Radivojac Lab):

Considering individually each phenotypic trait, group 2 predicted correctly all individuals with ASD, 15 out of 18 individuals presenting microcephaly, and 60% of the epileptic patients. Furthermore, group 1 predicted correctly the overall phenotype of 11 patients that other groups did not predict correctly. Although, their method was less accurate on the prediction of causative or putative variants indicated by the Padua NDD Lab, group 1 was one of the two best at predicting the correct combination of phenotype traits in cases where the Padua NDD Lab indicated a causative or putative variant. They also performed better in the phenotype prediction for patients with at least three phenotypic traits available. Given the discrepancy between the accuracy at predicting causal variants and phenotypic traits, the Padua NDD Lab checked the variants that they indicated supporting phenotype predictions. Although some supportive variants were classified as sequencing errors, many others were rare variants in genes associated to the specific phenotype trait. The Padua NDD Lab did not report these variants to the patient due to their relatively high frequency and discordant predictions among pathogenicity predictors. The group 1 method differs from that of other groups in the approach used to identify gene-phenotype association, particularly for traits with fewer known risk genes, such as macrocephaly, hypotonia, and ataxic gait. The use of protein-protein interaction networks to expand genetic association with the disease has been useful to select relatively low frequency variants with less functional impact that may contribute to the disease expression. This is in line with the emergent model explaining the genetic architecture of NDDs.