Abstract
Studies indicate that schizophrenia has a genetic component, however it cannot be isolated to a single gene. We aimed to determine how well one could predict that a person will develop schizophrenia based on their germ line genetics. We compared 1129 people from the UK Biobank dataset who had a diagnosis of schizophrenia to an equal number of age matched people drawn from the general UK Biobank population. For each person, we constructed a profile consisting of numbers. Each number characterized the length of segments of chromosomes. We tested several machine learning algorithms to determine which was most effective in predicting schizophrenia and if any improvement in prediction occurs by breaking the chromosomes into smaller chunks. We found that the stacked ensemble, performed best with an area under the receiver operating characteristic curve (AUC) of 0.545 (95% CI 0.539–0.550). We noted an increase in the AUC by breaking the chromosomes into smaller chunks for analysis. Using SHAP values, we identified the X chromosome as the most important contributor to the predictive model. We conclude that germ line chromosomal scale length variation data could provide an effective genetic risk score for schizophrenia which performs better than chance.
Subject terms: Computational biology and bioinformatics, Data mining, Genome informatics, Machine learning
Introduction
Schizophrenia is a highly heritable, complex psychiatric disorder1,2. Genome wide association studies have identified over one hundred genetic loci that contribute to its heritability2–7. However, these loci still account for less than half of the genetic risk for schizophrenia3. Environmental exposure to chemicals appears to play almost no role in the development of schizophrenia, but different forms of trauma experienced during development does appear to be a risk factor8. Twin studies have consistently shown a significant genetic contribution to schizophrenia, and many twin studies find that the environmental contribution to schizophrenia exists but that genetic effects provide significant liability to schizophrenia9.
Genetic risk scores10–12 have been developed for many different forms of disease, including breast cancer13, coronary artery disease14, and stroke15. Polygenic risk scores based on SNPs clearly can predict schizophrenia. One study measured an odds ratio of about 8 (95% CI 4–14) for the highest decile compared to the lowest decile16. A second study found that polygenic risk scores for schizophrenia (and bipolar disorder) are also associated with creativity17. A review of polygenic risk scores for schizophrenia highlighted the difficulty these studies had finding a consistent diagnosis of schizophrenia18. One limitation of polygenic risk scores is that they only consider linear combinations of SNPs.
Copy number variations (CNVs) in germ line DNA have also been associated with schizophrenia4,5,19–24. Evidence suggests that these CNVs associated with schizophrenia are represented also by SNPs24; the predictive power of CNVs does not add to the predictive power of SNPs when using linear prediction algorithms. The dimensionality of the data (many more SNPs than patients with schizophrenia) precludes the use of non-linear machine learning techniques.
Chromosome-scale length variation (CSLV) reduces the dimensionality of the data, while maintaining sufficient information for predictive algorithms. Combining CSLV with modern machine learning classification algorithms provides a powerful tool to predict phenotypes from a person’s genome25.The CSLV values are averages, across all or most of a chromosome, of copy number variation (CNV) measured at each SNP location. This method is particularly appealing for genetic risk scores because it includes epistatic effects that might be missed with conventional genome wide association studies, which use logistic regression—a linear combination of SNP scores. By attempting to still utilize every CNV value, this model aims to demonstrate that there are likely global CNV interactions which may be missed by conventional genetic risk scores.
The purpose of this paper is to evaluate how well a genetic risk score based on chromosome-scale length variation and machine learning classification algorithms can predict schizophrenia in individuals. We evaluated this approach on a dataset of 1129 patients who had schizophrenia in the UK Biobank dataset. These patients were previously genotyped as part of the UK Biobank project.
Methods
Data was obtained from the UK Biobank under Application Number 47850. The UK Biobank project collected extensive data from about 500,000 people who were between the ages of 40 and 69 during the 2006–2010 recruitment years. This data included genotyping data and medical records. In addition, most of the participants’ medical records are linked, through the National Health Service, to the UK Biobank records. This linkage provides for ongoing follow-up of health conditions26,27.
First, we downloaded the “l2r” files from the UK Biobank. Each chromosome has a separate “l2r” file. Each “l2r” file contained 488,377 columns and a variable number of rows. Each column represented a unique patient in the dataset, who can be identified with an encoded ID number. Each row represented a different location in the genome. The values in the file represent the log base 2 ratio of intensity relative to the expected two copies measured at the SNP location.
After downloading the “l2r” data from the UK Biobank, we computed the mean l2r value for different portions of each chromosome for each patient in the dataset. We created three different datasets, which we refer to as “splits”. We split each chromosome into either 1, 4, or 8 nominally equal parts. Then, we compute the length for each person’s chromosome split using the l2r files by taking the average of all l2r values measured within that portion of the chromosome split. A value of 0 represents the nominal average length of that portion of the particular chromosome. We call this dataset the chromosome-scale length variation (CSLV) dataset.
The CLSV numbers represent the copy number of the genomic DNA recognized by the probe. We computed a measure of the length of chromosomes, or chromosome fragments, by averaging these l2r measurements from different probes along the chromosome. For each person, we have 1 split, 4 split, and 8 split datasets. The 1 split data consists of 23 numbers, one for each of the autosomes and one for the X chromosome. The 4 split data consists of 92 numbers and the 8 split data has 184 numbers for each person.
This CSLV dataset was matched with the UK Biobank Health records dataset. UK Biobank matched the person in the Public Health England data with UK Biobanks internal records to produce the person’s encoded participant ID. The dataset we have, provided by UK Biobank, contains the participant ID and date the patient was diagnosed by a doctor as having schizophrenia.
Using the CSLV-Schizophrenia dataset, we selected all people who had a diagnosis of schizophrenia and labelled them in the dataset. We constructed an age-matched control group of the same size that had an identical age profile as those in the schizophrenia group. The age-matched control group was selected from all those in the UK Biobank dataset having no indication of schizophrenia. Since only a small fraction of the people in the UK Biobank had a schizophrenia diagnosis, we could rerun the analysis with a different age-matched control group many times to build up statistics.
We used the H2O machine learning package in R28,29. We created 100 machine learning models that were trained to classify a person in the dataset, consisting of those who had schizophrenia and age-matched controls, based solely on their chromosome scale length variation data. Each model was trained with fivefold cross-validation. Each model had a distinct set of controls. These models were trained to perform a binary classification, distinguishing between those who had been diagnosed with schizophrenia and those who did not have schizophrenia. The models were evaluated by measuring the area under the curve of the receiver operating characteristic curve, known as the AUC.
The H2O package implements several common machine learning algorithms. Distributed Random Forest (drf) is based on an algorithm originally called “Extremely randomized trees”30. The Gradient Boosting Machine algorithm (gbm) builds regression trees in parallel31,32. The generalized linear model (glm) is implemented using an augmented linear model33–35. XGBoost is a refinement to the general Gradient Boosting Machine algorithm36. Ensembles are a combination of these other machine learning algorithms. This combination often provides superior results to any particular algorithm37,38. The H2O package implements stacked ensembles as super learner algorithms39. The H2O package also uses SHAP values to interpret the models40. SHAP values are measures of how important different features are to the prediction.
Our computer analysis system is a Linux server running Ubuntu 18.04. The system is a 64-bit system running two Intel Xeon E5-2690 2.90 GHz CPUs. It also has a GeForce GT 710 NVIDIA GPU. 32 GBs of RAM were also available with a 10 TB HDD.
Ethics approval and consent to participate
Ethics approval and participant consent was collected by UK Biobank at the time participants enrolled. All subjects in the database have given informed consent, and if under 18, consent from a parent and/or legal guardian. Additionally, all subjects have the ability to withdraw at any time from the UK Biobank. This paper is an analysis of anonymized data provided by UK Biobank. According to UC Irvine’s IRB, analysis of anonymized data does not constitute Human Subjects Research. All methods and experimental research protocols were approved by the UK Biobank.
Results
Figure 1 presents results showing the performance of different machine learning algorithms. We found that the stacked ensemble models consistently performed best. As Fig. 1 shows, we found a slight difference between algorithms and their performance. But all algorithms could predict schizophrenia significantly better than chance (AUC = 0.50). This finding indicates that germ line genetics of the patient, as represented by the set of chromosome-scale length variation numbers, demonstrates predictability of schizophrenia.
The AUC (area under the curve of the receiver operating characteristic curve) for the machine learning classification models was 0.583 (standard deviation 0.014, 95% confidence interval of 0.581–0.586). A classification model with an AUC of 0.50 is equivalent to random guessing. The measured AUC differs from 0.50 with p < 0.00001.
We also tested how well each model could predict schizophrenia on a holdout set of validation data. The holdout set was 30% of the original test data and was not included in the training of the models. The AUC of the holdout set was 0.5734 with a 95% confidence interval of 0.569–0.578.
We then tested whether increasing the number of splits improves model performance. We constructed three overlapping datasets with 1 split, 4 splits, and 8 splits. The phrase “1 split” represents the average l2r value measured across an entire chromosome for all 23 chromosomes giving a total of 23 numbers, “4 splits” represents the average of each quarter of the 23 chromosomes l2r values for a total of 92 numbers, and “8 splits” represent the average of each eighth of the 23 chromosomes’ l2r values for a total of 184 numbers.
Figure 2 shows how models compare on the 3 different split datasets. Overall, a stacked ensemble had the best performance, however a general linear model (glm) was most often the best candidate model.
In all models, increasing splits improves model performance for the same runtime. Figure 3 demonstrates the difference of all models for 1 split, 4 splits, and 8 splits datasets. We tested whether finer splits of the dataset provided significantly improved AUCs. As shown in Table 1, the p-value of the 4 splits model compared to the 1 split model is . Comparing the mean AUC for the 8 splits model to the 1 split model gave a p-value of indicating that finer splits significantly improved the predictive ability of the models. The 4 splits and 8 splits models performed better than the 1 split models by a significant amount.
Table 1.
Dataset | Mean AUC | Standard Deviation | P-value vs 1 split |
---|---|---|---|
1 split | |||
4 splits | |||
8 splits |
We then calculated the odds ratio (OR) of our predictions drawn from the cross-validated model. Table 2 shows that a patient in the upper quintile is approximately twice as likely to have schizophrenia when compared to the lower quintile.
Table 2.
Quintile | Normal | Schizophrenia | Odds ratio | Count | 95% CI |
---|---|---|---|---|---|
1 | 185 | 123 | 0.67 | 308 | 0.51–0.85 |
2 | 156 | 152 | 0.97 | 308 | 0.76–1.24 |
3 | 153 | 155 | 1.0 | 308 | 0.79–1.3 |
4 | 142 | 165 | 1.2 | 307 | 0.91–1.5 |
5 | 133 | 174 | 1.3 | 307 | 1.0–1.7 |
In order to understand how our models came to their conclusions, we created several plots to explain them from H2O’s “explainability” framework. The first is a variable importance heatmap across the generated models which is shown in Fig. 4. Our analysis here indicated that chromosome X was one of the highest contributing variables in predicting Schizophrenia, especially in tree models such as GBM and XGBoost. We then confirmed this with a Shapley Additive exPlanation or SHAP plot in Fig. 5. This plot also indicates that chromosome X was the leading factor in our leading model for predicting schizophrenia.
Utilizing our findings above, we then proceeded to train new models from scratch using only CSLV values from chromosome X but with 64 CSLV splits. This model did not contain any information from the 22 autosomes but instead relied solely on CNVs in the X chromosome and our aim was to see if the model would be comparable to our previous 4-split and 8-split models. We found that on average these models had a comparable performance of about 0.58 with the highest being around 0.627 as shown in Fig. 6.
We then again performed a variable importance heatmap analysis to get greater granularity of our understanding of the contributing CSLVs in chromosome X. We found that this was again consistent with the previous findings from the 4-split model. Figure 7 indicates that the top features of variable importance are again being found in the first and last regions of chromosome X. As such it appears that the majority of the predictive power of any model trained with CSLV and when predicting schizophrenia in an individual is a result of CNVs on chromosome X. We also report corresponding estimates of hg38 coordinates in Table 3.
Table 3.
CSLV Split | Estimated hg38 Coordinates |
---|---|
1 | chrX:60425–634774 |
6 | chrX:5651118–7792613 |
9 | chrX:11426091–13234434 |
13 | chrX:20912585–22990332 |
42 | chrX:107331058–110669244 |
50 | chrX:128031497–130523635 |
58 | chrX:145709120–147908169 |
We wanted to ensure these results were not due to inherent sex differences. We trained 50 models using the 64 split chromosome X dataset which were not only age-matched with the controls but also sex-matched. 25 of the AutoML models were trained with the actual data with correctly labeled disease states. The other 25 AutoML models were trained with the schizophrenia diagnosis randomly shuffled. The results are shown in Table 4. Here we can see that a portion of the previous performance is most likely due to CSLV differences inherent between males and females (Supplemental D). However, a portion of the prediction is statistically still better than random guessing.
Table 4.
Dataset | Mean AUC | Standard Deviation |
---|---|---|
64 Split × normal | 0.545 | 0.01373103 |
64 Split × random | 0.525 | 0.01363745 |
Welch two sample t-test between normal and random |
T = − 5.0111 df = 47.998 |
p-value = 7.763e-06 |
Discussion
These results indicate that germline genetic variation contributes at least to some degree to the onset of schizophrenia in individuals. Our results indicate that genetic structural variation across the global chromosomal scope is sufficient to predict, better than guessing, whether or not an individual will have schizophrenia. The patients were an equal number of patients by gender between the control and disease group and the ages of patients in the control group also were matched to the ages of patients in the disease group. Further analysis revealed that length variation in a handful of regions of the X chromosome was sufficient to reproduce the predictive model. Recently, there has been revived discussion of copy number variations as a large contributing factor to several neurological ailments including schizophrenia41. Additionally, hypotheses about sex chromosome links to schizophrenia inheritance have been discussed for several decades and our findings lend support to this idea42.
On average, a stacked ensemble is the best approach to creating a predictive model for the prediction of schizophrenia. However, all models that were tested still created models with predictive power better than chance (Supplementary Information A, B, & C). Since H2O’s AutoML performs a grid-search of all the possible datasets and each trial we ran included the same disease group but with a different control groups, we can see in Fig. 1 that a general linear model (GLM) oftentimes was still the best option. Gradient Boosted Machines (GBM) and XGBoost also typically performed the same as GLM.
Utilizing a more granularized dataset by splitting the autosomes into quarters and eighths performs significantly better than using a CSLV averaged across an entire chromosome. This observation suggests we can increase performance by increasing splits. In the future, we plan on exploring the trade off in run time and computational resources required by increasing splits. Other methods of dimensionality reduction may also yield better results without sacrificing runtime performance.
The CSLV values are averages of copy number variation (CNV) measured at each SNP location. Simply using every single CNV value introduces a dimensionality problem as our dataset only has roughly 488,000 individuals while the total number of CNV values is 764,257 across the 22 autosomes and an additional 18,857 CNV values for the X Chromosome. This means there is likely diminishing returns for using more splits unless it can be offset with increased data.
This approach has several limitations. First, CSLV is an averaged measure of copy-number variations across a large section of the entire chromosome. We used SHAP values to highlight the regions that seem to be more important, but this does not provide a mechanistic explanation. Second, the dataset lacks diversity. The UK Biobank population is primarily Caucasian individuals in the United Kingdom (although not exclusively). Third, the diagnosis of schizophrenia in an individual is difficult to quantify and the disease might consist of a heterogeneous group of underlying biological processes. Finally, this analysis is based on a single dataset and the conclusions would be stronger if the analysis could be replicated in an independent dataset. However, similar datasets are not currently available.
Conclusion
We were able to create machine learning models for prediction of schizophrenia in patients. These models perform better than chance with an average AUC of 0.545. Prediction was performed with only chromosomal scale length variation measurements as the input variables. Further analysis of the SHAP values suggests that the length variation of several regions of the X chromosome are sufficient to reproduce this predictive value.
Supplementary Information
Acknowledgements
The data used in this study was obtained from the UK Biobank under Application Number 47850.
Abbreviations
- AUC
Area under the curve
- CNV
Copy number variation
- CSLV
Chromosomal-scale length variation
- GBM
Gradient boosted machines
- GLM
General linear model
- ROC
Receiver operator curve
- SHAP
SHapley Additive exPlanations
- SNP
Single nucleotide polymorphism
Author contributions
C.T. conceived of the study. C.T. and J.B. analyzed the UK Biobank data. C.T. and J.B. contributed to the manuscript. All authors read and approved the final manuscript.
Funding
No external funding supported this research.
Data availability
The datasets analyzed during the current study are available from UK Biobank at https://www.ukbiobank.ac.uk/.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Christopher Toh, Email: tohc@uci.edu.
James P. Brody, Email: jpbrody@uci.edu
Supplementary Information
The online version contains supplementary material available at 10.1038/s41598-021-97983-0.
References
- 1.Flint J, Munafò M. Genesis of a complex disease. Nature. 2014;511:412–413. doi: 10.1038/nature13645. [DOI] [PubMed] [Google Scholar]
- 2.Ripke S, Neale BM, Corvin A, Walters JTR, Farh KH, Holmans PA, et al. Biological insights from 108 schizophrenia-associated genetic loci. Nature. 2014 doi: 10.1038/nature13595. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Lee SH, Decandia TR, Ripke S, Yang J, Sullivan PF, Goddard ME, et al. Estimating the proportion of variation in susceptibility to schizophrenia captured by common SNPs. Nat. Genet. 2012 doi: 10.1038/ng.1108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Ripke S, O’Dushlaine C, Chambert K, Moran JL, Kähler AK, Akterin S, et al. Genome-wide association analysis identifies 13 new risk loci for schizophrenia. Nat. Genet. 2013 doi: 10.1038/ng.2742. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Ripke S, Sanders AR, Kendler KS, Levinson DF, Sklar P, Holmans PA, et al. Genome-wide association study identifies five new schizophrenia loci. Nat. Genet. 2011 doi: 10.1038/ng.940. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Farrell MS, Werge T, Sklar P, Owen MJ, Ophoff RA, O’donovan MC, et al. Evaluating historical candidate genes for schizophrenia. Mol. Psychiatry. 2015;20:555–562. doi: 10.1038/mp.2015.16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Purcell SM, Wray NR, Stone JL, Visscher PM, O’Donovan MC, Sullivan PF, et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature. 2009 doi: 10.1038/nature08185. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Van Os J, Kenis G, Rutten BPF. The environment and schizophrenia. Nature. 2010 doi: 10.1038/nature09563. [DOI] [PubMed] [Google Scholar]
- 9.Sullivan PF, Kendler KS, Neale MC. Schizophrenia as a complex trait: Evidence from a meta-analysis of twin studies. Arch. Gen. Psychiatry. 2003 doi: 10.1001/archpsyc.60.12.1187. [DOI] [PubMed] [Google Scholar]
- 10.Sugrue LP, Desikan RS. What are polygenic scores and why are they important? JAMA. 2019;321:1820. doi: 10.1001/jama.2019.3893. [DOI] [PubMed] [Google Scholar]
- 11.Torkamani A, Wineinger NE, Topol EJ. The personal and clinical utility of polygenic risk scores. Nat. Rev. Genet. 2018 doi: 10.1038/s41576-018-0018-x. [DOI] [PubMed] [Google Scholar]
- 12.Lello L, Raben TG, Yong SY, Tellier LCAM, Hsu SDH. Genomic prediction of 16 complex disease risks including heart attack, diabetes, breast and prostate cancer. Sci. Rep. 2019 doi: 10.1038/s41598-019-51258-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Mavaddat N, Michailidou K, Dennis J, Lush M, Fachal L, Lee A, et al. Polygenic risk scores for prediction of breast cancer and breast cancer subtypes. Am. J. Hum. Genet. 2019 doi: 10.1016/j.ajhg.2018.11.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Fahed AC, Wang M, Homburger JR, Patel AP, Bick AG, Neben CL, et al. Polygenic background modifies penetrance of monogenic variants for tier 1 genomic conditions. Nat. Commun. 2020 doi: 10.1038/s41467-020-17374-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Abraham G, Malik R, Yonova-Doing E, Salim A, Wang T, Danesh J, et al. Genomic risk score offers predictive performance comparable to clinical risk factors for ischaemic stroke. Nat. Commun. 2019 doi: 10.1038/s41467-019-13848-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Agerbo E, Sullivan PF, Vilhjálmsson BJ, Pedersen CB, Mors O, Børglum AD, et al. Polygenic risk score, parental socioeconomic status, family history of psychiatric disorders, and the risk for schizophrenia: A Danish population-based study and meta-analysis. JAMA Psychiat. 2015 doi: 10.1001/jamapsychiatry.2015.0346. [DOI] [PubMed] [Google Scholar]
- 17.Power RA, Steinberg S, Bjornsdottir G, Rietveld CA, Abdellaoui A, Nivard MM, et al. Polygenic risk scores for schizophrenia and bipolar disorder predict creativity. Nat. Neurosci. 2015 doi: 10.1038/nn.4040. [DOI] [PubMed] [Google Scholar]
- 18.Mistry S, Harrison JR, Smith DJ, Escott-Price V, Zammit S. The use of polygenic risk scores to identify phenotypes associated with genetic risk of schizophrenia: Systematic review. Schizophr. Res. 2018 doi: 10.1016/j.schres.2017.10.037. [DOI] [PubMed] [Google Scholar]
- 19.Ruderfer, D. M., Chambert, K., Moran, J., Talkowski, M., Chen, E. S., Gigek, C. et al. Mosaic copy number variation in schizophrenia. Nature. 508 (2014). [DOI] [PMC free article] [PubMed]
- 20.Szatkiewicz JP, O’Dushlaine C, Chen G, Chambert K, Moran JL, Neale BM, et al. Copy number variation in schizophrenia in Sweden. Mol. Psychiatry. 2014 doi: 10.1038/mp.2014.40. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Purcell SM, Moran JL, Fromer M, Ruderfer D, Solovieff N, Roussos P, et al. A polygenic burden of rare disruptive mutations in schizophrenia. Nature. 2014 doi: 10.1038/nature12975. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Korn JM, Kuruvilla FG, McCarroll SA, Wysoker A, Nemesh J, Cawley S, et al. Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs. Nat. Genet. 2008;40:1253–1260. doi: 10.1038/ng.237. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Derks EM, Vorstman JAS, Ripke S, Kahn RS, Ophoff RA, O’Donovan MC, et al. Investigation of the genetic association between quantitative measures of psychosis and schizophrenia: A polygenic risk score analysis. PLoS One. 2012 doi: 10.1371/journal.pone.0037852. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Marshall CR, Howrigan DP, Merico D, Thiruvahindrapuram B, Wu W, Greer DS, et al. Contribution of copy number variants to schizophrenia from a genome-wide study of 41,321 subjects. Nat. Genet. 2017 doi: 10.1038/ng.3725. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Toh C, Brody JP. Analysis of copy number variation from germline DNA can predict individual cancer risk. bioRxiv. 2018 doi: 10.1101/303339. [DOI] [Google Scholar]
- 26.Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, et al. UK Biobank: An open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 2015 doi: 10.1371/journal.pmed.1001779. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. doi: 10.1038/s41586-018-0579-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Click, C., Malohlava, M., Candel, A., Roark, H. & Parmar, V. Gradient Boosting Machine with H2O. 30. Accessed 7 April 2021. https://www.H2OAi/Resources/ (2017).
- 29.Aiello, S., Eckstrand, E., Fu, A., Landry, M. & Abouyon, P. Machine learning with R and H2O. H2O booklet, 550. It's available at this URL: https://h2o-release.s3.amazonaws.com/h2o/master/3283/docs-website/h2odocs/booklets/R_Vignette.pdf (2016).
- 30.Geurts P, Ernst D, Wehenkel L. Extremely randomized trees. Mach. Learn. 2006 doi: 10.1007/s10994-006-6226-1. [DOI] [Google Scholar]
- 31.Friedman JH. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001;29:1189–1232. doi: 10.1214/aos/1013203451. [DOI] [Google Scholar]
- 32.Friedman JH. Stochastic gradient boosting. Comput. Stat. Data Anal. 2002;38:367–378. doi: 10.1016/S0167-9473(01)00065-2. [DOI] [Google Scholar]
- 33.Lee Y, Nelder JA. Hierarchical generalised linear models: A synthesis of generalised linear models, random-effect models and structured dispersions. Biometrika. 2001 doi: 10.1093/biomet/88.4.987. [DOI] [Google Scholar]
- 34.Lee Y, Nelder JA. Hierarchical generalized linear models. J. R. Stat. Soc. Ser. B (Methodol.) 1996 doi: 10.1111/j.2517-6161.1996.tb02105.x. [DOI] [Google Scholar]
- 35.Nelder, J. A., Lee, Y., & Pawitan, Y. Generalized Linear Models with Random Effects: Unified Analysis via H-likelihood, Second Edition (2nd ed.). Chapman and Hall/CRC. 10.1201/9781315119953 (2017).
- 36.Chen, T. & Guestrin, C. XGBoost. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining—KDD’16. 785–794 (ACM Press, 2016). 10.1145/2939672.2939785.
- 37.Wolpert DH. Stacked generalization. Neural Netw. 1992 doi: 10.1016/S0893-6080(05)80023-1. [DOI] [Google Scholar]
- 38.Breiman L. Stacked regressions. Mach. Learn. 1996 doi: 10.1007/bf00117832. [DOI] [Google Scholar]
- 39.Van Der Laan MJ, Polley EC, Hubbard AE. Super learner. Stat. Appl. Genet. Mol. Biol. 2007 doi: 10.2202/1544-6115.1309. [DOI] [PubMed] [Google Scholar]
- 40.Lundberg, S. M. & Lee, S. I. A unified approach to interpreting model predictions. In: Proceedings of the 31st international conference on neural information processing systems (pp. 4768–4777) https://proceedings.neurips.cc//paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf (2017).
- 41.Zarrei M, Burton CL, Engchuan W, Young EJ, Higginbotham EJ, MacDonald JR, et al. A large data resource of genomic copy number variation across neurodevelopmental disorders. NPJ Genom. Med. 2019 doi: 10.1038/s41525-019-0098-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Bache WK, DeLisi LE. The sex chromosome hypothesis of schizophrenia: Alive, dead, or forgotten? A commentary and review. Mol. Neuropsychiatry. 2018;4:83–89. doi: 10.1159/000491489. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The datasets analyzed during the current study are available from UK Biobank at https://www.ukbiobank.ac.uk/.