Summary
In studies of individuals of primarily European genetic ancestry, common and low-frequency variants and rare coding variants have been found to be associated with the risk of bipolar disorder (BD) and schizophrenia (SZ). However, less is known for individuals of other genetic ancestries or the role of rare non-coding variants in BD and SZ risk. We performed whole-genome sequencing (∼27X) of African American individuals: 1,598 with BD, 3,295 with SZ, and 2,651 unaffected controls (InPSYght study). We increased power by incorporating 14,812 jointly called psychiatrically unscreened ancestry-matched controls from the Trans-Omics for Precision Medicine (TOPMed) Program for a total of 17,463 controls (∼37X). To identify variants and sets of variants associated with BD and/or SZ, we performed single-variant tests, gene-based tests for singleton protein truncating variants, and rare and low-frequency variant annotation-based tests with conservation and universal chromatin states and sliding windows. We found suggestive evidence of the association of BD with single variants on chromosome 18 and of lower BD risk associated with rare and low-frequency variants on chromosome 11 in a region with multiple BD genome-wide association study loci, using a sliding window approach. We also found that chromatin and conservation state tests can be used to detect differential calling of variants in controls sequenced at different centers and to assess the effectiveness of sequencing metric covariate adjustments. Our findings reinforce the need for continued whole-genome sequencing in additional samples of African American individuals and more comprehensive functional annotation of non-coding variants.
Keywords: Whole genome sequencing, bipolar disorder, schizophrenia, burden test, GWAS, African American, non-coding variant, regional burden test, quality control, rare variants
This study performed whole-genome sequencing to identify rare genetic variant associations with bipolar disorder and schizophrenia risk in African American individuals, for which it found suggestive bipolar disorder association signals. The study also demonstrated how chromatin and conservation state annotations can be used to detect sequencing differences in variant detection.
Introduction
Severe mental illnesses, including bipolar disorder (BD) and schizophrenia (SZ), are debilitating disorders that affect millions of people worldwide. BD and SZ encompass a wide range of shared symptoms, including recurrent episodes of psychosis, large mood swings, depression, and cognitive impairment. Both disorders are significantly associated with risk of suicide and increased all-cause mortality rate.1 Heritability estimates from family studies range from 60% to 85% for BD2 and 60% to 80% for SZ.3 Notably, there is considerable overlap in the underlying genetics of BD and SZ,4 and genetic correlations are estimated to be as high as 0.68 at the common variant level.5 Uncovering the genetic factors contributing to these disorders could lead to a deeper understanding of disease etiology and improved treatment options.
Hundreds of independent susceptibility loci for BD and SZ have been identified through large-scale genome-wide association studies (GWASs) by focusing on common and low-frequency alleles.6,7,8,9,10,11,12,13 An SZ case-control GWAS of European and east Asian ancestry individuals (Psychiatric Genomics Consortium, 76,755 individuals with SZ and 243,649 controls) identified 287 distinct loci, implicating genes associated with neurodevelopmental disorders and with brain-specific expression.13 A BD GWAS of European ancestry (41,917 BD cases and 371,549 controls) identified 64 distinct loci and significant enrichment of association signals within genes belonging to neuronal and synaptic pathways and targets for existing BD medications.4 Of the BD-associated variants, 17 were also associated with SZ. For both the BD and SZ GWASs, measured common variants are estimated to account for a modest portion of disease heritability (18.6% for BD4 and 24% for SZ14).
Whole-exome sequencing (WES) detects coding variants across the allele frequency spectrum. The Schizophrenia Exome Sequencing Meta-Analysis (SCHEMA) Consortium produced an SZ case-control WES study (SCHEMA) of multi-ancestry individuals (24,248 SZ cases and 97,322 controls), which identified enrichment of ultra-rare coding variants in 10 genes in individuals with SZ15; 2 of the genes were also implicated by common-variant GWAS.13 A WES BD case-control study (BipEx) of European ancestry individuals (13,933 BD cases and 14,422 controls) found that compared to controls, individuals with BD were enriched for ultra-rare protein truncating variants (PTVs) in constrained genes (probability of being loss-of-function intolerant [pLI] ≥ 0.9). When the BipEx and SCHEMA results were combined, AKAP11 was identified as a risk gene.16 These results identified ultra-rare coding variants as contributing to BD and SZ risk and suggested overlap between BD and SZ risk at both the rare and common variant levels.
Compared to common and exonic variants, less is known about the role of rare non-coding variants in SZ and BD. Whole-genome sequencing (WGS) allows detection of coding and non-coding variants across the allele frequency spectrum. Studies using WGS to investigate SZ or BD have generally been limited for non-coding variant analysis by sample sizes of no more than a few hundred individuals, often in family-based designs.17,18,19,20,21,22,23 One SZ WGS case-control study of Swedish samples (1,162 SZ cases and 936 controls) found association between SZ and structural variants at topologically associated domain boundaries but did not find significant differential burden of non-coding single-nucleotide variants (SNVs) and insertion or deletion polymorphisms (indels) between SZ cases and controls across a variety of biological groupings.24
To date, genomic studies of psychiatric disorders and many complex human diseases and traits have overwhelmingly been composed of individuals of European genetic ancestry.25,26 Although there has been progress in increasing the representation of ancestral backgrounds of individuals included in GWASs, notably in east Asians, the available data do not comprehensively represent individuals in the United States or the world.27 This impedes the discovery of genes and mechanisms that might be uncovered from the broader spectrum of variations across different ancestries. The use of European ancestry BD polygenic risk scores (PRSs) to predict disease risk across ancestries also has the potential to create health inequities. For example, PRSs from European ancestry GWASs predict a much smaller proportion of disease in east Asian and African American ancestry samples than in European ancestry samples.4 Increasingly, efforts are under way to assess the influence of genetic variation on complex traits in individuals of non-European ancestry.28,29,30 Genetic studies of mental health disorders that include WGS for individuals of diverse genetic ancestries will allow us to better address the disparities in diagnosis and treatment.
We examined the role of SNVs and short indels in BD and SZ susceptibility in African American individuals in a sample of 7,544 individuals (with 1,598 BD cases, 3,295 SZ cases, and 2,651 controls without SZ or BD) and an additional 14,812 phenotypically unscreened ancestry-matched individuals from the National Heart, Lung, and Blood Institute (NHLBI) Trans-Omics for Precision Medicine (TOPMed) program as external controls.30 Overall, we found suggestive evidence of single-variant BD association on chromosome 18. We observed that chromatin and conservation state burden tests were a sensitive way to assess the comparability of sequencing/calling between two WGS sample sets. We also found suggestive evidence of association of a chromosome 11 region sliding window located among multiple previously reported BD GWAS loci.4
Subjects and methods
Figure 1 contains an overview of the study design, samples, and analytical approaches.
Figure 1.
Study overview
(A) Number of total and unrelated study participants for each case or control group.
(B) Seven analysis groups: control-control and case-control.
(C) Four analysis types (total or unrelated samples used in analysis). PTV, protein truncating variant.
InPSYght study sample
We selected individuals for deep whole-genome sequencing (WGS) from US-based case-control studies of African American individuals as part of the Whole Genome Sequence for Psychiatric Disorders (WGSPD) Consortium.31 We refer to this study as the InPSYght study. We use the term “African American” to denote individuals who self-identified as African American on study forms; this term (which may have been one of a limited number of choices to describe African ancestry) can include individuals who, among others, are descendants of enslaved individuals or are more recent immigrants to the United States from African and other countries. The InPSYght study is composed of participants from the Genomic Psychiatry Cohort (GPC),32,33 the Consortium on the Genetics of SZ (COGS),34 the Bipolar Genome Study (BIGS),35 Lithium Treatment Moderate Dose Use Study (LiTMUS),36 and the Systematic Treatment Enhancement Program for Bipolar Disorder (STEP-BD) studies.37 All DNA samples were obtained from the National Institute of Mental Health (NIMH) Repository and Genomics Resource (Table S1). The array-based genotypes were obtained from the NIMH Repository or from the originating study. From these studies, we selected individuals with self-reported African/African American ancestry, with the goal of sequencing individuals who were more likely to have primarily European and African ancestry admixture. We used ADMIXTURE38 to estimate the percentage of African genetic ancestry in each individual using study sample genotype array data and the 1000 Genomes39 ancestry super-populations Admixed Americans (AMR), African (AFR), European (EUR), South Asian (SAS), and East Asian (EAS) as reference populations. We used a cutpoint of >25% estimated percentage global African genetic ancestry for sample inclusion in sequencing. All individuals designated as cases fulfilled the Diagnostic and Statistical Manual of Mental Disorders-IV criteria for SZ or BD. InPSYght controls (all from the GPC study) were included if they had no personal or family history of SZ or BD and if they did not have unipolar depression (from screening questions). Details of the recruitment strategies and instruments used for diagnosis are provided in the referenced publications for each cohort. All participants provided informed consent. The IRB of the University of Michigan reviewed the details of the study and determined it was “not regulated”.
TOPMed external controls
To increase statistical power to detect variants associated with BD and SZ, we selected as external controls previously whole-genome sequenced self-identified African American individuals from the NHLBI TOPMed project (hereafter referred to as TOPMed controls). TOPMed studies (case-control or family-based studies) are focused on heart, lung, blood and sleep disorders, and inclusion was not predicated on information about mental health.30 Details of the IRB approval for the studies in TOPMed are available elsewhere (See Section 4 in Taliun et al.30). We considered for inclusion individuals from 11 studies who agreed to the use of their samples as controls (Table S2). We considered for inclusion individuals with general research use (GRU) consent or consent for health/medical/biomedical purposes. Further inclusion/exclusion criteria are described below.
WGS of InPSYght and TOPMed individuals
InPSYght: Individuals were whole-genome sequenced (mean ± SD depth 26.8 ± 5.5) in seven batches at the Broad Institute on an Illumina HiSeq X10 instrument. The first WGS batch (n = 231 samples) was performed with PCR-based library preparation; the remaining batches were completed PCR-free. All sequencing was paired end with a 151-nt read length. Individuals with SZ, with BD, and control individuals were included in each sequencing batch, except for the PCR-amplified batch, which did not include individuals with BD (Table S3). Case and internal control samples were intermixed within each batch to help avoid sub-batch effects.
TOPMed individuals were whole-genome sequenced at five sequencing centers as previously described30 (Table S2), using paired-end sequencing and read length of 150 bp. The 14,812 samples used as controls had a mean sequencing depth of 36.8 ± 4.7.
Joint InPSYght and TOPMed variant discovery, genotype calling, and quality control
We used jointly called genotypes on human genome build version GRCh38 for bi-allelic SNVs and short indels from TOPMed Freeze 930 using the GotCloud/vt pipeline.38 We called genotypes for each individual on the autosomes and chromosome X on the across-individuals union of all variant sites. Genotypes for variants on the non-pseudo-autosomal region of the X chromosome were coded as 0 or 2 alleles for males, whereas those in the two PARs (PAR1 and PAR2, respectively) were coded 0, 1, or 2 alleles combining X and Y for males. We filtered variants based on outputs from a support vector machine (SVM) classifier based on inferred pedigree of related and duplicated individuals to calculate Mendelian consistency statistics and other features.30,39
Initial exclusions of individuals from analysis
For the InPSYght study, we excluded individuals with sex mismatches (self-reported sex disagreed with genetic sex) (n = 20), non-XX or XY sex karyotypes (n = 17), estimated DNA contamination >5% using verifyBamID240 (n = 4), or for whom <98% of sites were at a sequencing depth of ≥10 (n = 14) (see Table S4). For TOPMed, we excluded individuals with sex mismatches, non-XX or XY sex karyotypes, or with estimated DNA contamination >5%.30
Construction of PCs for inclusion of sequenced samples
To compute principal components (PCs) of genetic ancestry, we removed variants in high long-range linkage disequilibrium (LD) regions41 and then pruned variants in PLINK/1.942 using --indep-pairwise flag and the following parameters: 500000 5 0.2. We computed PCs using the --pca flag within PLINK42 and restricted variant selection to those present in the Human Genome Diversity Project (HGDP) reference panel.43 These criteria resulted in 160,517 common (minor allele frequency [MAF] >5% in the InPSYght case-control + TOPMed control dataset) autosomal bi-allelic SNVs (from the WGS data). We retained InPSYght individuals who visually clustered in the first three PCs. Then, we retained TOPMed controls who visually clustered in the space occupied by the InPSYght individuals along the first three PCs (Figures S1A–S1D).
Identification of duplicated and related individuals
We estimated pairwise kinship among all individuals (InPSYght and TOPMed controls) using KING (kinship-based inference for GWASs).44 We excluded 1 person per pair of monozygotic/duplicate genotyped individuals (17 duplicates within InPSYght, 140 duplicates within TOPMed, and 52 within InPSYght/TOPMed). For InPSYght/TOPMed pairs, the InPSYght sample was retained. To obtain unrelated individuals for variant aggregation tests, we randomly retained an individual from related pairs or groups (defined by individuals being related with at least one other person in the group) with kinship >0.0409 (third-degree relationship or higher), resulting in 4,489 cases (3,006 SZ and 1,483 BD), 2,374 InPSYght controls, and 8,509 TOPMed controls.
Construction of PCs for association analysis
We computed genetic PCs for use in analysis as described in the section Construction of PCs for inclusion of sequenced samples by using only the selected TOPMed and InPSYght individuals (Figures S1E–S1H).
Genetic ancestry estimation in InPSYght using WGS data
We re-estimated the global genetic ancestries for InPSYght samples using WGS data to use the same genotype source across studies. We used the supervised learning approach implemented in the software ADMIXTURE45 trained on the 1000 Genomes Project phase 346 ancestry super-populations AMR, AFR, EUR, SAS, and EAS. To estimate sample ancestry, we used the same SNVs as for the PC analysis.
We further inferred finer grain genetic ancestry using as a reference individuals from the HGDP reference panel and using RFMix to estimate ancestry (run by chromosome and summed across chromosomes).47 HGDP was chosen here in an attempt to characterize known high-levels of genetic diversity across African populations.48 From the HGDP reference, we used 156 European individuals as a single group, and as separate groups we used between 8 and 27 individuals per population from the following 8 African populations: Bantu from South Africa, Bantu from Kenya, Biaka Pygmy, Mandenka, Mbuti Pygmy, Mozabite, San, and Yoruba.49
Genetic ancestry estimation in TOPMed using WGS data
Estimated genetic ancestry of TOPMed samples has been described previously.30 In summary, local ancestry was inferred using RFMix version 247 with the following option: --node-size = 5. For reference haplotypes used in local ancestry inference, we obtained the HGDP reference panel50 and processed the data according to Wang et al.,51 giving 938 individuals and 639,958 autosomal SNVs. We then condensed the 53 populations in HGDP into 7 super-populations: (1) Sub-Saharan Africa (n = 104), (2) Central/South Asia (n = 200), (3) East Asia (n = 229), (4) Europe (n = 154), (5) Native America (n = 63), (6) Oceania (n = 28), and (7) Middle East (n = 160). After running RFMix, we summed up inferred local ancestry across all genetic windows of each individual to calculate global ancestry proportions, corresponding to the seven super-populations. Almost all selected TOPMed controls (n = 14,804, 99.9%) had >25% African estimated global ancestry (range: 26%–100%), except 8 (0.05%) selected controls with range of estimated African ancestry: 8%–24%.
Power calculations to detect single-variant associations
We calculated the odds ratio (OR) that would yield 80% power to detect variants associated with BD, SZ, or SZ + BD, using InPSYght + TOPMed samples as controls. We assumed a disease prevalence of 1% for BD and 1% for SZ and 2% for SZ + BD. We conservatively used approximate numbers of unrelated cases (n = 1,500 BD and n = 3,000 SZ) and controls (n = 11,000) present in the case-control comparison group of interest in our calculations. As we were interested in the power to detect lower frequency variants, we assumed risk allele frequencies of 0.01 and 0.05, using standard assumptions of a multiplicative disease model on the OR scale, population-based controls, and genome-wide significance level of 5 × 10−9 to account for the testing of variants with MAF <0.05, using the Genetic Power Calculator.52
Genome-wide single-variant case-control association analysis
We tested for association of SZ and/or BD with each SNV or indel (minor allele count [MAC] >20 in the tested individuals) using SAIGE (version 0.42), which employs a mixed model to account for related individuals and uses saddlepoint approximation to account for case-control imbalance in estimation of significance (the estimations are stable down to an MAC of 20).53 We used the reference genome allele as the reference allele in calculation of ORs. We performed a total of six case-control association analyses (Figure 1). We used as cases the SZ-only, BD-only, or combined SZ or BD InPSYght samples. We used as controls either only the InPSYght controls or the InPSYght + TOPMed controls. We chose to combine the SZ and BD cases based on previous evidence of substantial genetic correlation between the two disorders (based on common variants in European-ancestry populations).54 We included as covariates genetic sex and the first 10 genetic PCs. We also included the sequencing batch as a covariate for InPSYght sample-only analyses. To assess potential differences between the two sets of controls (InPSYght and TOPMed controls), we performed an association analysis of InPSYght controls versus TOPMed controls. We controlled for multiple testing within an analysis group using p < 5 × 10−9 for genome-wide significance; we used p < 5 × 10−9/7 comparisons = 7.1 × 10−10 for a conservative genome-wide significance.
PTV singletons-based burden tests
We performed burden tests for protein truncating SNV or indels (PTV) singletons at both the gene level and the gene-set level. We restricted our analysis to KING-estimated unrelated samples (less than third-degree relationships, see above) consisting of 4,489 InPSYght BD and SZ cases, 2,374 InPSYght controls, and 8,509 TOPMed controls. We defined singleton variants based on the unrelated samples to avoid excluding variants that occurred multiple times in a single family. We annotated 61,732 singleton variants as PTVs using the following Ensembl Variant Effect Predictor categories: frameshift, stop gained, splice acceptor, and splice donor. We performed gene-level burden tests of SZ + BD cases versus InPSYght + TOPMed controls on the aggregated singleton PTVs within each gene, testing for association of PTV count with case status using RVTESTS.55 To maximize power, we restricted testing to the genes with >10 PTV singletons (1,045 of 22,178 genes) and only tested SZ + BD cases versus InPSYght + TOPMed controls given the limited number of PTVs available per gene for testing. We included as covariates the first 10 genetic PCs, sex, and an individual’s total number of singleton alleles. We applied a Bonferroni-corrected significance threshold of 0.05/1,045 (of genes with ≥10 PTV singletons) = 4.8 × 10−5. As a sensitivity analysis, we expanded singletons to include PTVs up to 1% MAF and repeated the gene-based testing. At a 1% MAF threshold, we tested all 22,178 genes and used a corrected significance threshold of 0.05/22,178 = 2.3 × 10−6).
In addition, among the 10 previously reported SZ associated genes from the SCHEMA Consortium study,15 we tested the 6 genes with at least 1 PTV singleton in our study; all 6 genes had <10 PTV singletons. We note that the SCHEMA Consortium study includes the InPSYght SZ and control samples; thus, this is not an independent test but a test to see the contribution of the African American samples.
We performed gene set-level tests for the enrichment of PTV singletons within sets of genes previously associated with SZ. Given the higher singleton counts in gene set tests compared to individual genes, in addition to testing SZ + BD versus InPSYght + TOPMed controls as for individual genes, we tested SZ versus InPSYght + TOPMed controls. We tested three gene sets previously associated with SZ in more than one study: 1,423 postsynaptic density genes,56,57 784 FMR1 protein-associated (formerly named FMRP) genes,58,59 and 3,063 constrained (pLI >0.90) genes.59 In addition, we tested a gene set we constructed containing the 10 SCHEMA SZ-associated genes,15 which as noted above has some overlapping samples with the samples for this study. For each gene set test, for each person we summed the PTV counts over all genes in the gene set and used RVTESTS55 to test for association with case status as described for single-gene tests. We used a single analysis Bonferroni significance threshold of 0.05/3 = 0.017; in analysis where a result passed the single analysis threshold, we further evaluated the result using more stringent threshold of 0.05/(3 × 2 comparisons) = 0.0085 to account for multiple analyses.
Construction of sequencing metadata PCs
To control for potential sequencing batch effects within and across InPSYght and TOPMed, we constructed PCs based on a shared set of sample-level sequencing quality control metrics, including per-sample average depths and sample contamination levels (Table S5). We used the first four of these sequencing metadata PCs in the chromatin and conservation states analysis as they cumulatively explain 99.99% of the variance of the sequencing metadata PCs.
Test for case-control and InPSYght control-TOPMed control enrichment of rare and low-frequency variants in chromatin states and conservation states
We tested whether cases and controls exhibit differential enrichment of rare and low-frequency SNVs (MAF <0.05) for any class of genomic region defined based on chromatin or conservation states. Specifically, for the chromatin states, we used the universal ChromHMM60 100-chromatin state annotation of the human genome, which captures combinatorial and spatial patterns of chromatin marks over 1,000 epigenomic datasets from more than 100 cell and tissue types. The version of the annotations we used had been previously lifted over to human hg38 assembly from hg19.60 For the conservation states, we used a ConsHMM61 100-conservation state annotation of the human genome defined directly in hg38, which captures combinatorial and spatial patterns of individual nucleotides aligning to and matching the human reference genome within a 100-way vertebrate sequence alignment.61,62
For the same set of unrelated samples as in the gene-based tests, we used SNVs with MAF <0.05, excluding variants overlapping ENCODE-excluded regions.63 We annotated each variant with the ChromHMM and ConsHMM annotations described above. For each of the six case-control and InPSYght control/TOPMed control comparisons described in the single-variant test section, we used logistic regression to test for association between the non-reference allele count (predictor) and case-control or control study status (outcome). We upweighted rarer variants with the beta function beta(MAF, 1, 25) (mirroring the default choice of WGScan64). We included as covariates the first 10 genetic PCs, sex, sequencing batch (for tests involving only InPSYght samples), and the weighted total count of rare and low-frequency variants for each sample as covariates. We repeated the analysis, including the first four sequencing metadata PCs as covariates. We controlled for multiple testing with a Bonferroni correction for 200 tested states with significance p value threshold calculated as 0.05/200 = 0.00025; we used a more stringent p value threshold of 0.05/(200 × 7 comparisons) = 3.6 × 10−5 to account for multiple analyses.
Test for InPSYght control-TOPMed control enrichment of rare and low-frequency variants in various genomic repeat categories
To investigate the potential effects of the sequencing technical differences between the InPSYght and TOPMed studies on the number of non-reference alleles detected in genomic repeat regions, we annotated each analyzed variant for its presence in a repeat region. We defined repeat regions using (1) repeat regions identified by RepeatMasker 3.0.1 obtained from the UCSC Genome Browser65 and (2) simple repeats defined by Tandem Repeats Finder.66 The repeat regions were tested as a class and were further divided into 21 repeat categories, for a total of 22 categories. We tested for differential enrichment of SNVs in each of the repeat categories, without inclusion of the sequencing metadata PCs as described for the ChromHMM and ConsHMM state tests. We controlled for multiple testing with a Bonferroni correction for 22 tested repeat categories with the significance p value threshold calculated as 0.05/22 = 0.0023; we used a more stringent p value threshold of 0.05/(22 × 7 comparisons) = 3.3 × 10−4 to account for multiple analyses.
Genome-wide rare and low-frequency and rare variant sliding window burden tests for case-control and InPSYght control-TOPMed control comparisons
To identify local enrichments of disease-associated rare and low-frequency alleles, we performed sliding window burden tests using WGScan64 in the same six unrelated-samples case-control sets and one unrelated control-control sample set as in the chromatin and conservation states analysis. We also used the same variants as in the chromatin and conservation states analysis. Following the default parameters of WGScan, we tested variants in window sizes of 5, 10, 15, 20, 25, and 50 kb (including coding and non-coding regions), and upweighting rarer variants with beta function beta(MAF, 1, 25) weights. We included the same set of covariates as in the chromatin and conservation states analysis (with and without the first four sequencing metadata PCs). We used WGScan’s permutational approach with default parameters (including 5,000 permutation replicates) to estimate the effective number of tests (n) for each comparison group.64 We controlled for multiple testing with a Bonferroni-type correction, with significance thresholds calculated as 0.05/n (2.15 × 10−8 to 2.19 × 10−8); we used a more stringent p value threshold of 0.05/ = 3.1 × 10−9 to account for multiple analyses sets (i), where = 16,207,362 is the total number of effective tests across all comparisons.
Secondary analysis for the most strongly associated window across all case-control test combinations
We conducted secondary analyses for the most strongly associated sliding window across all six case-control test combinations: the chr11:64,859,972–64,869,939 association observed in InPSYght BD versus InPSYght controls. We removed variants in the repeat regions (defined above) and performed a WGScan-based burden test of InPSYght BD versus InPSYght controls on this window using the approach described above. We also tested the InPSYght BD cases versus InPSYght controls single-variant association in the non-repeat region of chr11:64,859,972–64,869,939 using a two-sided Fisher’s exact test implemented in PLINK 1.943 (as many variants had an MAC lower than the SAIGE threshold [MAC <20]). We then performed two additional WGScan-based burden tests for this chr11:64,859,972–64,869,939 region: variants with nominally significant Fisher’s exact p values (p <0.05) only and variants in the window that were not nominally significant.
Results
Genome-wide single-variant case-control association analysis
The InPSYght study sample consists of 7,544 African American individuals (estimated African ancestry >25%): 1,598 with BD, 3,295 with SZ, and 2,651 without known BD or SZ or unipolar depression (Table S1). Of all the participants, 42% were female, and participants had an average age of 42.5 ± 12.7 years (Table S3). We generated WGS data for InPSYght samples at an average depth of 26.8 ± 5.5. We estimated the ancestral sources of African ancestry in the samples using the HGDP reference panel and found that almost all InPSYght individuals were genetically most similar to the West African populations represented by Yoruba and Mandenka samples (Figure S2).
To increase power to detect BD- and SZ-variant associations, we included as controls 14,812 African American individuals from the TOPMed study (99.9% of which had >25% African ancestry; average sequencing depth 36.8 ± 4.7). The TOPMed samples came from studies focused on diseases of the heart, lung, and blood and sleep disorders, and inclusion was not predicated on information about mental health (Table S2). To minimize differences between InPSYght and TOPMed samples used in our analyses, we jointly called the samples, and we selected TOPMed samples to have a genetic PC composition similar to that of the InPSYght samples (subjects and methods; Figure S1). The selected TOPMed samples were 62% female. In the jointly called InPSYght and TOPMed external control dataset, we identified 226,434,324 variants (210,210,658 SNVs and 16,223,666 short indels) on the autosomes and chromosome X, 220,310,579 variants of which have MAF <0.05 (204,467,345 SNVs and 15,843,234 short indels).
To identify SNV and short indels associated with BD and SZ, we performed GWAS single-variant tests of association (MAC >20 in the tested group). We adjusted for the first 10 genetic PCs, sex, and sequencing batches (only in InPSYght sample analysis) and accounted for relatedness using a mixed model. First, to determine whether differences in genetic ancestry or sequencing between InPSYght and TOPMed samples might cause artifactual associations in the BD and SZ association analysis, we performed a GWAS of InPSYght controls versus TOPMed controls. There was no evidence of inflation of genomic control (λGC = 1.02). There was one common genome-wide significant variant, an indel on chromosome 13 (rs11350613) (OR = 0.79 [95% confidence interval {CI} 0.74–0.85], p = 1.2 × 10−10, MAF of 0.63 versus 0.68 in InPSYght controls versus TOPMed controls, respectively) (Figure S3). However, variant rs11350613 is not in LD with other variants in our data, and it just barely passed the SVM-based QC filter in our study (−0.497, threshold for retention SVM >−0.5) and failed QC in the subsequent TOPMed 10 data freeze (https://bravo.sph.umich.edu/variant.html?chrom=13&pos=79615934&ref=C&alt=CT). Second, we estimated the power to detect case-control associations for each case group versus the InPSYght + TOPMed control group. We used p < 5 × 10−9 for genome-wide significance and p < 5 × 10−9/7 = 7.1 × 10−10 for conservative multiple groups testing genome-wide significance to account for the seven groups of samples being tested. For tests of BD, SZ, and SZ + BD association with InPSYght + TOPMed controls, we have approximately 80% power for p < 5 × 10−9 and p < 7.1 × 10−10 to detect ORs of 3.2, 2.4, and 2.1 and 3.4, 2.5, and 2.3, respectively, for an MAF of 0.01 and for ORs of 1.76, 1.53, and 1.43 and 1.78, 1.55, and 1.44, respectively, for an MAF of 0.05. We performed single-variant association analyses of BD, SZ, or SZ + BD versus InPSYght + TOPMed controls or versus InPSYght controls (Manhattan and quantile-quantile [Q-Q] plots; Figures S3–S10; Table S6). Estimation of population stratification and deviation of test statistics observed from that expected (λGC) ranged from 1.00 to 1.02 for the various case-control combinations tested, consistent with minimal stratification bias or p value inflation (Table 1). We observed one genome-wide significant before multiple group testing correction (between p < 5 × 10−9 and p < 5 × 10−9/7 comparisons = 7.1 × 10−10), locus on chromosome 18 (2 SNVs and 1 indel) in the BD versus InPSYght + TOPMed control analysis (Figure S6): lead SNV chr18:49738979:G:T, OR (95% CI) = 30.7 (1.35 × 10−9), p = 1.3 × 10−9, MAF of 0.0069 (BD) versus 0.0011 (InPSYght + TOPMed controls). The three variants are within 600 bp of one another and are in strong LD (r2 > 0.9) (Figure 2). The locus zoom plots display the 1000G AFR LD, and there appears to be a variant in r2 > 0.80 with our associated variants; however, in our data, this variant has r2 = 0.65 with the most strongly associated variants and p = 0.0014. We observed less significant association results in the smaller BD versus InPSYght control analysis (chr18:49738979:G:T, OR [95% CI] = 8.51 [3.85–18.8], p = 1.22 × 10−7, MAF of 0.0069 [BD] versus 0.00056 (InPSYght controls) (Table S7). These variants had no obvious quality control issues. The nearest genes in the region are ACAA2, LIPG, and MYO5B.
Table 1.
λGC results for each analysis type
| Sample group 1 | Sample group 2 | Single-variant GWAS |
Gene exon burden |
Chromatin/conservation burden |
Sliding window burden |
||
|---|---|---|---|---|---|---|---|
| Base | Base | Base | + metadata PCs | Base | + metadata PCs | ||
| InPSYght controls | TOPMed controls | 1.02 (S) | a | 1.51 | 1.02 | 1.00 | 1.01 (S) |
| InPSYght BD + SZ | InPSYght + TOPMed controls | 1.01 | 0.92 | 1.54 | 1.24 | 1.02 | 1.01 |
| InPSYght controls | 1.00 | a | 0.98 | 0.98 | 0.98 | 0.98 | |
| InPSYght BD | InPSYght + TOPMed controls | 1.02 (S) | a | 1.70 (S) | 1.21 | 1.00 | 0.99 |
| InPSYght controls | 1.00 | a | 1.13 | 1.07 | 0.98 | 0.98 | |
| InPSYght SZ | InPSYght + TOPMed controls | 1.00 | a | 1.28 | 1.10 | 1.02 | 1.01 |
| InPSYght controls | 1.00 | a | 0.92 | 0.93 | 0.99 | 0.98 | |
(S) represents at least one case/control genome-wide or category-wide significant association test for one comparison or seven comparisons. BD, bipolar disorder; SZ, schizophrenia.
Comparison was not tested in this study. Base refers to the association test with base covariates (see subjects and methods) without the inclusion of metadata PCs.
Figure 2.
Regional view of chromosome 18 locus showing evidence of association in the InPSYght BD versus InPSYght+TOPMed controls
Horizontal line shows a genome-wide significance threshold of 5 × 10.
We found that in chromatin and conservation state tests (see below), sequencing metric-based PCs may help control for sample sequencing differences. We repeated our analysis of this region, including four sequencing metric metadata PCs. We found slightly attenuated non-genome-wide significant results (chr18:49738979:G:T, OR [95% CI] = 15.0 [5.78–39.1], p value 2.7 × 10−8).
Gene-based and gene set tests
The SCHEMA Consortium exome-sequencing meta-analysis of 24,248 SZ cases and 97,322 controls of predominantly European ancestry individuals tested functional variant annotation groupings for association with SZ; they found that extremely rare PTVs had the strongest SZ associations.15 The SCHEMA analysis contained InPSYght SZ and control individuals, but did not separately report the results for InPSYght African American individuals. To specifically assess gene-based PTV associations in African American individuals, we used 4,489 InPSYght SZ + BD cases and 10,883 InPSYght + TOPMed controls (all unrelated individuals) to test for PTV burden in 1,045 genes with singleton PTV count >10 (see subjects and methods). We included the BD cases to increase the number of alleles per gene given similarities in the underlying genetic architecture.54 We did not detect evidence of inflation of association statistics (λGC = 0.92) and found no SZ + BD-associated genes (Table 1; Figure S10). In a sensitivity analysis expanding the singleton criteria to PTV variants up to 1% MAF, we again found no significant SZ + BD genes (Table S8). Of SCHEMA’s 10 most strongly associated SZ genes, 4 genes had no singleton PTVs in the InPSYght + TOPMed sample and 6 had singleton PTVs <10; all had p > 0.05 (Table S9). Considering SCHEMA’s top 10 genes as a single gene set, we observed directionally consistent, although non-significant, enrichments of PTVs in SZ (OR = 1.65, p = 0.42) and SZ + BD cases (OR = 2.07, p = 0.15) compared to controls.
We tested three previously identified SZ-associated gene sets for PTV gene set enrichment in SZ or SZ + BD versus InPSYght + TOPMed controls. For the most strongly enriched SCHEMA15 gene set—3,063 constrained (pLI >0.90) genes15,59—we found significant association for both SZ (OR = 1.11, p = 8.2 × 10−3) and SZ + BD (OR = 1.13, p = 2.9 × 10−3) (samples contained within SCHEMA). For two SZ-associated gene sets identified in multiple papers—1,423 post-synaptic density genes56,57 and 784 FMR1 protein-associated genes58,59—we found directionally consistent ORs but no significant associations (Table 2).
Table 2.
InPSYght singleton PTV burden test results for previously published SZ gene sets
| Gene set | No. of genes | InPSYght study of BD and SZ |
InPSYght study of SZ |
Genovese et al.57 (SZ) |
Singh et al. (SCHEMA)15,a (SZ) |
|---|---|---|---|---|---|
| Case n = 4,489; control n = 10,883 |
Case n = 3,006; control n = 10,883 |
Case n = 4,877; control n = 6,203 |
Case n = 24,248; control n = 97,322 |
||
| African American ancestry |
African American ancestry |
European ancestry |
Multi-ethnic ancestry |
||
| OR (p value) | OR (p value) | OR (p value) | OR (p value) | ||
| Constrained (pLI > 0.90) | 3,063 | 1.13 (0.0030)b | 1.13 (0.0082)b | 1.17 (1.7 × 10−8) | 1.26 (7.6 × 10−35) |
| PSD | 1,423 | 1.09 (0.052) | 1.11 (0.039) | c | 1.20 (1.0 × 10−6) |
| FMRP | 784 | 1.10 (0.058) | 1.10 (0.099) | 1.23 (8.2 × 10−9) | 1.25 (2.2 × 10−17) |
Overlap in study sample with InPSYght.
Significant test, p < 0.0085
Gene set was not tested in the listed study
Case-control enrichment of rare and low-frequency SNVs in chromatin states and conservation states
We next investigated whether chromatin- or conservation state-based sets of rare and low-frequency SNVs (MAF <0.05) are differentially enriched between the InPSYght and TOPMed controls and between SZ and/or BD cases and InPSYght and/or TOPMed controls (Figure 1, unrelated individuals). Our goal with this analysis was to test a diverse set of systematically defined genomic regions that could potentially capture both technical artifacts and biological associations. For this, we annotated the SNVs using a set of 100-ChromHMM universal chromatin states60 and a set of 100-ConsHMM conservation states,61,62 for a total of 200 states (see subjects and methods). For each state, we performed a logistic regression to test whether two groups have a differential variant burden, with upweighting of variants with lower MAF (see subjects and methods).
We tested for state differences in variant burden in InPSYght controls versus TOPMed controls and found that the p value distribution was substantially inflated compared to the expected distribution (λGC = 1.51; Figures 3A and 3B; Table 1). Likewise, in case-control analyses that included TOPMed controls, we found substantial p value inflation (λGC = 1.28–1.70; Figures 3C, 3D, and S11; Table 1). One state (ConsHMM state 75) had lower variant burden in BD cases compared to InPSYght + TOPMed controls (OR = 0.71, p = 9.9 × 10−5, significant at a per-case-control group level).
Figure 3.
Chromatin and conservation state burden test results
Results for InPSYght controls versus TOPMed controls (A and B) and InPSYght cases versus InPSYght and TOPMed controls (C and D), with each point representing a ChromHMM or a ConsHMM state. Dashed lines show Bonferroni-based p value thresholds (p = 0.05/200).
(A and C) Diagonal lines show the unit slope Q-Q plots and genomic inflation factors (λ) before and after inclusion of sequencing metadata PCs covariates.
(B and D) Signed (by direction of enrichment coefficient) −log10p values before and after inclusion of sequencing metadata PCs covariates. Sign direction: case enrichment values are positive, and control enrichment values are negative.
To test whether these findings could be due to differences in technical sequencing factors between InPSYght and TOPMed, we constructed a set of sequencing metadata PCs from sequencing quality control metrics (see subjects and methods) and included them as covariates. These sequencing metadata PCs summarize various quality control metrics, including various sequencing depth-related metrics that almost exclusively drove the PCs (Table S5). After inclusion of sequencing metadata PCs, the TOPMed control versus InPSYght controls state-based burden test had a non-inflated λGC = 1.02, and the TOPMed control-containing case-control analyses had lower λGCs (1.10–1.24), with no visible inflation at more significant p values (Figures 3B, 3D, and S11; Table 1). Across all the case-control comparisons, no state reached a Bonferroni-based significance threshold of 0.05/200 chromatin states = 2.5 × 10−4 (the number of total states tested between the two models) after the inclusion of metadata PCs (Figure S12). Thus, we did not find evidence for enrichment in BD and/or SZ cases of rare and low-frequency variants in particular chromatin or conservation states.
Control-control enrichment of rare and low-frequency variants for repeat classes
We next sought to better understand why the inflated λGC values for chromatin and conservation states in the control-control enrichment analysis had non-inflated λGC when including sequencing metadata PCs. Given that repeat regions often present technical challenges in sequencing, we hypothesized that repeat regions as a whole or of particular classes might show specific differences in enrichment by control study. This hypothesis would also be consistent with strong enrichments shown for various categories of genomic repeat elements in specific ChromHMM60 and ConsHMM61 states. To test this hypothesis, we annotated rare and low-frequency SNVs with a total of 21 different categories of repeat regions defined by RepeatMasker and Tandem Repeats Finder.66 Using the same test as for state-based analysis, we asked whether the TOPMed and InPSYght controls were differentially enriched for SNV over all repeats and in specific repeat categories. We did not see significant enrichment of SNVs in the overall repeat category (OR = 2.13, p = 4.1 × 10−2, overlapping 53% of variants on average). However, we found significant enrichment of rare and low-frequency variants in TOPMed controls compared to InPSYght controls in the RepeatMasker-defined short interspersed nuclear elements (SINE) (OR = 2.04, p = 3.0 × 10−9, overlapping 16% of variants on average) and simple repeat regions (OR = 1.31, p = 3.2 × 10−6, overlapping 1% of variants on average) (Figures 4A and 4B). After including sequencing metadata PCs as additional covariates, only simple repeat regions remain significant (at a single-analysis set level), with a greatly reduced significance level (OR = 1.35, p = 2.0 × 10−3) (Figures 4C and 4D). Overall, these results suggest that TOPMed and InPSYght cohorts have different distributions of variants in specific repeat categories and could be associated with artifactual associations in burden-type analyses. Such differences can be partially controlled for with the inclusion of sequencing metadata PCs.
Figure 4.
Test of repeat categories for enrichment of rare and low-frequency variants in TOPMed controls versus InPSYght controls
Test of repeat categories without (A and B) or with (C and D) sequencing metadata PCs covariates.
“DNA?” represents elements with uncertain category classification to the DNA repeat element category. Horizontal dashed lines show Bonferroni-based p value thresholds (p = 0.05/22).
(A and C) Volcano plots with log odds ratio on the x axis and −log10p values on the y axis.
(B and D) Mean weighted percentage of variants overlapping each repeat category on the x axis (mean of (per person weighted allele count used in repeat test divided by total weighted alleles count)) see subjects and methods), and −log10p values on the y axis.
Genome-wide rare and low-frequency variant sliding window burden tests for BD and SZ samples versus controls
To identify contiguous genomic regions that might harbor an excess of rare and low-frequency SNVs that predispose to or protect from BD and SZ, we conducted genome-wide 5- to 50-kb sliding window analyses within the TOPMed and InPSYght control groups and 6 case-control groups (Figure 1) using the WGScan framework.64 Within each genomic window we performed allele frequency weighted burden test, adjusting for covariates used in the chromatin and conservation state analysis (with and without the metadata PCs).
To identify windows that might be affected by differences in sequencing between the TOPMed and InPSYght study, we first compared InPSYght controls and TOPMed controls (Figure 5). The λGC = 1.01 was consistent, with no inflation of the test statistics (Table 1). However, we identified a 50-kb region on chromosome 5 (chr5:58,500,010–58,550,007) with 12 windows with genome-wide significant associations, with the strongest association being chr5:58,510,332–58,515,331 (p = 1.29 × 10−9, significant when accounting for multiple analysis sets; Figure 5B). This region showed an elevated burden of rare and low-frequency SNVs for InPSYght controls (mean weighted burden = 156) compared to TOPMed controls (mean weighted burden = 145). To test the robustness of this association, we repeated the analysis without including the sequencing metadata PCs as covariates and found that none of the windows in this region remained significant (minimum p = 2.16 × 10−3) (Figures 5A, 5C, and 5D), leaving open the question of whether this was a chance association or if the metadata PCs induced a false positive association in this region.
Figure 5.
WGScan genome-wide sliding window burden test for rare and low-frequency variants for InPSYght controls versus TOPMed controls
Horizontal lines show genome-wide p value significance thresholds (p = 2.18 × 10−8); diagonal line shows the unit slope.
(A and B) Manhattan plots of sliding window p values without sequencing metadata PCs (A) and with sequencing metadata PCs (B).
(C) Q-Q plot for the window p values, before and after inclusion of sequencing metadata PCs covariates.
(D) Comparison between sliding window p values before and after inclusion of sequencing metadata PCs covariates, showing windows in the chr5 region 58,500,010–58,550,007 (red).
In the six case-control analyses we observed no inflation of the test statistics (λGC = (0.98–1.01)) with the inclusion of metadata PCs (Table 1). We did not identify any window where the burden of rare and low-frequency SNVs was significantly associated with BD and/or SZ. The significant chromosome 5 region from the control-control analysis did not show association signal in any of the case-control analyses (p ≥ 0.01 for every case-control analysis).
Interestingly, when we compared the sliding window results to BD and SZ GWAS results, we noticed that the most significant window across all case-control tests was located in a 6-Mb region of chromosome 11 containing multiple independent Psychiatric Genomics Consortium common variant BD associations (Table S10).4 This 10-kb window overlaps the EHD1 gene (chr11:64,859,972–64,869,939; GRCh38 (p = 4.06 × 10−8 for InPSYght BD versus InPSYght controls and has a higher burden in controls than in cases). This window is the tenth most significant window for the InPSYght BD versus combined InPSYght and TOPMed controls analysis (p = 1.37 × 10−6).
We assessed whether removing repetitive regions might strengthen the association signal because we could more accurately genotype non-repeat variants. In the non-repetitive region analysis of InPSYght BD versus InPSYght controls including sequencing metadata PCs, this window reached genome-wide significance (p = 1.12 × 10−9) and was the most strongly associated window across all case-control comparisons; we found similar results when the analysis was run without metadata PCs (p = 1.09 × 10−9). We identified a specific subset of control-enriched variants belonging to distinct haplotypes (Figure S13). Across cell and tissue types, this 10-kb window has an average of 44.5% of base pairs in the TxReg chromatin state (defined by a high presence of transcription, enhancer, and promoter chromatin marks) from a chromatin state model providing per-cell and tissue annotations67; only 0.02% of the 10-kb windows in the genome had a higher percentage of base pairs annotated by this state, suggesting high regulatory potential (Figure S14; see supplemental text for further analysis). These analyses point to a potential convergence of BD-related associations in this region that awaits replication in larger samples.
Discussion
We investigated the genomic basis of SZ and BD across the allele frequency spectrum for coding and non-coding variants in a cohort of African American individuals. We identified three low-frequency BD-associated variants (two SNVs and one indel) on chromosome 18. These variants are 200 kb from a BD association identified in a GWAS study of 1,461 BD cases (bipolar 1 disorder) and 2,008 controls37; larger studies have not identified associations in this region.4,12 In addition to the single-variant results, we identified a significantly associated window on chromosome 11 when performing a secondary BD analysis excluding genomic repeat regions. The associated window is located in a 6-Mb cluster of five common variant GWAS associations for BD.4 This region has a significantly lower burden of rare and low-frequency and rare variants in BD cases compared to InPSYght controls. Given the number of GWAS and sliding window tests we performed (albeit with overlapping sets of cases and controls), these findings would likely not survive a more stringent correction for the effective number of tests performed in our study. Thus, we consider these chromosome 18 and 11 results as potentially associated loci that would need to be assessed in larger WGS studies, including those with African American individuals.
Although we worked to minimize genotype calling differences by calling InPSYght and TOPMed samples together, one potential limitation of our study is that the TOPMed controls were more deeply sequenced and were sequenced separately from InPSYght cases and controls. Across our primary analyses, the only tests that appeared to be sensitive to potential differences in sequencing were the ConsHMM conservation states and of universal ChromHMM chromatin states tests (based on an inflated λGC). These tests combine large numbers of rare and low-frequency variants from regions that may be more or less difficult to sequence or prone to sequencing artifacts. These states are known to have different levels of repetitive regions,60,61 and we found that particular classes of repeats, SINE and simple repeat regions, were enriched for significant differences between the InPSYght and TOPMed controls. These findings suggest that sensitivity analyses should be performed for tests that aggregate variants across any class of genomic regions. For example, gene set analysis of non-coding variants may be susceptible to false positives given that different repeat classes are enriched near biologically distinct sets of genes; SINE are enriched in housekeeping genes.68
In our initial analyses of genomic regions, we attempted to control for sequencing differences and variant density, in a manner analogous to that of tests of rare singletons, by inclusion of the total weighted rare and low-frequency allele count as a covariate. This covariate was not effective in controlling the false positive rate in the ConsHMM conservation and ChromHMM chromatin states tests. We found, however, that inclusion of sequencing metadata PCs reduced or even eliminated the λGC inflation. This suggests that for tests that aggregate rare and low-frequency variants (and potentially tests that aggregate singletons), inclusion of more extensive sequence metrics may be necessary to control the false positive rate.
In contrast to their ability to control for p value inflation in the InPSYght versus TOPMed control chromatin and conservation state tests, we observed that the inclusion of the sequencing metadata PCs appeared to induce a false positive association in the sliding window control-control (TOPMed versus InPSYght) analysis. When we included sequencing metadata PCs as covariates, we identified a genome-wide significant region on chromosome 5 that was 6 orders of magnitude more significant than without sequence data metadata PCs. These findings suggest that significant associations should be subject to sensitivity analyses with and without adjustment for sequencing-related covariates. The findings also highlight the challenges associated with controlling for batch/sequencing effect differences and variant quality in WGS studies.69
There are additional limitations in the present study and analysis. First, we have a very small effective sample size relative to genotype array-based GWASs. We expect that larger WGS case-control samples will be essential to identification of rarer non-coding variants. Second, to have a focused study we only considered SNVs and short indels, and we did not analyze large structural variations including repeating expansions. Third, there was a substantially greater proportion of males in InPSYght cases than in TOPMed control individuals. We corrected for sex in our association testing, which would reduce any (unlikely) differences in allele frequency by sex.
Fourth, the TOPMed controls were not screened for psychiatric disorders, and thus the presence of individuals with BD or SZ in the control group could have decreased our power to detect BD and SZ associations. We expect misclassification in the control group to have a minimal effect70 because (1) the combined prevalence of BD and SZ is 2%–4% in the general population71 and (2) individuals with severe mental illness are less likely than individuals without severe mental illness to participate in non-psychiatric disorder-based studies due to exclusion criteria and/or a decreased likelihood of enrollment.72,73 As improved methods for analyzing rare non-coding variation are developed,64,70 and the field comes to a clearer consensus on which analytical strategies and annotations are most effective for WGS analysis, more disease-relevant information will be extracted from WGS data.
In summary, WGS allowed us to test variants for association with SZ and/or BD across the allele frequency spectrum in both coding and non-coding regions, in particular, rare non-coding variants that are missed by GWASs or exome sequencing studies. Our study highlights the need to perform sensitivity analysis when conducting WGS analyses, particularly for aggregation tests of rare and low-frequency non-coding variants that use data from multiple studies, and the need for larger WGS studies of African American individuals. Inclusion of African American individuals and other historically under-represented populations in psychiatric studies is crucial to gaining a more comprehensive and equitable understanding of the contribution of genetic variation to psychiatric disorders and may uncover associations previously unobserved in a more limited set of ancestries.74 We expect our data and additional African American WGS studies will contribute to the growing understanding of rare non-coding variants and complex psychiatric diseases in individuals with different ancestries.
Data and code availability
-
•
WGS data generated for this study’s samples are available in AnVIL: WGSPD Project 1: Whole Genome Sequencing for Schizophrenia and Bipolar Disorder (WGSPD1). The AnVIL Terra Workspaces containing data (AnVIL: https://anvilproject.org/data/studies/phs002041/workspaces) are AnVIL_NIMH_Broad_WGSPD1_McCarroll_Pato_GRU_WGS, ANVIL_NIMH_Broad_WGSPD_1_McCarroll_Braff_DS_WGS, and ANVIL_NIMH_Broad_WGSPD1_McCarroll_COGS_DS_WGS. Phenotype data are available at dbGaP: phs002041.v2.p1. GPC data used in the preparation of this study (genotype data, variables used in analysis, and case-control status), as well as selected results, are available in the NIMH Data Archive (NDA). NDA is a collaborative informatics system created by the National Institutes of Health to provide a national resource to support and accelerate research in mental health. Dataset identifier: https://doi.org/10.15154/1z0s-af20). This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or of the Submitters submitting original data to NDA.
-
•
The Freeze 9 TOPMed WGS genotype data used in this study are in dbGaP. See TOPMed projects (above) or Table S2 for the accession numbers.
-
•
The code for ChromHMM and ConsHMM state-based analysis is available at https://github.com/ernstlab/inpsyght_states.
Acknowledgments
This work was supported by the University of Michigan Precision Health Scholar Award (https://precisionhealth.umich.edu/) (S.A.G.T.); R01 HG009976 (M.B.); U01MH105653 (M.B., C.N.P., and S.A.M.); salary support from the Fonds de Recherche du Québec–Santé: a Junior 1 Award, and currently a Junior 2 Award (S.A.G.T.); R01 MH115676 (R.A.O.); and DP1DA044371 and U01HG012079 (J.E.). This work used computational and storage services associated with the Hoffman2 Cluster, which is operated by the University of California, Los Angeles Office of Advanced Research Computing’s Research Technology Group. Biosamples and data for the InPSYght study were obtained from the NIMH Repository & Genomics Resource (U24MH068457), a centralized national biorepository for genetic studies of psychiatric disorders (https://www.nimhgenetics.org/). Contributing studies include the GPC, the Consortium of the Genetics of Schizophrenia, the Bipolar Genome Study, the LiTMUS, and the Systematic Treatment Enhancement Program for Bipolar Disorder. We gratefully acknowledge the participants who provided biological samples and data for these studies. We gratefully acknowledge the studies and participants who provided biological samples and data for TOPMed. Molecular data for the TOPMed program were supported by the NHLBI. See the supplemental information for TOPMed study and TOPMed Neurocognitive Working Group members. Core support, including centralized genomic read mapping and genotype calling, variant quality metrics, and filtering, was provided by the TOPMed Informatics Research Center (3R01HL-117626-02S1, contract HHSN268201800002I). Core support, including phenotype harmonization, data management, sample-identity quality control, and program coordination, was provided by the TOPMed Data Coordinating Center (R01HL-120393 and U01HL-120393; contract HHSN268201800001I).
Declaration of interests
A.E.L. is a shareholder of Regeneron Pharmaceuticals. K.C.B. is an employee of Oxford Nanopore Technologies, Ltd. G.A. is a shareholder of Regeneron Pharmaceuticals.
Footnotes
Supplemental information can be found online at https://doi.org/10.1016/j.xhgg.2025.100499.
Contributor Information
Jason Ernst, Email: jason.ernst@ucla.edu.
Laura J. Scott, Email: ljst@umich.edu.
Supplemental information
References
- 1.Plana-Ripoll O., Pedersen C.B., Agerbo E., Holtz Y., Erlangsen A., Canudas-Romo V., Andersen P.K., Charlson F.J., Christensen M.K., Erskine H.E., et al. A comprehensive analysis of mortality-related health metrics associated with mental disorders: a nationwide, register-based cohort study. Lancet. 2019;394:1827–1835. doi: 10.1016/S0140-6736(19)32316-5. [DOI] [PubMed] [Google Scholar]
- 2.Song J., Bergen S.E., Kuja-Halkola R., Larsson H., Landén M., Lichtenstein P. Bipolar disorder and its relation to major psychiatric disorders: a family-based study in the Swedish population. Bipolar Disord. 2015;17:184–193. doi: 10.1111/bdi.12242. [DOI] [PubMed] [Google Scholar]
- 3.Owen M.J., Sawa A., Mortensen P.B. Schizophrenia. Lancet. 2016;388:86–97. doi: 10.1016/S0140-6736(15)01121-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Mullins N., Forstner A.J., O’Connell K.S., Coombes B., Coleman J.R.I., Qiao Z., Als T.D., Bigdeli T.B., Børte S., Bryois J., et al. Genome-wide association study of more than 40,000 bipolar disorder cases provides new insights into the underlying biology. Nat. Genet. 2021;53:817–829. doi: 10.1038/s41588-021-00857-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Gibson G. Rare and common variants: twenty arguments. Nat. Rev. Genet. 2012;13:135–145. doi: 10.1038/nrg3118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Schizophrenia Psychiatric Genome-Wide Association Study GWAS Consortium Genome-wide association study identifies five new schizophrenia loci. Nat. Genet. 2011;43:969–976. doi: 10.1038/ng.940. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Ripke S., O’Dushlaine C., Chambert K., Moran J.L., Kähler A.K., Akterin S., Bergen S.E., Collins A.L., Crowley J.J., Fromer M., et al. Genome-wide association analysis identifies 13 new risk loci for schizophrenia. Nat. Genet. 2013;45:1150–1159. doi: 10.1038/ng.2742. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Schizophrenia Working Group of the Psychiatric Genomics Consortium Biological insights from 108 schizophrenia-associated genetic loci. Nature. 2014;511:421–427. doi: 10.1038/nature13595. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Pardiñas A.F., Holmans P., Pocklington A.J., Escott-Price V., Ripke S., Carrera N., Legge S.E., Bishop S., Cameron D., Hamshere M.L., et al. Common schizophrenia alleles are enriched in mutation-intolerant genes and in regions under strong background selection. Nat. Genet. 2018;50:381–389. doi: 10.1038/s41588-018-0059-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Lam M., Chen C.-Y., Li Z., Martin A.R., Bryois J., Ma X., Gaspar H., Ikeda M., Benyamin B., Brown B.C., et al. Comparative genetic architectures of schizophrenia in East Asian and European populations. Nat. Genet. 2019;51:1670–1678. doi: 10.1038/s41588-019-0512-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Sklar P., Ripke S., Scott L.J., Andreassen O.A., Cichon S., Craddock N., Edenberg H.J., Nurnberger J.I., Rietschel M., Blackwood D., et al. Large-scale genome-wide association analysis of bipolar disorder identifies a new susceptibility locus near ODZ4. Nat. Genet. 2011;43:977–983. doi: 10.1038/ng.943. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Stahl E.A., Breen G., Forstner A.J., McQuillin A., Ripke S., Trubetskoy V., Mattheisen M., Wang Y., Coleman J.R.I., Gaspar H.A., et al. Genome-wide association study identifies 30 loci associated with bipolar disorder. Nat. Genet. 2019;51:793–803. doi: 10.1038/s41588-019-0397-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Trubetskoy V., Pardiñas A.F., Qi T., Panagiotaropoulou G., Awasthi S., Bigdeli T.B., Bryois J., Chen C.-Y., Dennison C.A., Hall L.S., et al. Mapping genomic loci implicates genes and synaptic biology in schizophrenia. Nature. 2022;604:502–508. doi: 10.1038/s41586-022-04434-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Zeng J., Xue A., Jiang L., Lloyd-Jones L.R., Wu Y., Wang H., Zheng Z., Yengo L., Kemper K.E., Goddard M.E., et al. Widespread signatures of natural selection across human complex traits and functional genomic categories. Nat. Commun. 2021;12:1164. doi: 10.1038/s41467-021-21446-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Singh T., Poterba T., Curtis D., Akil H., Al Eissa M., Barchas J.D., Bass N., Bigdeli T.B., Breen G., Bromet E.J., et al. Rare coding variants in ten genes confer substantial risk for schizophrenia. Nature. 2022;604:509–516. doi: 10.1038/s41586-022-04556-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Palmer D.S., Howrigan D.P., Chapman S.B., Adolfsson R., Bass N., Blackwood D., Boks M.P.M., Chen C.-Y., Churchhouse C., Corvin A.P., et al. Exome sequencing in bipolar disorder identifies AKAP11 as a risk gene shared with schizophrenia. Nat. Genet. 2022;54:541–547. doi: 10.1038/s41588-022-01034-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Homann O.R., Misura K., Lamas E., Sandrock R.W., Nelson P., McDonough S.I., DeLisi L.E. Whole-genome sequencing in multiplex families with psychoses reveals mutations in the SHANK2 and SMARCA1 genes segregating with illness. Mol. Psychiatr. 2016;21:1690–1695. doi: 10.1038/mp.2016.24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Khan F.F., Melton P.E., McCarthy N.S., Morar B., Blangero J., Moses E.K., Jablensky A. Whole genome sequencing of 91 multiplex schizophrenia families reveals increased burden of rare, exonic copy number variation in schizophrenia probands and genetic heterogeneity. Schizophr. Res. 2018;197:337–345. doi: 10.1016/j.schres.2018.02.034. [DOI] [PubMed] [Google Scholar]
- 19.Sul J.H., Service S.K., Huang A.Y., Ramensky V., Hwang S.-G., Teshiba T.M., Park Y., Ori A.P.S., Zhang Z., Mullins N., et al. Contribution of common and rare variants to bipolar disorder susceptibility in extended pedigrees from population isolates. Transl. Psychiatry. 2020;10:74. doi: 10.1038/s41398-020-0758-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Alkelai A., Greenbaum L., Docherty A.R., Shabalin A.A., Povysil G., Malakar A., Hughes D., Delaney S.L., Peabody E.P., McNamara J., et al. The benefit of diagnostic whole genome sequencing in schizophrenia and other psychotic disorders. Mol. Psychiatr. 2022;27:1435–1447. doi: 10.1038/s41380-021-01383-9. [DOI] [PubMed] [Google Scholar]
- 21.Georgi B., Craig D., Kember R.L., Liu W., Lindquist I., Nasser S., Brown C., Egeland J.A., Paul S.M., Bućan M. Genomic View of Bipolar Disorder Revealed by Whole Genome Sequencing in a Genetic Isolate. PLoS Genet. 2014;10 doi: 10.1371/journal.pgen.1004229. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Mojarad B.A., Yin Y., Manshaei R., Backstrom I., Costain G., Heung T., Merico D., Marshall C.R., Bassett A.S., Yuen R.K.C. Genome sequencing broadens the range of contributing variants with clinical implications in schizophrenia. Transl. Psychiatry. 2021;11:84. doi: 10.1038/s41398-021-01211-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Biernacka J., Jenkins G., McDonnell S., Batzler A., Sicotte H., Fogarty Z., Welkie B., Baheti S., Coombes B., McElroy S., Frye M. A whole genome sequencing study identifies a rare variant in ANK3 that may contribute to bipolar disorder. Eur. Neuropsychopharmacol. 2019;29 [Google Scholar]
- 24.Halvorsen M., Huh R., Oskolkov N., Wen J., Netotea S., Giusti-Rodriguez P., Karlsson R., Bryois J., Nystedt B., Ameur A., et al. Increased burden of ultra-rare structural variants localizing to boundaries of topologically associated domains in schizophrenia. Nat. Commun. 2020;11:1842. doi: 10.1038/s41467-020-15707-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Popejoy A.B., Fullerton S.M. Genomics is failing on diversity. Nature. 2016;538:161–164. doi: 10.1038/538161a. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Sirugo G., Williams S.M., Tishkoff S.A. The Missing Diversity in Human Genetic Studies. Cell. 2019;177:26–31. doi: 10.1016/j.cell.2019.02.048. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Martin A.R., Kanai M., Kamatani Y., Okada Y., Neale B.M., Daly M.J. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet. 2019;51:584–591. doi: 10.1038/s41588-019-0379-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Mahajan A., Wessel J., Willems S.M., Zhao W., Robertson N.R., Chu A.Y., Gan W., Kitajima H., Taliun D., Rayner N.W., et al. Refining the accuracy of validated target identification through coding variant fine-mapping in type 2 diabetes. Nat. Genet. 2018;50:559–571. doi: 10.1038/s41588-018-0084-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Spracklen C.N., Horikoshi M., Kim Y.J., Lin K., Bragg F., Moon S., Suzuki K., Tam C.H.T., Tabara Y., Kwak S.-H., et al. Identification of type 2 diabetes loci in 433,540 East Asian individuals. Nature. 2020;582:240–245. doi: 10.1038/s41586-020-2263-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Taliun D., Harris D.N., Kessler M.D., Carlson J., Szpiech Z.A., Torres R., Taliun S.A.G., Corvelo A., Gogarten S.M., Kang H.M., et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature. 2021;590:290–299. doi: 10.1038/s41586-021-03205-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Sanders S.J., Neale B.M., Huang H., Werling D.M., An J.-Y., Dong S., Whole Genome Sequencing for Psychiatric Disorders WGSPD, Abecasis G., Arguello P.A., Blangero J., et al. Whole genome sequencing in psychiatric disorders: the WGSPD consortium. Nat. Neurosci. 2017;20:1661–1668. doi: 10.1038/s41593-017-0017-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Pato M.T., Sobell J.L., Medeiros H., Abbott C., Sklar B.M., Buckley P.F., Bromet E.J., Escamilla M.A., Fanous A.H., Lehrer D.S., et al. The genomic psychiatry cohort: Partners in discovery. Am. J. Med. Genet. B Neuropsychiatr. Genet. 2013;162B:306–312. doi: 10.1002/ajmg.b.32160. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Bigdeli T.B., Genovese G., Georgakopoulos P., Meyers J.L., Peterson R.E., Iyegbe C.O., Medeiros H., Valderrama J., Achtyes E.D., Kotov R., et al. Contributions of common genetic variants to risk of schizophrenia among individuals of African and Latino ancestry. Mol. Psychiatr. 2020;25:2455–2467. doi: 10.1038/s41380-019-0517-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Swerdlow N.R., Gur R.E., Braff D.L. Consortium on the Genetics of Schizophrenia (COGS) assessment of endophenotypes for schizophrenia: An introduction to this Special Issue of schizophrenia research. Schizophr. Res. 2015;163:9–16. doi: 10.1016/j.schres.2014.09.047. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Smith E.N., Bloss C.S., Badner J.A., Barrett T., Belmonte P.L., Berrettini W., Byerley W., Coryell W., Craig D., Edenberg H.J., et al. Genome-wide association study of bipolar disorder in European American and African American individuals. Mol. Psychiatr. 2009;14:755–763. doi: 10.1038/mp.2009.43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Nierenberg A.A., Friedman E.S., Bowden C.L., Sylvia L.G., Thase M.E., Ketter T., Ostacher M.J., Leon A.C., Reilly-Harrington N., Iosifescu D.V., et al. Lithium Treatment Moderate-Dose Use Study (LiTMUS) for Bipolar Disorder: A Randomized Comparative Effectiveness Trial of Optimized Personalized Treatment With and Without Lithium. Am. J. Psychiatr. 2013 doi: 10.1176/appi.ajp.2012.12060751. [DOI] [PubMed] [Google Scholar]
- 37.Sklar P., Smoller J.W., Fan J., Ferreira M.A.R., Perlis R.H., Chambert K., Nimgaonkar V.L., McQueen M.B., Faraone S.V., Kirby A., et al. Whole-genome association study of bipolar disorder. Mol. Psychiatr. 2008;13:558–569. doi: 10.1038/sj.mp.4002151. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Jun G., Wing M.K., Abecasis G.R., Kang H.M. An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data. Genome Res. 2015;25:918–925. doi: 10.1101/gr.176552.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Center for Statistical Genetics . GitHub; 2020. Statgen: Topmed Variant Calling.https://github.com/statgen/topmed_variant_calling [Google Scholar]
- 40.Zhang F., Flickinger M., Taliun S.A.G., InPSYght Psychiatric Genetics Consortium. Abecasis G.R., Scott L.J., McCaroll S.A., Pato C.N., Boehnke M., Kang H.M. Ancestry-agnostic estimation of DNA sample contamination from sequence reads. Genome Res. 2020;30:185–194. doi: 10.1101/gr.246934.118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Price A.L., Weale M.E., Patterson N., Myers S.R., Need A.C., Shianna K.V., Ge D., Rotter J.I., Torres E., Taylor K.D., et al. Long-Range LD Can Confound Genome Scans in Admixed Populations. Am. J. Hum. Genet. 2008;83:132–139. doi: 10.1016/j.ajhg.2008.06.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Purcell S., Neale B., Todd-Brown K., Thomas L., Ferreira M.A.R., Bender D., Maller J., Sklar P., De Bakker P.I.W., Daly M.J., Sham P.C. PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses. Am. J. Hum. Genet. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Rosenberg N.A., Pritchard J.K., Weber J.L., Cann H.M., Kidd K.K., Zhivotovsky L.A., Feldman M.W. Genetic Structure of Human Populations. Science. 2002;298:2381–2385. doi: 10.1126/science.1078311. [DOI] [PubMed] [Google Scholar]
- 44.Manichaikul A., Mychaleckyj J.C., Rich S.S., Daly K., Sale M., Chen W.-M. Robust relationship inference in genome-wide association studies. Bioinformatics. 2010;26:2867–2873. doi: 10.1093/bioinformatics/btq559. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Alexander D.H., Novembre J., Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 2009;19:1655–1664. doi: 10.1101/gr.094052.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.1000 Genomes Project Consortium. Auton A., Brooks L.D., Durbin R.M., Garrison E.P., Kang H.M., Korbel J.O., Marchini J.L., McCarthy S., McVean G.A., Abecasis G.R. A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Maples B.K., Gravel S., Kenny E.E., Bustamante C.D. RFMix: A Discriminative Modeling Approach for Rapid and Robust Local-Ancestry Inference. Am. J. Hum. Genet. 2013;93:278–288. doi: 10.1016/j.ajhg.2013.06.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Fan S., Spence J.P., Feng Y., Hansen M.E.B., Terhorst J., Beltrame M.H., Ranciaro A., Hirbo J., Beggs W., Thomas N., et al. Whole-genome sequencing reveals a complex African population demographic history and signatures of local adaptation. Cell. 2023;186:923–939.e14. doi: 10.1016/j.cell.2023.01.042. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Cann H.M., de Toma C., Cazes L., Legrand M.-F., Morel V., Piouffre L., Bodmer J., Bodmer W.F., Bonne-Tamir B., Cambon-Thomsen A., et al. A Human Genome Diversity Cell Line Panel. Science. 2002;296:261–262. doi: 10.1126/science.296.5566.261b. [DOI] [PubMed] [Google Scholar]
- 50.Li J.Z., Absher D.M., Tang H., Southwick A.M., Casto A.M., Ramachandran S., Cann H.M., Barsh G.S., Feldman M., Cavalli-Sforza L.L., Myers R.M. Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation. Science. 2008;319:1100–1104. doi: 10.1126/science.1153717. [DOI] [PubMed] [Google Scholar]
- 51.Wang C., Zhan X., Bragg-Gresham J., Kang H.M., Stambolian D., Chew E.Y., Branham K.E., Heckenlively J., FUSION Study, Fulton R., et al. Ancestry estimation and control of population stratification for sequence-based association studies. Nat. Genet. 2014;46:409–415. doi: 10.1038/ng.2924. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Purcell S., Cherny S.S., Sham P.C. Genetic Power Calculator: design of linkage and association genetic mapping studies of complex traits. Bioinformatics. 2003;19:149–150. doi: 10.1093/bioinformatics/19.1.149. [DOI] [PubMed] [Google Scholar]
- 53.Zhou W., Nielsen J.B., Fritsche L.G., Dey R., Gabrielsen M.E., Wolford B.N., LeFaive J., VandeHaar P., Gagliano S.A., Gifford A., et al. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat. Genet. 2018;50:1335–1341. doi: 10.1038/s41588-018-0184-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Cross-Disorder Group of the Psychiatric Genomics Consortium Identification of risk loci with shared effects on five major psychiatric disorders: a genome-wide analysis. Lancet. 2013;381:1371–1379. doi: 10.1016/S0140-6736(12)62129-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Zhan X., Hu Y., Li B., Abecasis G.R., Liu D.J. RVTESTS: an efficient and comprehensive tool for rare variant association analysis using sequence data. Bioinformatics. 2016;32:1423–1426. doi: 10.1093/bioinformatics/btw079. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Purcell S.M., Moran J.L., Fromer M., Ruderfer D., Solovieff N., Roussos P., O’Dushlaine C., Chambert K., Bergen S.E., Kähler A., et al. A polygenic burden of rare disruptive mutations in schizophrenia. Nature. 2014;506:185–190. doi: 10.1038/nature12975. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Genovese G., Fromer M., Stahl E.A., Ruderfer D.M., Chambert K., Landén M., Moran J.L., Purcell S.M., Sklar P., Sullivan P.F., et al. Increased burden of ultra-rare protein-altering variants among 4,877 individuals with schizophrenia. Nat. Neurosci. 2016;19:1433–1441. doi: 10.1038/nn.4402. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Fromer M., Pocklington A.J., Kavanagh D.H., Williams H.J., Dwyer S., Gormley P., Georgieva L., Rees E., Palta P., Ruderfer D.M., et al. De novo mutations in schizophrenia implicate synaptic networks. Nature. 2014;506:179–184. doi: 10.1038/nature12929. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Singh T., Walters J.T.R., Johnstone M., Curtis D., Suvisaari J., Torniainen M., Rees E., Iyegbe C., Blackwood D., McIntosh A.M., et al. The contribution of rare variants to risk of schizophrenia in individuals with and without intellectual disability. Nat. Genet. 2017;49:1167–1173. doi: 10.1038/ng.3903. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Vu H., Ernst J. Universal annotation of the human genome through integration of over a thousand epigenomic datasets. Genome Biol. 2022;23 doi: 10.1186/s13059-021-02572-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Arneson A., Ernst J. Systematic discovery of conservation states for single-nucleotide annotation of the human genome. Commun. Biol. 2019;2:248. doi: 10.1038/s42003-019-0488-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Arneson A., Felsheim B., Chien J., Ernst J. ConsHMM Atlas: conservation state annotations for major genomes and human genetic variation. NAR Genom. Bioinform. 2020;2 doi: 10.1093/nargab/lqaa104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Amemiya H.M., Kundaje A., Boyle A.P. The ENCODE Blacklist: Identification of Problematic Regions of the Genome. Sci. Rep. 2019;9:9354. doi: 10.1038/s41598-019-45839-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.He Z., Xu B., Buxbaum J., Ionita-Laza I. A genome-wide scan statistic framework for whole-genome sequence data analysis. Nat. Commun. 2019;10:3018. doi: 10.1038/s41467-019-11023-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Karolchik D., Baertsch R., Diekhans M., Furey T.S., Hinrichs A., Lu Y.T., Roskin K.M., Schwartz M., Sugnet C.W., Thomas D.J., et al. The UCSC Genome Browser Database. Nucleic Acids Res. 2003;31:51–54. doi: 10.1093/nar/gkg129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999;27:573–580. doi: 10.1093/nar/27.2.573. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Ernst J., Kellis M. Large-scale imputation of epigenomic datasets for systematic annotation of diverse human tissues. Nat. Biotechnol. 2015;33:364–376. doi: 10.1038/nbt.3157. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Lu J.Y., Shao W., Chang L., Yin Y., Li T., Zhang H., Hong Y., Percharde M., Guo L., Wu Z., et al. Genomic Repeats Categorize Genes with Distinct Functions for Orchestrated Regulation. Cell Rep. 2020;30:3296–3311.e5. doi: 10.1016/j.celrep.2020.02.048. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Tom J.A., Reeder J., Forrest W.F., Graham R.R., Hunkapiller J., Behrens T.W., Bhangale T.R. Identifying and mitigating batch effects in whole genome sequencing data. BMC Bioinf. 2017;18:351. doi: 10.1186/s12859-017-1756-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Werling D.M., Brand H., An J.-Y., Stone M.R., Zhu L., Glessner J.T., Collins R.L., Dong S., Layer R.M., Markenscoff-Papadimitriou E., et al. An analytical framework for whole-genome sequence association studies and its implications for autism spectrum disorder. Nat. Genet. 2018;50:727–736. doi: 10.1038/s41588-018-0107-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Robinson N., Bergen S.E. Environmental Risk Factors for Schizophrenia and Bipolar Disorder and Their Relationship to Genetic Risk: Current Knowledge and Future Directions. Front. Genet. 2021;12 doi: 10.3389/fgene.2021.686666. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Legge S.E., Pardiñas A.F., Woolway G., Rees E., Cardno A.G., Escott-Price V., Holmans P., Kirov G., Owen M.J., O’Donovan M.C., Walters J.T.R. Genetic and Phenotypic Features of Schizophrenia in the UK Biobank. JAMA Psychiatry. 2024;81:681–690. doi: 10.1001/jamapsychiatry.2024.0200. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Moskvina V., Holmans P., Schmidt K.M., Craddock N. Design of Case-controls Studies with Unscreened Controls. Ann. Hum. Genet. 2005;69:566–576. doi: 10.1111/j.1529-8817.2005.00175.x. [DOI] [PubMed] [Google Scholar]
- 74.Ragsdale A.P., Weaver T.D., Atkinson E.G., Hoal E.G., Möller M., Henn B.M., Gravel S. A weakly structured stem for human origins in Africa. Nature. 2023;617:755–763. doi: 10.1038/s41586-023-06055-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
-
•
WGS data generated for this study’s samples are available in AnVIL: WGSPD Project 1: Whole Genome Sequencing for Schizophrenia and Bipolar Disorder (WGSPD1). The AnVIL Terra Workspaces containing data (AnVIL: https://anvilproject.org/data/studies/phs002041/workspaces) are AnVIL_NIMH_Broad_WGSPD1_McCarroll_Pato_GRU_WGS, ANVIL_NIMH_Broad_WGSPD_1_McCarroll_Braff_DS_WGS, and ANVIL_NIMH_Broad_WGSPD1_McCarroll_COGS_DS_WGS. Phenotype data are available at dbGaP: phs002041.v2.p1. GPC data used in the preparation of this study (genotype data, variables used in analysis, and case-control status), as well as selected results, are available in the NIMH Data Archive (NDA). NDA is a collaborative informatics system created by the National Institutes of Health to provide a national resource to support and accelerate research in mental health. Dataset identifier: https://doi.org/10.15154/1z0s-af20). This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or of the Submitters submitting original data to NDA.
-
•
The Freeze 9 TOPMed WGS genotype data used in this study are in dbGaP. See TOPMed projects (above) or Table S2 for the accession numbers.
-
•
The code for ChromHMM and ConsHMM state-based analysis is available at https://github.com/ernstlab/inpsyght_states.





