Abstract
Exome sequencing studies have generally been underpowered to identify deleterious alleles with a large effect on complex traits, as such alleles are mostly rare. Because the population of northern and eastern Finland has expanded dramatically and in isolation following a series of bottlenecks, it harbors numerous deleterious alleles at relatively high frequency. Capitalizing on this circumstance, we exome sequenced nearly 20,000 individuals from these regions. Exome-wide association studies for 64 quantitative traits clinically relevant to cardiovascular and metabolic disease identified 26 newly associated deleterious alleles. Nineteen of these alleles are either unique to or >20 times more frequent in Finns than in other Europeans and show geographical clustering comparable to Mendelian disease mutations characteristic of the Finnish population. We estimate that sequencing studies in populations without this unique history would require hundreds of thousands to millions of participants to achieve comparable association power.
Introduction
Most alleles with a demonstrated deleterious effect on phenotypes directly alter protein structure or function1,2. Exome sequencing studies aim to discover such alleles and demonstrate their association to common diseases and disease-related quantitative traits. However, exome sequencing studies to date generally have identified few newly associated rare variants or genes3,4. The sample size required for such discoveries remains uncertain and theoretical analyses indicate that studies to date have been underpowered, since most deleterious variants are expected to be rare due to purifying selection5. These previous analyses also suggest that power to detect associations to deleterious alleles is greatest in populations that have expanded in isolation after recent bottlenecks, as alleles passing through the bottlenecks may rise to much higher frequencies than in other populations6–8.
Finland exemplifies such a history. Bottlenecks occurred at the founding of early-settlement regions (southern and western Finland) 2,000-4,000 years ago and again with internal migration to late-settlement regions (northern and eastern Finland) in the 15th and 16th centuries9. Finland’s subsequent population growth (to ~5.5 million) generated sizable geographic sub-isolates in late-settlement regions.
This unique population history has resulted in “the Finnish Disease Heritage”10, 36 Mendelian diseases that are much more common in Finns than in other Europeans. These disorders concentrate in late-settlement regions of Finland10, and the genes responsible for them exhibit extreme enrichment of deleterious variants11–13. We created the FinMetSeq study to capitalize on the population history of late-settlement Finland to discover rare-variant associations with cardiovascular and metabolic disease-relevant quantitative traits through exome sequencing of two extensively phenotyped population cohorts, FINRISK and METSIM (Methods).
We successfully sequenced 19,292 FinMetSeq participants and tested the identified variants for association with 64 clinically relevant quantitative traits, discovering 43 novel associations with deleterious variants14,15: 19 associations (11 traits) in FinMetSeq alone and 24 associations (20 traits) in a combined analysis of FinMetSeq with 24,776 Finns from three cohorts with imputed genome-wide genotypes. Nineteen of the 26 variants underlying these 43 associations were unique to Finland or enriched >20-fold in FinMetSeq compared to non-Finnish Europeans (NFE). These enriched alleles cluster geographically like Finnish Disease Heritage mutations, indicating that the distribution of trait-associated rare alleles may vary significantly between locations within a country.
We demonstrate that exome sequencing in a historically isolated population that expanded after recent population bottlenecks is an extraordinarily efficient strategy to discover alleles with a substantial effect on quantitative traits. As most of the novel, putatively deleterious trait-associated variants that we identified are unique to or highly enriched in Finland, we estimate that similarly powered studies of these variants in non-Finnish populations might require hundreds of thousands or millions of participants.
Results
Genetic variation
In 19,292 successfully sequenced exomes, we identified 1,318,781 single nucleotide variants (SNVs) and 92,776 insertion/deletion (indel) variants (Supplementary Tables 1-3, Supplementary Information). Compared to NFE control exomes (gnomAD v2.1, Extended Data Fig. 1A), FinMetSeq exomes showed depletion of singletons and doubletons and excess variants with minor allele count (MAC)≥5, particularly for predicted-deleterious alleles (Extended Data Fig. 1B).
Association analyses
We tested for association between genetic variants in FinMetSeq and 64 clinically relevant quantitative traits after standard adjustments for medications and covariates and transformation to normality for analyses (Methods, Supplementary Tables 4 & 5). Sixty-two of 64 traits exhibited significant heritability with common SNVs (P<0.05; 5%<h2<53%; Extended Data Fig. 2A, Supplementary Table 6), with substantial phenotypic and genetic correlations between traits (Extended Data Fig. 2B).
Single-variant association tests with genetic variants with MAC≥3 among the 3,558 to 19,291 individuals measured for each trait (Supplementary Tables 4 & 5) identified 1,249 associations (P<5×10-7) at 531 variants (Supplementary Table 7); 53 traits associated with ≥1 variant (Fig. 1A). All 1,249 associations remained significant after multiple testing adjustment (exome-wide and across the 64 traits using a hierarchical procedure setting average FDR at 5%, Methods). Using this procedure on the 531 associated variants, we detected 287 more associations (Supplementary Table 8), most reflecting high correlation between lipid traits. Of the 531 variants, those at >10x frequency in FinMetSeq compared to NFE were more likely to be trait-associated (OR=4.92, P=2.6×10-5; Extended Data Fig. 1C).
Figure 1. Characterization of associations.
A) Number of genomic loci associated with each trait. Bars are subdivided into common (MAF>1%, dark blue) and rare (MAF≤1%, light blue).
B) Relationship between estimated heritability and number of loci detected per trait. Each trait is colored by trait group. Vertical bars indicate ±2 standard errors. The gray line shows the linear regression fit to indicate the general trend. The number of independent individuals used in each point is listed in Supplementary Table 5. Height is the notable outlier.
After clumping associated variants within 1Mbp and with r2>0.5 into single loci (Methods), the 531 associated variants represented 262 distinct loci (597 trait-locus pairs, Supplementary Table 7). The number of associated loci per trait correlated positively with trait heritability (r=0.38, P=8.8×10-4), with height a notable outlier (Fig. 1B).
Most variants and loci (61%) associated to a single trait; 4% associated to ≥10 traits. Overlapping associations (Extended Data Fig. 3A) reflect both phenotypic and genetic correlations and the estimated genetic correlation of trait pairs predicts shared loci between traits (Extended Data Fig. 3B). Gene-based association tests revealed 54 associations with P<3.88×10-6 and multi-trait FDR<0.05 (Methods, Supplementary Table 9), including ten traits associated with APOB (Extended Data Fig. 4) and a novel association of SECTM1 with HDL2-C (Extended Data Fig. 5).
To determine which of the 1,249 single-variant associations are distinct from previous GWAS findings, we repeated association analysis for each trait conditioning on published associated variants in the EBI GWAS Catalog (December 2016, Methods); 478 associations at 126 loci remained significant (P<×10-7), including at least one association for 48 traits (Supplementary Table 10). Conditionally-associated variants were more often rare (24% vs. 11%), more likely protein-altering (31% vs. 22%), and more frequently >10x enriched in FinMetSeq relative to NFE (19% vs. 10%) than associated variants overall.
Replication and follow-up
We attempted to replicate the 478 single-variant associations (unconditional and conditional P≤5×10-7) and follow up 2,120 sub-threshold associations from FinMetSeq (unconditional 5×10-7<P≤5×10-5 and conditional P≤5×10-5) in 24,776 participants from three Finnish cohort studies: FINRISK16,17 participants not in FinMetSeq (n=18,215), Northern Finland Birth Cohort 196618 (n=5,139), and Helsinki Birth Cohort19 (n=1,412), all imputed using the Finnish SISu v2 reference panel (www.sisuproject.fi). Following association analysis within each cohort, we conducted meta-analysis of the three imputation-based studies to test for replication of FinMetSeq variants (“replication analysis”), and four-study meta-analysis with FinMetSeq to follow up suggestive associations (“combined analysis”).
Of 448 significant variant-trait associations with replication data, 392 (87.5%) replicated at P<0.05 (Supplementary Table 11). Of the 1,417 sub-threshold associations, 431 reached P<5×10-7 in the combined analysis (Supplementary Table 12); >60% of variants we could not follow up were absent in the reference panel.
Among the significant associations from FinMetSeq or combined analysis, 43 were with 26 predicted deleterious variants (six PTVs, 20 missense) that conditional analysis and literature review suggest are novel (Table 1). Nineteen associations (15 variants) were significant in FinMetSeq (Table 1; Supplementary Table 11); another 24 associations (16 variants) reached significance in combined analysis (Table 1; Supplementary Table 12). Of these 43 associations, 34 were with 19 variants either seen only in Finland or enriched >20-fold in FinMetSeq compared to NFE. Identifying associations for these 19 variants would have required much larger samples in NFE populations than in FinMetSeq (Fig. 2A, B). We provide brief summaries relating some of these associations to known biology and prior genetic evidence (Table 1, expanded version in Supplementary Table 13, Supplementary Information), highlighting here the most striking findings.
Table 1. Novel associations with predicted deleterious variants from FinMetSeq alone or combined analysis.
Chr:Pos (GRCh37) | Gene | FMS MAF | NFE MAF# | MAF Ratio (95% CI) |
Trait | FMS P | FMS Beta | Repl. or comb. P** | Repl. or comb. Beta |
---|---|---|---|---|---|---|---|---|---|
1:55076137 | FAM151A | 0.099 | 0.0147 | 6.7 (6.1-7.5) | IDL-C | 5.4×10-16 | -0.187 | 2.1×10-17 | -0.191 |
IDL-P | 8.9×10-14 | -0.172 | 1.9×10-16 | -0.185 | |||||
2:120848049 | EPB41L5 | 0.085 | 0.044 | 1.9 (1.8-2.1) | eGFR* | 1.7×10-6 | -0.093 | 4.8×10-12 | -0.107 |
Creatinine* | 2.5×10-6 | 0.091 | 2.5×10-12 | 0.098 | |||||
3:125831672 | ALDH1L1 | 0.0026 | 0 | ∞ | Gly | 1.8×10-8 | -0.873 | 4.5×10-4 | -0.827 |
4:13612630 | BOD1L1 | 0.0001 | 0 | ∞ | WHR | 4.7×10-7 | -2.501 | NA | NA |
5:79336091 | THBS4 | 0.0045 | 0.0001 | 45 (14.4-140.9) | Weight* | 6.7×10-7 | -0.377 | 3.2×10-7 | -0.252 |
5:140181423 | PCDHA3 | 0.0001 | NA | NA | WHR | 2.7×10-7 | 2.559 | NA | NA |
9:107548661 | ABCA1 | 0.00023 | 0 | ∞ | HDL-C | 4.8×10-10 | -2.046 | NA | NA |
9:136501728 | DBH | 0.05 | 0.0021 | 23.8 (18.4-30.4) | Diast-BP* | 1.5×10-6 | -0.115 | 2.8×10-12 | -0.11 |
11:47282929 | NR1H3 | 0.0042 | 0.00003 | 140 | HDL-C | 1.4×10-7 | 0.425 | 6.7×10-7 | 0.435 |
(19.5-1004.4) | HDL2-C* | 3.2×10-6 | 0.473 | 1.3×10-8 | 0.458 | ||||
VLDL-C* | 4.0×10-6 | -0.469 | 3.1×10-7 | -0.412 | |||||
11:116692293 | APOA4 | 0.0096 | 0.012 | 0.8 (0.7-0.9) | HDL-C* | 2.2×10-5 | 0.225 | 1.5×10-7 | 0.196 |
11:117352857 | DSCAML1 | 0.016 | 0.0002 | 80 | VLDL-C | 4.1×10-8 | 0.299 | 2.0×10-3 | 0.162 |
(35.7-179.3) | |||||||||
14:101198426 | DLK1 | 0.023 | 0.00013 | 177 | Height* | 2.7×10-5 | -0.149 | 1.2×10-10 | -0.163 |
(66.3-472.4) | |||||||||
16:55862682 | CES1 | 0.0018 | 0.00003 | 60 | HDL-C | 1.1×10-10 | 0.771 | 3.8×10-6 | 0.793 |
(8.3-432.0) | ApoA1* | 1.9×10-6 | 0.668 | 4.0×10-9 | 0.718 | ||||
16:56996009 | CETP | 0.0017 | 0.00003 | 56.7 | ApoA1 | 2.6×10-8 | 0.834 | 1.8×10-4 | 1.034 |
(7.9-408.3) | HDL-C | 1.1×10-14 | 0.946 | 8.8×10-21 | 1.217 | ||||
16:68013570 | DPEP3 | 0.0099 | 0.00044 | 22.5 | HDL-C | 1.6×10-7 | -0.295 | 7.2×10-15 | -0.373 |
(12.9-39.1) | ApoA1* | 5.2×10-6 | -0.294 | 4.0×10-7 | -0.253 | ||||
16:68732169 | CDH3 | 0.0044 | 0.00064 | 6.9 (4.2-11.2) | Pyr* | 3.7×10-5 | 0.417 | 6.6×10-10 | 0.471 |
17:6599157 | SLC13A5 | 0.00091 | 0 | ∞ | Cit | 1.3×10-9 | 1.294 | 9.5×10-12 | 1.309 |
17:7129898 | DVL2 | 0.02 | 0.02 | 1 (0.9-1.1) | Val* | 4.2×10-5 | -0.239 | 5.7×10-9 | -0.232 |
17:39135270 | KRT40 | 0.00013 | 0 | ∞ | HDL-C | 3.2×10-8 | 2.416 | NA | NA |
17:41062979 | G6PC | 0.025 | 0 | ∞ | MUFA | 4.4×10-7 | 0.275 | 3.5×10-1 | 0.067 |
Glol* | 5.8×10-6 | 0.218 | 4.1×10-7 | 0.183 | |||||
CRP* | 1.6×10-5 | 0.175 | 4.0×10-9 | 0.185 | |||||
TotTG* | 1.0×10-6 | 0.23 | 1.3×10-7 | 0.197 | |||||
17:41926216 | CD300LG | 0.00034 | 0 | ∞ | HDL-C | 4.8×10-14 | 2.061 | 4.9×10-2 | 0.801 |
HDL2-C | 1.3×10-7 | 2.154 | NA | NA | |||||
ApoA1 | 8.1×10-8 | 1.694 | NA | NA | |||||
18:47091686 | LIPG | 0.0025 | 0 | ∞ | HDL2-C* | 1.2×10-5 | 0.579 | 5.6×10-10 | 0.624 |
PC* | 3.1×10-6 | 0.624 | 1.1×10-8 | 0.578 | |||||
TotPG* | 9.0×10-6 | 0.594 | 1.1×10-7 | 0.538 | |||||
19:10683762 | AP1M2 | 0.015 | 0.00009 | ApoB | 5.8×10-8 | -0.282 | 1.5×10-3 | -0.199 | |
167 | IDL-C* | 1.1×10-6 | -0.289 | 6.9×10-14 | -0.319 | ||||
(41.6-668.5) | IDL-P* | 2.1×10-6 | -0.281 | 8.5×10-14 | -0.318 | ||||
Remnant-C* | 8.0×10-6 | -0.268 | 2.7×10-12 | -0.301 | |||||
19:11350904 | ANGPTL8 | 0.0025 | 0 | ∞ | HDL2-C* | 3.4×10-6 | 0.564 | 1.1×10-8 | 0.574 |
19:49318380 | HSD17B14 | 0.046 | 0.05 | 0.9 (0.8-1.0) | Val* | 3.4×10-5 | -0.152 | 2.1×10-7 | -0.144 |
20:24994201 | ACSS1 | 0.0026 | 0 | ∞ | Ace* | 1.3×10-5 | 0.626 | 2.1×10-12 | 0.631 |
Non-Finnish European (NFE) MAF taken from gnomAD v2.1 control exomes excluding Estonian or Swedish individuals.
0; variant present in gnomAD, but not in NFE controls. NA; variant not present in gnomAD.
Association only reaches significance in combined analysis.
Replication P-values<0.05 are highlighted in bold.
Figure 2. Allelic enrichment in the Finnish population and its effect on genetic discovery.
A) Relationship between MAF and estimated effect size for associations discovered in FinMetSeq. Each variant reaching significance in FinMetSeq is plotted, with associations in Table 1 represented by dark blue points (FinMetSeq MAF) and green points (NFE MAF). Purple lines indicate 80% power curves for sample sizes of 10,000 and 20,000 at α=5x10-7.
B) Same plot as in A, highlighting the variants in Table 1 only reaching significance in the combined analysis.
Anthropometric traits
A predicted damaging missense variant (p.Arg94Cys) in THBS4 45X more frequent in FinMetSeq than in NFE was associated in the combined analysis with a mean 5.9 kg decrease in body weight. THBS4 encodes thrombospondin 4, a matricellular protein found in blood vessel walls and highly expressed in heart and adipose 20. THBS4 may regulate vascular inflammation21 and has been implicated in heart disease risk22.
A predicted damaging missense variant (p.Val104Met) in DLK1 177X more frequent in FinMetSeq than in NFE is associated in the combined analysis with a mean 1.3cm decrease in height. DLK1 encodes Delta-Like Notch Ligand 1, an epidermal growth factor that interacts with fibronectin and inhibits adipocyte differentiation. Uniparental disomy of DLK1 causes Temple and Kagami-Ogata Syndromes, characterized by growth restriction, hypotonia, joint laxity, motor delay, and early onset of puberty23. Paternally-inherited common variants near DLK1 are associated with childhood obesity, type 1 diabetes, age at menarche, and precocious puberty24–26. Homozygous null mutations in the mouse ortholog Dlk-1 lead to embryos with reduced size, skeletal length, and lean mass27; in Darwin’s finches, SNVs at this locus have a strong effect on beak size28.
HDL-C
A predicted deleterious missense variant p.Arg112Trp in CD300LG is associated in FinMetSeq with a mean 0.95 mmol/l increase in HDL-C and is associated with increased HDL2-C and ApoA1. This variant, absent in NFE, has an opposite direction of effect from a previously reported deleterious missense variant in this gene29, which encodes a type I cell surface glycoprotein.
Amino acids
A stop gain variant (p.Arg722X) in ALDH1L1 is associated in FinMetSeq with reduced serum glycine levels and is absent in NFE; this trait may increase risk for cardiometabolic disorders30,31. ALDH1L1 encodes 10-formyltetrahydrofolate dehydrogenase, which competes with serine hydroxymethyltransferase to alter the ratio of serine to glycine in the cytosol. Gene-based tests suggest additional PTVs and missense variants in ALDH1L1 alter glycine levels (P=1.4×10-20, Extended Data Fig. 6, Supplementary Table 9).
Ketone bodies
A predicted damaging missense variant (p.Phe517Ser) in ACSS1 is associated in the combined analysis with increased serum acetate levels and is absent in NFE. ACSS1 encodes an acyl-coenzyme A synthetase and plays a role in conversion of acetate to acetyl-CoA. In rodents, increased acetate levels lead to obesity, insulin resistance, and metabolic syndrome32.
Trait-associations and disease endpoints
Genotype data from FinnGen33 enabled us to test whether deleterious variants responsible for our novel trait associations contribute to related disease endpoints. We examined 22 diseases for the 25 available variants in Table 1; three variants were associated with diseases in FinnGen at Bonferroni threshold P<0.05/(22×25)=9.0×10-5 (Supplementary Table 14).
A predicted damaging missense variant (p.Ser32Pro) in KRT40, associated in FinMetSeq with elevated HDL-C, but absent in NFE, is associated in FinnGen with increased pancreatitis risk. While this is the first disease association reported for KRT40, type I keratins regulate exocrine pancreas homoeostasis34. A 29bp deletion causing a frameshift in FAM151A is associated in FinMetSeq with decreased total cholesterol in IDL and decreased IDL particle concentration, is 6.7X more frequent in FinMetSeq than NFE, and is associated in FinnGen with decreased risk of myocardial infarction. Interpretation of this association is complicated as the variant is also situated in an overlapping gene (ACOT11) involved in fatty acid metabolism and lies <1Mbp from a cardioprotective variant in PCSK9. Finally, a predicted damaging missense variant (p.Arg65Trp) in DBH associated with a mean 1.0 mmHg decrease in diastolic blood pressure in the combined analysis, is 23.8X more frequent in FinMetSeq than in NFE, and is associated in FinnGen with decreased risk for hypertension. Distinct loci in this gene and gene-based tests are associated with mean arterial pressure35,36.
Replication outside Finland
To assess the generalizability of these novel associations, we attempted to replicate associations from our combined analysis in the UK Biobank (UKB). Across eight anthropometric and blood pressure traits for which UKB data are publicly available, our combined analysis identified 31 trait-variant associations, of which 23 were present in UKB. Twenty of 23 associations were to variants with MAF>1% in FinMetSeq and comparable frequency in UKB; 15 (75%) showed association in UKB at P<0.05/23=2.2×10-3. The three rare variants in this analysis were all >10x more frequent in FinMetSeq than UKB; none were associated in UKB (Supplementary Table 15). However, even after adjusting for winner’s curse37, we had <50% power to detect these associations in UKB, consistent with the argument that extremely large samples will be needed in other populations to achieve the power for rare-variant association studies that we observed in Finland.
Enriched variants cluster geographically
Given the concentration of Finnish Disease Heritage mutations within regions of late-settlement Finland38, we hypothesized that trait-associated variants discovered through FinMetSeq might also cluster geographically. Principal component analysis supported this hypothesis, revealing broad-scale population structure within late-settlement regions among 14,874 unrelated FinMetSeq participants with known parental birthplaces (Extended Data Fig. 7). Carriers of PTVs and missense alleles showed more clustering of parental birthplaces than carriers of synonymous alleles, even after adjusting for MAC (Supplementary Tables 16A, B).
To analyze the distribution of variants within late-settlement Finland, we delineated geographically distinct population clusters using haplotype sharing among 2,644 unrelated individuals with both parents born in the same municipality (Methods, Extended Data Fig. 8). We compared variant counts across functional classes and frequencies between an early-settlement reference cluster and 12 clusters containing ≥100 individuals (Extended Data Fig. 9, Supplementary Tables 17, 18). Clusters representing the most heavily bottlenecked late-settlement regions (Lapland and Northern Ostrobothnia) displayed a deficit of singletons and enrichment of intermediate frequency variants compared to other clusters.
Variants >10x enriched in FinMetSeq compared to NFE displayed particularly strong geographical clustering (Supplementary Table 19). We further characterized clustering for FinMetSeq-enriched trait-associated variants, by comparing mean distances between birthplaces of parents of minor allele carriers to those of non-carriers (Supplementary Table 20). Most such variants were highly localized. For example, for rs780671030 in ALDH1L1, the mean distance between parental birthplaces is 135km for carriers and 250km for non-carriers (P<1.0×10-7, Fig. 3A).
Figure 3. Geographical clustering of associated variants.
A) Example of geographical clustering for a novel trait-associated variant (Table 1). The map shows birth locations of all 113 parents of carriers (orange) and 113 randomly selected parents of non-carriers (blue) of the minor allele for rs780671030 in ALDH1L1.
B) FDH mutations (N=38) geographically cluster (by parental birthplace) similarly to trait-associated variants (Table 1) that are >10x more frequent in FMS than in NFE (N=12) and more than enriched variants from our combined analysis (N=7). For all variants, carriers clustered more than non-carriers (center line, median; box limits, upper and lower quartiles; whiskers, 1.5 interquartile range; points, outliers).
Finally, we identified comparable geographic clustering between carriers of 35 Finnish Disease Heritage mutations and carriers of FinMetSeq-enriched trait-associated variants (Fig. 3B, Methods). Clustering was dramatically greater than that observed for non-carriers of both sets of variants, suggesting that rare trait-associated variants may be much more unevenly distributed geographically than previously appreciated.
Discussion
We demonstrate that a well-powered exome sequencing study of deeply phenotyped individuals can identify numerous rare variants associated with medically relevant quantitative traits. The variants we identified provide a useful starting point for studies aimed at uncovering biological mechanisms and fostering clinical translation. The power of this study to discover rare-variant associations derives from the numerous deleterious variants that are enriched in or unique to Finland. Prioritizing the sequencing of multiple population isolates that have expanded from recent bottlenecks is a strategy for scaling up the discovery of rare-variant associations7,39–41. Because genetic drift results in a different set of alleles to pass through population-specific bottlenecks, enriching some variants and depleting others, the numerous rare-variant associations that could be identified by sequencing well-phenotyped samples across multiple isolates could rapidly increase our understanding of the genetic architecture of complex traits.
Our results support recent suggestions of continuity between the genetic architectures of complex traits and disorders classically considered monogenic42,43, by identifying numerous deleterious variants with large effects on quantitative traits that demonstrate geographical clustering comparable to that of the mutations responsible for the Finnish Disease Heritage.
Using a Finland-specific reference panel44 to impute FinMetSeq variants into array-genotyped samples from three other Finnish cohorts enabled us to identify additional novel associations. However, the clustering in FinMetSeq of deleterious trait-associated variants within limited geographical regions and our inability to follow-up >700 sub-threshold associations from FinMetSeq for which the associated variants were absent in the Finnish imputation reference panel, emphasize the importance of representing regional subpopulations in such reference panels, to account for fine-scale population structure.
The value of rare-variant studies in population isolates will depend on the richness of phenotypes in sequenced cohorts from these populations. For example, we associated <100 of the >24,000 deleterious, highly enriched variants identified in FinMetSeq with one of the 64 quantitative traits studied here. The associations we identified to disease endpoints in FinnGen hint at the discoveries that will be possible when that database reaches its full size of 500,000 participants. The insights gained from such efforts will accelerate the implementation of precision health, informing projects in more heterogeneous populations which are still at an early stage45.
Methods
METSIM and FINRISK studies: designs, phenotypes, and sequenced participants
METSIM is a single-site study investigating cardiometabolic disorders and related traits in 10,197 men randomly selected from the population register of Kuopio, Eastern Finland, aged 45 to 73 years at initial examination from 2005 to 201015,46. We attempted exome sequencing of all METSIM study participants.
FINRISK is a series of health examination surveys based on random population samples from five (six in 2002) geographical regions of Finland, carried out every five years beginning in 197247. For exome sequencing, we chose 10,192 participants in the 1992-2007 FINRISK surveys from northeastern Finland (former provinces of North Karelia, Oulu, and Lapland).
All participants in both studies provided informed consent, and study protocols were approved by the Ethics Committees at participating institutions (National Public Health Institute of Finland; Hospital District of Helsinki and Uusimaa; Hospital District of Northern Savo). All relevant ethics committees approved this study.
Selection of traits, harmonization, exclusions, covariate adjustment, and transformation
Of the 257 quantitative traits measured in both METSIM and FINRISK, we selected 64 for association analysis in FinMetSeq based on clinical relevance for cardiovascular and metabolic health (Supplementary Tables 4, 5). We excluded individuals with type 1 diabetes and women who were pregnant at the time of phenotyping from all analyses; individuals with T2D from analyses of glycemic traits; and individuals not fasting for at least 8 hours after their last meal for traits influenced by food consumption. A complete list of exclusions is in Supplementary Table 5. We adjusted measured values of systolic and diastolic blood pressures for individuals on antihypertensive medication at the time of testing48,49, and serum lipid measures for individuals on lipid regulating medications50,51. Trait adjustments are listed in Supplementary Table 5.
We prepared quantitative traits for association analysis separately for METSIM and FINRISK by linear regression on trait-specific covariates after log transforming skewed variables. Covariates for regression analyses included: age and age2 (METSIM); sex, age, age2, and cohort year (FINRISK). Trait transformations and trait-specific covariates are listed in Supplementary Table 5. Several traits were adjusted for sex hormone treatment, which included women on contraceptives or hormone replacement therapy. We transformed residuals from these initial regression analyses to normality using inverse normal scores.
Exome sequencing
We carried out exome sequencing in two phases.
Phase 1 We quantified 10,379 DNA samples with PicoGreen (ThermoFisher Scientific) and randomly parsed samples with adequate DNA (>250ng) into cohort-specific files. We then re-arrayed samples to ensure equal numbers of METSIM and FINRISK samples on each 96-well plate, alternating samples between studies in consecutive positions within and across plates, to minimize between-study batch effects.
Using 100-250ng input DNA, we constructed dual indexed libraries using the HTP Library Kit (KAPA Biosystems, target insert size of 250bp), pooling twelve libraries prior to hybridization to the SeqCap EZ HGSC VCRome (Roche) exome reagent. After estimating the concentration of each captured library pool by qPCR (Kapa Biosystems) to produce appropriate cluster counts for the HiSeq2000 platform (Illumina), we generated 2x100bp paired-end sequence data yielding ~6 Gb per sample to achieve a coverage depth of ≥20x for ≥70% of targeted bases for every sample.
Phase 2 We quantified, prepared, pooled, and captured 9,937 samples just as in Phase 1. Here we generated 2×125bp paired-end sequencing reads on the HiSeq2500 1T to achieve the same coverage as in Phase 1.
Contamination detection, sequence alignment, sample QC, and variant calling
We aligned sequence reads to human genome reference build 37 (bwa-mem, v0.7.7), realigned indels (GATK52 IndelRealigner v2.4), and marked duplicates (picard MarkDuplicates, v1.113; http://broadinstitute.github.io/picard) and overlapping bases (BamUtil clipOverlap v1.0.11; http://genome.sph.umich.edu/wiki/BamUtil:_clipOverlap).
For each sample, we required SNV genotype array concordance >90% if SNV array data were available, excluding samples with estimated contamination >3% or sample swaps compared to existing genotype data (verifyBamID53, v1.1.1; Supplementary Table 1).
We called SNVs and short indels with GATK52 (v3.3, using recommended best practices) for all targeted exome bases and 500bp of sequence up and downstream of each target region using HaplotypeCaller. We merged calls in batches of 200 individuals using CombineGVCFs and recalled genotypes for all individuals at all variable sites with GenotypeGVCFs.
After merging genotypes for the 19,378 samples that passed preliminary QC checks, we filtered SNVs and indels separately using the recommended best practices for Variant Quality Score Recalibration (VQSR). We used the true positive variants in the GATK resource bundle (v2.5; build37) to train the VQSR model after restricting to sites in targeted exome regions. After assessment with VQSR, we retained variants for which we identified ≥99% of true positive sites used in the training model for both SNVs and indels.
Following initial variant filtering, we decomposed multi-allelic variants into bi-allelic variants, left-aligned indels, and dropped redundant variants using vt54 (version 0.5). We filtered variants with >2% missing calls and/or Hardy-Weinberg p-value<10-6. We additionally removed variants with an overall allele balance (alternate AC/sum of total AC) <30% in genotyped samples. We excluded 86 individuals with >2% missing variant calls yielding a final analysis set of 19,292 individuals.
Array genotypes, genotype imputation, and integrated exome+imputation panel
For all but 1,488 participants (57 METSIM, 1,431 FINRISK), previously generated array genotypes were available17,55, with which we generated three datasets: (1) a merged array-based call set of all variants present in ≥90% of array-genotyped individuals across both cohorts; (2) a merged array-based Haplotype Reference Consortium (HRC) v1.1 imputed dataset using the Michigan Imputation Server56,57; (3) an integrated data set containing HRC imputed genotypes and exome-sequence variants (excluding all individuals without array data, and using the sequence-based genotypes where there was overlap between sequenced and imputed genotypes).
Annotation
We annotated the final set of sequence variants passing QC using Ensembl’s variant effect predictor (VEP v76)58 employing five in silico algorithms to predict the functional impact of missense variants: PolyPhen2 HumDiv and HumVar59, LRT60, MutationTaster61, and SIFT62.
Association testing
Single variants
We carried out single-variant association tests for transformed trait residuals with genotype dosages for variants with MAC≥3 assuming an additive genetic model, using the EMMAX63 linear mixed model approach, as implemented in EPACTS (v3.3.0; http://genome.sph.umich.edu/wiki/EPACTS), to account for relatedness between individuals. We used genotypes for sequenced variants with MAF≥1% to construct the genetic relationship matrix (GRM).
Conditioning on associated variants from prior GWAS
To differentiate association signals identified here from known associations, for each trait we performed exome-wide association analysis conditioning on variants previously associated (P<10-7) with that trait in the EBI GWAS catalog (https://www.ebi.ac.uk/gwas/downloads; December 4, 2016 version)64, publications, or manuscripts in preparation55,65–67. The keywords from the GWAS catalog we used to assign known variants to each trait are in Supplementary Table 21. We also manually curated published associations for specific metabolites65,68.
Using the combined HRC+exome panel, we pruned each trait-specific list of associated variants (“GWAS variants”) based on linkage disequilibrium (LD) (r2>0.95). Of 23 GWAS variants absent in the HRC+exome panel, we identified a proxy (r2>0.80) variant for 17; we excluded the remaining six variants from conditional analysis. The variants included in conditional analysis are listed in Supplementary Table 22. We extracted genotypes for variants used in conditional analysis from the HRC+exome panel and converted dosages to alternate allele counts by rounding to the nearest integer (0, 1, or 2). For conditional analyses, we imputed missing genotypes for the individuals without array data using the mean genotype. We then ran association analysis using the same linear mixed model approach as in unconditional analysis but including the complete set of pruned GWAS variants as covariates in the association test. We then evaluated the novelty of conditional associations by searching OMIM, ClinVar, and the literature.
Defining loci
To identify the number of distinct associations for each trait, we performed LD clumping using Swiss (https://github.com/welchr/swiss) of variants with (1) unconditional P<5×10-7 or (2) both unconditional and conditional P<5×10-5 for at least one trait. For each variant in this subset, we provided Swiss with the minimum unconditional p-value across all traits. The clumping procedure starts with the variant with the smallest p-value, merges into one locus all variants within ±1Mbp that have r2>0.5 with the index variant, and iterates this process until no variants remain.
Calculating effects and variance explained of individual variants
For novel variants highlighted in Table 1 we evaluated the effect of each variant on the trait values by calculating the mean trait value in carriers and non-carriers. As the effect estimates from our association tests are standardized, we calculated variance explained for a given variant with the equation 2f (1-f) where f is the minor allele frequency and is the estimated effect size. The variance explained is in Supplementary Table 10.
Gene-based testing
We carried out gene-based association tests using the mixed model implementation of SKAT-O69, considering three different, but nested, sets of variants (variant “masks”):
(1) PTVs at any allele frequency with VEP annotations: frameshift_variant, initiator_codon_variant, splice_acceptor_variant, splice_donor_variant, stop_lost, stop_gained;
(2) PTVs included in (1) plus missense variants with MAF<0.1% scored as “damaging” or “deleterious” by all five functional prediction algorithms;
(3) PTVs included in (1) plus missense variants with MAF<0.5% scored as “damaging” or “deleterious” by all five algorithms.
For each trait and mask, we only tested genes with at least two qualifying variants. Each mask contained a different number of genes with at least two qualifying variants: up to 7,996, 12,795, and 12,890 for the three masks, respectively. The exact number of genes tested varied by trait due to sample size. We first used a Bonferroni-corrected exome-wide threshold for 12,890 genes, which corresponds to a threshold of P<3.88×10-6. Analogous to single-variant association, we passed genes meeting this association threshold forward for additional consideration with hierarchical FDR correction, described below.
Hierarchical FDR correction for testing multiple traits and variants
To control for multiple testing across 64 traits, we adopted an FDR controlling procedure70, using a two-stage hierarchical strategy (described in Supplementary Information). Stage 1 identifies the set of R variants (or genes) associated with at least one trait (P<5×10-7 for single-variant unconditional results and P<3.88×10-6 for gene-based results), controlling genome-wide FDR across all variants at 0.05. Stage 2 identifies all traits associated with the discovered variants in a manner guaranteeing an average FDR<0.05.
Genotype validation
We validated exome sequence-based genotype calls using Sanger sequencing for METSIM carriers of 13 trait-associated very rare variants with MAF<0.1% in seven genes, finding concordance for 107 of 108 (99.1%) non-reference genotypes evaluated.
Replication in additional Finnish cohorts
We attempted to replicate significant single-variant associations (P<5×10-7) and follow-up suggestive single-variant associations (P<5×10-5) using imputed array data from up to 24,776 individuals from three cohort studies: Northern Finland Birth Cohort 1966 (NFBC1966)18, the Helsinki Birth Cohort Study (HBCS)19, and FINRISK study participants not included in FinMetSeq16,17.
For each cohort, prior to phasing we performed genotype quality control batch-wise using standard quality thresholds. We pre-phased array genotypes with Eagle71 (v2.3) and imputed genotypes genome-wide with IMPUTE72 (v2.3.1) using 2,690 sequenced Finnish genomes and 5,092 sequenced Finnish exomes. We assessed imputation quality by confirming sex, comparing sample allele frequencies with reference population estimates, and examining imputation quality (INFO score) distributions. We excluded any variant with INFO<0.7 within a given batch from all replication/follow-up analyses.
For each cohort, we matched, harmonized, covariate adjusted, and transformed available phenotypes as described above for FinMetSeq, and ran single-variant association using the EMMAX linear mixed model implemented in EPACTS, after generating kinship matrices from LD-pruned (command: plink --indep-pairwise 50 5 0.2) directly genotyped variants with MAF>5%.
Association to disease endpoints
From >1,100 disease endpoints available for analysis in FinnGen, we selected 22 we considered most relevant to the traits analyzed in FinMetSeq, identifying variant associations as described in Tabassum et al.33.
Association replication in UK Biobank
For eight FinMetSeq anthropometric and blood pressure traits available in UKB (height, weight, BMI, hip circumference, waist circumference, fat percentage, systolic blood pressure, and diastolic blood pressure), we extracted, for variants reaching P<5x10-7 in our combined analysis, trait-variant association statistics from http://www.nealelab.is/uk-biobank. Seven of the eight traits had at least one associated variant and 23 of the total of 31 variants were available in UKBB. A comparison of association results is in Supplementary Table 15.
Population genetic analyses
Identifying unrelated individuals
To identify nearly independent common SNVs, we removed SNVs with MAF<5% and pruned the remaining SNVs in windows of 50 SNVs, in steps of 5 SNVs, such that no pair of SNVs had r2>0.2. We used KING73 to estimate pairwise relationships among the exome-sequenced individuals, removing one individual from each pair inferred by KING to have a relationship of 3rd degree or closer, yielding 14,874 unrelated individuals for population genetic analyses.
Enrichment of predicted-deleterious alleles in Finland
We assessed enrichment of predicted-deleterious alleles in Finland by comparing the 14,874 nearly unrelated FinMetSeq individuals to the 14,944 NFE control exomes in gnomAD (after removing NFE individuals from countries with substantial Finnish populations, Estonia and Sweden). We analyzed the two most common alleles at each site with base quality score >10, mapping quality score >20, and coverage equal to or greater than that found in ≥80% of variable sites (17.73X in FinMetSeq, 32.27X in gnomAD), resulting in ~38.6 Mbp for comparisons. We contrasted the proportional site frequency spectra for FinMetSeq and NFE for five functional variant categories (PTVs, missense, synonymous, UTR, and intronic variants) after down-sampling both datasets to 18,000 chromosomes.
We also assessed the enrichment of deleterious alleles within subpopulations of the FinMetSeq dataset. We applied Chromopainter and fineSTRUCTURE on 2,644 unrelated FinMetSeq individuals whose parents were both born in the same municipality to identify 16 sub-population clusters74 (Supplementary Information). Of the 16 clusters, we used as the reference population a cluster for which the highest proportion of the parents of its members were from early-settlement Finland (NSv3, Supplementary Table 17). We used the twelve clusters with >100 members in subsequent analyses (Supplementary Table 17). We then compared the ratio of the site frequency spectra to the reference for PTVs, missense, and synonymous variants, down-sampling both datasets to 200 haploid chromosomes. For each comparison, we computed statistical evidence for enrichment or depletion at a given allele count bin by exact binomial test against a null of equal number of variants found in both the test and reference cluster.
Geographical clustering of predicted functionally deleterious alleles
We first generated a distance matrix tabulating the pairwise geographical distance between the birthplaces of all available parents of unrelated sequenced individuals. For each variant of interest, we computed for the minor allele carriers in FinMetSeq the mean distance among all parent pairs. We evaluated statistical significance of geographical clustering by comparing the observed mean distance to mean distances for up to 10,000,000 sets of randomly drawn non-carrier individuals matched by cohort status and number of parents with birthplace information available. Birthplaces of carrier and non-carrier individuals were plotted on a map of Finland, including regions that were ceded prior to WW2 (© Karttakeskus Oy, 2001).
To assess whether PTVs or missense variants may be more geographically clustered than synonymous variants, we first identified a set of near-independent variants (r2>0.02) with MAC≥3 and MAF≤5% among the 14,874 unrelated individuals. For each variant, we computed the mean pairwise geographical distance between the birthplaces across all pairs of the available parents of carriers of the minor allele and regressed this mean distance on variant class (PTVs, missense, or synonymous) and MAC, MAC2, and MAC3 (Supplementary Table 16). For those variants in gnomAD, we also assessed whether variants enriched in FinMetSeq compared to NFE are more likely to be geographically clustered. As above, we computed the mean pairwise distances among parents of carriers of the minor allele and regressed mean distance on the logarithm of enrichment and MAC, MAC2, and MAC3 (Supplementary Table 19). In both analyses we assessed a model with the interaction terms but report only the model without interactions if the interactions were not significant.
Heritability estimates and genetic correlations
We used genome-wide array genotype data on the 13,326 unrelated individuals for whom both exome sequence and array data were available to estimate heritability and genetic correlations for the 64 traits. We constructed a GRM with PLINK75 (v.1.90b, https://www.cog-genomics.org/plink2) by applying additional filters for MAF>1% and genotype missingness rate <2% to the set of previously-used genotyped SNVs, leaving 205,149 SNVs for GRM calculation. We used the exact mixed model approach of biMM76 (v.1.0.0, http://www.helsinki.fi/~mjxpirin/download.html) to estimate the heritability of our 64 traits and the genetic correlation of the 2,016 trait pairs.
Extended Data
Extended Data Fig. 1. Allele frequency comparisons between FinMetSeq and NFE from gnomAD.
A) Distribution of allelic frequencies between FinMetSeq and gnomAD NFE. The comparison of allele frequencies shows the excess of variants at higher frequency in Finland as a result of the multiple bottlenecks experienced in Finnish population history.
B) Proportional site frequency spectra between FinMetSeq and gnomAD NFE by variant annotation class. In general, we find a depletion of the variants in the rarest frequency class, as well as enrichment of variants in the intermediate to common frequency range. The site frequency spectra were down-sampled to 18,000 chromosomes for each dataset.
C) Comparison of MAFs for trait-associated variants in FinMetSeq and NFE gnomAD. Plotted in gray background is a 2-D histogram of variants with non-zero allele frequencies in both gnomAD and FinMetSeq but no trait associations. Variants associated with at least one trait are colored and scaled inversely proportional to the logarithm of the association p-value. Variants >10x enriched in FinMetSeq compared to NFE are pink, those <10x enriched are in blue. The dashed line is the line of equal frequency. Two-sided uncorrected P-values are from a regression of trait on the count of alternative allele at each variant. The number of independent individuals used in each point is listed in Supplementary Table 5.
Extended Data Figure 2. Heritability of and correlations between traits.
Traits are in the same order, clockwise in A, and left to right and top to bottom in B, following the trait group color key.
A) Heritability estimated in 13,342 unrelated individuals (for abbreviations see Supplementary Table 4), for details see Supplementary Table 6.
B) Heatmap of: 1) absolute Pearson correlations of standardized trait values in upper triangle; 2) absolute values of estimated pairwise genetic correlations in lower triangle. Genetic correlations are estimated in 13,342 unrelated individuals. Values below the diagonal in gray had trait heritability less than 1.5 times the SE of heritability.
Extended Data Fig. 3. Properties of associations shared between traits.
A) Shared genomic associations by pairs of traits. For traits x and y, color in row x and column y reflects the number of loci associated with both traits divided by the number of loci associated with trait x. Traits are presented in the same order as in Extended Data Figure 2A, and the side and top color bars reflect trait groups.
B) Relationship between estimated genetic correlation and extent of sharing of genetic associations. For each trait-pair, the extent of locus sharing is defined as the number of loci associated with both traits divided by the total number of loci associated with either trait. Analysis using the absolute value of the Pearson correlation of the residual series results in a very similar pattern. The number of trait pairs in each x-axis category are as follows: 0-1%: 819; 1-10%: 204, 11-20%: 102; 21-30%: 41; 31-40%: 29; 41-50%: 16, >50%: 13. The bar within each box is the median, the box represents the upper and lower quartiles, whiskers extend to 1.5x the interquartile range, and points represent outliers.
Extended Data Fig. 4. Gene-based association of extremely rare variants in APOB with serum total cholesterol.
The upper panel shows the distribution of the covariate adjusted and inverse-normal transformed phenotype. The lower panel displays the association statistics for each variant included in the gene-based test along with the trait value for minor allele carriers of each variant (orange triangles). SV.P is the P-value from the analysis of each variant in a single-variant analysis. The number of independent individuals in the analysis is 19,291.
Extended Data Fig. 5. Gene-based association of rare variants in SECTM1 with HDL2 cholesterol.
The upper panel shows the distribution of the covariate adjusted and inverse-normal transformed phenotype. The lower panel displays the association statistics for each variant included in the gene-based test, along with the trait value for minor allele carriers of each variant (orange triangles). SV.P is the P-value from the analysis of each variant in a single-variant analysis. The number of independent individuals in the analysis is 10,984.
Extended Data Fig. 6. Gene-based association of extremely rare variants in ALDH1L1 with glycine levels.
The upper panel shows the distribution of the covariate adjusted and inverse-normal transformed phenotype. The lower panel displays the association statistics for each variant included in the gene-based test, along with the trait value for minor allele carriers of each variant (orange triangles). SV.P is the P-value from the analysis of each variant in a single-variant analysis. The number of independent individuals in the analysis is 8,206.
Extended Data Fig. 7. Population structure of the FinMetSeq dataset, by region.
Population structure, by region, from principal components analysis of exome sequencing variant data (MAF > 1%), for 14,874 unrelated individuals known parental birthplaces. Color indicates individuals with both parents born in the same region; gray indicates individuals with different parental birth regions, or missing information for one parent. Abbreviations for the regions: Usm, Uusimaa; Swf, Southwest Finland; Stk, Satakunta; Khm, Kanta-Hame; Prk, Pirkanmaa; Phm, Paijat-Hame; Kyl, Kymenlaakso; SKa, Southern Karelia; Nka, Northern Karelia; SSv, Southern Savonia; NSv, Northern Savonia; Ctf, Central Finland; SOs, Southern Ostrobothnia; Osb, Ostrobothnia; COs, Central Ostrobothnia; NOs, Northern Ostrobothnia; Kai, Kainuu; Lap, Lapland; X, split parental birthplaces. Large solid circles represent the center of each region.
Extended Data Fig. 8. Hierarchical clustering tree produced by fineSTRUCTURE.
We identified 16 subpopulations within the FinMetSeq dataset by applying a haplotype-based clustering algorithm, fineSTRUCTURE, on 2,644 unrelated individuals born by 1955 whose parents were both born in the same municipality (Methods). Each subpopulation is named based on the most common parental birth location among its members, with the following abbreviations: NKa, North Karelia; NSv, North Savonia; SOs, South Ostrobothnia; NOs, North Ostrobothnia; Kai, Kainuu; Lap, Lapland; SuK, Surrendered Karelia. A map of Finland with regions labeled is supplied for reference. If multiple subpopulations share the same location label, the subpopulation is further distinguished with a numeral. NSv3 is used as an internal reference in enrichment analysis. See Supplementary Table 17 for more detailed demographic descriptions of each subpopulation.
Extended Data Fig. 9. Regional variation in allele frequencies by functional annotation.
Enrichment of variants by allelic class in regional sub-populations of late settlement Finland (defined in Supplementary Table 17). Each bin represents the ratio of variants in the subpopulation compared to the reference subpopulation (NSv3), after down-sampling the frequency spectra of all populations to 200 chromosomes. Pink cells represent an enrichment (ratio >1), blue cells represent a depletion (ratio <1). Sample sizes and confidence intervals on each enrichment ratios, and their P-values, are presented in Supplementary Table 18. The results are consistent with multiple bottlenecks in late settlement Finland, particularly for populations in Lapland and Northern Ostrobothnia.
Supplementary Material
Supplementary Information is linked to the online version of the paper at www.nature.com/nature.
Acknowledgements
Thanks to Terri Teshiba for coordinating ethical permissions and samples. Thanks to Sini Kerminen, Daniel Lawson, and George Busby for discussions and providing scripts to run fineSTRUCTURE. SR was supported by the Academy of Finland Center of Excellence in Complex Disease Genetics (Grant No 312062), Academy of Finland (No. 285380), the Finnish Foundation for Cardiovascular Research, the Sigrid Juselius Foundation, Biocentrum Helsinki and University of Helsinki HiLIFE Fellow grant. VR acknowledges support by RFBR, research project No. 18-04-00789 A. VS was supported by the Finnish Foundation for Cardiovascular Research. CS and LS received funding from HG006695, HL113315, and MH105578. MAK is supported by a Senior Research Fellowship from the National Health and Medical Research Council (NHMRC) of Australia (APP1158958). He also works in a unit that is supported by the University of Bristol and UK Medical Research Council (MC_UU_12013/1). The Baker Institute is supported in part by the Victorian Government’s Operational Infrastructure Support Program. AUJ, DR, LJS, HMS, RW, PY, XY, and MB received funding from DK062370. SKS, CWKC, and NBF received funding from HL113315 and NS062691. The METSIM study was supported by grants from Academy of Finland (No. 321428), the Sigrid Juselius Foundation, the Finnish Foundation for Cardiovascular Research, Kuopio University Hospital, and Centre of Excellence of Cardiovascular and Metabolic Diseases supported by the Academy of Finland (ML). Sequencing was funded by 5U54HG003079, and AEL, KMS, HJB, CCC, CJK, KLK, DCK, DEL, JN, TJN, SKD, NOS, IMH, and RKW were funded by 5U54HG003079 and 5UM1HG008853-03.
Footnotes
Author Contributions
AEL, LJS, RKW, AaP, VS, ML, SR, MB, and NBF designed the study. AEL, KMS, HJA, RSF, DCK, DEL, JN, TJN, and JV produced and quality-controlled the sequence data. AEL, ASH, AUJ, ArP, HMS, MAK, VS, and ML collected, quality-controlled, and/or prepared the clinical data for association analysis. AEL, KMS, CWKC, SKS, ASH, LS, MP, CCC, AUJ, CJK, KK, VR, DR, JV, RW, PY, and XY analyzed data. ASH, JGE, MAK, MRJ, and MM collected, quality-controlled, and analyzed replication data. HL, SKD, NOS, IMH, CS, SR, MB, and NBF supervised experiments and analyses. AEL, KMS, CWKC, SKS, CS, MB and NBF wrote the paper. AEL, KMS, CWKC, and SKS contributed equally to this work. NBF and MB jointly supervised this work.
Author Information
Reprints and permission information is available at www.nature.com/reprints
Competing interests statements:
VS has participated in a conference trip sponsored by Novo Nordisk and received a honorarium from the same source for participating in an advisory board meeting. He also has ongoing research collaboration with Bayer Ltd.
HL is a member of the Nordic Expert group unconditionally supported by Gedeon Richter Nordics and has received an honorarium from Orion.
Data Availability: The sequence data can be accessed through dbGaP using study numbers phs000756 and phs000752. Association results can be accessed at http://pheweb.sph.umich.edu/FinMetSeq/ and are searchable via the Type 2 Diabetes Knowledge Portal (www.type2diabetesgenetics.org). Summary statistics will also be made available through the NHGRI-EBI GWAS Catalog at https://www.ebi.ac.uk/gwas/downloads/summary-statistics.
References
- 1.Samocha KE, et al. Regional missense constraint improves variant deleteriousness prediction. bioRxiv. 2017 doi: 10.1101/148353. [DOI] [Google Scholar]
- 2.Marouli E, et al. Rare and low-frequency coding variants alter human adult height. Nature. 2017;542:186–190. doi: 10.1038/nature21039. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Flannick J, et al. Exome sequencing of 20,791 cases of type 2 diabetes and 24,440 controls. Nature. 2019;570:71–76. doi: 10.1038/s41586-019-1231-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Timpson NJ, Greenwood CMT, Soranzo N, Lawson DJ, Richards JB. Genetic architecture: the shape of the genetic contribution to human traits and disease. Nature reviews. Genetics. 2018;19:110–124. doi: 10.1038/nrg.2017.101. [DOI] [PubMed] [Google Scholar]
- 5.Zuk O, et al. Searching for missing heritability: designing rare variant association studies. Proc Natl Acad Sci U S A. 2014;111:E455–464. doi: 10.1073/pnas.1322563111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Xue Y, et al. Enrichment of low-frequency functional variants revealed by whole-genome sequencing of multiple isolated European populations. Nature communications. 2017;8:15927. doi: 10.1038/ncomms15927. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Southam L, et al. Whole genome sequencing and imputation in isolated populations identify genetic associations with medically-relevant complex traits. Nature communications. 2017;8:15606. doi: 10.1038/ncomms15606. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Manolio TA, et al. Finding the missing heritability of complex diseases. Nature. 2009;461:747–753. doi: 10.1038/nature08494. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Jakkula E, et al. The genome-wide patterns of variation expose significant substructure in a founder population. American journal of human genetics. 2008;83:787–794. doi: 10.1016/j.ajhg.2008.11.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Polvi A, et al. The Finnish disease heritage database (FinDis) update-a database for the genes mutated in the Finnish disease heritage brought to the next-generation sequencing era. Hum Mutat. 2013;34:1458–1466. doi: 10.1002/humu.22389. [DOI] [PubMed] [Google Scholar]
- 11.Manning A, et al. A Low-Frequency Inactivating AKT2 Variant Enriched in the Finnish Population Is Associated With Fasting Insulin Levels and Type 2 Diabetes Risk. Diabetes. 2017;66:2019–2032. doi: 10.2337/db16-1329. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Lim ET, et al. Distribution and medical impact of loss-of-function variants in the Finnish founder population. PLoS genetics. 2014;10:e1004494. doi: 10.1371/journal.pgen.1004494. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Service SK, et al. Re-sequencing expands our understanding of the phenotypic impact of variants at GWAS loci. PLoS genetics. 2014;10:e1004147. doi: 10.1371/journal.pgen.1004147. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Wurtz P, et al. Quantitative Serum Nuclear Magnetic Resonance Metabolomics in Large-Scale Epidemiology: A Primer on -Omic Technologies. American journal of epidemiology. 2017;186:1084–1096. doi: 10.1093/aje/kwx016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Laakso M, et al. The Metabolic Syndrome in Men study: a resource for studies of metabolic and cardiovascular diseases. Journal of lipid research. 2017;58:481–493. doi: 10.1194/jlr.O072629. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Borodulin K, et al. Forty-year trends in cardiovascular risk factors in Finland. Eur J Public Health. 2015;25:539–546. doi: 10.1093/eurpub/cku174. [DOI] [PubMed] [Google Scholar]
- 17.Abraham G, et al. Genomic prediction of coronary heart disease. Eur Heart J. 2016;37:3267–3278. doi: 10.1093/eurheartj/ehw450. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Sabatti C, et al. Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nature genetics. 2009;41:35–46. doi: 10.1038/ng.271. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Pulizzi N, et al. Interaction between prenatal growth and high-risk genotypes in the development of type 2 diabetes. Diabetologia. 2009;52:825–829. doi: 10.1007/s00125-009-1291-1. [DOI] [PubMed] [Google Scholar]
- 20.Fagerberg L, et al. Analysis of the human tissue-specific expression by genome-wide integration of transcriptomics and antibody-based proteomics. Mol Cell Proteomics. 2014;13:397–406. doi: 10.1074/mcp.M113.035600. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Corsetti JP, et al. Thrombospondin-4 polymorphism (A387P) predicts cardiovascular risk in postinfarction patients with high HDL cholesterol and C-reactive protein levels. Thromb Haemost. 2011;106:1170–1178. doi: 10.1160/TH11-03-0206. [DOI] [PubMed] [Google Scholar]
- 22.Zhang XJ, et al. Association between single nucleotide polymorphisms in thrombospondins genes and coronary artery disease: A meta-analysis. Thromb Res. 2015;136:45–51. doi: 10.1016/j.thromres.2015.04.019. [DOI] [PubMed] [Google Scholar]
- 23.Beygo J, et al. New insights into the imprinted MEG8-DMR in 14q32 and clinical and molecular description of novel patients with Temple syndrome. Eur J Hum Genet. 2017;25:935–945. doi: 10.1038/ejhg.2017.91. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Wallace C, et al. The imprinted DLK1-MEG3 gene region on chromosome 14q32.2 alters susceptibility to type 1 diabetes. Nature genetics. 2010;42:68–71. doi: 10.1038/ng.493. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Day FR, et al. Genomic analyses identify hundreds of variants associated with age at menarche and support a role for puberty timing in cancer risk. Nature genetics. 2017;49:834–841. doi: 10.1038/ng.3841. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Perry JR, et al. Parent-of-origin-specific allelic associations among 106 genomic loci for age at menarche. Nature. 2014;514:92–97. doi: 10.1038/nature13545. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Cleaton MA, et al. Fetus-derived DLK1 is required for maternal metabolic adaptations to pregnancy and is associated with fetal growth restriction. Nature genetics. 2016;48:1473–1480. doi: 10.1038/ng.3699. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Chaves JA, et al. Genomic variation at the tips of the adaptive radiation of Darwin's finches. Mol Ecol. 2016;25:5282–5295. doi: 10.1111/mec.13743. [DOI] [PubMed] [Google Scholar]
- 29.Surakka I, et al. The impact of low-frequency and rare variants on lipid levels. Nature genetics. 2015;47:589–597. doi: 10.1038/ng.3300. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Ding Y, et al. Plasma Glycine and Risk of Acute Myocardial Infarction in Patients With Suspected Stable Angina Pectoris. J Am Heart Assoc. 2015;5 doi: 10.1161/JAHA.115.002621. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Wittemans LBL, et al. Assessing the causal association of glycine with risk of cardio-metabolic diseases. Nature communications. 2019;10:1060. doi: 10.1038/s41467-019-08936-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Perry RJ, et al. Acetate mediates a microbiome-brain-beta-cell axis to promote metabolic syndrome. Nature. 2016;534:213–217. doi: 10.1038/nature18309. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Tabbassum R, et al. Genetics of human plasma lipidome: Understanding lipid metabolism and its link to diseases beyond traditional lipids. bioRxiv. 2018 doi: 10.1101/457960. [DOI] [Google Scholar]
- 34.Casanova ML, et al. Exocrine pancreatic disorders in transsgenic mice expressing human keratin 8. J Clin Invest. 1999;103:1587–1595. doi: 10.1172/JCI5343. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Surendran P, et al. Trans-ancestry meta-analyses identify rare and common variants associated with blood pressure and hypertension. Nature genetics. 2016;48:1151–1161. doi: 10.1038/ng.3654. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Liu C, et al. Meta-analysis identifies common and rare variants influencing blood pressure and overlapping with metabolic trait loci. Nature genetics. 2016;48:1162–1170. doi: 10.1038/ng.3660. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Palmer C, Pe'er I. Statistical correction of the Winner's Curse explains replication variability in quantitative trait genome-wide association studies. PLoS genetics. 2017;13:e1006916. doi: 10.1371/journal.pgen.1006916. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Norio R. Finnish Disease Heritage I: characteristics, causes, background. Hum Genet. 2003;112:441–456. doi: 10.1007/s00439-002-0875-3. [DOI] [PubMed] [Google Scholar]
- 39.Service S, et al. Magnitude and distribution of linkage disequilibrium in population isolates and implications for genome-wide association studies. Nature genetics. 2006;38:556–560. doi: 10.1038/ng1770. [DOI] [PubMed] [Google Scholar]
- 40.Chiang CWK, et al. Genomic history of the Sardinian population. Nature genetics. 2018 doi: 10.1038/s41588-018-0215-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Rivas MA, et al. Insights into the genetic epidemiology of Crohn's and rare diseases in the Ashkenazi Jewish population. PLoS genetics. 2018;14:e1007329. doi: 10.1371/journal.pgen.1007329. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Bastarache L, et al. Phenotype risk scores identify patients with unrecognized Mendelian disease patterns. Science. 2018;359:1233–1239. doi: 10.1126/science.aal4043. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Niemi MEK, et al. Common genetic variants contribute to risk of rare severe neurodevelopmental disorders. Nature. 2018 doi: 10.1038/s41586-018-0566-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Surakka ISA-P, Ruotsalainen SE, Durbin R, Salomaa V, Daly M, Palotie A, Ripatti S. The rate of false polymorphisms introduced when imputing genotypes from global imputation panels. bioRxiv. 2016 doi: 10.1101/080770. [DOI] [Google Scholar]
- 45.Collins FS, Varmus H. A new initiative on precision medicine. N Engl J Med. 2015;372:793–795. doi: 10.1056/NEJMp1500523. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Stancáková A, et al. Changes in insulin sensitivity and insulin release in relation to glycemia and glucose tolerance in 6,414 Finnish men. Diabetes. 2009;58:1212–1221. doi: 10.2337/db08-1607. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Borodulin K, et al. Cohort Profile: The National FINRISK Study. Int J Epidemiol. 2017 doi: 10.1093/ije/dyx239. [DOI] [PubMed] [Google Scholar]
- 48.Wu J, et al. A summary of the effects of antihypertensive medications on measured blood pressure. Am J Hypertens. 18:935–942. doi: 10.1016/j.amjhyper.2005.01.011. [DOI] [PubMed] [Google Scholar]
- 49.Tobin MD, Sheehan NA, Scurrah KJ, Burton PR. Adjusting for treatment effects in studies of quantitative traits: antihypertensive therapy and systolic blood pressure. Statistics in medicine. 2005;24:2911–2935. doi: 10.1002/sim.2165. [DOI] [PubMed] [Google Scholar]
- 50.Liu DJ, et al. Exome-wide association study of plasma lipids in >300,000 individuals. Nature genetics. 2017 doi: 10.1038/ng.3977. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Friedewald WT, Levy RI, Fredrickson DS. Estimation of the concentration of low-density lipoprotein cholesterol in plasma, without use of the preparative ultracentrifuge. Clin Chem. 1972;18:499–502. [PubMed] [Google Scholar]
- 52.DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011;43:491–498. doi: 10.1038/ng.806. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Jun G, et al. Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data. Am J Hum Genet. 2012;91:839–848. doi: 10.1016/j.ajhg.2012.09.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Tan A, Abecasis GR, Kang HM. Unified representation of genetic variants. Bioinformatics. 2015;31:2202–2204. doi: 10.1093/bioinformatics/btv112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Davis JP, et al. Common, low-frequency, and rare genetic variants associated with lipoprotein subclasses and triglyceride measures in Finnish men from the METSIM study. PLoS genetics. 2017;13:e1007079. doi: 10.1371/journal.pgen.1007079. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Das S, et al. Next-generation genotype imputation service and methods. Nature genetics. 2016;48:1284–1287. doi: 10.1038/ng.3656. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.McCarthy S, et al. A reference panel of 64,976 haplotypes for genotype imputation. Nature genetics. 2016;48:1279–1283. doi: 10.1038/ng.3643. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.McLaren W, et al. The Ensembl Variant Effect Predictor. Genome Biol. 2016;17:122. doi: 10.1186/s13059-016-0974-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Adzhubei IA, et al. A method and server for predicting damaging missense mutations. Nature methods. 2010;7:248–249. doi: 10.1038/nmeth0410-248. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Chun S, Fay JC. Identification of deleterious mutations within three human genomes. Genome research. 2009;19:1553–1561. doi: 10.1101/gr.092619.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Schwarz JM, Cooper DN, Schuelke M, Seelow D. MutationTaster2: mutation prediction for the deep-sequencing age. Nature methods. 2014;11:361–362. doi: 10.1038/nmeth.2890. [DOI] [PubMed] [Google Scholar]
- 62.Kumar P, Henikoff S, Ng PC. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nature protocols. 2009;4:1073–1081. doi: 10.1038/nprot.2009.86. [DOI] [PubMed] [Google Scholar]
- 63.Kang HM, et al. Variance component model to account for sample structure in genome-wide association studies. Nature genetics. 2010;42:348–354. doi: 10.1038/ng.548. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Buniello A, et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 2019;47:D1005–D1012. doi: 10.1093/nar/gky1120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Kettunen J, et al. Genome-wide study for circulating metabolites identifies 62 loci and reveals novel systemic effects of LPA. Nature communications. 2016;7:11122. doi: 10.1038/ncomms11122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Kettunen J, et al. Genome-wide association study identifies multiple loci influencing human serum metabolite levels. Nature genetics. 2012;44:269–276. doi: 10.1038/ng.1073. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Teslovich TM, et al. Identification of seven novel loci associated with amino acid levels using single-variant and gene-based tests in 8545 Finnish men from the METSIM study. Hum Mol Genet. 2018;27:1664–1674. doi: 10.1093/hmg/ddy067. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Inouye M, et al. Novel Loci for metabolic networks and multi-tissue expression studies reveal genes for atherosclerosis. PLoS Genet. 2012;8:e1002907. doi: 10.1371/journal.pgen.1002907. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Lee S, et al. Optimal unified approach for rare-variant association testing with application to small-sample case-control whole-exome sequencing studies. American journal of human genetics. 2012;91:224–237. doi: 10.1016/j.ajhg.2012.06.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Peterson CB, Bogomolov M, Benjamini Y, Sabatti C. Many Phenotypes Without Many False Discoveries: Error Controlling Strategies for Multitrait Association Studies. Genet Epidemiol. 2016;40:45–56. doi: 10.1002/gepi.21942. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Loh PR, et al. Reference-based phasing using the Haplotype Reference Consortium panel. Nature genetics. 2016;48:1443–1448. doi: 10.1038/ng.3679. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Howie BN, Donnelly P, Marchini J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS genetics. 2009;5:e1000529. doi: 10.1371/journal.pgen.1000529. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Manichaikul A, et al. Robust relationship inference in genome-wide association studies. Bioinformatics. 2010;26:2867–2873. doi: 10.1093/bioinformatics/btq559. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Lawson DJ, Hellenthal G, Myers S, Falush D. Inference of population structure using dense haplotype data. PLoS genetics. 2012;8:e1002453. doi: 10.1371/journal.pgen.1002453. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Chang CC, et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015;4:7. doi: 10.1186/s13742-015-0047-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Pirinen M, et al. biMM: efficient estimation of genetic variances and covariances for cohorts with high-dimensional phenotype measurements. Bioinformatics. 2017;33:2405–2407. doi: 10.1093/bioinformatics/btx166. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.