Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2024 Dec 4;19(12):e0314502. doi: 10.1371/journal.pone.0314502

TRIO RVEMVS: A Bayesian framework for rare variant association analysis with expectation-maximization variable selection using family trio data

Duo Yu 1, Matthew Koslovsky 2, Margaret C Steiner 3, Kusha Mohammadi 4, Chenguang Zhang 5, Michael D Swartz 6,*
Editor: Priyadarshini Kachroo7
PMCID: PMC11616829  PMID: 39630689

Abstract

It is commonly reported that rare variants may be more functionally related to complex diseases than common variants. However, individual rare variant association tests remain challenging due to low minor allele frequency in the available samples. This paper proposes an expectation maximization variable selection (EMVS) method to simultaneously detect common and rare variants at the individual variant level using family trio data. TRIO_RVEMVS was assessed in both large (1500 families) and small (350 families) datasets based on simulation. The performance of TRIO_RVEMVS was compared with gene-level kernel and burden association tests that use pedigree data (PedGene) and rare-variant extensions of the transmission disequilibrium test (RV-TDT). At the region level, TRIO_RVEMVS outperformed PedGene and RV-TDT when common variants were included. TRIO_RVEMVS performed competitively with PedGene and outperformed RV-TDT when the analysis was only restricted to rare variants. At the individual variants level, with 1,500 trios, the average true positive rate of individual rare variants that were polymorphic across 500 datasets was 12.20%, and the average false positive rate was 0.74%. In the datasets with 350 trios, the average true and false positive rates of individual rare variants were 13.10% and 1.30%, respectively. When applying TRIO_RVEMVS to real data from the Gabriella Miller Kids First Pediatric Research Program, it identified 3 rare variants in q24.21 and q24.22 associated with the risk of orofacial clefts in the Kids First European population.

Introduction

Birth defects are prevalent, occurring in 1 out of every 33 babies born annually in the United States, and are the primary cause of infant mortality, responsible for 20% of all infant deaths [1]. The impact of birth defects may be underestimated in mortality statistics [2], thus understanding the etiology of these major birth defects remains a research priority in birth defects epidemiology. Family data supports the hypothesis that a significant component of the risk for various birth defects stems from genetic variation [36]. Recent genome-wide association studies have identified common SNPs associated with varying birth defects, including obstructive heart defects (OHDs) [7], multiple congenital heart defect (CHD) phenotypes [8], conotruncal heart defects (CTD) [9], left-sided lesions (LSL) [10], and tetralogy of Fallot [11]. However, the identified variants only account for a small portion of the heritability. Part of the missing genetic heritability is thought to reside in rare variants, which are largely undetectable through genome-wide association platforms [1217].

Although next-generation sequencing allows researchers to sequence each variant along the genome, there are statistical challenges inherent in identifying which rare variants are associated with disease. The power of traditional association methods for SNPs depends on allele frequency; the low frequency of minor alleles may reduce the power to analyze rare variants [1822]. Previous methods are based on global tests that pool rare variants in a region to test the association with diseases. These global tests can be classified into burden tests (such as Cohort Allelic Sums Test (CAST) [23], the Collapsed Multivariate Collapsing Method (CMC) [24], and the Variable Threshold (VT) method [25, 26])), or quadratic tests (like C-alpha test [27] and the Sequence Kernel Association Test (SKAT) [28]).

Some methods have been developed that can identify specific rare variants that drive association within a given region of interest, as well as include common variants [24, 2931]. However, these methods are based on a case-control design and are therefore subject to bias from population stratification [3134]. Diseases that affect young children, like birth defects, are good candidates for family-based study designs, such as parent-child trios, which allow for more robust methods to analyze genetic data, including both common and rare variants, in the presence of population substructure [32, 35, 36]. Methods that take advantage of pedigree data for the genetic variant association, such as PedGene [37], and RV-TDT [38] have been developed. PedGene extends kernel and burden statistics for unrelated case-control data to include known pedigree relationships, which can account for the population-structured data [37]. Similarly, to avoid the spurious associations derived from the population-based method when the population substructure and admixture exist, RV-TDT extends commonly used population-based methods to analyze the association of rare variants in population-structured data, including aforementioned CMC [24] and VT [25]. However, none of these methods can detect individual rare variants within the region of interest. In this study, we proposed TRIO_RVEMVS, a Bayesian framework for individual rare variant association analysis with expectation-maximization variable selection using family trio data, which can simultaneously detect common and rare variants at the individual variant level.

The paper is organized as follows: We begin by constructing the likelihood of common and rare variants using case-parent trios. Next, we detail the Bayesian framework of TRIO_RVEMVS, which includes specifying priors for common and rare variants’ coefficients, conducting posterior inference using the EM algorithm, and tuning selection parameters. To assess TRIO_RVEMVS, we perform simulations on both large (1500 case-trios) and small (350 case-trios) datasets and compare the results with those obtained using PedGene and RV-TDT. We then apply TRIO_RVEMVS to a real-world trio dataset from the Gabriella Miller Kids First Pediatric Research Program consisting of trios with a child suffering from cleft lip with or without cleft palate (CL+P). The paper concludes with a discussion of our findings.

Constructing the likelihood of common and rare variants using case-parents trios

In the case-parent trio design, small nuclear families are collected, where the child is affected by the disease or phenotype of interest. Then, the affected child and both parents are genotyped. Assuming each family can be phased, we denote the haplotype pair of the child [39]

g=(gm,gf)

where gm and gf denote the haplotypes inherited from the mother and father, respectively. Let D+ represent the child is diseased, Θ denote the transmission parameters, and denote the parental haplotype pairs from mother and father as Gm and Gf, respectively. To model the sampling distribution of the case-trio family data, we propose to use the conditional logistic regression likelihood to model the probability of haplotype transmission from parents to the diseased child, which is motivated by the literature [3941]. In more detail, the sampling distribution for observing a case trio can first be expressed as

P(g,Gm,Gf|Θ,D+)=P(g|Gm,Gf,Θ,D+)P(Gm,Gf|Θ,D+),

Due to Mendelian laws of inheritance, we can assume the transmission parameters contained in Θ are conditionally independent of the parents’ genotypes Gm and Gf, given that the child is diseased. Since the trios are sampled through the diseased child, there is no information regarding Θ in the sampling distribution of the parent’s haplotypes, which implies

P(Gm,Gf|Θ,D+)=P(Gm,Gf|D+),

and

P(g,Gm,Gf|Θ,D+)P(g|Gm,Gf,Θ,D+). (1)

Eq (1) implies the sampling distribution will based on P(g|Gm, Gf, Θ, D+), which can be generally modeled by a conditional logistic regression according to previous studies [3941].

The conditional probability of disease given the haplotypes of parents can be derived similarly as in [40],

P(g|Gm,Gf,Θ,D+)=P(D+|g,Θ)j=14P(D+|gj,Θ), (2)

where gj denotes one of the four different haplotype pairs inheritable from parents, j = 1, 2, 3, 4. We use a logistic regression modeling framework to include both common and rare variants in the sampling distribution as

log{P(D+|g,Θ)1-P(D+|g,Θ)}=gcβ+grα, (3)

where gc is a 1 × S vector which denotes the common SNPs, β is a S × 1 coefficient vector, gr is a 1 × L vector representing the rare SNPs, and α represents the effect of the selected rare variants on risk. For rare diseases, we can assume 1 − P(D+|g, Θ) ≃ 1. Therefore, Eq (3) simplifies to

logP(D+|g,Θ)=gcβ+grα, (4)

Similarly,

logP(D+|gj,Θ)=gcjβ+grjα, (5)

where gcj and grj are the common and rare variants of gj. With Eqs (4) and (5), finally, Eq (2) can be expressed as

P(g|Gm,Gf,Θ,D+)=exp(gcβ+grα)j=14exp(gcjβ+grjα). (6)

Eq (6) can be understood as the likelihood for a 1:3 matched case-control design, where the affected child plays the role as the case and is matched with the 3 other possible genetic configurations of children that could have been offspring from the same parents. These other configurations are commonly called pseudo-siblings. Therefore, the conditional likelihood function for trios is given by

L(g|Gm,Gf,Θ,D+)=n=1NP(gn|Gnm,Gnf,D+,Θ), (7)

where g denotes the collection of haplotype pairs of children from the case-trio data, Gm and Gf are the collections of haplotype pairs of parents in the case-trio data, n = 1, 2, ⋯, N are the indexes of families, and P(gn|Gnm, Gnf, D+, Θ) can be calculated through Eq (6).

Trio rare variants EMVS (TRIO_RVEMVS)

The sampling distribution and likelihood for trio data using common variants have been previously discussed [35, 3942], modeling the probability of transmitting genes to the affected child. In the previous section, we show the probability of transmission can be extended to incorporate rare variants and be written as Eq (6). In summary, trio data are modeled using conditional logistic regression, similar to the model for 1:3 matched case-control data. Here, the affected child is matched with the 3 “pseudo-siblings” who could potentially be offspring of the parents. Using this sampling distribution as our likelihood, we proceed to outline the remaining mathematical details to develop a Bayesian framework for the variable selection for both common and rare variants associated with disease and implement the EM algorithm to compute the posterior quantities of interest [43].

Hierarchical priors

In this section, we construct the prior distributions to model inclusion in the risk model for each common and rare variant. For common variant selection, we used a single binary indicator, γs, to denote whether a common variant is included in the model. For rare variant selection, we implemented a dual selection indicator structure, which was first used in genetics for common variants with multiple alleles [42] and later modified for use in a Bayesian sparse group selection framework [44]. Specifically, we used a binary indicator, ηr, to indicate a group of related rare variants (typically those within the same gene) and a second indicator, λrj, to indicate individual variants within group r. Each selection indicator follows a Bernoulli distribution, with a prior inclusion probability parameter πi (i = 1, 2, 3):

γs|π1Ber(π1),s=1,2,,S,ηr|π2Ber(π2),r=1,2,,R,p(λrj|ηr,π3)=ηrπ3λrj(1-π3)1-λrj+(1-ηr)δ0,j=1,2,,Jr, (8)

where δ0 is point mass at zero, S denotes the total number of common variants, R denote the total number of regions, and Jr denotes the total number of rare variants in region r. If γs = 1, it indicates that a common variant is included in the model; if γs = 0, it indicates otherwise. Similarly, ηr = 1 indicates that a group of rare variants in a defined region (typically a gene) is included in the model; ηr = 0 indicates otherwise. λrj = 1 indicates an individual rare variant j at group r is included in the model; λrj = 0 indicates otherwise. To make the variable selection more flexible, we assume beta priors on πi,

πiBeta(ai,bi),

with hyper-parameters ai and bi, i = 1, 2, 3. This creates Beta-Bernoulli prior on all inclusion indicators. We use the hyper-parameters to balance power and multiplicity correction for the number of variants similar to [29]. Specifically, we use ai = 1(i = 1, 2, 3), and set b1 as the total number of common variants, b2 as the total number of groups of rare variants, and b3 as the total number of rare variants.

Conditional on the selection indicator, the prior for the coefficient of a common variant, βs, is defined as a normal distribution,

p(βs|γs)=N(0,ds) (9)

where ds is defined by the corresponding γs:

ds={v1ifγs=1,v0ifγs=0.s=1,2,,S.

Here, we set v0 as a very small positive value that has the effect of restricting the value of βs to be close to 0 when the SNP is not selected and v1 (v1 > 0) large to allow the βs coefficient to be estimated. Defining the prior on βs in this way results in marginal normal mixture distributions that are defined by inclusion, typical of common Bayesian variable selection paradigms [29, 42, 43]. For each common variant, βs is distributed as a normal distribution with the following mean and variance

μs=0,andvar(βs)=(1-γs)v0+γsv1.

For the rare variant coefficient α = (α11, α12, ⋯, αrj, ⋯)′, we proposed:

p(αrj|ηr,λrj)(1-ηrλrj)N(0,v2)+ηrλrjN(0,v3), (10)

where r is the index of region, r = 1, 2, ⋯, R; and j is the index of individual rare variant in the region r, j = 1, 2, ⋯, Jr; ηr is the binary selection indicator of region r, λrj is the binary selection indicator of individual rare variant j in region r; v2 is the exclusion parameter of individual rare variant when either ηr = 0 or λrj = 0, v3 is the inclusion parameter of individual rare variant when both ηr = 1 and λrj = 1.

Posterior inference using the EM algorithm

Let γ, η, and λ denote the collections of binary selection indicators for individual common variants, regions, and individual rare variants, γ = {γs, s = 1, 2, ⋯, S}, η = {ηr, r = 1, 2, ⋯, R}, and λ = {λrj, r = 1, 2, ⋯, R;j = 1, 2, ⋯, Jr}. The full posterior distribution is denoted as logP(β, α, π1, π2, π3, γ, η, λ|g, Gm, Gf, D+). Given the likelihood of Eq (7), and priors of Eqs (8) to (10) as defined above, the posterior distribution does not have a closed form. Instead of simulating large samples directly from the posterior using Markov Chain Monte Carlo (MCMC) methods, we employ the expectation maximization (EM) algorithm to estimate the posterior modes of interest. In the EM algorithm, we treat the variable selection indicators γ, η, and λ as missing data and alternate between conditional expectation using the current best estimates for the parameters and maximization of the expectation of the complete log-likelihood (Q function) to estimate the posterior modes of β and α [43].

For the E-step, we determine the Q-function which is the conditional expectation of log-likelihood of complete data with respect to the missing indicator variables, γ, η, and λ, given the current estimates of the unknown parameters β(k),α(k),π1(k), π2(k),π3(k), where k is the index of current iteration. For the M-step, we maximize the Q-function with respect to the parameters β, α, π1, π2, π3 and iterate both steps until convergence. For iteration k, the Q-function is defined as:

Q[β,α,π1,π2,π3|β(k),α(k),π1(k),π2(k),π3(k)]=Eγ,η,λ|.[logL(β,α,π1,π2,π3,γ,η,λ|g,Gm,Gf,D+)] (11)

where Eγ,η,λ|.=Eγ,η,λ|β(k),α(k),π1(k),π2(k),π3(k),g,Gm,Gf,D+, where the distributions of γ, η, λ are given by Eq (8), and L(β, α, π1, π2, π3, γ, η, λ|g, Gm, Gf, D+) is the likelihood of complete data which is given by Eq (7).

E-step

The objective function Q can be simplified as the sum of conditional functions

Q(·)=C+Q1[β,α|β(k),α(k),π1(k),π2(k),π3(k)]+Q2[π1|β(k),α(k),π1(k)]+Q3[π2|β(k),α(k),π2(k)]+Q4[π3|β(k),α(k),π3(k)], (12)

where C is constant term and each Q1, Q2, Q3, Q4 can be maximized independently. For convenience, we index the genotype of cases in each family as g0n, and all the genotypes of pseudo-siblings are indexed by i ∈ {1, 2, 3}. Then, the common and rare SNPs of the case child from family n are denoted as gc0n and gr0n, respectively; and the common and rare SNPs of pseudo-siblings from family n are denoted as gcin and grin, i = 1, 2, 3, respectively. In total of N families, according to the likelihood function Eqs (6) and (7), the Q-function with respect to β and α can be written as

Q1[β,α|β(k),α(k),π1(k),π2(k),π3(k)]=n=1Nlog[exp(gc0nβ+gr0nα)exp(gc0nβ+gr0nα)+i=13exp(gcinβ+grinα)]-12s=1Sβs2Eγs|.[1v0(1-γs)+v1γs]-12r=1Rj=1Jrαrj2Eηr,λrj|.[1v2(1-ηrλrj)+v3ηrλrj]=-n=1Nlog(1+i=13e-xcinβ-xrinα)-12βPc(k)β-12αPr(k)α (13)

where xin = g0ngin which is a 1 × p vector; Pc(k) is a S×S diagonal matrix with elements (1-ps(k))1v0+ps(k)1v1,s{1,2,,S}, and S is the total number of the common variants, ps(k) is the conditional expectation of inclusion parameter. Based on Bayes’ rule, ps(k) can be calculated as follows.

ps(k)=Eγs|.[γs]=P(γs=1|β(k),π1(k))=asas+bs, (14)

where as=P(β(k)|γs=1)P(γs=1|π1(k)), bs=P(βs(k)|γs=0)P(γs=0|π1(k)) and P(γs=1|π1(k))=π1(k). Pr(k) is a L × L diagonal matrix with elements (1-qrj(k))1v2+qrj(k)1v3, r = 1, 2, ⋯, R, j = 1, 2, ⋯, Jr, L is the total number of rare variants, R is the number of groups (regions) of variants, Jr is the number of rare variants in the group (region) r, and

qrj(k)=Eλrj=1|ηr=1,[λrj]=P(λrj=1|ηr=1,αrj(k),π2(k),π3(k))=crjcrj+drj, (15)

where crj=π3(k)P(αrj(k)|λrj=1,ηr=1), and drj=(1-π3(k))P(αrj(k)|λrj=0,ηr=1). The second and third terms of the Q function can be calculated as

Q2[π1|β(k),π1(k)]=s=1SEγs|.[γs]log[π11-π1]+(2S-1)log(1-π1). (16)
Q3[π2|α(k),π2(k)]=r=1REηr|.[ηr]log[π21-π2]+(2R-1)log(1-π2), (17)

where

Eηr|·[ηr]=P(ηr=1|αr1(k),,αrJr(k),π2(k),π3(k))=erer+frgr(k), (18)

in which er and fr are defined as

er=π2(k)j=1Jr[π3(k)P(αrj(k)|λrj=1,ηr=1)+(1-π3(k))P(αrj(k)|λrj=0,ηr=1)],

and

fr=(1-π2(k))j=1JrP(αrj(k)|λrj=0,ηr=0).

The last term of the Q function is written as

Q4[π3|α(k),π2(k),π3(k)]=Eη,λ|·log[P(π3)r=1Rj=1JrP(λrj|ηr,π3)]=r=1Rgr(k)j=1Jrqrj(k)log(π31-π3)+(L+r=1Rgr(k)Jr-1)log(1-π3) (19)

M-step

For the M-step, we maximize Q1, Q2, Q3 and Q4 separately. There is no closed-form solution for Q1 function. However, maximizing the Q1 with respect to β and α is equivalent to a minimization problem with respect to parameter ω (p × 1), where p = S + L, S is the total number of common variants, and L is the total number of rare variants. Based on Eq (13),

maxβ,αQ1[β,α|β(k),α(k),π1(k),π2(k),π3(k)]=maxβ,α(-n=1Nlog(1+i=13e-xcinβ-xrinα)-12βPc(k)β-12αPr(k)α)=minβ,α(n=1Nlog(1+i=13e-xcinβ-xrinα)+12βPc(k)β+12αPr(k)α)=minω(n=1Nlog(1+i=13e-xinω)+12ωP(k)ω), (20)

where xin = (xcin, xrin), ω=(β1,β2,,βS,α11,α12,,αrj,,αRJR), P(k) is a p × p (p = S + L) diagonal matrix with diagonal elements (1-ps(k))1v0+ps(k)1v1,s{1,2,,S}) for the first S elements, and (1-qrj(k))1v2+qrj(k)1v3 for the rest of L elements.

Since the likelihood function of conditional logistic regression and 12ωP(k)ω are vector convex functions [43], we used stochastic dual coordinate ascent (SDCA) [45, 46], an efficient technique for solving regularized loss minimization problems in machine learning, to solve the minimization problem above. Particularly, the accelerated min-batch SDCA was implemented and the details of the algorithm can be found in the Supplemental Materials [46]. Accordingly, we compute the β and α estimates for the next iteration based on the Q1 function. The remaining components Q2, Q3, and Q4 have closed forms. The details of solving the closed form solutions to the maximization of Q2, Q3, and Q4 are shown in the Supplemental Materials. The closed form solution for Q2 is:

π1(k+1)=s=1Sps(k)2S-1 (21)

The closed form solution for Q3 is:

π2(k+1)=r=1Rgr(k)2R-1 (22)

The closed form solution for Q4 is:

π3(k+1)=r=1Rgr(k)j=1Jrqrj(k)L+r=1Rgr(k)Jr-1 (23)

Convergence of the algorithm is determined if the difference between two successive observed-data likelihood is less than ϵ, i.e.

l(θ(k+1)|yobs)-(θ(k)|yobs)=[Q(θ(k+1),θ(k))-Q(θ(k),θ(k))]-[R(θ(k+1),θ(k))-R(θ(k),θ(k))]<ϵ, (24)

where ϵ is a pre-specified threshold, θ = (β, α, π1, π2, π3), and

R[β,α,π1,π2,π3|β(k),α(k),π1(k),π2(k),π3(k)]=Eγ,η,λ|.[logP(γ,η,λ|g,Gm,Gf,D+,β,α,π1,π2,π3)]=s=1SEγs|.[γs]log[π11-π1]+Slog(1-π1)+r=1REηr|.ηrlog[π21-π2]+Rlog(1-π2)+r=1Rgr(k)j=1Jrqrj(k)log(π31-π3)+(r=1Rgr(k)Jr)log(1-π3). (25)

Deterministic annealing

Though the conventional EM algorithm has attractive features, it can become trapped in local maximums in multimodal posterior distributions. To enhance the chance of discovering a global mode, the Deterministic Annealing variant of the EM algorithm (DAEM) is considered [47]. During each DAEM iteration, the conditional probability of inclusion indicators is parameterized by temperature 1/t (0 < t < 1). Therefore, ps(k), qrj(k) and gr(k) from Eqs (14), (15) and (18) were substituted by:

ps(k)=astast+bst (26)
qrj(k)=crjtcrjt+drjt (27)
gr(k)=ertert+frt (28)

where as, bs, crj, drj, er and fr are the same as in Eqs (14), (15) and (18). In this study, for both simulation and real-data analysis, the initial value of coefficients of common and rare variants is 0.5; the initial value of t is 0.1 with an incremental value of 0.1.

Selection parameter tuning

Here, we present a recommendation for tuning the selection parameters in TRIO_RVEMVS. Specifically, selection is controlled through the exclusion parameters, v0 and v2, and inclusion parameters, v1 and v3, as similar to [43]. To determine suitable values for these parameters, we first considered an odds ratio between [0.95,1.05] to be clinically irrelevant. Then, given a 95% prior probability of variable inclusion of an odds ratio that covers [0.29,3.45] and [0.27,4.3] for common and rare variants respectively. Thus, we set v1 = 0.4 and v3 = 0.5. Next, we evaluated the local stability of regularization plots with respect to exclusion parameters. The tuning process proceeded as follows:

  • We initially set the exclusion parameters for common and rare variants to be equal, i.e. v0 = v2, and chose the common exclusion parameter in the local stable regularization plot window.

  • Subsequently, with the common exclusion parameter fixed, we evaluated the local stability of the regularization plot with respect to the rare variant exclusion parameter v2, and chose the v2 within a stable window defined by at least 3 points in the grid where no shrinkage occurs.

This procedure was applied in both simulated and real data analyses.

Simulation

We simulated two scenarios consisting of 500 datasets each to evaluate the performance of TRIO_RVEMVS in identifying regions/genes of interest relative to existing methods (PedGene [37] and RV-TDT [38]) as well as its ability to identify individual variants. One scenario generates 1500 case-parent trios per data set, while the other involves generating 350 case-parent trios per dataset. These scenarios allowed us to assess the performance of TRIO_RVEMVS under large-sample and small-sample conditions, respectively.

To measure the overall performance of region detection, we calculated the weighted average correct association percentage with

12[P(selected|associated)totalnumberofassociatedregions+P(unselected|unassociated)totalnumberofunassociatedregions] (29)

Considering that most of the rare variants were not polymorphic across all data sets, we defined the Average True and False Positive Rate (ATPR and AFPR) for individual variants as follows:

ATPR=1#ofdatasetsdatasetdNd(selected|associated)Nd(#ofpolymorphicassociatedvariants)AFPR=1#ofdatasetsdatasetdNd(selected|unassociated)Nd(#ofpolymorphicunassociatedvariants) (30)

where Nd(selected|⋅) denotes the number of detected variants given the variants are associated or unassociated in data set d.

Data simulation

We simulated the population of haplotypes using Cosi2 [48], which is a forward-time genetic simulator. We used the 1000 Genomes Project [49] haplotypes as the reference population for Cosi2 and simulated a 30kb region of chromosome 1, consisting of 45965 SNPs. We simulated populations of 80,000 African haplotypes and 80,000 European haplotypes. Then we constructed 500 samples, each consisting of 60,000 haplotypes (15,000 African and 45,000 European, reflecting 25% and 75%, respectively) from the simulated population (with replacement). For each of the 500 samples of haplotypes, we randomly selected and paired haplotypes within the race to construct individuals, and then randomly paired individuals within the race to construct parents. Then from each set of parents, we randomly selected one haplotype from each parent to be transmitted to the child to form 15,000 trios. The full simulation algorithm is described in the Supplemental Materials.

We defined a gene region as 2700 base pairs, resulting in 12 simulated gene regions. Within gene regions, we used population allele frequencies to determine rare and common variants. Since we simulated admixed samples, we computed a weighted minor allele frequency (MAF) for each SNP. This was based on the frequency estimates from our reference genome (the 1000 Genomes Project) for each population (African and European), weighted by the proportion of each population admixed for our simulation (25% African and 75% European). Rare variants were defined as variants with a weighted MAF < 0.05. For simplicity, we simulated our causal SNPs on genes 1–6. We modeled disease based on two causal common SNPs (risk-increasing allele on gene 3 and risk-decreasing allele on gene 6) and 5% of any variant with weighted MAF less than 3% were randomly chosen as causal rare variants. In total, 1212 rare variants were simulated as causal to the simulated disease, with 606 simulated as risk increasing and 606 as risk decreasing. The distribution of associated rare variants across the 6 genes is reported in Table 1.

Table 1. Number of associated rare variants with different range of weighted MAF by region in the haplotype pool.

Across all simulated data sets, most of the variants with weighted MAF less than 0.0001 were those non-polymorphic, singleton, doubletons, or triptons.

Region (0, 0.0001) [0.0001, 0.001) [0.001, 0.01) [0.01, 0.05)
1 204 4 3 1
2 172 2 0 0
3 204 3 2 0
4 220 5 0 0
5 197 2 1 0
6 185 6 1 0
Total 1182 22 7 1

We simulated the disease status of each child according to the logistic model:

logit(P(y=1))=α0+β1G1c+β2G2c++βpGpc, (31)

where G1c,G2c,,Gpc were the children’s genotypes for the p causal variants (consisting of the 2 common variants, and the rest are rare variants), and we set α0 = −2.2 to control the disease incidence to be low. For the simulation, we set the magnitude of the coefficients for our causal common variants to be 0.9, and the magnitude of the coefficients for causal rare variants was computed as in [28]: c|log10MAFi|, where c = 0.4 for causal risk rare variants, and c = −0.4 for causal protective rare variants, and MAFi is the weighted MAF of locus i. Assigning the coefficients in this way results in rarer variants having a stronger effect on disease.

The distribution of associated rare variants across 6 regions is shown in Table 1. After simulating diseased probands in each set of 15,000 trios, we generated 500 replicates of 1,500 case-parent trios, and another 500 replicates of 350 case-parent trios to assess performance for large and small sample sizes, respectively.

Due to sampling, some loci that were polymorphic in the population with rare variants became non-polymorphic in the 500 samples, and the number of polymorphic loci varied across the 500 data sets. Some of the polymorphic loci with rare variants that were in the disease-generating model were not polymorphic in one or more samples. The distribution of polymorphic variants across the 500 data sets for each sample size is depicted in Table 2.

Table 2. Summary statistics of data sets with 1,500 and 350 case-trios.

Summary statistics 1500 case-trios 350 case-trios
Average # polymorphic variants 4008 1442
Average # polymorphic common variants 48 47
Average # polymorphic causal common variants 2 2
Average # polymorphic rare variants 3960 1395
Average # polymorphic causal rare variants 132 45
# polymorphic variants across 500 data sets 431 133
# polymorphic causal variants across 500 data sets 12 4

Analysis of simulated data

We used the same analysis strategy across all simulated datasets. First, we classified variants as either common or rare in each simulated data set. We estimated the MAF of each variant in each dataset for each sample size. For each sample size, we defined rare variants based on the median MAF across the 500 data sets, with the threshold of rare variants being median MAF < 5%. To assess performance in identifying regions of interest, we applied all three methods (TRIO_RVEMVS, PedGene, and RV-TDT) to the 500 data sets, summarizing the performance of each method using the weighted average correct association metric. Since only TRIO_RVEMVS identifies specific rare variants, we also report a similar weighted correct association metric for individual rare variants. For TRIO_RVEMVS, the detailed exclusion parameter tuning using regularization plot can be found in the Supplemental Materials.

Simulation results

In this section, we first compare selection at the region level with PedGene [37] and RV-TDT [38] with the weighted average correct association percentage. Specifically, we compared TRIO_RVEMVS with PedGene kernel and burden methods [37]. For RV-TDT, we evaluated different variants, such as BRV-Haplo, VT-BRV-Haplo, WSS-Haplo, CMC-Analytical, CMC-Haplo, and VT-CMC-Haplo [38]. TRIO_RVEMVS outperformed both PedGene and RV-TDT when jointly considering common and rare variants. Our simulation analyses also confirmed that PedGene showed improved performance in detecting rare variants compared to RV-TDT [50]. We conclude our simulation analysis by discussing the capacity of TRIO_RVEMVS to detect individual rare variants, which is not accomplished by either PedGene or RV-TDT.

We compared the performance of all methods in two ways. First, we compared each method’s ability to select risk regions based on rare variants only. Second, we compared each method’s ability to identify risk regions when jointly analyzing rare and common variants. For all analyses, we defined rare variants as SNPs whose MAF < 5%. Panels a) and b) of Fig 1 show the true and false positive rate of region selection when jointly analyzing common and rare variants using datasets with 1500 case-trios. In panel a), it shows that TRIO_RVEMVS outperformed both PedGene and RV_TDT with higher true positive rates across the 6 simulated causal regions except for simulated region 5, where PedGene shows a better true positive rate of detection. Considering the false positive rates across regions 7–12, PedGene and Trio_RVEMVS were competitive, TRIO_RVEMVS outperformed PedGene’s false positive rate across regions 7, 8,10, and 12, while PedGene had slightly lower false positive rates for regions 9 and 11. Panels c) and d) of Fig 1 show the true and false positive rates when focusing solely on rare variants detection for data sets with 1500 case-trios. TRIO_RVEMVS outperformed PedGene in regions 1 and 3 in terms of true positive rate but performed just behind PedGene in regions 2, 4, 5, and 6.

Fig 1. For the data sets with 1,500 case-trios, panels (a) and (b) showed the true and false positive rates of analyzing regions using both common and rare variants; panels (c) and (d) showed the true and false positive rates of regions with rare variants only.

Fig 1

Table 3 summarizes the weighted average correct association percentage (WACAP), Eq (29), for TRIO_RVEMVS, RV-TDT, and PedGene. TRIO_RVEMVS shows the highest WACAP when selecting both common and rare variants. When focusing on using rare variants only, TRIO_RVEMVS was competitive with PedGene-Kernel, but did not always have the highest WACAP.

Table 3. The comparison of weighted average correct association percentage between TRIO_RVEMVS, PedGene and RV-TDT with and without common variants.

Methods 1500 case-trios 350 case-trios
common and rare rare only common and rare rare only
TRIO_RVEMVS 74.53 60.55 66.37 52.07
PedGene-kernel 66.17 61.25 55.73 52.65
PedGene-burden 50.97 50.77 49.90 49.70
CMC-Analytical 32.80 55.86 36.32 52.35
BRV-Haplo 42.60 56.98 40.58 53.08
CMC-Haplo 41.23 56.68 39.00 52.95
VT-BRV-Haplo 42.60 55.25 52.73 52.65
VT-CMC-Haplo 41.21 54.53 57.33 52.10
WSS-Haplo 52.65 55.33 50.18 52.75

In the 350 case-trio data sets, TRIO_RVEMVS outperformed both PedGene and RV_TDT with respect to the weighted average correct association percentage, shown in Table 3. TRIO_RVEMVS achieved the highest average correct association percentage among all methods at 66.37%. VT-CMC-Haplo had the second-highest WACAP at 57.33%. In each region, we observed a consistent pattern of true and false positive rates when analyzing both common and rare variants, Fig 2. Specifically, TRIO_RVEMVS exhibited superior performance in terms of the true positive rate in regions 1, 3, and 6, shown in panel a) in Fig 2. With respect to the false positive rate, TRIO_RVEMVS performed similarly to PedGene; and both have smaller false positive rates compared to RV_TDT. See panel b) in Fig 2. When analyzing only rare variants, all methods demonstrated similar average correct association percentages, as indicated in Table 3. Because of the smaller sample size, all methods showed low power to detect the causal rare variants in general, shown in panels c) and d) in Fig 2. Therefore, compared to the 1500 case-trios, we did not observe notably higher true positive rates or false positive rates at the region level when analyzing only rare variants in the 350 case-trios data analysis.

Fig 2. For the data sets with 350 case-trios, panels (a) and (b) showed the true and false positive rates of regions with common variants; panels (c) and (d) showed the true and false positive rates of regions with rare variants only.

Fig 2

At the individual variants level, TRIO_RVEMVS detected 94 variants including 8 causal in 500 datasets with 1500 case-trios, and 57 variants with 4 causal in 500 datasets with 350 case-trios. The true positive rate (TPR) and false positive rate (FPR) of detected variants are shown in Figs 3 and 4. We primarily focus on reporting the individual-level selection results for variants that were polymorphic across all datasets, due to the low MAF of rare variants, Table 4. When considering the variants that were not all polymorphic across datasets, individual-level selection results were summarized in the Supplemental Materials. For datasets with 1500 case-trios, the ATPRs were 26.83% and 12.20% with and without common variants. The two causal common variants were constantly detected with ATPR of 100%. The AFPRs were 0.67% and 0.74% with and without common variants. For datasets with 350 case-trios, when the common and rare variants were jointly analyzed the ATPR was 48.45%, and the AFPR was 1.38%; when only rare variants were analyzed the ATPR was 13.10% and AFPR was 1.30%. ATPR and AFPR for variants in different ranges of MAF were summarized in Table 4.

Fig 3. The true and false positive rate of SNPs with corresponding median MAF in 500 data sets with 1500 case-trios respectively.

Fig 3

The horizontal dash line represents a threshold of rate 0.05; the vertical line separates the causal and non-causal variants. Variants with black color are true positives; grey illustrates the false positive variants; dot denotes rare variants; triangle denotes common variants.

Fig 4. The true and false positive rate of SNPs with corresponding median MAF in 500 data sets with 350 case-trios respectively.

Fig 4

The horizontal dash line represents a threshold of rate 0.05; the vertical line separates the causal and non-causal variants. Variants with black color are true positives; grey illustrates the false positive variants; dot denotes rare variants; triangle denotes common variants.

Table 4. The average true and false positive rate of individual variants detection in different median MAF ranges for variants that were polymorphic across 500 datasets with different sample sizes.

Sample size MAF<0.01 0.01≤ MAF<0.05 MAF≥0.05 Total
Number of associated 1500 9 1 2 12
350 1 1 2 4
ATPR (%) 1500 3.08 94.2 100 26.83
350 0 26.2 83.8 48.45
Number of unassociated 1500 332 41 46 419
350 43 40 46 129
AFPR (%) 1500 0.09 6.08 0.09 0.67
350 0 2.70 1.51 1.38

We observe that the true positive rate is lower and the false positive rate is higher for variants that have lower MAF. We illustrate this using the dataset of 1500 case-trios. For variants with a median MAF less than 0.01, the ATPR and AFPR were 3.08% and 0.09%, respectively. The highest FPR in this group was 8.8%, and the variant with such a high FPR had an equal median MAF to one of the associated variants (0.006) in region 1. In addition, the associated variant in region 1 and this falsely detected unassociated variant was only separated by 53 basepairs, and the linkage disequilibrium between them was 1 (both Dprime and rSquare), Fig 5. The average FPR in the group of variants that had MAF in the range of [0.01, 0.05) was 6.08%. The TPR of the only associated variant (median MAF: 0.026) was 94.2%. Two variants in the same group have relatively high FPR: one rare variant from region 5 had an FPR of 71.8%; another rare variant from region 10 had an FPR of 62.8%. (Those three variants may have contributed the high true and false positive rates, respectively, at the region level for both methods TRIO_RVEMVS and PedGene, Fig 1).

Fig 5. Linkage-disequilibrium (LD) for variants that were polymorphic across 500 data sets with 350 case-trios.

Fig 5

The symbol ‘+’ before variant names denote detected associated rare variants; ‘-’ denotes detected non-associated variants; ‘<’ denotes never detected associated variants. V16613 from region 5, and V37318 from region 10 display a high false positive rate, potentially due to LD. They both have strong LD with all the causal rare variants that are polymorphic across 500 data sets.

Real data application

We applied TRIO_RVEMVS to a trio data set from the Gabriella Miller Kids First Pediatric Research Program (https://commonfund.nih.gov/kidsfirst/overview). Access to the data and analysis was exempted by the UTHealth IRB under protocol HSC-SPH-18–1127. On November 11, 2019, we obtained a total of 380 trios afflicted with cleft lip with or without cleft palate from the Gabriella Miller Kids First Data Resource Center (DRC). All authors confirmed that we did not have access to any information that could identify individual participants during and after data collection. We applied TRIO_RVEMVS to analyze chromosome 8 sequencing data for association with the risk of orofacial clefts in the European population. Quality control was performed using PLINK [51] according to the guidance from [52], including sample and marker genotyping efficiency/call rate, Mendelian inconsistency, and Hardy-Weinberg equilibrium. Subsequently, SHAPEIT2 [53] was utilized to phase the genotypes and obtain haplotypes for each trio of individuals. For TRIO_RVEMVS testing, our focus was on the region around 8q24 where the SNPs have been identified to be associated with the risk of orofacial clefts in the previous literature [54]. First, we identified the LD blocks in Chromosome 8 using Big-LD [55]. The LD block covering the region previously associated with orofacial clefts consists of 10401 SNPs after omitting singletons, doubletons, and tripletons from our analysis across the 380 trios (1140 individuals). We used the same procedure as described above for the simulated data to determine the exclusion parameters of the priors, which incorporates the regularization plot, Fig 6. The final selected SNPs were shown in Table 5. In total, we identified 8 SNPs in q24.21 and q24.22 associated with the risk of orofacial clefts in the Kids First European population.

Fig 6. Regularization plot for rare variants based on the trio data from the Gabriella Miller Kids First Data Resource Center (DRC).

Fig 6

Table 5. Final selected SNPs in the trio data from the Gabriella Miller Kids First Data Resource Center (DRC).

dbSNP Position ref MAF Locus Coefficients
rs1474668949 128825584 0.01 q24.21 -0.12
rs7017665 128946138 0.29 q24.21 0.26
rs17242358 128952627 0.29 q24.21 0.26
rs55658222 128963890 0.29 q24.21 0.27
rs1472381856 129156395 0.07 q24.21 -0.23
129243536 0.04 -0.15
rs1192270083 129364943 0.02 q24.21 -0.13
rs78061696 130619334 0.03 q24.22 0.12

Discussion

Although new sequencing technologies and statistical methods have accelerated genome-wide association studies, a large portion of the genetic variability associated with birth defects remains to be discovered [6, 5658]. These missing inheritances may reside in rare variants. Most existing genetic association methods, such as SKAT [28], PedGene [37], and RV-TDT [38], aggregate the burden of risks of rare variants within a region to test for association between that region and diseases. These methods often experience reduced power when a large number of unassociated rare variants are present within the region pooled, or when the rare variants are antagonistic within the same region (i.e. some promote risk while some offer protection against the disease) [29]. Additionally, it is well known that trio data is more robust to population stratification, and trio methods are well suited to help identify the risk of birth defects stemming from genetic variation [6, 5659]. We developed a statistical tool based on trio family data, TRIO_RVEMVS, that jointly models common and rare variants to identify the rare variants driving the association of a genetic region with the disease rather than simply assessing genetic regions. One of the advantages of the proposed method is that the common and rare variants are detected simultaneously. The selection of rare variants does not need to be restricted to the region where common variants have been detected previously. Additionally, TRIO_RVEMVS can be potentially applied in fine-mapping studies, particularly following genome-wide association studies (GWAS) that have identified broad regions associated with certain phenotypes or diseases.

Using simulated data, we assessed the performance of TRIO_RVEMVS by comparing its performance at the region level with PedGene and RV-TDT using a weighted average correct association metric. We also examined the average true positive rate (ATPR) and average false positive rates (AFPR) when identifying individual variants. TRIO_RVEMVS outperformed PedGene when common variants were included whereas both methods were competitive when considering only rare variants. In this study, we also confirmed the result that PedGene outperformed RV-TDT whether common variants were included or not at the region level [50]. For 500 datasets with 1,500 trios, the ATPR was 2.45% and AFPR was 0.07% when both common and rare were considered at the individual level; the ATPR was 0.94%, and AFPR was 0.07% when the rare variants were considered. For 500 data sets with 350 trios, the ATPR was 4.33% and AFPR was 0.13% with common variants; ATPR was 0.62% and AFPR was 0.08% without common variants at the individual level.

When applying TRIO_RVEMVS to real data from the Gabriella Miller Kids First Data Resource Center (DRC), it identified 8 SNPs in q24.21 and q24.22 that were associated with the risk of orofacial clefts in the Kids First European population. Three SNPs were previously reported as common variants in locus 8q24. SNP rs7017665 has been reported in literature [60] and is highly correlated with another generally reported SNP, rs987525, with LD (r2 = 0.847, D′ = 0.983) [54, 6062]. SNP rs55658222 and rs17242358 have both been previously reported in the literature [63, 64], respectively. One SNP we identified, rs78061696 has not yet been identified in the literature as associated with orofacial clefts.

We admit that one limitation of the proposed method is modeling the haplotype data which needs the assumption of accurate phasing. Therefore, accurate phasing is crucial before applying the proposed TRIO_EVEMVS. Fortunately, many phasing methods and software have been developed and different methods can be applied in different situations to achieve better accuracy [65]. For example, given the trio data, one can apply MERLIN [66], BEAGLE [67], and SHAPE-IT2 [53] et al. These methods work well for trios and parent-offspring pairs. In this study, we applied SHAPE-IT2 for phasing the real-data analysis. However, comparing different phasing methods and exploring their effect on the downstream analysis is beyond the scope of this study. Incorporating the uncertainty of phasing in the TRIO_EVEMVS is considered future research. Some challenges remain for TRIO_RVEMVS to detect rare variants: 1) TRIO_RVEMVS may fail to detect associated rare variants if their MAF is too small (less than 0.0027), and this threshold varies by sample size. 2) If there are too few rare variants associated with a disease in a given region, TRIO_RVEMVS may only detect the region and not detect the individual variant. 3) TRIO_RVEMVS may falsely detect some variants due to high LD. Despite these challenges, TRIO_RVEMVS is pioneering in its ability to identify individual rare variants alongside gene regions. Extending TRIO_RVEMVS to genome-wide data is considered as future research.

Supporting information

S1 File. Supplemental materials.

(PDF)

pone.0314502.s001.pdf (1.1MB, pdf)

Acknowledgments

We would like to acknowledge the Gabriella Miller Kids First Pediatric Research Program for providing the data.

Data Availability

The data underlying this study cannot be shared publicly as they are owned by the National Institutes of Health (NIH) and are available through the Gabriella Miller Kids First Pediatric Research Program (https://commonfund.nih.gov/kidsfirst/overview). Access to the data requires an application process, and users are prohibited from sharing the data with others.

Funding Statement

This project was fully funded by the NIH/NICHD grant R03HD083674. The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.CDC. About Birth Defects, May162024; 2024. https://www.cdc.gov/birth-defects.
  • 2. Sattolo ML, Arbour L, Bilodeau-Bertrand M, Lee GE, Nelson C, Auger N. Association of birth defects with child mortality before age 14 years. JAMA Network Open. 2022;5(4):e226739–e226739. doi: 10.1001/jamanetworkopen.2022.6739 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Beames TG, Lipinski RJ. Gene-environment interactions: aligning birth defects research with complex etiology. Development. 2020;147(21):dev191064. doi: 10.1242/dev.191064 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Qiao F, Wang Y, Zhang C, Zhou R, Wu Y, Wang C, et al. Comprehensive evaluation of genetic variants using chromosomal microarray analysis and exome sequencing in fetuses with congenital heart defect. Ultrasound in Obstetrics & Gynecology. 2021;58(3):377–387. doi: 10.1002/uog.23532 [DOI] [PubMed] [Google Scholar]
  • 5. Yang X, Li Q, Wang F, Yan L, Zhuang D, Qiu H, et al. Newborn screening and genetic analysis identify six novel genetic variants for primary carnitine deficiency in Ningbo Area, China. Frontiers in Genetics. 2021;12:686137. doi: 10.3389/fgene.2021.686137 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Yuan S, Zaidi S, Brueckner M. Congenital heart disease: emerging themes linking genetics and development. Current opinion in genetics & development. 2013;23(3):352–359. doi: 10.1016/j.gde.2013.05.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Rashkin SR, Cleves M, Shaw GM, Nembhard WN, Nestoridi E, Jenkins MM, et al. A genome-wide association study of obstructive heart defects among participants in the National Birth Defects Prevention Study. American Journal of Medical Genetics Part A. 2022;188(8):2303–2314. doi: 10.1002/ajmg.a.62759 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Shabana N, Shahid SU, Irfan U. Genetic contribution to congenital heart disease (CHD). Pediatric cardiology. 2020;41(1):12–23. doi: 10.1007/s00246-019-02271-4 [DOI] [PubMed] [Google Scholar]
  • 9. Lyu C, Webber DM, MacLeod SL, Hobbs CA, Li M, Study NBDP. Gene-by-gene interactions associated with the risk of conotruncal heart defects. Molecular Genetics & Genomic Medicine. 2020;8(1):e1010. doi: 10.1002/mgg3.1010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Sun H, Yi T, Hao X, Yan H, Wang J, Li Q, et al. Contribution of single-gene defects to congenital cardiac left-sided lesions in the prenatal setting. Ultrasound in Obstetrics & Gynecology. 2020;56(2):225–232. doi: 10.1002/uog.21883 [DOI] [PubMed] [Google Scholar]
  • 11. Cordell HJ, Töpf A, Mamasoula C, Postma AV, Bentham J, Zelenika D, et al. Genome-wide association study identifies loci on 12q24 and 13q32 associated with Tetralogy of Fallot. Human Molecular Genetics. 2013;22(7):1473–1481. doi: 10.1093/hmg/dds552 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Case AP, Mitchell LE. Prevalence and patterns of choanal atresia and choanal stenosis among pregnancies in Texas, 1999-2004. American Journal of Medical Genetics, Part A. 2011;155(4):786–791. doi: 10.1002/ajmg.a.33882 [DOI] [PubMed] [Google Scholar]
  • 13. Gorlov IP, Gorlova OY, Sunyaev SR, Spitz MR, Amos CI. Shifting Paradigm of Association Studies: Value of Rare Single-Nucleotide Polymorphisms. American Journal of Human Genetics. 2008;82(1):100–112. doi: 10.1016/j.ajhg.2007.09.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Iyengar SK, Elston RC. The genetic basis of complex traits: rare variants or “common gene, common disease”? Methods in Molecular Biology (Clifton, NJ). 2007;376:71–84. doi: 10.1007/978-1-59745-389-9_6 [DOI] [PubMed] [Google Scholar]
  • 15. Smith DJ. The allelic structure of common disease. Human Molecular Genetics. 2002;11(20):2455–2461. doi: 10.1093/hmg/11.20.2455 [DOI] [PubMed] [Google Scholar]
  • 16. Zhu Q, Ge D, Maia JM, Zhu M, Petrovski S, Dickson SP, et al. A genome-wide comparison of the functional properties of rare and common genetic variants in humans. American Journal of Human Genetics. 2011;88(4):458–468. doi: 10.1016/j.ajhg.2011.03.008 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Young AI. Solving the missing heritability problem. PLoS Genetics. 2019;15(6):e1008222. doi: 10.1371/journal.pgen.1008222 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Asimit J, Zeggini E. Rare Variant Association Analysis Methods for Complex Traits. Annual Review of Genetics. 2010;44(1):293–308. doi: 10.1146/annurev-genet-102209-163421 [DOI] [PubMed] [Google Scholar]
  • 19. Bansal V, Libiger O, Torkamani A, Schork NJ. Statistical analysis strategies for association studies involving rare variants. Nature Reviews Genetics. 2010;11(11):773–785. doi: 10.1038/nrg2867 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Basu S, Pan W. Comparison of Statistical Tests for Disease Association with Rare Variants. Bone. 2011;23(1):1–7. doi: 10.1002/gepi.20609 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Morris AP, Zeggini E. An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genetic Epidemiology. 2010;34(2):188–193. doi: 10.1002/gepi.20450 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Hoogmartens J, Cacace R, Van Broeckhoven C. Insight into the genetic etiology of Alzheimer’s disease: A comprehensive review of the role of rare variants. Alzheimer’s & Dementia: Diagnosis, Assessment & Disease Monitoring. 2021;13(1):e12155. doi: 10.1002/dad2.12155 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Andrés AM, Clark AG, Shimmin L, Boerwinkle E, Sing CF, Hixson JE. Understanding the accuracy of statistical haplotype inference with sequence data of known phase. Genetic Epidemiology: The Official Publication of the International Genetic Epidemiology Society. 2007;31(7):659–671. doi: 10.1002/gepi.20185 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Li B, Leal SM. Methods for Detecting Associations with Rare Variants for Common Diseases: Application to Analysis of Sequence Data. American Journal of Human Genetics. 2008;83(3):311–321. doi: 10.1016/j.ajhg.2008.06.024 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Price AL, Kryukov GV, de Bakker PIW, Purcell SM, Staples J, Wei LJ, et al. Pooled Association Tests for Rare Variants in Exon-Resequencing Studies. American Journal of Human Genetics. 2010;86(6):832–838. doi: 10.1016/j.ajhg.2010.04.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Pan W, Shen X. Adaptive Tests for Association Analysis of Rare Variants. Genetic Epidemiology. 2011;35(5):381–388. doi: 10.1002/gepi.20586 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Neale BM, Rivas MA, Voight BF, Altshuler D, Devlin B, Orho-Melander M, et al. Testing for an unusual distribution of rare variants. PLoS genetics. 2011;7(3):e1001322. doi: 10.1371/journal.pgen.1001322 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. American Journal of Human Genetics. 2011;89(1):82–93. doi: 10.1016/j.ajhg.2011.05.029 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Quintana MA, Berstein JL, Thomas DC, Conti DV. Incorporating Model Uncertainty in Detecting Rare Variants: The Bayesian Risk Index. Genetic Epidemiology. 2011;35(7):638–649. doi: 10.1002/gepi.20613 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Liang F, Xiong M. Bayesian Detection of Causal Rare Variants under Posterior Consistency. PLoS ONE. 2013;8(7):1–16. doi: 10.1371/journal.pone.0069633 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Boutry S, Helaers R, Lenaerts T, Vikkula M. Rare variant association on unrelated individuals in case–control studies using aggregation tests: existing methods and current limitations. Briefings in Bioinformatics. 2023;24(6):bbad412. doi: 10.1093/bib/bbad412 [DOI] [PubMed] [Google Scholar]
  • 32. Liu J, Lewinger JP, Gilliland FD, Gauderman WJ, Conti DV. Confounding and heterogeneity in genetic association studies with admixed populations. American Journal of Epidemiology. 2013;177(4):351–360. doi: 10.1093/aje/kws234 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Li Y, Lee S. Integrating external controls in case–control studies improves power for rare-variant tests. Genetic Epidemiology. 2022;46(3-4):145–158. doi: 10.1002/gepi.22444 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Lee S, Fuchsberger C, Kim S, Scott L. An efficient resampling method for calibrating single and gene-based rare variant association analysis in case–control studies. Biostatistics. 2016;17(1):1–15. doi: 10.1093/biostatistics/kxv033 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Schaid DJ, Rowland C. Use of parents, sibs, and unrelated controls for detection of associations between genetic markers and disease. American Journal of Human Genetics. 1998;63(5):1492–1506. doi: 10.1086/302094 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Mathieson I, McVean G. Differential confounding of rare and common variants in spatially structured populations. Nature Genetics. 2012;44(3):243–246. doi: 10.1038/ng.1074 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Schaid DJ, Mcdonnell SK, Sinnwell JP, Thibodeau SN. Multiple Genetic Variant Association Testing by Collapsing and Kernel Methods With Pedigree or Population Structured Data. Genetic Epidemiology. 2013;37(5):409–418. doi: 10.1002/gepi.21727 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. He Z, O’Roak BJ, Smith JD, Wang G, Hooker S, Santos-Cortez RLP, et al. Rare-variant extensions of the transmission disequilibrium test: Application to autism exome sequence data. American Journal of Human Genetics. 2014;94(1):33–46. doi: 10.1016/j.ajhg.2013.11.021 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Thomas D, Pitkaeniemi J, Langholz B, Tuomilehto-Wolf E, Tuomilehto J, the DiMe Study Group. Variation in HLA-Associated Risks of Childhood Insulin-Dependent Diabetes in the Finnish Population: II. Haplotype Effects. Genetic Epidemiology. 1995;12:455–466. doi: 10.1002/gepi.1370120503 [DOI] [PubMed] [Google Scholar]
  • 40. Schaid DJ. General Score Tests for Associations of Genetic Markers with Disease Using Cases and Their Parents. Genetic Epidemiology. 1996;13:423–449. doi: [DOI] [PubMed] [Google Scholar]
  • 41. Schaid D. Relative-risk Regression Models Using Cases and Their Parents. Genetic Epidemiology. 1995;12:813–818. doi: 10.1002/gepi.1370120647 [DOI] [PubMed] [Google Scholar]
  • 42. Swartz MD, Kimmel M, Mueller P, Amos CI. Stochastic search gene suggestion: a Bayesian hierarchical model for gene mapping. Biometrics. 2006;62(2):495–503. doi: 10.1111/j.1541-0420.2005.00451.x [DOI] [PubMed] [Google Scholar]
  • 43. Ročková V, George EI. EMVS: The EM approach to Bayesian variable selection. Journal of the American Statistical Association. 2014;109(506):828–846. doi: 10.1080/01621459.2013.869223 [DOI] [Google Scholar]
  • 44. Chen RB, Chu CH, Yuan S, Wu YN. Bayesian Sparse Group Selection. Journal of Computational and Graphical Statistics. 2016;25(3):665–683. doi: 10.1080/10618600.2015.1041636 [DOI] [Google Scholar]
  • 45. Shalev-Shwartz S, Zhang T. Stochastic Dual Coordinate Ascent methods for regularized loss minimization. Journal of Machine Learning Research. 2013;14(1):567–599. [Google Scholar]
  • 46. Shalev-Shwartz S, Zhang T. Accelerated mini-batch stochastic dual coordinate ascent. Advances in Neural Information Processing Systems. 2013; p. 1–17. [Google Scholar]
  • 47. Ueda N, Nakano R. Deterministic annealing variant of the EM algorithm. Advances in Neural Information Processing Systems. 1995; p. 545–552. [Google Scholar]
  • 48. Schaffner SF, Foo C, Gabriel S, Reich D, Daly MJ, Altshuler D. Calibrating a coalescent simulation of human genome sequence variation. Genome Research. 2005;15(11):1576–1583. doi: 10.1101/gr.3709305 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Siva N. 1000 Genomes project. Nature Biotechnology. 2008;26(3):256–257. doi: 10.1038/nbt0308-256b [DOI] [PubMed] [Google Scholar]
  • 50. Wang L, Choi S, Lee S, Park T, Won S. Comparing family-based rare variant association tests for dichotomous phenotypes. BMC Proceedings. 2016;10(Suppl 7). doi: 10.1186/s12919-016-0027-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics. 2007;81(3):559–575. doi: 10.1086/519795 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Turner S, Armstrong LL, Bradford Y, Carlson CS, Crawford DC, Crenshaw AT, et al. Quality control procedures for genome-wide association studies. Current Protocols in Human Genetics. 2011;68(1):1–19. doi: 10.1002/0471142905.hg0119s68 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Delaneau O, Zagury JF, Marchini J. Improved whole-chromosome phasing for disease and population genetic studies. Nature Methods. 2013;10(1):5. doi: 10.1038/nmeth.2307 [DOI] [PubMed] [Google Scholar]
  • 54. Assis Machado R, de Toledo IP, Martelli-Júnior H, Reis SR, Neves Silva Guerra E, Coletta RD. Potential genetic markers for nonsyndromic oral clefts in the Brazilian population: A systematic review and meta-analysis. Birth Defects Research. 2018;110(10):827–839. doi: 10.1002/bdr2.1208 [DOI] [PubMed] [Google Scholar]
  • 55. Kim SA, Cho CS, Kim SR, Bull SB, Yoo YJ. A new haplotype block detection method for dense genome sequencing data based on interval graph modeling of clusters of highly correlated SNPs. Bioinformatics. 2018;34(3):388–397. doi: 10.1093/bioinformatics/btx609 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Copp AJ, Stanier P, Greene NDE. Neural tube defects: Recent advances, unsolved questions, and controversies. The Lancet Neurology. 2013;12(8):799–810. doi: 10.1016/S1474-4422(13)70110-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57. Greene NDE, Stanier P, Copp AJ. Genetics of human neural tube defects. Human Molecular Genetics. 2009;18(R2). doi: 10.1093/hmg/ddp347 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Shkoukani MA, Chen M, Vong A. Cleft lip—A comprehensive review. Frontiers in Pediatrics. 2013;1(DEC):1–10. doi: 10.3389/fped.2013.00053 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Carroll NC. Clubfoot in the twentieth century: Where we were and where we may be going in the twenty-first century. Journal of Pediatric Orthopaedics Part B. 2012;21(1):1–6. doi: 10.1097/BPB.0b013e32834a99f2 [DOI] [PubMed] [Google Scholar]
  • 60. Leslie EJ, Taub MA, Liu H, Steinberg KM, Koboldt DC, Zhang Q, et al. Identification of functional variants for cleft lip with or without cleft palate in or near PAX7, FGFR2, and NOG by targeted sequencing of GWAS loci. The American Journal of Human Genetics. 2015;96(3):397–411. doi: 10.1016/j.ajhg.2015.01.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Grant SF, Wang K, Zhang H, Glaberson W, Annaiah K, Kim CE, et al. A genome-wide association study identifies a locus for nonsyndromic cleft lip with or without cleft palate on 8q24. The Journal of Pediatrics. 2009;155(6):909–913. doi: 10.1016/j.jpeds.2009.06.020 [DOI] [PubMed] [Google Scholar]
  • 62. Beaty TH, Murray JC, Marazita ML, Munger RG, Ruczinski I, Hetmanski JB, et al. A genome-wide association study of cleft lip with and without cleft palate identifies risk variants near MAFB and ABCA4. Nature Genetics. 2010;42(6):525–529. doi: 10.1038/ng.580 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63. Leslie EJ, Carlson JC, Shaffer JR, Feingold E, Wehby G, Laurie CA, et al. A multi-ethnic genome-wide association study identifies novel loci for non-syndromic cleft lip with or without cleft palate on 2p24. 2, 17q23 and 19q13. Human Molecular Genetics. 2016;25(13):2862–2872. doi: 10.1093/hmg/ddw104 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. Zhang W. Identification and comparison of imputed and genotyped variants for genome-wide association study of orofacial cleft case-parent trios. Johns Hopkins University; 2019. [Google Scholar]
  • 65. Browning SR, Browning BL. Haplotype phasing: existing methods and new developments. Nature Reviews Genetics. 2011;12(10):703–714. doi: 10.1038/nrg3054 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66. Abecasis GR, Wigginton JE. Handling marker-marker linkage disequilibrium: pedigree analysis with clustered markers. The American Journal of Human Genetics. 2005;77(5):754–767. doi: 10.1086/497345 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67. Browning BL, Browning SR. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. The American Journal of Human Genetics. 2009;84(2):210–223. doi: 10.1016/j.ajhg.2009.01.005 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 File. Supplemental materials.

(PDF)

pone.0314502.s001.pdf (1.1MB, pdf)

Data Availability Statement

The data underlying this study cannot be shared publicly as they are owned by the National Institutes of Health (NIH) and are available through the Gabriella Miller Kids First Pediatric Research Program (https://commonfund.nih.gov/kidsfirst/overview). Access to the data requires an application process, and users are prohibited from sharing the data with others.


Articles from PLOS ONE are provided here courtesy of PLOS

RESOURCES