Abstract
The development of set-based genetic-survival association tests has been focusing on right-censored survival outcomes. However, interval-censored failure time data arise widely from health science studies, especially those on the development of chronic diseases. In this paper, we proposed a suite of set-based genetic association and interaction tests for interval-censored survival outcomes under a unified weighted-V-statistic framework. Besides dealing with interval censoring, the new tests can account for genetic effect heterogeneity and accommodate left truncation of survival outcomes. Simulation studies showed that the new tests perform well in terms of size and power under various scenarios and that the new interaction test is more powerful than the standard likelihood ratio test for testing gene-gene/gene-environment interactions. The practical utility of the developed tests was illustrated by a genome-wide association study of age to early childhood caries.
Keywords: genetic heterogeneity, interval censoring, left truncation, set-based test, weighted V statistic
Introduction
With the development of genotyping and sequencing technologies and the increasing size of genome databases, the genetic predisposition to human diseases has been studied extensively in the past two decades, using agnostic approaches such as genome-wide association study (GWAS) and whole-genome sequence association study. However, the hypothesis-free approaches face the issues of multiple testing and small effect sizes of individual variants, which motivated the development of multi-marker tests. Multi-marker tests test the joint association of multiple genetic markers within a biological unit, e.g., a gene or a biological pathway. The most popular multi-marker tests are based on kernel machine regressions, including Liu, Lin, and Ghosh (2007), Liu, Ghosh, and Lin (2008) and Wu et al. (2011) among others. The kernel association tests increase the statistical power of association studies by not only aggregating the signals of individual markers but also capturing their potentially non-linear and/or interactive effects through the use of non-linear kernels.
Most of the existing kernel association tests were developed for completely observable quantitative/categorical phenotypes. Less effort was devoted to developing kernel association tests for censored survival traits, albeit survival traits are more informative than case-control phenotypes about the genetic contribution to disease development/progression. Existing kernel association tests for censored survival traits include the test based on Cox kernel machine regression (coxKM) (Cai, Tonini, & Lin, 2011), the test based on accelerated failure time kernel machine regression (aftKM) (Sinnott & Cai, 2013), the global test (gt) (Goeman, Oosting, Cleton-Jansen, Anninga, & van Houwelingen, 2005), and the sequence kernel association test in a Cox regression framework (coxSKAT) (Chen et al., 2014). The latter two use the same test statistic as coxKM with linear kernel but different methods to obtain its null distribution. All the four tests have note-worthy drawbacks as shown in Li, Wu, and Lu (2021). Specifically, aftKM, gt and coxSKAT do not have the nominal size when the sample size is not large, nor does coxKM when adjusting for covariates correlated with genetic markers of interest. To address those drawbacks, Li et al. (2021) developed a multi-marker test based on a weighted V statistic. The weighted V test (WV) inherits the advantage of the survival kernel association tests that it can be powerful to detect complex effects (e.g., nonlinear or interactive effects) of a SNP set or a gene set on survival outcomes. In addition, Li et al. (2021) demonstrated empirically that WV has the nominal size in moderate-sized samples and showed theoretically and empirically that the WV with linear kernel maintains the nominal size when adjusting for covariates that are linearly correlated with the markers of interest. Furthermore, WV can handle left truncation of survival traits and account for genetic effect heterogeneity (genetic heterogeneity) to increase power.
Survival traits of chronic diseases are usually interval censored due to the intermittent examinations of the disease status. Examples of such traits are age to early childhood caries (ECC), age to Alzheimer’s disease, age to type 2 diabetes, and time from menopause to osteoporosis in women, to name a few. To the best of our knowledge, multi-marker tests for interval-censored survival outcomes are lacking, which hampers the knowledge discovery of genetic components in chronic diseases. In this article, we develop the first set of multi-marker tests for genetic association and gene-gene/gene-environment (G–G/G–E) interaction with interval-censored survival traits. The tests are developed under the weight-V-statistic framework as adopted by Li et al. (2021) and thus are called the weighted V tests in the sequel. The proposed tests are fully nonparametric when not adjusting for covariates, and are semiparametric with covariate adjustment in the sense that only the effects of adjustment covariates on the survival outcome need to be specified through a semiparametric survival model. Besides dealing with interval censoring, another innovation of the proposed tests compared to Li et al. (2021) is that they can use any semiparametric transformation model (Zeng & Lin, 2006) to adjust for covariates, whereas Li et al. (2021) only uses the Cox model. We conducted an extensive set of simulation studies, which showed the excellent finite-sample performances in terms of size and power for the new tests. We also applied the proposed tests to a dbGaP dataset, Dental Caries: Whole Genome Association and Gene × Environment Studies (Accession No.: phs000095.v3.p1), to perform a gene-based association analysis of early childhood caries.
Methods
Set-up
We jointly test the effects of p genetic markers (e.g., expression values of a set of genes, or a set of SNPs, each coded as 0, 1 or 2, representing the number of minor alleles), G = (G1, …, Gp)T, on a survival time T, possibly adjusting for q covariates, Z = (Z1, …, Zq)T, which could be confounders or predictors that are independent of G. The survival time T is subject to interval censoring and possible left truncation. Therefore, the survival data of a subject can be summarized by (, L, R), where is the observed left truncation time, L is the last failure inspection time before the failure occurrence, and R is the first inspection time after the occurrence. L = 0 and R = ∞ when T is left- and right-censored respectively. All the data of a random sample of n subjects are then denoted by . We assume that the number and times of failure inspections are independent of T given Z and that the left-truncation time is independent of T given Z as well.
Weighted V tests for genetic association without considering genetic heterogeneity
The null hypothesis of our covariate-adjusted association test is that G is independent of T conditioning on Z. The test assumes that under the null, T given Z follows a semiparametric transformation model, , where Λ(t|Z) is the cumulative hazard function of T given Z, G(x) = log(1 + rx)/r with a specified r (r ≥ 0), and Λ0(t) is an unknown non-decreasing function with Λ0(0) = 0. Correspondingly, we use a working semiparametric transformation frailty model to construct the test. The model is , where eh is a subject-level frailty. The covariate-adjusted association test is based on the following weighted V statistic,
| (1) |
where and is the conditional likelihood (given frailty and left-truncation time) of the i-th subject’s data based on the semiparametric transformation frailty model. SZ,iSZ,j is considered as the V-statistic kernel and is a covariate-adjusted phenotype similarity measure. is considered as the weight function and is a covariate-centered genetic similarity measure defined by
where and f(Gi, Gj) is a genetic similarity function. It can be shown that . The choice of f(Gi, Gj) depends on the type of G and the expected form of G’s effect. If the effect of G is expected to be linear, we use cross-product kernel , a.k.a. linear kernel. For SNP covariates, we use IBS kernel , if the effect of G is not expected to be linear. For gene expression covariates, we use polynomial kernels or Gaussian kernel if non-linear and/or interactive effects of G are expected. As suggested by Wei and Lu (2017), we can also choose a unified Laplacian-kernel-based genetic similarity, , where Gi,k can be discrete or continuous variables, , and ωk is a function of variance of Gk defined by ωk = 1/σk for k = 1…, p. This kernel is particularly useful for association analyses with sequencing data, where a large portion of genetic variants are rare.
Under the null hypothesis, SZ,i has conditional mean zero given Zi. When Z is independent of G, following the proof of Theorem 1 in Li et al. (2021), the null limiting distribution of can be shown to be
| (2) |
where , are independent chi-square variables with degree 1, and νt’s are the eigenvalues of (i.e., , with E(ϕs(G, Z)ϕt(G, Z)) = I(s = t)). In this case, the unadjusted test introduced below and the test based on (adjusted test) are both asymptotically correct, but the latter is more powerful than the former in finite samples if Z affects the survival outcome. When Z affects both G and the survival outcome, the unadjusted test is likely to yield false positives, whereas the adjusted test is less likely to yield false positives in general. When the cross-product kernel is used for constructing f(Gi, Gj), the adjusted test is still asymptotically correct in the presence of linear confounding, i.e., G = a + BTZ + e, where a and B are respectively a constant vector and a constant matrix, and e is a zero-mean random error vector that is independent of Z. It can be shown, following the proof of Theorem 2 in Li et al. (2021), that the null limiting distribution of with the cross-product kernel for f(Gi, Gj) is (2) under linear confounding. Under the alternative hypothesis that G has effects on the survival outcome adjusting for Z, the covariate-adjusted phenotype similarity SZ,iSZ,j is concordant with the covariate-centered genetic similarity . In other words, larger phenotype similarity is weighted heavier and smaller phenotype similarity is weighted lighter, leading to a large value of .
SZ,i involves unknown (Λ0, γ) and involves unknown conditional expectations. In real applications, we replace (Λ0, γ) by the estimates obtained from a nonparametric maximum likelihood estimation (Li, Pak, & Todem, 2020) for the model , and replace the expectations in by the corresponding sample averages. The resulting weighted V statistic can be shown to be , where , , F = {f(Gi, Gj)}n×n, I is an n × n identity matrix, and with being a n × (q + 1) matrix whose i-th row is (i = 1, …, n). We approximate ζ by and use a matrix eigen-decomposition of (I − H)T F(I − H) to obtain a finite-dimensional approximation for νt’s. Denote the eigen-values and eigen-vectors of (I − H)T F(I − H) by and , respectively. Because satisfies , instead of , a finite-dimentional approximation for νt is . With and , the large sample null distribution of nVZ,IC is approximately
| (3) |
Based on this distribution, we use Davies’ method (Davies, 1980) to compute the p-value P(nVZ,IC ≥ nVZ,IC,obs). The test without covariate adjustment can be developed in the same manner. The corresponding weighted V statistic is , where J is a n × n matrix with all elements being 1/n and is the same as except that is fixed as zero and that G is taken to be the identity transformation. Note that this is a nonparametric test, i.e., it has no model assumption, although we exploited a frailty model to derive . In the sequel, the tests based on VZ,IC and VIC are collectively called WV-IC.
Weighted V tests for genetic association considering genetic heterogeneity
The effects of a set of genetic markers on a survival outcome could vary across subpopulations (e.g., different sexes, races or genetic backgrounds). To jointly test the effects of multiple genetic markers considering heterogeneous effects across subpopulations, we extend the weighted V statistics, VIC and VZ,IC, in the previous section by replacing f(Gi, Gj) with a heterogeneity weighted kernel function to consider the (latent) population structure. Specifically, f(Gi, Gj) is replaced by wij = (1 + kij)f(Gi, Gj), where kij is a measurement of subpopulation similarity between two individuals. The resulting weighted V statistics are
and
where W = (1 + K) ⊙ F, 1 = {1}n×n, K = {kij}n×n and ⊙ is the element-wise matrix product operator. The tests based on these heterogeneity weighted V statistics are collectively called HWV-IC. They can accommodate the population structure that is completely determined by a vector of observable variables X = (X1, …, XD)T or the latent population structure (e.g., subpopulations with different ancestry backgrounds) that cannot be completely determined but can be inferred by a vector of observable variables X (e.g., a large number of SNPs from the GWAS data). In either case, we can choose kij to be , where is the standardized Xi,d, i.e., and . If X is a vector of dummy variables coding several subpopulations (e.g. males and females), we can also use the identity kernel I(Xi = Xj) for kij. If X is a set of SNPs, we can use the IBS kernel for kij as well. Under the null hypothesis that G has no effect on the survival outcome in any subpopulation adjusting for Z, we use (3) to approximate the distribution of , replacing F by W to compute . The null limiting distribution of can be obtained similarly. Based on these distributions, we use Davies’ method (Davies, 1980) to compute the p-values and .
Weighted V tests for G–G/G–E interactions
We denote a gene under testing by G = (G1, …, GM)T, where Gm codes the genotype (0, 1 or 2) of the m-th SNP in the gene. In the G–E interaction analysis, the environmental variable is represented by another vector X = (X1, …, XD)T, whose dimension is greater than one if the variable is categorical with more than two levels. In the G–G interaction analysis, we denote the other gene under testing by H = (H1, …, HQ)T.
For testing G–G interactions, we first construct a phenotype similarity measure based on a semiparametric transformation model with G and H as two covariates, assuming additive genetic effects of G and H under the null hypothesis of no interaction effects. That is, ,
and (, , ) is the nonparametric maximum likelihood estimator for (Λ0, α, γ) under the model . Define and . Let f(Gi, Gj) and k(Hi, Hj) respectively represent the similarities in the genes G and H between subjects i and j. The choice of kernel functions f(·, ·) and k(·, ·) depends on the mode of inheritance. For example, we can use the cross-product kernel if the effect of G is expected to be additive. We propose a G–G interaction test based on the following weighted V statistic
where , . where 1 denotes an n-dim column vector of ones, F = {f(Gi, Gj)}n×n, K = {k(Hi, Hj)}n×n. The null limiting distribution of nVIC,I can be approximated by a mixture of chi-square distributions computed in the same way as for nVZ,IC. Based on the null distribution, we use Davies’s method to compute the p-value, P(nVIC,I ≥ nVIC,I,obs), where VIC,I,obs is the observed value of VIC,I. When M + Q is large, we multiply the null distribution by a factor n/(n − M − Q − 1) to take into account the projection operator I − O for finite samples, as Li et al. (2021) does. If the model of inheritance is unspecified for G and H, we propose a similar G–G interaction test to the above with the following modifications:
, becomes a n × 2M matrix with the i-th row being (I(Gi1 = 1), I(Gi1 = 2), …, I(GiM = 1), I(GiM = 2)), and is defined similarly.
The G–E interaction test is the same as the G–G interaction test except that the genetic variable H is replaced by the environment variable X. The G–G and G–E interaction tests can both adjust for covariates by incorporating them into the projection operator I − O and as additive covariates into the null model.
The weighted V interaction tests (WVI-IC) are more powerful than the regular tests (e.g., likelihood Ratio and Wald tests) for detecting interactions in a semiparametric transformation model when G and/or H contains correlated SNPs, because our test has an effective degree of freedom (DF) lower than the DF of the classical tests. Here we define the effective DF to be the DF d0 of a rescaled chi-square distribution that matches the first two moments of nVIC,I.
Simulations
We performed Monte Carlo simulations to assess the finite-sample performance of the (heterogeneity) weighted V tests for interval-censored and possibly left-truncated survival outcomes. In all simulation scenarios, subjects’ baseline covariates were adjusted, and survival times were generated from the proportion hazards and proportional odds models, two special cases of semiparametric transformation models (r = 0 and r = 1). Due to the lack of available estimation methods for the proportional odds model with left-truncated and interval-censored data, left truncation was considered only under the proportional hazards model. For each subject, three follow-up times, {V1, V2, V3}, were generated as follows: V1 ~ Unif(L, R), V2 ~ V1 + Unif(L, R), and V3 ~ V2 + Unif(L, R), where L and R were two positive numbers chosen in a way that left and right censoring rates were controlled within 20%−30% and 30%−40%, respectively. In all simulations, unless otherwise specified, two sample sizes, n = 500, 1000, and two marker-set sizes, p = 4, 8, were considered. Also, in the presence of adjustment covariates, a binary covariate Z1 ~ Bernoulli(0.5) and a continuous covariate Z2 ~ Unif(0, 2) were generated for each subject. The genetic markers under testing were SNPs, except that when considering linear confounding, the markers were gene expressions. To mimic linkage disequilibrium (LD) between SNPs within a SNP set, genotypes were generated via a two-step procedure: 1) sample n vectors independently from a multivariate normal distribution with zero mean and a covariance matrix Σp×p = {0.5|k−l|} (unless otherwise specified); 2) categorize each component of every multivariate normal vector into three levels labeled with 0, 1 and 2 using the cut-off values that were selected to achieve the Hardy–Weinberg equilibrium (HWE) and set the minor allele frequency (MAF) at 0.2 (unless otherwise specified). In all scenarios, 1000 Monte Carlo samples were generated. Unless otherwise specified, the significance level of a test was set at 0.05. Simulation results for the proportional hazards model in the absence of left truncation are presented below. Additional simulation results for the proportional hazards model under left truncation and simulation results for the proportional odds model in the absence of left truncation are given in the Supplementary Materials. The proposed tests without covariate adjustment performed well in simulations, but the corresponding simulation results are not presented to save space.
Testing genetic association in the absence of genetic heterogeneity
In this series of simulations, we investigated the performance of WV-IC in detecting the association of a SNP set with a survival outcome subject to interval censoring under no genetic heterogeneity. The survival time for subject i (i = 1, …, n) was generated using the inverse CDF method from the proportional hazard model with the following hazard function,
| (4) |
where the values of β1j and β2k vary depending on the simulation scenario.
Empirical size and power of WV-IC under various n’s and p’s.
In this simulation, we set the regression coefficients in (4) to be β1j = β2k = 0.05 and β1j = 0, β2k = 0.05 (j = 1, …, p and k = 1, 2) in power assessment and size assessment, respectively. We set L = 0.1 and R = 0.6 for the exam time generation. The IBS kernel was used to measure genetic similarity in WV-IC. Table 1 shows that the empirical sizes of WV-IC are close to the nominal level under various n and p. The power of the test increases with the sample size.
Table 1.
Empirical size and power of WV-IC with r = 0 in testing genetic effects in the absence of left truncation.
| Empirical Size (Power) | ||
|---|---|---|
| WV-IC | p=4, n=500 | p=4, n=1000 |
| 0.047 (0.231) | 0.057 (0.410) | |
| WV-IC | p=8, n=500 | p=8, n=1000 |
| 0.048 (0.415) | 0.054 (0.738) | |
Empirical size and power of WV-IC under linear confounding.
In this simulation, gene expressions were considered as the genetic markers under testing. Gi = 1Zi1 + 1Zi2 + ei, where Gi = (Gi1, …, Gip)T are the expression levels of p genes in subject i, 1 is a p-dimensional vector of ones, and ei = (ei1, …, eip)T follows multivariate normal distribution with zero mean and covariance matrix Σp×p = {0.5|k−l|}. We set L = 0.1, R = 0.6 and coefficient parameters in (4) to be β1j = 0, β2k = 0.05 for empirical size assessment and β1j = β2k = 0.02 for power evaluation (j = 1, …, p; k = 1, 2). We used the cross-product kernel to measure gene expression similarity in WV-IC. Table 2 shows that the empirical sizes of WV-IC were close to the nominal level under various choices of n and p, which supports the validity of WV-IC under linear confounding. The power of the test increases with the sample size.
Table 2.
Empirical size and power of WV-IC with r = 0 in testing genetic effects under linear confounding and no left truncation.
| Empirical Size (Power) | ||
|---|---|---|
| WV-IC | p=4, n=500 | p=4, n=1000 |
| 0.050 (0.158) | 0.055 (0.313) | |
| WV-IC | p=8, n=500 | p=8, n=1000 |
| 0.041 (0.320) | 0.056 (0.574) | |
Empirical size and power of WV-IC for testing rare variants.
In this simulation, We assessed the empirical size and power of WV-IC for testing the effects of SNP sets consisting of rare variants. The rare variants’ MAFs were generated from Unif(0.005, 0.05). In the power assessment, the effect of a rare variant was given by the formula: . All the other settings were the same as for Table 1. The simulation results are shown in Table 3. The empirical sizes of WV-IC with the IBS and the unified Laplacian kernels are both close to the nominal level. Since the rarer variants had larger effects, WV-IC with the unified Laplacian kernel had higher power than WV-IC with the IBS kernel.
Table 3.
Empirical size and power of WV-IC with r = 0 for testing genetic effects of rare variants in the absence of left truncation.
| Genetic Similarity Kernel | Empirical Size (Power) | ||
|---|---|---|---|
| p=4, n=500 | p=4, n=1000 | ||
| WV-IC | IBS | 0.043 (0.206) | 0.053 (0.413) |
| WV-IC | Laplacian | 0.046 (0.234) | 0.048 (0.476) |
| p=8, n=500 | p=8, n=1000 | ||
| WV-IC | IBS | 0.044 (0.307) | 0.050 (0.660) |
| WV-IC | Laplacian | 0.041 (0.382) | 0.047 (0.766) |
Empirical size and power of WV-IC for testing a large SNP set.
We also performed simulations to assess the empirical size and power of WV-IC for testing the effect of a set of 54 SNPs with a sample size of 1000. The SNP-set size of 54 was chosen because it is the 95% quantile of the sizes of the 23008 genes tested in the real application section. The sample size of 1000 was chosen because the sample size in the real application presented later is 1125. The simulation settings are the same as in the previous section of testing rare variants, except p = 54 and . The simulation results are shown in Table 4. One can see that the empirical sizes of WV-IC with the IBS and the unified Laplacian kernels are both close to the nominal level, but WV-IC with the unified Laplacian kernel had higher power due to the rarer variants having larger effects.
Table 4.
Empirical size and power of WV-IC with r = 0 for testing genetic effects of a large set of SNPs in the absence of left truncation.
| Genetic Similarity Kernel | Empirical Size (Power) | |
|---|---|---|
| p=54, n=1000 | ||
| WV-IC | IBS | 0.045 (0.760) |
| WV-IC | Laplacian | 0.047 (0.919) |
Testing genetic association in the presence of genetic heterogeneity
In this series of simulations, we investigated the performance of HWV-IC in testing the association of a SNP set in the presence of genetic heterogeneity across 1) observable subpopulations, 2) two latent subpopulations, 3) twenty latent subpopulations, and 4) individual genome profiles, respectively.
Empirical size and power of HWV-IC in the presence of genetic heterogeneity across observable subpopulations.
In this simulation, we investigated the empirical size and power of HWV-IC in the presence of genetic heterogeneity across four observable subpopulations with equal proportions. We let the SNPs under testing have different MAFs and LD structures across the four subpopulations. Specifically, when generating the SNPs following the steps described at the beginning of the Simulations section, we used four different sets of MAFs and LD structures (represented by Σp×p) as follows,
MAFs = {0.0118, 0.0304, 0.0476, 0.0450, 0.0393, 0.0076, 0.0459, 0.0054} and Σp×p = {0.2|k−l|} in subpopulation 1;
MAFs = {0.0378, 0.0421, 0.0312, 0.0467, 0.0156, 0.0085, 0.0362, 0.0336} and Σp×p = {0.5|k−l|} in subpopulation 2;
MAFs = {0.0157, 0.0299, 0.0220, 0.0119, 0.0272, 0.0093, 0.0063, 0.0354} and Σp×p = {0.7|k−l|} in subpopulation 3;
MAFs = {0.0257, 0.0323, 0.0289, 0.0252, 0.0078, 0.0319, 0.0439, 0.0193} and Σp×p = {0.6|k−l|} in subpopulation 4,
where every MAF was generated from Unif(0.005, 0.05). The first four and eight values of each set of MAFs were used when p was 4 and 8 respectively. The survival time of subject i (i = 1, …, n) was generated from an exponential distribution with the following hazard function,
| (5) |
where Zi = (Zi1, Zi2, Zi3)T is a three-dimensional vector of dummy variables that codes the subpopulation subject i belongs to. For the size and the power assessments, we used respectively (β0, ) = (0, 0, 0, 0) and (β0, ) = (0.0005, 0.4, 0.002, 0.0005). We set β2 = (0.2, 0.2, 0.1)T, L = 0.1 and R = 0.5. The IBS kernel was used to measure the genetic similarity, and the identity kernel was used to measure the subpopulation similarity in HWV-IC. Table 5 shows that the empirical sizes of HWV-IC and WV-IC are both around the nominal level, but HWV-IC had higher power by accounting for the genetic heterogeneity across the subpopulations.
Table 5.
Comparison of performance of WV-IC and HWV-IC (r = 0) in testing genetic association under genetic heterogeneity across four observable subpopulations and no left truncation.
| Empirical Size (Power) | ||
|---|---|---|
| p=4, n=500 | p=4, n=1000 | |
| HWV-IC | 0.043 (0.219) | 0.056 (0.496) |
| WV-IC | 0.044 (0.173) | 0.056 (0.331) |
| p=8, n=500 | p=8, n=1000 | |
| HWV-IC | 0.053 (0.233) | 0.048 (0.627) |
| WV-IC | 0.048 (0.180) | 0.053 (0.412) |
Empirical size and power of HWV-IC in the presence of genetic heterogeneity across two latent subpopulations.
In this simulation, we investigated the empirical size and power of HWV-IC under genetic heterogeneity across two latent subpopulations and no left truncation. The survival time of j-th subject in i-th subpopulation was generate from an exponential distribution with the following hazard rate,
| (6) |
where βi1 = … = βip(i = 1, 2) represent the effects of Gk’s (k = 1, …, p) in subpopulation i and vary depending on the simulation scenario. One covariate, , was simulated to infer the subpopulation. Specifically, Xi1 = ai + 1 + e, where e ~ N(0, 0.5) and ai ~ Bernoulli(0.5) is an indicator variable such that ai = 1 indicates that the i-th subject is from the first subpopulation. βik = 0 (i = 1, 2; k = 1, …, p) in size assessment, while different values were given to βik’s according to the different heterogeneity scenarios considered in power evaluation, as shown in Table 7. The IBS kernel was used to measure genetic similarity and the Gaussian kernel for subpopulation similarity. Table 6 shows that the empirical size of HWV-IC is close to the nominal level. Table 7 shows that the power of HWV-IC increases with the sample size and the heterogeneity size, measured by |β1k − β2k|.
Table 7.
Power of HWV-IC with r = 0 in testing genetic effects under genetic heterogeneity across two latent subpopulations and no left truncation. Various heterogeneity scenarios were considered, determined by the values of β1k and β2k, including the same effect size and the same effect direction (T1), identical sizes but opposite directions (T2), no effect in one subpopulation while positive effect in the other (T3), different sizes and opposite directions (T4), and different sizes but the same direction (T5).
| Heterogeneity Scenario | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| T1 | T2 | T3 | T4 | T5 | ||||||
| β 1k | 0.05 | 0.1 | −0.05 | −0.1 | 0 | 0 | −0.025 | −0.025 | 0.02 | 0.02 |
| 0.05 | 0.1 | 0.05 | 0.1 | 0.05 | 0.1 | 0.05 | 0.1 | 0.05 | 0.1 | |
| 0.153 | 0.508 | 0.079 | 0.213 | 0.063 | 0.185 | 0.083 | 0.201 | 0.079 | 0.202 | |
| 0.257 | 0.862 | 0.136 | 0.402 | 0.110 | 0.369 | 0.108 | 0.305 | 0.162 | 0.440 | |
| 0.210 | 0.790 | 0.232 | 0.735 | 0.139 | 0.441 | 0.167 | 0.437 | 0.142 | 0.450 | |
| p=8, n=1000 | 0.423 | 0.989 | 0.477 | 0.976 | 0.230 | 0.801 | 0.280 | 0.788 | 0.258 | 0.812 |
Table 6.
Empirical size of HWV-IC with r = 0 in testing genetic effects under genetic heterogeneity across two latent subpopulations and no left truncation.
| Empirical Size | ||
|---|---|---|
| p=4, n=500 | p=4, n=1000 | |
| HWV-IC | 0.042 | 0.049 |
| p=8, n=500 | p=8, n=1000 | |
| HWV-IC | 0.047 | 0.045 |
Empirical size and power of HWV in the presence of genetic heterogeneity across twenty latent subpopulations.
In this simulation, we investigated the performance of HWV-IC under similar settings to the previous section but increased the number of latent subpopulations to 20. The genetic effects of Gk’s (k = 1, …, p) in each of the 20 subpopulations, were set to satisfy βi1 = … = βip(i = 1, …, 20). The genetic effects were zero in empirical size assessment and were sampled from a uniform distribution with mean μβ and variance in power assessment. We simulated 25 covariates, Xd (d = 1, …, 25), to infer the subpopulation. The Xd of the j-th subject in i-th subpopulation was generated by Xijd = aid + δij, where aid (d = 1, …, 25) were randomly sampled from {1, …, 20} without replacement and δij ~ N(0, 0.5). We used the IBS kernel to measure the genetic similarity and the Gaussian kernel to measure the subpopulation similarity in HWV-IC. Table 8 shows that the empirical size of HWV-IC is close to the nominal level. Table 9 shows that as the average effect size (μβ) increases, the power of HWV-IC increases. For a fixed average effect size, as the genetic heterogeneity size (σβ) increases, HWV-IC gains power.
Table 8.
Empirical size of HWV-IC with r = 0 in testing genetic effects under genetic heterogeneity across twenty latent subpopulations and no left truncation.
| Empirical Size | ||
|---|---|---|
| p=4, n=500 | p=4, n=1000 | |
| HWV-IC | 0.051 | 0.050 |
| p=8, n=500 | p=8, n=1000 | |
| HWV-IC | 0.045 | 0.052 |
Table 9.
Power of HWV-IC with r = 0 under genetic heterogeneity across twenty latent subpopulations and no left truncation.
| Power | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| μ β | 0 | 0 | 0 | 0.025 | 0.025 | 0.025 | 0.05 | 0.05 | 0.05 |
| σ β | 0.05 | 0.1 | 0.15 | 0.05 | 0.1 | 0.15 | 0.05 | 0.1 | 0.15 |
| p=4, n=500 | 0.063 | 0.148 | 0.369 | 0.105 | 0.240 | 0.483 | 0.224 | 0.422 | 0.640 |
| p=4, n=1000 | 0.108 | 0.376 | 0.804 | 0.252 | 0.562 | 0.906 | 0.532 | 0.796 | 0.969 |
| p=8, n=500 | 0.179 | 0.766 | 0.993 | 0.283 | 0.862 | 0.996 | 0.549 | 0.933 | 0.999 |
| p=8, n=1000 | 0.434 | 0.989 | 1.000 | 0.687 | 0.999 | 1.000 | 0.908 | 1.000 | 1.000 |
Empirical size and power of HWV-IC in the presence of genetic heterogeneity across individual genome profiles.
In this simulation, we investigated the performance of HWV-IC when the subpopulation structure is “continuous”. Specifically, we let the genetic effect vary across individual genome profiles instead of a small number of subpopulations, e.g., males and females. We generated the survival time of the i-th (i = 1, …, n) subject from an exponential distribution with the following hazard rate,
| (7) |
where βi1 = βi2 = … = βip, and they were set to be zero in empirical size assessment and randomly sampled from a uniform distribution with mean μβ and variance in power assessment. The values of μβ and vary to result in different simulation scenarios. We simulated a set of 1000 SNPs for each subject, , to represent the genome profile. The genome profile was generated in two steps: 1) for each 1 ≤ d ≤ 1000, sample from a multivariate normal distribution, MVN(0, Σ), where Σ is a n × n covariance matrix with the (i, j)-th element being Σij = I(i = j) under the null hypothesis (no genetic association) and Σij = exp(−|βi1 − βj1|/σβ) under the alternative; 2) Xid is then obtained by categorizing into three levels, 0, 1 and 2, using rank-based cut-off values (i = 1, …, n). The cut-off values were selected to achieve the Hardy-Weinberg equilibrium and the pre-specified minor allele frequency that was randomly sampled from Unif(0.05, 0.2). (L, R) were set to (0.2, 0.6) and (0.1, 0.5) for empirical size assessment and power assessment respectively. We used the IBS kernel to measure both the genetic and the subpopulation similarity in HWV-IC. Table 10 shows that the empirical size of HWV-IC is around the nominal level. Table 11 shows that the power of HWV-IC increases with the mean effect size (μβ) and, for a fixed mean effect size, HWV-IC becomes more powerful as the genetic heterogeneity (σβ) increases.
Table 10.
Empirical size of HWV-IC with r = 0 in testing genetic effects under genetic heterogeneity across individual genome profiles and no left truncation.
| Empirical Size | ||
|---|---|---|
| p=4, n=500 | p=4, n=1000 | |
| HWV-IC | 0.043 | 0.052 |
| p=8, n=500 | p=8, n=1000 | |
| HWV-IC | 0.047 | 0.049 |
Table 11.
Power of HWV-IC with r = 0 under genetic heterogeneity across individual genome profiles and no left truncation.
| Power | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| μ β | 0 | 0 | 0 | 0.02 | 0.02 | 0.02 | 0.04 | 0.04 | 0.04 |
| σ β | 0.04 | 0.08 | 0.12 | 0.04 | 0.08 | 0.12 | 0.04 | 0.08 | 0.12 |
| p=4, n=500 | 0.062 | 0.076 | 0.103 | 0.089 | 0.116 | 0.149 | 0.170 | 0.190 | 0.252 |
| p=4, n=1000 | 0.069 | 0.087 | 0.223 | 0.125 | 0.204 | 0.369 | 0.330 | 0.442 | 0.590 |
| p=8, n=500 | 0.094 | 0.280 | 0.693 | 0.170 | 0.388 | 0.765 | 0.347 | 0.589 | 0.850 |
| p=8, n=1000 | 0.117 | 0.750 | 0.996 | 0.326 | 0.856 | 0.999 | 0.699 | 0.957 | 1.000 |
Testing G–G/G–E interaction
In this simulation, we compared the performance of WVI-IC and the regular likelihood ratio test (LRT) in testing interactions between G, a SNP set, and H, a SNP set or a binary variable (Bernoulli(0.5), acting as an environment variable). The survival times were generated from an exponential distribution with the following hazard rate,
| (8) |
where Gk, Hl, and (GH)m are respectively the k-, l-, and m-th element in G, H and their interaction term GH. βm = 0 (m = 1, …, pq) under the null hypothesis of no G–H interaction, while βm = 0.08 under the alternative hypothesis that the interaction exists. The cross-product kernel was used to measure the genetic similarity and the identity kernel was used to measure the environment similarity. Tables 12 and 13 show that the empirical size of WVI-IC is close to the nominal level while that of LRT is larger than the nominal level when the number of elements in the interaction term is large. In all the simulation scenarios, WVI-IC is more powerful than LRT.
Table 12.
Empirical sizes and powers of WVI-IC and LRT in testing G–G interaction under no left truncation.
| Empirical Size (power) | ||
|---|---|---|
| p=4, q=2, n=500 | p=4, q=2, n=1000 | |
| WVI-IC | 0.046 (0.458) | 0.052 (0.793) |
| LRT | 0.069 (0.305) | 0.073 (0.519) |
| p=6, q=3, n=500 | p=6, q=3, n=1000 | |
| WVI-IC | 0.060 (0.864) | 0.057 (0.989) |
| LRT | 0.115 (0.590) | 0.076 (0.863) |
Table 13.
Empirical sizes and powers of WVI-IC and LRT in detecting G–E interaction under no left truncation.
| Empirical Size (power) | ||
|---|---|---|
| p=4, q=1, n=500 | p=4, q=1, n=1000 | |
| WVI-IC | 0.051 (0.236) | 0.056 (0.435) |
| LRT | 0.055 (0.204) | 0.054 (0.350) |
| p=8, q=1, n=500 | p=8, q=1, n=1000 | |
| WVI-IC | 0.047 (0.450) | 0.043 (0.768) |
| LRT | 0.069 (0.341) | 0.068 (0.575) |
Empirical sizes of WV-IC and HWV-IC under stringent p-value thresholds
Genome-wide association studies with genotyping or sequencing data usually test the associations between hundreds of thousands of genetic variants and a phenotype, causing the multiple testing problem. Common approaches to address the multiple testing issue, such as Bonferroni correction and Benjamini-Hochberg Procedure (Benjamini & Hochberg, 1995), lead to stringent p-value thresholds when applied to those association analyses. In this simulation, we investigated the sizes of WV-IC and HWV-IC under stringent p-value thresholds (much smaller than 0.05). The simulation setting for WV-IC was the same as that for assessing its size and power under 0.05 level. The simulation setting for HWV-IC was the same as that for assessing its size and power in the presence of genetic heterogeneity across observable subpopulations under 0.05 level. In both scenarios, 150K Monte Carlo samples with n = 1000 and p = 4 were generated to calculate the empirical sizes. Table 14 shows that the empirical sizes of WV-IC and HWV-IC are close to the stringent p-value thresholds. These results indicate that WV-IC and HWV-IC are suitable for large-scale genetic association analyses.
Table 14.
Empirical sizes of WV-IC and HWV-IC under stringent p-value thresholds and no left truncation.
| Empirical Size | ||
|---|---|---|
| Threshold | WV-IC | HWV-IC |
| 0.005 | 0.0046 | 0.0047 |
| 0.0005 | 0.00039 | 0.00053 |
| 0.00005 | 0.000044 | 0.000047 |
Application to a dental caries dataset
The etiology of dental caries has a substantial genetic component. Estimates of dental caries heritability range from 30% to 70%, with higher heritability found for primary versus permanent dentition caries (Bretz et al., 2005; Shaffer et al., 2012; Vieira, Modesto, & Marazita, 2014; Wang et al., 2010). However, genetic studies of early childhood caries (ECC) before age 6 have not found any association variant with a genome-wide significance. All the existing genome-wide association studies of ECC (Ballantine et al., 2017; Orlova et al., 2019; Shaffer et al., 2011) tested the genetic association in a single-locus manner. In this paper, we performed a genome-wide association analysis of an ECC dataset using the proposed multi-marker association tests. The dataset analyzed was extracted from a master dataset, Dental Caries: Whole Genome Association and Gene × Environment Studies, which was obtained from dbGaP (accession number: phs000095.v3.p1). The master dataset contains caries assessment data and whole genome genotyping data of 5418 subjects from four study sites: PITT, IOWA, DRDR and GEIRS. The number of typed SNPs is 601,273. For each subject in the master dataset, caries data were from only one dental exam. The dataset analyzed is the dataset of participants who were younger than age 6 at the dental exam. None of them were from DRDR since there was no such participant from that site in the master database. The phenotype is the age to ECC, which is subject to case 1 interval censoring.
SNP-level and subject-level filterings were conducted using PLINK 1.9. Specifically, a SNP was removed if one of the three criteria was met: 1) MAF < 0.01, 2) HWE test’s p-value < 10−6, and 3) missing rate > 2%. A subject was removed if his/her genotype missing rate is greater than 2%. After the quality control, there are 1125 subjects each with 553,194 SNPs in the final analysis dataset. 385 of the 1125 subjects had developed ECC by the time of the dental exam. The missing genotypes of a SNP were imputed by sampling from a binomial distribution, , where was the sample minor allele frequency estimated using the non-missing genotypes.
We performed gene-based genome-wide association analyses. The 553,194 SNPs were grouped into 23008 genes based on the reference genome hg18. Specifically, the SNPs located within a gene or its upstream/downstream of 5000 base pairs were grouped to the gene. We then tested the association of each of the 23008 genes with age to ECC. To improve power and/or reduce confounding for the genetic association analyses, we adjusted for sex, race, study cohort (PITT, IOWA or GEIRS), and the top 10 principal components of the SNP data. We used WV-IC and HWV-IC for the association analyses. In the analyses using HWV-IC, we considered four heterogeneity sources: race, sex, genetic background, and home water fluoride level. Race has six categories: White, Asian, Black, American Indian, Bi- or multi-Racial, and Other. The genetic background was represented by a random sample of 200K SNPs from the whole genome. Home water fluoride level was dichotomized as ‘sufficient’ (> 0.7 mg/L) and ‘insufficient’ (≤ 0.7 mg/L). When considering genetic heterogeneity between the two home water fluoride levels, the analysis dataset was a subset consisting of 652 subjects from two study sites, PITT and IOWA, due to missing home water fluoride data in DRDR. In all the analyses, we considered two transformation models: the proportional odds (r = 1) and proportional hazards (r = 0) models. The false discover rate (FDR) was controlled using Benjamini-Hochberg procedure (Benjamini & Hochberg, 1995) without the assumption that the tests were independent of each other. The corresponding p-value threshold is for i = 1, …, m, where m = 23008 and α, the target FDR, is 10%. The Cross-product and the IBS kernels were both used to measure the genetic similarity. The genetic background similarity was measured by the IBS kernel. The identity kernel was used to measure the subpopulation similarity when the subpopulations were defined by race, sex or the dichotomized home water fluoride level.
The data analysis was preformed in R 4.0.2 on the computers (28-core CPU at 2.4 GHz and 115 GB RAM) of MSU High Performance Computing Center (HPCC). The average computation speed of the association tests was less than one second per gene. In our analysis, the total 23008 genes were divided into 231 tasks, each testing 100 genes (the last task tested only 8 genes). Therefore, the computation time of each task was less than 2 minutes on average. We carried out the 231 tasks simultaneously by using 231 computers of MSU HPCC. As a result, the real analysis was completed in a few minutes. Had all the tasks been carried out sequentially, the real analysis would take 6.4 hours at most.
The analysis results are summarized in Table 15. None of the genes reached the significance level after the multiple-testing adjustment. Nevertheless, MPPED2 and TSPAN2 are the two genes that were most frequently found to have the strongest association with age to ECC. MPPED2 was implicated in the first GWAS of childhood caries (Shaffer et al., 2011), and subsequently showed evidence of association in a meta-analysis of five independent replication samples (Stanley et al., 2014), although its role in caries etiology remains unknown. TSPAN2 has not been reported in the literature of ECC genetics before. The functions of these two genes are summarized in Table 16.
Table 15.
Top five genes discovered by WV-IC and HWV-IC from a gene-based genome-wide association analysis of the DC-WGAGE dataset. PH and PO stand for the proportional hazard model and the proportional odds model respectively. CP and IBS stand for cross-product kernel and IBS kernel respectively. Various types of heterogeneity were considered, including no genetic heterogeneity (S1), heterogeneity across races (S2), heterogeneity between the two home water fluoride levels (S3), heterogeneity between sexes (S4), and heterogeneity across genetic backgrounds (S5).
| Scenario | Genes and p-values | |||||
|---|---|---|---|---|---|---|
| PO+CP | S1 | MPPED2 | OR2B11 | NECAB3 | E2F1 | CYB5R2 |
| 5.01E-05 | 7.61E-05 | 8.62E-05 | 8.62E-05 | 1.69E-04 | ||
| S2 | MPPED2 | HOXC13-AS | CYB5R2 | NECAB3 | E2F1 | |
| 3.21E-05 | 1.29E-04 | 1.37E-04 | 1.79E-04 | 1.79E-04 | ||
| S3 | TSPAN2 | CRCT1 | HOXC13-AS | LCE5A | TSHB | |
| 1.17E-05 | 2.23E-05 | 2.55E-05 | 2.73E-05 | 4.78E-05 | ||
| S4 | MPPED2 | OR2B11 | NECAB3 | E2F1 | CDK5RAP1 | |
| 5.88E-05 | 9.91E-05 | 1.10E-04 | 1.11E-04 | 1.65E-04 | ||
| S5 | MPPED2 | OR2B11 | NECAB3 | E2F1 | CYB5R2 | |
| 4.94E-05 | 7.84E-05 | 8.70E-05 | 8.70E-05 | 1.69E-04 | ||
| PH+CP | S1 | MPPED2 | OR2B11 | NECAB3 | E2F1 | CYB5R2 |
| 8.23E-05 | 8.48E-05 | 8.90E-05 | 8.90E-05 | 9.26E-05 | ||
| S2 | MPPED2 | CYB5R2 | LINC00511 | HOXC13-AS | NCR3LG1 | |
| 5.75E-05 | 9.12E-05 | 1.69E-04 | 1.90E-04 | 2.08E-04 | ||
| S3 | TSPAN2 | HOXC13-AS | CRCT1 | ECRG4 | LCE5A | |
| 1.27E-05 | 3.05E-05 | 4.53E-05 | 5.07E-05 | 5.50E-05 | ||
| S4 | CYB5R2 | MPPED2 | TSPAN2 | NCR3LG1 | OR2B11 | |
| 2.16E-04 | 2.21E-04 | 2.49E-04 | 2.64E-04 | 2.73E-04 | ||
| S5 | TSPAN2 | MPPED2 | CYB5R2 | TPM3P9, ZNF761, ZNF765 | NCR3LG1 | |
| 1.77E-04 | 1.93E-04 | 2.02E-04 | 2.07E-04 | 2.09E-04 | ||
| PO+IBS | S1 | LOC101927989 | MPPED2 | PAX9 | NECAB3 | E2F1 |
| 4.68E-05 | 4.89E-05 | 7.04E-05 | 7.61E-05 | 7.61E-05 | ||
| S2 | MPPED2 | STARD5 | LOC101927989 | CCDC185 | CYB5R2 | |
| 2.78E-05 | 3.56E-05 | 1.19E-04 | 1.31E-04 | 1.50E-04 | ||
| S3 | TSPAN2 | CRCT1 | TSHB | LCE5A | HOXC13-AS | |
| 5.84E-06 | 5.63E-05 | 6.04E-05 | 6.23E-05 | 6.48E-04 | ||
| S4 | LOC101927989 | CCDC185 | MPPED2 | PAX9 | NECAB3 | |
| 5.03E-05 | 6.12E-05 | 6.46E-05 | 8.93E-05 | 9.72E-05 | ||
| S5 | LOC101927989 | MPPED2 | PAX9 | NECAB3 | E2F1 | |
| 4.79E-05 | 4.85E-05 | 7.13E-05 | 7.79E-05 | 7.79E-05 | ||
| PH+IBS | S1 | CCDC185 | MPPED2 | NECAB3 | E2F1 | PAX9 |
| 5.76E-05 | 6.89E-05 | 7.58E-05 | 7.58E-05 | 8.95E-05 | ||
| S2 | STARD5 | MPPED2 | CCDC185 | CYB5R2 | PPM1E | |
| 4.28E-05 | 4.36E-05 | 7.93E-05 | 1.07E-04 | 1.65E-04 | ||
| S3 | TSPAN2 | HOXC13-AS | TSHB | CRCT1 | ECRG4 | |
| 6.21E-06 | 7.55E-05 | 7.83E-05 | 1.14E-04 | 1.25E-04 | ||
| S4 | CCDC185 | STARD5 | TSPAN2 | MPPED2 | CYB5R2 | |
| 2.04E-05 | 1.02E-04 | 1.63E-04 | 2.01E-04 | 2.51E-04 | ||
| S5 | CCDC185 | STARD5 | TSPAN2 | MPPED2 | CYB5R2 | |
| 2.79E-05 | 7.71E-05 | 1.12E-04 | 1.60E-04 | 2.14E-04 | ||
| p-value threshold | 4.09E-07 | 8.18E-07 | 1.23E-06 | 1.64E-06 | 2.05E-06 | |
Table 16.
Functions of MPPED2 and TSPAN2 according to Entrez.
| Gene | Function |
|---|---|
| MPPED2 | This gene likely encodes a metallophosphoesterase. The encoded protein may play a role in brain development. |
| TSPAN2 | The protein encoded by this gene is a member of the transmembrane 4 superfamily. The proteins mediate signal transduction events that play a role in the regulation of cell development, activation, growth and motility. |
Discussion
In this paper, we developed the first set of multi-marker tests for genetic associations and G–G/G–E interactions with interval-censored and possibly left-truncated survival outcomes. The new tests can adjust for covariates based on a semiparametric transformation model. The proposed HWV-IC can also account for genetic heterogeneity to increase the power of association testing.
Our methods are directly applicable to interval-censored competing risks data if the observation process is independent of the process of competing risks given the covariates. Specifically, by applying our methods to the reduced interval-censored competing risks data pertaining to the failure cause of interest (Hudgens, Li, & Fine, 2014), one can test the effect of a marker set or a G–G/G–E interaction on the cumulative incidence function of that cause.
It is worthwhile to extend the proposed tests to multivariate survival phenotypes. This extension has applications to genetic studies of chronic diseases that can occur to multiple sites in a human body, such as dental caries, diabetic retinopathy, and age-related macular degeneration. Joint analysis of the failure times at the different sites could increase the power to detect the association of such a disease with SNPs, genes or biological pathways. The extension hinges on finding an appropriate phenotype similarity for multivariate interval-censored survival endpoints.
Supplementary Material
Acknowledgments
This work was partially supported by the National Institute of Dental and Craniofacial Research (Award No. R03DE027429), the National Institute on Drug Abuse (Award No. R01DA043501) and the National Library of Medicine (Award No. R01LM012848). Funding support for the study entitled Dental Caries: Whole Genome Association and Gene × Environment Studies was provided by the National Institute of Dental and Craniofacial Research (NIDCR, grant number U01-DE018903). This genome-wide association study is part of the Gene Environment Association Studies (GENEVA) program of the trans-NIH Genes, Environment and Health Initiative (GEI). Genotyping services were provided by the Center for Inherited Disease Research (CIDR). CIDR is fully funded through a federal contract from the National Institutes of Health (NIH) to The Johns Hopkins University, contract number HHSN268200782096C. Funds for this project’s genotyping were provided by the NIDCR through CIDRs NIH contract. Assistance with phenotype harmonization and genotype cleaning, as well as with general study coordination, was provided by the GENEVA Coordinating Center (U01-HG004446) and by the National Center for Biotechnology Information (NCBI). Data and samples were provided by: (1) the Center for Oral Health Research in Appalachia (a collaboration of the University of Pittsburgh and West Virginia University funded by NIDCR R01-DE 014899); (2) the University of Pittsburgh School of Dental Medicine (SDM) DNA Bank and Research Registry, supported by the SDM and by the University of Pittsburgh Clinical and Translational Sciences Institute (funded by NIH/NCRR/CTSA Grant UL1-RR024153); (3) the Iowa Fluoride Study and the Iowa Bone Development Study, funded by NIDCR (R01-DE09551and R01-DE12101, respectively); and (4) the Iowa Comprehensive Program to Investigate Craniofacial and Dental Anomalies (funded by NIDCR, P60-DE-013076). The datasets used for the analyses described in this manuscript were obtained from dbGaP at https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000095.v3.p1 through dbGaP accession number phs000095.v3.p1.
Funding information:
This work was partially supported by the National Institute of Dental and Craniofacial Research (Award No. R03DE027429 and R56DE030437), the National Institute on Drug Abuse (Award No. R01DA043501) and the National Library of Medicine (Award No. R01LM012848).
Footnotes
Computer Program
R codes implementing the methods developed in this paper are available at https://github.com/didiwu345/WV-IC.
Supporting Information
Additional Supporting Information may be found online in the supporting information section.
Data Availability Statement
The GWAS data in the real application are available from https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000095.v3.p1 with the permission of dbGaP.
References
- Ballantine J, Carlson J, Zandoná A, Agler C, Zeldin L, Rozier R, … Divaris K (2017, 10). Exploring the genomic basis of early childhood caries: a pilot study. International Journal of Paediatric Dentistry, 28. doi: 10.1111/ipd.12344 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Benjamini Y, & Hochberg Y (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1), 289–300. Retrieved from http://www.jstor.org/stable/2346101 [Google Scholar]
- Bretz W, Corby P, Hart T, Costa S, Coelho M, Weyant R, … Schork N (2005, 05). Dental caries and microbial acid production in twins. Caries research, 39, 168–72. doi: 10.1159/000084793 [DOI] [PubMed] [Google Scholar]
- Cai T, Tonini G, & Lin X (2011). Kernel machine approach to testing the significance of multiple genetic markers for risk prediction. Biometrics, 67(3), 975–986. Retrieved from https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1541-0420.2010.01544.x doi: 10.1111/j.1541-0420.2010.01544.x [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen H, Lumley T, Brody J, Heard-Costa NL, Fox CS, Cupples LA, & Dupuis J (2014). Sequence kernel association test for survival traits. Genetic Epidemiology, 38(3), 191–197. Retrieved from https://onlinelibrary.wiley.com/doi/abs/10.1002/gepi.21791 doi: 10.1002/gepi.21791 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Davies RB (1980). Algorithm as 155: The distribution of a linear combination of χ2 random variables. Journal of the Royal Statistical Society. Series C (Applied Statistics), 29(3), 323–333. Retrieved from http://www.jstor.org/stable/2346911 [Google Scholar]
- Goeman JJ, Oosting J, Cleton-Jansen A-M, Anninga JK, & van Houwelingen HC (2005, 01). Testing association of a pathway with survival using gene expression data. Bioinformatics, 21(9), 1950–1957. Retrieved from 10.1093/bioinformatics/bti267 doi: 10.1093/bioinformatics/bti267 [DOI] [PubMed] [Google Scholar]
- Hudgens MG, Li C, & Fine JP (2014). Parametric likelihood inference for interval censored competing risks data. Biometrics, 70(1), 1–9. Retrieved from https://onlinelibrary.wiley.com/doi/abs/10.1111/biom.12109 doi: 10.1111/biom.12109 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li C, Pak D, & Todem D (2020). Adaptive lasso for the cox regression with interval censored and possibly left truncated data. Statistical Methods in Medical Research, 29(4), 1243–1255. Retrieved from 10.1177/0962280219856238 doi: 10.1177/0962280219856238 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li C, Wu D, & Lu Q (2021). Set-based genetic association and interaction tests for survival outcomes based on weighted v statistics. Genetic Epidemiology, 45, 46–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu D, Ghosh D, & Lin X (2008). Estimation and testing for the effect of a genetic pathway on a disease outcome using logistic kernel machine regression via logistic mixed model. BMC bioinformatics, 9, 292. doi: 10.1186/1471-2105-9-292 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu D, Lin X, & Ghosh D (2007). Semiparametric regression of multidimensional genetic pathway data: Least-squares kernel machines and linear mixed models. Biometrics, 63(4), 1079–1088. Retrieved from https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1541-0420.2007.00799.x doi: 10.1111/j.1541-0420.2007.00799.x [DOI] [PMC free article] [PubMed] [Google Scholar]
- Orlova E, Carlson J, Lee M-K, Feingold E, McNeil D, Crout R, … Shaffer J (2019, 09). Pilot gwas of caries in african-americans shows genetic heterogeneity. BMC Oral Health, 19. doi: 10.1186/s12903-019-0904-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shaffer J, Wang X, DeSensi R, Wendell S, Weyant R, Cuenco K, … Marazita M (2012, 02). Genetic susceptibility to dental caries on pit and fissure and smooth surfaces. Caries research, 46, 38–46. doi: 10.1159/000335099 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shaffer J, Wang X, Feingold E, Lee M-K, Begum F, Weeks D, … Marazita M (2011, 09). Genome-wide association scan for childhood caries implicates novel genes. Journal of dental research, 90, 1457–62. doi: 10.1177/0022034511422910 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sinnott JA, & Cai T (2013). Omnibus risk assessment via accelerated failure time kernel machine modeling. Biometrics, 69(4), 861–873. Retrieved from https://onlinelibrary.wiley.com/doi/abs/10.1111/biom.12098 doi: 10.1111/biom.12098 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stanley B, Feingold E, Cooper M, Vanyukov M, Maher B, Slayton R, … Shaffer J (2014). Genetic association of mpped2 and actn2 with dental caries. Journal of Dental Research, 93(7), 626–632. Retrieved from 10.1177/0022034514534688 doi: 10.1177/0022034514534688 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vieira A, Modesto A, & Marazita M (2014, 05). Caries: Review of human genetics research. Caries research, 48, 491–506. doi: 10.1159/000358333 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang X, Shaffer J, Weyant R, Cuenco K, DeSensi R, Crout R, … Marazita M (2010, 07). Genes and their effects on dental caries may differ between primary and permanent dentitions. Caries research, 44, 277–84. doi: 10.1159/000314676 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wei C, & Lu Q (2017, 02). A generalized association test based on U statistics. Bioinformatics, 33(13), 1963–1971. Retrieved from 10.1093/bioinformatics/btx103 doi: 10.1093/bioinformatics/btx103 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu M, Lee S, Cai T, Li Y, Boehnke M, & Lin X (2011, 07). Rare-variant association testing for sequencing data with the sequence kernel association test. American journal of human genetics, 89, 82–93. doi: 10.1016/j.ajhg.2011.05.029 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zeng D, & Lin DY (2006, 09). Efficient estimation of semiparametric transformation models for counting processes. Biometrika, 93(3), 627–640. Retrieved from 10.1093/biomet/93.3.627 doi: 10.1093/biomet/93.3.627 [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The GWAS data in the real application are available from https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000095.v3.p1 with the permission of dbGaP.
