Identifying Genetic Variants for Addiction via Propensity Score Adjusted Generalized Kendall’s Tau

Yuan Jiang; Ni Li; Heping Zhang

doi:10.1080/01621459.2014.901223

. Author manuscript; available in PMC: 2015 Oct 2.

Published in final edited form as: J Am Stat Assoc. 2014 Oct 2;109(507):905–930. doi: 10.1080/01621459.2014.901223

Identifying Genetic Variants for Addiction via Propensity Score Adjusted Generalized Kendall’s Tau

Yuan Jiang ¹, Ni Li ², Heping Zhang ^*

PMCID: PMC4219655 NIHMSID: NIHMS571814 PMID: 25382885

Abstract

Identifying replicable genetic variants for addiction has been extremely challenging. Besides the common difficulties with genome-wide association studies (GWAS), environmental factors are known to be critical to addiction, and comorbidity is widely observed. Despite the importance of environmental factors and comorbidity for addiction study, few GWAS analyses adequately considered them due to the limitations of the existing statistical methods. Although parametric methods have been developed to adjust for covariates in association analysis, difficulties arise when the traits are multivariate because there is no ready-to-use model for them. Recent nonparametric development includes U-statistics to measure the phenotype-genotype association weighted by a similarity score of covariates. However, it is not clear how to optimize the similarity score. Therefore, we propose a semiparametric method to measure the association adjusted by covariates. In our approach, the nonparametric U-statistic is adjusted by parametric estimates of propensity scores using the idea of inverse probability weighting. The new measurement is shown to be asymptotically unbiased under our null hypothesis while the previous non-weighted and weighted ones are not. Simulation results show that our test improves power as opposed to the non-weighted and two other weighted U-statistic methods, and it is particularly powerful for detecting gene-environment interactions. Finally, we apply our proposed test to the Study of Addiction: Genetics and Environment (SAGE) to identify genetic variants for addiction. Novel genetic variants are found from our analysis, which warrant further investigation in the future.

Keywords: Addiction, Comorbidity, Genome-wide association study, Inverse probability weighting, Substance dependence

1 INTRODUCTION

Identifying genetic risk variants for addiction (substance dependence) has drawn much attention due to the popularity of genome-wide association studies (GWAS) based on high throughput data. Many genetic signals for addiction have been discovered using GWAS in recent years. Studies focusing on nicotine dependence include Bierut et al. (2007), Uhl et al. (2007), Luo et al. (2008), Drgon et al. (2009), Rice et al. (2012), and Wang et al. (2012), among others. Similarly, there are many important discoveries for alcohol dependence, including but not limited to, Reich et al. (1998), Treutlein et al. (2009), Edenberg et al. (2010), Bierut et al. (2010), Johnson et al. (2006), Kendler et al. (2011), Heath et al. (2011), Wang et al. (2011), and Frank et al. (2012).

Despite these important findings, it still remains to be a very challenging problem to identify genetic variants for addiction, especially taking into account the following two issues. First, comorbidity of addiction is widely observed in the existing literature (National Institute on Drug Abuse, 2010). For example, Zuo et al. (2012a) and Zuo et al. (2012b) studied the risk gene regions in alcohol and nicotine co-dependence. Substance dependence can also be comorbid with other diseases such as depression (Edwards et al., 2012). Second, environmental factors (covariates) are known to play an important role in the association analysis between genetic risk factors and addiction. Examples include stress and history of violence. These factors can potentially produce confounding effects, or they can interact with genotypes known as the gene-environment interactions.

In this work, we aim to analyze the data from the Study of Addiction: Genetics and Environment (SAGE), which is part of the Gene Environment Association Studies initiative (GENEVA) funded by the National Human Genome Research Institute. In the SAGE data, addiction to six different substances were measured simultaneously for the subjects, including alcohol, nicotine, marijuana, cocaine, opiates, and other drugs. A preliminary analysis shows that different addictions are dependent. In the data, there are about 45% subjects who are addicted to nicotine and 47% subjects addicted to alcohol. The nicotine and alcohol co-dependence rate is 32%, much higher than the rate if assuming these two traits are statistically independent. Moreover, information about important environmental factors was also collected. Environmental factors such as history of sexual abuse or violence and socioeconomic status have a non-negligible effect on substance dependence. To analyze the SAGE data, it remains an open question on how to properly adjust for these important covariates with such a complicated constitution of phenotypes. This motivates us to develop a new statistical method to fill this gap.

Traditionally, covariates were usually adjusted in GWAS by being added into a parametric association model such as a binary or an ordinal logistic regression model (Wang et al., 2006). However, there are two major drawbacks when using a parametric model-based approach for analysis of comorbidity of multiple traits. First, it is challenging to build a parametric model for multiple traits especially with different scales. Second, it is not clear how to remove the confounding effects through the model. Therefore, nonparametric tools were recently proposed. To handle comorbidity, Zhang et al. (2010) proposed a nonparametric U-statistic to measure association, called the “generalized Kendall’s tau”, which can take any hybrid of dichotomous, ordinal and quantitative traits. The generalized Kendall’s tau is applicable to both population-based and family-based designs. It is also noteworthy that the family-based association tests (FBAT) (Laird et al., 2000; Rabinowitz and Laird, 2000) are a special case of the generalized Kendall’s tau. To further adjust for environmental factors in a nonparametric setting, Zhu et al. (2012) and Jiang and Zhang (2011) proposed weighted versions of generalized Kendall’s tau. For the weight function, Zhu et al. (2012) used covariates themselves while Jiang and Zhang (2011) used propensity scores (Rosenbaum and Rubin, 1983). The weighted nonparametric tests have shown their power for detecting genetic effects after considering environmental effects.

The weighted tests are proven useful but still face difficulties. For instance, researchers are often required to select the tuning parameters in the weight function (Jiang and Zhang, 2011; Zhu et al., 2012). Although suggestions were made, this extra step makes the tests less accessible. In this work, we propose an alternative that is more natural and convenient. Instead of directly weighting the generalized Kendall’s tau, we employ the idea of “inverse probability weighting” from the applications of propensity scores (Rosenbaum, 1987; Robins et al., 2000; Lunceford and Davidian, 2004). First, we use a parametric model to estimate the genomic propensity scores (Zhao et al., 2009) which summarize all covariates. Then, we apply the inverse probability weighting using the parametric propensity score estimates to the genotype kernel of the nonparametric U-statistic. These procedures result in our proposed semiparametric measurement of association adjusted by covariates.

In an observational study, the inverse probability weighting method aims to construct an unbiased estimator of treatment effect. Similarly, we show that our U-statistic is an asymptotically unbiased estimator of the phenotype-genotype association under the null hypothesis, while the non-weighted and other weighted U-statistics are not necessarily asymptotically unbiased. Moreover, the inverse probability weighted U-statistic is free of tuning parameters. Another contribution of this work is to provide the null distribution of our test statistic incorporating the estimation step of propensity scores. Interestingly, we find that if the propensity scores are estimated consistently ( $\sqrt{n}$ -consistency indeed), the U-statistic has even a smaller variance than the one with true propensity scores. This confirms a surprising but known fact that “it is better to use the ‘estimated propensity score’ than the true propensity score even when the true score is known” (Robins et al., 1992). Nonetheless, it is the first time (to the best of our knowledge) to rigorously formalize this idea either from a U-statistic viewpoint or in the framework of genome-wide association tests.

To evaluate the performance of our proposed test, we perform simulation studies to compare with the generalized Kendall’s tau and its weighted versions in terms of type I error and power. The simulation results show that our test possesses a higher power in most situations we examined and is particularly powerful for detecting gene-environment interactions.

Finally, we apply our proposed test to the SAGE data, together with non-weighted and other weighted tests, for comorbidity of multiple addictions. We also compare the comorbidity based analyses with the analysis from a single addiction at a time. Interestingly, besides a few overlapped markers, novel regions have been detected using multiple phenotypes, and different approaches may be more powerful under different settings; for example, a comorbidity genetic analysis is more powerful only for shared genes. Among the tests for multiple addictions, we clearly see the advantage of adjusting for important covariates in our analysis. Without any adjustment, no SNP was identified to be genome-wide significant. With adjustment, different adjusted tests work complementarily to each other. Our proposed test, in particular, reveals SNPs/genes that are not discovered by other tests. For example, the SNP rs251133 (on chromosome 5) achieves the genome-wide significance only using our proposed test. The new findings from our analyses warrant further investigation with either a replication study or a biological verification.

2 SEMIPARAMETRIC ASSOCIATION TEST

2.1 Non-weighted and Weighted Association Measurements

Suppose we observe a vector of traits $Y_{i} = {Y_{i}^{(1)}, \dots, Y_{i}^{(p)}}'$ , a test-locus genotype G_i, and a vector of covariates $Z_{i} = {Z_{i}^{(1)}, \dots, Z_{i}^{(q)}}'$ for the ith subject in the n study subjects from a population association study. Our data are independent samples ${(Y_{i}^{'}, G_{i}, Z_{i}^{'})' : i = 1, \dots, n}$ . In the following, we denote Y = {Y₁, … , Y_n} and Z = {Z₁, … , Z_n} for all the traits and covariates, respectively. We present here a few nonparametric association statistics to measure the association between the multiple traits and the genetic marker.

The first statistic was proposed by Zhang et al. (2010). For individuals i and j, let Y_i and Y_j be their vectors of traits respectively. Then, a trait kernel is defined as

ϕ_{t} (Y_{i}, Y_{j}) = [f_{1} {Y_{i}^{(1)} - Y_{j}^{(1)}}, \dots, f_{p} {Y_{i}^{(p)} - Y_{j}^{(p)}}]',

where function f_k(·) (k = 1, … , p) can be chosen as the identity function for a quantitative or binary trait (Rabinowitz, 1997), or the sign function for an ordinal trait (Zhang et al., 2006). Traditionally, a genotype kernel is chosen as

ϕ_{g} (G_{i}, G_{j}) = G_{i} - G_{j} .

Based on these two kernels, Zhang et al. (2010) proposed a nonparametric U-statistic to measure the association between the phenotype and genotype as

U = {(\begin{matrix} n \\ 2 \end{matrix})}^{- 1} \sum_{i < j} ϕ_{t} (Y_{i}, Y_{j}) ϕ_{g} (G_{i}, G_{j}),

(1)

which is a generalization of Kendall’s tau (Kendall, 1938). This U-statistic was used there to test the null hypothesis that there is no phenotype-genotype association.

For the purpose of adjusting for the covariates, Zhu et al. (2012) introduced another statistic, which is a weighted version of U in (1). Let w(Z_i, Z_j) be a weight function measuring the similarity between Z_i and Z_j. For instance, the most intuitive weight function w(Z_i, Z_j) can be defined as a function of the distance or similarity of the two covariate vectors Z_i and Z_j. Afterwards, they defined the weighted U-statistic as

U_{w, 1} = {(\begin{matrix} n \\ 2 \end{matrix})}^{- 1} \sum_{i < j} ϕ_{t} (Y_{i}, Y_{j}) ϕ_{g} (G_{i}, G_{j}) w (Z_{i}, Z_{j}) .

(2)

This weighted U-statistic is used to measure the covariate-adjusted association between the multiple traits and the genetic marker.

Considering the fact that there exist potentially continuous (such as age) and categorical (such as gender) covariates, their distance or similarity can become arbitrary and complicated especially when we have many covariates. Therefore, Jiang and Zhang (2011) proposed to summarize all the covariates, continuous or categorical, into the propensity score (Rosenbaum and Rubin, 1983; Zhao et al., 2009). Its definition is the likelihood of an individual having a particular test-locus genotype based on that individual’s covariate makeup, which can be explicitly stated as

p (z_{i}) = {P (G_{i} = g ∣ Z_{i} = z_{i}) : g \in 𝒢}',

with 𝒢 being the set of possible values for the genotype G; while in our context, 𝒢 = {0, 1, 2} representing {aa, Aa, AA} for a SNP marker with two alleles A and a. Then the weighted U-statistic in (2) becomes

U_{w, 2} = {(\begin{matrix} n \\ 2 \end{matrix})}^{- 1} \sum_{i < j} ϕ_{t} (Y_{i}, Y_{j}) ϕ_{g} (G_{i}, G_{j}) w (p (Z_{i}), p (Z_{j})) .

(3)

These weighted U-statistics (2) and (3) were proposed to adjust the association taking into account the covariate effects. They have been proven useful in both theory and application especially when the covariates have direct or indirect effects on the traits (Jiang and Zhang, 2011; Zhu et al., 2012).

2.2 Inverse Probability Weighting

In the case without covariates, a natural choice of measurement of genotype-phenotype association is given by U in (1). One property of U is its unbiasedness under the null hypothesis. That is, E(U | Y) = 0 when there is no association between the genotype and phenotype (Zhang et al., 2010). It is noteworthy that conditioning on the traits is necessary to eliminate the need for assumptions about the phenotypic distribution (Laird et al., 2000).

When the covariate information is available, however, in order to remove the confounding effects of the covariates, one needs to test the conditional independence between the genotype and phenotype conditional on the covariates (Zhu et al., 2012). That is ℋ₀ : Y_i ⊥ G_i | Z_i, i = 1, … , n. Under the new null hypothesis ℋ₀, however, the U-statistic U in (1) is not necessarily an unbiased measure. The reason is that, under ℋ₀,

E (U ∣ Y) = {(\begin{matrix} n \\ 2 \end{matrix})}^{- 1} \sum_{i < j} ϕ_{t} (Y_{i}, Y_{j}) {E (G_{i} ∣ Y_{i}) - E (G_{j} ∣ Y_{j})},

which is a similar association measurement to U in (1) with the genotype G_i replaced by its conditional mean E(G_i | Y_i). This implies that E(U | Y) would have a non-degenerate distribution (when Y_i’s are regarded as random) unless all E(G_i | Y_i)’s are equal. Therefore, E(U | Y) cannot always be zero. The same conclusion holds for the weighted U-statistics U_W,1 and U_W,2 in (2) and (3). They are also not necessarily unbiased under the null hypothesis ℋ₀.

Therefore, we need to revise the above-mentioned U-statistics to ensure the theoretical unbiasedness. Borrowing the idea of the inverse probability weighting method for propensity scores (Rosenbaum, 1987; Robins et al., 2000; Lunceford and Davidian, 2004), we revise the genotype kernel from ϕ_g(G_i, G_j) = G_i − G_j to

ϕ_{g} (G_{i}, G_{j}; Z_{i}, Z_{j}) = \frac{G_{i}}{e (Z_{i})} - \frac{G_{j}}{e (Z_{j})},

where e(z_i) = E(G_i | Z_i = z_i) is the conditional expectation of G_i given Z_i = z_i. In general, e(z_i) can be directly obtained from the propensity score as

e (z_{i}) = \sum_{g \in 𝒢} g P (G_{i} = g ∣ Z_{i} = z_{i}) .

Then we propose the propensity score-inverse probability weighted U-statistic as

U_{IPW} = {(\begin{matrix} n \\ 2 \end{matrix})}^{- 1} \sum_{i < j} ϕ_{t} (Y_{i}, Y_{j}) ϕ_{g} (G_{i}, G_{j}; Z_{i}, Z_{j}) .

(4)

From (4), we see that

E (U_{IPW} ∣ Y) = {(\begin{matrix} n \\ 2 \end{matrix})}^{- 1} \sum_{i < j} ϕ_{t} (Y_{i}, Y_{j}) E [E {ϕ_{g} (G_{i}, G_{j}; Z_{i}, Z_{j}) ∣ Z_{i}, Z_{j}} ∣ Y] = 0,

as E{ϕ_g(G_i, G_j; Z_i, Z_j) | Z_i, Z_j} = 0 under ℋ₀. This shows that U_IPW is an unbiased estimator of the conditional association between the genotype and phenotype under ℋ₀, provided that the true values of propensity scores are known.

2.3 Asymptotic Distribution with True Propensity Scores

As illustrated by Zhu et al. (2012), the asymptotic distribution of U_IPW may be derived conditioning on both traits Y = y and covariates Z = z. Write ${\overset{‒}{u}}_{i} = \frac{1}{n} \sum_{j = 1}^{n} ϕ_{t} (Y_{i}, Y_{j})$ , then

U_{IPW} = \frac{2}{n - 1} \sum_{i = 1}^{n} {\overset{‒}{u}}_{i} G_{i} ∕ e (Z_{i}) .

Conditioning on both traits and covariates, the mean of U_IPW is still zero under ℋ₀. The asymptotic distribution of U_IPW can be derived by applying the central limit theorem. Theorem 1 reveals that U_IPW has an asymptotic normal distribution after normalization by its variance.

Theorem 1. Let v(z_i) = var(G_i | Z_i = z_i). Assume inf_n,i |e(z_i)| > 0 and inf_n,i |v(z_i)| > 0. Suppose ${max}_{1 \leq i \leq n} ∥ {\overset{‒}{u}}_{i} ∥^{2} = o {λ_{min} (\sum_{i = 1}^{n} {\overset{‒}{u}}_{i} {\overset{‒}{u}}_{i}^{'})}$ , where λ_min represents the minimum eigenvalue. Then, under the null hypothesis ℋ₀,

\sqrt{n} \sum^{- 1 ∕ 2} U_{IPW} \to N (0, I_{p})

in distribution, conditioning on all the traits and covariates, where

\sum = \frac{4}{n} \sum_{i = 1}^{n} {\overset{‒}{u}}_{i} {\overset{‒}{u}}_{i}^{'} ν (z_{i}) ∕ e^{2} (z_{i}) .

U_ipw is a linear combination of the independent genotypes G₁, … , G_n. This observation inspires the application of Corollary 1.3 in Shao (2003) to prove Theorem 1. The conditions inf_n,i |e(z_i)| > 0 and inf_n,i |v(z_i)| > 0 are assumed to ensure the positive definiteness of the covariance matrix Σ. Moreover, the condition ${max}_{1 \leq i \leq n} ∥ {\overset{‒}{u}}_{i} ∥^{2} = o {λ_{min} (\sum_{i = 1}^{n} {\overset{‒}{u}}_{i} {\overset{‒}{u}}_{i}^{'})}$ is used to control the contribution of each term in the linear combination so that no term is dominant of all the others (see the regularity condition in Corollary 1.3 in Shao (2003)).

2.4 Test Statistic with Estimated Propensity Scores

In Section 2.3, U_IPW involves the true values of the propensity score p(z_i) and the mean e(z_i). However, in the real situation, the propensity scores are always estimated from the samples, i.e., by $\hat{p} (Z_{i})$ . So is the mean e(z_i) in the statistic U_IPW, estimated by $\hat{e} (Z_{i})$ . In this case, the test statistic becomes

{\hat{U}}_{IPW} = \frac{2}{n - 1} \sum_{i = 1}^{n} {\overset{‒}{u}}_{i} G_{i} ∕ \hat{e} (Z_{i}) .

Therefore, we aim to find the asymptotic distribution of the test statistic ${\hat{U}}_{IPW}$ in this subsection. This distribution will serve as the reference distribution for our association test.

We assume a parametric model indexed by parameters θ ∈ R^d to estimate the propensity scores. Therefore, we call ${\hat{U}}_{IPW}$ a semiparametric measurement given both its parametric and nonparametric components. To estimate p(z_i) and further e(z_i), we make use of the maximum likelihood estimator or the root of the likelihood equations $\hat{θ}$ from this model. It is noteworthy that we do not limit ourselves to any specific form of models. Instead, we build the theory upon the following general parametric form,

P (G_{i} = g ∣ Z_{i} = z_{i}) = p_{g} (z_{i}; θ), g = 0, 1, 2; i = 1, \dots, n,

(5)

with $\sum_{g = 0}^{2} p_{g} (z_{i}; θ) = 1$ . For clarity, θ₀ is used for the true values of θ. Thus, e_θ₀(z_i) and v_θ₀(z_i) denote the true values of e(z_i) and v(z_i), respectively.

With model (5), we observe that ${\hat{U}}_{IPW} = U_{IPW} (\hat{θ})$ is a statistic with estimated parameters $\hat{θ}$ . To derive the asymptotic distribution of ${\hat{U}}_{IPW}$ , we follow the approach suggested by Pierce (1982) and Randles (1982). The idea is to derive the asymptotic joint distribution of ${U_{IPW}^{'} (θ_{0}), \hat{θ}'}'$ and then to approximate the distribution of ${\hat{U}}_{IPW}$ using the mean value theorem.

Before presenting the main theoretical result, we need to introduce some necessary notation. With i = 1, … , n, the log-likelihood function log ℓ_i(θ) of model (5) is

log ℓ_{i} (θ) = \sum_{g = 0}^{2} I (G_{i} = g) log p_{g} (z_{i}; θ) .

We assume the score function ψ_θ(G_i, z_i) and information matrix I_θ(z_i) are well defined as

ψ_{θ} (G_{i}, z_{i}) = \frac{\partial}{\partial θ} log ℓ_{i} (θ) = \sum_{g = 0}^{2} I (G_{i} = g) p_{g}^{- 1} (z_{t}; θ) \frac{\partial}{\partial θ} p_{g} (z_{i}; θ),

(6)

I_{θ} (z_{i}) = E {ψ_{θ} (G_{i}, z_{i}) ψ_{θ}^{'} (G_{i}, z_{i})} = \sum_{g = 0}^{2} p_{g}^{- 1} (z_{i}; θ) \frac{\partial}{\partial θ} p_{g} (z_{i}; θ) \frac{\partial}{\partial θ'} p_{g} (z_{i}; θ) .

(7)

In addition, define the following matrices,

\begin{matrix} \sum_{θ_{0}} & = \frac{4}{n} \sum_{i = 1}^{n} {\overset{‒}{u}}_{i} {\overset{‒}{u}}_{i}^{'} ν_{θ_{0}} (z_{i}) ∕ e_{θ_{0}}^{2} (z_{i}), \\ Γ_{θ_{0}} & = \frac{2}{n} \sum_{i = 1}^{n} {\overset{‒}{u}}_{i} \sum_{g = 0}^{2} (g \frac{\partial}{\partial θ'} p_{g} {z_{i}; θ_{0})} ∕ e_{θ_{0}} (z_{i}), \end{matrix}

(8)

and vectors (for i = 1, … , n),

\begin{matrix} γ_{i 1} & = {{\overset{‒}{u}}_{i}^{'} ∕ e_{θ_{0}} (z_{i}), p_{1}^{- 1} (z_{i}; θ_{0}) \frac{\partial}{\partial θ'} p_{1} (z_{i}; θ_{0}) - p_{0}^{- 1} (z_{i}; θ_{0}) \frac{\partial}{\partial θ'} p_{0} (z_{i}; θ_{0})}^{'}, \\ γ_{i 2} & = {2 {\overset{‒}{u}}_{i}^{'} ∕ e_{θ_{0}} (z_{i}), p_{2}^{- 1} (z_{i}; θ_{0}) \frac{\partial}{\partial θ'} p_{2} (z_{i}; θ_{0}) - p_{0}^{- 1} (z_{i}; θ_{0}) \frac{\partial}{\partial θ'} p_{0} (z_{i}; θ_{0})}^{'} . \end{matrix}

Theorem 2 presents the asymptotic distribution of the test statistic ${\hat{U}}_{IPW}$ , with the detailed derivation provided in the Appendix.

Theorem 2. Let the parameter space Θ be an open set. Suppose that, there exist some δ > 0 andc_θ₀ > 0 such that p_g(z_i; θ) ∈ [δ, 1 − δ] for all θ satisfying ∥θ − θ₀∥ ≤ c_θ₀ with g = 0, 1, 2 and i = 1, … , n; ℓ_i(θ) is twice continuously differentiable; for each g = 0, 1, 2,

\max_{1 \leq i \leq n} sup_{‖ θ - θ_{0} ‖ \leq c_{θ_{0}}} ‖ \frac{\partial}{\partial θ} p_{g} (z_{i}; θ) ‖ = O (1), \max_{1 \leq i \leq n} sup_{‖ θ - θ_{0} ‖ \leq c_{θ_{0}}} ‖ \frac{\partial^{2}}{\partial θ \partial θ'} p_{g} (z_{i}; θ) ‖ = O (1),

(9)

and there exists constants C_θ₀ > 0 and α > 0 such that for all θ satisfying ∥θ − θ₀∥ ≤ c_θ₀,

\frac{1}{n} \sum_{i = 1}^{n} ‖ \frac{\partial^{2}}{\partial θ \partial θ'} p_{g} (z_{i}; θ) - \frac{\partial^{2}}{\partial θ \partial θ'} p_{g} (z_{i}; θ_{0}) ‖ \leq C_{θ_{0}} ‖ θ - θ_{0} ‖^{α},

(10)

where ∥A∥ = {tr(A′A)}^1/2 is the Frobenius norm for any matrix A; there exists a positive definite matrix I_θ₀ such that $\frac{1}{n} \sum_{i = 1}^{n} I_{θ_{0}} (z_{i}) \to I_{θ_{0}}; λ_{max} (\sum_{i = 1}^{n} {\overset{‒}{u}}_{i} {\overset{‒}{u}}_{i}^{'}) = O (n)$ and ${max}_{1 \leq i \leq n} ∥ {\overset{‒}{u}}_{i} ∥^{2} = o (n)$ ; furthermore, ${max}_{1 \leq i \leq n} λ_{max} (γ_{i 1} γ_{i 1}^{'} + γ_{i 2} γ_{i 2}^{'}) = o [λ_{min} {\sum_{i = 1}^{n} (γ_{i 1} γ_{i 1}^{'} + γ_{i 2} γ_{i 2}^{'})}]$ and $λ_{min} {\sum_{i = 1}^{n} (γ_{i 1} γ_{i 1}^{'} + γ_{i 2} γ_{i 2}^{'})} \geq n ϵ$ for some ϵ > 0, where λ_max represents the maximum eigenvalue. Let $Λ_{θ_{0}} = \sum_{θ_{0}} - Γ_{θ_{0}} I_{θ_{0}}^{- 1} Γ_{θ_{0}}^{'}$ . Then, under the null hypothesis ℋ₀,

\sqrt{n} Λ_{θ_{0}}^{- 1 ∕ 2} {\hat{U}}_{IPW} \to N (0, I_{p}),

in distribution, conditioning on all the traits and covariates.

The condition ${max}_{1 \leq i \leq n} λ_{max} (γ_{i 1} γ_{i 1}^{'} + γ_{i 2} γ_{i 2}^{'}) = o [λ_{min} {\sum_{i = 1}^{n} (γ_{i 1} γ_{i 1}^{'} + γ_{i 2} γ_{i 2}^{'})}]$ in Theorem 2 has the same role as the condition ${max}_{1 \leq i \leq n} ∥ {\overset{‒}{u}}_{i} ∥^{2} = o {λ_{min} (\sum_{i = 1}^{n} {\overset{‒}{u}}_{i} {\overset{‒}{u}}_{i}^{'})}$ in Theorem 1. It is a typical requirement of the central limit theorem for a weighted sum of independent random variables. That is, none of the weights would dominate all the others in an asymptotic sense.

Theorem 2 implies the asymptotic unbiasedness of the semiparametric statistic ${\hat{U}}_{IPW}$ under our null hypothesis ℋ₀, when the propensity scores are estimated using a parametric model. This property has not been achieved by either the non-weighted or the weighted statistics in the previous work (Zhang et al., 2010; Jiang and Zhang, 2011; Zhu et al., 2012). This agrees with our observation in Section 2.2 when the true values of propensity scores are assumed to be known.

In addition, a comparison between Theorems 1 and 2 reveals that the asymptotic variance of ${\hat{U}}_{IPW}$ is smaller than that of U_IPW, the U-statistic with true propensity scores. This confirms a surprising but known fact that “it is better to use the ‘estimated propensity score’ than the true propensity score even when the true score is known” (Robins et al., 1992). This phenomenon has been revealed by both theory (Rosenbaum, 1987; Robins et al., 1992) and empirical studies (Gu and Rosenbaum, 1993). Nonetheless, it is the first time (to the best of our knowledge) to rigorously formalize the idea either from a U-statistic viewpoint or in the framework of association tests.

2.5 A Specific Example

As a specific example of model (5), we consider the ordinal logistic regression model

logit {G_{i} \leq g ∣ Z_{i} = z_{i}} = λ_{g} + β' z_{i}, g = 0, 1; i = 1, \dots, n,

(11)

where λ₀ < λ₁ are ascending level parameters, and β reflects the association between the gene and covariates. Using the notation in Section 2.4, θ = (λ₀, λ₁, β′)′ ∈ R^q+2 and d = q + 2.

Let

q_{g} (z_{i}; θ) = \frac{exp (λ_{g} + β' z_{i})}{1 + exp (λ_{g} + β' z_{i})}, g = 0, 1,

be the cumulative probabilities with q_g(z_i; θ) = Σ_g′≤g p_g′(z_i; θ), then the first-order derivatives in (6) can be explicitly written as follows,

\begin{matrix} \frac{\partial}{\partial θ} p_{0} (z_{i}; θ) & = π {q_{0} (z_{i}; θ)} ϕ_{10 i}, \\ \frac{\partial}{\partial θ} p_{1} (z_{i}; θ) & = π {q_{1} (z_{i}; θ)} ϕ_{01 i} - π {q_{0} (z_{i}; θ)} ϕ_{10 i}, \\ \frac{\partial}{\partial θ} p_{2} (z_{i}; θ) & = - π {q_{1} (z_{i}; θ)} ϕ_{01 i}, \end{matrix}

with π(x) = x(1 − x), $ϕ_{10 i} = (1, 0, z_{i}^{'})'$ and $ϕ_{01 i} = (0, 1, z_{i}^{'})'$ . The second-order derivatives in (9) and (10) can also be explicitly written as

\begin{matrix} \frac{\partial^{2}}{\partial θ \partial θ'} p_{0} (z_{i}; θ) & = ϖ {q_{0} (z_{i}; θ)} ϕ_{10 i} ϕ_{10 i}^{'}, \\ \frac{\partial^{2}}{\partial θ \partial θ'} p_{1} (z_{i}; θ) & = ϖ {q_{1} (z_{i}; θ)} ϕ_{01 i} ϕ_{01 i}^{'} - ϖ {q_{0} (z_{i}; θ)} ϕ_{10 i} ϕ_{10 i}^{'}, \\ \frac{\partial^{2}}{\partial θ \partial θ'} p_{2} (z_{i}; θ) & = - ϖ {q_{1} (z_{i}; θ)} ϕ_{01 i} ϕ_{01 i}^{'}, \end{matrix}

with $ϖ (x) = x (1 - x) (1 - 2 x)$ . In this way, we can write the explicit form of the information matrix in (7) as

I_{θ} (z_{i}) = [\frac{1}{p_{0} (z_{i}; θ)} + \frac{1}{p_{1} (z_{i}; θ)}] π^{2} {q_{0} (z_{i}; θ)} ϕ_{10 i} ϕ_{10 i}^{'} + [\frac{1}{p_{1} (z_{i}; θ)} + \frac{1}{p_{2} (z_{i}; θ)}] π^{2} {q_{1} (z_{i}; θ)} ϕ_{01 i} ϕ_{01 i}^{'} - \frac{1}{p_{1} (z_{i}; θ)} π {q_{0} (z_{i}; θ)} π {q_{1} (z_{i}; θ)} (ϕ_{10 i} ϕ_{10 i}^{'} + ϕ_{01 i} ϕ_{10 i}^{'}),

(12)

and the matrix Γ_θ₀ in (8) as

Γ_{θ_{0}} = - \frac{2}{n} \sum_{i = 1}^{n} {\overset{‒}{u}}_{i} [π {q_{0} (z_{i}; θ_{0})} ϕ_{10 i}^{'} + π {q_{1} (z_{i}; θ_{0})} ϕ_{01 i}^{'}] ∕ e_{θ_{0}} (z_{i}) .

(13)

The main result in Theorem 2 follows as long as its conditions are satisfied. Indeed, some of the conditions become redundant in this specific example, such as the twice continuous differentiability of the likelihood function. Moreover, conditions (9) and (10) can be simplified into a simple condition max_1≤i≤n ∥z_i∥ = O(1). In summary, we present the following corollary parallel to Theorem 2 specifically for this example.

Corollary 1. Assume model (11) holds. Suppose that, there exist some δ > 0 andc_θ₀ > 0 such that p_g(z_i; θ) ∈ [δ, 1 − δ] for all θ satisfying ∥θ − θ₀∥ c_θ₀ with g = 0, 1, 2 and i = 1, … , n; max_1≤i≤n ∥z_i∥ = O(1), ${max}_{1 \leq i \leq n} ∥ {\overset{‒}{u}}_{i} ∥^{2} = o (n)$ , and $λ_{max} (\sum_{i = 1}^{n} {\overset{‒}{u}}_{i} {\overset{‒}{u}}_{i}^{'}) = O (n); {max}_{1 \leq i \leq n} λ_{max} (γ_{i 1} γ_{i 1}^{'} + γ_{i 2} γ_{i 2}^{'}) = o [λ_{min} {\sum_{i = 1}^{n} (γ_{i 1} γ_{i 1}^{'} + γ_{i 2} γ_{i 2}^{'})}]$ and $λ_{min} {\sum_{i = 1}^{n} (γ_{i 1} γ_{i 1}^{'} + γ_{i 2} γ_{i 2}^{'})} \geq n ϵ$ for some ϵ > 0, where

\begin{matrix} γ_{i 1} & = {{\overset{‒}{u}}_{i}^{'} e_{θ_{0}}^{- 1} (z_{i}), - 1 - p_{i 0} p_{i 2} p_{i 1}^{- 1}, p_{i 2} + p_{i 0} p_{i 2} p_{i 1}^{- 1}, - (p_{i 0} + p_{i 1}) z_{i}^{'}}^{'}, \\ γ_{i 2} & = {2 {\overset{‒}{u}}_{i}^{'} e_{θ_{0}}^{- 1} (z_{i}), - p_{i 1} - p_{i 2}, - p_{i 0} - p_{i 1}, - (1 + p_{i 1}) z_{i}^{'}}^{'}, \end{matrix}

with the simplified notation p_ig = p_g(z_i; θ₀); there exists a positive definite matrix I_θ₀ such that $\frac{1}{n} \sum_{i = 1}^{n} I_{θ_{0}} (z_{i}) \to I_{θ_{0}}$ with I_θ₀(z_i) in (12). Then, the conclusion of Theorem 2 holds with the explicit form of Γ_θ₀ given in (13).

Following the asymptotic distribution of ${\hat{U}}_{IPW}$ in Corollary 1, we define the test statistic

{\hat{T}}_{IPW} = n {\hat{U}}_{IPW}^{'} {\hat{Λ}}^{- 1} {\hat{U}}_{IPW},

where $\hat{Λ} = Λ_{\hat{θ}}$ is the estimator of Λ_θ₀. The consistency of $\hat{Λ}$ can be verified under the conditions of Corollary 1. Therefore, it is clear that

{\hat{T}}_{IPW} \to χ_{p}^{2},

in distribution, conditioning on all the traits and covariates. This serves as the reference distribution in our numerical studies.

2.6 Genotype Coding

As mentioned in Section 2.1, the genotype G is coded as 0, 1, 2 representing aa, Aa, AA respectively, which record the number of a reference allele A. The choice of a different reference allele a leads to a different coding of genotype such as G′ = 2 − G. We illustrate in this subsection the effect of different genotype codings on the association measurements we studied in Sections 2.1–2.2.

Firstly, notice that the genotype kernel ϕ_g(G_i, G_j) in (1) is invariant to the change of genotype coding from G to G′, i.e., $ϕ_{g} (G_{i}, G_{j}) = ϕ_{g} (G_{i}^{'}, G_{j}^{'})$ . Therefore, the non-weighted U-statistic U in (1) and the weighted U-statistic U_W,1 in (2) are both invariant to the genotype codings.

Secondly, the propensity score vector p(z_i) = {P(G_i = g | Z_i = z_i) : g ∈ 𝒢}′ in the weighted U-statistics U_W,2 in (3) is invariant except that the order of its elements is reversed. It leads to the invariance of U_W,2, as long as the weight function w(u₁, u₂) in (3) is not changed by the synchronous permutation of the elements in u₁ and u₂. This is often the case. For example, Jiang and Zhang (2011) used w(u₁, u₂) = exp(−∥u₁−u₂∥²/2), which satisfies the above condition.

Finally, we should note that our proposed measurement U_IPW does not possess the invariance property under the two genotype codings. The revised genotype kernel ϕ_g(G_i, G_j; Z_i, Z_j) is not invariant under codings G and G′. Using a different genotype coding will actually change our association measurement U_IPW and further change the test result. This is understandable because we apply a new weighting scheme. In the non-weighted U-statistic U, the genotypes G_i are treated equally in the genotype kernel. However, to achieve the unbiasedness under ℋ₀, the new U-statistic U_IPW inversely weights the genotypes by their expected values conditional on the covariates. It is the new weighting scheme that violates the invariance but achieves the unbiasedness. From the practical viewpoint, the new method can give us more flexibility to choose a genotype coding which better fits the real situation.

For clarity, we recommend the simple genotype coding. We choose the major allele as the reference allele for practical reasons. In practice, the inverse probability weighting often encounters the difficulty of small weights in the denominator. However, it is fairly easy to see that the above choice is much less likely to result in small denominators e(z_i) (or $\hat{e} (z_{i})$ ) in U_IPW (or ${\hat{U}}_{IPW}$ ) than the other choice. Therefore, we try to avoid the situation where the weights e(z_i) (or $\hat{e} (z_{i})$ ) in the denominator are close to zero.

3 SIMULATION STUDIES

3.1 Settings

We conduct simulation studies to compare the performance of our semiparametric association test ${\hat{T}}_{IPW}$ with the three methods mentioned in Section 2.1. They are the non-weighted and weighted tests derived from the association measures (1)–(3), denoted by T, T_W,1 and T_W,2 respectively. We utilize the same “conditional independence” null hypothesis ℋ₀ (see Section 2.2) for all four tests for a fair comparison. The simulation results are obtained from samples with size of 500, which are generated as follows.

Step 1: For the ith sample, a continuous covariate Z_i1 is simulated from N(0, 1) distribution, and a binary covariate Z_i2 is randomly sampled from {−1, 1} with equal probabilities.

Step 2: For the relationship between the covariates and the test-locus genotype G_i, we generate G_i from the ordinal logistic regression model

OLR: logit {P (G_{i} \leq g ∣ Z_{i 1}, Z_{i 2})} = μ_{g} - ν_{1} Z_{i 1} - ν_{2} Z_{i 2}, g = 0, 1,

where ν₁ and ν₂ control the association between the genotype and the covariates. An alternative genotype model is to generate G_i according to a binomial distribution Bin(2, r_i) with probability r_i satisfying

BIN: logit (r_{i}) = μ + ν_{1} Z_{i 1} + ν_{2} Z_{i 2} + ϵ_{i},

where ϵ_i ∼ N(0, 1) is a random error. We refer to the former model “OLR” and the latter model “BIN”. The former model is the one we specified in Section 2.5, while the latter model is used to assess the effect of model misspecification with ϵ_i deliberately added for additional complexity.

Step 3: Conditional on the genotype G_i and the covariates Z_i1 and Z_i2, two binary traits $Y_{i} = (Y_{i}^{(1)}, Y_{i}^{(2)})'$ are generated according to a logistic regression phenotype model,

logit {P (Y_{i}^{(j)} = 1 ∣ G_{i}, Z_{i 1}, Z_{i 2})} = α_{j} + β_{G} G_{i} + β_{Z_{1}} Z_{i 1} + β_{Z_{2}} Z_{i 2} + β_{G Z_{1}} G_{i} Z_{i 1} + β_{G Z_{2}} G_{i} Z_{i 2} + ϵ_{i j},

with i = 1, … , n; j = 1, 2; and (ϵ_i1, ϵ_i2)′ ∼ N(0,Σ_ϵ).

In the two genotype models (OLR and BIN), the minor allele frequency (MAF) of the simulated genotype depends on the values of μ₀, μ₁, μ and ν₁, ν₂. To investigate the possible effect of different minor allele frequencies on our results, we fix ν₁ = ν₂ = 1 and select appropriate values of μ₀, μ₁ and μ. Their values are chosen so that the simulated minor allele frequency is equal to one of the following values: 0.05, 0.10, 0.15, … , 0.40. These choices give a broad and reasonable range for evaluating how an association test performs with different minor allele frequencies.

In the phenotype model, we set α₁ = −0.75, α₂ = −1, and $\sum_{ϵ} = (\begin{matrix} 1 & 0.25 \\ 0.25 & 1 \end{matrix})$ . The choices of the coefficients (β_G, β_Z₁, β_Z₂, β_GZ₁, β_GZ₂)′ are provided by Table 1 as different phenotype models. The models N1 and N2 are null models under ℋ₀ in which Y_i and G_i are independent conditional on (Z_i1, Z_i2), and the models A1–A6 are under our alternative hypothesis.

Table 1.

Phenotype models

Null Models
N1	β_G = 0	β_Z₁ = β_Z₂ = 0	β_GZ₁ = β_GZ₂ = 0
N2	β_G = 0	β_Z₁ = β_Z₂ = 0.5	β_GZ₁ = β_GZ₂ = 0

Alternative Models
A1	β_G = 0.5	β_Z₁ = β_Z₂ = 0	β_GZ₁ = β_GZ₂ = 0
A2	β_G = 0.5	β_Z₁ = β_Z₂ = 0.5	β_GZ₁ = β_GZ₂ = 0
A3	β_G = 0.5	β_Z₁ = β_Z₂ = 0	β_GZ₁ = β_GZ₂ = 1
A4	β_G = 0.5	β_Z₁ = β_Z₂ = 0.5	β_GZ₁ = β_GZ₂ = 1
A5	β_G = 0.5	β_Z₁ = β_Z₂ = 0	β_GZ₁ = β_GZ₂ = 2
A6	β_G = 0.5	β_Z₁ = β_Z₂ = 0	β_GZ₁ = β_GZ₂ = 2

Open in a new tab

3.2 Results for Bivariate Phenotypes

In this subsection, we present simulation results for the generated bivariate phenotypes. In terms of type I error, Table 2 presents the empirical type I error of the four tests based on 10,000 replications when the nominal level is set to 0.001. Table 2 also includes the type I error results when the nominal level is 5 × 10⁻⁷. To save the computational time, we fix the minor allele frequency at 0.10 there. This smaller nominal level provides an additional comparison among different methods in a situation similar to the real application (Burton et al., 2007). To illustrate the necessity of utilizing the “conditional independence” null hypothesis ℋ₀, we also include T′, the non-weighted test under the original “unconditional independence” null hypothesis $ℋ_{0}^{'}$ —no association between phenotype and genotype. In terms of power, Figures 1-4 present the statistical power of the four tests with respect to a wide range of minor allele frequencies. Figures 1-2 correspond to the nominal level 0.001 and Figures 3-4 correspond to the nominal level 5 × 10⁻⁷.

Table 2.

Type I error for bivariate phenotypes

MAF	T	T _W,1	T _W,2	${\hat{T}}_{IPW}$	T ′	T	T _W,1	T _W,2	${\hat{T}}_{IPW}$	T ′
	Model OLR (Nominal Level: 0.001)
	Model N1					Model N2
0.05	1.0e-3	0.9e-3	1.6e-3	1.1e-3	0.5e-3	0.5e-3	0.8e-3	0.6e-3	1.1e-3	0.2358
0.10	0.7e-3	0.7e-3	0.7e-3	1.0e-3	0.7e-3	0.4e-3	0.9e-3	1.2e-3	0.9e-3	0.4913
0.15	1.4e-3	0.8e-3	0.8e-3	1.3e-3	0.6e-3	0.5e-3	0.6e-3	0.7e-3	1.0e-3	0.6463
0.20	1.0e-3	0.7e-3	0.9e-3	1.0e-3	0.7e-3	1.0e-3	1.1e-3	1.0e-3	1.4e-3	0.7249
0.25	0.8e-3	0.9e-3	0.6e-3	0.7e-3	0.5e-3	0.8e-3	1.0e-3	1.1e-3	1.1e-3	0.7804
0.30	0.9e-3	0.9e-3	1.0e-3	1.1e-3	0.7e-3	0.7e-3	1.2e-3	0.8e-3	0.7e-3	0.8049
0.35	0.9e-3	0.6e-3	0.8e-3	1.5e-3	0.8e-3	0.5e-3	0.8e-3	1.4e-3	0.9e-3	0.8250
0.40	0.9e-3	1.3e-3	1.0e-3	1.0e-3	1.3e-3	0.5e-3	0.9e-3	0.5e-3	1.2e-3	0.8391

	Model BIN (Nominal Level: 0.001)
	Model N1					Model N2
0.05	0.5e-3	1.1e-3	0.5e-3	0.8e-3	0.8e-3	0.2e-3	0.1e-3	0.3e-3	0.7e-3	0.1937
0.10	1.2e-3	1.2e-3	0.7e-3	1.7e-3	1.3e-3	0.2e-3	0.4e-3	0.5e-3	0.6e-3	0.4293
0.15	0.8e-3	0.7e-3	0.4e-3	0.9e-3	0.8e-3	0.6e-3	1.1e-3	0.8e-3	1.7e-3	0.5950
0.20	1.1e-3	1.2e-3	1.0e-3	1.5e-3	1.5e-3	0.6e-3	0.6e-3	0.7e-3	1.1e-3	0.6954
0.25	0.5e-3	0.6e-3	0.7e-3	0.5e-3	1.1e-3	0.4e-3	0.6e-3	0.6e-3	0.7e-3	0.7691
0.30	1.2e-3	0.9e-3	0.8e-3	1.7e-3	1.3e-3	0.6e-3	1.1e-3	1.2e-3	0.8e-3	0.8072
0.35	0.7e-3	0.7e-3	0.5e-3	1.4e-3	0.8e-3	1.0e-3	1.6e-3	1.4e-3	0.7e-3	0.8263
0.40	1.1e-3	1.2e-3	1.4e-3	0.9e-3	1.3e-3	0.8e-3	0.9e-3	1.2e-3	0.8e-3	0.8437

	Model OLR (Nominal Level: 5 × 10⁻⁷)
	Model N1					Model N2
0.10	2e-7	2e-7	6e-7	3e-7	2e-7	3e-7	6e-7	7e-7	5e-7	0.0466208

	Model BIN (Nominal Level: 5 × 10⁻⁷)
	Model N1					Model N2
0.10	2e-7	4e-7	1e-7	5e-7	5e-7	1e-7	1e-7	1e-7	5e-7	0.0331154

Open in a new tab

Power versus minor allele frequency for bivariate phenotypes. The significance level is 0.001. The genotype is simulated using model OLR. Solid line with circles: inverse probability weighted test ${\hat{T}}_{IPW}$ ; dashed line with triangles: non-weighted test T; dotted line with pluses: covariate weighted test T_W,1; dotdash line with crosses: propensity score weighted test T_W,2.

Power versus minor allele frequency for bivariate phenotypes. The significance level is 5 × 10⁻⁷. The genotype is simulated using model BIN. Solid line with circles: inverse probability weighted test ${\hat{T}}_{IPW}$ ; dashed line with triangles: non-weighted test T; dotted line with pluses: covariate weighted test T_W,1; dotdash line with crosses: propensity score weighted test T_W,2.

Power versus minor allele frequency for bivariate phenotypes. The significance level is 0.001. The genotype is simulated using model BIN. Solid line with circles: inverse probability weighted test ${\hat{T}}_{IPW}$ ; dashed line with triangles: non-weighted test T; dotted line with pluses: covariate weighted test T_W,1; dotdash line with crosses: propensity score weighted test T_W,2.

Power versus minor allele frequency for bivariate phenotypes. The significance level is 5 × 10⁻⁷. The genotype is simulated using model OLR. Solid line with circles: inverse probability weighted test ${\hat{T}}_{IPW}$ ; dashed line with triangles: non-weighted test T; dotted line with pluses: covariate weighted test T_W,1; dotdash line with crosses: propensity score weighted test T_W,2.

From the perspective of type I error (in models N1 and N2), we find that all four tests under ℋ₀ behave fairly well since they all possess reasonably accurate type I errors under both nominal levels. This is partially due to the fact that ℋ₀ removes the confounding effects of covariates. By contrast, T′ cannot control its type I error in model N2. The reason is clear: T′ does not remove the confounding effect in model N2 (Jiang and Zhang, 2011; Zhu et al., 2012).

From the perspective of power, we consider models A1–A6. Models A1–A2 are from a phenotype model without the gene-environment interaction, and A3–A6 are with an interaction. To assess situations with different gene-environment interactions, in models A5–A6, we double the interaction coefficients from models A3–A4, respectively.

In model A1 with the genetic effect only, the non-weighted test T possesses the highest power among all four methods, although their differences are actually quite small. This agrees with our expectation since it is not necessary to adjust for covariates in this case. But adjusting for covariates does not harm the statistical power. In model A2 with both genetic and environmental effects, the non-weighted test T performs the worst for most values of minor allele frequency. The other three methods are slightly better, indicating the essentiality of including covariates in the association test. It is noteworthy that the proposed inverse probability weighted test favors the region of a small minor allele frequency in both models A1 and A2. Compared to other weighted tests, the proposed test is comparable or even better for low MAF’s, but is slightly underpowered when the MAF is higher than 0.30.

By including gene-environment interactions (models A3–A4), different methods perform quite differently. It is fairly clear from all figures that the proposed test ${\hat{T}}_{IPW}$ outperforms all competitors for all minor allele frequencies. When the nominal level is 0.001, the proposed test has a power close to 1, which means that it can identify the genetic signal in almost every replicate of the simulated data. The covariate weighted test T_W,1 wins the second place in terms of power. The non-weighted test T and the propensity score weighted test T_W,2 do not have a comparable power for a wide range of minor allele frequencies.

A further study with stronger gene-environment interactions (models A5–A6) provides additional evidence for our conclusion drawn from models A3–A4. When the gene-environment interactions dominate both genetic and environmental effects, the semiparametric inverse probability weighted test outperforms other tests in all minor allele frequencies we considered, showing the power of the proposed test in detecting the gene-environment interactions.

Comparing the two genotype models (OLR versus BIN), we have not observed a major impact from the misspecified model on testing the associations. When the genotype is generated using the binomial distribution, our test derived from the ordinal logistic regression (Section 2.5) still has a quite accurate type I error and also a high power (even higher in some cases) to detect either genetic effects or gene-environment interactions.

Between the two nominal levels (0.001 and 5 × 10⁻⁷), the statistical power becomes smaller with the lower nominal level given the same effect sizes (β’s in Table 1), especially in models A1–A2. All methods are underpowered there; with the sample size of 500, it is expected that we cannot achieve a reasonable power for a full GWAS scan, but unfortunately, the simulation for a much larger sample size takes a very long time to complete. Since our objective is to compare the relative power, we can achieve this goal with the modest sample size. In fact, for models A3–A6, the power of our proposed test is only slightly affected by this small nominal level, and it still dominates all others. In a situation similar to the real application (nominal level 5×10⁻⁷), it is clear that some adjustment is necessary when there is a gene-environment interaction.

3.3 Results for Individual Phenotypes

In addition to the simulation results for the bivariate phenotypes in Section 3.2, we also present the results for each individual phenotype Y⁽¹⁾ and Y⁽²⁾ separately. For simplicity, we fix the nominal level to be 0.001 throughout this subsection. In terms of type I error, Table 3 presents the empirical type I error of the tests based on 10,000 replications. In terms of power, Figures 5-8 present the statistical power of the four tests with respect to a wide range of minor allele frequencies, where Figures 5-6 correspond to the first phenotype and Figures 7-8 correspond to the second phenotype.

Table 3.

Type I error for individual phenotypes

MAF	T	T _W,1	T _W,2	${\hat{T}}_{IPW}$	T ′	T	T _W,1	T _W,2	${\hat{T}}_{IPW}$	T ′
	Phenotype Y⁽¹⁾, Model OLR
	Model N1					Model N2
0.05	0.7e-3	0.9e-3	0.5e-3	0.9e-3	1.2e-3	0.4e-3	0.6e-3	0.8e-3	0.7e-3	0.1288
0.10	1.2e-3	1.2e-3	1.3e-3	1.1e-3	0.6e-3	0.3e-3	0.6e-3	0.6e-3	0.9e-3	0.2689
0.15	1.0e-3	0.8e-3	1.2e-3	1.0e-3	0.9e-3	0.4e-3	0.4e-3	1.1e-3	0.8e-3	0.3703
0.20	1.4e-3	1.3e-3	1.1e-3	1.2e-3	1.1e-3	0.9e-3	1.2e-3	1.0e-3	1.1e-3	0.4441
0.25	1.3e-3	1.5e-3	1.0e-3	0.6e-3	0.7e-3	1.1e-3	1.2e-3	1.2e-3	1.7e-3	0.4966
0.30	1.0e-3	1.0e-3	0.7e-3	1.0e-3	1.0e-3	0.5e-3	0.9e-3	0.8e-3	0.9e-3	0.5173
0.35	0.7e-3	0.7e-3	0.9e-3	0.8e-3	0.7e-3	0.7e-3	1.2e-3	1.0e-3	1.3e-3	0.5402
0.40	1.2e-3	1.1e-3	1.3e-3	1.6e-3	0.7e-3	0.5e-3	0.9e-3	1.1e-3	0.9e-3	0.5566

	Phenotype Y⁽¹⁾, Model BIN
	Model N1					Model N2
0.05	0.6e-3	0.2e-3	0.6e-3	0.7e-3	1.2e-3	0.3e-3	0.3e-3	0.5e-3	1.2e-3	0.1099
0.10	0.5e-3	0.6e-3	0.2e-3	1.0e-3	1.5e-3	1.0e-3	1.2e-3	0.8e-3	1.8e-3	0.2319
0.15	0.5e-3	0.8e-3	0.7e-3	1.2e-3	1.4e-3	0.6e-3	0.3e-3	0.3e-3	1.0e-3	0.3321
0.20	1.0e-3	1.3e-3	1.2e-3	1.2e-3	1.0e-3	1.0e-3	1.2e-3	1.3e-3	1.5e-3	0.4100
0.25	0.6e-3	1.1e-3	0.9e-3	1.2e-3	1.1e-3	0.7e-3	0.9e-3	0.7e-3	1.1e-3	0.4768
0.30	0.5e-3	0.4e-3	0.3e-3	0.9e-3	1.0e-3	0.3e-3	0.7e-3	0.4e-3	1.4e-3	0.5136
0.35	1.3e-3	1.4e-3	1.2e-3	1.3e-3	0.8e-3	0.5e-3	0.8e-3	0.9e-3	0.7e-3	0.5491
0.40	1.2e-3	1.1e-3	0.7e-3	0.7e-3	1.1e-3	0.8e-3	0.8e-3	1.2e-3	0.7e-3	0.5665

	Phenotype Y⁽²⁾, Model OLR
	Model N1					Model N2
0.05	0.9e-3	0.7e-3	0.7e-3	0.9e-3	0.8e-3	0.7e-3	1.0e-3	0.9e-3	1.1e-3	0.1246
0.10	0.6e-3	0.9e-3	0.6e-3	0.4e-3	0.3e-3	0.3e-3	0.9e-3	1.0e-3	0.9e-3	0.2586
0.15	1.2e-3	1.4e-3	1.5e-3	1.3e-3	1.1e-3	0.5e-3	0.7e-3	0.6e-3	0.7e-3	0.3620
0.20	1.7e-3	1.3e-3	1.7e-3	1.3e-3	1.4e-3	0.5e-3	0.6e-3	1.5e-3	1.3e-3	0.4232
0.25	1.0e-3	0.8e-3	1.0e-3	0.9e-3	0.9e-3	0.7e-3	1.1e-3	0.9e-3	1.0e-3	0.4678
0.30	1.0e-3	0.7e-3	1.3e-3	0.8e-3	0.9e-3	0.5e-3	0.9e-3	0.5e-3	1.0e-3	0.5047
0.35	1.1e-3	0.9e-3	1.2e-3	1.3e-3	0.6e-3	0.9e-3	0.9e-3	0.9e-3	1.1e-3	0.5170
0.40	0.4e-3	0.7e-3	0.6e-3	0.7e-3	1.0e-3	0.7e-3	1.4e-3	1.5e-3	1.2e-3	0.5235

	Phenotype Y⁽²⁾, Model BIN
	Model N1					Model N2
0.05	0.7e-3	0.7e-3	1.2e-3	0.8e-3	0.9e-3	0.5e-3	0.8e-3	0.4e-3	1.3e-3	0.1091
0.10	0.3e-3	0.4e-3	0.6e-3	0.7e-3	0.6e-3	0.7e-3	0.7e-3	0.2e-3	1.0e-3	0.2282
0.15	0.8e-3	0.9e-3	0.5e-3	0.8e-3	0.5e-3	0.2e-3	0.5e-3	0.2e-3	0.9e-3	0.3195
0.20	0.7e-3	0.7e-3	0.4e-3	1.0e-3	0.9e-3	0.9e-3	1.1e-3	0.9e-3	1.6e-3	0.4007
0.25	0.7e-3	0.6e-3	0.4e-3	1.0e-3	1.0e-3	0.5e-3	0.6e-3	1.1e-3	0.8e-3	0.4558
0.30	0.4e-3	0.6e-3	0.6e-3	0.3e-3	1.0e-3	0.7e-3	0.9e-3	1.0e-3	0.6e-3	0.4942
0.35	0.5e-3	0.7e-3	0.5e-3	1.4e-3	0.7e-3	1.1e-3	1.1e-3	1.3e-3	1.4e-3	0.5106
0.40	0.8e-3	0.9e-3	0.7e-3	1.2e-3	1.0e-3	0.6e-3	0.7e-3	1.0e-3	0.8e-3	0.5313

Open in a new tab

Power versus minor allele frequency for phenotype Y⁽¹⁾. The significance level is 0.001. The genotype is simulated using model OLR. Solid line with circles: inverse probability weighted test ${\hat{T}}_{IPW}$ ; dashed line with triangles: non-weighted test T; dotted line with pluses: covariate weighted test T_W,1; dotdash line with crosses: propensity score weighted test T_W,2.

Power versus minor allele frequency for phenotype Y⁽²⁾. The significance level is 0.001. The genotype is simulated using model BIN. Solid line with circles: inverse probability weighted test ${\hat{T}}_{IPW}$ ; dashed line with triangles: non-weighted test T; dotted line with pluses: covariate weighted test T_W,1; dotdash line with crosses: propensity score weighted test T_W,2.

Power versus minor allele frequency for phenotype Y⁽¹⁾. The significance level is 0.001. The genotype is simulated using model BIN. Solid line with circles: inverse probability weighted test ${\hat{T}}_{IPW}$ ; dashed line with triangles: non-weighted test T; dotted line with pluses: covariate weighted test T_W,1; dotdash line with crosses: propensity score weighted test T_W,2.

Power versus minor allele frequency for phenotype Y⁽²⁾. The significance level is 0.001. The genotype is simulated using model OLR. Solid line with circles: inverse probability weighted test ${\hat{T}}_{IPW}$ ; dashed line with triangles: non-weighted test T; dotted line with pluses: covariate weighted test T_W,1; dotdash line with crosses: propensity score weighted test T_W,2.

In our simulations, the single-trait results are very similar to the bivariate-trait results in Section 3.2. From the perspective of type I error, all four tests under ℋ₀ behave fairly well since they all possess reasonably accurate type I errors. By contrast, T′ cannot control its type I error in model N2. From the perspective of power, we observe that the inverse probability weighted test is generally comparable to others when there is only genetic effects and/or environmental effects, and it outperforms others when there are gene-environment interactions.

3.4 Impact of Model Misspecification

In Sections 3.2–3.3, we observed no major impact on testing the genetic associations caused by a possibly misspecified parametric gene-environment model. To better understand how the model misspecification affects the estimation of the propensity scores, we compare the estimation results under the two genotype models (OLR and BIN) used in Section 3.1. Figure 9 provides the boxplot of the mean squared errors of the estimated propensity scores ${\hat{p}}_{0}, {\hat{p}}_{1}$ and ${\hat{p}}_{2}$ from random samples with size of 500 based on 1, 000 replications.

Mean squared error of the estimated propensity scores ${\hat{p}}_{0}, {\hat{p}}_{1}$ and ${\hat{p}}_{2}$ . Each panel includes the boxplots for mean squared errors of the estimated propensity scores ${\hat{p}}_{0}, {\hat{p}}_{1}$ and ${\hat{p}}_{2}$ , in that particular order, from genotype models OLR, BIN, and BIN’, respectively.

Since we use the ordinal logistic regression model to estimate the propensity scores (Section 2.5), when the genotype is simulated using model OLR, the estimation performance is the best. The mean squared errors of the estimated propensity scores are higher when the genotype data are simulated from model BIN.

We would like to note that we deliberately added a random error ϵ_i in model BIN for additional complexity, which can cause spurious estimation errors. For a more fair comparison, we also simulate genotype data using model BIN without the random error (referred to as model BIN’) and further present the results for BIN’ in Figure 9. From the results, it is obvious that the extra estimation error for model BIN is mainly caused by the random error we added. There is no significant difference between the estimation errors for models OLR and BIN’, indicating that the difference between the estimation performance under the two genotype models is negligible if no additional noise is included.

4 DATA ANALYSIS

4.1 Data and Methods

The Study of Addiction: Genetics and Environment (SAGE) aims to identify susceptible genetic factors that contribute to substance dependence through three large-scale genomewide association studies: the Collaborative Study on the Genetics of Alcoholism (COGA), the Family Study of Cocaine Dependence (FSCD), and the Collaborative Genetic Study of Nicotine Dependence (COGEND). These three studies have been reported separately in previous work (Reich et al., 1998; Hartel et al., 2006; Luo et al., 2008; Bierut et al., 2008). The SAGE data include 4,121 subjects for whom the addiction to alcohol, nicotine, marijuana, cocaine, opiates, and other drugs and genome-wide SNP data (ILLUMINA Human 1M platform) were available. Lifetime dependence on these six categories of substances was diagnosed in accordance with the Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition (DSM-IV). We hypothesize that there is a common genetic effect for the comorbidity including the addiction to the six categories of substances. We thus use multivariate traits, each of which stands for whether or not the subject is addicted to a single substance. The six phenotypes are coded into binary scales according to whether the subject is addicted to a particular substance.

In our study, we excluded 60 duplicate genotype samples and removed nine subjects with ethnic backgrounds other than African origin (black) or European origin (white). In total we have 3,627 unrelated subjects for whom we have both genotype and phenotype data. Following Chen et al. (2011), we performed a separate analysis for both race (black or white) and gender (female or male), due to the complexity of substance dependence with possible environmental components. Therefore, our analysis was performed in each of the four subpopulations: 1,393 white women, 1,131 white men, 568 black women, and 535 black men (Chen et al., 2011). In addition, we filtered SNPs by setting thresholds for call rate (> 90%), minor allele frequency (MAF) in each sub-population (> 1%), and Hardy-Weinberg equilibrium in each sub-population (p-value > 0.0001).

As we have already split the data by the covariates race and gender, they were not adjusted in the further analysis in each subset. Hence, the remaining covariates include age and some environmental risk factors, such as whether experienced rape/sexual assault, whether experienced physical assault, and whether experienced non-assaultive trauma. Some other risk factors, such as whether experienced neglect as a child, whether experienced physical abuse as a child, and childhood sexual abuse, were not included due to their high rates of missing values.

Similar to the simulation study, we compare four association tests: non-weighted test T, covariate-weighted test T_W,1, propensity score-weighted test T_W,2, and our semiparametric propensity score-inverse probability weighted test ${\hat{T}}_{IPW}$ . With the above selected covariates, the weight functions w(·, ·) in both weighted tests T_W,1 and T_W,2 are chosen following previous work (Jiang and Zhang, 2011; Zhu et al., 2012) with default parameters. Meanwhile, we continue to use the ordinal logistic regression model for the genotype-covariate relationship in our proposed test. In addition to the above tests with multivariate traits, we also tabulate the results from analyses using a single trait at a time. For each of the six traits, we utilize two approaches to analyze them. Firstly, we fit a logistic regression model including both genotype and the selected covariates. The statistical significance is drawn from a likelihood ratio test based on the logistic regression model. Secondly, we apply the same association tests T, T_W,1, T_W,2, and ${\hat{T}}_{IPW}$ as above to each trait, and present the significant findings.

4.2 Summary Statistics

We provided in Table 4 the co-dependence information of the six substances among the 3,627 unrelated subjects included in our final analysis. The diagonal entries are the rates of each substance dependence, and the lower-diagonal entries are the co-dependence rates of each pair of substances. Comparing a lower-diagonal entry to its two corresponding diagonal entries suggests the statistical dependence among the six addictions. For example, there are 1,625 subjects (45%) who are addicted to nicotine and 1,693 subjects (47%) addicted to alcohol. The co-dependence rate of nicotine and alcohol is 32% (1,154 out of 3,627), which is much higher than the rate if assuming these two addictions are statistically independent. This observation supports the existence of comorbidity among the six addictions in this data set.

Table 4.

Dependence and co-dependence rate of six substances. nic: nicotine; mj: marijuana; coc: cocaine; op: opiates; alc: alcohol; oth: other drugs. The percentage in the parenthesis is the dependence or co-dependence rate in the 3,627 unrelated subjects.

	Substance Dependence
	nic (%)	mj (%)	coc (%)	op (%)	alc (%)	oth (%)
nic	1625 (45)
mj	486 (13)	620 (17)
coc	686 (19)	464 (13)	937 (26)
op	203 (6)	145 (4)	217 (6)	258 (7)
alc	1154 (32)	577 (16)	820 (23)	238 (7)	1693 (47)
oth	332 (9)	258 (7)	335 (9)	168 (5)	406 (11)	432 (12)

Open in a new tab

Table 5 summarizes the addiction distribution in each subset of data split by race and sex. We can see that the addiction to some categories of substances is homogeneous across the four subpopulations, such as nicotine, with addiction rates 47%, 48%, 47% and 41% respectively. However, other substance dependencies differ by race (e.g., cocaine, 46% and 36% for black men and women versus 27% and 12% for white men and women) and/or sex (e.g., alcohol, 62% and 62% for black and white men versus 39% and 31% for black and white women). Throughout our analysis, the data are divided into four subsets according to sex and race of the subjects. Therefore, we focus on the subset specific analysis, removing the heterogeneity across the subpopulations.

Table 5.

Summary of substance dependence in each subpopulation. nic: nicotine; mj: marijuana; coc: cocaine; op: opiates; alc: alcohol; oth: other drugs. The percentage in the parenthesis is the substance dependence rate in each subpopulation.

Subset	Total	Substance Dependence
Subset	Total	nic (%)	mj (%)	coc (%)	op (%)	alc (%)	oth (%)
Black Men	535	254 (47)	136 (25)	248 (46)	44 (8)	332 (62)	61 (11)
Black Women	568	271 (48)	78 (14)	206 (36)	35 (6)	224 (39)	37 (7)
White Men	1131	528 (47)	285 (25)	309 (27)	112 (10)	704 (62)	203 (18)
White Women	1393	572 (41)	121 (9)	174 (12)	67 (5)	433 (31)	131 (9)

Total	3627	1625 (45)	620 (17)	937 (26)	258 (7)	1693 (47)	432 (12)

Open in a new tab

4.3 Single-Trait Results

Before presenting the multiple-trait results, we summarize the single-trait results from logistic regression models and the association tests in Table 6 and Table 7, respectively. The p-values in bold characters indicate that they reach the genome-wide significance level after Bonferroni correction for the number of traits (p-value < 5 × 10⁻⁷/6) (Burton et al., 2007).

Table 6.

Significant SNPs in the genome-wide association study of a single substance dependence from logistic regression. nic: nicotine; mj: marijuana; coc: cocaine; op: opiates; alc: alcohol; oth: other drugs.

Chr	SNP	Gene	MAF	p-values
Chr	SNP	Gene	MAF	nic	mj	coc	op	alc	oth
White Women
3	rs445057	FHIT	0.174	5.9e-1	2.2e-2	2.0e-4	1.7e-1	4.5e-8	1.8e-2

Open in a new tab

Table 7.

Significant SNPs in the genome-wide association study of a single substance dependence from association tests. op: opiates; oth: other drugs.

Chr	SNP	MAF	Gene	p-values
Chr	SNP	MAF	Gene	T	T _W,1	T _W,2	${\hat{T}}_{IPW}$
op
Black Men
2	rs2377339	0.019	NCK2	1.1e-8	1.1e-9	1.4e-9	8.2e-9
16	rs2042360	0.066	–	9.2e-7	6.5e-8	4.3e-7	9.6e-7
17	rs17544779	0.017	–	5.6e-8	6.3e-6	1.8e-6	4.6e-8
White Men
13	rs9529180	0.111	PCDH9	1.5e-7	4.6e-7	4.9e-8	1.1e-7
13	rs9540995	0.112	PCDH9	2.2e-7	7.0e-7	5.9e-8	1.5e-7
13	rs9529185	0.111	PCDH9	1.6e-7	4.7e-7	5.2e-8	1.1e-7
Black Women
5	rs2441010	0.012	–	1.0e-7	1.1e-4	8.2e-5	7.6e-8
7	rs2528381	0.084	UBE2D4	1.9e-5	5.1e-8	2.9e-5	1.6e-5
7	rs1182398	0.014	UBE3C	1.9e-7	5.6e-8	1.2e-6	1.1e-7
10	rs7911634	0.011	PCDH15	7.2e-5	2.7e-9	3.1e-6	6.6e-5
14	rs17197261	0.020	0R10G3	1.3e-5	4.5e-8	1.4e-3	1.0e-5
White Women
19	rs3745816	0.016	EML2	2.2e-5	4.4e-11	2.0e-5	1.3e-5
19	rs4445998	0.015	EML2	1.2e-5	1.2e-11	2.4e-5	6.7e-6
19	rs1545040	0.020	EML2	1.5e-3	5.7e-8	2.5e-3	1.1e-3

oth
Black Women
11	rs11603357	0.041	–	2.5e-7	2.6e-8	1.1e-8	1.5e-7
White Women
17	rs3098945	0.187	ANKRD13B	4.5e-6	1.8e-8	6.0e-7	1.1e-6

Open in a new tab

From Table 6, only one SNP achieves the genome-wide significance level (after Bonferroni correction) in the subpopulation of white women: rs445057 in gene FHIT is identified as a significant marker for addiction to alcohol. Very recently, FHIT has been documented to be in correlation with lifetime cigarette addiction (Antczak et al., 2013). This existing result, combined with our finding that FHIT is associated with alcohol dependence, partially supports the hypothesis that common genes underlie the comorbidity of multiple substance dependencies.

From Table 7, we have identified several significant SNP markers for each of the two phenotypes: addiction to opiates and addiction to other drugs, using the association tests T, T_W,1, T_W,2, and ${\hat{T}}_{IPW}$ .

For the addiction to opiates, three SNPs are identified to be genome-wide significant in black men. Among these SNPs, rs2377339 is located within gene NCK2, which has a strong association with normal angle glaucoma (Akiyama et al., 2008; Fuse, 2010). Furthermore, a meta-analysis (Bonovas et al., 2004) reported that smoking is a risk factor for glaucoma. These findings indicate some intriguing interplay between smoking and NCK2. A more recent study also verified the association of NCK2 with opiates addiction (Liu et al., 2013).

Three SNPs, all in gene PCDH9, are significantly associated with opiates dependence in white men. PCDH9 was discovered to contain variants that contribute to general addiction vulnerability (Liu et al., 2006), agreeing with our current finding.

Five additional SNPs, located in four known genes, achieve the genome-wide significance in black women. Among these genes, UBE3C has recently been discovered to be one of the four particularly promising candidate genes susceptible to cocaine dependence and major depressive episode (Yang et al., 2011); PCDH15 was also found to be associated with nicotine dependence by multiple human genome-wide association studies (Uhl et al., 2008; Lind et al., 2010). These results partially support our findings about the association between these two genes and opiates dependence.

Three SNPs in gene EML2 are discovered for addiction to opiates in white women. EML2 was found to be one of the potential candidate genes for bipolar disorder comorbid with alcoholism in mice (Le-Niculescu et al., 2008). However, no human studies have suggested the association of EML2 with substance dependence yet.

In addition to opiates, we have two more findings for addiction to other drugs, for which we have not found supporting evidence in the literature. All these single-trait findings can be potentially important for researchers to better understand the genetic components of substance dependence.

4.4 Multiple-Trait Results

The results from the analysis of multivariate traits are summarized in Table 8, with the p-values in bold characters indicating that they reach the genome-wide significance level (p-value < 5 × 10⁻⁷) (Burton et al., 2007). Comparing the four tests for multivariate traits, it is fairly clear to see the advantage of adjusting for important covariates in this data set. Without any adjustment, no SNP can be identified at the genome-wide significance level using test T. In addition, we find that different adjusted tests work complementarily to each other. These three tests (T_W,1, T_W,1 and ${\hat{T}}_{IPW}$ ) have some common findings and also non-overlapping discoveries. The results of the weighted tests might depend on the strength of the genetic signals and/or gene-environment interactions, as illustrated by our simulation studies. Similar conclusions can also be drawn from the comparison among different methods for single-trait results in Table 7.

Table 8.

Significant SNPs in the genome-wide association study of multiple substance dependencies. The symbol * indicates that the same SNP is also found by single-trait analysis in Table 7.

Chr	SNP	MAF	Gene	p-values
Chr	SNP	MAF	Gene	T	T _W,1	T _W,2	${\hat{T}}_{IPW}$
Black Men
2	rs2377339*	0.019	NCK2	1.1e-06	6.2e-08	1.4e-07	9.0e-07
5	rs251133	0.406	STARD4-AS1	5.3e-07	5.2e-06	2.8e-05	4.2e-07
5	rs10483285	0.037	ADCY4	2.4e-03	1.3e-07	5.0e-05	2.0e-03
White Men
3	rs4016435	0.042	CTNNB1	7.3e-07	6.2e-07	1.5e-07	2.6e-07
8	rs1477908	0.177	MMP16	1.1e-05	2.3e-05	2.3e-07	4.1e-06
Black Women
1	rs2175254	0.035	RASAL2	2.6e-05	4.1e-07	1.0e-05	1.7e-05
8	rs10504824	0.014	WWP1	1.1e-06	9.1e-09	2.7e-07	5.9e-07
8	rs17609515	0.014	CPNE3	1.1e-06	9.1e-09	2.7e-07	5.9e-07
10	rs7911634*	0.011	PCDH15	1.7e-04	1.1e-08	1.3e-05	1.6e-04
White Women
2	rs16866493	0.011	–	6.1e-04	1.9e-07	5.2e-04	3.3e-04
2	rs878167	0.010	–	1.3e-04	4.8e-08	1.0e-04	6.4e-05
2	rs6731600	0.039	–	2.1e-05	9.7e-06	7.1e-08	5.2e-06
2	rs6721762	0.039	MPV17	3.2e-05	1.1e-05	2.3e-07	8.7e-06
11	rs955396	0.068	TOLLIP/MUC5B	4.4e-05	1.5e-06	9.3e-08	4.4e-05
19	rs3745816*	0.016	EML2	5.2e-05	8.8e-10	1.7e-04	4.6e-05
19	rs4445998*	0.015	EML2	5.4e-05	3.8e-10	3.1e-04	4.6e-05
19	rs1545040*	0.020	EML2	6.7e-04	1.6e-07	2.4e-03	6.8e-04

Open in a new tab

Interestingly, we have several common findings between the multiple-trait results in Table 8 and the single-trait results in Table 7. These common genes, such as NCK2, PCDH15, and EML2, can be of particular interest to the addiction research. In the following, we provide a brief overview of the multiple-trait findings.

Three SNPs, rs2377339, rs251133 and rs10483285, which are located in genes NCK2, STARD4-AS1 and ADCY4 respectively, reach the genome-wide significance in black men. In addition to NCK2, previous research has also provided evidence for ADCY4: it is associated with opioid dependence (Wang et al., 2005; Li et al., 2008). All these results support NCK2 and ADCY4 as potentially relevant genes to substance dependence.

Two other SNPs, rs4016435 and rs1477908, in genes CTNNB1 and MMP16, achieve the genome-wide significance level in white men. It has come to our attention that the gene CTNNB1 has been suggested by microarray studies of nicotine exposure in rats (Sullivan et al., 2004), but it is the first time that this gene is discovered to be related to substance dependence in a human study. In addition, MMP16 belongs to a family of genes (matrix metalloproteinases, i.e., MMPs) that is known to play an important role in drug addiction (Wright and Harding, 2009).

Four SNPs located in four different genes are discovered to be associated with substance dependence in black women. Similar to CTNNB1, RASAL2 is also a candidate gene for nicotine dependence from pathway analysis (Sullivan et al., 2004). Furthermore, multiple human genome-wide association studies identified PCDH15 to be associated with nicotine dependence (Uhl et al., 2008; Lind et al., 2010). These existing results provide partial support to our findings.

Eight other SNPs are identified using multiple addictions in white women. Similar to EML2, previous microarray study in mice has provided evidence that MPV17 is associated with alcohol dependence (Li et al., 2008). However, no human studies have suggested the association of these two genes with substance dependence yet.

Besides the SNPs/genes discussed above, there are other SNPs/genes showing strong evidence of association with substance dependence in our study, and those SNPs/genes warrant further investigation.

5 DISCUSSION

Understanding comorbidity related with addictions is one of the most pressing challenges with enormous public health significance (National Institute on Drug Abuse, 2010). In this work, we studied genetics of multiple addictions by analyzing the data from the Study of Addiction: Genetics and Environment (SAGE). To properly utilize the information collected by this study, we propose a novel statistical method to incorporate environmental factors into a nonparametric U-statistic (generalized Kendall’s tau) which can handle comorbidity of multiple traits. Compared with directly imposing a weight function on the U-statistic, the idea of inverse probability weighting is more natural and convenient. On the one hand, the inverse probability weighted U-statistic is asymptotically unbiased under the null hypothesis while the non-weighted and other weighted tests are not necessarily. On the other hand, the proposed test is free of tuning parameters, which is more convenient and accessible than other weighted tests.

A byproduct of our theoretical work is to confirm a previous finding that estimated propensity scores can be preferable to their true values in applications. It is shown that our semiparametric U-statistic has a smaller asymptotic variance with $\sqrt{n}$ -consistent propensity score estimates than with true propensity scores. Although this phenomenon has been revealed before, to the best of our knowledge this is the first time to formalize it in the areas of U-statistics and genetic association tests. Moreover, a recently proposed multiple-trait association test called “Scaled Multiple-phenotype Association Test” (SMAT) (Schifano et al., 2013) was brought to our attention by a referee. It is noteworthy that SMAT can only handle continuous phenotypes while our proposed test can take any hybrid of dichotomous, ordinal and quantitative traits. Since we focus on binary responses in our current investigation of addictions, we will leave the comparison study with SMAT to our future work.

We have demonstrated numerical performance of our method, and should note the topics that deserve further research. For example, a key assumption for the distribution of our statistic is that the propensity scores are estimated under the correct parametric model. We assessed the impact of model misspecification in simulation studies, and our empirical results did not reveal a major impact. Nonetheless, a deeper theoretical understanding is still important. Another issue is the choice of genotype coding in our method. As discussed in Section 2.6, our test is not invariant to the genotype coding and we provided a practical suggestion. Although it is not the focus of the current study, it warrants some future investigations.

Applying the new method (together with other methods) to the SAGE data leads to a few interesting findings. Firstly, the multiple-trait analysis reveals new markers that were not identified by the single addiction analysis. When a genetic signal is not strong enough for any single addiction and yet underlies multiple ones, it can become stronger (to a detectable level) by combining different substance dependencies.

Secondly, our analysis of the SAGE data reveals an advantage of adjusting for environmental factors. To study comorbidity, adjusted tests identified a few genetic variants to addiction but the unadjusted test did not have any findings. This agrees with the observations from our simulation studies. Most of the time, the inclusion of important environmental factors can increase the power to detect either the genetic effect or the gene-environment interaction. Even under the situation with a genetic effect only (no environmental effects), an unnecessary adjustment for the environmental factors has little effect on the power of a test.

Lastly, tests with different adjustments behave differently. Due to the nature of the real data analysis, we cannot really tell which method performs the best. In a real application, it is usually not practical to have one method that is always superior to all others. Therefore, it is useful that different adjusted tests work complementarily to each other in this data set.

Acknowledgments

The authors would like to thank Zhifa Liu for his assistance in biologically interpreting the findings from the data analysis. The authors also thank the editor, the associate editor, and two anonymous referees for their comments and suggestions that led to considerable improvements of the paper. This research is supported in part by grants R01 DA016750 and R01 DA029081 from the National Institutes of Health (NIH). The dataset used for the analyses described in this manuscript was obtained from dbGaP at http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000092.v1.p1 through dbGaP accession number phs000092.v1.p. The data collection was funded by NIH grants U01 HG004422, U01 HG004446, U10 AA008401, P01 CA089392, R01 DA013423, U01 HG004438, and HHSN268200782096C.

A APPENDIX

We split our derivation of Theorem 2 into three steps as follows. The first step is to obtain an asymptotic representation of $\hat{θ}$ . Under regularity conditions, there exists a $\sqrt{n}$ -consistent estimator $\hat{θ}$ of θ₀. The following lemma presents the result, with its proof given in Appendix A.1.

Lemma 1. Let the parameter space Θ be an open set. Suppose that, there exists some δ > 0 and c_θ₀ > 0 such that p_g(z_i; θ) ∈ [δ, 1 − δ] for all θ satisfying ∥θ − θ₀∥ ≤ c_θ₀ with g = 0, 1, 2 and i = 1, … , n; ℓ_i(θ) is twice continuously differentiable; for each g = 0, 1, 2, condition (9) holds, and there exists constants C_θ₀ > 0 and α > 0 such that for all θ satisfying ∥θ − θ₀∥ ≤ c_θ₀, condition (10) holds; there exists a positive definite matrix I_θ₀ such that $\frac{1}{n} \sum_{i = 1}^{n} I_{θ_{0}} (z_{i}) \to I_{θ_{0}}$ . Then, there exists a root of the likelihood equations $\hat{θ}$ of θ₀ which has the following representation

\sqrt{n} (\hat{θ} - θ_{0}) = I_{θ_{0}}^{- 1} \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} ψ_{θ_{0}} (G_{i}, z_{i}) + o_{p} (1) .

(A.1)

The result of Lemma 1 is fairly standard for a root of the likelihood equations $\hat{θ}$ in the framework of maximum likelihood. We refer to Theorem 5.21 in van der Vaart (1998) and Theorem 4.17 in Shao (2003) as similar conclusions. A distinct part of this lemma is that the samples are only independent but not identically distributed due to the conditional inference given all the covariates. In other words, the covariates are regarded as non-random. This characteristic results in the unique conditions (9) and (10) involving the covariate z_i’s, compared with the traditional theories. Thus, we provide a proof in Appendix A.1 for being clear and self-contained.

The second step is to investigate the asymptotic joint distribution of ${U_{IPW}^{'} (θ_{0}), \hat{θ}'}'$ . The idea becomes clear with the conclusion of Lemma 1, as both U_IPW(θ₀) and $\hat{θ} - θ_{0}$ can be written in the form of a sum of independent random vectors. Hence, ${U_{IPW}^{'} (θ_{0}), (\hat{θ} - θ_{0})'}'$ becomes a sum of independent random vectors, on which we can apply the central limit theorem. Thus, we leave the proof in Appendix A.2 and present the result in the following lemma.

Lemma 2. In addition to the conditions in Lemma 1, assume that $λ_{max} (\sum_{i = 1}^{n} {\overset{‒}{u}}_{i} {\overset{‒}{u}}_{i}^{'}) = O (n)$ and ${max}_{1 \leq i \leq n} λ_{max} (γ_{i 1} γ_{i 1}^{'} + γ_{i 2} γ_{i 2}^{'}) = o [λ_{min} {\sum_{i = 1}^{n} (γ_{i 1} γ_{i 1}^{'} + γ_{i 2} γ_{i 2}^{'})}]$ . Then, under the null hypothesis ℋ₀,

\sqrt{n} Ω_{θ_{0}}^{- 1 ∕ 2} [\begin{matrix} U_{IPW} (θ_{0}) \\ \hat{θ} - θ_{0} \end{matrix}] \to N (0, I_{p + d}),

(A.2)

in distribution, conditioning on all the traits Y = y and covariates Z = z. In (A.2),

Ω_{θ_{0}} = (\begin{matrix} \sum_{θ_{0}} & Γ_{θ_{0}} I_{θ_{0}}^{- 1} \\ I_{θ_{0}}^{- 1} Γ_{θ_{0}}^{'} & I_{θ_{0}}^{- 1} \end{matrix}),

where Σ_θ₀ and Γ_θ₀ are defined in Section 2.4.

Finally, as the last step, the asymptotic distribution of ${\hat{U}}_{IPW}$ follows from the joint asymptotic distribution of U_IPW(θ₀) and $\hat{θ}$ , borrowing the idea from Pierce (1982) and Randles (1982). The proof of this step can be found in Appendix A.3.

A.1 Proof of Lemma 1

In this section, all probability related arguments/operations will be conditioning on the covariates. However, to simplify the notation, we still write E(·) or var(·) instead of E(· | Z = z) or var(· | Z = z).

We first prove that $\sqrt{n} (\hat{θ} - θ_{0}) = O_{p} (1)$ . This is implied by the fact that for any ϵ > 0, there exists C > 0 and n₀ > 1 such that

P {log ℓ (θ) - log ℓ (θ_{0}) < 0 for all θ \in \partial B_{n} (C)} \geq 1 - ϵ, n > n_{0},

(A.3)

where $log ℓ (θ) = \sum_{i = 1}^{n} log ℓ_{i} (θ)$ and ∂B_n(C) is the boundary of $B_{n} (C) = {θ : \sqrt{n} ∥ θ - θ_{0} ∥ \leq C}$ . Let $Ψ_{n} (θ) = \frac{1}{n} \sum_{i = 1}^{n} ψ_{θ} (G_{i}, z_{i})$ . The Taylor expansion gives that

\frac{1}{n} {log ℓ (θ) - log ℓ (θ_{0})} = Ψ_{n}^{'} (θ_{0}) (θ - θ_{0}) + \frac{1}{2} (θ - θ_{0})' \frac{\partial Ψ_{n} (\tilde{θ})}{\partial θ'} (θ - θ_{0}),

(A.4)

where $\tilde{θ}$ is the generic notation of a vector lying between θ₀ and θ. We will show at the end that,

‖ Ψ_{n} (θ_{0}) ‖ = O_{p} (n^{- 1 ∕ 2}), \frac{\partial Ψ_{n} (\tilde{θ})}{\partial θ'} + I_{θ_{0}} = o_{p} (1) .

(A.5)

Combining (A.4) and (A.5),

\frac{1}{n} {log ℓ (θ) - log ℓ (θ_{0})} = ‖ θ - θ_{0} ‖ O_{p} (n^{- 1 ∕ 2}) - \frac{1}{2} (θ - θ_{0})' {I_{θ_{0}} + o_{p} (1)} (θ - θ_{0}),

therefore (A.3) holds with large enough C and n₀. The $\sqrt{n}$ -consistency of $\hat{θ}$ is proved.

To obtain the asymptotic representation (A.1) of $\hat{θ}$ , we consider the Taylor expansion of $Ψ_{n} (\hat{θ})$ at θ₀. On the one hand, $Ψ_{n} (\hat{θ}) = 0$ by the definition of a root of the likelihood equations; on the other hand,

Ψ_{n} (\hat{θ}) = Ψ_{n} (θ_{0}) + \frac{\partial Ψ_{n} (\tilde{θ})}{\partial θ'} (\hat{θ} - θ_{0}),

(A.6)

where $\tilde{θ}$ lies between θ₀ and $\hat{θ}$ . Then the representation (A.1) in Lemma 1 holds by (A.6), $\sqrt{n} (\hat{θ} - θ_{0}) = O_{p} (1)$ , and the same result as the second part of (A.5) but with $\tilde{θ}$ denoting a vector between θ₀ and $\hat{θ}$ (which will be proved immediately).

At the end, we provide the proof of (A.5). For Ψ_n(θ₀), it is seen that

E {Ψ_{n} (θ_{0})} = 0, n var {Ψ_{n} (θ_{0})} = \frac{1}{n} \sum_{i = 1}^{n} I_{θ_{0}} (z_{i}) \to I_{θ_{0}},

because of the exchangeability of the partial derivative and integration with respect to a discrete measure. Then, for any ϵ > 0, we can choose C_ϵ large enough such that

P {‖ \sqrt{n} Ψ_{n} (θ_{0}) ‖ > C_{ϵ}} \leq C_{ϵ}^{- 2} E {n {‖ Ψ_{n} (θ_{0}) ‖}^{2}} = C_{ϵ}^{- 2} tr [n var {Ψ_{n} (θ_{0})}] < ϵ,

This is the first part of (A.5). For the second part, we need to show it holds for $\tilde{θ}$ satisfying either $\sqrt{n} ∥ \tilde{θ} - θ_{0} ∥ \leq C$ or $\sqrt{n} ∥ \tilde{θ} - θ_{0} ∥ = O_{p} (1)$ . In either case, we have that

\frac{\partial Ψ_{n} (\tilde{θ})}{\partial θ'} = \frac{\partial Ψ_{n} (θ_{0})}{\partial θ'} + o p (1),

(A.7)

E {\frac{\partial Ψ_{n} (θ_{0})}{\partial θ'}} = \frac{1}{n} \sum_{i = 1}^{n} I_{θ_{0}} (z_{i}) \to - I_{θ_{0}},

(A.8)

var {\frac{\partial Ψ_{n} (θ_{0})}{\partial θ'} c} = \frac{1}{n^{2}} \sum_{i = 1}^{n} var [{\frac{\partial}{\partial θ'} ψ_{θ_{0}} (G_{i}, z_{i})} c] \to 0,

(A.9)

for an arbitrary d-dimensional vector c. (A.7) follows from the following equation

\frac{\partial Ψ_{n} (θ)}{\partial θ'} = \frac{1}{n} \sum_{i = 1}^{n} \sum_{g = 0}^{2} I (G_{i} = g) {p_{g}^{- 1} (z_{i}; θ) \frac{\partial^{2}}{\partial θ \partial θ'} p_{g} (z_{i}; θ) - p_{g}^{- 2} (z_{i}; θ) \frac{\partial}{\partial θ} p_{g} (z_{i}; θ) \frac{\partial}{\partial θ'} p_{g} (z_{i}; θ)}

and the conditions (9) and (10) in Lemma 1. (A.8) follows from the exchangeability of the partial derivative and integration with respect to a discrete measure. (A.9) follows from the condition (9) in Lemma 1. By Markov’s inequality, for any ϵ > 0,

\begin{matrix} P [‖ {\frac{\partial Ψ_{n} (θ_{0})}{\partial θ'} + I_{θ_{0}}} c ‖ > ϵ] & \leq ϵ^{- 2} E [{‖ {\frac{\partial Ψ_{n} (θ_{0})}{\partial θ'} - E \frac{\partial Ψ_{n} (θ_{0})}{\partial θ'}} c ‖}^{2}] + ϵ^{- 2} E [{‖ {E \frac{\partial Ψ_{n} (θ_{0})}{\partial θ'} + I_{θ_{0}}} c ‖}^{2}] \\ = ϵ^{- 2} tr [var {\frac{\partial Ψ_{n} (θ_{0})}{\partial θ'} c}] + ϵ^{- 2} {‖ {E \frac{\partial Ψ_{n} (θ_{0})}{\partial θ'} + I_{θ_{0}}} c ‖}^{2} \to 0 . \end{matrix}

(A.10)

The second part of (A.5) is implied by (A.7) and (A.10).

A.2 Proof of Lemma 2

In the next two subsections (Sections A.2 and A.3), all probability related arguments/operations will be conditioning on the traits and covariates. However, to simplify the notation, we still write E(·) or var(·) instead of E(· | Y = y, Z = z) or var(· | Y = y, Z = z).

From the Cramér-Wold device, it suffices to find the asymptotic distribution of $c_{1}^{'} U_{IPW} (θ_{0}) + c_{2}^{'} (\hat{θ} - θ_{0})$ for arbitrary p- and d-dimensional vectors c₁ and c₂. As $\sqrt{n} U_{IPW} (θ_{0}) = O_{p} (1)$ from Theorem 1 and the condition $λ_{max} (\sum_{i = 1}^{n} {\overset{‒}{u}}_{i} {\overset{‒}{u}}_{i}^{'}) = O (n)$ , it is seen that

\sqrt{n} {c_{1}^{'} U_{IPW} (θ_{0}) + c_{2}^{'} (\hat{θ} - θ_{0})} = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} [2 c_{1}^{'} {\overset{‒}{u}}_{i} G_{i} ∕ e_{θ_{0}} (z_{i}) + c_{2}^{'} I_{θ_{0}}^{- 1} ψ_{θ_{0}} (G_{i}, z_{i})] + o_{p} (1) .

(A.11)

A direct calculation gives its variance

\begin{matrix} σ_{n}^{2} & = var [\frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {2 c_{1}^{'} {\overset{‒}{u}}_{i} G_{i} ∕ e_{θ_{0}} (z_{i}) + c_{2}^{'} I_{θ_{0}}^{- 1} ψ_{θ_{0}} (G_{i}, z_{i})}] \\ = c_{1}^{'} [\frac{4}{n} \sum_{i = 1}^{n} {\overset{‒}{u}}_{i} {\overset{‒}{u}}_{i}^{'} ν_{θ_{0}} (z_{i}) ∕ e_{θ_{0}}^{2} (z_{i})] c_{1} + c_{2}^{'} [I_{θ_{0}}^{- 1} \frac{1}{n} \sum_{i = 1}^{n} I_{θ_{0}} (z_{i}) I_{θ_{0}}^{- 1}] c_{2} \end{matrix}

(A.12)

+ 2 c_{1}^{'} [\frac{2}{n} \sum_{i = 1}^{n} {\overset{‒}{u}}_{i} E {G_{i}, ψ_{θ_{0}}^{'} (G_{i}, z_{i})} I_{θ_{0}}^{- 1} ∕ e_{θ_{0}} (z_{i})] c_{2},

(A.13)

where we have in (A.12) that

c_{2}^{'} [I_{θ_{0}}^{- 1} \frac{1}{n} \sum_{i = 1}^{n} I_{θ_{0}} (z_{i}) I_{θ_{0}}^{- 1}] c_{2} \to c_{2}^{'} I_{θ_{0}}^{- 1} c_{2}, n \to \infty,

and in (A.13) that

\begin{matrix} 2 c_{1}^{'} [\frac{2}{n} \sum_{i = 1}^{n} {\overset{‒}{u}}_{i} E {G_{i} ψ_{θ_{0}}^{'} (G_{i}, z_{i})} I_{θ_{0}}^{- 1} ∕ e_{θ_{0}} (z_{i})] c_{2} \\ = & 2 c_{1}^{'} (\frac{2}{n} \sum_{i = 1}^{n} {\overset{‒}{u}}_{i} \sum_{g = 0}^{2} [E {G_{i} I (G_{i} = g)} p_{g}^{- 1} (z_{i}; θ_{0}) \frac{\partial}{\partial θ'} p_{g} (z_{i}; θ_{0})] I_{θ_{0}}^{- 1} ∕ e_{θ_{0}} (z_{i})) c_{2} \\ = & 2 c_{1}^{'} [\frac{2}{n} \sum_{i = 1}^{n} {\overset{‒}{u}}_{i} \sum_{g = 0}^{2} {g \frac{\partial}{\partial θ'} p_{g} (z_{i}; θ_{0})} I_{θ_{0}}^{- 1} ∕ e_{θ_{0}} (z_{i})] c_{2} . \end{matrix}

Therefore,

σ_{n}^{2} = c_{1}^{'} \sum_{θ_{0}} c_{1} + c_{2}^{'} I_{θ_{0}}^{- 1} c_{2} + 2 c_{1}^{'} Γ_{θ_{0}} I_{θ_{0}}^{- 1} c_{2} + o (1) .

In order to apply the central limit theorem as in Corollary 1.3 in Shao (2003), we need to rewrite (A.11) into

\frac{1}{\sqrt{n}} \sum_{i = 1}^{n} [2 c_{1}^{'} {\overset{‒}{u}}_{i} G_{i} ∕ e_{θ_{0}} (z_{i}) + c_{2}^{'} I_{θ_{0}}^{- 1} ψ_{θ_{0}} (G_{i}, z_{i})] = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} d_{i}^{'} {R_{i} - E (R_{i})},

with d_i = (d_i1, d_i2)′, R_i = {I(G_i = 1), I(G_i = 2)}′, and

\begin{matrix} d_{i 1} & = 2 c_{1}^{'} {\overset{‒}{u}}_{i} ∕ e_{θ_{0}} (z_{i}) + c_{2}^{'} I_{θ_{0}}^{- 1} {p_{1}^{- 1} (z_{i}; θ_{0}) \frac{\partial}{\partial θ} p_{1} (z_{i}; θ_{0}) - p_{0}^{- 1} (z_{i}; θ_{0}) \frac{\partial}{\partial θ} p_{0} (z_{i}; θ_{0})} = (2 c_{1}^{'}, c_{2}^{'} I_{θ_{0}}^{- 1}) γ_{i 1}, \\ d_{i 2} & = 4 c_{1}^{'} {\overset{‒}{u}}_{i} ∕ e_{θ_{0}} (z_{i}) + c_{2}^{'} I_{θ_{0}}^{- 1} {p_{2}^{- 1} (z_{i}; θ_{0}) \frac{\partial}{\partial θ} p_{2} (z_{i}; θ_{0}) - p_{0}^{- 1} (z_{i}; θ_{0}) \frac{\partial}{\partial θ} p_{0} (z_{i}; θ_{0})} = (2 c_{1}^{'}, c_{2}^{'} I_{θ_{0}}^{- 1}) γ_{i 2} \end{matrix}

using the notation introduced in Lemma 2.

From the condition ${max}_{1 \leq i \leq n} λ_{max} (γ_{i 1} γ_{i 1}^{'} + γ_{i 2} γ_{i 2}^{'}) = o [λ_{min} {\sum_{i = 1}^{n} (γ_{i 1} γ_{i 1}^{'} + γ_{i 2} γ_{i 2}^{'})}]$ , we see that

\max_{1 \leq i \leq n} ‖ d_{i} ‖^{2} ∕ \sum_{i = 1}^{n} ‖ d_{i} ‖^{2} \to 0 .

The conditions in Lemma 2 also lead to inf_n,i λ_min({var(R_i)}) > 0 and sup_n,i E(∥R_i∥^2+δ) < ∞ for δ = 2. These regularity conditions imply that

\frac{1}{σ_{n}} \sqrt{n} {c_{1}^{'} U_{IPW} (θ_{0}) + c_{2}^{'} (\hat{θ} - θ_{0})} \to N (0, 1)

in distribution. If Ω_θ₀ is positive definite, then substituting $(c_{1}^{'}, c_{2}^{'}) = ({\tilde{c}}_{1}^{'}, {\tilde{c}}_{2}^{'}) Ω_{θ_{0}}^{- 1 ∕ 2}$ already leads to the result in Lemma 2.

The last piece to prove is the positive definiteness of Ω_θ₀. Let V_i = var(R_i) and $A_{i} = diag (2 I_{p}, I_{θ_{0}}^{- 1}) (γ_{i 1}, γ_{i 2})$ , then

Ω_{θ_{0}} = \frac{1}{n} (A_{1}, \dots, A_{n}) diag (V_{1}, \dots, V_{n}) (A_{1}, \dots, A_{n})' + o (1) .

We see that inf_n[λ_min{diag(V₁, … , V_n)}] > 0. In addition, there exists some δ_n > 0,

{‖ (x', y') (A_{1}, \dots, A_{n}) ‖}^{2} = (2 x', y' I_{θ_{0}}^{- 1}) {\sum_{i = 1}^{n} (γ_{i 1} γ_{i 1}^{'} + γ_{i 2} γ_{i 2}^{'})} (2 x', y' I_{θ_{0}}^{- 1})' \geq δ_{n} {‖ (x', y') ‖}^{2},

for arbitrary p- and d-dimensional vectors x and y. Therefore, for n sufficiently large,

(x', y') Ω_{θ_{0}} (x', y')' \geq {δ_{n} ∕ (2 n)} inf_{n} [λ_{min} {diag (V_{1}, \dots, V_{n})}] {‖ (x', y') ‖}^{2},

(A.14)

which implies the positive definiteness of Ω_θ₀.

A.3 Proof of Theorem 2

The proof follows from the idea in Pierce (1982) and Randles (1982) who provided a general guidance of deriving the asymptotic distribution of statistics with estimated parameters. In our situation, the statistic is ${\hat{U}}_{IPW} = U_{IPW} (\hat{θ})$ where $\hat{θ}$ are the estimated parameters. The proof starts from the following fact,

{\hat{U}}_{IPW} = U_{IPW} (\hat{θ}) = U_{IPW} (θ_{0}) + \frac{\partial}{\partial θ'} U_{IPW} (\tilde{θ}) (\hat{θ} - θ_{0}),

(A.15)

with some $\tilde{θ}$ lying between θ₀ and $\hat{θ}$ . As

U_{IPW} (θ) = \frac{2}{n - 1} \sum_{i = 1}^{n} {\overset{‒}{u}}_{i} G_{i} ∕ e_{θ} (Z_{i}),

it is seen that

\begin{matrix} \frac{\partial}{\partial θ'} U_{IPW} (θ_{0}) & = - \frac{2}{n - 1} \sum_{i = 1}^{n} {\overset{‒}{u}}_{i} G_{i} \frac{\partial}{\partial θ'} e_{θ_{0}} (Z_{i}) ∕ e_{θ_{0}}^{2} (Z_{i}) \\ = \frac{2}{n - 1} \sum_{i = 1}^{n} {\overset{‒}{u}}_{i} G_{i} \sum_{g = 0}^{2} {g \frac{\partial}{\partial θ'} p_{g} (Z_{i}; θ_{0})} ∕ e_{θ_{0}}^{2} (Z_{i}) \\ = - Γ_{θ_{0}} {1 + o (1)} + o_{p} (1) . \end{matrix}

(A.16)

The equality in (A.16) follows from the facts that

\begin{matrix} E {\frac{\partial}{\partial θ'} U_{IPW} (θ_{0})} & = - \frac{n}{n - 1} Γ_{θ_{0}}, and \\ var {\frac{\partial}{\partial θ'} U_{IPW}^{(l)} (θ_{0})} & = \frac{4}{{(n - 1)}^{2}} \sum_{i = 1}^{n} {{\overset{‒}{u}}_{i}^{(l)}}^{2} v_{θ_{0}} (z_{i}) \sum_{g = 0}^{2} {g \frac{\partial}{\partial θ} p_{g} (z_{i}; θ_{0})} \sum_{g = 0}^{2} {g \frac{\partial}{\partial θ'} p_{g} (z_{i}; θ_{0})} ∕ e_{θ_{0}}^{4} (z_{i}) \\ \to 0, \end{matrix}

due to the condition ${max}_{1 \leq i \leq n} ∥ {\overset{‒}{u}}_{i} ∥^{2} = o (n)$ and the first part of condition (9). In addition, since $\tilde{θ} - θ_{0} = O_{p} (n^{- 1 ∕ 2})$ ,

\frac{\partial}{\partial θ'} U_{IPW} (\tilde{θ}) - \frac{\partial}{\partial θ'} U_{IPW} (θ_{0}) = \frac{\partial^{2}}{\partial θ \partial θ'} U_{IPW} ({\tilde{θ}}^{*}) (\tilde{θ} - θ_{0}) = o_{p} (1),

(A.17)

with ${\tilde{θ}}^{*}$ between θ₀ and $\tilde{θ}$ . The equality in (A.17) follows from the fact that for each l = 1, … , p,

\begin{matrix} \frac{\partial^{2}}{\partial θ \partial θ'} U_{IPW}^{(l)} ({\tilde{θ}}^{*}) & = - \frac{2}{n - 1} \sum_{i = 1}^{n} {\overset{‒}{u}}_{i}^{(l)} G_{i} \sum_{g = 0}^{2} {g \frac{\partial^{2}}{\partial θ \partial θ'} p_{g} (Z_{i}; {\tilde{θ}}^{*})} ∕ e_{{\tilde{θ}}^{*}}^{2} (Z_{i}) + \frac{4}{n - 1} \sum_{i = 1}^{n} {\overset{‒}{u}}_{i}^{(l)} G_{i} \sum_{g = 0}^{2} {g \frac{\partial}{\partial θ} p_{g} (Z_{i}; {\tilde{θ}}^{*})} \sum_{g = 0}^{2} {g \frac{\partial}{\partial θ'} p_{g} (Z_{i}; {\tilde{θ}}^{*})} ∕ e_{{\tilde{θ}}^{*}}^{3} (Z_{i}) \\ = o_{p} (\sqrt{n}), \end{matrix}

by the condition ${max}_{1 \leq i \leq n} ∥ {\overset{‒}{u}}_{i} ∥^{2} = o (n)$ and the condition (9). Substituting (A.16) and (A.17) into (A.15) leads to

\begin{matrix} \sqrt{n} Λ_{θ_{0}}^{- 1 ∕ 2} {\hat{U}}_{IPW} & = \sqrt{n} Λ_{θ_{0}}^{- 1 ∕ 2} U_{IPW} (θ_{0}) - Λ_{θ_{0}}^{- 1 ∕ 2} Γ_{θ_{0}} \sqrt{n} (\hat{θ} - θ_{0}) + o_{p} (1) \\ = Λ_{θ_{0}}^{- 1 ∕ 2} \sqrt{n} {U_{IPW} (θ_{0}) - Γ_{θ_{0}} (\hat{θ} - θ_{0})} + o_{p} (1) \end{matrix}

(A.18)

= {Λ_{θ_{0}}^{- 1 ∕ 2} (I_{p}, - Γ_{θ_{0}}) Ω_{θ_{0}}^{1 ∕ 2}} \sqrt{n} Ω_{θ_{0}}^{- 1 ∕ 2} [\begin{matrix} U_{IPW} (θ_{0}) \\ \hat{θ} - θ_{0} \end{matrix}] + o_{p} (1) .

(A.19)

The equality in (A.18) follows if

{‖ Γ_{θ_{0}} ‖}_{2} = O (1) and {‖ Λ_{θ_{0}}^{- 1 ∕ 2} ‖}_{2} = O (1),

(A.20)

where ∥A∥₂ = {λ_max(A′A)}^1/2 is the spectral norm for any matrix A. We will prove (A.20) at the end. Combining Lemma 2 and the fact that

{Λ_{θ_{0}}^{- 1 ∕ 2} (I_{p}, - Γ_{θ_{0}}) Ω_{θ_{0}}^{1 ∕ 2}} {Λ_{θ_{0}}^{- 1 ∕ 2} (I_{p}, - Γ_{θ_{0}}) Ω_{θ_{0}}^{1 ∕ 2}}' = I_{p},

(A.19) leads to the following convergence in distribution

\sqrt{n} Λ_{θ_{0}}^{- 1 ∕ 2} {\hat{U}}_{IPW} \to N (0, I_{p}) .

At the end, we verify (A.20) to complete our proof. There exists a constant C > 0 such that

\begin{matrix} {‖ Γ_{θ_{0}} ‖}_{2} & \leq C \sum_{g = 0}^{2} \frac{1}{n} {‖ \sum_{i = 1}^{n} g {\overset{‒}{u}}_{i} \frac{\partial}{\partial θ} p_{g} (Z_{i}; θ_{0}) ‖}_{2} \\ \leq C \sum_{g = 0}^{2} \frac{2}{n} {‖ \sum_{i = 1}^{n} {\overset{‒}{u}}_{i} {\overset{‒}{u}}_{i}^{'} ‖}_{2}^{1 ∕ 2} {‖ \sum_{i = 1}^{n} \frac{\partial}{\partial θ} p_{g} (Z_{i}; θ_{0}) \frac{\partial}{\partial θ'} p_{g} (Z_{i}; θ_{0}) ‖}_{2}^{1 ∕ 2} \\ = \frac{1}{n} O (\sqrt{n}) O (\sqrt{n}) = O (1) . \end{matrix}

Also, for an arbitrary x ∈ R^p,

\begin{matrix} x' Λ_{θ_{0}} x & = x' (I_{p}, - Γ_{θ_{0}}) Ω_{θ_{0}} (I_{p}, - Γ_{θ_{0}})' x \\ = (x', - x' Γ_{θ_{0}}) Ω_{θ_{0}} (x', - x' Γ_{θ_{0}})' \\ \geq inf_{n} {λ_{min} (Ω_{θ_{0}})} {‖ x ‖}^{2}, \end{matrix}

With the condition $λ_{min} {\sum_{i = 1}^{n} (γ_{i 1} γ_{i 1}^{'} + γ_{i 2} γ_{i 2}^{'})} \geq n ϵ$ in Theorem 2, δ_n in (A.14) can be replaced with nδ for some δ > 0, which in turn implies that inf_n{λ_min(Ω_θ₀)} > 0. Then we know inf_n{λ_min(Λ_θ₀)} > 0 according to (A.21). So $‖ Λ_{θ_{0}}^{- 1 ∕ 2} ‖_{2} = O (1)$ .

References

Akiyama M, Yatsu K, Ota M, Katsuyama Y, Kashiwagi K, Mabuchi F, Iijima H, Kawase K, Yamamoto T, Nakamura M, Negi A, Sagara T, Kumagai N, Nishida T, Inatani M, Tanihara H, Ohno S, Inoko H, Mizuki N. Microsatellite analysis of the GLC1B locus on chromosome 2 points to NCK2 as a new candidate gene for normal tension glaucoma. British Journal of Ophthalmology. 2008;92:1293–1296. doi: 10.1136/bjo.2008.139980. [DOI] [PubMed] [Google Scholar]
Antczak A, Migdalska-Sek M, Pastuszak-Lewandoska D, Czarnecka K, Nawrot E, Domańska D, Kordiak J, Górski P, Brzeziańska E. Significant frequency of allelic imbalance in 3p region covering RARβ and MLH1 loci seems to be essential in molecular non-small cell lung cancer diagnosis. Medical Oncology. 2013;30:1–10. doi: 10.1007/s12032-013-0532-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bierut LJ, Agrawal A, Bucholz KK, Doheny KF, Laurie C, Pugh E, Fisher S, Fox L, Howells W, Bertelsen S, et al. A genome-wide association study of alcohol dependence. Proceedings of the National Academy of Sciences. 2010;107:5082–5087. doi: 10.1073/pnas.0911109107. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bierut LJ, Madden PA, Breslau N, Johnson EO, Hatsukami D, Pomerleau OF, Swan GE, Rutter J, Bertelsen S, Fox L, et al. Novel genes identified in a high-density genome wide association study for nicotine dependence. Human Molecular Genetics. 2007;16:24–35. doi: 10.1093/hmg/ddl441. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bierut LJ, Strickland JR, Thompson JR, Afful SE, Cottler LB. Drug use and dependence in cocaine dependent subjects, community-based individuals, and their siblings. Drug and Alcohol Dependence. 2008;95:14–22. doi: 10.1016/j.drugalcdep.2007.11.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bonovas S, Filioussi K, Tsantes A, Peponis V. Epidemiological association between cigarette smoking and primary open-angle glaucoma: a meta-analysis. Public Health. 2004;118:256–261. doi: 10.1016/j.puhe.2003.09.009. [DOI] [PubMed] [Google Scholar]
Burton PR, Clayton DG, Cardon LR, Craddock N, Deloukas P, Duncanson A, Kwiatkowski DP, McCarthy MI, Ouwehand WH, Samani NJ, et al. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen X, Cho K, Singer B, Zhang H. The nuclear transcription factor PKNOX2 is a candidate gene for substance dependence in European-origin women. PLoS One. 2011;6:e16002. doi: 10.1371/journal.pone.0016002. [DOI] [PMC free article] [PubMed] [Google Scholar]
Drgon T, Montoya I, Johnson C, Liu Q-R, Walther D, Hamer D, Uhl GR. Genome-wide association for nicotine dependence and smoking cessation success in NIH research volunteers. Molecular Medicine. 2009;15:21. doi: 10.2119/molmed.2008.00096. [DOI] [PMC free article] [PubMed] [Google Scholar]
Edenberg HJ, Koller DL, Xuei X, Wetherill L, McClintick JN, Almasy L, Bierut LJ, Bucholz KK, Goate A, Aliev F, et al. Genome-wide association study of alcohol dependence implicates a region on chromosome 11. Alcoholism: Clinical and Experimental Research. 2010;34:840–852. doi: 10.1111/j.1530-0277.2010.01156.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Edwards AC, Aliev F, Bierut LJ, Bucholz KK, Edenberg H, Hesselbrock V, Kramer J, Kuperman S, Nurnberger JI, Jr, Schuckit MA, et al. Genome-wide association study of comorbid depressive syndrome and alcohol dependence. Psychiatric Genetics. 2012;22:31–41. doi: 10.1097/YPG.0b013e32834acd07. [DOI] [PMC free article] [PubMed] [Google Scholar]
Frank J, Cichon S, Treutlein J, Ridinger M, Mattheisen M, Hoffmann P, Herms S, Wodarz N, Soyka M, Zill P, et al. Genome-wide significant association between alcohol dependence and a variant in the ADH gene cluster. Addiction Biology. 2012;17:171–180. doi: 10.1111/j.1369-1600.2011.00395.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fuse N. Genetic bases for glaucoma. The Tohoku Journal of Experimental Medicine. 2010;221:1–10. doi: 10.1620/tjem.221.1. [DOI] [PubMed] [Google Scholar]
Gu X, Rosenbaum P. Comparison of multivariate matching methods: structures, distances, and algorithms. Journal of Computational and Graphical Statistics. 1993;2:405–420. [Google Scholar]
Hartel DM, Schoenbaum EE, Lo Y, Klein RS. Gender differences in illicit substance use among middle-aged drug users with or at risk for HIV infection. Clinical Infectious Diseases. 2006;43:525–531. doi: 10.1086/505978. [DOI] [PMC free article] [PubMed] [Google Scholar]
Heath AC, Whitfield JB, Martin NG, Pergadia ML, Goate AM, Lind PA, McEvoy BP, Schrage AJ, Grant JD, Chou Y-L, et al. A quantitative-trait genome-wide association study of alcoholism risk in the community: findings and implications. Biological Psychiatry. 2011;70:513–518. doi: 10.1016/j.biopsych.2011.02.028. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jiang Y, Zhang H. Propensity score-based nonparametric test revealing genetic variants underlying bipolar disorder. Genetic Epidemiology. 2011;35:125–132. doi: 10.1002/gepi.20558. [DOI] [PMC free article] [PubMed] [Google Scholar]
Johnson C, Drgon T, Liu Q-R, Walther D, Edenberg H, Rice J, Foroud T, Uhl GR. Pooled association genome scanning for alcohol dependence using 104,268 SNPs: validation and use to identify alcoholism vulnerability loci in unrelated individuals from the collaborative study on the genetics of alcoholism. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics. 2006;141:844–853. doi: 10.1002/ajmg.b.30346. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kendall MG. A new measure of rank correlation. Biometrika. 1938;30:81–93. [Google Scholar]
Kendler KS, Kalsi G, Holmans PA, Sanders AR, Aggen SH, Dick DM, Aliev F, Shi J, Levinson DF, Gejman PV. Genomewide association analysis of symptoms of alcohol dependence in the molecular genetics of schizophrenia (MGS2) control sample. Alcoholism: Clinical and Experimental Research. 2011;35:963–975. doi: 10.1111/j.1530-0277.2010.01427.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Laird N, Horvath S, Xu X. Implementing a unified approach to family-based tests of association. Genetic Epidemiology. 2000;19:S36–S42. doi: 10.1002/1098-2272(2000)19:1+<::AID-GEPI6>3.0.CO;2-M. [DOI] [PubMed] [Google Scholar]
Le-Niculescu H, McFarland M, Ogden C, Balaraman Y, Patel S, Tan J, Rodd Z, Paulus M, Geyer M, Edenberg H, et al. Phenomic, convergent functional genomic, and biomarker studies in a stress-reactive genetic animal model of bipolar disorder and co-morbid alcoholism. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics. 2008;147:134–166. doi: 10.1002/ajmg.b.30707. [DOI] [PubMed] [Google Scholar]
Li C-Y, Mao X, Wei L. Genes and (common) pathways underlying drug addiction. PLoS Computational Biology. 2008;4 doi: 10.1371/journal.pcbi.0040002. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lind PA, Macgregor S, Vink JM, Pergadia ML, Hansell NK, de Moor MH, Smit AB, Hottenga J-J, Richter MM, Heath AC, et al. A genomewide association study of nicotine and alcohol dependence in Australian and Dutch populations. Twin Research and Human Genetics. 2010;13 doi: 10.1375/twin.13.1.10. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu Q-R, Drgon T, Johnson C, Walther D, Hess J, Uhl GR. Addiction molecular genetics: 639,401 SNP whole genome association identifies many cell adhesion genes. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics. 2006;141:918–925. doi: 10.1002/ajmg.b.30436. [DOI] [PubMed] [Google Scholar]
Liu Z, Guo X, Jiang Y, Zhang H. NCK2 is significantly associated with opiates addiction in african-origin men. The Scientific World Journal. 2013:2013. doi: 10.1155/2013/748979. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lunceford J, Davidian M. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Statistics in Medicine. 2004;23:2937–2960. doi: 10.1002/sim.1903. [DOI] [PubMed] [Google Scholar]
Luo Z, Alvarado GF, Hatsukami DK, Johnson EO, Bierut LJ, Breslau N. Race differences in nicotine dependence in the collaborative genetic study of nicotine dependence (COGEND) Nicotine & Tobacco Research. 2008;10:1223–1230. doi: 10.1080/14622200802163266. [DOI] [PubMed] [Google Scholar]
National Institute on Drug Abuse Comobidity: Addiction and other mental illnesses. Research Report Series, U.S. Department of Health and Human Services. 2010 NIH Publication Number 10-5771. [Google Scholar]
Pierce D. The asymptotic effect of substituting estimators for parameters in certain types of statistics. Annals of Statistics. 1982;10:475–478. [Google Scholar]
Rabinowitz D. A transmission disequilibrium test for quantitative trait loci. Human Heredity. 1997;47:342–350. doi: 10.1159/000154433. [DOI] [PubMed] [Google Scholar]
Rabinowitz D, Laird N. A unified approach to adjusting association tests for population admixture with arbitrary pedigree structure and arbitrary missing marker information. Human Heredity. 2000;50:211–223. doi: 10.1159/000022918. [DOI] [PubMed] [Google Scholar]
Randles RH. On the asymptotic normality of statistics with estimated parameters. Annals of Statistics. 1982;10:462–474. [Google Scholar]
Reich T, Edenberg HJ, Williams JT, Van Eerdewegh P, Foroud T, Hesselbrock V, Schuckit MA, Bucholz K, Porjesz B, Li TK, Conneally PM, Nurnberger JIJ, Tischfield JA, Crowe RR, Cloninger CR, Wu W, Shears S, Carr K, Crose C, Willig C, Begleiter H. Genome-wide search for genes affecting the risk for alcohol dependence. American Journal of Medical Genetics. 1998;81:207–215. [PubMed] [Google Scholar]
Rice JP, Hartz SM, Agrawal A, Almasy L, Bennett S, Breslau N, Bucholz KK, Doheny KF, Edenberg HJ, Goate AM, et al. CHRNB3 is more strongly associated with Fagerström Test for Cigarette Dependence-based nicotine dependence than cigarettes per day: phenotype definition changes genome-wide association studies results. Addiction. 2012;107:2019–2028. doi: 10.1111/j.1360-0443.2012.03922.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Robins J, Hernán M, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology. 2000;11:550–560. doi: 10.1097/00001648-200009000-00011. [DOI] [PubMed] [Google Scholar]
Robins J, Mark S, Newey W. Estimating exposure effects by modelling the expectation of exposure conditional on confounders. Biometrics. 1992;48:479–495. [PubMed] [Google Scholar]
Rosenbaum P. Model-based direct adjustment. Journal of the American Statistical Association. 1987;82:387–394. [Google Scholar]
Rosenbaum P, Rubin D. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70:41–55. [Google Scholar]
Schifano ED, Li L, Christiani DC, Lin X. Genome-wide association analysis for multiple continuous secondary phenotypes. The American Journal of Human Genetics. 2013;92:744–759. doi: 10.1016/j.ajhg.2013.04.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shao J. Mathematical Statistics. 2nd Springer-Verlag New York, Inc; New York: 2003. [Google Scholar]
Sullivan PF, Neale BM, van den Oord E, Miles MF, Neale MC, Bulik CM, Joyce PR, Straub RE, Kendler KS. Candidate genes for nicotine dependence via linkage, epistasis, and bioinformatics. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics. 2004;126:23–36. doi: 10.1002/ajmg.b.20138. [DOI] [PubMed] [Google Scholar]
Treutlein J, Cichon S, Ridinger M, Wodarz N, Soyka M, Zill P, Maier W, Moessner R, Gaebel W, Dahmen N, et al. Genome-wide association study of alcohol dependence. Archives of General Psychiatry. 2009;66:773. doi: 10.1001/archgenpsychiatry.2009.83. [DOI] [PMC free article] [PubMed] [Google Scholar]
Uhl GR, Liu Q-R, Drgon T, Johnson C, Walther D, Rose JE. Molecular genetics of nicotine dependence and abstinence: whole genome association using 520,000 SNPs. BMC Genetics. 2007;8:10. doi: 10.1186/1471-2156-8-10. [DOI] [PMC free article] [PubMed] [Google Scholar]
Uhl GR, Liu Q-R, Drgon T, Johnson C, Walther D, Rose JE, David SP, Niaura R, Lerman C. Molecular genetics of successful smoking cessation: convergent genome-wide association study results. Archives of General Psychiatry. 2008;65:683. doi: 10.1001/archpsyc.65.6.683. [DOI] [PMC free article] [PubMed] [Google Scholar]
van der Vaart AW. Asymptotic Statistics. Cambridge University Press; New York: 1998. [Google Scholar]
Wang H-Y, Friedman E, Olmstead M, Burns L. Ultra-low-dose naloxone suppresses opioid tolerance, dependence and associated changes in Mu opioid receptor-G protein coupling and G βγ signaling. Neuroscience. 2005;135:247–261. doi: 10.1016/j.neuroscience.2005.06.003. [DOI] [PubMed] [Google Scholar]
Wang K-S, Liu X, Zhang Q, Pan Y, Aragam N, Zeng M. A meta-analysis of two genome-wide association studies identifies 3 new loci for alcohol dependence. Journal of Psychiatric Research. 2011;45:1419–1425. doi: 10.1016/j.jpsychires.2011.06.005. [DOI] [PubMed] [Google Scholar]
Wang K-S, Liu X, Zhang Q, Zeng M. ANAPC1 and SLCO3A1 are associated with nicotine dependence: Meta-analysis of genome-wide association studies. Drug and Alcohol Dependence. 2012;124:325–332. doi: 10.1016/j.drugalcdep.2012.02.003. [DOI] [PubMed] [Google Scholar]
Wang X, Ye Y, Zhang H. Family-based association tests for ordinal traits adjusting for covariates. Genetic Epidemiology. 2006;30:728–736. doi: 10.1002/gepi.20184. [DOI] [PubMed] [Google Scholar]
Wright JW, Harding JW. Contributions of matrix metalloproteinases to neural plasticity, habituation, associative learning and drug addiction. Neural Plasticity. 2009:2009. doi: 10.1155/2009/579382. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yang B-Z, Han S, Kranzler HR, Farrer LA, Gelernter J. A genomewide linkage scan of cocaine dependence and major depressive episode in two populations. Neuropsychopharmacology. 2011;36:2422–2430. doi: 10.1038/npp.2011.122. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang H, Liu C-T, Wang X. An association test for multiple traits based on the generalized Kendall's tau. Journal of the American Statistical Association. 2010;105:473–481. doi: 10.1198/jasa.2009.ap08387. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang H, Wang X, Ye Y. Detection of genes for ordinal traits in nuclear families and a unified approach for association studies. Genetics. 2006;172:693–699. doi: 10.1534/genetics.105.049122. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao H, Rebbeck T, Mitra N. A propensity score approach to correction for bias due to population stratification using genetic and non-genetic factors. Genetic Epidemiology. 2009;33:679–690. doi: 10.1002/gepi.20419. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhu W, Jiang Y, Zhang H. Nonparametric covariate-adjusted association tests based on the generalized Kendall's tau. Journal of the American Statistical Association. 2012;107:1–11. doi: 10.1080/01621459.2011.643707. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zuo L, Zhang F, Zhang H, Zhang X-Y, Wang F, Li C-SR, Lu L, Hong J, Lu L, Krystal J, et al. Genome-wide search for replicable risk gene regions in alcohol and nicotine co-dependence. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics. 2012a;159:437–444. doi: 10.1002/ajmg.b.32047. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zuo L, Zhang X-Y, Wang F, Li C-SR, Lu L, Ye L, Zhang H, Krystal JH, Deng H-W, Luo X. Genome-wide significant association signals in IPO11-HTR1A region specific for alcohol and nicotine codependence. Alcoholism: Clinical and Experimental Research. 2012b doi: 10.1111/acer.12032. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] Akiyama M, Yatsu K, Ota M, Katsuyama Y, Kashiwagi K, Mabuchi F, Iijima H, Kawase K, Yamamoto T, Nakamura M, Negi A, Sagara T, Kumagai N, Nishida T, Inatani M, Tanihara H, Ohno S, Inoko H, Mizuki N. Microsatellite analysis of the GLC1B locus on chromosome 2 points to NCK2 as a new candidate gene for normal tension glaucoma. British Journal of Ophthalmology. 2008;92:1293–1296. doi: 10.1136/bjo.2008.139980. [DOI] [PubMed] [Google Scholar]

[R2] Antczak A, Migdalska-Sek M, Pastuszak-Lewandoska D, Czarnecka K, Nawrot E, Domańska D, Kordiak J, Górski P, Brzeziańska E. Significant frequency of allelic imbalance in 3p region covering RARβ and MLH1 loci seems to be essential in molecular non-small cell lung cancer diagnosis. Medical Oncology. 2013;30:1–10. doi: 10.1007/s12032-013-0532-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Bierut LJ, Agrawal A, Bucholz KK, Doheny KF, Laurie C, Pugh E, Fisher S, Fox L, Howells W, Bertelsen S, et al. A genome-wide association study of alcohol dependence. Proceedings of the National Academy of Sciences. 2010;107:5082–5087. doi: 10.1073/pnas.0911109107. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Bierut LJ, Madden PA, Breslau N, Johnson EO, Hatsukami D, Pomerleau OF, Swan GE, Rutter J, Bertelsen S, Fox L, et al. Novel genes identified in a high-density genome wide association study for nicotine dependence. Human Molecular Genetics. 2007;16:24–35. doi: 10.1093/hmg/ddl441. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Bierut LJ, Strickland JR, Thompson JR, Afful SE, Cottler LB. Drug use and dependence in cocaine dependent subjects, community-based individuals, and their siblings. Drug and Alcohol Dependence. 2008;95:14–22. doi: 10.1016/j.drugalcdep.2007.11.023. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Bonovas S, Filioussi K, Tsantes A, Peponis V. Epidemiological association between cigarette smoking and primary open-angle glaucoma: a meta-analysis. Public Health. 2004;118:256–261. doi: 10.1016/j.puhe.2003.09.009. [DOI] [PubMed] [Google Scholar]

[R7] Burton PR, Clayton DG, Cardon LR, Craddock N, Deloukas P, Duncanson A, Kwiatkowski DP, McCarthy MI, Ouwehand WH, Samani NJ, et al. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Chen X, Cho K, Singer B, Zhang H. The nuclear transcription factor PKNOX2 is a candidate gene for substance dependence in European-origin women. PLoS One. 2011;6:e16002. doi: 10.1371/journal.pone.0016002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Drgon T, Montoya I, Johnson C, Liu Q-R, Walther D, Hamer D, Uhl GR. Genome-wide association for nicotine dependence and smoking cessation success in NIH research volunteers. Molecular Medicine. 2009;15:21. doi: 10.2119/molmed.2008.00096. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Edenberg HJ, Koller DL, Xuei X, Wetherill L, McClintick JN, Almasy L, Bierut LJ, Bucholz KK, Goate A, Aliev F, et al. Genome-wide association study of alcohol dependence implicates a region on chromosome 11. Alcoholism: Clinical and Experimental Research. 2010;34:840–852. doi: 10.1111/j.1530-0277.2010.01156.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Edwards AC, Aliev F, Bierut LJ, Bucholz KK, Edenberg H, Hesselbrock V, Kramer J, Kuperman S, Nurnberger JI, Jr, Schuckit MA, et al. Genome-wide association study of comorbid depressive syndrome and alcohol dependence. Psychiatric Genetics. 2012;22:31–41. doi: 10.1097/YPG.0b013e32834acd07. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Frank J, Cichon S, Treutlein J, Ridinger M, Mattheisen M, Hoffmann P, Herms S, Wodarz N, Soyka M, Zill P, et al. Genome-wide significant association between alcohol dependence and a variant in the ADH gene cluster. Addiction Biology. 2012;17:171–180. doi: 10.1111/j.1369-1600.2011.00395.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Fuse N. Genetic bases for glaucoma. The Tohoku Journal of Experimental Medicine. 2010;221:1–10. doi: 10.1620/tjem.221.1. [DOI] [PubMed] [Google Scholar]

[R14] Gu X, Rosenbaum P. Comparison of multivariate matching methods: structures, distances, and algorithms. Journal of Computational and Graphical Statistics. 1993;2:405–420. [Google Scholar]

[R15] Hartel DM, Schoenbaum EE, Lo Y, Klein RS. Gender differences in illicit substance use among middle-aged drug users with or at risk for HIV infection. Clinical Infectious Diseases. 2006;43:525–531. doi: 10.1086/505978. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Heath AC, Whitfield JB, Martin NG, Pergadia ML, Goate AM, Lind PA, McEvoy BP, Schrage AJ, Grant JD, Chou Y-L, et al. A quantitative-trait genome-wide association study of alcoholism risk in the community: findings and implications. Biological Psychiatry. 2011;70:513–518. doi: 10.1016/j.biopsych.2011.02.028. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Jiang Y, Zhang H. Propensity score-based nonparametric test revealing genetic variants underlying bipolar disorder. Genetic Epidemiology. 2011;35:125–132. doi: 10.1002/gepi.20558. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Johnson C, Drgon T, Liu Q-R, Walther D, Edenberg H, Rice J, Foroud T, Uhl GR. Pooled association genome scanning for alcohol dependence using 104,268 SNPs: validation and use to identify alcoholism vulnerability loci in unrelated individuals from the collaborative study on the genetics of alcoholism. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics. 2006;141:844–853. doi: 10.1002/ajmg.b.30346. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Kendall MG. A new measure of rank correlation. Biometrika. 1938;30:81–93. [Google Scholar]

[R20] Kendler KS, Kalsi G, Holmans PA, Sanders AR, Aggen SH, Dick DM, Aliev F, Shi J, Levinson DF, Gejman PV. Genomewide association analysis of symptoms of alcohol dependence in the molecular genetics of schizophrenia (MGS2) control sample. Alcoholism: Clinical and Experimental Research. 2011;35:963–975. doi: 10.1111/j.1530-0277.2010.01427.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Laird N, Horvath S, Xu X. Implementing a unified approach to family-based tests of association. Genetic Epidemiology. 2000;19:S36–S42. doi: 10.1002/1098-2272(2000)19:1+<::AID-GEPI6>3.0.CO;2-M. [DOI] [PubMed] [Google Scholar]

[R22] Le-Niculescu H, McFarland M, Ogden C, Balaraman Y, Patel S, Tan J, Rodd Z, Paulus M, Geyer M, Edenberg H, et al. Phenomic, convergent functional genomic, and biomarker studies in a stress-reactive genetic animal model of bipolar disorder and co-morbid alcoholism. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics. 2008;147:134–166. doi: 10.1002/ajmg.b.30707. [DOI] [PubMed] [Google Scholar]

[R23] Li C-Y, Mao X, Wei L. Genes and (common) pathways underlying drug addiction. PLoS Computational Biology. 2008;4 doi: 10.1371/journal.pcbi.0040002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Lind PA, Macgregor S, Vink JM, Pergadia ML, Hansell NK, de Moor MH, Smit AB, Hottenga J-J, Richter MM, Heath AC, et al. A genomewide association study of nicotine and alcohol dependence in Australian and Dutch populations. Twin Research and Human Genetics. 2010;13 doi: 10.1375/twin.13.1.10. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Liu Q-R, Drgon T, Johnson C, Walther D, Hess J, Uhl GR. Addiction molecular genetics: 639,401 SNP whole genome association identifies many cell adhesion genes. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics. 2006;141:918–925. doi: 10.1002/ajmg.b.30436. [DOI] [PubMed] [Google Scholar]

[R26] Liu Z, Guo X, Jiang Y, Zhang H. NCK2 is significantly associated with opiates addiction in african-origin men. The Scientific World Journal. 2013:2013. doi: 10.1155/2013/748979. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Lunceford J, Davidian M. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Statistics in Medicine. 2004;23:2937–2960. doi: 10.1002/sim.1903. [DOI] [PubMed] [Google Scholar]

[R28] Luo Z, Alvarado GF, Hatsukami DK, Johnson EO, Bierut LJ, Breslau N. Race differences in nicotine dependence in the collaborative genetic study of nicotine dependence (COGEND) Nicotine & Tobacco Research. 2008;10:1223–1230. doi: 10.1080/14622200802163266. [DOI] [PubMed] [Google Scholar]

[R29] National Institute on Drug Abuse Comobidity: Addiction and other mental illnesses. Research Report Series, U.S. Department of Health and Human Services. 2010 NIH Publication Number 10-5771. [Google Scholar]

[R30] Pierce D. The asymptotic effect of substituting estimators for parameters in certain types of statistics. Annals of Statistics. 1982;10:475–478. [Google Scholar]

[R31] Rabinowitz D. A transmission disequilibrium test for quantitative trait loci. Human Heredity. 1997;47:342–350. doi: 10.1159/000154433. [DOI] [PubMed] [Google Scholar]

[R32] Rabinowitz D, Laird N. A unified approach to adjusting association tests for population admixture with arbitrary pedigree structure and arbitrary missing marker information. Human Heredity. 2000;50:211–223. doi: 10.1159/000022918. [DOI] [PubMed] [Google Scholar]

[R33] Randles RH. On the asymptotic normality of statistics with estimated parameters. Annals of Statistics. 1982;10:462–474. [Google Scholar]

[R34] Reich T, Edenberg HJ, Williams JT, Van Eerdewegh P, Foroud T, Hesselbrock V, Schuckit MA, Bucholz K, Porjesz B, Li TK, Conneally PM, Nurnberger JIJ, Tischfield JA, Crowe RR, Cloninger CR, Wu W, Shears S, Carr K, Crose C, Willig C, Begleiter H. Genome-wide search for genes affecting the risk for alcohol dependence. American Journal of Medical Genetics. 1998;81:207–215. [PubMed] [Google Scholar]

[R35] Rice JP, Hartz SM, Agrawal A, Almasy L, Bennett S, Breslau N, Bucholz KK, Doheny KF, Edenberg HJ, Goate AM, et al. CHRNB3 is more strongly associated with Fagerström Test for Cigarette Dependence-based nicotine dependence than cigarettes per day: phenotype definition changes genome-wide association studies results. Addiction. 2012;107:2019–2028. doi: 10.1111/j.1360-0443.2012.03922.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] Robins J, Hernán M, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology. 2000;11:550–560. doi: 10.1097/00001648-200009000-00011. [DOI] [PubMed] [Google Scholar]

[R37] Robins J, Mark S, Newey W. Estimating exposure effects by modelling the expectation of exposure conditional on confounders. Biometrics. 1992;48:479–495. [PubMed] [Google Scholar]

[R38] Rosenbaum P. Model-based direct adjustment. Journal of the American Statistical Association. 1987;82:387–394. [Google Scholar]

[R39] Rosenbaum P, Rubin D. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70:41–55. [Google Scholar]

[R40] Schifano ED, Li L, Christiani DC, Lin X. Genome-wide association analysis for multiple continuous secondary phenotypes. The American Journal of Human Genetics. 2013;92:744–759. doi: 10.1016/j.ajhg.2013.04.004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] Shao J. Mathematical Statistics. 2nd Springer-Verlag New York, Inc; New York: 2003. [Google Scholar]

[R42] Sullivan PF, Neale BM, van den Oord E, Miles MF, Neale MC, Bulik CM, Joyce PR, Straub RE, Kendler KS. Candidate genes for nicotine dependence via linkage, epistasis, and bioinformatics. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics. 2004;126:23–36. doi: 10.1002/ajmg.b.20138. [DOI] [PubMed] [Google Scholar]

[R43] Treutlein J, Cichon S, Ridinger M, Wodarz N, Soyka M, Zill P, Maier W, Moessner R, Gaebel W, Dahmen N, et al. Genome-wide association study of alcohol dependence. Archives of General Psychiatry. 2009;66:773. doi: 10.1001/archgenpsychiatry.2009.83. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] Uhl GR, Liu Q-R, Drgon T, Johnson C, Walther D, Rose JE. Molecular genetics of nicotine dependence and abstinence: whole genome association using 520,000 SNPs. BMC Genetics. 2007;8:10. doi: 10.1186/1471-2156-8-10. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] Uhl GR, Liu Q-R, Drgon T, Johnson C, Walther D, Rose JE, David SP, Niaura R, Lerman C. Molecular genetics of successful smoking cessation: convergent genome-wide association study results. Archives of General Psychiatry. 2008;65:683. doi: 10.1001/archpsyc.65.6.683. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R46] van der Vaart AW. Asymptotic Statistics. Cambridge University Press; New York: 1998. [Google Scholar]

[R47] Wang H-Y, Friedman E, Olmstead M, Burns L. Ultra-low-dose naloxone suppresses opioid tolerance, dependence and associated changes in Mu opioid receptor-G protein coupling and G βγ signaling. Neuroscience. 2005;135:247–261. doi: 10.1016/j.neuroscience.2005.06.003. [DOI] [PubMed] [Google Scholar]

[R48] Wang K-S, Liu X, Zhang Q, Pan Y, Aragam N, Zeng M. A meta-analysis of two genome-wide association studies identifies 3 new loci for alcohol dependence. Journal of Psychiatric Research. 2011;45:1419–1425. doi: 10.1016/j.jpsychires.2011.06.005. [DOI] [PubMed] [Google Scholar]

[R49] Wang K-S, Liu X, Zhang Q, Zeng M. ANAPC1 and SLCO3A1 are associated with nicotine dependence: Meta-analysis of genome-wide association studies. Drug and Alcohol Dependence. 2012;124:325–332. doi: 10.1016/j.drugalcdep.2012.02.003. [DOI] [PubMed] [Google Scholar]

[R50] Wang X, Ye Y, Zhang H. Family-based association tests for ordinal traits adjusting for covariates. Genetic Epidemiology. 2006;30:728–736. doi: 10.1002/gepi.20184. [DOI] [PubMed] [Google Scholar]

[R51] Wright JW, Harding JW. Contributions of matrix metalloproteinases to neural plasticity, habituation, associative learning and drug addiction. Neural Plasticity. 2009:2009. doi: 10.1155/2009/579382. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R52] Yang B-Z, Han S, Kranzler HR, Farrer LA, Gelernter J. A genomewide linkage scan of cocaine dependence and major depressive episode in two populations. Neuropsychopharmacology. 2011;36:2422–2430. doi: 10.1038/npp.2011.122. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R53] Zhang H, Liu C-T, Wang X. An association test for multiple traits based on the generalized Kendall's tau. Journal of the American Statistical Association. 2010;105:473–481. doi: 10.1198/jasa.2009.ap08387. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R54] Zhang H, Wang X, Ye Y. Detection of genes for ordinal traits in nuclear families and a unified approach for association studies. Genetics. 2006;172:693–699. doi: 10.1534/genetics.105.049122. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R55] Zhao H, Rebbeck T, Mitra N. A propensity score approach to correction for bias due to population stratification using genetic and non-genetic factors. Genetic Epidemiology. 2009;33:679–690. doi: 10.1002/gepi.20419. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R56] Zhu W, Jiang Y, Zhang H. Nonparametric covariate-adjusted association tests based on the generalized Kendall's tau. Journal of the American Statistical Association. 2012;107:1–11. doi: 10.1080/01621459.2011.643707. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R57] Zuo L, Zhang F, Zhang H, Zhang X-Y, Wang F, Li C-SR, Lu L, Hong J, Lu L, Krystal J, et al. Genome-wide search for replicable risk gene regions in alcohol and nicotine co-dependence. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics. 2012a;159:437–444. doi: 10.1002/ajmg.b.32047. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R58] Zuo L, Zhang X-Y, Wang F, Li C-SR, Lu L, Ye L, Zhang H, Krystal JH, Deng H-W, Luo X. Genome-wide significant association signals in IPO11-HTR1A region specific for alcohol and nicotine codependence. Alcoholism: Clinical and Experimental Research. 2012b doi: 10.1111/acer.12032. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Identifying Genetic Variants for Addiction via Propensity Score Adjusted Generalized Kendall’s Tau

Yuan Jiang

Ni Li

Heping Zhang

Roles

Abstract

1 INTRODUCTION

2 SEMIPARAMETRIC ASSOCIATION TEST

2.1 Non-weighted and Weighted Association Measurements

2.2 Inverse Probability Weighting

2.3 Asymptotic Distribution with True Propensity Scores

2.4 Test Statistic with Estimated Propensity Scores

2.5 A Specific Example

2.6 Genotype Coding

3 SIMULATION STUDIES

3.1 Settings

Table 1.

3.2 Results for Bivariate Phenotypes

Table 2.

Figure 1.

Figure 4.

Figure 2.

Figure 3.

3.3 Results for Individual Phenotypes

Table 3.

Figure 5.

Figure 8.

Figure 6.

Figure 7.

3.4 Impact of Model Misspecification

Figure 9.

4 DATA ANALYSIS

4.1 Data and Methods

4.2 Summary Statistics

Table 4.

Table 5.

4.3 Single-Trait Results

Table 6.

Table 7.

4.4 Multiple-Trait Results

Table 8.

5 DISCUSSION

Acknowledgments

A APPENDIX

A.1 Proof of Lemma 1

A.2 Proof of Lemma 2

A.3 Proof of Theorem 2

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases