Statistical inference on qualitative differences in the magnitude of an effect

Aaron Hudson; Ali Shojaie

doi:10.1002/sim.10025

. Author manuscript; available in PMC: 2025 Mar 30.

Published in final edited form as: Stat Med. 2024 Feb 2;43(7):1419–1440. doi: 10.1002/sim.10025

Statistical inference on qualitative differences in the magnitude of an effect

Aaron Hudson ^1,^*, Ali Shojaie ²

PMCID: PMC10947912 NIHMSID: NIHMS1959976 PMID: 38305667

Abstract

Qualitative interactions occur when a treatment effect or measure of association varies in sign by sub-population. Of particular interest in many biomedical settings are absence/presence qualitative interactions, which occur when an effect is present in one sub-population but absent in another. Absence/presence interactions arise in emerging applications in precision medicine, where the objective is to identify a set of predictive biomarkers that have prognostic value for clinical outcomes in some sub-population but not others. They also arise naturally in gene regulatory network inference, where the goal is to identify differences in networks corresponding to diseased and healthy individuals, or to different subtypes of disease; such differences lead to identification of network-based biomarkers for diseases. In this paper, we argue that while the absence/presence hypothesis is important, developing a statistical test for this hypothesis is an intractable problem. To overcome this challenge, we approximate the problem in a novel inference framework. In particular, we propose to make inferences about absence/presence interactions by quantifying the relative difference in effect size, reasoning that when the relative difference is large, an absence/presenceinteractionoccurs.Theproposedmethodologyisillustratedthrougha simulation study as well as an analysis of breast cancer data from the Cancer Genome Atlas.

Keywords: Qualitative interactions, Differential network connectivity, Precision medicine

1 |. INTRODUCTION

An objective of many biomedical studies is to identify and test for interactions, which arise when a measure of effect or association between variables differs by sub-population^1,2,3. Precision medicine and genetic network inference provide examples of areas in which interactions are of interest. For instance, researchers in precision medicine seek to understand how patients’ characteristics are associated with heterogeneity in response to treatment. In omics data analysis, it is of interest to determine how biological networks, which summarize the associations between genes/proteins/etc, depend on phenotype.

Interactions may lack clinical or scientific significance when differences in effect are small. In addition to detecting interactions, it is important to identify which are meaningful. For example, in precision medicine, the most important differences in treatment effect may be those in which some sub-populations of patients benefit from the treatment, while other sub-populations are harmed or unaffected. Additionally, one may want to identify differences among sub-populations in the set of biomarkers that have prognostic value for a health outcome — that is, to determine whether some biomarkers are predictive of the outcome in only a subset of the full population. Genetic network inference provides another example: When comparing sub-population level genetic networks, it may be of primary interest to identify pairs of genes that share an association in some sub-populations but share no association in others, or to identify pairs that have a positive association in one sub-population and a negative association in another. This is known as differential network analysis^4,5.

Such qualitative interactions are the focus of this paper. Qualitative interactions occur when a measure of association differs in sign by sub-population. We consider two types of qualitative interactions: positive/negative interactions — also known in the literature as cross-over interactions⁶ — and absence/presence interactions — sometimes referred to as pure interactions⁷. Positive/negative interactions occur when an effect is positive in one sub-population and negative in another, and absence/presence interactions occur when the effect is present in one population but absent in another.

Our objective is to formally test for qualitative interactions, given independent samples from each sub-population. Testing for positive/negative interactions is well-studied, with several procedures available for identifying positive/negative interactions in a fixed number of sub-populations^{6,8,9,10,11,12}. More recent work has also extended to identifying interactions when there is a large, or possibly infinite, number of sub-groups^{13,14,15,16,17}. In contrast to positive/negative interactions, testing for absence/presence interactions has received little attention. Naïve approaches, to be discussed in Section 2.3, require an untenable minimum signal strength condition — that if an effect is present in any sub-population, it is large enough to be detected with absolute certainty. No approaches exist, to the best of our knowledge, that avoid this assumption.

In this paper, we propose a novel framework for inference about absence/presence interactions in two sample settings. Our proposed methodology allows for well-calibrated hypothesis testing under mild assumptions. We also introduce a numerical summary that measures the strength of absence/presence interactions, while accounting for the uncertainty associated with parameter estimation. Additionally, we describe methods for simultaneous inference about absence/presence and positive/negative interactions. The methodology we introduce provides an effective and flexible inference tool in precision medicine and genetic network analysis, as we illustrate in simulations and an analysis of breast cancer data from The Cancer Genome Atlas (TCGA).

2 |. BACKGROUND

2.1 |. Notation

As we begin to formalize the problem, we first introduce some notation. We consider two sub-populations, labeled by $g \in {1, 2}$ . Let $θ_{g} \in R$ denote a measure of association in sub-population $g$ . For simplicity, we assume that $θ_{g}$ is finite. When convenient, we write $θ = (θ_{1}, θ_{2})$ . We can consider various measures of association, such as: correlation coefficients, indicating the strength of linear relationship between two variables of interest; $l o g$ odds ratios, describing the relationship between predictors and a binary outcome; and log hazard ratios, describing the association between predictors and a time-to-event outcome.

We assume that given i.i.d. samples of size $n_{1}$ and $n_{2}$ from each sub-population, $\sqrt{n_{g}}$ -consistent and asymptotically normal estimators ${\hat{θ}}_{g}$ of $θ_{g}$ are available, i.e.,

\sqrt{n_{g}} ({\hat{θ}}_{g} - θ_{g}) \to_{d} N (0, σ_{g}^{2}),

with $σ_{g}^{2} > 0$ denoting the asymptotic variance. For expositional simplicity, we assume balanced sample sizes $n_{1} = n_{2} = n$ (key results are stated more generally in the Appendix). We also assume $σ_{g}^{2}$ is known, though we can instead use a consistent estimate, as is commonly done in practice.

We now formally state the null hypotheses of no positive/negative interactions and no absence/presence interaction, labeled $H_{0}^{P / N}$ and $H_{0}^{A / P}$ , respectively:

H_{0}^{P / N} : θ_{g} \geq 0 for g \in {1, 2} or θ_{g} \leq 0 for g \in \{1, 2\}

(1)

H_{0}^{A / P} : θ_{g} = 0 for g \in {1, 2} or θ_{g} \neq 0 for g \in \{1, 2\}

(2)

We let $Θ_{0}^{P / N}, Θ_{0}^{A / P}$ and $Θ_{1}^{P / N}, Θ_{1}^{A / P}$ denote the corresponding null and alternative regions of the parameter space, depicted in the left and center panels of Figure 1 (Recall that the null region is the set of parameters such that the null hypothesis holds, and the alternative region is the complement of the null region.) The positive/negative null region is the union of the the non-negative and non-positive orthants, and the absence/presence null region is the union of all open orthants and the origin.

Null and alternative regions of parameter space for positive/negative (left; introduced in Section 2.1), absence/presence (center; introduced in Section 2.1), and relative difference hypotheses (right; introduced in Section 3.1)

2.2 |. Testing Composite Null Hypotheses

Our goal is to use the estimate $\hat{θ}$ to perform tests of $H_{0}^{P / N}$ and $H_{0}^{A / P}$ such that the size is controlled asymptotically under mild assumptions. Recall that for a null hypothesis $H_{0}$ with accompanying null region $Θ_{0}$ , the size of a test is defined as

\underset{θ_{0} \in Θ_{0}}{s u p} P (“Reject the null hypothesis” ∣ θ = θ_{0}).

In words, the size is the largest possible type-I error rate that could be achieved under any probability distribution given that $θ$ belongs to the null region.

Here, we describe our general approach for controlling the size at a pre-specified level $α \in (0, 1)$ . We first define a test statistic $T$ , a map from observable data to a real-valued number, with larger values of $T$ corresponding to more evidence against the null hypothesis. We then calculate the test statistic on the observed data, which we denote by $t$ . We write $P (T > t ∣ θ = θ_{0})$ as the probability of observing a random test statistic at least as large as the observed value $t$ , assuming $θ = θ_{0}$ . We reject the null hypothesis if

ρ (t) \equiv \underset{θ_{0} \in Θ_{0}}{s u p} P (T > t ∣ θ = θ_{0}) < α .

One can think of $P (T > t ∣ θ = θ_{0})$ as the p-value under a specific null distribution $θ = θ_{0}$ ; $ρ (t)$ is then the largest of all such p-values. We reject $H_{0}$ when there is sufficient evidence to reject all hypotheses $θ = θ_{0}$ . We can view $ρ (t)$ as a generalization of the usual p-value for simple null hypotheses to tests with composite null hypotheses, and we will simply refer to $ρ (t)$ as “p-value”. Tests of the above form are guaranteed to control the size¹⁸.

2.3 |. Existing Methodology

We now review existing approaches to test for qualitative interactions. We first discuss testing positive/negative interactions before moving to absence/presence interactions.

Gail and Simon⁶ developed the most widely used procedure to test for positive/negative interactions. Though Gail and Simon proposed a general $K$ -sample test, we focus on the two-sample problem in this paper. We note that various $K$ -sample tests for positive/negative interactions have been proposed^8,10,11, but these procedures are essentially equivalent to the Gail-Simon test in the two-sample setting.

Gail and Simon’s approach is to perform a likelihood ratio test based on the asymptotic sampling distribution of $\hat{θ}$ . The likelihood ratio test rejects $H_{0}^{P / N}$ for large values of

- \frac{{s u p}_{(a_{1}, a_{2}) \in Θ_{0}^{P / N}} \prod_{g \in {1,2}} ϕ \{\sqrt{n} σ_{g}^{- 1} ({\hat{θ}}_{g} - a_{g})\}}{{s u p}_{(b_{1}, b_{2}) \in R^{2}} \prod_{g \in {1,2}} ϕ \{\sqrt{n} σ_{g}^{- 1} ({\hat{θ}}_{g} - b_{g})\}},

where $ϕ (\cdot)$ is the standard normal density. By performing algebraic manipulations, one can show that the likelihood ratio test equivalently rejects the null for large values of

T^{P / N} = \min_{(a_{1}, a_{2}) \in Θ_{0}^{P / N}} \sum_{g \in {1, 2}} n {σ_{g}^{- 1} ({\hat{θ}}_{g} - a_{g})}^{2} .

The test statistic $T^{P / N}$ can be interpreted as the shortest distance between $\hat{θ}$ and the null region, where the distance is inversely weighted by the asymptotic variances of the estimates, as illustrated in the left panel of Figure 2

Geometric interpretation of likelihood ratio statistic for the positive/negative hypothesis (left; introduced in Section 2.3) and relative difference hypotheses (right; introduced in Section 3.2)

Gail and Simon show that the test statistic $T^{P / N}$ can be calculated as

T^{P / N} = \underset{g \in {1,2}}{m i n} \{n {({\hat{θ}}_{g} / σ_{g})}^{2}\} 1 (\hat{θ} \in Θ_{1}^{P / N}),

where $1 (\cdot)$ denotes the indicator function. Furthermore, one can verify that for all $t > 0$ , the p-value is easily calculated as

ρ^{P / N} (t) = \underset{θ_{0} \in Θ_{0}^{P / N}}{s u p} \underset{n \to \infty}{l i m} P (T^{P / N} > t ∣ θ = θ_{0}) = \frac{1}{2} P (χ_{1}^{2} > t) .

The likelihood ratio test is quite intuitive and rejects the null when a positive estimate of association is observed in one population, a negative estimate is observed in another population, and both associations are statistically significant.

Now, we discuss approaches to test for absence/presence interactions. While one might be tempted to perform a likelihood ratio test for absence/presence interactions, the likelihood ratio test fails in the sense that it never can reject the null. To see this, we first recognize that, similar to the positive/negative interaction test, the absence/presence likelihood ratio test would reject for large values of

T^{A / P} = \underset{(a_{1}, a_{2}) \in Θ_{0}^{A / P}}{m i n} \sum_{g \in {1, 2}} n {σ_{g}^{- 1} ({\hat{θ}}_{g} - a_{g})}^{2} .

Again, the test statistic $T^{A / P}$ is the shortest distance between $\hat{θ}$ and the null region $Θ_{0}^{A / P}$ . However, because the alternative region $Θ_{1}^{A / P}$ has zero area, $\hat{θ}$ lies in the null region with probability one. Therefore, the test statistic is always 0, and the likelihood ratio test has no power.

One might alternatively attempt to test the absence/presence null by separately testing the null hypotheses $H_{0}^{g} : θ_{g} = 0$ for $g \in {1, 2}$ and rejecting the absence/presence null when $H_{0}^{g}$ is rejected for one $g$ and not rejected for the other. To control the size of a test of this form, which we refer to as the naïve test for the absence/presence null, we need to simultaneously control the type-I error under two scenarios: (1) there is an association in both sub-populations, and (2) there is no association in either population. When there is an association in both populations, a type-I error occurs when we incorrectly fail to reject one of $H_{0}^{g}$ . When there is no association in either population, we make a type-I error when we incorrectly reject one of $H_{0}^{g}$ . Thus, controlling the size of the test for $H_{0}^{A / P}$ using this approach requires simultaneous control of the type-I error rate and type-II error rate of tests for $H_{0}^{g}$ . If the tests for $H_{0}^{g}$ are consistent — that is, the type-II error rates tend to zero with sufficiently large samples — this approach is asymptotically valid, as only the type-I error rates for tests of $H_{0}^{g}$ need to be controlled. However, with even moderately large samples, we will not be able to correctly reject false $H_{0}^{g}$ with absolute certainty unless the true association is strong and hence easy to detect. This would make the test of $H_{0}^{A / P}$ unreliable in the presence of weak signal.

Next, we argue that the naïve test controls the type-I error only if $θ_{1}, θ_{2} > o (n^{- 1 / 2})$ , which is similar to the assumption recently made for differential network connectivity analysis ¹⁹. To see this, we construct a more formal argument. For simplicity, suppose $σ_{1} = σ_{2} = 1$ . We consider tests of $H_{0}^{g}$ of the form

ψ_{g} = \{\begin{array}{l} Reject; & if \sqrt{n} | {\hat{θ}}_{g} | > a \\ Accept; & if \sqrt{n} | {\hat{θ}}_{g} | < a \end{array},

where $a$ is a constant that would be selected to control the size. The probability of rejecting the absence/presence null is

P (Reject H_{0}^{A / P}) = P (ψ_{1} = Reject) P (ψ_{2} = Accept) + P (ψ_{1} = Accept) P (ψ_{2} = Reject) .

Suppose $θ_{1}$ and $θ_{2}$ are guaranteed to be greater than $o (n^{- 1 / 2})$ if they are both nonzero. Then, for $θ_{1}, θ_{2} \neq 0, \sqrt{n} | {\hat{θ}}_{g} | \to \infty$ , and $P (Reject H_{0}^{A / P}) \to 0$ . Therefore, to control the size, we are only required to select $α$ so that the type-I error is controlled when $θ_{1} = θ_{2} = 0$ ; this can be done by taking $a$ as the $(1 - α / 4)$ quantile of the standard normal distribution. However, if we allow $θ_{g} < o (n^{1 / 2})$ , we can see a drastically inflated type-I error rate. For instance, for a small $ϵ > 0$ , let $θ_{1} = n^{- 1 / 2 + ϵ}$ , $θ_{2} = n^{- 1 / 2 - ϵ}$ . This asymptotic regime allows us to approximate the type-1 error rate of this test in settings where the sample size is large relative to the signal strength. Then $\sqrt{n} | {\hat{θ}}_{1} | \to \infty > a$ while $\sqrt{n} | {\hat{θ}}_{2} | \to 0 < a$ , so $P (R e j e c t H_{0}^{A / P}) \to 1$ . Thus, when small signal is permitted, tests of this form will be asymptotically anti-conservative.

Both approaches discussed above for testing absence/presence interactions — namely, the likelihood ratio test and the naïve test — fail for a similar reason: it is difficult to gather evidence supporting that a measure of association is exactly equal to zero. This is captured by the alternative region having zero area, causing the failure of the first approach. In the second approach, to obtain evidence supporting that an association is zero, we require that $H_{0, g}$ is only accepted when $θ_{g} = 0$ ; for this, we rely upon a minimum signal strength condition to guarantee that any non-zero association is detected.

3 |. PROPOSED METHODOLOGY

3.1 |. Refinement of Absence/Presence Hypothesis

To mitigate the challenges described in Section 2, we consider a refinement of the absence/presence null hypothesis. The key idea is that in practice, absence/presence interactions can be approximated by considering the settings where an association is at least moderately large in one population and negligible or near zero in the other; or when one association is substantially stronger than the other. This means that we can expand the alternative region to include neighborhoods of zero in a way that the absence/presence interpretation is preserved.

Recall that when there exists an absence/presence interaction, the ratio of the maximum of the absolute value of the $θ_{g}$ to the minimum is infinite. We cannot test that the ratio is infinite because we will never have evidence to support that the denominator is exactly zero. However, we can test that the ratio is large because we may have evidence to support that the denominator is very small. Motivated by this intuition, we propose to test whether the relative difference between sub-population measures of association is greater than a large pre-specified constant $κ > 1$ . Formally, let $θ_{m a x} = {m a x}_{g} |θ_{g}|$ and $θ_{m i n} = {m i n}_{g} |θ_{g}|$ . We define the new relative difference null hypothesis $H_{0}^{κ}$ as

H_{0}^{κ} : θ_{m a x} / θ_{m i n} \leq κ or θ_{m a x} = θ_{m i n} = 0 .

Equivalently,

H_{0}^{κ} : θ_{m a x} - κ θ_{m i n} \leq 0 .

(3)

The null region $Θ_{0}^{κ}$ , illustrated in the right panel of Figure 1, is the union of four linear subspaces — each residing in a separate orthant of $R^{2}$ ; the boundary of each subspace is the union of the spans of vectors with absolute direction $(κ, 1)^{'}$ and $(1, κ)^{'}$ .

The relative difference null region can be viewed as a relaxation of the absence/presence null. For a large choice of $κ$ , the original and refined null hypotheses have a similar interpretation: the greater measure of association is substantially larger than the lesser. However, the hypotheses may not coincide for any value of $κ$ as the relative difference hypothesis can hold if $θ_{m a x}$ is non-zero while $θ_{m i n}$ is non-zero but small. It is nonetheless appealing that $H_{0}^{κ}$ has a reasonable interpretation for any choice of $κ$ ; that is, the multiplicative difference in strength of association is no larger than $κ$ .

To motivate defining the refined null hypothesis in terms of relative differences rather than absolute differences, we argue that testing for relative differences has at least the following benefits. First, relative differences are unitless, so the relative difference null is compatible with unitless measures of association such as the Pearson correlation coefficient, which are often preferred in the analysis of biological data. Second, it is possible that in some contexts, $κ$ could be selected without prior knowledge on the range of strength of association. For instance, in studies of differential gene expression in two groups, it is common for investigators to focus on identifying genes for which the relative difference in the mean expression level exceeds a user-specified threshold. The threshold is chosen to be scientifically meaningful, based on the investigator’s own discretion^20,21. When the goal is identifying pairs of genes for which the correlation between gene expression levels differs by group, an investigator could similarly select a scientifically meaningful threshold for the relative difference in correlation at their own discretion.

Of course, the relative difference null hypothesis depends on the choice of $κ$ . For small values of $κ$ , the relative difference null may be too dissimilar from the absence/presence null to retain its interpretation; for large $κ$ , the alternative region becomes very small, and the hypothesis may be overly conservative. In settings where there is a scientifically meaningful threshold for the relative difference in effect size, $κ$ can be specified a priori based on its interpretation, often at the investigator’s discretion. In the following subsections, we first construct a test of the relative difference null hypothesis for a pre-specified $κ$ . We then describe an approach to identify the set of $κ$ values for which the test rejects the null hypothesis. This method can be useful when there is no clear scientifically meaningful threshold for the relative difference in effects.

3.2 |. Likelihood Ratio Test for Relative Difference Hypothesis

We next develop a testing procedure for the new relative difference null hypothesis. The relative difference null region, unlike the absence/presence null region, has a non-zero area, so a likelihood ratio test will not fail in the same manner as the likelihood ratio test for the absence/presence hypothesis. Similar to the previously discussed examples, the likelihood ratio test statistic is

T^{κ} = \underset{(a_{1}, a_{2}) \in Θ_{0}^{κ}}{m i n} \sum_{g \in {1,2}} n {\{σ_{g}^{- 1} ({\hat{θ}}_{g} - a_{g})\}}^{2},

and can be interpreted as the shortest (weighted) distance between $\hat{θ}$ and the null region $Θ_{0}^{κ}$ , as depicted in the right panel of Figure 2 Clearly, the test statistic is zero whenever $\hat{θ}$ lies in the null region. Otherwise, $T^{κ}$ is the shortest distance between $\hat{θ}$ and the closest of the four linear subspaces that define $Θ_{0}^{κ}$ . The test statistic can be calculated as the distance between ${({\hat{θ}}_{max}, {\hat{θ}}_{min})}^{'}$ and its projection onto the span of the vector $(κ, 1)^{'}$ .

The likelihood ratio test statistic is straightforward to calculate. Let ${\hat{θ}}_{m a x} = {m a x}_{g} | {\hat{θ}}_{g} |$ and ${\hat{θ}}_{min} = {m i n}_{g} | {\hat{θ}}_{g} |$ be the strongest and weakest estimated absolute association, respectively. In the following lemma, we state that $T^{κ}$ is equal to the difference between ${\hat{θ}}_{max}$ and $κ {\hat{θ}}_{min}$ divided by a normalizing constant. Thus, the test statistic can be viewed as a plug-in of $\hat{θ}$ into (3) with an additional normalizing constant. This test statistic closely resembles the test statistic proposed by Fieller²² to conduct inference about ratios of means.

Lemma 1.

The likelihood ratio test statistic $T^{κ}$ can be written as

T^{κ} = \frac{{\hat{θ}}_{m a x} - κ {\hat{θ}}_{m i n}}{\sqrt{{\hat{τ}}_{m a x} + κ^{2} {\hat{τ}}_{m i n}}},

where ${\hat{τ}}_{m a x} = n^{- 1} σ_{1}^{2} 1 (| {\hat{θ}}_{1} | = {\hat{θ}}_{m a x}) + n^{- 1} σ_{2}^{2} 1 (| {\hat{θ}}_{2} | = {\hat{θ}}_{m a x})$ , and ${\hat{τ}}_{m i n} = n^{- 1} σ_{1}^{2} 1 (| {\hat{θ}}_{1} | = {\hat{θ}}_{min}) + n^{- 1} σ_{2}^{2} 1 (| {\hat{θ}}_{2} | = {\hat{θ}}_{min})$ .

We now discuss how to obtain a p-value for the relative difference hypothesis. First, we obtain an observed test statistic $t$ , a realization of $T^{κ}$ calculated from the data. Following the approach described in Section 2.2, we define the p-value as $ρ^{κ} (t)$ , the largest of all asymptotic tail probabilities ${l i m}_{n \to \infty} P (T^{κ} > t ∣ θ = θ_{0})$ such that $θ_{0}$ belongs to the null region. To determine the maximum tail probability, we characterize the limiting distribution of $T^{κ}$ assuming $θ = θ_{0}$ for all $θ_{0}$ in the null region.

Though the null region contains an infinite number of values, $T^{κ}$ can only attain one of three limiting distributions corresponding to the following three cases:

The true association is in the interior of the null region, i.e., $θ_{m a x} - κ θ_{m i n} < 0$ .
The true association is on the boundary of the null region, but both associations are non-zero, i.e., $θ_{m a x} = κ θ_{m i n} > 0$ .
The true association is zero in both sub-populations, i.e., $θ_{1} = θ_{2} = 0$ .

In Proposition 1, we describe the asymptotic behavior of $T^{κ}$ for cases 1 and 2 above. We provide here some intuition for the result and reserve a formal argument for the Appendix. In case 1, it is easy to argue that because $\hat{θ}$ is consistent, $T^{κ}$ is a negative number with probability tending to one, and therefore never provides evidence against the null. In case 2, $T^{κ}$ asymptotically follows a standard normal distribution. To see this, we note that because both associations are non-zero, consistency and asymptotic normality of $\hat{θ}$ imply that, for large $n$ , the sign of the estimators and the ranking of their magnitudes are deterministic. That the signs are asymptotically deterministic implies that $| \hat{θ} |$ is asymptotically normal (speaking loosely, $| {\hat{θ}}_{g} | \to s i g n (θ_{g}) {\hat{θ}}_{g}$ ); that ranking is asymptotically deterministic implies that ${\hat{θ}}_{m a x}$ and ${\hat{θ}}_{min}$ are asymptotically independent. Therefore, taking the difference between ${\hat{θ}}_{max}$ and ${\hat{θ}}_{min}$ , suitably standardized, is asymptotically equivalent to taking a difference between two independent normal random variables with equal means. Dividing by the asymptotic variances gives the claimed result.

Proposition 1.

In the interior of the null region, i.e., when $θ_{m a x} - κ θ_{m i n} < 0$ , $T^{κ}$ converges in distribution to $- \infty$ . At all nonzero boundary points of the null region, i.e., when $θ_{m a x} = κ θ_{m i n} > 0$ , $T^{κ}$ converges in distribution to a standard normal random variable.

In case 3, when both associations are zero, the asymptotic distribution of the test statistic is more complicated. More specifically, $θ_{g} = 0$ implies that, asymptotically, $| {\hat{θ}}_{g} |$ follows a half-normal distribution instead of a normal distribution. Moreover, the ranking of $| {\hat{θ}}_{1} |$ and $| {\hat{θ}}_{2} |$ remains random in the limit. This gives rise to a non-standard limiting distribution. In particular, unlike cases 1 and 2, the limiting distribution depends on the asymptotic variances of the sub-population estimates (and also the ratio of the sample sizes in the unbalanced case). We are nonetheless able to derive an analytic expression for the distribution function, stated in Proposition 2. We reserve this statement for the Appendix, as the expression is cumbersome.

By Propositions 1 and 2, we can calculate the p-value as the maximum of the tail probabilities in cases 2 and 3. That is,

ρ^{κ} (t) = m a x \{1 - Φ (t), 1 - F^{κ} (t)\},

(4)

where $Φ (\cdot)$ denotes the standard normal distribution function, and $F^{κ} (\cdot) \equiv {l i m}_{n \to \infty} P (T^{κ} < t ∣ θ = θ_{0})$ is the limiting distribution function for the test statistic when $θ = 0$ . Though the limiting distribution under $θ = 0$ is non-standard, tail probabilities can be calculated easily, as we show in the Appendix (Remark 1). The p-value is therefore simple to calculate.

We have derived an analytic approximation for the power of the likelihood ratio test. In Proposition 2, we more generally characterize the limiting distribution of the test statistic under hypotheses of the form $(θ_{1}, θ_{2}) = n^{- 1 / 2} (c_{1}, c_{2})$ . The local asymptotic power of the proposed test at the level $α$ is available by considering $β_{α}^{κ} (c_{1}, c_{2}) \equiv {l i m}_{n \to \infty} P (T^{κ} > t_{1 - α}^{*} ∣ θ = n^{- 1 / 2} (c_{1}, c_{2}))$ , where we define $t_{1 - α}^{*}$ as the maximum of the $1 - α$ quantiles of the limiting distribution of $T^{κ}$ under scenarios 2 and 3 described above. An analytic finite-sample approximation of the power can be calculated as $β_{α}^{κ} (n^{1 / 2} θ_{1}, n^{1 / 2} θ_{2})$ .

A contour plot of the local asymptotic power is given in Figure 3 for both equal and unequal variances of the estimators. We observe that the likelihood ratio test has low power when the strongest effect max $\{|c_{1}|, |c_{2}|\}$ is small, and power improves considerably when the strongest effect grows. Additionally, we find that in the presence of unequal variance, the test has greater power when the weakest effect is estimated with higher precision than the strongest effect.

Contour plots of the local asymptotic power of the relative difference likelihood ratio test with $κ = 2$ and $α = . 05$ . Bold lines represent the boundary of the null region. Settings of equal and unequal asymptotic variance of estimators are shown.

3.3 |. Quantifying Relative Difference in Effect by Inverting Likelihood Ratio Test

In some applications there may not be a clear scientifically meaningful threshold for the relative difference in effect size. In such settings, investigators may be disinclined to test the relative difference null for a single fixed choice of $κ$ , as the choice can be somewhat arbitrary. It may be preferable to instead provide a more informative summary describing the relative difference in effect size.

We propose to summarize the relative difference in effect size by reporting the set of $κ$ for which we would reject the relative difference null. We define

κ_{m a x}^{α} \equiv s u p \{κ : κ > 1, ρ^{κ} (t) < α\}

as the largest $κ > 1$ such that the likelihood ratio test rejects the null hypothesis at the $α$ level. When the likelihood ratio test fails to reject for all $κ > 1$ , we will use the convention $κ_{m a x}^{α} = 1$ . Since $T^{κ}$ is negative for $κ > {\hat{θ}}_{m a x} / {\hat{θ}}_{min}$ and since we only reject the relative difference null for large positive values of $T^{κ}$ , $κ_{m a x}^{α}$ is bounded above by ${\hat{θ}}_{m a x} / {\hat{θ}}_{m i n}$ .

Because the likelihood ratio test controls the type-I error at the level $α$ , if $θ_{m a x} = θ_{m i n}$ , then with probability at least $1 - α, κ_{m a x}^{α} = 1$ . Moreover, if $θ_{m a x} / θ_{m i n} > 1$ , $κ_{m a x}^{α}$ is bounded above by $θ_{m a x} / θ_{m i n}$ with probability at least $1 - α$ . Thus, $κ_{m a x}^{α}$ provides a probabilistic lower bound for the relative difference in effect size. That is,

\underset{n \to \infty}{l i m} P (κ_{m a x}^{α} \leq θ_{m a x} / θ_{m i n}) \geq 1 - α .

Moreover, for any $κ < θ_{m a x} / θ_{min}$ , the relative difference test will reject the null hypothesis that $θ_{m a x} \leq κ θ_{min}$ with probability tending to one. Therefore, in the limit of large $n$ , $κ_{m a x}^{α}$ should approach but not exceed the true relative difference in effect size. We discuss the calculation of $κ_{m a x}^{α}$ in the Appendix.

Larger values of $κ_{m a x}^{α}$ indicate that the relative difference in effect $θ_{m a x} / θ_{m i n}$ is large and provide greater evidence to support the occurrence of an absence/presence interaction. However, we emphasize that no value of $κ_{m a x}^{α}$ will be large enough to guarantee that an absence presence interaction has truly occurred. While we argue that $κ_{m a x}^{α}$ is an informative and intuitive summary, we caution investigators against misinterpreting it.

3.4 |. Simultaneous Test for Qualitative Interactions

When it is of interest to identify both absence/presence and positive/negative qualitative interactions, it may be desirable to test for both simultaneously. In this section, we construct an omnibus test that achieves control of the size asymptotically.

We define the omnibus qualitative interaction null hypothesis as

H_{0}^{P / N, κ} : Both H_{0}^{P / N} and H_{0}^{κ} hold.

The null region $Θ_{0}^{P / N, κ}$ is the intersection of the positive/negative and relative difference null regions, as depicted in Figure 4 We observe that as $κ \to \infty$ , the omnibus null and alternative regions tend to the positive/negative null and alternative regions.

(Left) Null and alternative region for omnibus qualitative interaction hypothesis. (Right) Geometric interpretation of likelihood ratio statistic for the omnibus qualitative interaction hypothesis.

To construct the likelihood ratio test, we proceed using similar arguments to those presented in Section 3.2. The likelihood ratio statistic $T^{P / N, κ}$ is the distance between the estimate $\hat{θ}$ and its projection onto the null region $Θ_{0}^{P / N, κ}$ , inversely weighted by the asymptotic variance of $\hat{θ}$ . A simple expression for $T^{P / N, κ}$ is given in Lemma 2.

Lemma 2.

The likelihood ratio statistic $T^{P / N, κ}$ can be written as

T^{P / N, κ} = m i n \{\frac{{({\hat{θ}}_{1} - κ {\hat{θ}}_{2})}^{2}}{n^{- 1} (σ_{1}^{2} + κ^{2} σ_{2}^{2})}, \frac{{(κ {\hat{θ}}_{1} - {\hat{θ}}_{2})}^{2}}{n^{- 1} (κ^{2} σ_{1}^{2} + σ_{2}^{2})}\} 1 (\hat{θ} \in Θ_{1}^{P / N, κ}) .

Unsurprisingly, the likelihood ratio statistic for the omnibus test approaches the likelihood ratio statistic for the Gail-Simon likelihood ratio statistic for positive/negative interactions in the limit of large $κ$ . The tests will, therefore, be nearly identical for sufficiently large $κ$ .

To characterize the asymptotic behavior of the omnibus test statistic at each location under the null, we use similar arguments to those in the previous section. If $θ$ belongs to the interior of the null region, $T^{P / N, κ}$ converges in probability to zero. If $θ$ belongs to the boundary of the null region and is non-zero, $T^{P / N, κ}$ converges weakly to a uniform mixture of zero and the chi-squared distribution with one degree of freedom. If $θ$ is zero, the limiting distribution of $T^{P / N, κ}$ is non-standard, though it can be characterized nonetheless. Formal statements of asymptotic properties of $T^{P / N, κ}$ are given in Propositions 3 and Remark 2 (the statement of Remark 2 is also cumbersome, and is reserved for the Appendix).

We have derived an analytic approximation for the power of the likelihood ratio test. In Proposition 4 in the Appendix, we more generally characterize the limiting distribution of the test statistic under hypotheses of the form $(θ_{1}, θ_{2}) = n^{- 1 / 2} (c_{1}, c_{2})$ , and in Remark 3, also in the Appendix, we describe how this result can be applied to calculate the local asymptotic power of the proposed test.

Proposition 3.

If $θ$ belongs to the interior of the null region $Θ_{0}^{P / N, κ}$ , $T^{P / N, κ}$ converges in distribution to zero. If $θ$ is on the boundary of the null region, but $θ \neq 0$ , $P (T^{P / N, κ} > t) \to \frac{1}{2} P (χ_{1}^{2} > t)$ as $n \to \infty$ .

Calculating the p-value for the omnibus test is no more difficult than calculating the p-value for the absence/presence test. Defining $F^{P / N, κ} (t) \equiv {l i m}_{n \to \infty} P (T^{P / N} < t ∣ θ = 0)$ , the p-value can be calculated as

ρ^{P / N, κ} (t) = m a x {P (χ_{1}^{2} > t), 1 - F^{P / N, κ} (t)}

(5)

where $t$ is the value of the test statistic $T^{P / N, κ}$ calculated on the observed data.

4 |. SIMULATION STUDY

In a Monte Carlo simulation study, we examine how type-I error rate and statistical power of the likelihood ratio tests for the relative difference in effect are affected by signal strength, sample size, and choice of $κ$ . Additionally, we examine how $κ_{m a x}^{α}$ depends on the true sub-population effects and the sample size.

We generate random observations $(Y_{g}, X_{g})$ in sub-population $g$ under the linear model:

Y_{g} = θ_{g} X_{g} + ϵ; X_{g} ~ N (0, 1); ϵ ~ N (0, 1) .

Here, $X_{g}$ is the predictor of interest, $Y_{g}$ is the response, and $ϵ$ is white noise. The measure of association in which we are interested is the regression coefficient $θ_{g}$ . We fix $θ_{1} = 1$ and consider $θ_{2} \in {- 1, - . 95, - . 9, \dots, . 95, 1}$ . A total of 1000 synthetic data sets are randomly generated for each $θ_{2}$ and $n \in {50, 100, 200, 400, 800, 1600}$ .

For each synthetic data set, we perform the relative difference likelihood ratio test and the omnibus test with $κ \in {1.5, 2, 4, 8}$ , and use a significance level of $α = . 05$ . We additionally calculate $κ_{m a x}^{α}$ with $α = . 05$ . Parameter estimation is performed with ordinary least squares, and model-based estimates of the standard error are used. We compare the likelihood ratio test with the naïve test for absence/presence interactions described in Section 2.3, which rejects the null when there is evidence for the existence of an effect in one population but insufficient evidence in the other population. We also compare with a test for quantitative interactions, which rejects the null when there is evidence that $θ_{1}$ and $θ_{2}$ are unequal.

In Figure 5, we plot Monte Carlo estimates of the rejection probabilities for the relative difference likelihood ratio test, the naïve test, and the test for quantitative interactions over the range of $θ_{2}$ . We see that the relative difference likelihood ratio test achieves control of size for all choices of $κ$ and all sample sizes. Power is largest when $|θ_{2}|$ is near zero and is lower for larger values of $κ$ , as expected, because larger values of $κ$ correspond to a larger null region. With $κ = 8$ , the likelihood ratio test has almost zero power when the sample size is small, though we see an improvement in large sample settings.

Monte Carlo estimate of the rejection probability of the relative difference likelihood ratio test and the naïve test. The white vertical line at $θ_{2} = 0$ corresponds to the single point at which the absence/presence hypothesis does not hold. The solid pink line denotes the specified size $α = . 05$

The naïve test of the absence/presence hypothesis exhibits poor type-1 error control. Because the absence/presence does not hold only when $θ_{2} = 0$ , a well-calibrated test would reject the null with probability no greater than $α$ for any other value of $θ_{2}$ . However, we can see that the type- 1 error rate of the naïve test is greatly inflated when $θ_{2}$ resides within a neighborhood of zero, and this is especially true in a small sample setting. Due to the test’s poor calibration, it is difficult to glean any meaningful information from a decision to reject the null hypothesis. While the relative difference test can also have a high rejection probability when $θ_{2}$ is in a neighborhood of zero, it is more easily interpretable: because this test is well-calibrated, we are able to understand what a decision to reject the null implies about the relationship between $θ_{1}$ and $θ_{2}$ .

Figure 6 shows the estimated rejection probabilities of the omnibus test for varying $θ_{2}$ . Control of size is achieved in both large and small samples, as expected. For this test, power increases as $θ_{2}$ tends to −1, and as expected, power decreases with increasing $κ$ .

Monte Carlo estimate of the rejection probability of the omnibus test for qualitative interactions. The solid pink line denotes the specified size $α = . 05$

In Figure 7. we plot the $α,$ 0.5, and $(1 - α)$ quantiles of $κ_{m a x}^{α}$ values from 1000 synthetic data sets for each $θ_{2}$ . As we would expect, for any fixed $θ_{2}$ , at least $100 (1 - α) %$ of the $κ_{m a x}^{α}$ fall below $|θ_{1}| / |θ_{2}|$ . In addition, the $(1 - α)$ quantile of the distribution $κ_{m a x}^{α}$ approaches $|θ_{1}| / |θ_{2}|$ as sample size increases. However, when the sample size is small, $κ_{m a x}^{α}$ tends to be much smaller than the relative difference in effect size. This suggests that $κ_{m a x}^{α}$ is a loose lower bound for the relative difference in effect in a small sample setting, but the bound becomes progressively tighter as the sample size grows.

Distribution of $κ_{m a x}^{α}$ calculated on synthetic data. The .05, .50, and .95 quantiles are represented by the dashed blue and solid black curves. The grey curve represents $|θ_{1}| / |θ_{2}|$ , the value $κ_{m a x}^{α}$ is expected to approach for a given $θ_{2}$ .

5 |. DATA EXAMPLE

In this example, we investigate genetic differences in breast cancer sub-types. Classification of breast cancer based on expression of estrogen receptor (ER) is known to be associated with clinical outcomes. Approximately 70% of breast cancers are estrogen receptor positive (ER+) cancers, meaning that estrogen causes cancer cells to grow²³; breast cancers are otherwise estrogen receptor negative (ER−). Patients with ER+ breast cancer tend to experience better clinical outcomes than ER− patients²⁴.

We conduct an analysis using publicly available data from The Cancer Genome Atlas (TCGA)²⁵. We use clinical data and gene expression data from a total of 806 ER+ patients and 237 ER− patients.

We first investigate the differences between the genetic networks in ER+ and ER− breast cancers. Both ER+ and ER− breast cancer are expected to have similar pathways, but identifying differences between them may be key to understanding the underlying disease mechanisms. We then conduct an analysis to assess whether any genes in a set known to be associated with breast cancer are strongly prognostic of disease outcomes in only one of the estrogen receptor groups.

5.1 |. Differential Network Analysis

Our objective is to determine whether there are any pairs of genes that are much more strongly associated in one estrogen receptor group than the other. We consider the set of $p = 145$ genes in the Kyoto Encyclopedia of Genes and Genomes (KEGG)²⁶ breast cancer pathway and measure the association between gene expression levels using the Pearson correlation.

We test the relative difference null hypothesis for each pair of genes with $κ \in {2, 2.5, 3}$ and provide the resulting p-values in Table 1. In Figure 8, we display the pairs of genes that are statistically significant at the $α = . 05$ level with $κ = 2$ after a Bonferroni adjustment. Each of the genes progesterone (PGR), insulin-like growth factor 1 (IGF1R), and estrogen receptor 1 (ESR1) have multiple differential connections; each belongs to at least two pairs such that the association is twice as strong in the ER+ population than in the ER− population. These genes have been shown in the literature to be associated with sub-type and prognosis^27,28,29.

Table 1.

Bonferroni-adjusted p-values for pairs of genes that genes for which we reject the relative difference hypothesis with κ = 2, κ = 2.5, or κ = 3.

Gene Pair	κ =2	κ = 2.5	κ =3

(IGF1R, AKT3)	0.034	1	1
(IGF1R, FGF1)	0.0011	0.16	1
(KIT, EGFR)	0.0069	1	1
(PGR, FGF7)	0.00034	0.17	1
(PGR, DLL1)	0.014	0.5	1
(PGR, NOTCH4)	6.5e-07	0.00054	0.04
(FZD4, PGR)	8.8e-08	0.00013	0.013
(FGF18, FGF10)	0.017	0.74	1
(E2F1,ESR1)	3.2e-07	0.001	0.13
(E2F2, ESR1)	0.00094	0.11	1

Open in a new tab

(Left) Pairs of genes in the KEGG breast cancer pathway for which we reject the relative difference hypothesis with $κ = 2$ . Blue edges indicate associations that are stronger in the ER+ group, and red edges indicate associations that are stronger in the ER− group. (Right) Log hazard ratios for KEGG genes in ER+ and ER− groups. The gray dashed line represents the 45-degree line. Blue diamonds and red triangles indicate genes for which $κ_{m a x}^{α} > 1$ with $α = . 10$ , where the largest log hazard ratio is in the ER+ group and ER− group, respectively.

5.2 |. Prognostic Value of Biomarkers

The goal of this analysis is to assess whether any of the KEGG genes have a stronger association with time to death in one estrogen receptor group than in the other. For each gene, we fit a univariate Cox proportional-hazards model with time to death as the outcome in both of the estrogen receptor groups separately; we measure association using the log hazard ratio. A total of 64 deaths occurred in the ER+ group, and 33 deaths occurred the ER− group. We calculate $κ_{m a x}^{α}$ for $α$ equal to .1, .05, .025, and .01, for each gene.

In Figure 8, we compare the log hazard ratios of the ER+ and ER− groups in a scatterplot. Though the log hazard ratios for most genes are similar between subgroups, there are twelve genes with $κ_{m a x}^{α}$ larger than one with $α = 0.10$ . A complete list is available in Table 2. The two genes with the strongest interactions are Growth Factor Receptor-bound Protein 2 (GRB2; $κ_{m a x}^{0.10} = 2.04$ ), which has a stronger association in the ER− group, and Adenomatous Polyposis Coli (APC; $κ_{m a x}^{0.10} = 1.91$ ), which has a stronger association in the ER+ group. Both genes have been hypothesized to be associated with breast cancer carcinogenesis^30,31. We acknowledge, however, that for small $α$ , none of the $κ_{m a x}^{α}$ values are much larger than one, and moreover, no $κ_{m a x}^{α}$ values exceed one at the Bonferroni-adjusted level .10∕145. This indicates that the relative difference in strength of association is possibly small for all genes, or alternatively, that we may not have sufficient power to detect large or moderate relative differences.

Table 2.

KEGG genes that are more strongly associated with time to death in one ER group than the other. Reported are $κ_{m a x}^{α}$ values for α ϵ {.1, .05, .025, .01}.

Gene	ER+ Log HR (SE)	ER− Log HR (SE)	$κ_{m a x}^{α}$
Gene	ER+ Log HR (SE)	ER− Log HR (SE)	α = .1	α = .05	α = .025	α = .01

GRB2	−0.06 (0.31)	−1.66 (0.68)	2.04	1.36	1.00	1.00
APC	1.34 (0.32)	−0.09 (0.33)	1.91	1.57	1.32	1.08
BAX	−1.05 (0.24)	0.04 (0.36)	1.53	1.26	1.06	1.00
PIK3CA	1.13 (0.28)	0.14 (0.32)	1.51	1.24	1.05	1.00
SOS2	1.13 (0.36)	−0.1 (0.37)	1.33	1.03	1.00	1.00
MAP2K2	−0.87 (0.27)	0.03 (0.35)	1.22	1.00	1.00	1.00
GADD45G	−0.52 (0.13)	−0.07 (0.19)	1.21	1.00	1.00	1.00
HES5	0.02 (0.2)	0.51 (0.18)	1.19	1.00	1.00	1.00
WNT2	−0.36 (0.09)	0 (0.17)	1.14	1.00	1.00	1.00
DLL4	0.09 (0.2)	0.68 (0.27)	1.10	1.00	1.00	1.00
FRAT2	−1.22 (0.31)	−0.45 (0.29)	1.08	1.00	1.00	1.00
SOS1	1.19 (0.3)	−0.34 (0.42)	1.01	1.00	1.00	1.00

Open in a new tab

6 |. DISCUSSION

We proposed a general framework for inference about absence/presence qualitative interactions. We argued that naïve procedures rely upon untenable conditions because the absence/presence hypothesis is ill-posed. We thus proposed to relax the problem in order to conduct well-calibrated inference that maintains the absence/presence interpretation and only requires mild assumptions.

In simulations, we found that our methodology has low power when signal is weak or sample sizes are small. To an extent, this is just a feature of the problem; naturally, one would require even more information to detect absence/presence qualitative interactions than what is required to detect quantitative interactions. However, we provide no guarantee that our methodology is optimal, as tests for composite hypotheses based upon supremum p-values can be conservative in practice³².

Our framework is interpretable and provides a natural approach for quantifying differences in strength of association by sub-population in general settings. Though we only considered measures of marginal association in our examples, our method can be used with conditional measures of association as well; we only require that asymptotically normal estimators are available. In particular, our approach remains valid in the high-dimensional setting, where penalized estimators are biased³³ but asymptotically normal estimates can be obtained using the techniques of, e.g., van de Geer et al.³⁴ and Zhang and Zhang³⁵. Finally, we demonstrated the usefulness of our method in a real data example based on genomics data

9 |. ACKNOWLEDGEMENTS

The authors gratefully acknowledge the support of the NSF Graduate Research Fellowship Program under grant DGE-1762114 as well as NSF grant DMS-1561814 and NIH grant R01-GM114029. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding agencies.

APPENDIX