Bayesian Hierarchical Quantile Regression with Application to Characterizing the Immune Architecture of Lung Cancer

Priyam Das; Christine B Peterson; Yang Ni; Alexandre Reuben; Jiexin Zhang; Jianjun Zhang; Kim-Anh Do; Veerabhadran Baladandayuthapani

doi:10.1111/biom.13774

. Author manuscript; available in PMC: 2024 Sep 1.

Published in final edited form as: Biometrics. 2022 Nov 3;79(3):2474–2488. doi: 10.1111/biom.13774

Bayesian Hierarchical Quantile Regression with Application to Characterizing the Immune Architecture of Lung Cancer

Priyam Das ¹, Christine B Peterson ², Yang Ni ³, Alexandre Reuben ⁴, Jiexin Zhang ⁵, Jianjun Zhang ⁴, Kim-Anh Do ², Veerabhadran Baladandayuthapani ^6,^*

PMCID: PMC10102253 NIHMSID: NIHMS1842069 PMID: 36239535

Summary:

The successful development and implementation of precision immuno-oncology therapies requires a deeper understanding of the immune architecture at a patient level. T-cell Receptor (TCR) repertoire sequencing is a relatively new technology that enables monitoring of T-cells, a subset of immune cells that play a central role in modulating immune response. These immunologic relationships are complex and are governed by various distributional aspects of an individual patient’s tumor profile. We propose Bayesian Quantile regression for hierarchical Covariates (QUANTICO) that allows simultaneous modeling of hierarchical relationships between multi-level covariates, conducts explicit variable selection, estimates quantile and patient-specific coefficient effects, to induce individualized inference. We show QUANTICO outperforms existing approaches in multiple simulation scenarios. We demonstrate the utility of QUANTICO to investigate the effect of TCR variables on immune response in a cohort of lung cancer patients. At population level, our analyses reveal the mechanistic role of T-cell proportion on the immune cell abundance, with tumor mutation burden as an important factor modulating this relationship. At a patient level, we find several outlier patients based on their quantile-specific coefficient functions, who have higher mutational rates and different smoking history.

Keywords: Bayesian quantile regression, non-small cell lung cancer, T-cell receptor repertoire, variable selection, varying sparsity regression

1. Introduction

Immunotherapy is a class of cancer treatments that fosters the patient’s own immune system to fight cancer (Waldman et al., 2020). Although immunotherapies represent a major break-through in cancer treatment, offering immense clinical benefit to some patients with a lower toxicity burden than chemotherapy, resistance to immunotherapy remains a major challenge (Walsh and Soo, 2020). Tumors use various strategies to protect themselves from anti-tumor immunity which might vary across patients. Anti-tumor immune responses might also be mediated by several different mechanisms, driven by patient-specific immune architecture. Cancer immunotherapy therefore needs to be personalized to recognize patient-specific rate-limiting steps and employ strategies for overcoming these hurdles (Kakimi et al., 2017).

Identifying patient-specific influences on immune response requires integrating measures of immune activity and characterizing their dependence on upstream factors. Here we consider a setting where the response variable Y is a continuous measure of immune activation, such as the abundance of CD8+ T-cells, which play a key role in directly killing cancer cells. The abundance of these cells is driven by “Level 1” covariates T₁, …, T_p, which represent measures of upstream immune activity. Here we take these Level 1 covariates to be features derived from T-cell receptor (TCR) sequencing, which capture activity of the adaptive immune system. The TCR repertoire depends, in turn on “Level 2” influences further upstream, such as antigens that the T-cells have been exposed to. This includes tumor-specific antigens, which are abnormal proteins produced by tumor cells due to DNA mutations in the cell (see Figure 1(a)). We consider mutational variables M₁, …, M_g as (hierarchical) Level 2 covariates in our model. Our goal is to develop a hierarchical modeling approach providing insights into the mechanistic relationships among these variables where-in the measures of immune activity may depend on the upstream factors in a complex non-linear fashion.

Figure 1. — (a) Illustration of how mutation within tumor cells and TCR repertoire impact immune cells. (b) Illustration of the proposed model for a scenario with n patients, g mutation variables M₁, …, M_g, and p TCR variables T₁, …, T_p. The response variable is denoted Y₁, …, Y_n. Different colored lines describe the estimated dependency structure at different quantile levels denoted by (τ₁, τ₂, … τ_L). (c) Directed acyclic graph (DAG) of the QUANTICO model. Parameters are shown in circles and the observed data are shown in boxes.

We now briefly review prior work addressing the challenge of flexible regression modeling, which lays the foundation for the proposed approach. In order to estimate the subject-specific effect of variables on the outcome variable, Hastie and Tibshirani (1993) proposed the varying coefficient model (VCM). Since then, several variations of the VCM have been proposed (Fan and Zhang, 1999; Park et al., 2013), including approaches that incorporate shrinkage in estimation (Wang and Xia, 2009). However, existing VCM methods do not enable explicit multi-level variable selection, which is crucial in settings such as ours where there are a large number of explanatory variables within a hierarchy. Although VCMs allow more flexible coefficients than traditional regression models, they are still focused on estimation of the mean of the response variable. If one is interested in obtaining a comprehensive picture of the effect of the predictors on the response variable, mean regression might be insufficient. For example, if the dependent variable is multi-modal or skewed, estimating the mean effect might be misleading. In such scenarios, median or more generally, quantile regression may be more appropriate (Koenkar and Bassett, 1978). Specifically, if the interest lies specifically at higher or lower quantiles of the response variable, quantile regression is more suitable.

Over the last couple of decades, methodological developments in quantile regression have been proposed in both the classical and Bayesian frameworks (Kottas and Gelfand, 2001; Yu and Moyeed, 2001; Reich, 2012; Das and Ghosal, 2018). Several articles emerged in the literature integrating quantile regression with VCMs, (Tang and Wang, 2005; Honda, 2004; Kim, 2007). In Bayesian settings, the literature on VCMs is relatively sparse (Biller and Fahrmeir, 2001), and these proposals do not address quantile regression. Recently, Ni et al. (2018) proposed a VCM incorporating variable selection in the Bayesian framework, which was applied to characterize the relationship of cancer patient outcomes to proteomics and genomics data. Although there are approaches for Bayesian variable selection in multi-level models that assume linear covariate effects (Stingo et al., 2013; Koslovsky et al., 2020), to the best of our knowledge, no prior work has considered variable selection in VCM for quantile regression.

In order to understand the subject-specific effect of hierarchically structured covariates on the outcome variable we propose Bayesian Quantile regression for hierarchical Covariates (QUANTICO), where the regression coefficients are allowed to differ across patients for any given quantile level of the outcome variable. Among the hierarchically structured (multi-level) covariates, we consider the Level 1 covariates have a direct effect on the response variable, modulated by Level 2 covariates. Since we expect the covariate effects to be heterogeneous across patients, we want to perform individualized inference as well as allowing variable selection for both Level 1 and Level 2 covariates. As we also expect the effects to be different at different parts of the distribution of the response variable, we model across the quantiles, rather than only considering fixed moments (e.g. mean). This provides a richer, broader, and more flexible exploration of the relationship structure. We show an illustration of the proposed model in Figure 1(b) with n subjects, p Level 1 and g Level 2 covariates. As shown in the figure, the selection of Level 1 and 2 covariates is allowed to vary across quantiles. Due to the presence of two levels of covariates and due to the fact that the effect of Level 2 covariates is induced on the output variable via its effect on Level 1 covariates, we call it a hierarchical model.

The rest of the paper is organized as follows: Section 2 describes the QUANTICO modeling framework, with a discussion of the priors in Section 3. We then describe the computational algorithm for performing posterior inference (Section 4), and provide a simulation study comparing the performance of the proposed method with existing alternatives (Section 5). In Section 6, we apply QUANTICO to characterize the relationship of the CD8 immune marker with TCR and mutation variables for lung cancer patients. We conclude with a brief discussion and possible extensions of our methodology in Section 7.

2. QUANTICO model

In Section 2.1, we introduce quantile regression in a VCM setting with two levels of covariates. In Section 2.2, we describe variable selection procedures on Level 1 and Level 2 covariates. Section 2.3 summarizes the likelihood constructions.

2.1. Varying sparsity quantile regression model

Suppose there are n subjects whose response variables (immune marker values) are denoted by Y = (Y₁, …, Y_n) and p Level 1 covariates (TCR variables) T = (T₁, …, T_p). For the i-th subject, the set of Level 1 covariates are denoted as T_i = (T_i1, …, T_ip) for i = 1, …, n. We assume a linear relationship between the response variable Y and the Level 1 covariates T. Now, for the j-th Level 1 covariate T_j, consider a set of q_j Level 2 covariates (mutation variables) $M_{j} = (M_{j 1}, \dots, M_{j q_{j}})$ for j = 1, …, p. A Level 2 covariate induces its effect on the response variable Y through its effect on the j-th Level 1 covariate, i.e., T_j for j = 1, …, p. Note that the same Level 2 covariate may influence multiple Level 1 covariates, and that the number q_j of Level 2 covariates can be different for each Level 1 covariate. However, in the particular case study considered in this paper, the same set of Level 2 covariates (mutation variables) are considered for all Level 1 covariates (TCR variables). For the i-th subject, the set of Level 2 covariates is denoted by $M_{i j} = (M_{i j 1}, \dots, M_{i j q_{j}})$ for i = 1, …, n, j = 1, …, p. The total number of possible effects for the M variables which can be identified as a part of the coefficient of the T variables is given by $\sum_{j = 1}^{p} q_{j}$ .

Instead of estimating the effect of T and M on the mean of the response variable, our interest lies in estimating their relationship for different quantile levels. To obtain the quantile-specific estimates, we assume the effect of the covariates on Y to be dependent on the quantile level τ(0 < τ < 1). Let

Q_{Y} (τ ∣ T, M) = inf {q : P (Y ⩽ q ∣ T = T, M = M) ⩾ τ},

denote the τ-th conditional quantile (0 < τ < 1) of the response variable Y at T = T, M = M. The relation between the response and the covariates at the τ-th quantile is given by

Q_{Y_{i}} (τ ∣ T, M) = β_{0} (τ, M_{i 0}) + \sum_{j = 1}^{p} T_{i j} β_{j} (τ, M_{i j}),

(1)

where M_i0 denotes the values of all distinct Level 2 covariates for the i-th subject, and M_ij denotes the values of the Level 2 covariates corresponding to the j-th Level 1 covariate for i-th subject. This equation shows the dependence structure of the τ-th quantile of the dependent variable Y for the i-th patient on the corresponding Level 1 (T) and Level 2 (M) covariates. Note that for each Level 1 covariate T_j we have q_j many distinct M variables, but since two different Level 1 covariates may share the same M variables, the $\sum_{j = 1}^{p} q_{j}$ many M variables may not be distinct. Also note that both the intercept and slope terms are patient-specific (indexed by i). In order to incorporate the effect of the M variables on the dependent variable, at any given quantile level τ, all the slope terms (of T) are semi-parametric functions of the M variables and likewise the intercept is a semi-parametric function of the distinct M variables. The explicit structure of β_j(τ, M_ij) is discussed in Section 2.2.

2.2. Varying-sparsity coefficient modelling and selection

For any given quantile level τ, we consider β_j(τ, M_ij) as a smooth function of M_ij. The main motivation for taking β_j(τ, M_ij) as a smooth function of M_ij is so that the coefficients of any Level 1 covariate (T_j) for two different subjects are similar if the values of the corresponding Level 2 covariates (M_·j) are similar as well. Since for any Level 1 covariate, “neighboring” patients (with respect to the M variables) are expected to have a similar slope coefficient, this assumption enables borrowing of strength, increasing our power to estimate the subject-specific slope coefficients. For modelling the slope and intercept terms, we use spline functions due to their flexible construction, interpretation, and the ease of incorporating penalization.

At any given quantile τ (0 < τ < 1), the slope and the intercept terms are estimated as the sum of spline functions given by

β_{j} (τ, M_{i j}) = \sum_{k = 1}^{q_{j}} f_{j k}^{(τ)} (M_{i j k}), j = 0, \dots p,

(2)

where β₀(τ, M_i0) denotes the (global) intercept term, and β_j(τ, M_ij) for j ⩾ 1 denotes the slope term. The spline components $f_{j k}^{(τ)} (M_{i j k}) = S_{i j k} α_{j k}^{τ}$ where S_ijk denotes the cubic B-spline bases for M_ijk and $α_{j k}^{τ}$ denotes the corresponding spline coefficient. Note that we consider the intercept term to be a function of all Level 2 covariates. The number of knots for the B-spline bases is taken to be sufficiently large to capture the non-linear features. We do not perform knot selection; rather equally spaced quantile knots on each of the Level 2 covariates are considered. Smoothing is induced via regularization and over-fitting is controlled through a roughness penalty on the spline coefficients, details of which are provided in Section A of the Supporting Information.

In equation (2), β_j(τ, ·) is modeled as a sum of a set of smooth spline functions. As discussed in Section 3, we construct a prior on the spline coefficients, so that, for any given T variable, the linear and non-linear effects corresponding to all associated M variables can be identified. Under these assumptions, the estimated coefficients of the T variables would be zero if both the linear and non-linear effects of all associated M variables are zero. However, for a larger number of M variables, it is unlikely that both the linear and non-linear effects of all M variables corresponding to any T variable are zero. If the number of Level 1 covariates (T variables) is also large, it becomes crucial to perform variable selection on them to enforce sparsity and interpretability.

In order to incorporate sparsity on the Level 1 covariates, one naïve approach could be to use discrete mixture priors such as spike-and-slab (George and McCulloch, 1993). But, since we consider the selection of the T covariates to be patient-specific, to apply spike-and-slab prior, we would need to assign a latent indicator for each coefficient for every patient. This approach would therefore substantially increase the number of parameters in the model. In addition, as discussed below, the spike-and-slab prior is not well-suited for functional regression coefficients. Instead, we rely on a Bayesian hard-thresholding approach where we truncate the coefficients with smaller absolute values to zeros so that only the important Level 1 variables are selected. For truncation, we take the Bayesian hard-thresholding function h(z, t) = zI(|z| > t) to threshold the slope coefficients of the T variables. We modify the values of the slope coefficients given in equation (2) as

β_{j} (τ, M_{i j}) = h (\sum_{k = 1}^{q_{j}} f_{j k}^{(τ)} (M_{i j k}), λ_{j}),

where λ_j is a ‘minimum effect size’ for the effect of T_j to be considered as non-zero. Our use of the hard-thresholding prior instead of the more commonly adopted Bayesian spike-and-slab prior is due to the functional nature of the regression coefficients β_j(·). The hard-thresholding prior allows selection at any given input whereas spike-and-slab is not able to handle an infinite dimensional object. Another advantage of using the Bayesian hard-thresholding is that this ‘minimum effect size’ can be set at any reasonable value by the user based on intuition or prior experience, or estimated by assigning a prior (as we discuss in the next section). Note, no hard-thresholding is performed (or needed) on the intercept term as our main interest is to perform variable selection among Level 1 covariates only.

2.3. Likelihood construction

We now describe the likelihood construction the QUANTICO model, assimilating the above constructs, for any fixed quantile level of interest τ. We shorten the notation β₀(τ, M_i0) and β_j(τ, M_ij) in equation (1) to $β_{0}^{(i)} (τ)$ and $β_{j}^{(i)} (τ)$ , respectively, for i = 1, …, n, and j = 1, …, p. Following the principle of linear quantile regression (Koenkar and Bassett, 1978), $β_{0}^{(i)} (τ)$ and $β_{j}^{(i)} (τ)$ can be estimated by minimizing the loss function

V (τ) = \sum_{i = 1}^{n} ψ_{τ} (y_{i} - \sum_{j = 0}^{p} β_{j}^{(i)} (τ) T_{i j}),

where T_i0 = 1 and ψ_τ(t) is the check function given by ψ_τ(t) = τt, if t ⩾ 0, or ψ_τ(t) = −(1 − τ)t, if t < 0. Now consider the model

y_{i} = \sum_{j = 0}^{p} β_{j}^{(i)} (τ) T_{i j} + u_{i}, i = 1, \dots, n,

where u_i follows the i.i.d. asymmetric Laplace distribution with density f(u|τ) = τ(1 − τ) exp[−ψ_τ(u)]. Hence the joint density of y₁, …, y_n is given by

f (y_{1}, \dots, y_{n} ∣ τ) = τ^{n} {(1 - τ)}^{n} exp [- \sum_{i = 1}^{n} ψ_{τ} (y_{i} - \sum_{j = 0}^{p} β_{j}^{(i)} (τ) T_{i j})] .

Following Yu and Moyeed (2001), for any given τ, minimizing V (τ) with respect to $β_{j}^{(i)} (τ)$ ‘s for j = 1, …, p is equivalent to maximizing f(u₁, …, u_n|τ). Hence, to estimate the τ-th quantile, the likelihood is given by

L_{τ} ({β_{0}^{(i)}, \dots, β_{p}^{(i)}}_{i = 1}^{n} ∣ Y, T, M) = τ^{n} {(1 - τ)}^{n} exp [- \sum_{i = 1}^{n} ψ_{τ} (y_{i} - \sum_{j = 0}^{p} β_{j}^{(i)} (τ) T_{i j})] .

3. Prior formulations

Note that the intercept and the slope terms ( $β_{0}^{(i)} (τ)$ and $β_{j}^{(i)} (τ)$ , respectively) are functions of the spline coefficients $α_{j k}^{τ}$ . In this section, we describe the prior on the spline coefficients $α_{j k}^{τ}$ , the prior used to induce selection of Level 2 covariates, and the prior assigned to the thresholding parameter λ_j to select Level 1 covariates.

Prior on spline coefficients.

We propose the use of a penalized spline in order to have a flexible but smooth fit. Specifically, we choose a large number of knots placed at equally-spaced quantiles of the covariates so that the local features can be captured, and penalize the roughness of $f_{j k}^{τ} (\cdot)$ through an improper Gaussian random walk prior on $α_{j k}^{τ}$ given by $α_{j k}^{τ} ~ N (0, s K^{-})$ . The penalty matrix K is constructed from the second order differences of adjacent spline coefficients, which essentially penalizes the second derivatives of $f_{j k}^{τ} (\cdot)$ . Larger value of s leads to a smoother fit, while smaller value of s leads to an irregular fit (Ni et al., 2015). Note that equation (2) is not identifiable since adding some quantity to any term of the summation and deducting the same quantity from any other term would yield the same summation value. Additionally, K is singular and therefore no penalty is imposed on the linear and constant trend of $f_{j k}^{τ} (\cdot)$ (i.e., the null space of K). In order to alleviate these two issues, we consider a similar approach as in Scheipl et al. (2012) and transform the spline bases into orthonormal bases. Let S_jk = (S_1jk, …, S_njk). For a given quantile τ, consider the spectral decomposition of the covariance of $S_{j k} α_{j k}^{(τ)}$

c o v (S_{j k} α_{j k}^{(τ)}) = s S_{j k} K^{-} S_{j k}^{T} = [\begin{array}{l} U_{j k} & * \end{array}] [\begin{matrix} D_{j k} & 0 \\ 0 & 0 \end{matrix}] {[\begin{array}{l} U_{j k} & * \end{array}]}^{T},

where U_jk is the orthonormal matrix of the eigenvectors corresponding to the positive eigenvalues in the diagonal matrix D_jk. Let $S_{j k}^{*} = U_{j k} D_{j k}^{\frac{1}{2}}$ . Now if we assume an independent proper prior $α_{j k}^{* τ} ~ N (0, σ^{2} I)$ , then the non-linear part of f_jk(·) can be parameterized by $S_{j k}^{*} α_{j k}^{* τ}$ which has a proper distribution proportional to the distribution of the original improper prior $S_{j k} α_{j k}^{τ}$ . Suppose we denote the effect size of the j-th covariate for the i-th subject before applying the hard-thresholding by $β_{j}^{*} (τ, M_{i j})$ . Thus the full reparameterization of $β_{j}^{*} (τ, M_{i j})$ is now given by

β_{j}^{*} (τ, M_{i j}) = \sum_{k = 1}^{q_{j}} f_{j k}^{(τ)} (M_{i j k}) = \sum_{k = 1}^{q_{j}} S_{j k}^{*} α_{j k}^{* τ} + \sum_{k = 1}^{q_{j}} M_{i j k} α_{j k}^{0 τ} + α_{j}^{τ},

(3)

where $α_{j}^{τ}$ is the global constant term absorbing all constant terms from splines. This parameterization adds the flexibility to separately shrink and estimate the linear, non-linear and constant effects of the Level 2 (M) variables on the coefficients of the Level 1 (T) variables. In order to make the proposed method computationally more efficient, we only consider the first several eigenvectors which explain at least 99.5% of the variation, a similar idea as in principal component analysis. In order to induce sparsity on the linear, non-linear and the constant effects, we consider a parameter-expanded normal mixture of inverse gamma (peNMIG) prior on each $α_{j k}^{* τ}$ , $α_{j k}^{0 τ}$ , and $α_{j}^{τ}$ separately. We opt for the peNMIG prior since it is known to provide a more efficient Markov chain Monte Carlo (MCMC) algorithm compared to traditional spike-and-slab priors given the multivariate nature of the spline coefficients (Scheipl et al., 2012). peNMIG multiplicatively expands $α_{j k}^{* τ}$ as $α_{j k}^{* τ} = η_{j k} ξ_{j k}$ where η_jk is a scalar parameter and ξ_jk is a vector of the same length as $α_{j k}^{* τ}$ . Each of $α_{j k}^{0 τ}$ and $α_{j}^{τ}$ is also expanded in a similar fashion. A brief discussion on the choice of such a prior is provided in Section A of the Supporting Information.

Priors for selection of Level 2 covariates.

Since the Level 2 coefficients are at population level, we can induce explicit selection using a spike-and-slab prior on η_jk,

η_{j k} ~ γ_{j k} N (0, t_{j k}) + (1 - γ_{j k}) N (0, v_{0} t_{j k}),

where γ_jk ~ Ber(ρ) and ν₀ is a fixed very small quantity close to zero. The selection of η_jk as a non-zero effect is indicated by the binary variable γ_jk, and thus γ_jk indicates the selection of $α_{j k}^{* τ}$ vector due to the multiplicative construction. In terms of interpretation, the binary variable γ_jk = 1 indicates that M_ijk has non-zero non-linear effect on T_ij. Similarly, in the expansion of $α_{j k}^{0 τ}$ and $α_{j}^{τ}$ , γ_jk = 1 indicates the presence of linear and constant effects of M_ijk on T_ij respectively. Thus, based on the estimated values of γ_jk in the expansion of $α_{j k}^{* τ}$ and $α_{j k}^{0 τ}$ , we can identify the presence of non-linear and linear effects of Level 2 covariates respectively. In essence, this construction allows the flexibility of considering only linear or only non-linear or joint effects simultaneously. We choose conjugate hyper-priors for t_jk and ρ, t_jk ~ IG(a_t, b_t) and ρ ~ Beta(a_ρ, b_ρ). As the number of Level 2 covariates increases, this Beta-Bernoulli prior automatically corrects for multiplicity by making the posterior distribution of ρ concentrated at small values near 0 (Scott and Berger, 2010). We assign a mixture normal prior on each element of the vector $ξ_{j k} = (ξ_{j k}^{(i)})$ , $ξ_{j k}^{(i)} ~ \frac{1}{2} N (1, 1) + \frac{1}{2} N (- 1, 1)$ . The structure of this assumed prior discourages small effects. In a similar fashion, peNMIG priors are assumed for $α_{j k}^{0 τ}$ and $α_{j}^{τ}$ as well.

Prior on hard-thresholding for selection of Level 1 covariates.

As mentioned before, to select the Level 1 covariates, we adopt a hard-thresholding approach where we truncate the non-zero coefficient of the j-th Level 1 covariate with absolute value less than λ_j to 0. Since sparsity is induced while estimating the linear, non-linear, and constant effects of the Level 2 covariates on the Level 1 covariates, it is possible that β_j(·) = 0 even before hard-thresholding (when $α_{j k}^{* τ}$ , $α_{j k}^{0 τ}$ , and $α_{j}^{τ}$ are zero for all k = 1, …, q_j). In that case, λ_j can take any value without affecting the resulting estimates. To resolve this identifiability issue, we take λ_j = λ for j = 1, …, p. In the presence of at least one non-zero β_j(·), λ is now well-defined. We put a gamma prior on λ, λ ~ Gamma(a_λ, b_λ). The values of the shape and scale parameters (a_λ, b_λ) can be taken so that the mean of the gamma distribution (i.e., a_λ, b_λ) is equal to the desired cut-off (based on intuition or prior experience). A brief discussion on how to choose the parameters of the gamma prior is provided in Supporting Information (Section A). Note that a more conventional variable selection prior, the spike-and-slab, is often used for finite-dimensional parametric models. We choose to use the random hard-thresholding because it is more suitable for infinite-dimensional functional selection, as discussed in Section 2.2. A schematic illustration of the proposed model and its key parameters is given in Figure 1(c).

4. MCMC and posterior inference

The posterior distributions of the model parameters are not analytically tractable. Therefore, a Markov Chain Monte Carlo (MCMC) sampling algorithm is required to generate samples from the posterior distribution. We use a Gibbs sampling scheme for updating parameters using their full conditional distributions (τ_jk, γ_jk, ρ) and Metropolis sampling scheme for the parameters without closed form conditional distributions (η_jk, ξ_jk, λ). Details on the full conditional distributions are provided in Section B of the Supporting Information.

Inferential summaries.

When using this sampling algorithm, the quantile level of interest is fixed at the desired level. In order to estimate multiple quantiles simultaneously, the algorithm can be run in parallel for faster computation. The selection of Level 1 and Level 2 covariates can be performed in several ways based on the marginal posterior probability of inclusion. For the selection of Level 1 covariates, we consider the cut-off of the posterior probability of inclusion to be 0.5. A similar approach of thresholding the posterior inclusion probabilities can be taken for the selection of Level 2 covariates. In the presence of a higher number of Level 2 covariates, a false discovery rate (FDR) controlling approach can also be considered (Baladandayuthapani et al., 2010); see Section C of the Supporting Information.

The proposed model allows the identification of both linear and non-linear effect components of M_j within the effect size of T_j for all subjects. From the posterior samples, the posterior probabilities of both linear and non-linear effect of Level 2 covariates within the effect-size of Level 1 covariate can be calculated explicitly. Due to the induction of sparsity on the effect-size components of Level 2 covariates, the effect size of each Level 2 covariate can be categorized as one out of four possible cases: linear, non-linear, both, or none.

Based on the posterior mean estimate of the effect sizes of the Level 1 covariates over a grid of quantiles, patient-specific posterior credible intervals over the quantiles can be obtained using QUANTICO. To calculate the posterior 95% credible interval of the quantile function for the ith patient, we calculate the posterior mean of Q_y(τ_l|T_i·, M_i·) over the quantile grid τ_l = 0.1l for l = 1, …, 9. Then, for quantile level τ_l, we calculate the 95-th percentile of the values $| Q_{y}^{(z)} (τ_{l} ∣ T_{i \cdot}, M_{i \cdot}) - {\hat{Q}}_{y} (τ_{l} ∣ T_{i \cdot}, M_{i \cdot}) |$ from the posterior sample, where $Q_{y}^{(z)} (τ_{l} ∣ T_{i \cdot}, M_{i \cdot})$ denotes the value of Q_y(τ_l|T_i·, M_i·) obtained in the z-th posterior sample. Thus, we derive the width of the posterior 95% credible interval at all τ_l’s. Further, by taking a dense grid of quantiles over [0, 1], patient-specific uniform posterior credible intervals can be calculated. A detailed description on the calculation of posterior point-wise and uniform credible intervals is provided in Supporting Information Section D.

5. Simulation studies

In this section, we evaluate the variable selection performance of QUANTICO across both Level 1 and Level 2 covariates and illustrate the subject-specific estimation at different quantile levels using simulation studies. The performance of QUANTICO is compared with Varying Coefficient Quantile Regression Model (VCQRM) and variable selection in quantile regression using the lasso penalty (LASSO-QR) (Wu and Liu, 2009) as well as its Bayesian alternatives. All of these methods are variants of quantile-based VCMs and we provide explicit details of each method in the Section C of the Supporting Information.

5.1. Variable selection performance

To compare the variable selection performance across both Level 1 (T) and Level 2 (M) covariates, we calculate the true positive rate (TPR), false positive rate (FPR), and area under the receiver operating characteristic (ROC) curve (AUC) separately for the T and M variables. A detailed description of the simulation design and computation details of the metrics is provided Supporting Information Section C.

Simulation structure.

To compare the performance of QUANTICO, VCQRM and LASSO-QR, we consider 2 scenarios with sample sizes of n = 100 and n = 200. In order to understand how the selection of the T variables is carried out in the presence of a higher number of covariates, we consider the cases p = 5, 10 for sample size scenario n = 100 and the cases p = 5, 10, 20 for the sample size scenario n = 200. For the cases where p > 5, the true model remains as given by equation (1) of the Supporting Information. So, the additional T covariates considered in p > 5 scenarios have no true effect on Y. The simulation is repeated 25 times, and each time new data is generated from the quantile function given in equation (1) of the Supporting Information. The mean and the standard deviation of the TPR, FPR, and AUC for the T and M variables are computed separately at τ = 0.1, 0.5, 0.9 for all the methods. In addition, for a few high-dimensional scenarios the performance of QUANTICO is evaluated in Section F of the Supporting Information. We observe that QUANTICO maintains a high TPR and AUC for larger values of p and q_j as well, along with low FPR rates. We also evaluate the performance of QUANTICO compared to two other existing quantile regression methods, implemented in the R packages rqPen (Sherwood and Maidman, 2019) and Brq (Alhamzawi and Ali, 2020) which is also provided in Section F of the Supporting Information. It is observed that QUANTICO outperforms rqPen and Brq, in general.

Comparative performance evaluation.

The comparative performance of the three methods is reported in Table 1. In general, QUANTICO and VCQRM perform better than LASSO-QR. QUANTICO results in better selection of the T (Level 1) variables, and a large improvement in terms of FPR and AUC over VCQRM. In VCQRM, we do not incorporate any thresholding of the slope terms. Although it is possible to have a zero slope or intercept term without thresholding if all estimated linear, non-linear, and constant effects of all Level 2 covariates are zero, in the simulation results, the FPR of Level 1 covariates came out to be 1. We do not observe any strong pattern of differences in the comparative performance of QUANTICO and VCQRM in terms of TPR, FPR and AUC for M (Level 2) variables, since they follow same mechanism for selection of M variables. In terms of performance, in Table 1, we report that QUANTICO yields very low TPR as well as FPR for Level 2 covariates for sample size scenario n = 100 despite a high AUC, which implies that QUANTICO is very conservative in its estimation. In QUANTICO, the selection of Level 2 covariates is performed using a false discovery rate (FDR) based approach with cut-off α = 0.2. The performance of QUANTICO improves as n increases (as noted in Table 1).

Table 1.

Comparative performance study of QUANTICO, Varying Coefficient Quantile Regression Model (VCQRM) and LASSO Quantile Regression (LASSO-QR) methods at quantile levels τ = 0.1, 0.5, 0.9 for different sample size and number of T variable scenarios. lvl1TPR, lvl1FPR and lvl1AUC denote the True-Positive Rate, False-Positive Rate and Area Under Receiver Operating Characteristic Curve (AUC) of Level 1 covariates; and lvl2TPR, lvl2FPR and lvl2AUC denote the same for Level 2 covariates.

(n, p)	Measures	τ = 0.1			τ = 0.5			τ = 0.9
(n, p)	Measures	QUANTICO	VCQRM	LASSO-QR	QUANTICO	VCQRM	LASSO-QR	QUANTICO	VCQRM	LASSO-QR
(100,5)	lvl1TPR	0.94 (0.22)	1.00 (0.00)	1.00 (0.00)	0.97 (0.10)	1.00 (0.00)	1.00 (0.00)	0.85 (0.20)	1.00 (0.00)	0.99 (0.07)
	lvl1FPR	0.00 (0.01)	1.00 (0.00)	0.56 (0.41)	0.24 (0.18)	1.00 (0.00)	0.65 (0.33)	0.03 (0.11)	1.00 (0.00)	0.66 (0.40)
	lvl1AUC	0.97 (0.11)	0.50 (0.00)	0.94 (0.16)	0.95 (0.08)	0.50 (0.00)	0.93 (0.16)	0.94 (0.09)	0.50 (0.00)	0.91 (0.18)
	lvl2TPR	0.10 (0.25)	0.08 (0.19)	NA	0.58 (0.34)	0.48 (0.37)	NA	0.32 (0.38)	0.24 (0.39)	NA
	lvl2FPR	0.00 (0.00)	0.00 (0.00)	NA	0.05 (0.04)	0.05 (0.05)	NA	0.01 (0.01)	0.01 (0.01)	NA
	lvl2AUC	0.96 (0.09)	0.91 (0.13)	NA	0.88 (0.15)	0.89 (0.11)	NA	0.85 (0.14)	0.90 (0.13)	NA
(100,10)	lvl1TPR	0.90 (0.25)	1.00 (0.00)	1.00 (0.00)	1.00 (0.01)	1.00 (0.00)	0.98 (0.10)	0.78 (0.25)	1.00 (0.00)	0.99 (0.07)
	lvl1FPR	0.01 (0.03)	1.00 (0.00)	0.42 (0.28)	0.13 (0.09)	1.00 (0.00)	0.62 (0.30)	0.06 (0.10)	1.00 (0.00)	0.59 (0.32)
	lvl1AUC	0.98 (0.08)	0.50 (0.00)	0.96 (0.09)	0.99 (0.02)	0.50 (0.00)	0.96 (0.10)	0.90 (0.12)	0.50 (0.00)	0.91 (0.10)
	lvl2TPR	0.06 (0.17)	0.04 (0.14)	NA	0.42 (0.37)	0.40 (0.35)	NA	0.12 (0.30)	0.18 (0.32)	NA
	lvl2FPR	0.00 (0.00)	0.00 (0.00)	NA	0.02 (0.02)	0.02 (0.02)	NA	0.00 (0.01)	0.00 (0.00)	NA
	lvl2AUC	0.90 (0.14)	0.91 (0.09)	NA	0.86 (0.15)	0.88 (0.13)	NA	0.81 (0.18)	0.86 (0.16)	NA
(200,5)	lvl1TPR	1.00 (0.00)	1.00 (0.00)	1.00 (0.00)	0.99 (0.03)	1.00 (0.00)	1.00 (0.00)	1.00 (0.00)	1.00 (0.00)	1.00 (0.00)
	lvl1FPR	0.00 (0.00)	1.00 (0.00)	0.57 (0.35)	0.19 (0.16)	1.00 (0.00)	0.73 (0.24)	0.02 (0.09)	1.00 (0.00)	0.78 (0.36)
	lvl1AUC	1.00 (0.00)	0.50 (0.00)	0.99 (0.03)	0.88 (0.20)	0.50 (0.00)	0.98 (0.07)	1.00 (0.02)	0.50 (0.00)	0.99 (0.04)
	lvl2TPR	0.98 (0.10)	0.98(0.10)	NA	0.86 (0.23)	0.88 (0.22)	NA	0.88 (0.30)	0.96 (0.20)	NA
	lvl2FPR	0.03 (0.01)	0.03 (0.01)	NA	0.10 (0.06)	0.10 (0.05)	NA	0.03 (0.02)	0.03 (0.02)	NA
	lvl2AUC	1.00 (0.00)	1.00 (0.00)	NA	0.95 (0.01)	0.94 (0.11)	NA	1.00 (0.01)	1.00 (0.01)	NA
(200,10)	lvl1TPR	1.00 (0.00)	1.00 (0.00)	1.00 (0.00)	0.99 (0.03)	1.00 (0.00)	1.00 (0.00)	0.99 (0.04)	1.00 (0.00)	1.00 (0.00)
	lvl1FPR	0.00 (0.00)	1.00 (0.00)	0.44 (0.28)	0.16 (0.09)	1.00 (0.00)	0.55 (0.31)	0.01 (0.02)	1.00 (0.00)	0.59 (0.26)
	lvl1AUC	1.00 (0.00)	0.50 (0.00)	0.99 (0.03)	0.98 (0.03)	0.50 (0.00)	0.97 (0.10)	1.00 (0.00)	0.50 (0.00)	0.98 (0.03)
	lvl2TPR	1.00 (0.00)	0.94 (0.22)	NA	0.82 (0.24)	0.80 (0.25)	NA	0.92 (0.19)	0.88 (0.26)	NA
	lvl2FPR	0.02 (0.00)	0.01 (0.01)	NA	0.06 (0.03)	0.05 (0.03)	NA	0.01 (0.01)	0.01 (0.01)	NA
	lvl2AUC	1.00 (0.00)	1.00 (0.00)	NA	0.95 (0.08)	0.95 (0.09)	NA	1.00 (0.01)	0.99 (0.04)	NA
(200,20)	lvl1TPR	1.00 (0.01)	1.00 (0.00)	1.00 (0.00)	0.93 (0.22)	1.00 (0.00)	1.00 (0.00)	0.97 (0.13)	1.00 (0.00)	1.00 (0.00)
	lvl1FPR	0.00 (0.00)	1.00 (0.00)	0.28 (0.22)	0.11 (0.06)	1.00 (0.00)	0.34 (0.25)	0.01 (0.02)	1.00 (0.00)	0.43 (0.22)
	lvl1AUC	0.99 (0.01)	0.50 (0.00)	0.99 (0.02)	0.92 (0.22)	0.50 (0.00)	1.00 (0.01)	0.99 (0.06)	0.50 (0.00)	0.98 (0.03)
	lvl2TPR	0.96 (0.20)	0.92 (0.24)	NA	0.70 (0.35)	0.68 (0.28)	NA	0.78 (0.33)	0.78 (0.33)	NA
	lvl2FPR	0.01 (0.00)	0.01 (0.00)	NA	0.03 (0.02)	0.02 (0.01)	NA	0.01 (0.01)	0.00 (0.00)	NA
	lvl2AUC	1.00 (0.00)	1.00 (0.00)	NA	0.88 (0.18)	0.95 (0.08)	NA	0.98 (0.06)	0.98 (0.09)	NA

Open in a new tab

We further report the selection and estimation performance of QUANTICO evaluated at quantile levels τ = 0.1, 0.2, …, 0.9. Figure 2 (a) illustrates the selection of T variables. QUANTICO selected the correct T variables across all quantiles. Specifically, it selected variable T₃ for the quantiles greater than 0.5 as it is in the true model. In Figure 2 (b) and (c), the corresponding true G variables have the highest posterior probability of selection for T₁ and T₂, respectively. In Figure 2 (d), (e) and (f), the true and estimated quantile functions are plotted for 3 randomly selected subjects along with the corresponding estimated 95% credible bounds.

We also compute the uniform posterior 95% credible intervals of the quantile functions. To improve coverage of the uniform credible intervals, we increase the width using an inflation factor. Although for point estimates, the observed coverage might come close to the percentage of the computed credible interval, for functions, in practice, it is a common phenomenon that the observed coverage may be less than the actual percentage of the credible interval. This under-coverage of the uniform posterior credible band of smooth functions is a well-known property and has been addressed in several articles (Cox, 1993; Knapik et al., 2011; Szabo et al., 2015; Das and Ghosal, 2017). In order to improve coverage, either under-smoothing or inflation of the obtained credible interval is required (Yoo and Ghosal, 2016). To improve coverage of the posterior uniform credible intervals, we increase the width using an inflation factor. As mentioned in Remark 5.4 and Theorem 5.3 in Yoo and Ghosal (2016), the inflation factor of the uniform credible interval should be taken such that it slowly increases to infinity as a function of sample size. Through experimentation, we observe that the inflation factor $f (n) = 1.5 \sqrt{log (n)}$ (which has a similar form to that considered in Das and Ghosal (2017) in the context of uniform credible bound over quantiles) works well in simulation under various settings for the sample size and number of Level 1 covariates. A detailed discussion regarding the computation of the uniform posterior 95% credible intervals of the quantile functions along with an extensive study on the computational time is provided in Supporting Information Section D.

6. Case study on immune architecture of non-small cell lung cancer

6.1. Scientific problem and data description

Lung cancer is the second most common cancer and is one of the leading causes of cancer death. Though early stage patients can be treated with surgery, late stage lung cancer requires the use of systemic therapies (Ettinger, D. S., Wood, D. E., Aisner, D. L., Akerley, W., Bauman, J., Chirieac, 2017). In recent years, immunotherapies have emerged as a successful treatment in a subset of late stage lung cancers, largely through their ability to boost the activity of T-cells, the subset of immune cells that target infected or malignant cells based on the detection of specific antigens. However, a large proportion of lung cancer patients still do not respond to immunotherapy (Doroshow, D. B., Sanmamed, M. F., Hastings, K., Politi, K., Rimm, D. L., Chen, L. et al., 2019).

The nature of the antigens specifically recognized by T-cells has been the object of intense focus, and recent work has suggested that somatic mutations harbored by the tumor can be presented to T-cells as neoantigens (Lee et al., 2018). Analysis of specific somatic mutations and overall tumor mutational burden has shown a positive association with response to immunotherapy in patients with non-small cell lung cancers (NSCLC). This supports the role of these mutations in aiding T-cell responses by increasing tumor immunogenicity. Recent technologies have emerged to sequence the T-cell receptor (TCR) to gain insight into T-cell responses, and studies have confirmed that TCR sequencing can be used to monitor immune response in various types of cancer (Page, D. B., Yuan, J., Redmond, D., Wen, Y. H., Durack, J. C., Emerson, 2016). In order to develop patient-specific immunotherapeutic treatment strategies for NSCLC, it is of critical importance to understand the interplay between somatic mutations, TCR variables, and the immune microenvironment, illustrated schematically in Figure 1(a).

We consider a cohort of 215 non-small cell lung cancer (NSCLC) patients recruited at UT MD Anderson Cancer Center. The tumors of these subjects were analyzed to obtain immune profiling, TCR sequencing, and mutational status of immune-related genes. We focus here on understanding the impact of mutation and TCR variables on the CD8 marker, which is the outcome variable in our analysis and is discussed in more detail below. In order to assess the effect of the explanatory variables on the outcome variable across different patients, we would like to estimate the patient-specific effect at different quantiles of the response.

As our Level 1 covariates, we take standard summary measures of TCR sequencing data, including T-cell clonality, T-cell entropy, T-cell productive proportion of cells, and T-cell richness. T-cell clonality is a measure of heterogeneity among T-cells and has been linked to patient outcomes (Reuben, A., Gittelman, R., Gao, J., Zhang, J., Yusko, E. C., Wu, C. et al., 2017). T-cell entropy is Shannon’s entropy, and highly diverse samples have comparatively higher entropy. The T-cell productive proportion of cells denotes the proportion of the tumor that consists of T-cells. TCR richness is another measure of TCR heterogeneity, defined as the number of different unique sequences in the sample.

As our Level 2 covariates, we consider mutation counts for the top 6 most frequently mutated genes across the cohort, namely CSMD3, MUC16, RYR2, TP53, TTN, and USH2A, along with total tumor mutational burden (TMB), which is the total number of mutations observed per sample. As the mutation counts for individual genes have sparse values (i.e., zeros for a large proportion of patients), we only consider the linear and constant effects for those 6 variables. For TMB, linear, non-linear, and constant effects are allowed.

As our response variable, we focus on CD8 abundance as a key measure of immune activity. CD8 is a protein found on the surface of cytotoxic T cells. CD8+ T-cells have the ability to mount a response against pathogens and defend against tumors by killing transformed tumor cells (Berg and Forman, 2006), and are therefore a vital part of cancer immunity. In precision oncology, understanding the patient-specific effect of T-cell architecture on CD8+ T-cell abundance is therefore crucial.

Out of the cohort of 215 NSCLC patients, we focus on the 87 patients for which all three data types (immune, TCR, and mutational profiling) are available. Since the set of TCR variables considered have differing magnitudes, it is crucial to transform them to a similar range of values to make any comparison of their effect sizes meaningful. The same applies for the mutation variables. Therefore, before applying QUANTICO, we transform the TCR variables and the CD8 immune markers to unit intervals using log-normal cumulative distribution function (cdf) transformation (see Supporting Information Section G). The mutation variables are transformed into the unit interval using a linear transformation.

6.2. Results

Population level findings.

Using the proposed method, the coefficients of the TCR variables are estimated for each patient at nine quantile levels τ = 0.1, 0.2, …, 0.9. We use the same values of the hyper-parameters as in the simulation study except the total number of iterations and burn-in, which are taken to be 50,000 and 10,000, respectively. The average posterior probability of selection of all TCR variables is plotted Figure 3(a). The T-cell productive proportion variable has an average posterior probability greater than 0.5 at all quantile levels except at τ = 0.9, and T-cell entropy has an average posterior probability marginally higher than 0.5 at τ = 0.2. In general, the average posterior probability of the TCR variables having a non-zero effect on the CD8+ immune cell abundance decreases at higher quantile levels. This implies that for patients with a lower abundance of CD8+ cells, the number of CD8+ cells has a stronger dependence on the TCR variable measures. However, in patients with a higher density of CD8+ cells, this dependence is less prominent.

Figure 3. — (a) Posterior probability of selection of TCR variables based on average of individual posterior probability of the same for each of the 87 patients. T-cell productive proportion is the only selected variable for quantiles τ = 0.1, 0.2, …, 0.8. No TCR variables are selected at τ = 0.9. (b) Posterior probability of mutation variables having non-zero effect on the coefficient of the T-cell productive proportion variable. Tumor Mutation Burden (TMB) is shown to have highest non-zero effect among all mutation variables, specially at higher quantile effects.(c) Hierarchical clustering of the patient specific estimated coefficients of T-cell productive proportion of cells at quantile levels τ = 0.1, 0.2, …, 0.8 for 87 considered NSCLC patients. Three clinical variables i.e., smoking status, recurrence and vital status are also shown.

To summarize our Level 2 findings, we assessed the effect of the mutation variables on the coefficient of the T-cell productive proportion, which was identified to be the most important among the TCR variables (Figure 3(b)). The non-zero effects of the mutation variables in the coefficient of the T-cell productive proportion are not strong for lower quantile levels, while TMB is shown to have a marginally higher posterior probability of having a non-zero effect at higher quantiles. Thus, at higher quantiles, a large proportion of the effect of the T-cell productive proportion on CD8+ immune cells is due to TMB. This is consistent with previous studies that have shown a positive correlation between tumor mutational burden and CD8+ in melanoma (Reuben, A., Spencer, C. N., Prieto, P. A., Gopalakrishnan, V., Reddy, S. M., Miller, J. 2017), where immunotherapy using check-point inhibitors have shown to be successful. Hence, our findings suggest that TMB could be used for predicting the response to anti-PD-1/PD-L1 therapies in NSCLC.

Patient level findings.

Our results also provide insights into patient-specific immune profiles offering the potential to guide the development of precision immunoncology treatment strategies based on patient-specific information such as patient smoking history and mutational profiles. As an illustration of this, in Figure 3(c), we show the estimated coefficients of the T-cell productive proportions for each subject across quantiles in the rows of a heatmap, along with patient-level covariates. In this figure, we rely on hierarchical clustering over the estimated coefficients at quantile levels τ = 0.1, …, 0.8 to group patients with similar coefficients. Focusing on the last few rows of the heatmap, it is apparent that patients with overall lower values of the coefficient of the T-cell productive proportion have higher recurrence rates of NSCLC. Layering this analysis with clinical covariates reveals that the effect size of T-cell productive proportion is in general lower for non-smokers compared to recent and former smokers (Figure S3 in Supporting Information). Reuben, A., Zhang, J., Chiou, S., Gittelman, R. M., Li. J., Lee, W. et al. (2020) found higher T cell clonality in current and former smokers compared to never smokers.

In Figure 4(a), we show a boxplot of the coefficients of the T-cell productive proportion for all patients, and we detect outlier patients (patient numbers 4, 45, 51, 56 and 76) at quantile level 0.1 and an overlapping set of outliers (patient numbers 40, 45, 56, and 76) at quantile level 0.3. Since this coefficient is modeled as a function of the mutation variables, we further compare the number of mutations for these 6 distinct patients (4(b-d)). In general, these outlier patients have a higher number of mutations compared to the average value in the patient cohort; specifically for TMB, TP53 and MUC16 genes.

7. Conclusions and future work

In this paper, we propose QUANTICO with multi-level covariates where, at any specific quantile level, selection over the direct (Level 1) covariates is performed for each subject. A novel feature of the proposed model is the development of a quantile-specific varying sparsity coefficient estimation approach which allows us to explicitly delineate how at different quantiles, the response variable depends on different covariates for each subject. The proposed method also enables selection of the indirect (Level 2) covariates, and can be used for obtaining patient-specific posterior credible bands over the quantile levels.

The proposed method is used to analyze how the CD8 immune marker depends on TCR and mutation variables at different quantiles. We find that T-cell productive proportion is the most important TCR variable, and influences CD8 immune cells for most quantile levels. Out of all mutation variables considered, total tumor mutational burden (TMB) is found to be most important. Based on the structure of the relationship between the TCR and immune variable, we identify outlier patients, who turn out to have a higher number of mutations across several critical genes of known clinical relevance in cancer. This information is potentially useful to devise effective immunologic therapies for such patients(s) based on their unique immune architectures.

There are several potential refinements that could be made to our modeling framework. We can extend our approach to simultaneous quantile regression, where instead of estimating the model parameters at specific quantiles, the entire quantile function and associated parameters are estimated simultaneously (Yang and Tokdar, 2017; Das and Ghosal, 2018). In terms of theoretical excursions, one could investigate for such hierarchical varying-coefficient quantile regression models, building on some of the theoretical results proposed for Bayesian quantile regression by ourselves and others (Yang and Tokdar, 2017; Das and Ghosal, 2017) regarding posterior consistency, rates of convergence and posterior contraction. Given the non-trivial nature of these explorations, we leave them as future work.

Supplementary Material

supinfo

NIHMS1842069-supplement-supinfo.pdf^{(2MB, pdf)}

Acknowledgments

VB was supported by NIH grants R01-CA160736, R01CA244845-01A1, R21-CA220299, and P30 CA46592, NSF grant 1463233, and start-up funds from the U-M Rogel Cancer Center and School of Public Health. CBP was supported by NIH/NCI CCSG P30CA016672 and CPRIT grant RP150521. KAD was partially supported by a CCSG NCI Grant P30 CA016672, NIH grants UL1TR003167, 5R01GM122775, the prostate cancer SPORE P50 CA140388, CPRIT grant RP160693, and the Moon Shots funding at MD Anderson Cancer Center. YN was partially supported by the NSF DMS-1918851 and NSF DMS-2112943.

Footnotes

Supporting Information

Web Appendices, Tables, and Figures referenced in Sections 2–6, along with relevant codes are available with this paper at the Biometrics website on Wiley Online Library. The codes are also available online at https://github.com/bayesrx/QUANTICO.

Data Availability Statement

Code and data: Code for the simulations and real data analysis, along with a synthetic data set designed to closely resemble our example T-cell Receptor (TCR) data, are available online at https://github.com/bayesrx/QUANTICO.

References

Alhamzawi R and Ali H (2020). Brq: an R package for Bayesian quantile regression. METRON 78, 313–328. [Google Scholar]
Baladandayuthapani V, Ji Y, Talluri R, Nieto-Barajas LE, and Morris JS (2010). Bayesian random segmentation models to identify shared copy number aberrations for array CGH data. Journal of the American Statistical Association 105, 1358–1375. [DOI] [PMC free article] [PubMed] [Google Scholar]
Berg RE and Forman J (2006). The role of CD8 T cells in innate immunity and in antigen non-specific protection. Current Opinion in Immunology 18, 338–343. [DOI] [PubMed] [Google Scholar]
Biller C and Fahrmeir L (2001). Bayesian varying-coefficient models using adaptive regression splines. Statistical Modelling 1, 195–211. [Google Scholar]
Cox DD (1993). An analysis of Bayesian inference for non-parametric regression. Annals of Statistics 21, 903–923. [Google Scholar]
Das P and Ghosal S (2017). Bayesian quantile regression using random B-spline series prior. Computational Statistics & Data Analysis 109, 121–143. [Google Scholar]
Das P and Ghosal S (2018). Bayesian non-parametric simultaneous quantile regression for complete and grid data. Computational Statistics & Data Analysis 127, 172–186. [Google Scholar]
Doroshow DB, Sanmamed MF, Hastings K, Politi K, Rimm DL, Chen L et al. (2019). Immunotherapy in non–small cell lung cancer: facts and hopes. Clinical Cancer Research 25, 4592–4602. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ettinger DS, Wood DE, Aisner DL, Akerley W, Bauman J, Chirieac LR et al. (2017). Non–small cell lung cancer, version 5.2017, NCCN clinical practice guidelines in oncology. Journal of the National Comprehensive Cancer Network 15, 504–535. [DOI] [PubMed] [Google Scholar]
Fan J and Zhang W (1999). Statistical estimation in varying coefficient models. Annals of Statistics 27, 1491–1518. [Google Scholar]
George EI and McCulloch RE (1993). Variable selection via Gibbs sampling. Journal of the American Statistical Association 88, 881–889. [Google Scholar]
Hastie T and Tibshirani R (1993). Varying-coefficient models. Journal of the Royal Statistical Society: Series B 55, 757–796. [Google Scholar]
Honda T (2004). Quantile regression in varying coefficient models. Journal of Statistical Planning and Inference 121, 113–125. [Google Scholar]
Kakimi K, Karasaki T, Matsushita H, and Sugie T (2017). Advances in personalized cancer immunotherapy. Breast Cancer 24, 16–24. [DOI] [PubMed] [Google Scholar]
Kim M (2007). Quantile regression with varying coefficients. Annals of Statistics 35, 92–108. [Google Scholar]
Knapik B, van der Vaart A, and van Zanten J (2011). Bayesian inverse problems with Gaussian priors. Annals of Statistics 39, 2626–2657. [Google Scholar]
Koenkar R and Bassett G (1978). Regression quantiles. Econometrica 46, 33–50. [Google Scholar]
Koslovsky MD, Ho man KL, Daniel CR, and Vannucci M (2020). A Bayesian model of microbiome data for simultaneous identification of covariate associations and prediction of phenotypic outcomes. The Annals of Applied Statistics 14, 1471–1492. [Google Scholar]
Kottas A and Gelfand AE (2001). Bayesian semiparametric median regression modeling. Journal of the American Statistical Association 96, 1458–1468. [Google Scholar]
Lee C, Yelensky R, Jooss K, and Chan TA (2018). Update on tumor neoantigens and their utility: why it is good to be different. Trends in Immunology 39, 536–548. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ni Y, Stingo FC, and Baladandayuthapani V (2015). Bayesian nonlinear model selection for gene regulatory networks. Biometrics 71, 585–595. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ni Y, Stingo FC, Ha MJ, Akbani R, and Baladandayuthapani V (2018). Bayesian hierarchical varying-sparsity regression models with application to cancer proteogenomics. Journal of the American Statistical Association 114, 48–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
Page DB, Yuan J, Redmond D, Wen YH, Durack JC, Emerson R et al. (2016). Deep sequencing of T-cell receptor DNA as a biomarker of clonally expanded TILs in breast cancer after immunotherapy. Cancer Immunology Research 4, 835–844. [DOI] [PMC free article] [PubMed] [Google Scholar]
Park BU, Mammen E, Lee YK, and Lee ER (2013). Varying coefficient regression models: a review and new developments. International Statistical Review 83, 36–64. [Google Scholar]
Reich BJ (2012). Spatiotemporal quantile regression for detecting distributional changes in environmental processes. Journal of the Royal Statistical Society: Series C (Applied Statistics) 61, 535–553. [DOI] [PMC free article] [PubMed] [Google Scholar]
Reuben A, Gittelman R, Gao J, Zhang J, Yusko EC, Wu C et al. (2017). TCR repertoire intratumor heterogeneity in localized lung adenocarcinomas: An association with predicted neoantigen heterogeneity and postsurgical recurrence. Cancer Discovery 7, 1088–1097. [DOI] [PMC free article] [PubMed] [Google Scholar]
Reuben A, Spencer CN, Prieto PA, Gopalakrishnan V, Reddy SM, Miller JP et al. (2017). Genomic and immune heterogeneity are associated with differential responses to therapy in melanoma. NPJ Genomic Medicine 2, 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
Reuben A, Zhang J, Chiou S, Gittelman RM, Li J, Lee W et al. (2020). Comprehensive T cell repertoire characterization of non-small cell lung cancer. Nature Communications 11, 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
Scheipl F, Fahrmeir L, and Kneib T (2012). Spike-and-slab priors for function selection in structured additive regression models. Journal of the American Statistical Association 107, 1518–1532. [Google Scholar]
Scott JG and Berger JO (2010). Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem. Annals of Statistics 38, 2587–2619. [Google Scholar]
Sherwood B and Maidman A (2019). Penalized Quantile Regression. https://cran.r-project.org/web/packages/rqPen/rqPen.pdf.
Stingo FC, Guindani M, Vannucci M, and Calhoun VD (2013). An integrative Bayesian modeling approach to imaging genetics. Journal of the American Statistical Association 108, 876–891. [DOI] [PMC free article] [PubMed] [Google Scholar]
Szabo B, van der Vaart A, and van Zanten J (2015). Frequentist coverage of adaptive nonparametric Bayesian credible sets. Annals of Statistics 43, 1391–1428. [Google Scholar]
Tang Q and Wang J (2005). l₁-estimation for varying coefficient model. Statistics 39, 389–404. [Google Scholar]
Waldman AD, Fritz JM, and Lenardo MJ (2020). A guide to cancer immunotherapy: from T cell basic science to clinical practice. Nature Reviews Immunology 20, 651–668. [DOI] [PMC free article] [PubMed] [Google Scholar]
Walsh R and Soo R (2020). Resistance to immune checkpoint inhibitors in non-small cell lung cancer: biomarkers and therapeutic strategies. Therapeutic Advances in Medical Oncology 12,. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang H and Xia Y (2009). Shrinkage estimation of the varying coefficient model. Journal of the American Statistical Association 104, 747–757. [Google Scholar]
Wu Y and Liu Y (2009). Variable selection in quantile regression. Statistica Sinica 19, 801–817. [Google Scholar]
Yang Y and Tokdar S (2017). Joint estimation of quantile planes over arbitrary predictor spaces. Journal of the American Statistical Association 112, 1107–1120. [Google Scholar]
Yoo WW and Ghosal S (2016). Supremum norm posterior contarction and credible sets for non-parametric multivariate regression. Annals of Statistics 44, 1069–1102. [Google Scholar]
Yu K and Moyeed R (2001). Bayesian quantile regression. Statisics and Probabilty Letters 54, 437–447. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supinfo

NIHMS1842069-supplement-supinfo.pdf^{(2MB, pdf)}

Data Availability Statement

[R1] Alhamzawi R and Ali H (2020). Brq: an R package for Bayesian quantile regression. METRON 78, 313–328. [Google Scholar]

[R2] Baladandayuthapani V, Ji Y, Talluri R, Nieto-Barajas LE, and Morris JS (2010). Bayesian random segmentation models to identify shared copy number aberrations for array CGH data. Journal of the American Statistical Association 105, 1358–1375. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Berg RE and Forman J (2006). The role of CD8 T cells in innate immunity and in antigen non-specific protection. Current Opinion in Immunology 18, 338–343. [DOI] [PubMed] [Google Scholar]

[R4] Biller C and Fahrmeir L (2001). Bayesian varying-coefficient models using adaptive regression splines. Statistical Modelling 1, 195–211. [Google Scholar]

[R5] Cox DD (1993). An analysis of Bayesian inference for non-parametric regression. Annals of Statistics 21, 903–923. [Google Scholar]

[R6] Das P and Ghosal S (2017). Bayesian quantile regression using random B-spline series prior. Computational Statistics & Data Analysis 109, 121–143. [Google Scholar]

[R7] Das P and Ghosal S (2018). Bayesian non-parametric simultaneous quantile regression for complete and grid data. Computational Statistics & Data Analysis 127, 172–186. [Google Scholar]

[R8] Doroshow DB, Sanmamed MF, Hastings K, Politi K, Rimm DL, Chen L et al. (2019). Immunotherapy in non–small cell lung cancer: facts and hopes. Clinical Cancer Research 25, 4592–4602. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Ettinger DS, Wood DE, Aisner DL, Akerley W, Bauman J, Chirieac LR et al. (2017). Non–small cell lung cancer, version 5.2017, NCCN clinical practice guidelines in oncology. Journal of the National Comprehensive Cancer Network 15, 504–535. [DOI] [PubMed] [Google Scholar]

[R10] Fan J and Zhang W (1999). Statistical estimation in varying coefficient models. Annals of Statistics 27, 1491–1518. [Google Scholar]

[R11] George EI and McCulloch RE (1993). Variable selection via Gibbs sampling. Journal of the American Statistical Association 88, 881–889. [Google Scholar]

[R12] Hastie T and Tibshirani R (1993). Varying-coefficient models. Journal of the Royal Statistical Society: Series B 55, 757–796. [Google Scholar]

[R13] Honda T (2004). Quantile regression in varying coefficient models. Journal of Statistical Planning and Inference 121, 113–125. [Google Scholar]

[R14] Kakimi K, Karasaki T, Matsushita H, and Sugie T (2017). Advances in personalized cancer immunotherapy. Breast Cancer 24, 16–24. [DOI] [PubMed] [Google Scholar]

[R15] Kim M (2007). Quantile regression with varying coefficients. Annals of Statistics 35, 92–108. [Google Scholar]

[R16] Knapik B, van der Vaart A, and van Zanten J (2011). Bayesian inverse problems with Gaussian priors. Annals of Statistics 39, 2626–2657. [Google Scholar]

[R17] Koenkar R and Bassett G (1978). Regression quantiles. Econometrica 46, 33–50. [Google Scholar]

[R18] Koslovsky MD, Ho man KL, Daniel CR, and Vannucci M (2020). A Bayesian model of microbiome data for simultaneous identification of covariate associations and prediction of phenotypic outcomes. The Annals of Applied Statistics 14, 1471–1492. [Google Scholar]

[R19] Kottas A and Gelfand AE (2001). Bayesian semiparametric median regression modeling. Journal of the American Statistical Association 96, 1458–1468. [Google Scholar]

[R20] Lee C, Yelensky R, Jooss K, and Chan TA (2018). Update on tumor neoantigens and their utility: why it is good to be different. Trends in Immunology 39, 536–548. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Ni Y, Stingo FC, and Baladandayuthapani V (2015). Bayesian nonlinear model selection for gene regulatory networks. Biometrics 71, 585–595. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Ni Y, Stingo FC, Ha MJ, Akbani R, and Baladandayuthapani V (2018). Bayesian hierarchical varying-sparsity regression models with application to cancer proteogenomics. Journal of the American Statistical Association 114, 48–60. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Page DB, Yuan J, Redmond D, Wen YH, Durack JC, Emerson R et al. (2016). Deep sequencing of T-cell receptor DNA as a biomarker of clonally expanded TILs in breast cancer after immunotherapy. Cancer Immunology Research 4, 835–844. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Park BU, Mammen E, Lee YK, and Lee ER (2013). Varying coefficient regression models: a review and new developments. International Statistical Review 83, 36–64. [Google Scholar]

[R25] Reich BJ (2012). Spatiotemporal quantile regression for detecting distributional changes in environmental processes. Journal of the Royal Statistical Society: Series C (Applied Statistics) 61, 535–553. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Reuben A, Gittelman R, Gao J, Zhang J, Yusko EC, Wu C et al. (2017). TCR repertoire intratumor heterogeneity in localized lung adenocarcinomas: An association with predicted neoantigen heterogeneity and postsurgical recurrence. Cancer Discovery 7, 1088–1097. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Reuben A, Spencer CN, Prieto PA, Gopalakrishnan V, Reddy SM, Miller JP et al. (2017). Genomic and immune heterogeneity are associated with differential responses to therapy in melanoma. NPJ Genomic Medicine 2, 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Reuben A, Zhang J, Chiou S, Gittelman RM, Li J, Lee W et al. (2020). Comprehensive T cell repertoire characterization of non-small cell lung cancer. Nature Communications 11, 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Scheipl F, Fahrmeir L, and Kneib T (2012). Spike-and-slab priors for function selection in structured additive regression models. Journal of the American Statistical Association 107, 1518–1532. [Google Scholar]

[R30] Scott JG and Berger JO (2010). Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem. Annals of Statistics 38, 2587–2619. [Google Scholar]

[R31] Sherwood B and Maidman A (2019). Penalized Quantile Regression. https://cran.r-project.org/web/packages/rqPen/rqPen.pdf.

[R32] Stingo FC, Guindani M, Vannucci M, and Calhoun VD (2013). An integrative Bayesian modeling approach to imaging genetics. Journal of the American Statistical Association 108, 876–891. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] Szabo B, van der Vaart A, and van Zanten J (2015). Frequentist coverage of adaptive nonparametric Bayesian credible sets. Annals of Statistics 43, 1391–1428. [Google Scholar]

[R34] Tang Q and Wang J (2005). l₁-estimation for varying coefficient model. Statistics 39, 389–404. [Google Scholar]

[R35] Waldman AD, Fritz JM, and Lenardo MJ (2020). A guide to cancer immunotherapy: from T cell basic science to clinical practice. Nature Reviews Immunology 20, 651–668. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] Walsh R and Soo R (2020). Resistance to immune checkpoint inhibitors in non-small cell lung cancer: biomarkers and therapeutic strategies. Therapeutic Advances in Medical Oncology 12,. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] Wang H and Xia Y (2009). Shrinkage estimation of the varying coefficient model. Journal of the American Statistical Association 104, 747–757. [Google Scholar]

[R38] Wu Y and Liu Y (2009). Variable selection in quantile regression. Statistica Sinica 19, 801–817. [Google Scholar]

[R39] Yang Y and Tokdar S (2017). Joint estimation of quantile planes over arbitrary predictor spaces. Journal of the American Statistical Association 112, 1107–1120. [Google Scholar]

[R40] Yoo WW and Ghosal S (2016). Supremum norm posterior contarction and credible sets for non-parametric multivariate regression. Annals of Statistics 44, 1069–1102. [Google Scholar]

[R41] Yu K and Moyeed R (2001). Bayesian quantile regression. Statisics and Probabilty Letters 54, 437–447. [Google Scholar]

PERMALINK

Bayesian Hierarchical Quantile Regression with Application to Characterizing the Immune Architecture of Lung Cancer

Priyam Das

Christine B Peterson

Yang Ni

Alexandre Reuben

Jiexin Zhang

Jianjun Zhang

Kim-Anh Do

Veerabhadran Baladandayuthapani

Summary:

1. Introduction

Figure 1.

2. QUANTICO model

2.1. Varying sparsity quantile regression model

2.2. Varying-sparsity coefficient modelling and selection

2.3. Likelihood construction

3. Prior formulations

Prior on spline coefficients.

Priors for selection of Level 2 covariates.

Prior on hard-thresholding for selection of Level 1 covariates.

4. MCMC and posterior inference

Inferential summaries.

5. Simulation studies

5.1. Variable selection performance

Simulation structure.

Comparative performance evaluation.

Table 1.

Figure 2.

6. Case study on immune architecture of non-small cell lung cancer

6.1. Scientific problem and data description

6.2. Results

Population level findings.

Figure 3.

Patient level findings.

Figure 4.

7. Conclusions and future work

Supplementary Material

Acknowledgments

Footnotes

Data Availability Statement

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases