Skip to main content
Wiley Open Access Collection logoLink to Wiley Open Access Collection
. 2025 Apr 25;44(8-9):e70036. doi: 10.1002/sim.70036

Extensions of Heterogeneity in Integration and Prediction (HIP) With R Shiny Application

Jessica Butts 1, Leif Verace 1, Christine Wendt 2, Russell Bowler 3, Craig P Hersh 4, Qi Long 5, Lynn Eberly 1, Sandra E Safo 1,
PMCID: PMC12023842  PMID: 40277350

ABSTRACT

Multiple data views measured on the same set of participants are becoming more common and have the potential to deepen our understanding of many complex diseases by analyzing these different views simultaneously. Equally important, many of these complex diseases show evidence of subgroup heterogeneity (e.g., by sex or race). HIP (Heterogeneity in Integration and Prediction) is among the first methods proposed to integrate multiple data views while also accounting for subgroup heterogeneity to identify common and subgroup‐specific markers of a particular disease. However, HIP is applicable to continuous outcomes and requires programming expertise by the user. Here we propose extensions to HIP that accommodate multi‐class, Poisson, and Zero‐Inflated Poisson outcomes while retaining the benefits of HIP. Additionally, we introduce an R Shiny application, accessible on shinyapps.io at https://multi‐viewlearn.shinyapps.io/HIP_ShinyApp/, that provides an interface with the Python implementation of HIP to allow more researchers to use the method anywhere and on any device. We applied HIP to identify genes and proteins common and specific to males and females that are associated with exacerbation frequency. Although some of the identified genes and proteins show evidence of a relationship with chronic obstructive pulmonary disease (COPD) in existing literature, others may be candidates for future research investigating their relationship with COPD. We demonstrate the use of the Shiny application with publicly available data. An R‐package for HIP is available at https://github.com/lasandrall/HIP.

Keywords: COPD, integrative analysis, multimodal, multi‐omics, multi‐view data, subgroup heterogeneity


Abbreviations

COPD

Chronic Obstructive Pulmonary Disease

HIP

Heterogeneity in Integration and Prediction

1. Introduction

Chronic obstructive pulmonary disease (COPD) is a chronic disease affecting the lungs and airways in almost 4% of the global population in 2017 [1]. Cigarette smoking is a known risk factor for COPD, but fewer than 50% of heavy smokers develop COPD [2]. There are also many genetic and environmental factors influencing risk [3, 4, 5, 6]. COPD research is further complicated by the subgroup heterogeneity that exists between males and females. In a meta‐analysis, female smokers had a faster annual decline in forced expiratory volume in 1 s (FEVInline graphic) even if they smoked fewer cigarettes [7]. Another study found that women smokers generally had higher airway wall thickness (AWT) compared to male smokers [8]. Additionally, researchers found that women experienced an increased risk of hospitalization for COPD compared to males even when controlling for smoking [9].

The Genetic Epidemiology of COPD (COPDGene) Study [10] was designed to understand genetic factors related to the development of COPD. The study collected genetic and proteomic data on a subset of participants at the Phase 2 (P2) study visit. Given the availability of multi‐view data (e.g., genomic and proteomic data) and the known subgroup heterogeneity between males and females, the application of an integrative analysis method accounting for subgroup heterogeneity to the COPDGene Study data offers the opportunity to gain new insights into COPD. Butts et al. [11] recently proposed a method called HIP, short for Heterogeneity in Integration and Prediction, for integrating data from multiple sources and simultaneously predicting a continuous outcome while accounting for subgroup heterogeneity. HIP allows to identification of common and subgroup‐specific variables contributing most to the overall association among the views and the variation in the outcome. HIP was used to investigate airway wall thickness (AWT) as a proxy for COPD severity and the authors demonstrated that HIP was capable of identifying genes and proteins common and specific to males and females that were predictive of AWT. However, AWT is not the only way to characterize the effects of COPD, and one may be interested in different types of outcomes that are not continuous.

One such outcome of interest is the number of COPD exacerbations, generally defined as an acute worsening of symptoms that require a change in treatment; these symptoms can include cough, wheezing, dyspnea, chest tightness, and decreased exercise tolerance. While exacerbations vary in severity, severe cases may need to be hospitalized and put on a ventilator; patients who require ICU treatment have a 43%–46% risk of death within a year of the hospitalization [12]. As such the number of exacerbations experienced by patients is a clinically meaningful outcome. Additionally, the TORCH (Towards a Revolution in COPD Health) study found that females had a rate of exacerbations that was 25% higher than males during their 3‐year follow‐up [13]. This emphasizes not only the importance of looking at exacerbation frequency but also accounting for sex differences. Because exacerbation frequency is a Poisson or rate variable, it is not compatible with the originally proposed HIP [11].

Our goal is to deepen our investigation into the molecular underpinnings of sex differences in COPD mechanisms using data from the COPDGene Study [10] by identifying genes and proteins common and specific to males and females related to exacerbation frequency. For this analysis, we included COPDGene Study [10] participants with COPD (defined as GOLD stage 1) at P2 and with genomic, proteomic, and selected clinical covariates (age, BMI, race, pack‐years, FEVInline graphic%, AWT, and % emphysema) available. Figure 1 shows the distributions of many exacerbations in the past year for males and females in this subsample. First, we note a statistically significant difference in the number of events per person‐year between males and females. Second, both males and females had several participants with zero exacerbations suggesting that a zero‐inflated Poisson (ZIP) outcome may better fit this data.

FIGURE 1.

FIGURE 1

Distribution of exacerbation frequency for males and females at COPDGene Study visit P2. The figure includes data from the P2 visit of the COPDGene Study for the subset of subjects included in our analysis. There is a statistically significant difference in the number of exacerbations experienced per person‐year between males and females.

To address our research goal, the simplest approach we could consider would be to use a penalized regression method such as the Elastic Net [14] or Lasso [15]. The glmnet package [16] can implement the Lasso and Elastic Net on a binary or Poisson outcome, but these methods neither perform integrative analysis nor account for subgroup heterogeneity; they also cannot accommodate a ZIP outcome. These approaches would require concatenating data views as they can only accept a single data view; they would also require running separate analyses for each subgroup to allow for subgroup heterogeneity. These approaches are easily implemented but are not ideal for our research goal where we want to integrate data from multiple sources and associate these data with a ZIP outcome while accounting for subgroup heterogeneity.

There are several integrative analysis methods we could consider for this analysis, but to the best of our knowledge, none of them account for subgroup heterogeneity, and few can accommodate a ZIP outcome. For example, Canonical Variate Regression (CVR) [17] is a one‐step method for simultaneously associating data from multiple sources and predicting a clinical outcome. CVR can accommodate continuous, binary or Poisson outcomes but not a ZIP outcome. Sparse Integrative Discriminant Analysis (SIDA) [18] is a one‐step method for joint association and classification of data from multiple sources, but it is only applicable to classification problems (i.e., binary or multi‐class outcome). The aforementioned methods are one‐step in that the problem of associating multiple views is coupled with the problem of predicting an outcome. We could use a two‐step method to first model the associations between views and then model the Poisson or ZIP response using information from the first step. For example, we could perform canonical correlation analysis (CCA) using SELP (Sparse Estimation through Linear Programming) [19] and then use the canonical variates as predictors in a Poisson or ZIP regression model. However, these all still fail to account for subgroup heterogeneity. To use any of these integrative analysis methods, we would have to either (a) concatenate the subgroups which ignores any potential subgroup heterogeneity or (b) run a separate analysis for each subgroup which limits power, especially in a high‐dimensional data setting where the sample size is typically less than the number of variables.

Alternatively, we could consider methods that account for subgroup heterogeneity, but to the best of our knowledge, none of them perform integrative analysis. One example is the Joint Lasso [20], but this method is unable to accommodate a binary, Poisson, or ZIP outcome. More generally, methods that account for subgroup heterogeneity but do not perform integrative analysis would require either (a) concatenating data views within each subgroup which fails to model the associations between data views or (b) considering each view separately which fails to fully utilize the multi‐view data and requires combining results post hoc.

With all of these existing methods, the only way to get subgroup‐specific information is to fit the model separately on each subgroup, but this can greatly reduce the sample size and thus power to detect effects. Alternatively, if these methods are run on the combined subgroup data, the presence of any potential subgroup heterogeneity will be ignored. Existing methods do not fully utilize the available COPDGene Study [10] data to address our interest in identifying genes and proteins common and specific to males and females related to exacerbation frequency (modeled as a ZIP outcome). Thus, we propose extensions of HIP that can accommodate multi‐class, Poisson, and ZIP outcomes while preserving the existing benefits of HIP (joint association and prediction method for integrative analysis accounting for subgroup heterogeneity and feature ranking) to identify common and subgroup‐specific features predictive of an outcome. Further, since HIP requires Python programming expertise which is limiting, we developed a web application using the R Shiny framework and hosted it on shinyapps.io, allowing HIP to be used anywhere, on any device, and by users with limited programming expertise. We believe that this will increase the widespread adoption of HIP by biomedical researchers interested in integrative analysis methods that also account for subgroup differences.

The remainder of this paper is structured as follows. In Section 2, we present the proposed methods to extend HIP. In Section 3, we describe the algorithmic implementation of this extension. In Section 4, we describe the design and results of simulations assessing the performance of HIP in comparison with existing methods. In Section 5, we apply HIP to data from the COPDGene Study [10] to examine the relationship of genes and proteins with exacerbation frequency. Section 6 introduces an R Shiny [21] application that provides a graphical interface to the underlying Python implementation of HIP. We conclude with a discussion of limitations and future work in Section 7.

2. Methods

2.1. Notation and Problem

Consider the scenario where we have D data views (e.g., genomics, proteomics, clinical) measured on the same set of N subjects. Each view has pd variables and S subgroups (known a priori). Each subgroup has sample size ns. For our application, S=2 for biological sex (males and females). The total number of samples is N=s=1Sns. The data matrix for subgroup s is denoted by Xd,sns×pd where rows are for samples and columns are for variables. We define the outcome for each subgroup as Ys. For a multi‐class outcome, Ysns×m with m being the number of classes, is an indicator matrix where each row has a one in the column corresponding to the class of the observation and a 0 in all other columns. For a Poisson or ZIP outcome, Ysns×2 where the first column is the observed counts and the second column is an offset. If no offset is provided, we add a column of ones as the offset which does not affect estimation.

2.2. Existing HIP Framework

Since we extend the original HIP to accommodate different types of outcomes beyond continuous outcome(s), we briefly introduce HIP for completeness' sake. HIP simultaneously associates data from multiple views, predicts an outcome, and ranks or identifies common and subgroup‐specific variables contributing to the overall dependency structure and the variation in an outcome by minimizing an objective function that is a sum of three terms: association term, hierarchical penalty term, and prediction term. Consider the association term. HIP assumes that each view Xd,s can be approximated by a product of low‐rank view‐independent matrix, Zsns×K and view‐specific loadings Bd,spd×K. That is, HIP assumes that Xd,s=ZsBd,sT+Ed,s where Zs are latent scores describing association across views and is shared by all views for subgroup s, Bd,spd×K are view and subgroup‐specific variable loadings, and Ed,s are errors as a result of approximating Xd,s with ZsBd,sT. Here, K is the number of latent components, which is typically Kpd, resulting in data dimensionality reduction. The factor decomposition model is a popular approach to capture dependencies across variables (e.g., Fan et al. [22], Bai J [23], and Chekouo and Safo [24]). It is well known that this model decomposition is only identifiable up to orthonormal rotations. We note that when using latent models for prediction, rotational invariance is irrelevant. In other words, since our focus is not to interpret the regression coefficients of Zs and we do not explicitly order the latent components Zs and the factor loadings Bd,s, we are not concerned with non‐identifiability of this model. Given this latent decomposition, Bd,s and Zs are estimated to minimize the difference between the observed and estimated data via the loss function:

F(Xd,s,Zs,Bd,s)=Xd,sZsBd,sTF2 (1)

where ·F is the Frobenius norm. We now consider the hierarchical penalty term which allows for identifying common and subgroup‐specific variables. The hierarchical penalty implemented in HIP is a modified version of the penalty proposed in the Meta Lasso [25]. For common and subgroup‐specific variable selection, HIP decomposes Bd,s into the element‐wise product of Gd and Ξd,s, i.e., Bd,s=Gd·Ξd,s. Here, Gd allows us to model common effects across subgroups while Ξd,s allows us to model heterogeneity between subgroups. Since Gd is common to all subgroups, its estimation allows us to borrow strength from all subgroups and improves power in estimation, especially in high‐dimensional settings where the number of samples is smaller than the number of variables. Of note, in this reparameterization, exact values of Gd and Ξd,s are not identifiable, but also are not directly needed for variable ranking since variable ranking is based on Bd,s. HIP imposes a block L2,1 penalty on both Gd and Ξd,s which encourages the selection of a variable in all K components or none of the components. Specifically, the hierarchical penalty term proposed in HIP is:

d=1Ds=1S𝒥(Bd,s)=λGd=1Dγdl=1pdgld2+λξd=1Dγds=1Sl=1pdξld,s2 (2)

where gld is a vector of length K in the lth row of Gd and ξld,s is a vector of length K in the lth row of Ξd,s. The strength of the penalty on Gd and Ξd,s is controlled by hyperparameters λG>0 and λξ>0 respectively. The γd is a user‐specified indicator of whether to penalize view d, i.e., whether to select variables in view d. This term is useful to force the inclusion of covariates believed to affect the outcome in the integrative analysis model.

In our previous work, we used the factor regression model approach to relate the outcome with the factors and hence the association term. In particular, we considered the loss function: F(Ys,Zs,Θ,β0)=Ys𝒥nsβ0TZsΘTF2, where Θq×K is a matrix of regression coefficients, 𝒥ns is a vector of ones of size ns and β0T is a vector of length q for the intercept. Θ is common and not subgroup‐dependent. This allows the outcome data for all subgroups to be used in estimating the parameters and can improve the overall prediction of the outcome. Of note, for single continuous outcome, q=1, and for multiple continuous outcome, q>1. Of note, the constraint ZsTZs=I could be imposed to allow the columns of Zs to be uncorrelated so that each of the K components provides unique information. However, since our goal is not to interpret the coefficients Θ as in regression analysis but to use Y to guide the estimation of the Bd,s and to rank the variables corresponding to those coefficients, we do not impose this constraint. The presence of Zs in both the prediction and association terms is what allows HIP to be a one‐step method. In other words, HIP simultaneously models the association between multiple views and predicts an outcome, compared to two‐step methods that separate the association and prediction problems which may result in the Zs not being clinically meaningful. Since the two steps are combined in HIP, the estimation of the low‐dimensional view‐independent components Zs is guided by an outcome, and therefore Zs is naturally endowed with prediction capabilities. Put together, HIP solves the following optimization problem to estimate the subgroup‐specific view‐independent components Zs, view‐specific common variables Gd, view‐ and subgroup‐specific variables Ξd,s and regression coefficients Θ that associate multiple views, predicts a continuous outcome(s) and identifies common‐ and subgroup‐specific variables:

(B^d,s,Z^s,Θ^,β^0)=minBd,s,Zs,Θ,β0s=1SF(Ys,Zs,Θ,β0)+d=1Ds=1SF(Xd,s,Zs,Bd,s)+d=1Ds=1S𝒥(Bd,s) (3)

Because the prediction term relies on the type of outcome, in the current work, we redefine F(Ys,Zs,Θ,β0) based on the type of outcome. We define F(Ys,Zs,Θ,β0) for multi‐class, Poisson, and ZIP outcomes below.

Remark 1

We note that the factor analysis model, which decomposes high‐dimensional data X into X=ZBT+E, where Z is a matrix of latent factors (we suppress the subscripts s), and E are errors uncorrelated with Z, is a popular approach to capture dependencies across variables (e.g., Fan et al. [22], Bai J [23], and Chekouo and Safo [24]). The factor regression model Y=Zθ+ey is also commonly used in the literature to relate latent factors with a continuous outcome. More recently, Fan et al. (2023) proposed a factor regression model that depended on both the latent factors Z and error term E: Y=Zθ+Eβ+ey. The factor regression model we use is a special case of that proposed by Fan et al. [22]. with β=0. The addition of errors in the factor regression model allows the use of additional information beyond the linear space spanned by the predictors (i.e., beyond the latent components).

2.3. Beyond Gaussian Outcome(S): Modification of HIP Prediction Term

2.3.1. Multi‐Class Outcome

Similar to the continuous outcome, we relate the shared component Zs with the multi‐class outcome. For this purpose, we use the cross‐entropy loss function: F(Ys,Zs,Θ,β0)=i=1nsj=1myijslog(aijs), where aijs=exp{wijs}j=1mexp{wijs} is the softmax function generalizing logistic regression from binary to multi‐class problems. Here, Ws=𝒥nsβ0+ZsΘ represents scores, 𝒥ns is an ns×1 matrix of ones, ΘK×m, and wijs is the ijth entry in Ws. The softmax function forces the sum of each row in Ws to be 1 so that each entry in the row represents a probability of case i belonging to class j.

2.3.2. Poisson Outcome

For a Poisson outcome, we define the prediction term: F(ys,Zs,Θ,β0)=i=1nsyis[log(tis)+β0+ZisΘ]+tisexp(β0+ZisΘ)+log(yis!), which is based on the negative log‐likelihood for a Poisson regression with an offset using a log link. Here, yis is the outcome for the ith subject in subgroup s, tis is the offset for the ith subject in subgroup s, and Zis is the ith row of the Zs matrix. Here ΘK×1 and β01×1.

2.3.3. Zero‐Inflated Poisson Outcome

As mentioned in Section 1, the motivating data set has a ZIP outcome rather than a true Poisson outcome. To accommodate this, we add a loss function to the code based on Lambert [26]. In this model, the observed outcome is from the zero state with probability τ and a Poisson random variable with probability 1τ. We assume that covariates are only related to the Poisson mean (λ) and not τ. Further, we assume that there is no relationship between λ and τ. Thus, log(λi)=log(ti)+β0+ZisΘ. This assumption is a simplifying assumption, so while we could miss modeling a relationship that is there, it does not enforce or require any specific constraints. Using this distribution, we use the negative log‐likelihood to define the loss function as:

F(Ys,Zs,Θ,β0)=yis=0log(exp[τ]+exp[tisexp(β0+ZisΘ)])yis>0(yis[log(tis)+β0+ZisΘ]+tisexp[β0+ZisΘ])+i=1ns(log(yis!)+log[1+exp(τ)]) (4)

This does require estimation of the additional parameter τ which is described in Section 3. Again ΘK×1 and β01×1.

2.4. Prediction

Suppose we have a test data, Xtestd,s, s=1,,S, d=1,,D. Our goal in this section is to use the estimated optimization parameters Zs^, Θ^, and β^0 and the test data Xtestd,s to predict the test outcome Y^s for subgroup s. We first estimate the test shared component for subgroup s, Z^preds, by solving the optimization problem:

Z^preds=minZsd=1Ds=1SF(Xtestd,s,Zs,B^d,s)=minZsd=1Ds=1SXtestd,sZsB^d,sTF2 (5)

Let Xcats be an ns×{p1++pd} matrix that concatenates all D views for subgroup s, i.e., Xcats=[Xtest1,s,,XtestD,s]. Similarly, let B^cat=[B^1,s,,B^D,s] be a {p1++pD}×K matrix of variable coefficients. Then the solution to the optimization problem (5) is given as: Z^preds=XcatsB^cats(B^catsTB^cats)1 for s=1,,S. Given the predicted Z^preds, we predict the test outcome as follows:

y^is=arg maxjexp{w^ijs}j=1mexp{w^ijs}Multi‐classtisexp(β^0+Z^predisΘ^)Poisson(1τ^)tisexp(β^0+Z^predisΘ^)ZIP

where Z^predis is the ith row of Z^preds, w^ijs is the ijth element of W^s=𝒥nsβ^0+Z^predsΘ^.

3. Algorithm

3.1. Optimizing Parameters and Variable Ranking

We use an alternating minimization algorithm to solve the optimization problem (3). We initialize the entries of Zs(0) by randomly sampling from a U(0.9,1.1) distribution. We initialize the entries of Gd(0) for d=1,,D, Θ(0), and β0(0) with ones. We initialize Ξd,s(0) as Ξd,s(0)=[(Zs(0)TZs(0))1Zs(0)TXtraind,s]T.

We estimate Zs^(t) at iteration t by optimizing the following problem using gradient descent with gradients calculated using PyTorch [27]:

Zs^(t)=minZss=1SF(Ys,Zs,Θ(t1),β0(t1))+d=1Ds=1SXd,sZsBd,s(t1)TF2 (6)

where F(Ys,Zs,Θ(t1),β0(t1)) depends on the type of outcome. We use FISTA (fast iterative shrinkage‐thresholding algorithm) with backtracking [28] to speed up convergence and select an appropriate step size. For a ZIP outcome, further consideration has to be given to the additional parameter, τ. The parameter τ is the probability that a given observation is in the zero state. Note that this differs from P(yis=0)=τ+(1τ)eλi as P(yis=0) includes the probability that is in the zero state plus the probability of a zero from a Poisson distribution with mean λi. We initialize τ using the observed proportion of excess 0s beyond the proportion predicted by the Poisson model across all observations [26] shown in Equation (7). Specifically, the first term is the observed proportion of zeros, and the second term is the P(yis=0|λi=β0+ZisΘ) averaged across all observations. In the sub‐optimizations for Zs and Θ/β0, τ is treated as a fixed value. Once all other model estimates have been updated, τ is recalculated with the current model estimates of Zs, Θ, and β0 as:

p^0=s=1Si=1nsI(yis=0)s=1Si=1nsexp[exp(β0+ZisΘ)]s=1Sns (7)

To estimate Bd,s(t), we first estimate Gd(t) for each of the D data views. For a fixed Ξd,s(t1), we solve the following optimization problem using the Adagrad [29] optimizer in PyTorch [27]:

Gd^(t)=minGdRpd×Ks=1S(Xd,sZs(t))(Gd·Ξd,s(t))TF2+λGγdl=1pdgld2 (8)

Convergence is defined as the relative change in Equation (8) evaluated at G^d(t) and G^d(t1). We then use these updated estimates for Zs and Gd to estimate Ξd,s(t) by solving the optimization problem:

Ξ^d,s(t+1)=minΞd,sRpd×K(Xd,sZs(t)(Gd(t)·Ξd,s)T)F2+λξγdl=1pdξld,s2 (9)

We use the same technique described for the optimization of Gd with an analogous convergence criterion defined as the relative change in Equation (9) evaluated at Ξ^d,s(t) and Ξ^d,s(t1).

We note that because we use an automatic differentiation algorithm, the L2,1 (or block l2/l1) penalty does not result in zero coefficients. However, the magnitude of the coefficients in B^d,s for the noise variables are clearly smaller than the coefficients of the signal variables. Thus, we rank and identify relevant variables based on the magnitude of the L2 norm of the rows in B^d,s. In implementing the ranking procedure, the user specifies the number of variables (denoted as Ntop) they wish to keep, which could vary across data views. We run Algorithm 1 on the full training data and select the Ntop variables for each view and subgroup based on the estimated Bd,s. Then, we run Algorithm 1 a second time but with the selected variables. The estimated parameters based on this “subset” of data are used in the prediction procedure described in Section 2.4.

ALGORITHM 1. Overview of optimization algorithm.

ALGORITHM 1

We estimate Θ and β0 using Zs^(t) to optimize the equation problem:

(Θ^(t),β^0(t))=minΘ,β0s=1SF(Ys,Zs(t),Θ,β0)

We use ISTA (iterative shrinkage‐thresholding algorithm) with backtracking [28] to select an appropriate step size. The convergence criterion is the relative change in Equation (3.1) evaluated at Θ^(t),β^0(t) and Θ^(t1),β^0(t1).

3.2. Tuning Parameters

The optimization problem depends on λ=(λG,λξ) and K. We use grid and random [30] searches for selecting λ, and we follow the automatic or scree plot approaches to selecting K described in Butts et al. [11]. However, we include eBIC (extended Bayesian Information Criterion) [31] as an additional model selection criterion. This criterion modifies the priors used in the traditional BIC (Bayesian Information Criterion) to prevent larger probabilities from being assigned to models with more covariates. If we consider the set of models with q covariates q, then for the jth model qjq with maximum likelihood estimated parameters θ^(qj), eBIC is defined as

eBICδ(qj)=2logn(θ^(qj))+ν(qj)log(n)+2δlog(κ(q))

for 0δ1 where ν(qj) is the number of estimated parameters, κ(q) is the number of models in q, and n is the number of observations. When δ=0, eBIC is equivalent to the standard BIC, and δ=1 ensures consistency when the number of covariates is very large.

The eBIC criterion for HIP is defined in Equation (10). We use F(Ys,Z^s,Θ^,β0^) in the first term as it is proportional to the log‐likelihood. For the number of parameters ν, we count the number of variables that will be included in the subset model fit for HIP. A variable is included in the subset fit if it is one of the Ntop variables selected in at least one subgroup, where Ntop is user‐specified (e.g., top 10% of variables or top 50 variables) and can vary by data view. Thus, we define ν(B^)=d=1Dl=1pdI(B^ld,sNtopfor anys=1,,S); note this value is the same for all subgroups. For the number of models in q, we note that in a given view, we could include as few as Ntop variables in the case where all subgroups select the same set of variables and as many as SNtop variables when all subgroups select distinct sets of variables. Thus, to count the size of q where q=Ntop, we sum the combinations of choosing each possibility between Ntop and SNtop from the pd variables in the view. The code will return values for eBIC0, eBIC0.5, and eBIC1:

eBICδ=2s=1SF(Ys,Z^s,Θ^,β0^)+s=1Slog(ns)ν(B^)+2δd=1Dlog[w=NtopSNtoppdw] (10)

4. Simulations

4.1. Set‐Up

Simulations were run for binary, Poisson, and ZIP outcomes. For all outcomes, there were two views and two subgroups, i.e., D=S=2, and there were n1=250 subjects belonging to the first subgroup and n2=260 subjects belonging to the second subgroup. We considered two different scenarios for the degree of subgroup heterogeneity: Full Overlap and Partial Overlap. In both scenarios, there are 50 variables important to each subgroup for each view. In the Full Overlap scenario, the 50 important variables are the same for both subgroups; in the Partial Overlap scenario, 25 of the important variables are common to both subgroups and 25 are unique to each subgroup. For each of the outcomes and scenarios, there were two different sets of variable dimensions in the data views. In the low dimensional setting, the first data set had p1=300 variables, and the second had p2=350. In the high‐dimensional setting, the first data set had p1=2000 variables, and the second had p2=3000 variables. We also considered simulation settings where (i) each variable for each view was binary (i.e., non‐Gaussian) and (ii) the sample sizes for the two groups were unbalanced, to investigate the robustness of the proposed method. Please refer to the Supporting Information for simulation results for these scenarios.

4.2. Data Generation

The data were generated following a process based on Luo et al. [17]. Entries in rows of Bd,s corresponding to a signal variable were drawn from a U(1,0.5)U(0.5,1); otherwise, the entry was set to 0. The columns of each Bd,s were then orthonormalized using a QR decomposition. The entries of Zs are drawn from N(μ=25.0,σ=3.0) and entries of Ed,s from N(μ=0.0,σ=1.0). Then the covariate matrices Xd,s are formed as ZsBd,sT+Ed,s.

At this point, the processes diverge somewhat for the different outcomes. For the binary outcome, we apply the softmax function to Ws=𝒥nsβ0+ZsΘ+Eys where 𝒥ns is an ns×1 matrix of ones, Eys is an ns×2 matrix of standard normal errors and assign the class with the largest probability for each observation. Here β0=0.50.5 and Θ=1.00.50.20.8. For the Poisson and ZIP outcomes, we first standardize the columns of Zs to have mean 0 and variance 1. The observations are then generated from the Pytorch [27] Poisson random variable generator with mean exp(𝒥nsβ0+ZsΘ), i.e., observation yisPoisson(exp(β0+ZisΘ)). Here β0=2.0 and Θ=0.70.2T. For the ZIP outcome, each observation is then multiplied by a draw from a Bernoulli distribution that is 0 with probability τ=0.25.

4.3. Comparison Methods

For all outcomes under consideration (binary, Poisson, and ZIP), we are unaware of any other methods that perform integrative analysis while also accounting for subgroup heterogeneity. For the binary outcome, we compare HIP to two integrative analysis methods: CVR [17] as implemented in the R package CVR [32] and SIDA [18] as implemented in the R package mvlearnR [33]. Because neither of these methods accounts for subgroup heterogeneity, we implement each method in two ways: (1) all subgroups concatenated within each view (Concatenated) and (2) a separate model for each subgroup (Subgroup). We also compare HIP to the Lasso [15] and the Elastic Net [14] as implemented in the R package glmnet [16]. Neither of these two methods performs integrative analysis, so the two views are concatenated for these methods. We again implement two ways: (1) concatenating the subgroups (Concatenated) and (2) separate models for each subgroup (Subgroup).

For the Poisson outcome, SIDA is no longer applicable, so we instead compare HIP to the two‐step integrative analysis method SELPCCA [19] as implemented in the R package mvlearnR [33]. Since SELPCCA does not account for subgroup heterogeneity, we again fit concatenated and subgroup models (Concatenated SELPCCA and Subgroup SELPCCA, respectively).

For the ZIP outcome, CVR, Lasso, and Elastic Net cannot explicitly account for a ZIP outcome, but we still fit these models using a Poisson family. We also fit HIP specifying a Poisson outcome [HIP (Grid)‐Poisson and HIP (Random)‐Poisson] to demonstrate the importance of accounting for the zero‐inflated nature of the data. Finally, because SELPCCA is a two‐step method, we use the canonical variables from SELPCCA in a ZIP regression model fit with the zeroinfl function [34] in R package pscl [35]. We again fit a model on the concatenated subgroups (Concatenated SELPCCA‐ZIP) and separate models for each subgroup (Subgroup SELPCCA‐ZIP).

Tuning parameters for the comparison methods were selected using 10‐fold cross‐validation. For the Elastic Net, we set α=0.5. Tuning parameters for HIP were selected using eBIC1; we considered a range of (0,2] for λξ and λG with 8 steps for each. The true values for K (i.e., 2) and Ntop (i.e., 50) were used for all simulations.

4.4. Evaluation Measures

We compare HIP to the existing methods in terms of variable selection and prediction ability for new data. For variable selection, we will estimate the true positive rate (TPR=True PositivesTrue Positives + False Negatives), false positive rate (FPR=False PositivesTrue Negatives + False Positives), and F1 score (F1=True PositivesTrue Positives+12(False Positives + False Negatives)) which are all constrained to be between 0 and 1. Ideally, TPR and F1 are 1 and FPR is 0.

To compare predictive ability for the binary outcome, we look at classification accuracy, and for both Poisson and ZIP outcomes, we look at the fraction of deviance explained, D2=DnullDoptDnull, where Dnull is the deviance of the null model and Dopt is the deviance of the model with optimal tuning parameters [36]. Results are averaged over 20 Monte Carlo data sets.

4.5. Results

We focus on the results for the ZIP outcome here as the motivating data have a ZIP outcome (Figures 2 and 3); results for the ZIP outcome with Full Overlap scenario as well as binary and Poisson outcomes are presented in the Supporting Information: Figures S1–S8.

FIGURE 2.

FIGURE 2

Results for ZIP outcome, full overlap scenario. The first row corresponds to the low dimension (p1 = 300, p2 = 350) and the second to the high dimension (p1 = 2 000, p2 = 3 000). For all settings, n1=250 and n2=260. The right column is the fraction of deviance explained (D2), so a higher value indicates better performance. Results are mean ± one standard deviation summarized across 20 generated data sets.

FIGURE 3.

FIGURE 3

Performance results for ZIP outcome, partial overlap scenario. The first row corresponds to the low dimension (p1 = 300, p2 = 350) and the second to the high dimension (p1 = 2 000, p2 = 3 000). For all settings, n1=250 and n2=260. The right column is the fraction of deviance explained (D2), so a higher value indicates better performance. Results are mean ± one standard deviation summarized across 20 generated data sets.

First, we note the performance of HIP (Grid) and HIP (Random) are very similar in both the low‐ and high‐dimensional settings for both the Full and Partial Overlap scenarios (Figures 2 and 3, respectively), so we recommend the use of HIP (Random) as it is computationally faster. In terms of variable selection, HIP (Grid) and HIP (Random) show the highest TPR and F1 values; all methods show low FPRs. The ZIP model only improves variable selection slightly over the HIP Poisson fits. The other two integrative methods, CVR and SELPCCA‐ZIP, show similar variable selection performance to each other and show a benefit over the non‐integrative methods (Elastic Net and Lasso), but they do not perform nearly as well as HIP. All of the comparison methods seem to be missing many of the true signal variables as evidenced by lower TPRs.

In terms of predictive ability in the Full Overlap Scenario, we note that HIP and the SELPCCA‐ZIP models have a much higher fraction of deviance explained than the methods that do not account for the zero‐inflated nature of the data, highlighting the importance of doing so. In the Partial Overlap Scenario, Concatenated SELPCCA‐ZIP has a much lower D2 than in the Full Overlap Scenario suggesting that this method has more difficulty when subgroup heterogeneity exists in the data. HIP has the highest fraction of deviance explained out of all methods applied for both the low‐ and high‐dimensional settings in both the Full and Partial Overlap Scenarios indicating HIP is still favorable even if subgroup heterogeneity does not exist in the data.

Computation times for all methods are summarized for the Full and Partial Overlap Scenarios in Supporting Information: Figures S9 and S10 respectively. The Lasso and Elastic Net are the fastest in all scenarios, but because these methods do not perform integrative analysis or account for subgroup heterogeneity, the variable selection and predictive ability suffer. Of the methods that perform integrative analysis, HIP (Random) is consistently the fastest with larger computational advantages in the high‐dimensional setting particularly when compared to CVR. Concatenated SELPCCA‐ZIP has similar computation times as HIP (Random) in the Partial Overlap setting, but this is the setting where the performance of Concatenated SELPCCA‐ZIP was reduced dramatically.

5. Application to Exacerbation Frequency

5.1. Goals

In this section, our goal is to use the genetic and proteomic data from the COPDGene Study [10] in combination with clinical data to gain new insights into the molecular architecture of COPD in males and females. To be included in the analyses, participants had to have COPD at P2 (defined as GOLD stage 1) and have proteomic, genomic, and selected clinical covariates (age, BMI, race, pack‐years, FEVInline graphic%, AWT, and % emphysema) available. There were N=1374 participants meeting these criteria with n1=780 males and n2=594 females; demographic characteristics of this sample are in Table 1. Continuous variables were compared between males and females using t‐tests and categorical variables using χ2 tests. Participants were predominantly non‐Hispanic white; there were no statistically significant sex differences in age, BMI, BODE index, percentage of current smokers, or percentage with diabetes. Males and females differed in their exacerbation frequency (p <0.001) but did not have significantly different lung function as measured by mean FEVInline graphic% predicted. Given the available data and the sex differences in exacerbation frequency, we first identify genes and proteins common and specific to males and females associated with exacerbation frequency. Using those identified genes and proteins, we then explore pathways enriched for males and females.

TABLE 1.

COPDGene participant characteristics.

Variable Males Females p
N = 782 N = 594
Age 68.28 (8.35) 68.03 (8.36) 0.581
BMI 28.03 (5.62) 27.69 (6.59) 0.317
FEV1% Predicted 61.94 (22.97) 62.91 (22.59) 0.431
BODE index 2.45 (2.45) 2.63 (2.38) 0.176
% Emphysema 11.30 (11.86) 9.39 (11.45) 0.003
Pack years 53.05 (26.63) 47.57 (24.99) < 0.001
Airway wall thickness 1.17 (0.23) 1.00 (0.21) < 0.001
Non‐Hispanic White (%) 82 78 0.084
Current Smoker (%) 66 65 0.510
Diabetes (%) 17 14 0.204

Note: The measurements presented were collected at the Year 5 study visit to align with the collection of proteomic and genomic data collection.

Abbreviations: BMI = Body Mass Index; BODE = Body mass index, airflow Obstruction, Dyspnea, and Exercise capacity; COPD = Chronic Obstructive Pulmonary Disease; FEV1 = Forced Expiratory Volume in 1 s.

5.2. Applying HIP and Existing Methods

The COPDGene data includes 19263 genes and 4979 proteins. We first performed unsupervised filtering to select the 5 000 genes and 2 000 proteins with the largest variances. To preserve as much generalizability as possible, we then randomly split the data 50 times into train (75%) and test (25%) data sets keeping the proportions of males and females the same. Genes and proteins that were consistently selected across these splits were considered “stable” as these would be most likely to show consistent findings. Within each split, we performed supervised filtering using the training data by regressing exacerbation frequency on each of the genes and proteins retained after the unsupervised filtering. We used the zeroinfl function [34] where the current gene or protein was the only predictor in the count model and only an intercept for the zero model. Genes and proteins with an uncorrected p‐value <0.10 were included in the models. Because this supervised filtering was repeated for each split, a different set of variables could enter the models for each split of the data.

For existing methods, we fit the subgroup implementations described in Section 4.3 for CVR (using a Poisson family), SELPCCA‐ZIP, Lasso (using a Poisson family), and Elastic Net (using a Poisson family). For HIP, we applied HIP (Grid) and HIP (Random) using the ZIP family and the Poisson family [HIP (Grid)‐Poisson and HIP (Random)‐Poisson]. Additionally, we fit HIP using the ZIP family with an additional clinical data view that was not penalized [HIP (Grid)‐ZIP+Clinical and HIP (Random)‐ZIP+Clinical]. We used the training data to select tuning parameters and calculate training D2 and then used the test data to calculate a test D2.

We defined the “stable” genes and proteins by ranking them based on the product of (a) the number of splits in which the variable was included in the Ntop variables and (b) the number of splits in which the variable was included in the Ntop variables divided by the number of splits in which the variable was entered into the models. The top 1% of genes and proteins for males and females based on this ranking were identified as the “stable” genes and proteins for each method.

To select tuning parameters, CVR, SELPCCA, Lasso, and Elastic Net used 10‐fold cross‐validation. For HIP, when we used the λ range (0,2] used in simulations, the upper bound was consistently being selected, so we increased the range until this was not happening resulting in a range of (0,15] for λG and λξ. The best model was selected using eBIC1. We also needed to specify K, the number of latent components for HIP, and an equivalent parameter for CVR. The automatic approach specified in Butts et al. [11] with a threshold of 0.20 on the concatenated data suggested K=3. On the separate Xd,s, it suggested K=3 for the genes and K=1 for the proteins. Scree plots for the concatenated and separate data are in Supporting Information: Figure S24. We thus selected K=3 components for HIP and CVR. For HIP, we set Ntop=125 genes and 50 proteins.

5.3. Results

5.3.1. Fraction of Deviance Explained and Computation Times

Figure 4 shows violin plots of the train and test fraction of deviance explained for the 50 Train/Test data splits. As expected, HIP (using the ZIP model) and SELPCCA‐ZIP, the two methods that account for the zero‐inflated nature of the data, have the best predictive ability. The number of variables selected in the 50 Train/Test splits are summarized in Supporting Information: Table S1. Similarly, Supporting Information: Figure S25 shows a violin plot of the run times for each method. Of the methods performing integrative analysis, HIP (Random) regardless of family tended to have the fastest computation times, although SELPCCA‐ZIP often had similar run times.

FIGURE 4.

FIGURE 4

Fraction of deviance explained across 50 data splits of the COPDGene data. Violin plots display the distribution of the fraction of deviance explained across the 50 data splits. For each of CVR, SELPCCA‐ZIP, Lasso, and Elastic Net, there is a violin plot for each subgroup as comparator methods require separate models for each subgroup to allow for possible subgroup heterogeneity.

5.3.2. Selected Genes and Proteins

Supporting Information: Table S2 shows the number of “stable” genes and proteins common and specific to males and females identified by each method. Supporting Information: Figure S26 shows the overlap of “stable” genes and proteins selected for males and females by each method. There are few overlaps in the “stable” genes and proteins, but the overlaps that do occur tend to be in the methods that perform integrative analysis (i.e., HIP, CVR, and SELPCCA) suggesting integrative methods may result in more reproducible findings.

HIP (Random) and HIP (Grid) showed strong agreement in the “stable” genes and proteins (Supporting Information: Table S3) which also supports the use of the random search instead of the grid search. The “stable” genes selected by HIP (Random) for males and females and their average estimated weights are in Supporting Information: Tables S4 and S5 respectively. Analogous results for proteins are presented in Supporting Information: Table S6. The weights in these tables are averages of the estimated weights (L2 norm of rows in B^d,s) from the subset fit in the splits where the variable was included in the subset fit (i.e., the variable was in Ntop for at least one subgroup). Supporting Information: Tables S7–S9 list the “stable” gene IDs and proteins selected by the competing methods.

The top gene for males was activating signal cointegrator 1 complex subunit (ASCC2). Wilson et al. [37] looked for genes that were differentially expressed in COPD patients with and without cachexia (a loss of weight and muscle) and identified ASCC2 in a sample of 400 COPDGene Study participants and replicated the finding in a sample of 114 participants from the ECLIPSE Study; cachexia occurs more frequently in those with more advanced COPD. The top gene for females was dematin actin binding protein (DMTN). Lee et al. [38] measured DNA methylation on 100 participants (60 with and 40 without COPD) from a Korean COPD Cohort and identified DMTN (also known as EPB49) as a differentially methylated region when comparing current smokers to never smokers; the authors also note this gene has been identified in previous epigenome‐wide association studies of smoking. Thus this gene may be a candidate for future research to investigate the relationship with COPD specifically.

The top protein for males was SHC adaptor protein 1 (SHC1) (ranked fifth for females). Li et al. [39] ranked candidate genes based on gene risk scores where a higher rank indicated a stronger relationship to COPD. The SHC1 gene was ranked third out of 200 candidate genes and had lower expression levels in patients with COPD compared to healthy controls. The top protein for females was amyloid beta precursor protein (APP) (ranked second for males). Almansa et al. [40] compared gene expression levels of 12 patients with COPD requiring treatment in the ICU compared to 16 patients with COPD who were admitted to the hospital for treatment but did not require the ICU. They found that the patients admitted to the ICU showed higher expression levels of APP compared to patients who were not admitted to the ICU.

5.3.3. Pathway Analysis

We tested for overrepresentation of pathways in our “stable” proteins and genes for males and females using Ingenuity Pathway Analysis [41]. Table 2 shows the top 10 canonical pathways for genes and proteins for males and females.

TABLE 2.

Top canonical pathways from IPA enrichment analysis.

View Subgroup Canonical pathway Molecules Unadjusted p
Genes Males Iron homeostasis signaling pathway CDC34, FECH, SLC25A37 0.003
Methylglyoxal Degradation I HAGH 0.006
Heme Biosynthesis from Uroporphyrinogen‐III I FECH 0.008
Pentose Phosphate Pathway (Non‐oxidative Branch) RPIA 0.013
Heme Biosynthesis II FECH 0.019
Pentose Phosphate Pathway RPIA 0.023
Erythropoietin Signaling Pathway BCL2L1, GATA1 0.054
ID1 Signaling Pathway BCL2L1, GSPT1 0.068
Sertoli Cell‐Sertoli Cell Junction Signaling SPTB, YBX3 0.069
Autophagy GABARAPL2, SLC1A5 0.076
Females Heme Biosynthesis II ALAS2, FECH < 0.001
Iron homeostasis signaling pathway ALAS2, CDC34, FECH, SLC25A37 < 0.001
Methylglyoxal Degradation I HAGH 0.006
Heme Biosynthesis from Uroporphyrinogen‐III I FECH 0.008
Tetrapyrrole Biosynthesis II ALAS2 0.010
Hypoxia Signaling in the Cardiovascular System CDC34, UBE2H 0.011
Pentose Phosphate Pathway (Non‐oxidative Branch) RPIA 0.013
Pentose Phosphate Pathway RPIA 0.023
Erythropoietin Signaling Pathway BCL2L1, GATA1 0.054
ID1 Signaling Pathway BCL2L1, GSPT1 0.068
Proteins Males Role of JAK2 in Hormone‐like Cytokine Signaling EPO, LEP, PTPN6 < 0.001
White Adipose Tissue Browning Pathway BDNF, LEP, NPPB < 0.001
Erythropoietin Signaling Pathway EPO, LEP, PTPN6 < 0.001
Serotonin Receptor Signaling ADIPOQ, BDNF, LEP, NPPB < 0.001
AMPK Signaling ADIPOQ, INS, LEP 0.001
Leptin Signaling in Obesity INS, LEP 0.002
IL‐3 Signaling PPP3R1, PTPN6 0.002
Maturity Onset Diabetes of Young (MODY) Signaling ADIPOQ, INS 0.002
Thyroid Cancer Signaling BDNF, INS 0.002
ABRA Signaling Pathway NPPB, PPP3R1 0.003
Females Granulocyte Adhesion and Diapedesis PF4, PPBP, TNFRSF1A < 0.001
Agranulocyte Adhesion and Diapedesis PF4, PPBP, TNFRSF1A < 0.001
Wound Healing Signaling Pathway EGF, PF4, TNFRSF1A < 0.001
Huntington's Disease Signaling BDNF, CPLX2, EGF 0.001
Pathogen Induced Cytokine Storm Signaling Pathway PF4, PPBP, TNFRSF1A 0.002
Glioma Signaling CDKN2D, EGF 0.004
Type II Diabetes Mellitus Signaling ADIPOQ, TNFRSF1A 0.005
Axonal Guidance Signaling BDNF, EGF, PAPPA 0.005
Tumor Microenvironment Pathway EGF, TNFRSF1A 0.007
Regulation of the Epithelial‐Mesenchymal Transition by Growth Factors Pathway EGF, TNFRSF1A 0.008

There were both common and subgroup‐specific gene pathways for males and females. The top gene pathway for males was the CLEAR Signaling Pathway, and the top gene pathway for females was the STAT3 pathway. The STAT3 pathway is known to be involved in inflammatory responses to many diseases including COPD [42]. The Iron homeostasis signaling pathway that was found in the AWT analysis in Butts et al. [11] for both males and females was again present in the top gene pathways for males and females.

There was complete overlap in the top 10 protein pathways for males and females; this makes some sense because of the 20 stable proteins identified for males and females, 17 of them overlapped. For both males and females, the top pathway was the wound healing signaling pathway followed by PDGF signaling. The PDGF family includes PDGFA (platelet‐derived growth factor subunit A) and PDGFB (platelet‐derived growth factor subunit B) and is associated with wound healing; when PDGFs are found outside the context of wound healing, it seems to contribute to many diseases [43]. Though these authors focus on asthma rather than COPD, they also state that PDGF is expressed in airway epithelial cells and PDGFB is expressed in inflamed airway tissue.

6. R Shiny Application

While HIP and its extension to additional outcomes has many research areas to which it could be applied (within and outside of human health), it is implemented in Python which may be a barrier to some researchers without a coding background. We introduce an R Shiny [21] application that provides a graphical interface to apply HIP to data that is either uploaded, simulated, or available within the application. In this section, we provide a brief overview and highlights of the application. A detailed example of using the application is in the Supporting Information. The application is accessible on shinyapps.io at https://multi‐viewlearn.shinyapps.io/HIP_ShinyApp/.

The Shiny application has three tabs. The first tab, “About”, provides a brief overview of the method and related links (Figure 5a). The second tab, “HIP”, is where the user will set‐up the data and parameters for the method. We provide options for the user to upload their data, simulate data (based on our simulation examples), and implement HIP on an example of COVID‐19 data (Figure 5b). The third tab, “Results”, is where the user will submit the analysis and see results. This tab produces outputs of the model including prediction information and variable importance to help the user understand the results.

FIGURE 5.

FIGURE 5

HIP Shiny application screens.

Once the “Run Analysis” button has been clicked, a progress notification (Supporting Information: Figure S32) will appear in the lower right corner of the screen so the user knows the analysis has started. The user will know the analysis is complete when the progress notification disappears and the run time is displayed to the right of the “Run Analysis” button. After the analysis is complete, the left column in the “Result Summary” section will provide a button to download results, and the right column will display some basic information about the results for the user including the convergence status, the λ values used, the eBIC0 or other selection criterion value, the run time, and the applicable training prediction metric (Supporting Information: Figure S33). We also show prediction metrics for the training data and the test data if they are available. If there are no test data available, a message will be printed for the user stating so. The prediction metric is the mean squared error for Gaussian outcomes, classification accuracy for multi‐class outcomes, and fraction of deviance explained for Poisson and ZIP outcomes. The appropriate metric and formula are displayed in a gold box in the left column for users.

The weights from the estimated Bd,s matrices are displayed in two formats. The first is a variable importance plot that includes all variables that were included in the subset model fit (Supporting Information: Figure S35). The variables are ranked based on the weights within each view and subgroup. The second output is an interactive table with columns for the rank (within view and subgroup), variable, weight, view, and subgroup (Supporting Information: Figure S36). This table can be sorted by or filtered on any column so that the user can explore the results. Some possible options that could be of interest include selecting the top‐ranked variable(s) for each view and subgroup, selecting a specific variable to compare between subgroups, or selecting all results for a single view and subgroup (Supporting Information: Figures S37–S39).

In general, user inputs are in gray boxes, generated outputs have a white background, and information for the user is in gold boxes. Please refer to the Supporting Information for detailed descriptions of HIP and an illustration of HIP on publicly available data on COVID‐19.

7. Conclusion

In this paper, we have extended HIP as proposed in Butts et al. [11] to accommodate multi‐class, Poisson, and zero‐inflated Poisson (ZIP) outcomes allowing researchers to investigate additional clinically relevant outcomes in an integrative analysis framework that also accounts for subgroup heterogeneity. We retain the benefits of a joint association and prediction method for data integration to select clinically meaningful subgroup‐specific and common features and the ability to include clinical covariates. In simulations, HIP demonstrated improved variable selection abilities for binary, Poisson, and ZIP outcomes compared to existing methods. While all methods showed similar classification accuracy in the binary outcome simulations, HIP showed small improvements in D2 in the Poisson outcome simulations and substantial improvements in D2 in the ZIP outcome simulations. When applied to data from the COPDGene Study, we were able to identify common and subgroup‐specific genes and proteins associated with exacerbation frequency; previous literature has identified at least some of these as being related to COPD.

The R Shiny Application developed here provides a user‐friendly interface to apply HIP to any data with multiple data views, measured on pre‐specified subgroups, being used to predict an outcome. In the first tab, the application introduces HIP and describes why HIP may be a desirable method to consider for analyzing a multi‐view data set with potential subgroup heterogeneity. In the second tab, the application guides the user through inputting data, selecting an appropriate K, and setting parameters for running HIP. Finally, the third tab produces outputs of the model including prediction information and variable importance to help the user understand the results.

There are still some limitations to HIP requiring further research. As in the originally proposed HIP [11], Ntop, the number of variables to keep for each view, must be specified. In simulations, we know the true value of Ntop, but we cannot know the true value in applications to real data. Sensitivity analyses could help investigate the performance when values other than the true value are specified. We investigated using simulations the performance of different Ntop set below and above the truth. There was a decline in the performance of the variable selection compared to when Ntop was set to the truth, but the impact of specifying Ntop less than the true number was smaller than when it was greater than the true number, particularly in the high‐dimensional scenarios. In general, Ntop had less impact on variable selection in the low‐dimensional scenario when it was not set to the true value. HIP was still competitive, producing better or comparable variable selection and prediction estimates. We recommend that users investigate different cutoff points for their specific applications, while considering the balance between prediction estimates and parsimonious models. The inclusion of clinical covariates in the COPDGene application did not improve the predictive performance of the test data. The reason for this is unclear and could be for multiple reasons. One possibility is the variables in the clinical data do not explain additional variation in the outcome beyond the genes and proteins. Another possibility is that the large difference in the number of variables in this view compared to the other views affects the estimation. As another limitation, we focused on identifying joint or shared latent components among views for each subgroup. However, there might exist view‐ and subgroup‐specific variations in some applications that may not be captured by our method. Future work could consider extensions of HIP that model both joint and view‐specific latent components for each subgroup. The factor regression model proposed by Fan et al. (2023) [22] that uses additional information beyond the linear space spanned by the predictors (i.e., beyond the latent components) is appealing, but its benefits in our model are uncertain. Future studies could explore this further. We have assumed that each of the predictors comes from a Gaussian distribution. However, this assumption might not hold in practice, as real‐world data can have different underlying distributions. As a result, future work could explore alternative loss functions beyond the Frobenius norm to better accommodate various data distributions, potentially improving model robustness and predictive accuracy. Despite these limitations, our extension to HIP allows researchers to explore a wide variety of new questions by considering multi‐class, Poisson, or ZIP outcomes that are clinically meaningful.

Conflicts of Interest

The authors declare no conflicts of interest.

Supporting information

Supporting information.

SIM-44-0-s001.pdf (12.4MB, pdf)

Acknowledgments

This work was supported by NIGMS grant 1R35GM142695, NHLBI grants U01HL089897 and U01HL089856, and by NIH contract 75N92023D00011. The COPDGene study (NCT00608764) is also supported by the COPD Foundation through contributions made to an Industry Advisory Committee that has included AstraZeneca, Bayer Pharmaceuticals, Boehringer‐Ingelheim, Genentech, GlaxoSmithKline, Novartis, Pfizer, and Sunovion.

Funding: This study was supported by the National Institutes of Health (Grant Numbers: 1R35GM142695, 75N92023D00011, U01HL089856, U01HL089897).

Data Availability Statement

Access to the clinical and genomic data can be requested through dbGaP (IDs: phs000951.v4.p4 and phs000179.v6.p2). The proteomic data can be requested from the COPDGene Study Group (http://www.copdgene.org/). The Python source code and R‐package for implementing the methods and generating simulated data along with README files is available on GitHub at https://github.com/lasandrall/HIP. A Shiny Application of HIP for users with limited programming expertise can be found at https://multi‐viewlearn.shinyapps.io/HIP_ShinyApp/.

References

  • 1. Soriano J. B., Kendrick P. J., Paulson K. R., Gupta V., and Abrams E. M., “Prevalence and Attributable Health Burden of Chronic Respiratory Diseases, 1990–2017: A Systematic Analysis for the Global Burden of Disease Study 2017,” Lancet Respiratory Medicine 8, no. 6 (2020): 585–596, 10.1016/S2213-2600(20)30105-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Agustí A., Celli B. R., Criner G. J., et al., “Global Initiative for Chronic Obstructive Lung Disease 2023 Report: GOLD Executive Summary,” American Journal of Respiratory and Critical Care Medicine 207, no. 7 (2023): 819–837. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Silverman E. K., “Genetics of COPD,” Annual Review of Physiology 82, no. 1 (2020): 413–431, 10.1146/annurev-physiol-021317-121224. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Hardin M. and Silverman E. K., “Chronic Obstructive Pulmonary Disease Genetics: A Review of the Past and a Look Into the Future,” Chronic Obstructive Pulmonary Diseases: Journal of the COPD Foundation 1, no. 1 (2014): 33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Hu G., Zhou Y., Tian J., et al., “Risk of COPD From Exposure to Biomass Smoke: A Metaanalysis,” Chest 138, no. 1 (2010): 20–31. [DOI] [PubMed] [Google Scholar]
  • 6. Chung K. and Adcock I., “Multifaceted Mechanisms in COPD: Inflammation, Immunity, and Tissue Repair and Destruction,” European Respiratory Journal 31, no. 6 (2008): 1334–1356. [DOI] [PubMed] [Google Scholar]
  • 7. Gan W. Q., Man S. P., Postma D. S., Camp P., and Sin D. D., “Female Smokers Beyond the Perimenopausal Period Are at Increased Risk of Chronic Obstructive Pulmonary Disease: A Systematic Review and Meta‐Analysis,” Respiratory Research 7, no. 1 (2006): 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Kim Y. I., Schroeder J., Lynch D., et al., “Gender Differences of Airway Dimensions in Anatomically Matched Sites on CT in Smokers,” COPD: Journal of Chronic Obstructive Pulmonary Disease 8, no. 4 (2011): 285–292. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Prescott E., Bjerg A., Andersen P., Lange P., and Vestbo J., “Gender Difference in Smoking Effects on Lung Function and Risk of Hospitalization for COPD: Results From a Danish Longitudinal Population Study,” European Respiratory Journal 10, no. 4 (1997): 822–827. [PubMed] [Google Scholar]
  • 10. Regan E. A., Hokanson J. E., Murphy J. R., et al., “Genetic Epidemiology of COPD (COPDGene) Study Design,” COPD: Journal of Chronic Obstructive Pulmonary Disease 7, no. 1 (2011): 32–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Butts J., Wendt C., Bowler R. P., et al., “HIP: A Method for High‐Dimensional Multi‐View Data Integration and Prediction Accounting for Subgroup Heterogeneity,” Briefings in Bioinformatics 25, no. 6 (2024): bbae470. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Evenson A., “Management of COPD Exacerbations,” American Family Physician 81, no. 5 (2010): 607–613. [PubMed] [Google Scholar]
  • 13. Celli B., Vestbo J., Jenkins C. R., et al., “Sex Differences in Mortality and Clinical Expressions of Patients With Chronic Obstructive Pulmonary Disease,” American Journal of Respiratory and Critical Care Medicine 183, no. 3 (2011): 317–322PMID: 20813884, 10.1164/rccm.201004-0665OC. [DOI] [PubMed] [Google Scholar]
  • 14. Zou H. and Hastie T., “Regularization and Variable Selection via the Elastic Net,” Journal of the Royal Statistical Society, Series B 67 (2005): 301–320. [Google Scholar]
  • 15. Tibshirani R., “Regression Shrinkage and Selection via the Lasso,” Journal of the Royal Statistical Society, Series B 58 (1994): 267–288. [Google Scholar]
  • 16. Friedman J., Hastie T., and Tibshirani R., “Regularization Paths for Generalized Linear Models via Coordinate Descent,” Journal of Statistical Software 33, no. 1 (2010): 1–22. [PMC free article] [PubMed] [Google Scholar]
  • 17. Luo C., Liu J., Dey D. K., and Chen K., “Canonical Variate Regression,” Biostatistics 17, no. 3 (2016): 468–483. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Safo S. E., Min E. J., and Haine L., “Sparse Linear Discriminant Analysis for Multiview Structured Data,” Biometrics 78, no. 2 (2022): 612–623, 10.1111/biom.13458. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Safo S. E., Ahn J., Jeon Y., and Jung S., “Sparse Generalized Eigenvalue Problem With Application to Canonical Correlation Analysis for Integrative Analysis of Methylation and Gene Expression Data,” Biometrics 74, no. 4 (2018): 1362–1371, 10.1111/biom.12886. [DOI] [PubMed] [Google Scholar]
  • 20. Dondelinger F., Mukherjee S., and Initiative T. A. D. N., “The Joint Lasso: High‐Dimensional Regression for Group Structured Data,” Biostatistics 21, no. 2 (2018): 219–235, 10.1093/biostatistics/kxy035. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Chang W., Cheng J., Allaire J., et al., “Shiny: Web Application Framework for R,” 2023. R Package Version 1.7.4.1.
  • 22. Fan J., Lou Z., and Yu M., “Are Latent Factor Regression and Sparse Regression Adequate?,” Journal of the American Statistical Association 119, no. 546 (2023): 1076–1088. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Bai J., “Inferential Theory for Factor Models of Large Dimensions,” Econometrica 71, no. 1 (2003): 135–171. [Google Scholar]
  • 24. Chekouo T. and Safo S. E., “Bayesian Integrative Analysis and Prediction with Application to Atherosclerosis Cardiovascular Disease,” Biostatistics 24, no. 1 (2023): 124–139. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Li Q., Wang S., Huang C. C., Yu M., and Shao J., “Meta‐Analysis Based Variable Selection for Gene Expression Data,” Biometrics 70, no. 4 (2014): 872–880. [DOI] [PubMed] [Google Scholar]
  • 26. Lambert D., “Zero‐Inflated Poisson Regression, With an Application to Defects in Manufacturing,” Technometrics 34, no. 1 (1992): 1–14. [Google Scholar]
  • 27. Paszke A., Gross S., Massa F., et al., “PyTorch: An Imperative Style, High‐Performance Deep Learning Library,” in Advances in Neural Information Processing Systems, ed. Wallach H., Larochelle H., Beygelzimer A., Alché‐Buc d F., Fox E., and Garnett R. (Curran Associates, Incorporation, 2019), 8024–8035. [Google Scholar]
  • 28. Beck A. and Teboulle M., “A Fast Iterative Shrinkage‐Thresholding Algorithm for Linear Inverse Problems,” Journal on Imaging Sciences 2, no. 1 (2009): 183–202, 10.1137/080716542. [DOI] [Google Scholar]
  • 29. Duchi J., Hazan E., and Singer Y., “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization,” Journal of Machine Learning Research 12, no. 61 (2011): 2121–2159. [Google Scholar]
  • 30. Bergstra J. and Bengio Y., “Random Search for Hyper‐Parameter Optimization,” Journal of Machine Learning Research 13, no. Feb (2012): 281–305. [Google Scholar]
  • 31. Chen J. and Chen Z., “Extended Bayesian Information Criteria for Model Selection With Large Model Spaces,” Biometrika 95 (2008): 759–771, 10.1093/biomet/asn034. [DOI] [Google Scholar]
  • 32. Luo C. and Chen K., “CVR: Canonical Variate Regression,” (2017), R Package Version 0.1.1.
  • 33. Safo S. E. and Palzer E. F., “Mvlearnr: Multiview Learning Methods in R,” (2022), https://github.com/lasandrall/mvlearnR.
  • 34. Zeileis A., Kleiber C., and Jackman S., “Regression Models for Count Data in R,” Journal of Statistical Software 27 (2008): 1–25. [Google Scholar]
  • 35. Jackman S., Pscl: Classes and Methods for R Developed in the Political Science Computational Laboratory (United States Studies Centre, University of Sydney, 2020) R Package Version 1.5.5.1. [Google Scholar]
  • 36. Hastie T., Tibshirani R., and Wainwright M., Statistical Learning With Sparsity: The Lasso and Generalizations (CRC Press, 2016). [Google Scholar]
  • 37. Wilson A. C., Kumar P. L., Lee S., et al., “Heme Metabolism Genes Downregulated in COPD Cachexia,” Respiratory Research 21, no. 1 (2020): 100, 10.1186/s12931-020-01336-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Lee M. K., Hong Y., Kim S. Y., London S. J., and Kim W. J., “DNA Methylation and Smoking in Korean Adults: Epigenome‐Wide Association Study,” Clinical Epigenetics 8, no. 1 (2016): 103, 10.1186/s13148-016-0266-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Li W., Zhang Y., Wang Y., et al., “Candidate Gene Prioritization for Chronic Obstructive Pulmonary Disease Using Expression Information in Protein–Protein Interaction Networks,” BMC Pulmonary Medicine 21, no. 1 (2021): 280, 10.1186/s12890-021-01646-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Almansa R., Socias L., Sanchez‐Garcia M., et al., “Critical COPD Respiratory Illness Is Linked to Increased Transcriptomic Activity of Neutrophil Proteases Genes,” BMC Research Notes 5, no. 1 (2012): 401, 10.1186/1756-0500-5-401. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Kramer A., J. Greeen, Jr. , and Tugendreich J. P., “Causal Analysis Approaches in Ingenuity Pathway Analysis,” Bionformatics 30, no. 4 (2014): 523–530, 10.1093/bioinformatics/btt703. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Kiszałkiewicz J. M., Majewski S., Piotrowski W. J., et al., “Evaluation of Selected IL6/STAT3 Pathway Molecules and miRNA Expression in Chronic Obstructive Pulmonary Disease,” Scientific Reports 11, no. 1 (2021): 22756, 10.1038/s41598-021-01950-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Kardas G., Daszy'nska‐Kardas A., Marynowski M., Brzakalska O., Kuna P., and Panek M., “Role of Platelet‐Derived Growth Factor (PDGF) in Asthma as an Immunoregulatory Factor Mediating Airway Remodeling and Possible Pharmacological Target,” Frontiers in Pharmacology 11 (2020): 47, 10.3389/fphar.2020.00047. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting information.

SIM-44-0-s001.pdf (12.4MB, pdf)

Data Availability Statement

Access to the clinical and genomic data can be requested through dbGaP (IDs: phs000951.v4.p4 and phs000179.v6.p2). The proteomic data can be requested from the COPDGene Study Group (http://www.copdgene.org/). The Python source code and R‐package for implementing the methods and generating simulated data along with README files is available on GitHub at https://github.com/lasandrall/HIP. A Shiny Application of HIP for users with limited programming expertise can be found at https://multi‐viewlearn.shinyapps.io/HIP_ShinyApp/.


Articles from Statistics in Medicine are provided here courtesy of Wiley

RESOURCES