Summary
Current diagnosis of neurological disorders often relies on late-stage clinical symptoms, which poses barriers to developing effective interventions at the premanifest stage. Recent research suggests that biomarkers and subtle changes in clinical markers may occur in a time-ordered fashion and can be used as indicators of early disease. In this article, we tackle the challenges to leverage multidomain markers to learn early disease progression of neurological disorders. We propose to integrate heterogeneous types of measures from multiple domains (e.g., discrete clinical symptoms, ordinal cognitive markers, continuous neuroimaging, and blood biomarkers) using a hierarchical Multilayer Exponential Family Factor (MEFF) model, where the observations follow exponential family distributions with lower-dimensional latent factors. The latent factors are decomposed into shared factors across multiple domains and domain-specific factors, where the shared factors provide robust information to perform extensive phenotyping and partition patients into clinically meaningful and biologically homogeneous subgroups. Domain-specific factors capture remaining unique variations for each domain. The MEFF model also captures nonlinear trajectory of disease progression and orders critical events of neurodegeneration measured by each marker. To overcome computational challenges, we fit our model by approximate inference techniques for large-scale data. We apply the developed method to Parkinson’s Progression Markers Initiative data to integrate biological, clinical, and cognitive markers arising from heterogeneous distributions. The model learns lower-dimensional representations of Parkinson’s disease (PD) and the temporal ordering of the neurodegeneration of PD.
Keywords: Data integration, Factor models, Inflection points, Neurological disorders, Parkinson’s disease, Probabilistic models, Variational inference
1. Introduction
Patients affected by neuropsychiatric diseases often exhibit substantial heterogeneity in disease progression measured by cognitive, behavioral, and biological measures. Due to a lack of gold standard biomarkers, diagnosis of neurological disorders (e.g., Parkinson’s disease [PD] and Alzheimer’s disease [AD]) still heavily relies on clinical symptoms that occur at a late disease stage. Recent research has shown that disease-related biomarkers and subtle cognitive signs may follow a temporal ordering with varying rates of deterioration as the disease progresses. For example, Jack and others (2010) suggested that certain cerebrospinal fluid (CSF) biomarkers decline much earlier than neuronal dysfunction and cognitive symptoms (e.g., memory loss) for AD. Thus, it is important to identify markers that can measure the earlier pathological changes associated with underlying disease progression. In addition, since no gold standard diagnostic criteria defined by an objective biological marker is available for neurological disorders, it is also of interest to simultaneously integrate multiple data modalities (e.g., biological markers, cognitive or psychiatric signs, and motor symptoms) to measure disease progression.
Analysis of data collected in a single domain does not allow for a comprehensive characterization of disease risk and progression. Although evidence has started to emerge from separate studies of genetic variants and neuroimaging biomarkers associated with PD, only a handful has been replicated, and some are contradictory (Chen-Plotkin, 2014). There is a growing body of literature suggesting that it is beneficial to examine the complementary contribution from genomic, imaging, and CSF biomarkers to study susceptibility or risk of neurological disorders. Given the multiple genetic causes for complex diseases (e.g., PD and AD), the variability of markers, and the evident heterogeneity of symptoms across subjects (Wang and others, 2013), it is likely that multiple markers with utilities ranging from molecular genetic mechanisms, disease pathological states (e.g., spinal fluid biomarkers or neuroimaging measures) to clinical symptoms need to be integrated to fully map disease risk and progression.
Integrating multidomain disease markers and clinical data offer a comprehensive modeling of complex diseases resulting from an elaborate interplay among biological factors, environmental exposures, and behavioral or lifestyle factors. However, several challenges hamper the integration of heterogeneous types of longitudinal markers to model neurological disease progression. First, current research suggests that the progression profiles of many clinical and biological markers are nonlinear (Jack and others, 2010) and jointly modeling multivariate nonlinear random variables often lead to computational instability. Furthermore, the clinical symptoms and markers are measured on different scales, not all of them are collected for all subjects, and they may require different types of distributions for modeling (e.g., ordinal, multinomial, and continuous). In addition, patients enter a study at different stages of the disease and are usually followed within short time periods. As a result, the measurements are not aligned to directly observe the long-term disease pathological processes.
Despite the evidence of nonlinear disease progression, mixed models assuming linear progression are often used in clinical applications to model patients with repeated measures and allow subject-specific random variation, partially due to a lack of appropriate methods and stable software implementation for nonlinear models. For example, Lessig and others (2012) investigated the change of two cognitive instruments, Mini-Mental Status Examination (MMSE) and Montreal Cognitive Assessment (MoCA) among PD patients over time using a linear mixed effects model (LME). Furthermore, most existing studies focus on modeling a single outcome using LME or generalized linear mixed models (GLMM). In fact, when jointly modeling a large number of discrete outcomes, GLMM often terminate with failure of convergence due to complex random effects structures. With increasing computational power and Bayesian estimation techniques (e.g., Markov Chain Monte Carlo [MCMC]), some multivariate models have been proposed to accommodate interactions among multiple outcomes (Iddi and others, 2018). These methods only apply to continuous outcomes with an identity link function, and the computation is intensive for large-scale data. Matrix decomposition-based methods are proposed to integrate multiview continuous data with Gaussian distributions (Lock and others, 2013; Gaynanova and Li, 2019). Recently, Zhu and others (2020), Li and Gaynanova (2018), and Li and others (2018) proposed integrative analysis to accommodate mixed type data under an exponential family framework, but they do not model longitudinal data to characterize disease progression. Simultaneously modeling and integrating a large number of longitudinal measures with different types of distributions remain a challenge.
In this work, we propose a generative model to integrate a large number of markers from different modalities and mixed data types (e.g., continuous and ordinal) to simultaneously learn their interactions and effects on disease progression. Our model is based on composing distributions from the exponential families with latent variables in a hierarchical fashion. We simultaneously examine the associations among these markers in the latent space so as to create lower-dimensional factors and partition patients into biologically meaningful, homogeneous subgroups based on shared variation across data modalities. There are three main foci of our approach on modeling disease progression: (i) The model will allow the markers’ distribution types to be heterogeneous and use the corresponding exponential family for each type; (ii) We will use a generative hierarchical model with nonlinear trajectories to capture associations between markers and identify the order of impairments by examining model parameters (e.g., average inflection points); (iii) We allow subject-specific random inflection points to accommodate patients’ differential stage of disease and align them in the latent space; (iv) To accommodate between-patient heterogeneity, the model will produce modality-specific factor scores and shared factor scores across modalities that can distinguish patient subgroups in terms of their disease progression stages based on both clinical symptoms and shared variation with biomarkers.
Our modeling framework is based on a Multilayer Exponential Family Factor (MEFF) model with nonlinear trajectories over time consistent with theories of disease cascade and evidence in empirical studies. The convergence of algorithms for fitting such nonlinear latent models with large-scale, mixed-type outcomes is a challenging statistical problem. For a single outcome, the computation and estimation of nonlinear mixed effects models often use various approximations of the log-likelihood function, including Laplacian approximation, LME model approximation, Gaussian quadrature, or penalized quasilikelihood. However, these methods become unstable or infeasible when models are highly nonlinear and outcomes are multivariate consisting of different types (e.g., continuous and discrete). There are three main numeric challenges for fitting large-scale nonlinear models. They arise from the computation of integrals over the multidimensional latent factors, the optimization of the nonlinear objective function, and the dependence of the model optimization used for each data type. To overcome these challenges, we propose to fit the model with state-of-the-art computational techniques for large-scale data based on variational inference (VI) (Blei and others, 2017). The optimization method is sufficiently general to accommodate multiple distributions in the exponential family and connect latent variables and observed variables in a hierarchical fashion. To accommodate a large number of markers and observations, stochastic gradient descent is used (Ranganath and others, 2014), and the method is applicable to “wide and tall” large-scale data with repeated measures not necessarily aligned among subjects. We show the feasibility of the method through simulation studies and an application to the Parkinson’s Progression Markers Initiative (PPMI) to model the progression of PD.
2. Construction of the generative models
2.1. Integration of different data modalities
Let
denote measurement of the
th time-varying marker within the
th data modality at visit
on the
th subject, let
denote time-invariant covariates to be adjusted for confounding (e.g., demographics) or to inform disease mechanisms (e.g., genetic mutation), and let
denote the observed measurement time or age for each modality. Different modalities of outcomes (e.g., clinical symptoms, CSF biomarkers, and neuroimaging biomarkers) can be of different data types (e.g., continuous, ordinal, and discrete), and each type requires a different link function under the exponential family models. Assume that the first modality
measures ordinal clinical markers (e.g., items in an instrument measuring the severity of motor symptoms for PD patients), where
indicates normal symptoms with larger integer values indicating more severe impairments. A commonly used model from measurement theory in psychology and psychiatry is the item response theory (IRT) model, where dichotomous markers are modeled using logit link functions incorporating latent traits (Partchev, 2004). Our model for ordinal markers is based on an extension of the IRT, which accommodates inflated normal symptoms for long-term resilient subjects using a Bernoulli random variable
. Figure C.1 of the Supplementary material available at Biostatistics online shows an example of observed ordinal markers and their histograms for PD patients in our motivating study. We can observe a large proportion of normal outcomes since many patients only present normal signs for the two markers. In our model, a subject with
is not susceptible to pathological changes manifested through the
th marker and therefore will always have normal symptoms during the life course (i.e.,
for all
).
On the contrary, susceptible patients will have ordinal outcomes
following an adjacent category model with latent variables, and we can obtain
![]() |
(2.1) |
![]() |
(2.2) |
In (2.1), the outcome follows an inflated adjacency category model given the unobserved random variables
. Parameters (
) are global parameters specific to a marker’s trajectory shared by all subjects. The parameter
indicates the marker-specific rate of impairment. Since the probability of having less severe symptoms decreases as age
increases,
is constrained to be negative. Parameter
allows the inflection point age to differ for different successive outcome levels, and
is fixed at
to ensure model identifiability. The random subject-specific inflection point of the
th marker (i.e.,
) is the age that a susceptible patient reaches the midpoint on the probability curve of a normal outcome,
. In other words, it is the age when a susceptible patient starts to have a probability of greater than
to experience abnormal symptoms (i.e.,
). It is also the age when the impairment probability curve
decreases the fastest with respect to
, where
are local latent variables allowing individual shifts around a group-average mean age of inflection point,
. These random shifts allow patients to be at different stages of their disease progression when entering the study, and the progression trajectory is governed by global parameters.
The subject- and marker-specific inflection point age
is a critical landmark point, since it represents the time when the deterioration from a normal state occurs at the maximal rate. The values of
indicate the temporal ordering of the
markers as the disease progresses for the
th subject. It is further modeled as a linear function of fixed covariates
and the zero-mean random shift scores
. In (2.2), an interbattery latent factor model (IBFA) is used to capture subject-specific disease progression measured by multiple markers jointly. The factor model in (2.2) expresses variations among observed markers through a lower-dimensional, oriented subspace spanned by modality-specific latent factors
(
for the first modality) and common latent factors
shared across all modalities. Loadings
and
specify relative contributions of common and modality-specific latent factors to the latent space. Details are described in Section 2.2.
In addition to discrete clinical markers, important continuous measures from biological, cognitive, and imaging data modalities are often collected. To accommodate nonlinear disease progression measured by CSF or imaging biomarkers, Sun and Wang (2018) transferred modeling sigmoidal nonlinear trajectories into modeling probabilities of Bernoulli latent variables so that the marginal distribution of biomarker would be sigmoidal. This model is within the class of latent nonlinear mixed effects model with random inflection points, which is difficult to fit as reviewed in Section 1. However, by formulating a nonlinear model as a mixture of two latent states (e.g.,
), each with a linear trajectory, but with switching probability depending on time through a nonlinear link function, a linearization of nonlinear model is achieved. This linearization reduces some parameter optimization to weighted least squares regressions, and thus circumvents unreliable computation. The previous work neither performs integrative analysis of multimodality measures nor dimension–reduction, which are the foci of the current work.
Here, we model the continuous outcome
as
![]() |
(2.3) |
![]() |
(2.4) |
The scale parameter
is constrained to be positive and represents the magnitude of change induced from the two states. Here, we define
in a way such that smaller values indicate more severe impairments, which is often the case for cognitive markers. The linear terms
and
account for impairments due to normal aging. Thus, we should expect
to be negative as aging will impair the cognitive functions. The parameter
, the variables
and
are modeled similar as in model (2.1), and
are normally distributed measurement errors with marker-specific variances. One difference in (2.3) is that
is a binary variable with one indicating a normal disease state and zero for a severe state. As a result, when a patient has long-term resilience (i.e.,
), or when a patient has a normal disease state (i.e.,
) for the
th marker, we should expect to see a less impaired outcome with
added. We show details of model (2.3) in the Supplementary materials A.3 available at Biostatistics online.
This model combines the linear trend and the well-known sigmoidal trajectories at the population level when marginalize
and
. In addition, the mixture of latent variables
and
allows modeling the probability of switching between a normal and a severe status. The factor model for
in
serves a similar role as in (2.2). The latent factors
are used to capture underlying subject-specific properties uniquely manifested by continuous markers specifically to the second modality. The models (2.3) and (2.4) is used to model continuous biomarkers such as brain caudate and putamen regional measures. The linearity assumption when
for all
is motivated by the observation that the normal aging process in regional brain volumes follow a linear trajectory throughout the adult life span (Allen and others, 2005; Fjell and others, 2009), and acceleration of aging is due to disease progression (Fjell and others, 2009).
Here, if
are observed, the progression trajectory given
is shifted by
. However,
are unobserved latent variables used to model a nonlinear trajectory. The progression trajectories marginalizing over latent variable probabilities are nonlinear functions of age (since probabilities are sigmoidal functions), as we demonstrate in Figure 8. In addition to population nonlinear trajectories, the model is flexible to allow a random horizontal shift
for each individual and each marker under the self-modeling regression framework (SEMOR; Lawton and others, 1972; Donohue and others, 2014). However, our model goes much beyond the original SEMOR by modeling large number of markers simultaneously and determine the random shifts by a lower-dimensional factor model accounting for within- and between-domain correlations.
A graphical illustration of the MEFF model for each patient
can be found in Figure 1. In our MEFF,
reflects the disease progression of a patient and can be used to investigate the underlying subgroups of patients (e.g., by performing a clustering analysis of
or model them as a mixture Gaussian distribution). The MEFF model with random shifts falls under the SEMOR (Donohue and others, 2014) but goes beyond by modeling large number of markers simultaneously and determine the random shifts by a lower-dimensional factor model.
Figure 1.
An example of the generative model of MEFF with two data modalities for the patient
. The stacked outcomes represent records of multiple visits for a patient and a marker. The latent factor space is represented by the lower-dimensional latent factor
shared across modalities and the unique latent factor
specific for each modality
.
is informative to reflect the underlying disease progression of subjects from different subgroups.
2.2. Interbattery factor model for the latent space
IBFA has recently been reinterpreted from a probabilistic perspective (Browne, 1979). It has a close connection with probabilistic canonical correlation analysis that aims to obtain components for capturing correlations between two sets of random variables (Bach and Jordan, 2005). In contrast, IBFA models not only provide a latent lower-dimensional subspace shared among all the modalities but also include modality-specific latent variables to account for each data modality’s unique variability. The main idea of IBFA has been extended to exponential family to demonstrate its capability to separate shared information from those specific to each modality (Klami and others, 2013).
Different from current literature that applies IBFA directly to observed outcomes, we use it to factorize the random shift scores
embedded in (2.2) and (2.4). With two modalities, our factor model is
![]() |
(2.5) |
where
, and
,
for
. Here, we have decomposed the lower-dimensional latent space into a common shared latent space with
and modality-specific latent spaces with
and
. Note that the dimension of
,
, is assumed to be much lower than the number of markers
. Thus, the IBFA model also serves the purpose of dimension reduction. The group-wise sparsity structure imposed for the loading matrix
in (2.5) helps to resolve the issue of unidentifiability regarding to allocating the shared and unique components (Klami and others, 2013). We further require that components in
and
are not all zero. Otherwise, the two modalities do not share information and thus do not warrant joint modeling. The group-wise sparsity and the requirement for
to have an identity variance–covariance matrix only guarantee the identifiability of
and
for each modality
. Loading matrices
and
are only identifiable up to an orthogonal rotation matrix
(i.e.,
). In real data analysis, we use QR decomposition to fix a rotation that generates the most interpretable loadings (Section 5.2 and Section A.1 of the Supplementary material available at Biostatistics online). Our model has the advantage of capturing the common variation and learning robust information using all data domains. In particular, when predicting one data source
from the other source
, not all variation of
is relevant. Using only denoised, and shared latent factors to predict one from another is sufficient, which is similar to the rationale of a partial least squares analysis. In our model, the shared latent factors
are used to reflect the underlying subject-specific disease progression status by combining observations from all modalities. Hence, the fitted shared latent factors based on posterior distributions given observed data can provide comprehensive and robust information to distinguish patients in heterogeneous subgroups.
Several features of our modeling framework are worth noting. First, the model maintains nonlinear trajectories observed for markers in neurological disorders. To compare marker trajectories between-patient subgroups and predict personalized trajectories, the model incorporates subject covariates (
) informative of disease progression (e.g., mutation status and baseline levodopa treatment status) and subject-specific latent factors (
). Second, instead of assessing individual markers separately, we jointly estimate a large number of markers to borrow information and reduce dimensionality using latent factors. The overlaps of variance among markers are effectively accounted for by the shared latent factors. Third, we can compare and establish temporal relationships across markers by estimating marker-specific parameters (e.g.,
and
) and mapping different types of markers onto the same latent space so that they are comparable. MEFF recognizes and organizes disease markers to jointly model the underlying long-term disease course as well as learn shared information across multiple domains and unique information specific to a domain.
3. Estimation and inference
Denoting
as all the parameters in the model, the marginal log-likelihood is
![]() |
With observed data
and the latent variables
, the goal is to learn the global progression parameters for each modality (e.g.,
), along with factor loadings
. For estimation with latent variables, the Expectation - Maximization (EM) algorithm is widely used. However, due to the intractability of posterior expectation of multiple layers of latent variables and nonlinearity in our model, it is impractical to use EM algorithm especially for large data sets with many markers. In this case, VI provides an approximation with an easy assessment of convergence through optimizing a variational lower bound. Specifically, VI requires to propose a variational distribution
parameterized by
to obtain adequate approximation of the conditional distribution of latent variables given observed data (i.e.,
) (Jordan and others, 1999). This is equivalent to maximizing the evidence lower bound (ELBO)
of data log-likelihood (Jordan and others, 1999), where
Based on the form of the likelihood in model (2.1), when using the logit link for
and
, the posterior
is intractable. However, VI overcomes this hurdle by avoiding numeric integration over multivariate latent factors
, and thus substantially speeding up computation.
Black box variational inference (BBVI) was proposed for VI to be easily applied and implemented for a wide variety of models (Ranganath and others, 2014). BBVI replaces tedious model-specific derivations of gradients in the updates of VI by an equivalent form and leverages computational power, which forms the cornerstone of deep exponential family. It derives a noisy but unbiased estimator for the gradient of the ELBO, where
With variational distribution
of any proposed easy form, we can numerically compute the gradient by drawing Monte Carlo samples. Then stochastic optimization can be used to maximize
and easily scale to large number of observations. One advantage of BBVI is that the computation of the gradient
does not involve the underlying model. Therefore, as long as the model-specific complete data log-likelihood and the proposed variational distributions can be numercially evaluated, any probabilistic programming package can be applied to automatically estimate an approximate for the posterior distribution. Therefore, the optimization challenges of fitting nonlinear models are tackled by stochastic sampling and using generic model-independent gradients, which is scalable to big data (Tran and others, 2016). Edward is such a Python library built on Tensorflow to support probabilistic modeling, efficient inference and techniques for model criticism (Tran and others, 2016). In this work, Edward is used for model estimation.
4. Simulation studies
We conducted simulation studies to examine the performance of the proposed method by simulating data that closely reflects the observed PPMI study data. There were
simulated subjects with ages from
to
who shared the same baseline covariates (i.e., gender) as the PPMI study cohorts. For the ease of presentation, we considered three domains with the first and second modality
simulated from
and
ordinal outcomes taking integer values from
to
, and the third modality
simulated from
continuous outcomes using
. The
estimated from the real data is small since we standardized all markers to have variance one. Each subject has on average
repeated visits in the first modality, and
repeated visits in the second modality. As described in (2.5), we simulated
using their own
D modality-specific latent factors
for
, and a shared
D latent factor
that correlates the three modalities. The parameters were estimated using BBVI implemented in Edward 1.3.5.
We report the estimates of the important parameters directly related to describing the disease progression (i.e.,
) in Table 1 from
simulated data sets. We see that all the mean estimates have small biases from the true parameter values (highlighted in bold) with small standard deviations. In particular,
for maker-specific rate of impairment and
for population-level inflection point age are two crucial parameters to distinguish useful markers and perform temporal ordering. Both of the two parameters are precisely estimated for all the makers. In conclusion, the proposed MEFF can be estimated accurately using BBVI. Example codes are available at the Github website (https://github.com/qw2223/MEFF). More simulation studies exploring estimation performances under different sample size, noise variance, missing proportion, and misspecification of number of latent factors are included in Section B of the Supplementary material available at Biostatistics online.
Table 1.
The parameter estimates of the
simulated data from three modalities mimicking PPMI
| Ordinal |
|
|
Continuous
|
|
||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|||
|
7.0
|
6.0
|
7.5
|
7.5
|
4.5
|
5.0
|
|
3.5
|
5.0
|
4.0
|
4.5
|
|
Bias
|
0.111 |
0.032 |
0.005 |
0.071 |
0.208 | 0.251 | 0.012 | 0.008 | 0.020 | 0.014 | ||
| s.d | 0.221 | 0.232 | 0.291 | 0.227 | 0.202 | 0.207 | 0.019 | 0.018 | 0.024 | 0.024 | ||
|
1.0 | 1.2 | 0.8 | 1.4 | 1.5 | 1.0 |
|
0.8 | 1.3 | 1.3 | 1.2 | |
| Bias | 0.003 | 0.001 |
0.003 |
0.002 |
0.013 |
0.009 |
0.003 | 0.005 | 0.004 | 0.004 | ||
| s.d | 0.007 | 0.010 | 0.011 | 0.009 | 0.023 | 0.010 | 0.002 | 0.002 | 0.004 | 0.003 | ||
|
2.5
|
4.0
|
3.0
|
6.0
|
3.0
|
3.0
|
|
6.5
|
5.0
|
3.5
|
4.0
|
|
| Bias |
0.002 |
0.011 |
0.013 |
0.151 |
0.016 |
0.034 |
0.063 | 0.017 | 0.013 | 0.000 | ||
| s.d | 0.058 | 0.113 | 0.093 | 0.593 | 0.188 | 0.146 | 0.462 | 0.033 | 0.025 | 0.026 | ||
|
0.5
|
2.5
|
2.0
|
1.0
|
0.5 | 0.0 |
|
1.0
|
2.0 | 2.0 |
1.5
|
|
| Bias |
0.005 |
0.012 |
0.017 |
0.238 |
0.002 |
0.010 |
0.272 |
0.006 |
0.008 |
0.003 | ||
| s.d | 0.048 | 0.087 | 0.078 | 0.469 | 0.170 | 0.109 | 0.748 | 0.015 | 0.016 | 0.010 | ||
|
0.2 | 0.5 |
0.3
|
0.1 | 0.4 | 0.6 |
|
0.5 |
0.2
|
0.4 | 0.7 | |
| Bias | 0.003 | 0.005 | 0.008 | 0.004 | 0.016 | 0.027 | 0.010 | 0.019 | 0.003 | 0.006 | ||
| s.d | 0.006 | 0.016 | 0.010 | 0.005 | 0.020 | 0.024 | 0.005 | 0.009 | 0.005 | 0.006 | ||
|
2.5 | 1.5 | 3.0 | 2.0 | 2.5 | 3.5 |
|
1.5
|
2.5
|
1.0
|
0.5
|
|
| Bias | 0.021 | 0.020 |
0.006 |
0.025 | 0.001 |
0.022 |
0.005 | 0.001 | 0.004 | 0.003 | ||
| s.d | 0.084 | 0.103 | 0.100 | 0.091 | 0.096 | 0.081 | 0.009 | 0.005 | 0.006 | 0.004 | ||
|
3.5 | 1.5 | 5.0 | 3.5 | 2.5 | 4.0 |
|
6.5 | 7.5 | 5.0 | 4.5 | |
| Bias | 0.050 | 0.016 | 0.018 | 0.002 |
0.012 |
0.021 |
0.031 | 0.008 | 0.014 | 0.009 | ||
| s.d | 0.125 | 0.113 | 0.130 | 0.140 | 0.130 | 0.120 | 0.037 | 0.007 | 0.006 | 0.007 | ||
The true parameter values are in bold. The bias was calculated from the mean bias compared to the true parameters.
The continuous markers were generated with
.
5. Applications to the PPMI study
5.1. Study design
The PPMI is a large international study of PD that aims to identify biomarkers of PD progression using longitudinal multimodality data including clinical measurements, and biological markers including genetics and neuroimaging measures (Marek and others, 2011). To understand the disease progression and distinguish markers of early PD patients from other signs, the PPMI study recruited eight cohorts with different genetic backgrounds and different disease diagnostic statuses determined at baseline by trained neurologists. Three major cohorts include de novo “PD” cohort (early PD patients), “HC” cohort (healthy controls), and “Prodromal” cohort (nondiagnosed PD with early nonmotor symptoms). There is a high chance for a prodromal patient to be diagnosed with PD in the near future. Therefore, it is crucial to distinguish prodromal cohort from health controls in order to develop effective early interventions before severe symptoms appear. Although prodromal patients had complete information on several cognitive measures (e.g., delayed recall), most did not have data from imaging biomarkers (e.g., caudate and putamen). Hence, it would be beneficial to borrow information from other similar subjects to study the PD progression for prodromal subjects.
A complication for multimodality analysis of the PPMI is that not all the subjects have complete outcome measurements for all the markers, which is a challenge that often occurs in practice for joint models. More specifically, PPMI study cohorts were assessed under different schedules. In the study period,
of the subjects have only baseline measurements and,
have 7–15 repeated measures. The median length of follow-up time is 4 years among those who have at least one observation other than the baseline measure. The advantage of MEFF is that it can allow such missingness and still produce the shared latent factor scores
for a subject by borrowing information from the subject’s other available outcomes and from other similar subjects. We have shown the validity of model estimation in the presence of missing data in Section B of the Supplementary material available at Biostatistics online. Figure C.2 of the Supplementary material available at Biostatistics online shows that the missingness patterns of the PPMI study are quite different across cohorts:
subjects of the genetic cohorts do not have data on MDS-UPDRS and 64
of prodromal patients have no records on both MDS-UPDRS and SCOPA-AUT items. In this case, methods such as principal component analysis will not be able to include those subjects with incomplete markers. We tackle these challenges using MEFF.
To incorporate information from different sources, we considered 10 modalities, including clinical, cognitive, imaging, and biological markers. The first modality contains categorical measures from MDS-UPDRS, the most common scale used to measure PD severity in terms of nonmotor autonomic functions and motor symptoms. Each item has five levels to describe various impairment domains of PD symptomatology (e.g., bradykinesia, dystonia, tremor, and rigidity). Because the most severe level has very few observations, we combine it with the moderate level. Since some items have the same score for most of the subjects, we only focus on the
items with some variability in the analysis. The second modality contains
categorical SCOPA-AUT items measuring autonomic functions that are important for early PD. In addition, we have six continuous modalities of cognitive measurements. They include semantic fluency test (VLT), MoCA (i.e., MCATOT, visuospatial, attention, language, delayed recall, and orientation), symbol-digit modalities, letter number sequencing test (LNS), Hopkins verbal learning test (HVLT), and Benton judgment of line orientation (JLO). We also have one modality for CSF biological markers (i.e.,
,
-synuclein, p-tau, and t-tau), and one for DaTscan imaging markers (i.e., caudate and putamen). All markers in the continuous modalities are standardized to mean zero and variance one for modeling.
We summarize subject-and-marker-specific observed symptom scores by taking the average across repeated measures of a subject and visualize the available observed data in Figure C.2 of the Supplementary material available at Biostatistics online and its correlation matrix of all
markers in Figure 2(a). Due to limited space, we only list the names for half of the markers in both figures. The observed data in Figure C.2 of the Supplementary material available at Biostatistics online show that overall patient cohorts (i.e., “PD,” “SWEDD,” “GCPD,” and “GRPD”) have more severe symptoms across markers than control cohorts (i.e., “HC,” “Prodromal,” “GCUN,” and “GRUN”). Consistent with cohort description, “SWEDD” subjects do not have severe symptoms in DaTscan imaging markers (e.g., caudate and putamen), and “Prodromal” subjects do not have severe MDS-UPDRS markers (e.g, NP1SLPN). Figure 2(a) shows that the markers of the same modality or measuring the same types of symptoms (e.g., cognitive functions) form a cluster with highly correlated markers. Specifically, we see clusters of MDS-UPDRS markers, cognitive markers (e.g., VLTFRUIT), biological markers (e.g., t-tau), and imaging markers. As a comparison, the SCOPA-AUT markers (e.g., SCAU1) have smaller within-group correlations. Data from different clusters can also have between-group correlations. For example, the DaTscan imaging markers are highly correlated with MDS-UPDRS Part II items (e.g., NP2SPCH). This is consistent with the fact that the current diagnosis of PD heavily relies on MDS-UPDRS and DaTscan imaging markers.
Figure 2.
Heatmap of the correlation matrix of the 77 markers averaged across patient repeated measures. (a) For observed symptom scores. (b) For model fitted symptom scores. (c) For subject-and-marker-specific fitted shift scores
based on the MEFF. MEFF inferred correlation patterns (b) greatly enhanced the correlation patterns from observed data (a) by removing noises and revealing hidden patterns. Markers measuring the same domain of functions have high correlations.
The visualization results indicate that PPMI measures collected from different modalities describes the disease progression from a distinct perspective while they also share some underlying common features. We will use MEFF model to identify both shared and unique latent factors in lower dimensions of the observed measures across all domains, identify combinations of markers to better distinguish diagnostic groups (e.g., prodromal patients from PD and HC subjects), and compare markers’ inflection points.
5.2. Fitting MEFF on PPMI
We fit MEFF on PPMI data using
subjects from eight diagnostic cohorts measured by
markers from
modalities. Since gender differences are observed in PD (Shulman, 2007), we include gender as a covariate in the analysis. Based on prior knowledge and observed correlation patterns from Figure 2(a), we assume that the
modalities share latent factors
of dimension three and each modality has its own unique latent factor. The number of shared latent factors is chosen from the prior knowledge that the
modalities of data are consisted of
main domains (i.e., cognition, autonomic function, and motor function). When prior knowledge is not available, a preliminary factory analysis or principal components analysis can be used to determine the likely range of the dimension of latent factors and Bayesian Information Criterion (BIC) can be used to choose the final dimension. We conduct sensitivity analyses in the Supplementary material available at Biostatistics online to show that model parameters are robust to misspecification of the number of latent factors. The VI algorithm in Edward aimed to estimate the model parameters and latent factors by maximizing the ELBO function in Section 3. The optimization process takes around 6–8 min to converge, which is fast given a large number of observations and markers. Details on fitting PPMI data using VI and the identifiability issues are included in Section A.1 of the Supplementary material available at Biostatistics online.
From the fitted model, we obtained all the parameter estimates as well as the posterior means of
and
for each modality
given the observed data as the estimates for the latent factors. To compare observed outcomes versus fitted outcomes, we obtain the correlation matrix from the observed symptom scores in Figure 2(a) and the correlation matrix using fitted symptom scores
in Figure 2(b). The fitted scores not only preserve major information from observed data but also reveal much stronger patterns after removing noises. To examine correlation in the latent space, let
denote all posterior means of latent factors concatenated together. Since the subject-and-marker-specific random inflection point shifts
are not observed, we first estimate them using
and estimates of
as discussed in Section 5.2. Figure 2(c) is the heatmap of the correlation matrix of
obtained by concatenating
for all subjects. Compared with the correlation matrix of the observed symptoms in Figure 2(a), the estimated inflection point shifts
not only capture the similar patterns for both within and between modalities but also much enhance the correlation structures after adjusting for age, disease resilience, rate of impairment, and so on. Specifically, SCOPA-AUT items are highly correlated with MDS-UPDRS Part I items, which is expected because they all measure autonomic functions. In addition, SCOPA-AUT items are roughly divided into three groups where the first and the third group are highly correlated. Consistent with findings of the observed data, Figure 2(b) from the fitted model also indicates that DaTscan imaging markers have high correlations with motor markers (i.e., MDS-UPDRS Part II and Part III). On the contrary, most markers from cognitive tests are not directly correlated to the other modalities.
5.3. Latent factor representation of PD
The factor loading matrix contains valuable information to describe the representation from the high-dimensional observation space to the low-dimensional latent factors learned from PPMI. We first calculate correlation coefficients between estimated random inflection shifts
and latent factors
. Note that in the traditional factor analysis when
are observed and standardized, the factor loadings
are the correlation between each latent factors in
and each observed variable in
. However, since in our analysis
cannot be observed and standardized beforehand, we will calculate the pairwise correlation coefficients between the
fitted marker-specific shift scores in
and the
estimated latent factors in
. Figure 3(a) shows the heatmap of the correlation matrix where each row is a maker and each column is a latent factor. The first three columns are the
-dimensional shared latent factors from which we see that the first factor
mainly represents MDS-UPDRS and DaTscan markers. Since PD diagnosis is based on motor symptoms and dopamine deficit in DaTscan imaging, this latent factor plays the most important role in distinguishing patients from control groups. For example, from the density plot in Figure 3(b), we see that “PD” and “HC” cohorts are well separated using
. In addition, “Prodromal” cohort is another important group of interest because prodromal subjects can hardly be distinguished from HC based on motor symptoms, but they have a high chance of converting to PD in the later course. Since
contains not only the motor and DaTscan markers but also incorporates other important measurements from different modalities, it can further separate the “Prodromal” cohort from “HC” as in Figure 3(b). Therefore, our model shows the potential to improve diagnosis by integrating information from multiple modalities. The second shared factor
is almost solely loaded on markers from cognitive tests, indicating that the modality measuring cognitive functions does not interact directly with other modalities. The third latent factor
is loaded on SCOPA-AUT items and MDS-UPDRS Part I items both of which represent autonomic functions. Figure 3(c) shows the smoothed density plot of
and
, where we see that “PD”, “HC” and “Prodromal” subjects differ in the direction of
due to different locations and prodromal subjects differ in the direction of
due to different variances. To summarize, the correlations between predicted
and
provide an intuitive way to understand the utility of latent factors. Our model has flexible interpretations while also maintains good power to distinguish subjects with different diagnostic status.
Figure 3.
(a) Heatmap of the correlation between each subject-and-marker-specific fitted shift scores
and each latent factor. (b) Smoothed density plot of the first shared latent factor
for cohorts HC, PD, and Prodromal. (c) 2D density map for the two shared latent factors. Shared latent factors can distinguish prodromal subjects from HC.
PD patients often present heterogeneous clinical symptoms and disease patterns, making it challenging for establishing diagnostic criteria and developing effective treatments. Since the shared latent factors contain major information common to many modalities, they are valuable for clustering patients into meaningful subgroups. For example, we can use an unsupervised clustering algorithm (e.g., Gaussian mixture model) to cluster the PD patients into homogeneous subgroups. Since patients in the same subgroup share similar progression profiles, they can be studied together to understand the disease progression trend and targeted treatment effects.
5.4. Validation and temporal ordering of neurodegeneration of PD
To validate the findings from the MEFF models, we use clinically important external variables that have not been used in the model fitting to demonstrate the validity of the fitted shift scores. One external marker is the initiation of medication as an important milestone that indicates disease progression to the next level so that PD symptoms are no longer tolerable without levadopa treatment (e.g., how soon a subject initiated medication after baseline). Another marker is the REM sleep scores measuring sleep disorders that reflects evolving neural degeneration (Stiasny-Kolster and others, 2007). In Figure C.3 of the Supplementary material available at Biostatistics online, we compute the correlations between each of the two external variables and observed symptoms (shaded gray), fitted symptoms (yellow) and fitted inflection point shift scores (blue) respectively. We observe that across almost all markers, the correlations with external variables are improved using MEFF fitted results (i.e., fitted symptom scores or shift scores). Higher correlations with external variables of clinical importance indicate the evidence that the MEFF has obtained and enhanced the signals from the raw data, and the resulting progression scores are more predictive of early disease progression (i.e., higher correlation with early progression marker, initiation of levadopa).
Besides posterior estimates for latent variables
and
, the fixed effects parameters from MEFF are also informative for learning disease progression, especially for identifying early disease markers. To understand the disease progression only for patients, we fit the MEFF for PD patients excluding HC. Figure 4 shows the population-average marker trajectories represented by the probability of having normal symptoms for
ordinal markers selected from MDS-UPDRS and
continuous markers selected from cognitive and DaTscan imaging domain. As age increases, the probability of having normal symptoms decreases and equivalently the disease severity increases. The dashed line indicates the population-averaged inflection point age
and the slope of the curve indicates the rate of impairment
. For a marker to be useful for early diagnosis, it should have a relatively early inflection point age and large rate of impairment. Figures 4(a) and (b) show that MDS-UPDRS items are more useful for early diagnosis than cognitive markers. Specifically, global spontaneity of movements (NP3BRADY) and facial expression (NP3FACXP) degenerate earlier in the disease course, which is consistent with a recent study (Prashanth and Roy, 2018). Among cognitive and imaging markers, the putamen area has the earliest inflection point age and the largest rate of impairment, followed by the caudate. These findings corroborate the current PD diagnosis which heavily relies on MDS-UPDRS and DaTscan imaging measures. Furthermore, for early screening, neurologists may pay more attention to global spontaneity of movements, facial expression, and putamen area degeneration.
Figure 4.
Population-level progression curves estimated from the probability of having normal symptoms. The dashed lines represent the locations of the corresponding inflection point age
shared across PD patients (lines for inflection point age larger than
are omitted). Markers including global spontaneity of movements, facial expression, and putamen clearly indicate evidence of early progression. (a) MDS-UPDRS markers. (b) Cognitive and DaTscan markers.
6. Discussion
In this work, we propose MEFF model to integrate multidomain data to learn lower-dimensional latent space of disease impairment, characterize temporal ordering of clinical and biological markers of PD neurodegeneration and cluster patients in homogeneous groups. Since neurodegeneration rarely follows a linear trend and patient heterogeneity cannot be sufficiently described by a single set of latent variables, the MEFF takes a nonlinear form with multiple levels of latent variables. Computational challenges of optimizing multilevel nonlinear models are resolved by BBVI which can handle a large number of observations and heterogeneous distributions.
The MEFF models have several practical utilities. First, at the population level, the model estimates disease progression by establishing which marker deteriorates earlier and faster than the others through the inflection point age
and marker-specific deterioration rate
. Markers with a larger deterioration rate will be more sensitive for monitoring patient’s disease progression, which will allow investigators to rank and choose a marker to be tested in clinical trials using relatively smaller sample sizes and shorter periods of time. Second, at the individual level, the estimated shared latent scores integrates information across all modalities and shows the underlying disease stage relative to the population. This measurement will be more robust than a few specific symptoms to determine the disease status. When a new patient comes, with limited available symptom observations, the model can integrate all available modalities to learn latent scores. In addition, we can estimate the subject-and-marker-specific shift score
and obtain the predicted progression curve of each marker by plugging in population-level parameters in the model. Lastly, MEFF reduces the dimensionality of a large number of markers to lower-dimensional latent factors to allow partitioning of patients into homogeneous groups.
There are a few limitations of BBVI used to fit MEFF. First, the approximation to the targeted posterior distributions may not be tight (Blei and others, 2017). However, comparing to exact optimization techniques such as MCMC and EM, VI has easy convergence criteria to access, and the computational gain is substantial for the applications we consider. Second, similar to many nonlinear and nonconvex optimization procedures, the VI algorithm may converge to a local optimum, and the estimation process may be influenced by initial values. We have demonstrated in the Table B.3 of the Supplementary material available at Biostatistics online that the simulated data is robust to choices of initial values. For our analysis of PPMI, multiple initial values were examined to ensure better convergence.
Learning the global shared latent factor common to all modalities can be restrictive when one or more modalities do not contain valuable information. In this case, we can allow MEFF to learn partially joint structure through factors shared by a subset of modalities of data. Prior knowledge is needed to determine which modalities may have partially shared information and which ones are isolated from the others. In the absence of scientific knowledge, one can also run MEFF including all modalities as an exploratory step, and determine the modality structures through fitted results. Another complexity worth noting is the impact on modeling due to different numbers of markers in each modality. The modality with a very large number of markers may dominate the fitted results. To balance the effects of imbalanced number of markers, we can consider a weighted likelihood with weights inversely correlated with the number of markers in each modality. There is no technical difficulty to include time-varying covariates or different covariates for different markers. For other distributions in the exponential family, the likelihood function will change when using VI but the essential algorithm remains the same. Here, our main results are based on the shared latent factors because they integrate information across all modalities and may be more robust. However, unique factors can potentially also describe disease progression not explained by the common factors.
In the real data analysis, we did not focus on demonstrating the accuracy of long-term prediction for each patient because each patient had only been followed for a short period of time. In fact, one of the advantages of our model is to utilize the short-term observations to learn the long-term trajectories. In the simulation studies, we show that all the population parameters can be accurately estimated, individual latent factors are accurate, and thus the long-term marker trajectories can be accurately described. We note that at the individual patient level, our goal is not to rely on any single marker trajectory to determine the underlying disease progression but to focus on the information extracted from the integrated modalities.
To directly account for patient heterogeneity using subgroups, a Gaussian mixture model for
can be considered. To identify mechanisms through which putative causal mutations (e.g., LRRK2 genotype for PD, Apolipoprotein E (APOE) for AD) confer risk for a disorder (e.g., onset of PD), it is crucial to demonstrate that the markers as intermediate outcomes indicate disease progression and are on the pathway from genetic predisposition to neuropathology. However, since gold standard disease diagnosis does not exist unless an autopsy is performed for neuropathologic confirmation, using clinical diagnosis or milestone events of disease progression based on symptoms provides an alternative way to evaluate a marker’s validity. When it is unknown which domain a new marker belongs to, one can consider a method for discovering domain structure through topic modeling.
Supplementary Material
Acknowledgments
Conflict of Interest: None declared.
Contributor Information
Qinxia Wang, Department of Biostatistics, Mailman School of Public Health, Columbia University, 722 W168th Street, New York, 10032, USA.
Yuanjia Wang, Department of Biostatistics, Mailman School of Public Health, Columbia University, 722 W168th Street, New York, 10032, USA.
Funding
U.S. National Institutes of Health (NIH) (NS073671, GM124104, and MH123487).
References
- Allen, J. S., Bruss, J., Brown, C. K. and Damasio, H. (2005). Normal neuroanatomical variation due to age: the major lobes and a parcellation of the temporal region. Neurobiology of Aging 26, 1245–1260. [DOI] [PubMed] [Google Scholar]
- Bach, F. R. and Jordan, M. I. (2005). A probabilistic interpretation of canonical correlation analysis. Technical Report 688. Department of Statistics University of California, Berkeley. [Google Scholar]
- Blei, D. M., Kucukelbir, A. and McAuliffe, J. D. (2017). Variational inference: a review for statisticians. Journal of the American Statistical Association 112, 859–877. [Google Scholar]
- Browne, M. W. (1979). The maximum-likelihood solution in inter-battery factor analysis. British Journal of Mathematical and Statistical Psychology 32, 75–86. [Google Scholar]
- Chen-Plotkin, A. S. (2014). Unbiased approaches to biomarker discovery in neurodegenerative diseases. Neuron 84, 594–607. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Donohue, M. C., Jacqmin-Gadda, H., Le Goff, M., Thomas, R. G., Raman, R., Gamst, A. C., Beckett, L. A., Jack, C. R. Jr, Weiner, M. W., Dartigues, J.-F. and others. (2014). Estimating long-term multivariate progression from short-term data. Alzheimer’s & Dementia 10, S400–S410. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fjell, A. M., Walhovd, K. B., Fennema-Notestine, C., McEvoy, L. K., Hagler, D. J., Holland, D., Brewer, J. B. and Dale, A. M. (2009). One-year brain atrophy evident in healthy aging. Journal of Neuroscience 29, 15223–15231. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gaynanova, I. and Li, G. (2019). Structural learning and integrative decomposition of multi-view data. Biometrics 75, 1121–1132. [DOI] [PubMed] [Google Scholar]
- Iddi, S., Li, D., Aisen, P. S., Rafii, M. S., Litvan, I., Thompson, W. K. and Donohue, M. C. (2018). Estimating the evolution of disease in the Parkinson’s progression markers initiative. Neurodegenerative Diseases 18, 173–190. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jack, C. R., Knopman, D. S., Jagust, W. J., Shaw, L. M., Aisen, P. S., Weiner, M. W., Petersen, R. C. and Trojanowski, J. Q. (2010). Hypothetical model of dynamic biomarkers of the Alzheimer’s pathological cascade. The Lancet Neurology 9, 119–128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jordan, M. I., Ghahramani, Z., Jaakkola, T. S. and Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine Learning 37, 183–233. [Google Scholar]
- Klami, A., Virtanen, S. and Kaski, S. (2013). Bayesian canonical correlation analysis. Journal of Machine Learning Research 14, 965–1003. [Google Scholar]
- Lessig, S., Nie, D., Xu, R. and Corey-Bloom, J. (2012). Changes on brief cognitive instruments over time in Parkinson’s disease. Movement Disorders 27, 1125–1128. [DOI] [PubMed] [Google Scholar]
- Li, G. and Gaynanova, I. (2018). A general framework for association analysis of heterogeneous data. The Annals of Applied Statistics 12, 1700–1726. [Google Scholar]
- Li, G., Huang, J. Z. and Shen, H. (2018). Exponential family functional data analysis via a low-rank model. Biometrics 74, 1301–1310. [DOI] [PubMed] [Google Scholar]
- Lock, E. F., Hoadley, K. A., Marron, J. S. and Nobel, A. B. (2013). Joint and individual variation explained (JIVE) for integrated analysis of multiple data types. The Annals of Applied Statistics 7, 523. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Marek, K., Jennings, D., Lasch, S., Siderowf, A., Tanner, C., Simuni, T., Coffey, C., Kieburtz, K., Flagg, E., Chowdhury, S. and others. (2011). The Parkinson progression marker initiative (PPMI). Progress in Neurobiology 95, 629–635. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Partchev, I. (2004). A visual guide to item response theory. Retrieved November 9, 2004. [Google Scholar]
- Prashanth, R. and Roy, S. D. (2018). Novel and improved stage estimation in Parkinson’s disease using clinical scales and machine learning. Neurocomputing 305, 78–103. [Google Scholar]
- Ranganath, R., Gerrish, S. and Blei, D. M. (2014). Black box variational inference. In: Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS). pp. 814–822. [Google Scholar]
- Shulman, L. M. (2007). Gender differences in Parkinson’s disease. Gender Medicine 4, 8–18. [DOI] [PubMed] [Google Scholar]
- Stiasny-Kolster, K., Mayer, G., Schäfer, S., Möller, J. C., Heinzel-Gutenbrunner, M. and Oertel, W. H. (2007). The rem sleep behavior disorder screening questionnaire—a new diagnostic instrument. Movement Disorders 22, 2386–2393. [DOI] [PubMed] [Google Scholar]
- Sun, M. and Wang, Y. (2018). Nonlinear model with random inflection points for modeling neurodegenerative disease progression. Statistics in Medicine 37, 4721–4742. [DOI] [PubMed] [Google Scholar]
- Tran, D., Kucukelbir, A., Dieng, A. B., Rudolph, M., Liang, D. and Blei, D. M. (2016). Edward: a library for probabilistic modeling, inference, and criticism. arXiv preprint arXiv:1610.09787. [Google Scholar]
- Wang, J., Hoekstra, J. G., Zuo, C., Cook, T. J. and Zhang, J. (2013). Biomarkers of Parkinson’s disease: current status and future perspectives. Drug Discovery Today 18, 155–162. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu, H., Li, G. and Lock, E. F. (2020). Generalized integrative principal component analysis for multi-type data with block-wise missing structure. Biostatistics 21, 302–318. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.











