A Bayesian multiple imputation approach to bivariate functional data with missing components

Jeong Hoon Jang; Amita K Manatunga; Changgee Chang; Qi Long

doi:10.1002/sim.9093

. Author manuscript; available in PMC: 2022 May 23.

Published in final edited form as: Stat Med. 2021 Jun 8;40(22):4772–4793. doi: 10.1002/sim.9093

A Bayesian multiple imputation approach to bivariate functional data with missing components

Jeong Hoon Jang ¹, Amita K Manatunga ², Changgee Chang ³, Qi Long ³

PMCID: PMC9125166 NIHMSID: NIHMS1806891 PMID: 34102703

Abstract

Existing missing data methods for functional data mainly focus on reconstructing missing measurements along a single function—a univariate functional data setting. Motivated by a renal study, we focus on a bivariate functional data setting, where each sampling unit is a collection of two distinct component functions, one of which may be missing. Specifically, we propose a Bayesian multiple imputation approach based on a bivariate functional latent factor model that exploits the joint changing patterns of the component functions to allow accurate and stable imputation of one component given the other. We further extend the framework to address multilevel bivariate functional data with missing components by modeling and exploiting inter-component and intra-subject correlations. We develop a Gibbs sampling algorithm that simultaneously generates multiple imputations of missing component functions and posterior samples of model parameters. For multilevel bivariate functional data, a partially collapsed Gibbs sampler is implemented to improve computational efficiency. Our simulation study demonstrates that our methods outperform other competing methods for imputing missing components of bivariate functional data under various designs and missingness rates. The motivating renal study aims to investigate the distribution and pharmacokinetic properties of baseline and post-furosemide renogram curves that provide further insights into the underlying mechanism of renal obstruction, with post-furosemide renogram curves missing for some subjects. We apply the proposed methods to impute missing post-furosemide renogram curves and obtain more refined insights.

Keywords: Bayesian latent factor model, bivariate functional data, curves, missing data, multiple imputation

1 |. INTRODUCTION

Contemporary medical imaging technology is increasingly producing complex data objects that could potentially aid noninvasive diagnostics by effectively delineating pathophysiology of various diseases. One such data object is a function, whose unit of observation is assumed to be a realization of a smooth continuous process defined over a time domain. The dynamic structure of functions, induced by their inherent smoothness and infinite dimensionality, is a rich source of pathophysiological information, but at the same time poses unique challenges in their analysis and modeling. Many statistical techniques have been developed to translate functional data into clinically useful information to facilitate better clinical decision making. See, for example, the monograph by Ramsay and Silverman¹ and references therein. As with other traditional data objects (eg, univariate, multivariate, longitudinal), one of the major impediments to reliable and efficient analysis of functional data is missing data. The complex structure of functions, however, renders traditional tools for handling missing data² inadequate, and, as such, there is an increasing demand for developing missing data methods that are specifically tailored for functional data.

Our work is motivated by a renal study conducted at Emory University. 99mTc-Mercaptoacetyltriglycine (MAG3) renography is a high-tech, noninvasive imaging technique widely used to detect renal obstruction (i.e., obstruction to urine drainage from kidney).³ MAG3 renography generates two time activity curves by imaging and monitoring the time-varying concentration of intravenously injected MAG3 inside the kidney. A first renogram curve, called baseline, consists of MAG3 photon counts detected at 59 time points (every 15–30 seconds) during an initial period of 24 minutes. Then, following a 20 to 30-minute waiting period and an intravenous injection of a furosemide, a separate second renogram curve, called post-furosemide, is obtained at 40 time points (every 30 seconds) over a 20-minute time period.

Diagnosis of renal obstruction using renogram curves is a difficult problem and generally requires substantial expertise in renal physiology and MAG3 pharmacokinetics.³ As such, the scientific goal of the Emory renal study has been to improve interpretation of renogram curves by investigating and understanding their distribution and pharmacokinetic properties that provide further insights into the underlying physiological mechanism of renal obstruction. The study dataset contains MAG3 renal scans of 314 kidneys that were collected from an archived database at Emory University Hospital. The final obstruction statuses of the kidneys were retrospectively determined by the consensus of three nuclear medicine experts, according to which 248 kidneys were non-obstructed and 66 kidneys were obstructed.

A major challenge to achieving the scientific goal of the study is missing data. For a subset of kidneys, post-furosemide renogram curves were not collected if the attending physician (not a nuclear medicine expert) deemed that obstruction could be excluded only by inspecting their baseline renogram curves, leading to 96 out of 314 kidneys with missing post-furosemide renogram curves (missing rate of 30%). Note, this does not mean that kidneys whose baseline renogram curve patterns suggest non-obstruction always have missing post-furosemide renogram curves, but only have a higher likelihood to do so. The reason is two-fold. First, a complete set of baseline and post-furosemide renogram curves was collected from a majority of kidneys regardless of their obstruction status. Second, the decision to not collect the post-furosemide curve was made by an attending physician prone to making errors. In sum, post-furosemide renogram curves are missing at random (MAR),² where the likelihood of missingness depends on the observed patterns of baseline renogram curves. A simple approach to addressing missing post-furosemide renogram curves is complete-case analysis (CCA), in which kidneys with missing data are discarded from analysis. However, this approach is likely to produce biased and inefficient findings under MAR.² Thus, there is a critical need for a more statistically principled approach that can handle missing post-furosemide renogram curves in the renal study.

The missing data problem encountered in the Emory renal study corresponds to a challenging extrapolation problem, where one component (post-furosemide curve) of bivariate functional data should be imputed using the information from another (baseline curve). Much of the existing literature, however, has focused on the missing data problem under a univariate functional data setting, where the goal is to impute/reconstruct the missing measurements along a single function. Morries et al⁴ and He et al⁵ introduced multiple imputation approaches for incomplete curve data under wavelet-based and smoothing-splines-based functional linear mixed models, respectively. Chiou et al⁶ proposed to impute incomplete curve data using the conditional expectation approach to functional principal component analysis. Yao et al⁷ suggested a smoothing technique for interpolating a sparsely observed function—that is, a case where a function is observed on a small number of points that are distributed randomly over its domain. For a fragmented function—that is, a case where a function is observed at points that only cover a subset of a domain—methods based on a “shift-and-connect” procedure⁸ and a Markov chain model⁹ were developed to reconstruct the missing fragments of a function given its observed fragments. Kraus¹⁰ proposed an alternative approach based on a Hilbert-Schmidt operator that inputs the observed part of a function to recover its missing part. Kneip and Liebl¹¹ introduced a more general class of reconstruction operators that exhibits better optimality properties than the restrictive case of Hilbert-Schmidt operators. Liebl and Rameseder¹² considered the case of reconstructing a fragmented function under specific violations of the missing-completely-at-random (MCAR) assumption. Furthermore, Ciarleglio et al¹³ proposed an extension to multiple imputation by chained equations that leverages scalar and functional response regression models to impute missing scalar and functional data.

The aim of this article is to propose a new multiple imputation approach specifically tailored to bivariate functional data with missing component functions. Multiple imputation is a statistically principled method in which the missing values (ideally MAR) are filled multiple times based on observed data to create “completed” datasets.¹⁴ It is popular among practitioners for its flexibility and ease of use. Various existing statistical tools can be readily applied to analyze completed datasets and perform valid inference.² This advantage is particularly sensible for the Emory renal study where a variety of existing statistical tools are sought to study patterns of renogram curves that relate to the underlying mechanism of renal obstruction. The crux of our multiple imputation approach is in efficiently characterizing and exploiting the joint changing patterns of the two component functions to allow accurate and stable imputation of one component given the other. We start by using splines to project the mean of each component function onto a finite-dimensional space that is computationally tractable. We then formulate a bivariate functional latent factor model to jointly regularize the spline coefficients of the two component functions and extract low-dimensional shared latent factors that govern their joint mean structure. A fully Bayesian framework is proposed for imputation. Specifically, we develop a Gibbs sampling algorithm that simultaneously generates imputations of missing component functions (imputation step) and posterior samples of model parameters (posterior step).

The proposed framework exhibits several desirable characteristics. First, it borrows strength across bivariate components via shared latent factors and from other observations via model parameters that are learnt by pooling entire sample information. Second, it can be applied to sparse functions as well as those with measurement error. Third, a multiplicative gamma process shrinkage prior¹⁵ is utilized for adaptive regularization which prevents overfitting while keeping flexibility in place. Lastly, other covariate information can be incorporated to improve imputation accuracy. Furthermore, it is possible that renogram curves from the left and right kidneys of a same subject are correlated. We thus extend the proposed modeling framework to accommodate a setting where multiple sets of bivariate functional data are collected for each subject (i.e., multilevel bivariate functional data); here our approach exploits both inter-component and intra-subject correlations to improve imputation accuracy and efficiency.

The remainder of the article is organized as follows. In Section 2, we propose a multiple imputation approach for bivariate functional data with missing components. In Section 3, we extend the method to a setting of multilevel bivariate functional data. In Section 4, we conduct simulation studies to evaluate the performance of the proposed approaches. In Section 5, we describe the application of our methods to the Emory renal study. In Section 6, we provide concluding remarks.

2 |. METHODS

2.1 |. General setup

Consider the setting of bivariate functional data; that is, each sampling unit is a collection of two functions (g⁽¹⁾, g⁽²⁾) that may refer to fundamentally different processes defined on different domains $T^{(1)}$ and $T^{(2)}$ . Each component function, $g^{(c)} : T^{(c)} \to ℝ (c = 1, 2)$ , is assumed to belong to a space of square integrable functions $L^{2} (T^{(c)})$ , where $T^{(c)} \subset ℝ$ .

Let ${(g_{i}^{(1)}, g_{i}^{(2)}); i = 1, \dots, n}$ denote n independent copies of bivariate functional data. Given the smoothness of each function $g_{i}^{(c)}$ , we model it by a basis expansion. Different system of basis functions can be chosen in different contexts. For instance, Fourier basis is widely used to model periodic curve data,¹ and wavelet basis is appropriate for modeling curves with many peaks and spikes.⁴ B-spline basis is a powerful tool for modeling nonperiodic curves and achieves a desirable balance between flexibility and computational efficiency.¹⁶ Without loss of generality, we choose to use cubic B-splines, and the resulting parameterization is

g_{i}^{(c)} (t) = \sum_{k = 1}^{K^{(c)}} β_{i k}^{(c)} b_{k} (t) = b^{(c)} {(t)}^{T} β_{i}^{(c)},

(1)

where $b^{(c)} (t) = {[b_{1} (t), \dots, b_{K^{(c)}} (t)]}^{T}$ is a K^(c)-dimensional cubic B-spline basis evaluated at argument $t \in T^{(c)}$ , and $β_{i}^{(c)} = {[β_{i 1}^{(c)}, \dots, β_{i K^{(c)}}^{(c)}]}^{T}$ is a K^(c)-dimensional vector of spline coefficients. Note that the number of knots used here is K^(c) − 2. We recommend to choose a sufficiently large K^(c) to incorporate a rich variety of patterns that $g_{i}^{(c)}$ may exhibit.

In practice, each $g_{i}^{(c)}$ is not directly observable, but available as noisy measurements on a discrete grid $(t_{i 1}^{(c)}, t_{i 2}^{(c)}, \dots, t_{i m_{i}^{(c)}}^{(c)})$ over $T^{(c)}$ . We denote these measurements as $Y_{i j}^{(c)}$ and model them as

Y_{i j}^{(c)} = g_{i}^{(c)} (t_{i j}^{(c)}) + ϵ_{i j}^{(c)}, t_{i j}^{(c)} \in T^{(c)}, j = 1, \dots, m_{i}^{(c)},

(2)

where $ϵ_{i j}^{(c)}$ are error terms that are identically and independently distributed (i.i.d.) as $N (0, σ_{ϵ}^{2 (c)})$ across i and j. Under this modeling assumption, observed bivariate functional data are $(Y_{i}^{(1)}, Y_{i}^{(2)})$ , where $Y_{i}^{(c)} = {[Y_{i 1}^{(c)}, Y_{i 2}^{(c)}, \dots, Y_{i m_{i}^{(c)}}^{(c)}]}^{T}$ denotes an $m_{i}^{(c)}$ -dimensional vector of observed measurements from cth component function. Then the missing data problem considered in this article can be framed as a case where either $Y_{i}^{(1)}$ or $Y_{i}^{(2)}$ is missing for some i.

2.2 |. Bivariate functional latent factor model

Let $Y_{i} = {[Y_{i}^{(1) T}, Y_{i}^{(2) T}]}^{T}$ denote a m_i-dimensional $(m_{i} = m_{i}^{(1)} + m_{i}^{(2)})$ vector that concatenates the observed functional measurements from two components. Let $β_{i} = {[β_{i}^{(1) T}, β_{i}^{(2) T}]}^{T}$ denote a K-dimensional vector that concatenates the spline coefficients of the two underlying functions $(g_{i}^{(1)}, g_{i}^{(2)})$ ; in other words, $β_{i k} = β_{i k}^{(1)}$ for k ≤ K⁽¹⁾, and $β_{i k} = β_{i, k - K^{(1)}}^{(2)}$ for k > K⁽¹⁾. Denote $B_{i}^{(c)} = {[b^{(c)} (t_{i 1}), \dots, b^{(c)} (t_{i m_{i}^{(c)}})]}^{T}$ as a $m_{i}^{(c)} \times K^{(c)}$ matrix of cubic B-spline basis functions evaluated on a fine grid of the domain, and set $B_{i} = blkdiag (B_{i}^{(1)}, B_{i}^{(2)})$ , which is an m_i × K block diagonal matrix with blocks $B_{i}^{(1)}$ and $B_{i}^{(2)}$ . Then, under assumed models (1) and (2), we can succinctly write the joint model for observed bivariate functional data as:

Y_{i} = B_{i} β_{i} + ϵ_{i}, ϵ_{i} ~ N_{m_{i}} (0, Σ_{ϵ}^{(i)}),

(3)

where $ϵ_{i} = {[ϵ_{i 1}^{(1)}, \dots, ϵ_{i m_{i}^{(1)}}^{(1)}, ϵ_{i 1}^{(2)}, \dots, ϵ_{i m_{i}^{(2)}}^{(2)}]}^{T} = {[ϵ_{i}^{(1) T}, ϵ_{i}^{(2) T}]}^{T}$ is a m_i-dimensional combined error vector, with $ϵ_{i}^{(1)} = {[ϵ_{i 1}^{(1)}, \dots, ϵ_{i m_{i}^{(1)}}^{(1)}]}^{T}$ and $ϵ_{i}^{(2)} = {[ϵ_{i 1}^{(2)}, \dots, ϵ_{i m_{i}^{(2)}}^{(2)}]}^{T}$ denoting vectors of $m_{i}^{(1)}$ and $m_{i}^{(2)}$ error terms corresponding to the first and second component functions, respectively. $Σ_{ϵ}^{(i)} = blkdiag (σ_{ϵ}^{2 (1)} I_{m_{i}^{(1)}}, σ_{ϵ}^{2 (2)} I_{m_{i}^{(2)}})$ , where $I_{m_{i}^{(c)}}$ denotes an $m_{i}^{(c)} \times m_{i}^{(c)}$ identity matrix.

Although a high-dimensional β_i is needed to capture rich information contained in both component functions, the resulting nonsparsity may cause the problem of overfitting and inefficiency. As such, we propose to jointly model the coefficients of both component functions by formulating a bivariate functional latent factor model, which extends its univariate counterpart.¹⁷ Specifically, we model the concatenated coefficients β_i as:

β_{i} = Λ η_{i} + ζ_{i}, ζ_{i} ~ N_{K} (0, Σ_{ζ}),

(4)

where Λ = {λ_kl}(k = 1, … , K; l = 1, … , q) is a K × q matrix of factor loadings with q ≪ K, and ζ_i = [ζ_i1, … , ζ_iK] is a K-dimensional residual vector that is independent of other variables in the model and is normally distributed with mean 0 and a diagonal covariance matrix $Σ_{ζ} = diag (σ_{ζ 1}^{2}, \dots, σ_{ζ K}^{2})$ . η_i = [η_i1, … η_iq]^T is a q-dimensional vector of latent factors that is shared across the bivariate components and parsimoniously captures their joint changing patterns—a low-dimensional approximation to β_i. Shared latent factors η_i play two crucial roles. First, they efficiently strike balance between underfitting and overfitting of bivariate functions in a data-driven manner. Second, they allow us to impute one component function from the other with accuracy and stability by facilitating efficient information sharing between the components. To illustrate these points, we can now model the expected value of each component function $g_{i}^{(c)} (t)$ via basis expansion: $E {g_{i}^{(c)} (t)} = \sum_{l = 1}^{q} η_{i l} ψ_{l}^{(c)} (t)$ , where $ψ_{l}^{(c)} (t) = \sum_{k = 1}^{K^{(c)}} λ_{k^{*} l} b_{k} (t)$ , l = 1 … , q, k* = k + (c − 1)K⁽¹⁾, form an unknown low-dimensional basis system to be learned from data, and η_il are the corresponding coefficients that are shared across bivariate components for efficient information sharing. The dimension of latent factor, q, can be learned from data.

Let x_i = [x_1i, … , x_pi]^T denote p-dimensional vector of covariate which are assumed to be fully observed. Following the approach by Montagna et al,¹⁷ we incorporate covariate information by linking x_i to the latent factors through the following linear model

η_{i} = Γ x_{i} + e_{i}, e_{i} ~ N_{q} (0, I_{q}),

(5)

where Γ = {γ_lr}(l = 1, … , q; r = 1, … , p) is a q × p matrix of unknown coefficients, and e_i is a q-dimensional vector of pure latent factors. The modeling assumption (5) implies that, conditionally on ${Λ, Σ_{ζ}, Γ, {(x_{i})}_{i = 1}^{n}, {(B_{i})}_{i = 1}^{n}}$ , $g_{1}^{(c)}, \dots, g_{n}^{(c)}$ are independent Gaussian processes with mean function $E {g_{i}^{(c)} (t)} = \sum_{l = 1}^{q} \sum_{r = 1}^{p} γ_{l r} X_{i r} ψ_{l}^{(c)} (t)$ , $t \in I^{(c)}$ , and covariance function

Cov {g_{i}^{(c)} (t), g_{i}^{(c)} (u)} = \sum_{l = 1}^{q} ψ_{l}^{(c)} (t) ψ_{l}^{(c)} (u) + \sum_{k = 1}^{K^{(c)}} b_{k} (t) b_{k} (u) σ_{ζ k^{*}}^{2}, t, u \in T^{(c)} .

Conditioning on the same values, we can also derive the cross covariance function between the components of bivariate functional data:

Cov {g_{i}^{(1)} (t), g_{i}^{(2)} (u)} = \sum_{l = 1}^{q} ψ_{l}^{(1)} (t) ψ_{l}^{(2)} (u), t \in T^{(1)}, u \in T^{(2)} .

If covariates are not available in the dataset, we can assume that $η_{i} ~ N_{q} (0, I_{q})$ , which corresponds to the original Bayesian latent factor model.¹⁵

2.3 |. Specification of Bayesian priors

To complete the Bayesian model specification, we assign prior distributions for all unknown model parameters. We first note that the factor model (4) is invariant under rotation or scaling of the loadings and scores, resulting in the non-identifiability of the covariance matrix Ω = ΛΛ^T + Σ_ζ. This issue is typically addressed by imposing identifiability constraints, such as assuming the loading matrix to be lower triangular with positive diagonal entries.¹⁸ One disadvantage of this approach is that it induces a priori order dependence in the off-diagonal entries of the covariance matrix, with the specification of the order of first k response variables being an important but a difficult modeling decision.¹⁹ On the other hand, from a Bayesian perspective, identifiability of the loading elements is not required for a wide class of applications, including estimation and inference on the covariance matrix.^15,20 Also, our model is identifiable as long as the covariance matrix Ω is well identified, making our case particularly amenable to Bayesian techniques. As a result, we assign a multiplicative gamma process shrinkage prior¹⁵ on the factor loadings as follows:

λ_{k l} ~ N (0, ϕ_{k l}^{- 1} τ_{l}^{- 1}), ϕ_{k l} ~ G (v / 2, v / 2), τ_{l} = \prod_{h = 1}^{l} δ_{h}, δ_{1} ~ G (a_{1}, 1), δ_{h} ~ G (a_{2}, 1), h \geq 2,

where τ_l is a global shrinkage parameter for the lth column of the loading matrix Λ, ϕ_kl is a local shrinkage parameter applied to each element of the lth column, and $G (a, b)$ denotes a Gamma distribution with shape parameter a and rate parameter b. This prior formulation has several advantages. First, the prior is placed on a parameter expanded factor loading matrix, making the induced prior on the covariance matrix Ω free of order dependence. Second, it allows the introduction of infinitely many factors and hence avoids the need for specification of the number of factors, which is often a difficult task.²¹ Third, by setting a₂ > 1—that is, by letting τ_l to stochastically increase in factor (column) index l—the loadings are increasingly shrunk toward zero as the factor index increases. This induced sparsity on the factor loadings enables efficient basis selection in (4) and accurate posterior computations. Finally, the simultaneous use of global and local shrinkage parameters allows strong global shrinkage for zero loadings while not over-shrinking nonzero loadings.

We use conjugate priors for the variances of error terms in (2) and (4); that is, $σ_{ζ k}^{2} ~ I G (a_{ζ}, b_{ζ})$ and $σ_{ϵ}^{2 (c)} ~ I G (a_{ϵ}^{(c)}, b_{ϵ}^{(c)})$ , where $I G (a, b)$ denotes an Inverse Gamma distribution with shape parameter a and scale parameter b. For the regression coefficients Γ in (5), the prior distributions can be assigned as $γ_{l r} ~ N (0, ω_{l r}^{- 1})$ and $ω_{l r} ~ G (1 / 2, 1 / 2)$ .

2.4 |. Bayesian multiple imputation framework

We now introduce our Bayesian multiple imputation scheme for bivariate functional data with missing components. Define the missingness indicator as $Δ_{i}^{(c)} = 0$ if $Y_{i}^{(c)}$ is missing, and $Δ_{i}^{(c)} = 1$ otherwise. Then let ${\tilde{Y}}_{mis}^{(1)} = {{(Y_{i}^{(1)}, Y_{i}^{(2)})}_{i = 1}^{n} Δ_{i}^{(1)} = 0, Δ_{i}^{(2)} = 1}$ and ${\tilde{Y}}_{mis}^{(2)} = {{(Y_{i}^{(1)}, Y_{i}^{(2)})}_{i = 1}^{n} : Δ_{i}^{(1)} = 1, Δ_{i}^{(2)} = 0}$ represent collection of functional data from observations with missing first and second component functions, respectively. Let ${\tilde{Y}}_{obs} = {{(Y_{i}^{(1)}, Y_{i}^{(2)})}_{i = 1}^{n} : Δ_{i}^{(1)} = 1, Δ_{i}^{(2)} = 1}$ denote collection of functional data from observations with fully observed components, and let $\tilde{X} = {(X_{i})}_{i = 1}^{n}$ denote collection of covariate data. Our goal is to impute ${\tilde{Y}}_{mis}^{(c)}$ based on the information contained in ${\tilde{Y}}_{mis}^{(c^{'})} (c^{'} \neq c)$ , ${\tilde{Y}}_{obs}$ and $\tilde{X}$ .

Denote $θ = {{(β_{i})}_{i = 1}^{n}, {(η_{i})}_{i = 1}^{n}, σ_{ϵ}^{2 (1)}, σ_{ϵ}^{2 (2)}, Σ_{ζ}, Λ, {(ϕ_{k l})}_{k = 1, \dots K}^{l = 1, \dots, L}, {(δ_{r})}_{r = 1}^{p}, Γ, {(ω_{l r})}_{l = 1, \dots, q}^{r = 1, \dots, p}}$ as the collection of all unknown parameters in the bivariate functional latent factor model described in Section 2.2. Assuming MAR for missing components of bivariate functional data, a natural approach is to treat ${\tilde{Y}}_{mis}^{(1)}$ and ${\tilde{Y}}_{mis}^{(2)}$ as unknown parameters to be updated at each MCMC iteration. Specifically, we propose to use the Gibbs sampling technique²² to iteratively obtain draws of ${\tilde{Y}}_{mis}^{(1)}$ , ${\tilde{Y}}_{mis}^{(2)}$ and θ from their full conditional distributions. Specifically, each dth iteration runs through the three steps:

Step I . θ^{(d + 1)} ~ p (θ ∣ {\tilde{Y}}_{mis}^{(1) (d)}, {\tilde{Y}}_{mis}^{(2) (d)}, {\tilde{Y}}_{obs}, \tilde{X});

Step II . {\tilde{Y}}_{mis}^{(1) (d + 1)} ~ p ({\tilde{Y}}_{mis}^{(1)} ∣ {\tilde{Y}}_{mis}^{(2) (d)}, {\tilde{Y}}_{obs}, θ^{(d + 1)}, \tilde{X});

Step III . {\tilde{Y}}_{mis}^{(2) (d + 1)} ~ p ({\tilde{Y}}_{mis}^{(2)} ∣ {\tilde{Y}}_{mis}^{(1) (d + 1)}, {\tilde{Y}}_{obs}, θ^{(d + 1)}, \tilde{X}) .

The full Gibbs sampling algorithm is presented in Appendix A of the online Supplementary Material.

Given the initial values of model parameters, the Gibbs sampler repeatedly cycles through the steps presented in Appendix A of the online Supplementary Material to produce a sequence of draws of model parameters and missing functions. Note that the number of latent factors, q, is not fixed but learned within the Gibbs sampler. Convergence of the Gibbs chain can be assessed by monitoring the trace plots and calculating various diagnostic measures, including the Gelman-Rubin²³ and Raftery-Lewis²⁴ statistics. Although the Gibbs chain is guaranteed to converge given randomly chosen initial values, we provide a set of initial values that speed up the convergence rate in Appendix A of the online Supplementary Material. Once the chain has converged, we take M = 5 or 10 draws of missing functions with sufficient lags between the draws to reduce autocorrelation. Various statistical analyses can be performed with the imputed datasets. Note that no matter which analysis is selected, the estimates from each imputed dataset should be combined using the Rubin’s rules¹⁴ to account for the variability in results between the imputed datasets and ensure proper estimation and inference. Specifically, let θ denote the parameter of interest, and let ${\hat{θ}}_{(m)}$ and $\hat{Var} ({\hat{θ}}_{(m)})$ , respectively, denote the point estimate and estimated variance obtained from the mth imputed dataset (m = 1, … , M). Then the final pooled estimate and its estimated variance can be computed as $\hat{θ} = M^{- 1} \sum_{m = 1}^{M} {\hat{θ}}_{(m)}$ and $\hat{Var} (\hat{θ}) = V_{W} + (1 + M^{- 1}) V_{B}$ , respectively, where $V_{W} = M^{- 1} \sum_{m = 1}^{M} \hat{Var} ({\hat{θ}}_{(m)})$ is the within-imputation variance reflecting the variability of the parameter estimates in each imputed dataset, and $V_{B} = {(M - 1)}^{- 1} \sum_{m = 1}^{M} {({\hat{θ}}_{(m)} - \hat{θ})}^{2}$ is the between-imputation variance reflecting the variability in the estimates caused by the missing information.

The main advantage of our approach is that it borrows strength from both within and across observations while performing the imputation. The borrowing of strength from within the observation is accomplished through the latent factor η_i that contains information shared across the components of bivariate functional data as well as across other covariates. The borrowing of strength across observations occurs when their information are pooled to estimate Λ and Γ, which characterize the distributions in (4) and (5), respectively.

3 |. EXTENSION TO MULTILEVEL BIVARIATE FUNCTIONAL DATA

3.1 |. Model formulation

In the Emory renal study, each subject has two sets of baseline and post-furosemide renogram curves, each set from his/her left and right kidney. This situation corresponds to a setting of multilevel/repeated bivariate functional data, where S_i sets of bivariate functional data are collected for each ith subject; that is, $(g_{s i}^{(1)}, g_{s i}^{(2)})$ , s = 1, … , S_i, i = 1, … , n. Here, s indicates a level-one unit (kidney), and i indicates a level-two unit (subject). In the renal study, s indicates a left or right kidney (s = L, R), so that S_i = 2 for all i. We consider two types of correlation in multilevel bivariate functional data: (A) inter-component correlation—that is, correlation between $g_{s i}^{(1)}$ and $g_{s i}^{(2)}$ ; and (B) intra-subject correlation—that is, correlation among $g_{1 i}^{(c)}, \dots, g_{S_{i}, i}^{(c)} (c = 1, 2)$ . Our goal is to exploit both correlations to improve accuracy and efficiency of imputation.

As previously done, we model observed functional measurements as:

Y_{s i j}^{(c)} = g_{s i}^{(c)} (t_{s i j}^{(c)}) + ϵ_{s i j}^{(c)}, t_{s i j}^{(c)} \in T^{(c)}, c = 1, 2, j = 1, \dots, m_{s i}^{(c)}, ϵ_{s i j}^{(c)} ~ N (0, σ_{ϵ}^{(c)}),

where $ϵ_{s i j}^{(c)}$ is an error term, and $g_{s i}^{(c)} (t)$ is an underlying smooth function that can be modeled using K^(c)-dimensional cubic B-splines b^(c)(t) as

g_{s i}^{(c)} (t) = b^{(c)} {(t)}^{T} β_{s i}^{(c)},

with $β_{s i}^{(c)} = {[β_{s i 1}^{(c)}, \dots, β_{s i K^{(c)}}^{(c)}]}^{T}$ denoting a K^(c)-dimensional vector of coefficients. Let $Y_{s i}^{(c)} = {[Y_{s i 1}^{(c)}, \dots, Y_{s i m_{s i}^{(c)}}^{(c)}]}^{T}$ , and denote $B_{s i}^{(c)} = {[b^{(c)} (t_{s i 1}), \dots, b^{(c)} (t_{s i m_{s i}^{(c)}})]}^{T}$ as a $m_{s i}^{(c)} \times K^{(c)}$ matrix of spline basis functions that model $Y_{s i}^{(c)}$ . Let $Y_{s i} = {[Y_{s i}^{(1)}, Y_{s i}^{(2)}]}^{T}$ denote m_si-dimensional $(m_{s i} = m_{s i}^{(1)} + m_{s i}^{(2)})$ vector that concatenates observed functional measurements from two components, and let $β_{s i} = {[β_{s i}^{(1)}, β_{s i}^{(2)}]}^{T}$ denote K-dimensional (K = K⁽¹⁾ + K⁽²⁾) vector of corresponding coefficients. Then, denoting $B_{s i} = blkdiag (B_{s i}^{(1)}, B_{s i}^{(2)})$ , the model for Y_si is

Y_{s i} = B_{s i} β_{s i} + ϵ_{s i}, ϵ_{s i} ~ N_{m_{s i}} (0, Σ_{ϵ}^{(s i)}),

where $ϵ_{s i} = {[ϵ_{s i 1}^{(1)}, \dots, ϵ_{{sim}_{i s}^{(1)}}^{(1)}, ϵ_{s i 1}^{(2)}, \dots, ϵ_{{sim}_{s i}^{(2)}}^{(2)}]}^{T}$ , and $Σ_{ϵ}^{(s i)} = blkdiag (σ_{ϵ}^{2 (1)} I_{m_{s i}^{(1)}}, σ_{ϵ}^{2 (2)} I_{m_{s i}^{(2)}})$ .

To facilitate information sharing between the two component functions and between level-one units (kidneys) from the same level-two unit (subject), we specify a bivariate functional latent factor model on their concatenated coefficients β_si as follows

β_{s i} = Λ η_{s i} + ξ_{i} + ζ_{s i}, ξ_{i} ~ N_{K} (0, Σ_{ξ}), ζ_{s i} ~ N_{K} (0, Σ_{ζ}),

(6)

where Λ = {λ_kl}(k = 1, … , K; l = 1, … , q) is a K × q matrix of factor loadings, η_si = [η_si1, … ,η_siq]^T is a q-dimensional vector of latent factors, ξ_i = [ξ_i1, … , ξ_iK]^T is a K-dimensional vector of mean-zero subject-specific random effects with covariance matrix $Σ_{ξ} = diag (σ_{ξ 1}^{2}, \dots, σ_{ξ K}^{2})$ , and ζ_si = [ζ_si1, … , ζ_siK]^T is K-dimensional residual vector with diagonal covariance matrix $Σ_{ζ} = diag (σ_{ζ 1}^{2}, \dots, σ_{ζ K}^{2})$ . Note that ξ_i and ζ_si are independent from each other and from all other variables.

As done previously, we incorporate covariate information by linking the design matrix, x_si = [x_1si, … , x_psi]^T, to the latent factor η_si as follows

η_{s i} = Γ x_{s i} + e_{s i}, e_{s i} ~ N_{q} (0, I_{q}),

(7)

where Γ = {γ_lr}(l = 1, … , q; r = 1, … , p) is a q × p matrix of regression coefficients, and e_si is a q-dimensional vector of pure latent factors. Under the hierarchical modeling framework of (6) and (7), the coefficients from two component functions, β_si, contain inter-component and intra-subject information via shared latent factors η_si and subject-specific random effects ξ_i. In other words, our framework simultaneously induces inter-component and intra-subject correlations, and facilitates borrowing of information across components and within subjects, leading to a more efficient and accurate imputation.

To illustrate how inter-component and intra-subject correlations are induced in the renal study application, we derive the covariance function (8), inter-component cross-covariance function (9), and between-kidney (intra-subject) cross-covariance function (10), conditionally on {(B_si), Λ, Σ_ξ, Σ_ζ, Γ, x_si}:

Cov {g_{s i}^{(c)} (t), g_{s i}^{(c)} (u)} = \sum_{l = 1}^{q} ψ_{l}^{(c)} (t) ψ_{l}^{(c)} (u) + \sum_{k = 1}^{K^{(c)}} b_{k} (t) b_{k} (u) (σ_{ξ k^{*}}^{2} + σ_{ζ k^{*}}^{2}), t, u \in T^{(c)},

(8)

Cov {g_{s i}^{(1)} (t), g_{s i}^{(2)} (u)} = \sum_{l = 1}^{q} ψ_{l}^{(1)} (t) ψ_{l}^{(2)} (u), t \in T^{(1)}, u \in T^{(2)},

(9)

Cov {g_{L i}^{(c)} (t), g_{R i}^{(c)} (u)} = \sum_{k = 1}^{K^{(c)}} b_{k} (t) b_{k} (u) σ_{ξ k^{*}}^{2}, t, u \in T^{(c)},

(10)

where $ψ_{l}^{(c)} (t) = \sum_{k = 1}^{K^{(c)}} b_{k} (t) λ_{k^{*} l}$ , and k* = k + (c − 1)K⁽¹⁾.

3.2 |. Bayesian priors and multiple imputation

We impose a conjugate prior distribution for the variance component of the subject-specific random effect: $σ_{ξ l}^{2} ~ I G (a_{ξ}, b_{ξ})$ . Priors for other parameters can be specified as in Section 2.3.

Define the missingness indicator at the sth level (kidney) as $Δ_{s i}^{(c)} = 1$ if $Y_{s i}^{(c)}$ is missing, and $Δ_{s i}^{(c)} = 1$ otherwise. Set ${\tilde{Y}}_{mis}^{(1)} = {{(Y_{s i}^{(1)}, Y_{s i}^{(2)})}_{s = 1, \dots, S_{i}}^{i = 1, \dots, n} : Δ_{s i}^{(1)} = 0, Δ_{s i}^{(2)} = 1}$ , ${\tilde{Y}}_{mis}^{(2)} = {{(Y_{s i}^{(1)}, Y_{s i}^{(2)})}_{s = 1, \dots, s_{i}}^{i = 1, \dots, n} : Δ_{s i}^{(1)} = 1, Δ_{s i}^{(2)} = 0}$ , and ${\tilde{Y}}_{obs} = {{(Y_{s i}^{(1)}, Y_{s i}^{(2)})}_{s = 1, \dots, S_{i}}^{i = 1, \dots, n} : Δ_{s i}^{(1)} = 1, Δ_{s i}^{(2)} = 1}$ . Denote $θ = {{(β_{s i})}_{s = 1, \dots, S_{i}}^{i = 1, \dots, n}, {(η_{s i})}_{s = 1, \dots, S_{i}}^{i = 1, \dots, n}, {(ξ_{i})}_{i = 1}^{n}, σ_{ϵ}^{2 (1)}, σ_{ϵ}^{2 (2)}, Σ_{ζ}, Σ_{ξ}, Λ, {(ϕ_{k l})}_{k = 1, \dots K}^{l = 1, \dots, L}, {(δ_{r})}_{r = 1}^{p}, Γ, {(ω_{l r})}_{l = 1, \dots, q}^{r = 1, \dots, p}}$ as the collection of all unknown parameters. Under these notations, the proposed multiple imputation framework iterates through the three steps presented in Section 2.4 to obtain draws of ${\tilde{Y}}_{mis}^{(1)}$ , ${\tilde{Y}}_{mis}^{(2)}$ , and θ from their full conditional distributions.

We highlight here the modifications made to the sampling steps for β_si η_si, Λ, and ξ_i. In our highly structured complex model, the ordinary Gibbs sampler²² based on full conditional posterior distributions suffers from poor mixing, high autocorrelations, and strong posterior correlations between the parameters. To improve the chain’s convergence, while maintaining the target stationary distribution, we derive a partially collapsed Gibbs (PCG) sampler²⁵ that permutes and reduces the conditioning in the draws of β_si η_si, Λ and ξ_i. The detailed PCG sampler is provided in Appendix B of the online Supplementary Material.

4 |. SIMULATIONS

In this section, we conduct simulation studies to evaluate the performance of the proposed method and compare it with those of several competing methods. Hereafter, we will call the proposed multiple imputation method for bivariate functional data introduced in Section 2 as “FUNMI” and that for multilevel bivariate functional data introduced in Section 3 as “FUNMI2”.

4.1 |. Data generated from a bivariate functional latent factor model

In the first set of simulations, we generate data from the bivariate functional latent factor model. Specifically, we generate bivariate functional data $Y_{i} = {[Y_{i}^{(1) T}, Y_{i}^{(2) T}]}^{T}$ from $N_{m_{i}} (μ_{i}, Σ_{i})$ with joint mean μ_i = B_iΛΓx_i and covariance $Σ_{i} = B_{i} Λ Λ^{T} B_{i}^{T} + B_{i} Σ_{ζ} B_{i}^{T} + Σ_{ϵ}^{(i)}$ on respective time grids $T_{G}^{(1)} = {t_{i 1}^{(1)}, \dots, t_{i m_{i}^{(1)}}^{(1)}}$ and $T_{G}^{(2)} = {t_{i 1}^{(2)}, \dots, t_{i m_{i}^{(2)}}^{(2)}}$ , both of which are inside the unit interval [0, 1]. Block diagonal components of $B_{i} = blkdiag (B_{i}^{(1)}, B_{i}^{(2)})$ include evaluations of K₁ = 6 and K₂ = 4 cubic B-splines on $T_{G}^{(1)}$ and $T_{G}^{(2)}$ , respectively. We set $σ_{ϵ}^{2 (1)} = 1.2$ , $σ_{ϵ}^{2 (2)} = 0.6$ , and $σ_{ζ k}^{2} = 1$ for all k = 1, … ,10. Values of Λ and Γ are presented in Appendix C of the online Supplementary Material. We consider sample sizes of n = 100 and 200.

We vary the sizes of $m_{i}^{(1)}$ and $m_{i}^{(2)}$ to assess the sensitivity of the proposed method to varying density of time points. Specifically, we consider three cases. (20U) Sparse and irregular functional data with $m_{i}^{(1)}$ and $m_{i}^{(2)}$ following a Poisson distribution with mean 20 and lower-truncated at 12. Except for the preset endpoints $t_{i 1}^{(c)} = 0$ and $t_{i m_{i}^{(c)}}^{(c)} = 1$ , the remaining $(m_{i}^{(c)} - 2)$ observation times are randomly drawn from a uniformly distributed grid {(u − 1)/39; u = 2, … ,39}. (20B) Sparse and regular functional data with $m_{i}^{(1)} = m_{i}^{(2)} = 20$ time points equally spaced on [0, 1] for all functions. (40B) Dense and regular functional data with $m_{i}^{(1)} = m_{i}^{(2)} = 40$ time points equally spaced on [0, 1] for all functions. For the covariates, we set x_i = [x_1i, x_2i]^T, where x_i1 is an intercept, and x_2i is a binary group indicator generated from Bernoulli(0.5), with x_2i = 0 and x_2i = 1 representing data from groups 0 and 1, respectively.

We consider a missing data mechanism that mimics that of the Emory renal study, where the probability of missing for post-furosemide renogram curves $(Y_{i}^{(2)})$ is higher when patterns of baseline renogram curves $(Y_{i}^{(1)})$ strongly suggest non-obstruction. Specifically, missingness indicators for $Y_{i}^{(2)}$ are generated according to the following logistic model:

l o g (\frac{P (Δ_{i} = 1)}{1 - P (Δ_{i} = 1)}) = γ_{0} + γ_{1} Y_{i}^{(1)} + γ_{2} x_{2 i} + γ_{3} Y_{i}^{(1)} x_{2 i} .

(11)

γ₁ = [0.43, 0, … , 0, … , 0, 0.43]^T ( $m_{i}^{(1)}$ -dimensional vector with only the first and last elements being nonzero), γ₂ = 18, and γ₃ = −γ₁. We set γ₀ = −15 or γ₀ = −16 so that the percentage of missing $Y_{i}^{(2)}$ is around 15% or 30%, respectively. Note that the above model formulation (11) implies that subjects in group 0 tend to have more missing $Y_{i}^{(2)}$ than those in group 1. Specifically, on average, percentages of missing $Y_{i}^{(2)}$ are 25% in group 0 and 5% in group 1 under the overall missing rate of 15%, and percentages of missing $Y_{i}^{(2)}$ are 47% in group 0 and 12% in group 1 under the overall missing rate of 30%. Figure 1 illustrates a single representative simulated dataset of size n = 200 with overall missing rates of 15% and 30%. We see that both first and second component functions from group 1 tend to have higher values over respective domains than those from group 0. A visual comparison of the datasets of second component functions before (top right panel) and after removing the missing observations (bottom left panel: 15% missing; bottom right panel: 30% missing) further confirms that our missing data mechanism (11) deletes more $Y_{i}^{(2)}$ from group 0 than group 1.

A single representative simulated dataset of size n = 200 at missing rates of 15% and 30% for second component functions. The top left panel presents data of first component functions. The top right panel presents data of second component functions before removing the missing observations. The bottom left and right panels present data of second component functions after removing missing observations under missing rates of 15% and 30%, respectively. Green (gray on print version) and black curves indicate data from groups 1 and 0, respectively

We apply FUNMI to impute missing $Y_{i}^{(2)}$ . Hyperparameters of the Gibbs sampler are set to $a_{ϵ}^{(1)} = a_{ϵ}^{(2)} = a_{ζ} = 1.5$ , $b_{ϵ}^{(1)} = b_{ϵ}^{(2)} = b_{ζ} = 0.5$ , ν = 5, and a₁ = a₂ = 10. Note that values of global shrinkage parameters, a₁ and a₂, greater than 5 led to good imputation performance in our numerical experiments. We set the initial number of latent factors to q = 5, and choose α₀ = −1 and α₁ = −5 × 10⁻⁴ so that q is automatically adapted around every 10 iterations at the beginning of the Gibbs chain, and the frequency of adaptation exponentially decreases as the chain grows.¹⁷ Given that such an adaptation induces effective basis selection, we specify a generous number of cubic B-splines, K₁ = K₂ = 10, to ensure that important local variations of curves are captured. Five imputations for each of the missing second component functions are obtained at 9960, 9970, 9980, 9990, 10 000th iterations of the Gibbs sampler. Figure 2 presents the true and 5 imputed second component functions of two observations randomly selected from a sample of size n = 200 with missing rates of 15% (top panel) and 30% (bottom panel). All imputed functions are generally close to but scattered around the true function, incorporating the uncertainty of imputation.

True (black) and five imputed (green; gray on print version) second component functions of two randomly selected observations from a sample of size n = 200 with missing rates of 15% (top panel) and 30% (bottom panel)

Five other competing methods are also applied. The first competing method is the CCA. The second competing method is the “multiple imputation by chained equations” (MICE), which produces imputations for various types of missing multivariate data by iterating over a set of conditional densities, one for each incomplete variable.²⁶ We apply MICE by treating $Y_{i}^{(2)}$ as $m_{i}^{(2)}$ -dimensional multivariate data and generate five imputations. Note that this method is only applicable to regular functional data, (20B) and (40B). For the third competing method, we apply MICE to directly impute the TMAG3 values (MICE-TMAG3), rather than the curves themselves. For the fourth competing method, we consider one of those that aims to reconstruct the missing part of a curve given its observed part. Specifically, we choose Kneip and Liebl’s method¹¹ which minimizes the mean squared error loss between missing and reconstructed values using an optimal reconstruction operator. This method is referred to as RFUN hereafter. Note that RFUN does not acknowledge $Y_{i}^{(1)}$ and $Y_{i}^{(2)}$ are distinct functions observed on distinct domains and is primarily applicable to a univariate functional data setting, where $Y_{i}^{(2)}$ (missing part) is a direct continuation of $Y_{i}^{(1)}$ (observed part). Nonetheless, we found it important to consider this method, as it is one of very few methods that explicitly deal with missing functional data objects. To apply RFUN, we assume that $Y_{i}^{(1)}$ and $Y_{i}^{(2)}$ constitute a single, extended curve on a combined grid $T_{G} = {T_{G}^{(1)}, T_{G}^{(2)} + t^{*}} = {t_{i 1}^{(1)}, \dots, t_{i m_{i}^{(1)}}^{(1)}, t_{i 1}^{(2)} + t^{*}, \dots, t_{i m_{i}^{(2)}}^{(2)} + t^{*}}$ , where t* = 1.02 under (40B) and t* = 1.05 under (20U) and (20B). The optimal reconstruction operator then leverages the observed part of this extended curve, $Y_{i}^{(1)}$ , to reconstruct its missing part, $Y_{i}^{(2)}$ . The last competing method is functional regression (FUNREG), in which we fit a penalized function-on-function regression²⁷ that models $Y_{i}^{(2)}$ on $Y_{i}^{(1)}$ and x_2i (group indicator), and then impute the missing $Y_{i}^{(2)}$ by prediction from the fitted model.

We consider two estimands that distinguish between groups 0 and 1 (ie, x₂ = 0 vs. x₂ = 1).

In a renal study, the area under the post-furosemide renogram curve, $\int y_{i}^{(2)} (t) d t$ , quantifies the total MAG3 level (TMAG3) remaining in the kidney after furosemide injection and is an important marker of renal obstruction.²⁸ Thus, we are interested in assessing the discriminative ability of the TMAG3 of the second component function $Y_{i}^{(2)}$ for distinguishing between the two groups. The corresponding estimand of interest is the area under the receiver operating characteristic curve (AUC) of TMAG3. In this article, the AUC is estimated using the trapezoidal rule. The variance and 95% confidence interval of the AUC estimate are obtained using a stratified bootstrap technique with 2000 replicates.²⁹
Accurate estimation of mean function is one of the important themes in functional data analysis. For instance, in a renal study, evaluating the mean functions of post-furosemide renogram curves of obstructed and non-obstructed kidneys and assessing their difference can potentially help characterize the distribution of renogram curves suggestive of renal obstruction. That being said, the second estimand is the between-group difference in the mean function of $Y_{i}^{(2)}$ , that is, $μ_{d}^{(2)} (t) = μ_{1}^{(2)} (t) - μ_{0}^{(2)} (t)$ , where $μ_{x_{2}}^{(2)} (t)$ , x₂ = 0, 1, denotes a group-specific mean function. In this article, we apply a local linear smoother to estimate the group-specific mean function. The pointwise variance of the difference in the estimated group-specific mean functions, ${\hat{μ}}_{d}^{(2)} (t) = {\hat{μ}}_{1}^{(2)} (t) - {\hat{μ}}_{0}^{(2)} (t)$ , can be estimated as $\hat{Var} {{\hat{μ}}_{d}^{(2)} (t)} = n_{0}^{- 1} {\hat{K}}_{0}^{(2)} (t, t) + n_{1}^{- 1} {\hat{K}}_{1}^{(2)} (t, t)$ , where $n_{x_{2}}$ is the group-specific sample size, and ${\hat{K}}_{x_{2}}^{(2)} (\cdot, \cdot)$ denotes the estimated group-specific covariance function of the observed curves obtained by applying a local linear surface smoother on off-diagonal elements and a local linear smoother on diagonal elements. The details of these estimation procedures based on kernel smoothing can be found in Yao et al.⁷ In all simulation configurations, ${\hat{μ}}_{d}^{(2)} (t)$ is evaluated on a equidistant working grid of 40 time points over a unit interval [0, 1], that is, G = {(u − 1)/39; u = 1, … , 40}.

We estimate and infer the two estimands based on CCA and using datasets imputed/reconstructed by MICE, MICE-TMAG3, RFUN, FUNREG, and FUNMI, and compare their finite-sample performance. Rubin’s rules¹⁴ are applied to pool estimates from five imputed datasets generated by FUNMI, MICE, and MICE-TMAG3, respectively. Note that for pooling ${\hat{μ}}_{d}^{(2)} (t)$ , the Rubin’s rules are applied pointwise to each $t \in T_{G}^{(2)}$ . The estimation and inference made on the data before deletion (BD) of the missing cases are taken as the gold standard. For the first estimand, (i) AUC, the following three statistics are studied to assess the performance of the methods: empirical bias (EmpBias), root of the relative mean squared error, $RRMSE = \sqrt{(MSE (Method) / MSE (BD)}$ , and empirical coverage rate of the 95% confidence interval (Cov95). For the second estimand, (ii) the mean function difference, we consider two statistics on the uniformly distributed grid of 40 time points: root of the relative mean integrated squared error, $RRMISE = \sqrt{(MISE (Method)/ MISE (BD)}$ , where $MISE = E \int {{\hat{μ}}_{d}^{(2)} (t) - μ_{d}^{(2)} (t)}^{2} d t$ , and average of the empirical coverage rate of the pointwise 95% confidence intervals (ACOV95). We generate 200 Monte Carlo replications for each scenario.

The simulation results under the missing data rate of 15% are presented in Table 1. As expected, CCA performs badly across all configurations. Its biases consistently exceed −0.03, and its RRMSEs and RRMISEs are close to 2. Its coverage rates rapidly drop as the sample size increases. The performance of MICE is highly unstable, breaking down with large biases, RRMSEs and RRMISEs, and low coverage rates when faced with high dimensionality of missing data (40B). The performance of MICE-TMAG3 is similar to that of CCA in all configurations (not reported), suggesting unsatisfactory performance of directly imputing the scalar measure that condenses functional data. RFUN performs well with respect to AUC of TMAG3 but yields large bias and low coverage rates when concerned with mean function difference $μ_{d}^{(2)} (t)$ . The performance of FUNREG is generally satisfactory, except in some cases of sparse functional data (20U and 20B) where the average coverage rate for the mean function difference hovers around or below 0.9. FUNMI performs well in all estimands, outperforming other competing methods in most configurations. Its estimates are virtually unbiased and have MSEs and MISEs comparable to those of BD. Its coverage rates are close to the 95% nominal level in all cases.

TABLE 1.

Simulation results for the missing data rate of 15%

Estimand:			(i) AUC of TMAG3			(ii) Mean difference
n	Design	Method	Bias	RRMSE	Cov95	RRMISE	ACov95
100	20U	BD	−0.002	Ref.	0.945	Ref.	0.959
		CCA	−0.022	1.463	0.935	1.376	0.886
		RFUN	−0.003	1.013	0.945	1.496	0.834
		FUNREG	0.001	1.000	0.940	1.099	0.916
		FUNMI	−0.003	1.025	0.945	1.052	0.957
	20B	BD	−0.003	Ref.	0.930	Ref.	0.956
		CCA	−0.023	1.435	0.925	1.425	0.882
		MICE	−0.015	1.177	0.940	1.462	0.885
		RFUN	−0.004	1.001	0.935	1.547	0.817
		FUNREG	−0.005	1.118	0.905	1.119	0.932
		FUNMI	−0.001	1.000	0.925	1.050	0.951
	40B	BD	0.001	Ref.	0.950	Ref.	0.965
		CCA	−0.018	1.473	0.950	1.328	0.922
		MICE	−0.060	2.736	0.780	2.379	0.881
		RFUN	0.001	1.017	0.945	1.502	0.849
		FUNREG	0.002	1.033	0.950	1.029	0.957
		FUNMI	0.002	1.032	0.945	1.019	0.970
200	20U	BD	−0.002	Ref.	0.925	Ref.	0.965
		CCA	−0.023	1.642	0.875	1.607	0.832
		RFUN	−0.003	0.985	0.935	1.766	0.782
		FUNREG	0.000	0.980	0.920	1.202	0.888
		FUNMI	−0.004	1.039	0.935	1.072	0.957
	20B	BD	−0.002	Ref.	0.945	Ref.	0.941
		CCA	−0.023	1.608	0.860	1.551	0.814
		MICE	−0.005	1.072	0.930	1.169	0.910
		RFUN	−0.004	1.019	0.935	1.804	0.736
		FUNREG	−0.001	1.070	0.910	1.079	0.907
		FUNMI	−0.001	1.013	0.930	1.039	0.935
	40B	BD	−0.002	Ref.	0.940	Ref.	0.951
		CCA	−0.022	1.624	0.905	1.632	0.810
		MICE	−0.019	1.472	0.930	1.889	0.776
		RFUN	−0.002	1.013	0.935	1.898	0.721
		FUNREG	0.000	1.026	0.930	1.045	0.935
		FUNMI	−0.001	1.012	0.935	1.043	0.953

Open in a new tab

Note: Two estimands of interest are AUC of TMAG3 and mean function difference. Estimation and inference performance of the six methods (BD, CCA, MICE, RFUN, FUNREG, and FUNMI) are summarized using Bias, RRMSE, and Cov95 for the AUC and using RRMISE and ACov95 for the mean difference.

The simulation results under the 30% missing data rate are shown in Table 2. Performance of all competing methods drops compared to that under the missing rate of 15%, while dramatic drops are found in MICE and RFUN. Specifically, MICE now fails to produce accurate imputations for both sparse (20B) and dense (40B) functional data when n = 100. RFUN produces even larger bias and lower coverage rates for the mean function difference and also begins to yield some bias in estimating AUC for sparse functional data (20U and 20B). The performance of FUNREG is also negatively affected, with average coverage rates of mean function difference falling well below 0.9 for sparse functional data, and RRMSEs for estimating the AUCs exceeding 1.2 for regular functional data when n = 100. On the other hand, FUNMI maintains an overall satisfactory performance that is as good as or slightly lower than that under the missing rate of 15%.

TABLE 2.

Simulation results for the missing data rate of 30%

Estimand:			(i) AUC of TMAG3			(ii) Mean difference
n	Design	Method	Bias	RRMSE	Cov95	RRMISE	ACov95
100	20U	BD	−0.002	Ref.	0.940	Ref.	0.959
		CCA	−0.034	1.893	0.890	1.726	0.837
		RFUN	−0.008	1.085	0.945	2.376	0.567
		FUNREG	0.003	1.031	0.940	1.172	0.884
		FUNMI	−0.002	1.032	0.960	1.131	0.944
	20B	BD	−0.003	Ref.	0.935	Ref.	0.956
		CCA	−0.037	1.892	0.910	1.815	0.827
		MICE	−0.046	2.013	0.920	2.482	0.745
		RFUN	−0.011	1.093	0.935	2.429	0.530
		FUNREG	−0.012	1.323	0.890	1.208	0.908
		FUNMI	−0.001	1.043	0.915	1.107	0.946
	40B	BD	0.001	Ref.	0.950	Ref.	0.965
		CCA	−0.031	2.013	0.910	1.705	0.866
		MICE	−0.117	4.883	0.345	3.867	0.725
		RFUN	−0.003	1.065	0.960	2.409	0.569
		FUNREG	−0.001	1.214	0.935	1.099	0.934
		FUNMI	0.005	1.055	0.925	1.071	0.965
200	20U	BD	−0.002	Ref.	0.935	Ref.	0.971
		CCA	−0.038	2.300	0.790	2.222	0.719
		RFUN	−0.008	1.077	0.960	3.051	0.399
		FUNREG	0.003	1.026	0.915	1.237	0.858
		FUNMI	−0.003	1.056	0.955	1.201	0.940
	20B	BD	−0.002	Ref.	0.940	Ref.	0.941
		CCA	−0.036	2.219	0.800	2.088	0.685
		MICE	−0.009	1.195	0.920	1.556	0.821
		RFUN	−0.009	1.122	0.925	2.965	0.359
		FUNREG	−0.002	1.146	0.895	1.156	0.881
		FUNMI	0.001	1.018	0.920	1.101	0.932
	40B	BD	−0.002	Ref.	0.940	Ref.	0.951
		CCA	−0.036	2.266	0.825	2.167	0.703
		MICE	−0.047	2.708	0.795	3.322	0.576
		RFUN	−0.007	1.071	0.945	3.176	0.340
		FUNREG	0.001	1.063	0.915	1.052	0.922
		FUNMI	0.002	1.009	0.930	1.096	0.947

Open in a new tab

4.2 |. Data generated from a multivariate Karhunen-Loéve expansion

In the next set of simulations, we consider a data generation model that is more general than that specified in Section 4.1. Specifically, we generate bivariate functional data from a multivariate Karhunen-Loéve expansion,^30,31 which represents any multivariate square integrable stochastic process as a countable linear combination of orthogonal functions and uncorrelated scores. Let Y(t) = [Y⁽¹⁾(t⁽¹⁾), Y⁽¹⁾(t⁽²⁾)]^T, $t = {[t^{(1)}, t^{(2)}]}^{T} \in T^{(1)} \times T^{(2)} = [0, 1] \times [0, 1]$ denote observed bivariate functional data. Then, based on the multivariate Karhunen-Loéve expansion, we can generate bivariate functional data for each observation i ∈ {i = 1, … , n} as:

Y_{i} (t) = μ (t) + \sum_{h = 1}^{H} ρ_{i h} ϕ_{h} (t) + ϵ_{i} (t), t \in T,

(12)

where μ(t) = [μ⁽¹⁾(t⁽¹⁾), μ⁽²⁾(t⁽²⁾)]^T is an overall bivariate mean function, $ϕ_{h} (t) = {[ϕ_{h}^{(1)} (t^{(1)}), ϕ_{h}^{(2)} (t^{(2)})]}^{T}$ are orthonormal functions, ρ_ih are mean-zero uncorrelated scores, and $ϵ_{i} (t) = {[ϵ_{i}^{(1)} (t^{(1)}), ϵ_{i}^{(2)} (t^{(2)})]}^{T}$ is an independent measurement error term with E{ϵ^(c)(t^(c))} = 0 and $Var {ϵ^{(c)} (t^{(c)})} = σ_{ϵ}^{2}$ , c = 1,2. Note that ϕ_h is orthonormal in a sense that $\sum_{c = 1}^{2} \int_{T^{(c)}} ϕ_{h}^{(c)} (t^{(c)}) ϕ_{j}^{(c)} (t^{(c)}) d t^{(c)} = I (h = j)$ .

We consider two scenarios with different shapes of bivariate functions. First let x_i = [x_1i, x_2i]^T denote covariate data where x_1i is an intercept and x_2i ∈ {0, 1} is a binary group indicator (group 0 vs. group 1) generated from Bernoulli(0.5). Also let $μ_{g}^{(c)} (t^{(c)})$ denote the mean function of the cth component in group g ∈ {0, 1}. In Scenario A, the mean of the first component function is bimodal, while the mean of the second component function is unimodal; that is, we set $μ_{0}^{(1)} (t^{(1)}) = 10 - 0.5 \sqrt{2} sin (2 π t^{(1)}) - 0.3 \sqrt{2} cos (4 π t^{(1)}) - 0.3 \sqrt{2} sin (4 π t^{(1)})$ and $μ_{1}^{(1)} (t^{(1)}) = μ_{0}^{(1)} (t^{(1)}) + 1$ for the first component, and $μ_{0}^{(2)} (t^{(2)}) = 9.5 - 0.8 \sqrt{2} cos (2 π t^{(2)}) - 0.8 \sqrt{2} sin (2 π t^{(2)})$ and $μ_{1}^{(2)} (t^{(2)}) = μ_{0}^{(2)} (t^{(2)}) + 1$ for the second component. In Scenario B, we set the mean functions of both components to be unimodal but with different peak values; that is, $μ_{0}^{(1)} (t^{(1)}) = 9.5 - 0.3 \sqrt{2} cos (2 π t^{(2)}) - 0.3 \sqrt{2} sin (2 π t^{(2)})$ , $μ_{1}^{(1)} (t^{(1)}) = μ_{0}^{(1)} (t^{(1)}) + 1$ , $μ_{0}^{(2)} (t^{(2)}) = 9.8 - 0.8 \sqrt{2} cos (2 π t^{(2)}) - 0.8 \sqrt{2} sin (2 π t^{(2)})$ and $μ_{1}^{(2)} (t^{(2)}) = μ_{0}^{(2)} (t^{(2)}) + 1$ . In both scenarios, we set ${ϕ_{h}^{(1)} (t^{(1)}), h = 1, \dots, 6}$ and ${ϕ_{h}^{(2)} (t^{(2)}), h = 1, \dots, 6}$ as the first 6 normalized Legendre polynomials divided by $\sqrt{2}$ . The scores ρ_ih are independently generated from $N (0, v_{h})$ , where ν_h = 0.5^h, h = 1, … , 6. Figure 3 illustrates simulated bivariate functional data of size n = 200, overlaid with group-specific mean functions (bold lines), under Scenarios A (top panel) and B (bottom panel).

Simulated bivariate functional data of size n = 200 (transparent lines) overlaid with group-specific mean functions (bold lines) under Scenarios A (top panel) and B (bottom panel). Green (gray on print version) and black curves indicate data from groups 1 and 0, respectively

In each scenario, we consider a sample size of n = 200 and two different values of measurement error variance; $σ_{ϵ}^{2} = 0.2$ , 0.7. The same three domain designs—(20U), (20B), and (40B)—were considered as before (see Section 4.1). As in Section 4.1, the missingness of the second component functions $(Y_{i}^{(2)})$ depends on the pattern of the first component functions $(Y_{i}^{(2)})$ as well as the group indicators (x_2i) according to the logistic regression model (11) with γ₀ = −5, $γ_{1} = {[0.27, 0, \dots, 0, \dots, 0, 0.27]}^{T} \in ℝ^{m_{i}^{(1)}}$ , γ₂ = 7, and γ₃ = −γ₁. Herein, the percentages of missing $Y_{i}^{(2)}$ are approximately 50% and 10% in groups 0 and 1, respectively. This amounts to approximately 30% missing rate in the entire dataset.

The proposed method, FUNMI, is applied using the same hyperparameter values assigned in Section 4.1, with 5 imputations obtained at 9960, 9970, 9980, 9990, 10 000th iterations of the proposed Gibbs sampler. The five competing methods—CCA, MICE, MICE-TMAG3, RFUN, and FUNREG—are also applied to impute missing second component functions. As done in Section 4.1, the performance of the methods are evaluated based on the estimation and inference accuracy of the two estimands using the imputed datasets: (i) AUC of TMAG3; and (ii) mean function difference between the two groups. We generate 200 Monte Carlo replications for each scenario.

Table 3 presents the simulation results for Scenario A. As expected, the performance of CCA is not satisfactory, with biases, RRMSEs and RRMISEs, respectively, exceeding −0.02, 1.6, and 1.4 in all configurations. Its coverage rates (Cov95 and ACov95) are also low, falling under 0.9 when the measurement error variance is relatively large $(σ_{ϵ}^{2} = 0.7)$ . The performance of MICE-TMAG3 is close to that of CCA (not reported). The performance of MICE is acceptable under (20B) design with relatively small measurement error variance $(σ_{ϵ}^{2} = 0.2)$ , but it suffers under all other configurations with high dimensionality of curves (40B) and/or larger measurement error variance $(σ_{ϵ}^{2} = 0.7)$ . The performance of RFUN is on par with that of CCA in most cases. Even in some cases where RFUN does reasonably well in terms of MSEs or MISEs, it still suffers from low coverage rates. FUNREG and FUNMI show comparable and satisfactory performance when $σ_{ϵ}^{2} = 0.2$ . Here, biases and RRMSEs are below −0.01 and 1.1, respectively. RRMISEs hover below 1.1 in most cases, although FUNMI produces RRMISE greater than 1.2 for irregular functional data (20U). All coverage rates are close to the nominal level of 0.95. However, when the measurement error variance increases to $σ_{ϵ}^{2} = 0.7$ , the performance of FUNREG in estimating AUC of TMAG3 falls apart; the biases and RMSEs are unacceptably high, and the coverage rates plummet below 0.8. On the other hand, the performance of FUNMI is robust to the increase in $σ_{ϵ}^{2}$ , with all evaluation metrics close to those of the gold standard analysis (BD). The results for Scenario B, presented in Table 4, convey similar findings, suggesting that FUNMI achieves satisfactory performance on data generated based on the multivariate Karhunen-Loéve expansion that encompasses a wide range of curve shapes.

TABLE 3.

Simulation results for Scenario A

Estimand:			(i) AUC of TMAG3			(ii) Mean difference
$σ_{ϵ}^{2}$	Design	Method	Bias	RRMSE	Cov95	RRMISE	ACov95
0.2	20U	BD	−0.004	Ref.	0.955	Ref.	0.960
		CCA	−0.019	1.695	0.905	1.466	0.934
		RFUN	−0.024	1.675	0.820	1.564	0.754
		FUNREG	−0.001	1.011	0.955	1.086	0.914
		FUNMI	−0.005	1.014	0.970	1.282	0.936
	20B	BD	−0.005	Ref.	0.950	Ref.	0.958
		CCA	−0.020	1.716	0.910	1.506	0.924
		MICE	−0.007	1.078	0.945	1.212	0.948
		RFUN	−0.025	1.658	0.825	1.460	0.763
		FUNREG	−0.006	1.095	0.930	1.059	0.941
		FUNMI	−0.004	1.013	0.955	1.076	0.962
	40B	BD	−0.003	Ref.	0.940	Ref.	0.960
		CCA	−0.019	1.794	0.885	1.480	0.941
		MICE	−0.014	1.345	0.945	1.604	0.957
		RFUN	−0.022	1.600	0.830	1.429	0.765
		FUNREG	−0.003	1.031	0.940	0.988	0.954
		FUNMI	−0.004	1.013	0.955	1.018	0.974
0.7	20U	BD	−0.017	Ref.	0.900	Ref.	0.983
		CCA	−0.031	1.620	0.845	1.420	0.974
		RFUN	−0.037	1.635	0.660	1.437	0.863
		FUNREG	−0.066	3.155	0.370	1.212	0.930
		FUNMI	−0.018	1.078	0.955	1.309	0.973
	20B	BD	−0.015	Ref.	0.930	Ref.	0.962
		CCA	−0.030	1.674	0.840	1.414	0.944
		MICE	−0.022	1.268	0.925	1.339	0.950
		RFUN	−0.034	1.627	0.665	1.170	0.856
		FUNREG	−0.064	3.149	0.380	1.201	0.930
		FUNMI	−0.013	1.008	0.955	1.059	0.975
	40B	BD	−0.008	Ref.	0.955	Ref.	0.966
		CCA	−0.025	1.833	0.880	1.414	0.958
		MICE	−0.033	1.955	0.850	1.747	0.956
		RFUN	−0.027	1.670	0.760	1.128	0.862
		FUNREG	−0.028	2.101	0.765	1.040	0.948
		FUNMI	−0.009	1.053	0.960	1.014	0.982

Open in a new tab

TABLE 4.

Simulation results for Scenario B

Estimand:			(i) AUC of TMAG3			(ii) Mean difference
$σ_{ϵ}^{2}$	Design	Method	Bias	RRMSE	Cov95	RRMISE	ACov95
0.2	20U	BD	−0.004	Ref.	0.960	Ref.	0.960
		CCA	−0.018	1.648	0.910	1.422	0.936
		RFUN	−0.024	1.647	0.825	1.462	0.791
		FUNREG	−0.002	1.021	0.935	1.059	0.927
		FUNMI	−0.005	1.019	0.975	1.261	0.934
	20B	BD	−0.005	Ref.	0.950	Ref.	0.958
		CCA	−0.019	1.689	0.905	1.468	0.926
		MICE	−0.007	1.080	0.960	1.188	0.946
		RFUN	−0.025	1.665	0.820	1.400	0.788
		FUNREG	−0.006	1.087	0.940	1.052	0.943
		FUNMI	−0.003	0.981	0.955	1.077	0.978
	40B	BD	−0.003	Ref.	0.940	Ref.	0.960
		CCA	−0.018	1.709	0.900	1.440	0.940
		MICE	−0.012	1.251	0.955	1.495	0.960
		RFUN	−0.021	1.584	0.850	1.356	0.799
		FUNREG	−0.003	1.027	0.950	0.987	0.954
		FUNMI	−0.003	1.000	0.955	1.029	0.969
0.7	20U	BD	−0.017	Ref.	0.900	Ref.	0.983
		CCA	−0.030	1.576	0.845	1.383	0.975
		RFUN	−0.036	1.625	0.665	1.362	0.889
		FUNREG	−0.064	3.150	0.375	1.205	0.940
		FUNMI	−0.014	0.989	0.965	1.270	0.971
	20B	BD	−0.015	Ref.	0.925	Ref.	0.962
		CCA	−0.030	1.651	0.820	1.389	0.945
		MICE	−0.022	1.228	0.925	1.303	0.953
		RFUN	−0.035	1.665	0.655	1.140	0.876
		FUNREG	−0.064	3.157	0.385	1.212	0.932
		FUNMI	−0.010	0.954	0.950	1.035	0.978
	40B	BD	−0.008	Ref.	0.945	Ref.	0.966
		CCA	−0.024	1.766	0.880	1.382	0.957
		MICE	−0.030	1.803	0.885	1.645	0.960
		RFUN	−0.027	1.636	0.805	1.074	0.892
		FUNREG	−0.026	2.068	0.770	1.035	0.949
		FUNMI	−0.006	0.982	0.960	0.998	0.980

Open in a new tab

Note: Two estimands of interest are AUC of TMAG3 and mean difference. Estimation and inference performance of the six methods (BD, CCA, MICE, RFUN, FUNREG, and FUNMI) are summarized using Bias, RRMSE, and Cov95 for the AUC and using RRMISE and ACov95 for the mean difference. $σ_{ϵ}^{2}$ denotes the measurement error variance.

In sum, our proposed method, FUNMI, yields robust and satisfactory performance across all settings and comes out as the top method for handling bivariate functional data with missing components. Further simulation study and results on FUNMI2 are presented in Appendix D and Table S1 of the online Supplementary Material. In summary, FUNMI2 also shows satisfactory performance.

5 |. RENAL STUDY

We apply the proposed method to the Emory renal study introduced in Section 1. The baseline and post-furosemide renogram curves of 218 kidneys (with complete data) are plotted in the top left and right columns of Figure 4, respectively. Inspection of renogram curve data supports the concept that baseline and post-furosemide curves are realizations of two separate continuous processes (ie, bivariate functional data), which is expected as they are generated under different biological circumstances (MAG3 counts pre- and post-injection of furosemide water pill). At the same time, there are interpretable patterns of renogram curves that physicians look for to diagnose renal obstruction. For example, renogram curves of a non-obstructed kidney are often characterized by a quick uptake of MAG3 at baseline, followed by its fast drainage at the post-furosemide (see black curve in the bottom panel of Figure 4). On the other hand, the renogram of an obstructed kidney often exhibits a prolonged period of MAG3 accumulation at baseline, followed by its poor excretion at post-furosemide (see gray curve in the bottom panel of Figure 4). In practice, however, renogram curves exhibit high kidney-to-kidney variability, and many show non-clear-cut patterns, leading to erroneous diagnosis by physicians who lack expertise in MAG3 pharmacokinetics and renal physiology.³

Top panel represents baseline (left) and post-furosemide (right) renogram curves of 218 kidneys. The bottom panel presents the baseline (left) and post-furosemide (right) renogram curves of two kidneys that are diagnosed as “non-obstructed” (black) and “obstructed” (gray)

A scientific goal of the Emory renal study is to improve interpretation of MAG3 renography by investigating the distribution and pharmacokinetic properties of renogram curves that provide further insights into the underlying mechanism of renal obstruction. The dataset consists of 314 kidneys (157 left kidneys and 157 right kidneys) of 157 subjects collected from Emory University Hospital’s archived database dating from 1/29/1998 to 7/17/2017. The final obstruction status of the kidneys were retrospectively determined by the consensus of three internationally renowned experts in nuclear medicine, resulting in 248 non-obstructed (79%) and 66 obstructed kidneys (21%). Gender (84 women [54%], 73 men [46%]) and age (mean=56; SD=16; range: 18–87 years) of the subjects were also recorded. A major challenge to the analysis of the renal data is missing post-furosemide renogram curves for 96 out of 314 kidneys (missing rate of 30%), where missingness depends on the observed patterns of baseline renogram curves (ie, MAR). Refer back to Section 1 for the cause and description of this missing data problem.

The outline of the planned statistical analyses is as follows. We first apply the proposed multiple imputation algorithms, FUNMI and FUNMI2, to impute missing post-furosemide renogram curves. Although various analyses can be performed with the imputed datasets to address the scientific goal of the study, we perform two important analyses for illustration. A common clinical practice among physicians to arrive at accurate diagnosis of renal obstruction is to study simple, interpretable pharmacokinetic parameters of renogram curves that relate to the pathophysiology of renal obstruction.³ Accordingly, our first analysis focuses on evaluating the diagnostic accuracy of one of the important pharmacokinetic parameters of the post-furosemide renogram curve, namely its TMAG3, and establishing its clinical utility to improve interpretation of MAG3 renography. Our second analysis focuses on exploring the distribution of post-furosemide renogram curves, specifically by estimating and comparing their mean functions between obstructed and non-obstructed kidneys.

Prior hyperparameters of FUNMI and FUNMI2 are set to $a_{ϵ}^{(1)} = a_{ϵ}^{(2)} = a_{ζ} = 1.5$ , $b_{ϵ}^{(1)} = b_{ϵ}^{(2)} = b_{ζ} = 0.5$ , ν = 5, and a₁ = a₂ = 10. The initial number of latent factors is set to q = 10, along with α₀ = −1 and α₁ = −5 × 10⁻⁴. We set K₁ = 20 and K₂ = 10 given that the patterns of baseline renogram curves are generally more complex than those of post-furosmide renogram curves (mostly straight), but any choice of K₁ and K₂ between 10 and 20 produced similar results (not reported). For FUNMI2, we additionally set a_ξ = 1.5 and b_ξ = 0.5. For both methods, we run M = 30 000 iterations and obtain five imputations of the missing post-furosemide curves at 29 600, 29 700, 29 800, 29 900, and 30 000th iterations of the sampler. The MCMC traceplots (from 10 000th to 30 000th iteration) for some parameters of FUNMI and FUNMI2, each of which with two different initializations, are presented in Figures S1 to S4 of the online Supplementary Material. We can see that the chains are stationary and well-mixed. Five competing methods are also applied to address missing post-furosemide renogram data: CCA, MICE, MICE-TMAG3, RFUN, and FUNREG. Note that RFUN is applied assuming that the baseline and post-furosemide renogram curves constitute a single curve, where the latter is a continuation of the former with a 30-minute waiting period in between.

For the first analysis, we estimate the AUCs of the TMAG3 of the post-furosemide renogram using the imputed datasets obtained from FUNMI, FUNMI2 and competing methods under the rule that larger TMAG3 indicates a greater likelihood of renal obstruction. For FUNMI, FUNMI2, and MICE, we combine the estimates by the Rubin’s rules.¹⁴ Table 5 lists the AUCs of TMAG3 estimated and inferred based on CCA and using imputed/recovered data from MICE, MICE-TMAG, RFUN, FUNREG, FUNMI, and FUNMI2 in left and right kidneys. The AUCs are slightly higher in the left kidney than in the right kidney across all configurations. In both kidneys, we immediately find that estimated AUCs from MICE are significantly lower than those estimated using other methods, suggesting its low imputation accuracy due to high dimensionality of renogram curve data. The performance of MICE-TMAG3 is similar to that of CCA, which is consistent with our simulation results. On the other hand, RFUN, FUNREG, FUNMI, and FUNMI2 improve upon CCA, producing higher AUC estimates with smaller standard errors. Among all methods, FUNMI2 produces the highest AUC estimates with narrowest confidence intervals, followed by FUNMI, then FUNREG. Specifically, the AUC estimate from FUNMI2 in the left kidney is found to be as high as 0.924, unveiling an excellent discriminative power of the TMAG3 for classifying obstruction vs non-obstruction. This demonstrates that our methods produce reasonable imputations of missing post-furosemide curves, correcting for underestimation of the AUC and enhancing estimation efficiency.

TABLE 5.

AUC of the TMAG3 of the post-furosemide renogram estimated and inferred based on CCA and using imputed/recovered datasets by MICE, MICE-TMAG3, RFUN, FUNREG, FUNMI, and FUNMI2

Method	Kidney	AUC (SE)	95% CI
CCA	Left	0.878 (0.039)	0.801 to 0.955
	Right	0.829 (0.056)	0.718 to 0.939
MICE	Left	0.727 (0.047)	0.635 to 0.818
	Right	0.704 (0.057)	0.593 to 0.815
MICE-TMAG3	Left	0.873 (0.036)	0.803 to 0.943
	Right	0.832 (0.052)	0.731 to 0.933
RFUN	Left	0.898 (0.033)	0.834 to 0.963
	Right	0.839 (0.048)	0.745 to 0.933
FUNREG	Left	0.906 (0.030)	0.847 to 0.966
	Right	0.854 (0.044)	0.768 to 0.939
FUNMI	Left	0.909 (0.029)	0.852 to 0.965
	Right	0.857 (0.044)	0.772 to 0.943
FUNMI2	Left	0.924 (0.028)	0.869 to 0.978
	Right	0.868 (0.044)	0.782 to 0.954

Open in a new tab

Note: The estimates are presented for left and right kidneys.

Abbreviations: CI, confidence interval; SE, standard error.

For the second analysis, we estimate the mean functions of post-furosemide renogram curves of obstructed and non-obstructed kidneys, μ_obs(t) and μ_non(t), and assess their difference, μ_d(t) = μ_obs(t) − μ_non(t). Estimates of μ_d(t) based on CCA and imputed datasets by RFUN, FUNREG, FUNMI, and FUNMI2 are plotted in Figure 5 for left and right kidneys. The estimates from MICE were excluded due to its low value across the scan period (range: 10–20). All μ_d(t)’s are positive and eventually decrease over time, reflecting the quicker drainage of MAG3 from the non-obstructed kidneys than from the obstructed kidneys especially during the early scan period. We see that μ_d(t)’s estimated from FUNMI2 hover above those of competing methods across the domain, with larger discrepancy in the earlier time points. The curves estimated by FUNMI are close to but slightly lower than those of FUNMI2. In sum, applying our methods reveals a more dramatic discrepancy in the post-furosemide curves between the obstructed and non-obstructed kidneys, especially during the first 10-minute period of the scan, and reinforces the importance of monitoring the post-furosemide MAG3 level in this period for accurate diagnosis of renal obstruction.

Plot of the difference in the mean functions between the obstructed (obs) and non-obstructed (non) left (left panel) and right (right panel) kidneys, that is, μ_d(t) = μ_obs(t) − μ_non(t), using complete or imputed/recovered data. The gray solid line represents μ_d(t) estimated based on CCA. The gray dashed line and gray dotted line represent μ_d(t) estimated from recovered/imputed data by RFUN and FUNREG, respectively. The black solid line and black dashed line represent μ_d(t) estimated from data imputed by FUNMI and FUNMI2, respectively

6 |. DISCUSSION

We have developed a Bayesian multiple imputation approach for bivariate functional data with missing components. Our method is based on a bivariate functional latent factor model that borrows strength from the observed component and other covariates within each observation, as well as from other observations. Furthermore, we have extended the framework to handle multilevel bivariate functional data by modeling both inter-component and intra-subject correlations. In the Emory renal study, the proposed methods successfully imputed missing post-furosemide renogram curves, and supported unbiased and efficient study into their pharmacokinetic property and distribution associated with renal obstruction. Although the proposed methods are illustrated via AUC analysis and mean function estimation, they offer flexibility to be adapted to many other analyses. That is, the imputed datasets can be readily analyzed using a variety of statistical techniques to approach the problem in a different perspective and gain further insights into the underlying mechanism of renal obstruction.

One potential concern of our Bayesian framework is computation time due to the iterative nature of MCMC, high-dimensionality of curve data and complexity of the model. Despite such concern, the proposed algorithms (FUNMI and FUNMI2) have good computational efficiency that scales well with dimension and sample size considered in the simulations and real data analysis. For example, in various simulation settings with missing rate of 30%, the times taken for FUNMI to generate 10 000 MCMC samples were: (i) 14 minutes for n = 100 and $m_{i}^{(1)} = m_{i}^{(2)} = 20$ (N = 4000 functional measurements); (ii) 30 minutes for n = 200 and $m_{i}^{(1)} = m_{i}^{(2)} = 20$ (N = 8000); and (iii) 58 minutes for n = 200 and $m_{i}^{(1)} = m_{i}^{(2)} = 40$ (N = 16 000). In the real data analysis with n = 314, $m_{i}^{(1)} = 59$ , $m_{i}^{(2)} = 40$ (N = 31 086) and missing rate of 30%, FUNMI took about 110 minutes to generate 10 000 MCMC samples. The computation times of FUNMI2 were similar.

There are several extensions and future directions for this work. First, the method can be readily extended to a multivariate functional data setting where multiple functions are collected for each observational unit; that is, $(g_{i}^{(1)}, \dots, g_{i}^{(C)})$ , C ≥ 2, for each i = 1, … , n. The strategy for imputing missing components of multivariate functional data is similar to the bivariate case. (i) Model each component function as $g_{i}^{(c)} (t) = b_{i}^{(c)} β_{i}^{(c)}$ , c = 1, … , C, where $b_{i}^{(c)}$ is a suitable basis system, and $β_{i}^{(c)}$ is a vector of corresponding coefficients. (ii) Jointly model the concatenated coefficients, $β_{i} = {[β_{i}^{(1) T}, \dots, β_{i}^{(C) T}]}^{T}$ , using a multivariate functional latent factor model: β_i = λη_i + ζ_i, where latent factors η_i parsimoniously capture covarying patterns of multivariate components. (iii) Link covariate information to η_i. (iv) Assign priors, formulate a joint posterior distribution of parameters and missing data, and derive a multiple imputation algorithm. Second, in some practical cases, data are missing not at random (MNAR); that is, the probability of missingness can depend on unobserved values.² One possible approach to handle MNAR is to directly model the missingness mechanism by building a nonignorable model³² that accounts for scale shift relative to ignorable imputations of functions.

Supplementary Material

NIHMS1806891-supplement-Supplementary_Material.pdf^{(3.6MB, pdf)}

ACKNOWLEDGEMENTS

This research was supported by the grant R01DK108070 from the National Institute of Diabetes and Digestive and Kidney Diseases.

Footnotes

CONFLICT OF INTEREST

The authors declare no potential conflict of interest.

SUPPORTING INFORMATION

Additional supporting information may be found online in the Supporting Information section at the end of this article.

DATA AVAILABILITY STATEMENT

The dataset that supports the findings of this study is proprietary and confidential, and cannot be shared with the general public for subsequent research purposes. The authors do not have the permission to publish or share the raw data.

REFERENCES

1.Ramsay J, Silverman B. Functional Data Analysis. New York, NY: Springer; 2005. [Google Scholar]
2.Little R, Rubin D. Statistical Analysis with Missing Data. Hoboken, NJ: Wiley; 2002. [Google Scholar]
3.Taylor A, Manatunga A, Garcia E. Decision support systems in diuresis renography. Semin Nucl Med. 2008;38:67–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Morris J, Arroyo C, Coull B, Ryan L, Herrick R, Gortmaker S. Using wavelet-based functional mixed models to characterize population heterogeneity in accelerometer profiles: a case study. J Am Stat Assoc. 2006;101:1352–1364. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.He Y, Yucel R, Raghunathan T. A functional multiple imputation approach to incomplete longitudinal data. Stat Med. 2011;30:1137–1156. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Chiou J, Zhang Y, Chen W, Chang C. A functional data approach to missing value imputation and outlier detection for traffic flow data. Transp B. 2014;2:106–129. [Google Scholar]
7.Yao F, Müller H, Wang J. Functional data analysis for sparse longitudinal data. J Am Stat Assoc. 2005;100:577–590. [Google Scholar]
8.Delaigle A, Hall P. Classification using censored functional data. J Am Stat Assoc. 2013;108:1269–1283. [Google Scholar]
9.Delaigle A, Hall P. Approximating fragmented functional data by segments of Markov chains. Biometrika. 2016;103:779–799. [Google Scholar]
10.Kraus D Components and completionof partially observed functional data. J R Stat Soc Ser B Methodol. 2015;77:777–801. [Google Scholar]
11.Kneip A, Liebl D. On the optimal reconstruction of partially observed functional data. Ann Stat. 2020;48:1692–1717. [Google Scholar]
12.Liebl D, Rameseder S. Partially observed functional data: the case of systematically missing parts. Comput Stat Data Anal. 2019;131:104–115. [Google Scholar]
13.Ciarleglio A, Petkova E, Harel O. Multiple imputation in functional regression with applications to EEG data in a depression study. Technical report. arXiv:2001.08175 [stat.AP]; Preprint Posted Online December 4, 2020. https://arxiv.org/abs/2001.08175. [Google Scholar]
14.Rubin D Multiple Imputation for Nonresponse in Surveys. Hoboken, NJ: Wiley; 1987. [Google Scholar]
15.Bhattacharya A, Dunson D. Sparse Bayesian infinite factor models. Biometrika. 2011;98:291–306. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Green P, Silverman M. Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach. London, UK: Chapman & Hall; 1994. [Google Scholar]
17.Montagna S, Tokdar S, Neelon B, Dunson D. Bayesian latent factor regression for functional and longitudinal data. Biometrics. 2012;68:1064–1073. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Geweke J, Zhou G. Measuring the price of the arbitrage pricing theory. Rev Finan Stud. 1996;9:557–587. [Google Scholar]
19.Carvalho C, Chang G, Lucas J, Nevins J, Wang Q, West M. High-dimensional sparse factor modeling: applications in gene expression genomics. J Am Stat Assoc. 2008;103:1439–1456. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Sun J, Herazo-Maya J, Kaminski N, Zhao H, Warren J. A dirichlet process mixture model for clustering longitudinal gene expression data. Stat Med. 2017;36:3495–3506. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Preacher K, Zhang G, Kim C, Mels G. Choosing the optimal number of factors in exploratory factor analysis: a model selection perspective. Multivar Behav Res. 2013;48:28–56. [DOI] [PubMed] [Google Scholar]
22.Geman S, Geman D. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEE Trans Pattn Anal Mach Intell. 1984;6:721–741. [DOI] [PubMed] [Google Scholar]
23.Gelman A, Rubin D. Inference from iterative simulation using multiple sequences. Stat Sci. 1992;7:457–511. [Google Scholar]
24.Raftery A, Lewis S. One long run with diagnostics: implementation strategies for Markov chain Monte Carlo. Stat Sci. 1992;7:493–497. [Google Scholar]
25.Van Dyk D, Park T. Partially collapsed Gibbs samplers: theory and methods. J Am Stat Assoc. 2008;103:790–796. [Google Scholar]
26.Buuren S, Groothuis-Oudshoorn K. mice: multivariate imputation by chained equations in R. J Stat Softw. 2011;45:1–67. [Google Scholar]
27.Ivanescu A, Staicu A, Scheipl F, Greven S. Penalized function-on-function regression. Comput Stat. 2015;30:539–568. [Google Scholar]
28.Jang J, Peng L, Manatunga A. Assessing alignment between functional markers and ordinal outcomes based on broad sense agreement. Biometrics. 2019;75:1367–1379. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Robin X, Turck N, Hainard A, et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinform. 2011;12:77. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Chiou J, Chen Y, Yang Y. Multivariate functional principal component analysis: a normalization approach. Stat Sin. 2014;24:1571–1596. [Google Scholar]
31.Happ C, Greven S. Multivariate functional principal component analysis for data observed on different (dimensional) domains. J Am Stat Assoc. 2018;113:649–659. [Google Scholar]
32.Van Burren S Flexible Imputation of Missing Data. Boca Raton, FL: CRC Press; 2018. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material

NIHMS1806891-supplement-Supplementary_Material.pdf^{(3.6MB, pdf)}

Data Availability Statement

[R1] 1.Ramsay J, Silverman B. Functional Data Analysis. New York, NY: Springer; 2005. [Google Scholar]

[R2] 2.Little R, Rubin D. Statistical Analysis with Missing Data. Hoboken, NJ: Wiley; 2002. [Google Scholar]

[R3] 3.Taylor A, Manatunga A, Garcia E. Decision support systems in diuresis renography. Semin Nucl Med. 2008;38:67–81. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Morris J, Arroyo C, Coull B, Ryan L, Herrick R, Gortmaker S. Using wavelet-based functional mixed models to characterize population heterogeneity in accelerometer profiles: a case study. J Am Stat Assoc. 2006;101:1352–1364. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.He Y, Yucel R, Raghunathan T. A functional multiple imputation approach to incomplete longitudinal data. Stat Med. 2011;30:1137–1156. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Chiou J, Zhang Y, Chen W, Chang C. A functional data approach to missing value imputation and outlier detection for traffic flow data. Transp B. 2014;2:106–129. [Google Scholar]

[R7] 7.Yao F, Müller H, Wang J. Functional data analysis for sparse longitudinal data. J Am Stat Assoc. 2005;100:577–590. [Google Scholar]

[R8] 8.Delaigle A, Hall P. Classification using censored functional data. J Am Stat Assoc. 2013;108:1269–1283. [Google Scholar]

[R9] 9.Delaigle A, Hall P. Approximating fragmented functional data by segments of Markov chains. Biometrika. 2016;103:779–799. [Google Scholar]

[R10] 10.Kraus D Components and completionof partially observed functional data. J R Stat Soc Ser B Methodol. 2015;77:777–801. [Google Scholar]

[R11] 11.Kneip A, Liebl D. On the optimal reconstruction of partially observed functional data. Ann Stat. 2020;48:1692–1717. [Google Scholar]

[R12] 12.Liebl D, Rameseder S. Partially observed functional data: the case of systematically missing parts. Comput Stat Data Anal. 2019;131:104–115. [Google Scholar]

[R13] 13.Ciarleglio A, Petkova E, Harel O. Multiple imputation in functional regression with applications to EEG data in a depression study. Technical report. arXiv:2001.08175 [stat.AP]; Preprint Posted Online December 4, 2020. https://arxiv.org/abs/2001.08175. [Google Scholar]

[R14] 14.Rubin D Multiple Imputation for Nonresponse in Surveys. Hoboken, NJ: Wiley; 1987. [Google Scholar]

[R15] 15.Bhattacharya A, Dunson D. Sparse Bayesian infinite factor models. Biometrika. 2011;98:291–306. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Green P, Silverman M. Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach. London, UK: Chapman & Hall; 1994. [Google Scholar]

[R17] 17.Montagna S, Tokdar S, Neelon B, Dunson D. Bayesian latent factor regression for functional and longitudinal data. Biometrics. 2012;68:1064–1073. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Geweke J, Zhou G. Measuring the price of the arbitrage pricing theory. Rev Finan Stud. 1996;9:557–587. [Google Scholar]

[R19] 19.Carvalho C, Chang G, Lucas J, Nevins J, Wang Q, West M. High-dimensional sparse factor modeling: applications in gene expression genomics. J Am Stat Assoc. 2008;103:1439–1456. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Sun J, Herazo-Maya J, Kaminski N, Zhao H, Warren J. A dirichlet process mixture model for clustering longitudinal gene expression data. Stat Med. 2017;36:3495–3506. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Preacher K, Zhang G, Kim C, Mels G. Choosing the optimal number of factors in exploratory factor analysis: a model selection perspective. Multivar Behav Res. 2013;48:28–56. [DOI] [PubMed] [Google Scholar]

[R22] 22.Geman S, Geman D. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEE Trans Pattn Anal Mach Intell. 1984;6:721–741. [DOI] [PubMed] [Google Scholar]

[R23] 23.Gelman A, Rubin D. Inference from iterative simulation using multiple sequences. Stat Sci. 1992;7:457–511. [Google Scholar]

[R24] 24.Raftery A, Lewis S. One long run with diagnostics: implementation strategies for Markov chain Monte Carlo. Stat Sci. 1992;7:493–497. [Google Scholar]

[R25] 25.Van Dyk D, Park T. Partially collapsed Gibbs samplers: theory and methods. J Am Stat Assoc. 2008;103:790–796. [Google Scholar]

[R26] 26.Buuren S, Groothuis-Oudshoorn K. mice: multivariate imputation by chained equations in R. J Stat Softw. 2011;45:1–67. [Google Scholar]

[R27] 27.Ivanescu A, Staicu A, Scheipl F, Greven S. Penalized function-on-function regression. Comput Stat. 2015;30:539–568. [Google Scholar]

[R28] 28.Jang J, Peng L, Manatunga A. Assessing alignment between functional markers and ordinal outcomes based on broad sense agreement. Biometrics. 2019;75:1367–1379. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Robin X, Turck N, Hainard A, et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinform. 2011;12:77. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] 30.Chiou J, Chen Y, Yang Y. Multivariate functional principal component analysis: a normalization approach. Stat Sin. 2014;24:1571–1596. [Google Scholar]

[R31] 31.Happ C, Greven S. Multivariate functional principal component analysis for data observed on different (dimensional) domains. J Am Stat Assoc. 2018;113:649–659. [Google Scholar]

[R32] 32.Van Burren S Flexible Imputation of Missing Data. Boca Raton, FL: CRC Press; 2018. [Google Scholar]

PERMALINK

A Bayesian multiple imputation approach to bivariate functional data with missing components

Jeong Hoon Jang

Amita K Manatunga

Changgee Chang

Qi Long

Abstract

1 |. INTRODUCTION

2 |. METHODS

2.1 |. General setup

2.2 |. Bivariate functional latent factor model

2.3 |. Specification of Bayesian priors

2.4 |. Bayesian multiple imputation framework

3 |. EXTENSION TO MULTILEVEL BIVARIATE FUNCTIONAL DATA

3.1 |. Model formulation

3.2 |. Bayesian priors and multiple imputation

4 |. SIMULATIONS

4.1 |. Data generated from a bivariate functional latent factor model

FIGURE 1.

FIGURE 2.

TABLE 1.

TABLE 2.

4.2 |. Data generated from a multivariate Karhunen-Loéve expansion

FIGURE 3.

TABLE 3.

TABLE 4.

5 |. RENAL STUDY

FIGURE 4.

TABLE 5.

FIGURE 5.

6 |. DISCUSSION

Supplementary Material

ACKNOWLEDGEMENTS

Footnotes

DATA AVAILABILITY STATEMENT

REFERENCES

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases