Longitudinal functional additive model with continuous proportional outcomes for physical activity data

Haocheng Li; Sarah Kozey-Keadle; Victor Kipnis; Raymond J Carroll

doi:10.1002/sta4.121

. Author manuscript; available in PMC: 2017 Oct 4.

Published in final edited form as: Stat (Int Stat Inst). 2016 Oct 4;5(1):242–250. doi: 10.1002/sta4.121

Longitudinal functional additive model with continuous proportional outcomes for physical activity data

Haocheng Li ^a,^✉, Sarah Kozey-Keadle ^b, Victor Kipnis ^c, Raymond J Carroll ^d

PMCID: PMC5124499 NIHMSID: NIHMS825619 PMID: 27904749

Abstract

Motivated by physical activity data obtained from the BodyMedia FIT device (www.bodymedia.com), we take a functional data approach for longitudinal studies with continuous proportional outcomes. The functional structure depends on three factors. In our three-factor model, the regression structures are specified as curves measured at various factor-points with random effects that have a correlation structure. The random curve for the continuous factor is summarized using a few important principal components. The difficulties in handling the continuous proportion variables are solved by using a quasilikelihood type approximation. We develop an efficient algorithm to fit the model, which involves the selection of the number of principal components. The method is evaluated empirically by a simulation study. This approach is applied to the BodyMedia data with 935 males and 84 consecutive days of observation, for a total of 78, 540 observations. We show that sleep efficiency increases with increasing physical activity, while its variance decreases at the same time.

Keywords: BodyMedia FIT device, continuous proportions, functional data, mixed-effects model, physical activity, sleep efficiency

1. Introduction

Motivated by physical activity data, we take a functional data approach for longitudinal studies on continuous proportional outcomes. The research is aimed at understanding the influence of physical activity on sleep efficiency. The response variable, sleep efficiency, is measured by the ratio of daily sleep time to lying down time for each participant (Lambiase et al., 2013), and thus it is a continuous proportion. The major explanatory factor, physical activity level, is measured by the daily minutes of moderate to vigorous physical activity (MVPA). The intensity of physical activity is evaluated in a unit called METs, with 1 MET being the energy required to sit quietly, a quantity that depends on one’s body weight and other characteristics. MVPA has 3–6 METs, and is roughly when a person is moving fast enough or strenuously enough to burn off three to six times as much energy per minute as when she/he is sitting quietly. Month and weekday effects may also influence sleep efficiency, and we take them into account as two additional factors. Therefore, our functional structure depends on physical activity, month and weekday effects factors.

It has been widely reported that greater physical activity leads to greater sleep efficiency, both marginally (Lambiase et al., 2013; Oudegeest-Sander et al., 2013) and, in longitudinal data, within-person (Ekstedt et al., 2013). However, these studies mainly use correlation-based or linear regression methods, which do not take advantage of newly developed instruments, such as the BodyMedia FIT device (bodymedia.com), the ActiGraph device (actigraphcorp.com), and the ActivPal device (paltechnologies.com). These devices can measure physical activity data continuously (e.g., minute-by-minute) for an extended period (e.g., weeks). In this study, we use data from the BodyMedia FIT device. The BodyMedia FIT device is a multi-sensor armband which measures skin temperature, heat flux, galvanic skin response, and motion through a 3-axis accelerometer. With this device, there is the opportunity for new and more powerful statistical methods to understand physical activity and other outcomes, both within and between individuals.

The data we have are summaries of daily minutes of moderate to vigorous physical activity (MVPA) and daily sleep efficiency over 12 week periods with starting dates distributed throughout the calendar year. There are numerous questions that arise, including the following:

What is the effect of daily MVPA minutes on daily sleep efficiency?
If daily MVPA minutes lead to greater sleep efficiency, is the effect constant or is there some point where the effect of increasing MVPA plateaus or even decreases?
Does increasing MVPA minutes have influence on the stability (variability) of sleep efficiency?

The purpose of this paper is to develop initial methods that can be used to answer the questions and conjectures described above. We do not claim that the methods are the last word in the analysis of physical activity and sleep efficiency, but we believe that our work is novel because we take a functional data analysis approach to the question, while simultaneously recognizing explicitly that sleep efficiency is a variable that necessarily is constrained to the unit interval (0, 1), a constraint that, to the best of our knowledge, has not been considered in the functional data literature.

There are many studies for continuous proportional responses, but not for longitudinal functional data. One naive way is to ignore that the proportional outcomes should be in the unit interval (0, 1), and then fit the responses by ordinary linear regression, but this potentially gives predictions outside the unit interval (Kieschnick & McCullough, 2003). The beta distribution (Ferrari & Cribari-Neto, 2004) can also be employed to analyze the data. Simas et al. (2010) and Zhao et al. (2012) used beta regression with functional forms for predictors, but they were limited by assuming that all observations are independent. Verkuilen & Smithson (2012) and Figueroa-Zúñiga et al. (2013) used mixed-effects in the beta regression to model correlated data, but their fixed and random effect structures were not in functional formulations. On the other hand, continuous proportions can be analyzed by first taking the logit transformation of the outcomes, and then using linear regression to fit the data. However, as we will show, this approach can lead to biased results.

There are of course many statistical papers that focus on functional data analysis, but, as far as we are aware, there are only a few studies that focus on proportional responses. Hall et al. (2008) proposed a functional model for non-Gaussian data. However, the model estimation is partly based on a simplified first-order linear approximation formulation, and is known to lead to biased results in some settings (Serban et al., 2013). Goldsmith et al. (2015) proposed a generalized multilevel function-on-scalar regression model for outcomes with exponential family. Gertheiss et al. (2015) discussed a marginal functional regression model for binary outcomes, and Scheipl et al. (2016) studied generalized functional additive mixed models for outcomes with exponential family distribution as well as others like beta-distribution. However, these models cannot handle the effects from the three factors in our data.

In the case of correlated functional data, a large number of random effects are required to model the smooth random curves. Current computational methods for random effects mainly use Monte Carlo or Gauss-Hermite quadrature approximations (Molenberghs & Verbeke, 2005), but in our context these are computationally expensive. For example, Figueroa-Zúñiga et al. (2013) suggested that the beta regression could require more quadrature nodes than logistic regression.

We address this problem with the approach proposed by Cox (1996), where a quasilikelihood method (Wedderburn, 1974) is used to model continuous proportions. The quasilikelihood does not need to find a full distribution for outcomes but only requires the specification of the first and second moments. However, the existing methodology is limited to independent data, and we extend it to our longitudinal functional data scenario. In particular, the modeling of the correlation structure with respect to physical activity, month and weekday effects is necessary. To build a flexible model, we use functional random curves for the MVPA minutes. To avoid the dimension problem in random curves, we only use a few important functional principal components to summarize the random curves. This method is developed by Zhou et al. (2008) in the linear model and we take this general approach to address a continuous proportional response.

A new efficient algorithm is proposed. The algorithm includes both features of penalized quasilikelihood (Breslow & Clayton, 1993) and the eigen-decomposition discussed in Yao et al. (2005). Since our problem involves quasilikelihood modeling and random effects, a penalized quasilikelihood approach is convenient. On the other hand, the eigen-decomposition approach is efficient for simultaneous selection and estimation of the functional principal components. As a result, the new algorithm includes both procedures.

The paper is organized as follows. Section 2 describes the model, and Section 3 is for our algorithm in model fitting. Section 4 gives results from a simulation study. Section 5 analyzes the Body Media data set involving physical activity, and suggests answers to the three questions posed earlier. Concluding remarks are given in Section 6.

2. Model

2.1. The Mixed Effects Model for Continuous Proportional Data

Let Y_i(r, s, t) be a continuous proportional observation at MVPA minute r, month s and weekday t for subject i = 1… n. Each subject has m_i observations. The possible values for s can be 1, 2,…, 12, which represents January to December, while t can be 1, 2,…, 7 indicating Sunday through Saturday. We use Y_i (r_ij, s_ij, t_ij) to denote the j^th observation for subject i. Define Y_i = {Yi(r_i1, s_i1, t_{_i1}),…, Y_i (r_imi, s_imi, t_imi)}^⊤.

According to the quasilikelihood method suggested by Wedderburn (1974) and McCullagh & Nelder (1989), we only specify the first and second moments for the outcomes. The mean and variance functions of Y_i(r, s, t) given random effects are

E {Y_{i} (r, s, t) | U_{i} (r, s, t)} = H {μ (r, s, t) + U_{i} (r, s, t)},

(1)

var {Y_{i} (r, s, t) | U_{i} (r, s, t)} = σ^{2} [E {Y_{i} (r, s, t) | U_{i} (r, s, t)}] [1 - E {Y_{i} (r, s, t) | U_{i} (r, s, t)}],

where H(·) denotes the logistic distribution function, μ(r, s, t) is a fixed curve, U_i(r, s, t) is a random effects curve, and σ² is a dispersion parameter. We further assume that given the random effects U_i(r, s, t), the variables in Y_i are independent.

Cox (1996) discussed other candidates for modeling Y_i(r, s, t). For example, the variance structure can be specified as

σ^{2} {[E {Y_{i} (r, s, t) | U_{i} (r, s, t)}]}^{2} {[1 - E {Y_{i} (r, s, t) | U_{i} (r, s, t)}]}^{2} .

However, unlike Cox’s independent observation scenarios, our method involves the random effect curves U_i(r, s, t) to model the correlation structure. Therefore, we focus on model (1).

We further specify μ(r, s, t) and U_i(r, s, t) terms in (1) by additive models

μ (r, s, t) = μ_{0} (r) + μ_{1} (s) + μ_{2} (t);

(2)

U_{i} (r, s, t) = U_{i, 0} (r) + U_{i, 1} (s) + U_{i, 2} (t),

(3)

where μ₀(r), μ₁(s) and μ₂(t) are fixed curves at r, s, t, and U_i,0(r), U_i,1(s), U_i,2(t) are random curves at r, s, t, respectively. For model identifiability, we set μ₁(1) = μ₂(1) = 0. We also assume that U_i,0(r), U_i,1(s) and U_i,2(t) are mutually independent for all r, s, t.

2.2. Basis Functions

To model the fixed and random curves in (2) and (3), let b₀(r) = {b_0,1(r),…, b_0,q₀(r)}^⊤, b₁(s) = {b_1,1(s),…, b_1,q₁(s)}^⊤ and b₂(t) = {b_2,1(t),…, b_2,q₂(t)}^⊤ be the vectors of orthogonal B-spline basis functions evaluated at physical activity minutes r, month s and weekday t, respectively. The orthogonal B-spline basis functions can be computed using an exact approach found in the R package “orthogonalsplinebasis” (Redd, 2011; R Core Team, 2016).

We model the fixed effect curves to be

μ_{0} (r) = b_{0}^{⊤} (r) β_{0}, μ_{1} (s) = b_{1}^{⊤} (s) β_{1}, μ_{2} (t) = b_{2}^{⊤} (t) β_{2},

where β₀, β₁, β₂ are q₀ × 1, q₁ × 1, q₂ × 1 regression coefficient vectors, and s = 2,…, 12 and t = 2,…, 7.

For the random effect curve U_i,0(r), we set

U_{i, 0} (r) = b_{0}^{⊤} (r) u_{i, 0},

where u_i,0 are q₀ × 1 correlated random effect vectors. In practice, when q₀ is large, the estimation of the variance structure for u_i,0 could be difficult. Based on the principal component approach (Zhou et al., 2008), we summarize u_i,0 by using only a few principal components by setting $u_{i, 0} ≐ \sum_{ℓ = 1}^{L} θ_{ℓ} α_{i, ℓ}$ , where L is the number of principal components, θ_ℓ is the ℓ^th q₀ × 1 orthogonal principal component vector, and α_i,ℓ is the ℓ^th principal component score. For identifiability, the principal components are sorted in decreasing order by the variance of α_i,ℓ, and the α_i,ℓ is set to be independent across all ℓ = 1,…, L. Denote Θ; = (θ₁,…, θ_L) and α_i = (α_i,1,…, α_i,L)^⊤. We assume α_i ~ Normal(0, Δ) where Δ = diag(Δ₁,…, Δ_L), and thus u_i,0 ~ Normal(0,Ψ₀) with Ψ₀ = Θ;ΔΘ;^⊤. We further denote $f_{ℓ} (r) = b_{0}^{⊤} (r) θ_{ℓ}$ as the ℓ^th principal component curve.

Remark 1

Instead of using principal components, there are two commonly used models for the random effect curve U_i,0(r), namely

U_{i, 0} (r) = u_{i, 0}^{*}; and U_{i, 0} (r) = u_{i, 0}^{*} + u_{i, 1}^{*} r,

where $u_{i, 0}^{*}$ and $u_{i, 1}^{*}$ are scalar random effects. The first formulation only involves random-intercepts, which implies homoscedasticity for U_i,0(r) across r. The second model has an additional random-slope term, so that the variance of U_i,0(r) is a quadratic function over r. However, as we show in the simulation study, both formulations can be limited when the random effect structure is complicated, and they can lead to biased estimates.

For U_i,1(s) and U_i,2(t), we use dummy variables given as

U_{i, 1} (s) = \sum_{k = 1}^{12} I (s = k) u_{i, 1, k} = I (s = 1) u_{i, 1, 1} + I (s = 2) u_{i, 1, 2} + \dots + I (s = 12) u_{i, 1, 12},

U_{i, 2} (t) = \sum_{k = 1}^{7} I (t = k) u_{i, 2, k} = I (t = 1) u_{i, 2, 1} + I (t = 2) u_{i, 2, 2} + \dots + I (t = 7) u_{i, 2, 7},

where I(·) is an indicator function, u_i,1 = (u_i,1,1,…,u_i,1,12)^⊤ and u_i,2 = (u_i,2,1, …, u_i,2,7)^⊤ are 12 × 1 and 7 × 1 random effect vectors. We assume u_i,1 ~ Normal(0,Ψ₁) and u_i,2 ~ Normal(0,Ψ₂) with Ψ₁ = diag(Ψ_1,1,…, Ψ_1,12) and Ψ₂ = diag(Ψ_2,1,…, Ψ_2,7).

Therefore, the model (2)–(3) can be rewritten as

μ (r, s, t) = b_{0}^{⊤} (r) β_{0} + b_{1}^{⊤} (s) β_{1} + b_{1}^{⊤} (t) β_{2};

(4)

U_{i} (r, s, t) = b_{0}^{⊤} (r) Θ α_{i} + \sum_{k = 1}^{12} I (s = k) u_{i, 1, k} + \sum_{k = 1}^{7} I (t = k) u_{i, 2, k} .

(5)

The modeling with B-splines involves six sets of parameters to be estimated: (a) the dispersion parameter: σ²; (b) the B-spline coefficients for the fixed effects: β₀, β₁ and β₂; (c) the number of principal component: L; (d) the B-spline coefficients for principal component functions: Θ;; (e) the principal component scores’ covariance matrix: Δ; and (f) the covariance matrices for u_i,1 and u_i,2: Ψ₁ and Ψ₂.

3. Model Fitting Procedure

3.1. Second Order Approximation for Continuous Proportions

Estimation of the parameters is complicated by the continuous proportional outcomes. We approximate the continuous proportions using a penalized quasilikelihood that includes a second order approximation term. This method was introduced in Goldstein & Rasbash (1996) and it outperforms the methods proposed by Breslow & Clayton (1993). The method is as follows. Since H(·) is logistic distribution function, the first and second derivatives of H(·) are H′(·) = H(·){1 − H(·)} and H″(·) = {1 − 2H(·)} H′(·). Let g(·) = 1/H′ (·).

Set $X (r, s, t) = {b_{0}^{⊤} (r), b_{1}^{⊤} (s), b_{2}^{⊤} (t)}$ . Let I₁(s) = {I(s = 1),…I(s = 12)}^⊤, and I₂(t) = {I(t = 1)…, I(t = 7)}^⊤, and set $Z (r, s, t) = {b_{0}^{⊤} (r), l_{1}^{⊤} (s), l_{2}^{⊤} (t)}$ . Denote $β = {(β_{0}^{⊤}, β_{1}^{⊤}, β_{2}^{⊤})}^{⊤}$ and $u_{i} {(u_{i, 0}^{⊤}, u_{i, 1}^{⊤}, u_{i, 2}^{⊤})}^{⊤}$ . Given known values of (β̂, û_i), letting η̂_i(r, s, t) = X(r, s, t)β̂ + Z(r, s, t)û _i, we use the approximate model

Y_{i}^{*} (r, s, t) = g {{\hat{η}}_{i} (r, s, t)} [Y_{i} (r, s, t) - H {{\hat{η}}_{i} (r, s, t)}] + {\hat{η}}_{i} (r, s, t) - (1 / 2) g {{\hat{η}}_{i} (r, s, t)} H ″ {{\hat{η}}_{i} (r, s, t)} [Z (r, s, t) \hat{var} (u_{i} - {\hat{u}}_{i}) Z^{⊤} (r, s, t)] \approx X (r, s, t) β + Z (r, s, t) u_{i} + ε_{i} (r, s, t),

(6)

where ε_i (r, s, t) = Normal[0, σ²g{η̂_i (r, s, t)}]. The derivation of this approximation can be referred to Molenberghs & Verbeke (2005)

3.2. Estimation Algorithm

According to the second order approximation in (6), the transformed continuous proportional outcomes $Y_{i}^{*} (r, s, t)$ can be treated as continuous variables with normal distributions. We estimate the parameters using an ECME algorithm (Schafer, 1998). The ECME algorithm updates fixed structure parameters by the Newton-Raphson approach, and updates the random effects parameters by the EM method. We provide a brief sketch of the model estimation procedure here, and the details are in supporting information for Appendix S1.

We set the initial numbers of principal components to be L= q₀ and thus Ψ₀ is a full rank covariance matrix. We also give initial values for other parameters listed in Section 2.2. Then the iteration procedure is

1
update β and σ² by a Newton-Raphson approach,
2
update Ψ₀, Ψ₁ and Ψ₂ by the EM method, and
3
update L, Θ;, and Δ with an eigen-decomposition of Ψ₀.

The entire procedure is iterated until convergence. For convergence properties of the ECME algorithm, see Liu & Rubin (1994).

3.3. Maximum Penalized Likelihood

The previous discussion focuses on the modeling of the response variables using basis functions. It is helpful however to introduce roughness penalties to regularize the fits of functions (Eilers & Marx, 1996). Denote θ_ℓ to be the ℓ^th column for Θ;.

We penalize the loglikelihood and update the parameters in each iteration to maximize

ℒ_{pen} = ℒ - τ_{β} β^{⊤} D β - τ_{θ} \sum_{ℓ = 1}^{L} θ_{ℓ}^{⊤} D_{0} θ_{ℓ},

where ℒ is defined in the Appendix S1 equation (S.2), τ_β and τ_θ are penalty parameters, and the penalty matrices are $D_{0} = \int b_{0}^{″} (r) b_{0}^{″} {(r)}^{⊤} d r, D_{1} = \int b_{1}^{″} (s) b_{1}^{″} {(s)}^{⊤} d s, D_{2} = \int b_{2}^{″} (t) b_{2}^{″} {(t)}^{⊤} d t$ , and D= diag(D₀, D₁, D₂).

Using maximum penalized likelihood has only a minor effect on the estimation algorithm, although of course it has a major effect on the estimation results. We describe the details in supporting information for Appendix S1.3.

In all of our work, we use five-fold crossvalidation to choose penalty parameters. We searched over a two dimensional grid for τ_β and τ_θ. The tuning parameters are obtained by maximizing the crossvalidated loglikelihood

- \sum_{i = 1}^{n} [m_{i} log (2 π) + log {| \hat{cov} (Y_{i}) |} + {Y_{i} - Ê (Y_{i})}^{⊤} {\hat{cov} (Y_{i})}^{- 1} {Y_{i} - Ê (Y_{i})}],

where the estimates Ê(Y_i) and $\hat{cov} (Y_{i})$ are described in supporting information for Appendix S1.4.

4. Simulation Studies

We use a simulation of 500 runs to assess the performance of our longitudinal functional additive model. There are n = 240 subjects, and each subject has 84 visits observed in 12 weeks; this is similar to our BodyMedia data, but with a much smaller number of subjects. Each week has complete observations from Monday to Sunday. We set each month to have four weeks. All subjects are observed in three consecutive months. For example, subject 1 is observed from January to March and subject 2 is observed from February to April. Then Y_i(r, s, t) is generated according to, the beta distribution with density function conditional on U_i(r, s, t) as

\frac{Γ (ϕ)}{Γ (κ ϕ) Γ {(1 - κ) ϕ}} {y_{i} (r, s, t)}^{κ ϕ - 1} {1 - y_{i} (r, s, t)}^{(1 - κ) ϕ - 1},

where Γ(·) is the gamma function, κ = E{Yi(r, s, t)|U_i(r, s, t)} and ϕ = 1/σ² − 1. We set E{Y_i (r, s, t)|U_i(r, s, t)} = H{μ₀(r) + μ₁(s) + μ₂(t) + f₁(r)α_i,1 + f₂(r)α_i,2 + U_i,1(s) + U_i,2(t)}, where μ₀(r) = H(r/2 − 5.5), μ₁(s) = (s − 7)²/36 − 1 and μ₂(t) = −(t − 3)²/5 + 0.8. The principal component curves are $f_{1} (r) = sin {2 π r / 22} / \sqrt{11}$ and $f_{2} (r) = cos {2 π r / 22} / \sqrt{11}$ . We generate α_i,1, α_i,2, U_i,1(s) and U_i,2(t) as normally distributed with zero means, and set Δ₁ = 12, Δ₂ = 6, Ψ_1,s = 2 for all s, and Ψ_2,t = 1 for all t. We also generate r as uniformly distributed in [0, 22]. For σ², we studied σ² = 0.02 by following the suggestion from Figueroa-Zúñiga et al. (2013) and σ² = 1/30 which is similar to the result of our data application in Section 5. Our method has good performances in both scenarios, and we only report the simulation results from σ² = 1/30 here.

As a comparison to our method, three naive approaches are explored. The first approach (labeled as NAIVE1) follows our algorithm but uses a random-intercepts model for U_i,0(r) as discussed in Remark 1. The second method (labeled as NAIVE2) is similar to NAIVE1 but uses a random-slopes model. The third method (labeled as NAIVE3) uses an identical random effect structure as our method, but it first takes a logit transformation of the responses and then fits the outcomes by a linear functional data model (Zhou et al., 2008).

We use cubic B-spline basis function with 10 equispaced knots to fit μ₀(r), and use linear B-spline basis functions with 5 and 4 knots to fit μ₁(s) and μ₂(t), respectively. Convergence was achieved for all simulated data sets. The correct number of principal components was selected in all simulated data sets. Table 1 presents the mean estimates and the mean squared errors (MSE) of the parameters, which indicates good performance in the estimation of the model parameters for our approach.

Table 1.

Results for simulation results in Section 4. Displayed are the average estimates and mean squared errors (MSE) of the parameters. The symbol * means that the actual number is multiplied by 10000.

Parameter	σ²	Δ₁	Δ₂	Ψ_1,1	Ψ_1,2	Ψ_1,3	Ψ_1,4	Ψ_1,5	Ψ_1,6	Ψ_1,7	Ψ_1,8

True	0.03	12.00	6.00	2.00	2.00	2.00	2.00	2.00	2.00	2.00	2.00
Mean	0.03	12.05	6.01	2.00	1.97	2.03	1.97	1.98	2.03	1.99	1.98
MSE	0.01*	1.22	0.31	0.14	0.17	0.17	0.17	0.15	0.16	0.16	0.17

Parameter	Ψ_1,9	Ψ_1,10	Ψ_1,11	Ψ_1,12	Ψ_2,1	Ψ_2,2	Ψ_2,3	Ψ_2,4	Ψ_2,5	Ψ_2,6	Ψ_2,7

True	2.00	2.00	2.00	2.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
Mean	2.00	1.99	1.98	2.02	1.00	1.00	1.01	1.02	1.01	1.01	0.99
MSE	0.15	0.16	0.15	0.15	0.01	0.01	0.01	0.01	0.01	0.01	0.01

Open in a new tab

Figures 1 (a)–(b) show the true fixed curve μ₀(r), and the averaged estimates of the four methods. They indicate that NAIVE1, NAIVE2 and NAIVE3 approaches lead to obviously biased outcomes, while there is little bias for our method. Figures 1 (c)–(f) represent the performance of our approach in fixed curves μ₁(s), μ₂(t) and the principal component curves f₁(r), f₂(r), respectively. Our approach captures all of the true curve patterns.

Fitted fixed and random effects curves for 500 simulated data sets: (a) mean of the fixed effects curve estimates of μ₀(r) obtained from NAIVE1, NAIVE2 and our method, (b) mean of the fixed effects curve estimates of μ₀(r) obtained from NAIVE3 and our method, (c) mean of the fixed effects curve estimates of μ₁(s), (d) mean of the fixed effects curve estimates of μ₂(t), (e) mean of the principal component curve estimates for f₁(r), (f) mean of the principal component curve estimates of f₂(r). Dotted lines denote the true curves. Solid thin lines represent the average values of the fitted curves from our method. Dot-dashed thin and thick lines in figure (a) represent the average values of the fitted curves from NAIVE1 and NAIVE2 methods, respectively. Solid thick line in figure (b) represents the average values of the fitted curves from NAIVE3 method. The upper and lower dashed lines in figures (c)–(f) are the 10% and 90% quantiles of the fitted values from our method over the 500 simulated data sets.

5. Application to Physical Activity Data

In this section we apply our methods to the BodyMedia data to help answer the question raised in the introduction. Our data involve 935 males and each person has 84 observations consisting of daily METS and daily sleep efficiency. Referring back to Section 2’s notation, Y_i(r, s, t) is the ratio of daily sleep time to lying down time, r is the average minutes of moderate and vigorous activity (MVPA) time on the current day and previous day, s is the month and t is the weekday. We use cubic B-spline basis function with 24 equispaced knots to fit μ₀(r), while other basis functions follow the settings in Section 4. The dispersion parameter σ² is estimated to be 0.034.

Figure 2 presents the estimation results for fixed effect curves μ₀(r), μ₁(s) and μ₂(t), and principal component curve f₁(r). Figure 2(a) suggests that conditioning on the random effect U_i(r, s, t), the relation between MVPA minutes and sleep efficiency can be divided into three parts. From minutes 0 to 60, the increase of MVPA minutes leads to higher sleep efficiency. The effect of increasing MVPA minutes flattens out gradually between minutes 60 to 120, while the greater MVPA minutes have negative influence on sleep efficiency after minutes 120. Figure 2(b) suggests that sleep efficiency has a strong monthly trend, where January and February have higher sleep efficiency while October has lower sleep efficiency. Figure 2(c) indicates that Sunday and Monday have lower sleep efficiency but the sleep efficiency on Wednesday and Thursday are higher. Thus, it suggests sleep efficiency is greater in the middle of the week.

Fitted fixed effect and principal component curves for data application. (a) Fitted fixed effects curve μ₀(r) for the factor MVPA minutes, (b) Fitted fixed effects curve μ₁(s) for the factor month, (c) Fitted fixed effects curve μ₂(t) for the factor weekday, (d) Fitted principal component curve for factor MVPA minutes f₁(r). Solid lines represent the values of the fitted curves. The upper and lower dashed lines are the 10% and 90% quantiles of the fitted values across 500 bootstrap estimates. Dot-dashed vertical lines represent MVPA time on minutes 60 and 120, respectively.

The number of principal components for MVPA minutes is selected to be 1. Figure 2(d) shows the principal component curve is decreasing with increasing MVPA time, which means greater physical activity leads to less between-subject variability. The result implies subjects with more physical activity have more consistent sleep efficiency.

We also study the marginal mean and variance structure of Y_i(r, s, t). The month is set to be January and the weekday is Monday. Figure 3(a) presents the marginal mean of the outcomes evaluated at different MVPA minutes. It displays the increasing MVPA results in the improvement of sleep efficiency. However, the increase in sleep efficiency is up to about 120 MVPA minutes, and then it tails off. Figure 3(b) is the marginal variance of the responses. The variability of sleep efficiency is decreasing with increasing MVPA time. In particular, the variability at 200 MVPA minutes is about half of that at 0 MVPA minutes. This suggests that people with higher physical activity time will generally have more constant sleep efficiency.

Fitted marginal curves for data application. (a) Fitted marginal mean curve for the factor MVPA minutes on Monday in January, (b) Fitted marginal variance curve for the factor MVPA minutes on Monday in January. Solid lines represent the values of the fitted curves. The upper and lower dashed lines are the 10% and 90% quantiles of the fitted values across 500 bootstrap estimates. Dot-dashed vertical lines represent MVPA time on minutes 60 and 120, respectively.

We show the correlation structure of the outcomes with respect to MVPA time on Monday in January, corr{Y_i(j, 1, 2), Y_i(k, 1, 2)}, as a 3-D plot in Figure 4. The figure reaches its peak around (j = 0, k = 0). This makes sense because lower MVPA time leads to higher variability in sleep efficiency which causes greater correlation. On the other hand, the plot decreases as MVPA minutes increase. This is likely because, intuitively, sleep efficiency is relatively constant for people with longer MVPA minutes.

The estimates of correlation surfaces for corr{*Y_i*(j, 1, 2), *Y_i*(k, 1, 2)} (j ≠ k) on Monday in January.

6. Discussion

We have proposed a three-factor joint modeling and estimation strategy for functional data with continuous proportions. The simulation results are encouraging, with little bias. The analysis of the BodyMedia data using the our method demonstrates its utility in real applications. Our conclusions are that daily sleep efficiency improves with increasing MVPA up to about 120 minutes and increasing MVPA results in a decrease in the variance of sleep efficiency throughout the range of MVPA minutes. The former conclusion makes sense in general, however, the plateau of mean daily sleep efficiency at about 120 MVPA minutes has not been reported previously, largely because fully linear modeling of this data is standard. We believe that the substantial decrease in the variability of sleep efficiency as MVPA minutes increase is also a new finding, with standard analyses focusing only on means.

Supplementary Material

Supp info

NIHMS825619-supplement-Supp_info.pdf^{(54.2KB, pdf)}

Acknowledgments

Li was supported by discovery grants program from the Natural Sciences and Engineering Research Council of Canada (NSERC, RGPIN-2015-04409). Carroll was supported by a grant from the National Cancer Institute (U01-CA057030). The authors thank BodyMedia, Inc. for making the data available to them.

Footnotes

Supporting Information

Additional information for this article is available at the publisher’s web-site.

Appendix S1: Detailed model estimation algorithm.

Figure 1: Fitted fixed and random effects curves for 500 simulated data sets.

Figure 2: Fitted fixed effect and principal component curves for data application.

Figure 3: Fitted marginal curves for data application.

Figure 4: The estimates of correlation surfaces.

Table 1: Results for simulation results in Section 4.

References

Breslow NE, Clayton DG. Approximate inference in generalized linear mixed models. Journal of the American Statistical Association. 1993;88:9–25. [Google Scholar]
Cox C. Nonlinear quasi-likelihood models: applications to continuous proportions. Computational Statistics and Data Analysis. 1996;21:449–461. [Google Scholar]
Eilers PHC, Marx BD. Flexible smoothing with B-splines and penalties. Statistical Science. 1996;11:89–121. [Google Scholar]
Ekstedt M, Nyberg G, Ingre M, Örjan Ekblom & Marcus C. Sleep, physical activity and bmi in six to ten-year-old children measured by accelerometry: a cross-sectional study. International Journal of Behavioral Nutrition and Physical Activity. 2013;10:82. doi: 10.1186/1479-5868-10-82. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ferrari S, Cribari-Neto F. Beta regression for modelling rates and proportions. Journal of Applied Statistics. 2004;31:799–815. [Google Scholar]
Figueroa-Zúñiga JI, Arellano-Valle RB, Ferrari SL. Mixed beta regression: A Bayesian perspective. Computational Statistics and Data Analysis. 2013;61:137–147. [Google Scholar]
Gertheiss J, Maier V, Hessel EF, Staicu AM. Marginal functional regression models for analyzing the feeding behavior of pigs. Journal of Agricultural, Biological, and Environmental Statistics. 2015;20:353–370. [Google Scholar]
Goldsmith J, Zipunnikov V, Schrack J. Generalized multilevel function-on-scalar regression and principal component analysis. Biometrics. 2015 doi: 10.1111/biom.12278. [DOI] [PMC free article] [PubMed] [Google Scholar]
Goldstein H, Rasbash J. Improved approximations for multilevel models with binary responses. Journal of the Royal Statistical Society. Series A. 1996;159:505–513. [Google Scholar]
Hall P, Müller HG, Yao F. Modelling sparse generalized longitudinal observations with latent gaussian processes. Journal of the Royal Statistical Society, Series B. 2008;70:703–723. [Google Scholar]
Kieschnick R, McCullough BD. Regression analysis of variates observed on (0, 1): percentages, proportions and fractions. Statistical Modelling. 2003;3:193–213. [Google Scholar]
Lambiase MJ, Gabriel KP, Kuller LH, Matthews KA. Temporal relationships between physical activity and sleep in older women. Medicine and Science in Sports and Exercise. 2013;45:2362–2368. doi: 10.1249/MSS.0b013e31829e4cea. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu C, Rubin DB. The ECME algorithm: A simple extension of EM and ECM with faster monotone convergence. Biometrika. 1994;81:633–648. [Google Scholar]
McCullagh P, Nelder JA. Generalized Linear Models. London: Chapman and Hall; 1989. [Google Scholar]
Molenberghs G, Verbeke G. Models for Discrete Longitudinal Data. Springer; 2005. [Google Scholar]
Oudegeest-Sander MH, Eijsvogels TH, Verheggen RJ, Poelkens F, Hopman MT, Jones H, Thijssen DH. Impact of physical fitness and daily energy expenditure on sleep efficiency in young and older humans. Gerontology. 2013;59:8–16. doi: 10.1159/000342213. [DOI] [PubMed] [Google Scholar]
R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2016. [Google Scholar]
Redd A. orthogonalsplinebasis: Orthogonal Bspline Basis Functions. r package version 0.1.5. 2011 [Google Scholar]
Schafer JL. Tech. rep. The Pennsylvania State University: The Methodological Center; 1998. Some improved procedures for linear mixed models. [Google Scholar]
Scheipl F, Gertheiss J, Greven S. Generalized functional additive mixed models. Electronic Journal of Statistics. 2016 [Google Scholar]
Serban N, Staicu AM, Carroll RJ. Multilevel cross-dependent binary longitudinal data. Biometrics. 2013;69:903–913. doi: 10.1111/biom.12083. [DOI] [PMC free article] [PubMed] [Google Scholar]
Simas AB, Barreto-Souza W, Rocha AV. Improved estimators for a general class of beta regression models. Computational Statistics and Data Analysis. 2010;54:348–366. [Google Scholar]
Verkuilen J, Smithson M. Mixed and mixture regression models for continuous bounded responses using the beta distribution. Journal of Educational and Behavioral Statistics. 2012;37:82–113. [Google Scholar]
Wedderburn RWM. Quasi-likelihood functions, generalized linear models, and the gauss-newton method. Biometrika. 1974;61:439–447. [Google Scholar]
Yao F, Müller HG, Wang JL. Functional data analysis for sparse longitudinal data. Journal of the American Statistical Association. 2005;100:577–590. [Google Scholar]
Zhao W, Zhang R, Huang Z, Feng J. Partially linear single-index beta regression model and score test. Journal of Multivariate Analysis. 2012;103:116–123. [Google Scholar]
Zhou L, Huang JZ, Carroll RJ. Joint modelling of paired sparse functional data using principal components. Biometrika. 2008;95:601–619. doi: 10.1093/biomet/asn035. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp info

NIHMS825619-supplement-Supp_info.pdf^{(54.2KB, pdf)}

[R1] Breslow NE, Clayton DG. Approximate inference in generalized linear mixed models. Journal of the American Statistical Association. 1993;88:9–25. [Google Scholar]

[R2] Cox C. Nonlinear quasi-likelihood models: applications to continuous proportions. Computational Statistics and Data Analysis. 1996;21:449–461. [Google Scholar]

[R3] Eilers PHC, Marx BD. Flexible smoothing with B-splines and penalties. Statistical Science. 1996;11:89–121. [Google Scholar]

[R4] Ekstedt M, Nyberg G, Ingre M, Örjan Ekblom & Marcus C. Sleep, physical activity and bmi in six to ten-year-old children measured by accelerometry: a cross-sectional study. International Journal of Behavioral Nutrition and Physical Activity. 2013;10:82. doi: 10.1186/1479-5868-10-82. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Ferrari S, Cribari-Neto F. Beta regression for modelling rates and proportions. Journal of Applied Statistics. 2004;31:799–815. [Google Scholar]

[R6] Figueroa-Zúñiga JI, Arellano-Valle RB, Ferrari SL. Mixed beta regression: A Bayesian perspective. Computational Statistics and Data Analysis. 2013;61:137–147. [Google Scholar]

[R7] Gertheiss J, Maier V, Hessel EF, Staicu AM. Marginal functional regression models for analyzing the feeding behavior of pigs. Journal of Agricultural, Biological, and Environmental Statistics. 2015;20:353–370. [Google Scholar]

[R8] Goldsmith J, Zipunnikov V, Schrack J. Generalized multilevel function-on-scalar regression and principal component analysis. Biometrics. 2015 doi: 10.1111/biom.12278. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Goldstein H, Rasbash J. Improved approximations for multilevel models with binary responses. Journal of the Royal Statistical Society. Series A. 1996;159:505–513. [Google Scholar]

[R10] Hall P, Müller HG, Yao F. Modelling sparse generalized longitudinal observations with latent gaussian processes. Journal of the Royal Statistical Society, Series B. 2008;70:703–723. [Google Scholar]

[R11] Kieschnick R, McCullough BD. Regression analysis of variates observed on (0, 1): percentages, proportions and fractions. Statistical Modelling. 2003;3:193–213. [Google Scholar]

[R12] Lambiase MJ, Gabriel KP, Kuller LH, Matthews KA. Temporal relationships between physical activity and sleep in older women. Medicine and Science in Sports and Exercise. 2013;45:2362–2368. doi: 10.1249/MSS.0b013e31829e4cea. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Liu C, Rubin DB. The ECME algorithm: A simple extension of EM and ECM with faster monotone convergence. Biometrika. 1994;81:633–648. [Google Scholar]

[R14] McCullagh P, Nelder JA. Generalized Linear Models. London: Chapman and Hall; 1989. [Google Scholar]

[R15] Molenberghs G, Verbeke G. Models for Discrete Longitudinal Data. Springer; 2005. [Google Scholar]

[R16] Oudegeest-Sander MH, Eijsvogels TH, Verheggen RJ, Poelkens F, Hopman MT, Jones H, Thijssen DH. Impact of physical fitness and daily energy expenditure on sleep efficiency in young and older humans. Gerontology. 2013;59:8–16. doi: 10.1159/000342213. [DOI] [PubMed] [Google Scholar]

[R17] R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2016. [Google Scholar]

[R18] Redd A. orthogonalsplinebasis: Orthogonal Bspline Basis Functions. r package version 0.1.5. 2011 [Google Scholar]

[R19] Schafer JL. Tech. rep. The Pennsylvania State University: The Methodological Center; 1998. Some improved procedures for linear mixed models. [Google Scholar]

[R20] Scheipl F, Gertheiss J, Greven S. Generalized functional additive mixed models. Electronic Journal of Statistics. 2016 [Google Scholar]

[R21] Serban N, Staicu AM, Carroll RJ. Multilevel cross-dependent binary longitudinal data. Biometrics. 2013;69:903–913. doi: 10.1111/biom.12083. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Simas AB, Barreto-Souza W, Rocha AV. Improved estimators for a general class of beta regression models. Computational Statistics and Data Analysis. 2010;54:348–366. [Google Scholar]

[R23] Verkuilen J, Smithson M. Mixed and mixture regression models for continuous bounded responses using the beta distribution. Journal of Educational and Behavioral Statistics. 2012;37:82–113. [Google Scholar]

[R24] Wedderburn RWM. Quasi-likelihood functions, generalized linear models, and the gauss-newton method. Biometrika. 1974;61:439–447. [Google Scholar]

[R25] Yao F, Müller HG, Wang JL. Functional data analysis for sparse longitudinal data. Journal of the American Statistical Association. 2005;100:577–590. [Google Scholar]

[R26] Zhao W, Zhang R, Huang Z, Feng J. Partially linear single-index beta regression model and score test. Journal of Multivariate Analysis. 2012;103:116–123. [Google Scholar]

[R27] Zhou L, Huang JZ, Carroll RJ. Joint modelling of paired sparse functional data using principal components. Biometrika. 2008;95:601–619. doi: 10.1093/biomet/asn035. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Longitudinal functional additive model with continuous proportional outcomes for physical activity data

Haocheng Li

Sarah Kozey-Keadle

Victor Kipnis

Raymond J Carroll

Abstract

1. Introduction

2. Model