Abstract
We study a mixed-effects model in which the response and the main covariate are linked by position. While the covariate corresponding to the observed response is not directly observable, there exists a latent covariate process that represents the underlying positional features of the covariate. When the positional features and the underlying distributions are parametric, the expectation-maximization (EM) is the most commonly used procedure. Though without the parametric assumptions, the practical feasibility of a semi-parametric EM algorithm and the corresponding inference procedures remain to be investigated. In this paper, we propose a semiparametric approach, and identify the conditions under which the semiparametric estimators share the same asymptotic properties as the unachievable estimators using the true values of the latent covariate; that is, the oracle property is achieved. We propose a Monte Carlo graphical evaluation tool to assess the adequacy of the sample size for achieving the oracle property. The semiparametric approach is later applied to data from a colon carcinogenesis study on the effects of cell DNA damage on the expression level of oncogene bcl-2. The graphical evaluation shows that, with moderate size of subunits, the numerical performance of the semiparametric estimator is very close to the asymptotic limit. It indicates that a complex EM-based implementation may at most achieve minimal improvement and is thus unnecessary.
Keywords and phrases: Carcinogenesis, Consistency, Generalized estimating equation, Local linear smoothing, Mixed-effects model
1. INTRODUCTION
1.1 Colon carcinogenesis study
Recent researches on colon cancer have been focusing on linking colon tumor development to the inhibition of apoptosis (cell death; see Heemels et al. 2000). When the body is affected by carcinogen, apoptosis causes termination of the cells with irreparable genetic damages, and thus prevents them from proliferating to cancer cells. It consequently reduces the risk of cancer. Any inhibition of apoptosis, on the other hand, induces cancer development.
An oncogene closely linked to, but adversely affecting, apoptosis is bcl-2. Over-expression of bcl-2 gene leads to suppression of apoptosis, thus allows tumor cells to survive and proliferate. During the initial stage of colon carcinogenesis, few apoptotic cells are formed and the main information about apoptosis is carried by apoptosis related gene, e.g., bcl-2. For the purpose of cancer prevention, it would be beneficial that the level of bcl-2 gene expression decreases as cell DNA damage increases. Therefore, in this study, we focus on investigating the relationship between the cell DNA damage and bcl-2 gene expression during the initial stage of colon cancer. Our primary interest is how the diet affects this relationship at different time post carcinogen exposure.
We now briefly describe the experiment. Thirty rats were divided evenly into two groups. Each group was fed with one of the two diets, fish oil supplemented or corn oil supplemented, for two weeks. After this, all 30 rats were injected with azoxymethane (AOM), a carcinogen to induce colon cancer. Three rats from each diet group were then euthanized at 0, 3, 6, 9, and 12 hours post injection to measure the cell DNA damage and bcl-2 gene expression. In labs, the cell DNA damage is measured by the DNA adduct level. For each rat, 20 crypts were selected to measure bcl-2, and another group of 15 to 25 crypts were selected to measure the DNA adduct level. These two measurements were taken at each cell within the selected crypts. There are about 14 to 56 cells in each crypt.
Colon crypts are discrete units within the colon where colonic cells replicate. At the bottom of each crypt, there are the stem cells that generate all the cells within the crypt. Daughter cells are formed from stem cells, move up along the crypt and exfoliate into lumen as more cells are created. Thus, a cell’s relative position within a crypt is an indicator of its age: cells at the bottom are younger, and cells near the top are older. In this study, the position of the cell was recorded by the relative cell position, as in Morris et al. (2001). The relative cell positions range from 0 at the bottom to 1 at the top.
Our goal is to understand the relationship between the response (bcl-2 gene expression) and the covariate (cell DNA adduct level), as well as how this relationship changes with the diet. More precisely, we want to investigate, in comparison to the corn oil supplemented diet, whether the fish oil supplemented diet helps to suppress the increasing trend of bcl-2 gene expression when DNA damage increases. We need a mixed-effects model to accommodate the diet and time treatment effects, and also the random effects for rat and crypt. The special aspect about this study is: DNA adduct level and bcl-2 gene expression were not measured in the same crypts, though from the same rats. This is because in this study, once a crypt was euthanized to take DNA adduct measurement, it could not be used again to measure bcl-2. Instead, a different crypt from the same rat was used. Since the number of cells varied from crypt to crypt, cells within different crypts had different relative positions. Consequently, the two measurements, bcl-2 gene expression and DNA adduct level, were observed at different relative crypt depths (i.e., the relative cell positions) in addition to being from different crypts. It formed a problem of misaligned measurements. Conventional regression methods are not appropriate.
To illustrate the data structure, Table 1 lists the portion of data from one rat under one treatment condition (i.e., combination of diet and time post carcinogen exposure). It shows that, within this rat, bcl-2 was observed from 20 crypts while DNA adduct observed from 23 crypts. These are two different groups of crypts; that is, crypt j for observing bcl-2 is different from crypt j′ for observing adduct, for all j and j′. Within the first crypt (j = 1) for bcl-2 observation, bcl-2 was measured over 32 cells, while within the first crypt (j′ = 1) for DNA adduct observation, DNA adduct was measured over 40 cells.
Table 1.
Colon Carcinogenesis Data within One Rat and Under One Treatment Condition: the Left Four Columns Are bcl-2 Observations from 20 Crypts, and the Right Four Columns Are DNA Adduct Observations from 23 Crypts; Crypt j Is the j-th Crypt for Observing bcl-2 and j′ Is the j′-th Crypt for Observing DNA Adduct, These Are Two Different Sets of Crypts in the Same Rat.
|
bcl-2 observations |
DNA adduct observations |
||||||
|---|---|---|---|---|---|---|---|
| crypt (j) | cell | cell position | bcl-2 reading | crypt (j′) | cell | cell position | DNA adduct reading |
| 1 | 1 | 0 | y1,1 | 1 | 1 | 0 | w1,1 |
| 1 | 2 | 1/31 | y1,2 | 1 | 2 | 1/39 | w1,2 |
| 1 | 3 | 2/31 | y1,3 | 1 | 3 | 2/39 | w1,3 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 1 | 32 | 1 | y1,32 | 1 | 32 | 31/39 | w1,32 |
| ⋮ | ⋮ | ⋮ | ⋮ | ||||
| 1 | 40 | 1 | w1,40 | ||||
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 20 | 1 | 0 | y20,1 | 20 | 1 | 0 | w20,1 |
| 20 | 2 | 1/28 | y20,2 | 20 | 2 | 1/35 | w20,2 |
| 20 | 3 | 2/28 | y20,3 | 20 | 3 | 2/35 | w20,3 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 20 | 29 | 1 | y20,29 | 20 | 29 | 29/35 | w20,29 |
| ⋮ | ⋮ | ⋮ | ⋮ | ||||
| 20 | 36 | 1 | w20,36 | ||||
| ⋮ | ⋮ | ⋮ | ⋮ | ||||
| 23 | 1 | 1/30 | w23,1 | ||||
| ⋮ | ⋮ | ⋮ | ⋮ | ||||
| 23 | 31 | 1 | w23,31 | ||||
1.2 Statistical background
When the covariate values are not directly available, imputation is the traditional practice, like the nearest neighbor method (NN; Pielou, 1961; Huang and Zhu, 2002) or the last observation carry-forward method(LOCF; Carroll, 2004). In the colon carcinogenesis study, these two methods are, within each rat, to use the DNA adduct values observed nearest to or immediately in front of the positions of the bcl-2 measurement, respectively, as the matching DNA adduct measurement. The pitfall of the naive imputation approach is the attenuation effects as discussed in Carroll (2004).
Another possible approach is to assume a subject-level latent process Xi(·) for the positional feature behind the observed covariate from subject i and apply the maximum likelihood method through an expectation-maximization algorithm (EM; Dempster, Laird, and Rubin, 1977). The drawback of EM is two-fold. First, maximum likelihood is sensitive to model assumption; Second, due to the complexity of the colon carcinogenesis data, the derivation of EM is not straightforward and the computational implementation is extremely intensive.
In this paper, we propose a semiparametric method. As in EM algorithm, we consider a latent covariate process Xi(·) for the DNA adduct versus the relative cell position within each rat. Unlike EM, we directly “estimate” the rat-level latent covariate by nonparametric regression and use the estimated DNA adduct for the estimation of the primary model. This semiparametric approach and the NN, LOCF methods are all “plug-in” based methods, which are often unfavorably considered. However, we prove that, under adequate within-subject sample sizes, the estimators from this semiparametric approach are asymptotically consistent and achieve the oracle property. An EM estimator, depending on its construction, may at most share the same asymptotic properties. We later propose a graphical tool to check the adequacy of the “effective” sample sizes.
Our semiparametric approach is basically to model the primary relationship between bcl-2 and DNA adduct, and meanwhile, the positional feature of DNA adduct level. Joint modeling of parametric longitudinal features and a primary endpoint has been studied extensively, see Tsiatis and Davidian (2001), Li, Zhang, and Davidian (2004) and the works therein. However, identifying a parametric structure for longitudinal biomarkers is not always feasible. In this paper, the joint modeling is extended to accommodate nonparametric longitudinal covariate features and longitudinal response.
The semiparametric approach can also be considered as an extension of Carroll and Wand (1991) and Pepe and Fleming (1991) in that a nonparametric estimation is used to obtain the estimates of the unobserved covariate. However, there are three major differences. First, in these previous works, true covariate values were available for a portion of the subjects, but in our example, there were absolutely no matched measurements. Second, the DNA adduct measurement alone was from a nonparametric mixed-effects model with the marginal mean as a function of the relative cell position. Meanwhile, the response bcl-2 was also measured repeatedly from the same crypt within the same subject. Therefore, the observations in our example were correlated while the previous works focused on independent cases. Third, the issue of “effective” sample size and the corresponding diagnostic tool had never been investigated in the previous papers.
The rest of this paper is organized as follows. § 2 formulates the mixed-effects models for the study and describes the proposed semiparametric method. We present the asymptotic properties of the semiparametric estimators and the conditions to reach the oracle property. § 3 gives the numerical results which include a simulation study, the application to the colon carcinogenesis, and a sensitivity analysis on the “effective” sample size. Finally, § 4 contains the concluding remarks.
2. THE MODEL, THE METHOD AND THE ASYMPTOTICS
2.1 Model specification
Let X(·) denote a latent covariate process. For the colon carcinogenesis study, Xi(t) is the realization of the rat-level process for DNA adduct in rat i at cell position t. Hereafter, the cell position refers to the relative crypt depth of the cell within a colon crypt, thus t ∈ [0, 1].
The following mixed-effects model describes a general relationship between the response Y and the latent covariate X,
| (1) |
For better understanding of the notations, we associate them with the terminologies in the colon carcinogenesis example. That is, Y is the bcl-2 gene expression, X is the DNA adduct latent covariate, i is the subject index of the rat, j is the subunit index of the crypt selected to measure bcl-2, k is the index of the cell in the selected crypt, and the sup-index “tr” is the treatment indicator for the diet and the time. The cell-level bcl-2 gene expression is linked to the rat-level covariate process through the cell position tijk. β is the unknown fixed effect parameter vector and H is the known link function. The zero mean and bounded variance random effect btr, coupled with the rat- and the crypt-level observed covariate Ztr, lays out the hierarchical rat- and crypt-level dependency. Finally, γtr denotes the vector of variance parameters in the distribution of btr and the additive error εtr. For simplicity, we hereafter suppress the sup-index “tr” in the text.
The latent covariate Xi(t) is completely unobservable but can be considered as the rat-level mean at cell position t. Let Wij′k′ be the DNA adduct observed from cell k′ of crypt j′ in rat i. What follows is a natural model that links the observed Wij′k′ to the rat-level latent covariate,
| (2) |
In this nonparametric mixed-effects model, dij′ denotes the crypt-level variation and eij′k′ the additive error. Conditional on Xi(t), we assume that measurements from different crypts are independent. We let denote the vector of variance parameters in model (2) for the covariate observation within rat i. Note that j′ denotes the index of the crypt selected for measuring the DNA adduct, and k′ is the index of the cell within that crypt. Due to the nature of this experiment, in no situation does j′ = j in (1) and (2). Throughout this paper, we use “ ′ ” for the indices of covariate observations to distinguish them from the indices of response observations.
Since crypts are randomly selected from the same rat to measure the bcl-2 and DNA adduct, the two groups of crypts should be biologically similar. As pointed out in Morris et al. 2001, the cells at the same positions of different crypts share the common characteristics. Therefore, it is reasonable to assume that the rat-level latent process for DNA adduct is the same for the two groups of crypts. This suggests that we can estimate the latent covariate Xi(·) in model (1) from the nonparametric model (2). In fact, the latent covariate process can only be assumed at rat level to link the observed surrogate covariate to the underlying covariate that corresponds to the response. This is because no covariate was observed from the same sub-unit (crypt) of response observation, consequently, no crypt-level process can be obtained.
2.2 Method description
Our semiparametric approach can be implemented in two steps. In step 1, we nonparametrically estimate the latent covariate Xi for each subject i based on model (2). In step 2, we use the nonparametrically estimated Xi(tijk) in the primary model (1) to estimate β and γ.
For the nonparametric estimation of Xi(·), we use local linear smoothing (Fan and Gijbels, 1996) with the working independence correlation (Lin and Carroll, 2000), and estimate within each rat separately. Other smoothing procedures can also be used. The choice of working independence approach is appropriate here because we intend to show that the simplest approach would still work here. For the parametric estimation of the primary model, we use the generalized estimating equation (GEE) with working covariance matrix. We introduce the semiparametric estimation in the case that H is quadratic. The approach which is designed for the generalized linear mixed-effects model and its associated properties have also been developed in the dissertation of the first author. We focus on presenting the more complex quadratic scenario here because it was the model used for analyzing the data.
Let n be the total number of rats. Yi = (Yi,1T, …, Yi,JiT)T is the vector of crypt by crypt bcl-2 observations in rat i, with Yi,jT as the bcl-2 observations from crypt j of rat i, for i = 1, …, n and j = 1, …, Ji. Ti = (Ti,1T, …, Ti,JiT)T denotes the vector of cell positions for observing bcl-2. Xi(Ti) is the realization of Xi(·) at Ti.
For the mixed-effects quadratic model, the semiparametric estimator for β is,
| (3) |
where X̃i(Ti) is the nonparametrically estimated Xi(Ti) and Σ̂i is the estimated covariance matrix in primary model (1).
To account for the nested experimental design in the colon carcinogenesis study for cells within a crypt and crypts within a rat, we consider a dependent covariance structure (4) as the working covariance for the primary model,
| (4) |
where and are the variance components for the rat- and crypt- level random effects, respectively, and is that for the random error. J is matrix of entry 1 and I is the identity matrix. All indices here refer to the bcl-2 observation in rat i: Ni is the total number of bcl-2 observations; Ji is the number of crypts selected for bcl-2 observation and Ki,j is the number of cells in crypt j.
When this assumed structure is true, it is known that the covariance parameters can be consistently estimated under the assumption of normal random effects even if the distributions of the random effects are not normal (see Verbeke and Lesaffre, 1997, for a reference). Consequently, there is no need of distributional assumptions on these random effects. For estimation efficiency, we also assume the same variance parameters across all treatment groups as well as at different cell position. Thus , together with (4), describes the covariance in the primary model. Σ̂i is calculated by replacing γ with γ̂.
Note that there are possible choices of Σi to allow for correlation within crypts. For example, instead of the term INi in (4), we could have a Markov-structured matrix such that the correlation between two cells in the logarithm scale is inversely proportional to the distance between the two cells. We checked and found that the correlations between two adjacent cells within a crypt are low on average, it is thus reasonable to assume the covariance structure (4) for this study.
For the estimation of the variance components represented by γ, we focus on a simple regression-based method that uses the transformed response and covariates. The exact formulae are given in the Appendix. This is not a maximum likelihood method but it shares the same consistency property and performs numerically better for small sample sizes. An “equivalent” method was proposed in Henderson (1953) and was studied extensively by Fuller and Battese (1973) for nested designs. We choose this estimator for variance components not only because the role played by the latent covariate can be explicitly reflected, but also that the estimated parameters provide a summary of the variations at different levels even if the assumed variance structure is not exact.
2.3 Asymptotic properties of the semiparametric estimators
In this subsection, to simplify the notation, we assume that all rats have the same number of crypts J′ and the same number of cells K′ within each crypt for observing the covariate. Due to the structure of the colon, the number of cells within each crypt is limited, while the number of crypts within a colon is nearly unbounded. This is because, compared with the dimension of a crypt, the colon is of nearly infinite length. Therefor, in this subsection, we focus on the asymptotic scenario that J′ goes to ∞ and K′ is bounded. We derive the asymptotic properties of the semiparametric estimator and identify the conditions under which the oracle property can be reached. Later in § 3.3, a bootstrap-based graphical tool will be used to evaluate the adequacy of crypt size J′.
For simplicity, we denote Xi(Ti) as Xi in the following. Recall that is the vector of the variance components of the primary model (1), and let γW = ( , i = 1, …, n) be the variance parameters in the nonparametric model (2) for the subject—level latent covariate processes.
For the mixed-effects quadratic model, we obtain the following properties:
Proposition 1
As n and J′ →; ∞, h → 0 and J′K′h → ∞, , with
where
Ch, CJ′K′h, CJ′, Dh, DJ′K′h, DJ′ are functions of (X, γ, γW, β) and of order O(1).
When nh4 → 0, n(J′K′h)−2 → 0, and n(J′)−2 → 0, β̂ is -consistent. The exact expressions of Ch, CJ′K′h, CJ′, Dh, DJ′K′h, and DJ′, as well as a sketch of proof, are in the Appendix.
Remarks
The variance of β̂*has two parts. is the asymptotic variance from the regular GEE if X were observed. The second term of Vβ represents the extra variation due to the nonparametric estimation of the latent covariate. As h → 0, J′ → ∞, and J′K′h → ∞, this term diminishes and the asymptotic variance of β̂* is . That is, the estimator achieves the oracle property.
-
Both the bias Bβ and the second term of the asymptotic variance Vβ go to 0 as (J ′K′h)−1 and h go to 0, and J′ goes to ∞. So, instead of requiring K′h → ∞, the bandwidth selection for this semiparametric estimation is determined by J′K′h → ∞. That is, we do not require that the observations (over cells) within each subunit (crypt) for observing the covariate to be dense enough, but only that the observations pooled over the subunits within each subject (rat) are dense. This is because the latent covariate process is assumed at subject instead of the sub-unit level. Therefore, in addition to the facts that the same positional feature is shared across the sub-units of a subject, latent covariate process assumed at subject level also allows feasible bandwidth selection for attaining the oracle property.
In the colon carcinogenesis example, DNA adduct was measured within each selected crypt over all the cells, with the number of cells ranging from 14 to 56 per crypt. With about 20 crypts selected to observe the DNA adduct, the total number of cells within each rat ranged from several hundreds to one thousand. Thus, it is J′K′h instead of K′h that could be sufficiently large, which enables reasonable bandwidth selection.
Due to the colon structure, the number of cells (K′) within a crypt is limited while the number of crypts (J′) within a colon is nearly infinite. Therefore, the only within-subject sample size that could be large enough is that of the crypt. It also indicates that, for the application of this semiparametric method, the major assumption to check is that the sub-unit number within each subject is sufficiently large. We describe a graphical diagnostic tool to check this assumption in § 3.3.
-
In Proposition 1, the bias and variance of the semi-parametric estimators are determined by the variance parameters γ and γW from both models, the underlying true covariate, and the parameter β.
We assumed that the variance components γ of the primary model are the same for all treatment groups and irrelevant to the cell positions. This assumption can be evaluated and further relaxed through a known parametric variance structure, as stated in § 2.2. In the non-parametric mixed-effects model for the latent covariate, we allow each rat to have its own variance parameters; in addition, the variance parameters can be functions of the cell position. Though in the proof, for presentation simplicity, we consider the case that , the consistency remains true for general scenarios, provided that either each or their sum over the n rats divided by n are bounded.
For the estimation of the variance components in (4), we obtain the following consistency.
Proposition 2
For the latent covariate Xi satisfying the conditions in Lemma 1, estimators of variance components and with the nonparametric estimate of Xi are consistent as h → 0, J′ → ∞ and J′K′h → ∞.
3. NUMERICAL STUDY
We study the numerical performance of the semiparametric estimation by a simulation study in § 3.1, then present the application to the colon carcinogenesis example in § 3.2. Though the results developed in § 2.3 regard the asymptotic performance in the case of J′ → ∞, we show graphically in § 3.3 that the moderate crypt numbers observed in the colon carcinogenesis example seem to be sufficiently large to achieve the oracle property; that is, the estimates and their variances remain roughly the same even when we artificially increase the crypt number.
3.1 Simulation results
Here, we investigate the finite sample performance of semiparametric estimators for a quadratic mixed-effects model with a latent covariate.
We simulate the data to mimic that of the colon carcinogenesis study. Thirty (n = 30) subjects are considered. For each subject, we generate K = 15 observations of response (Y) within each of the J = 20 crypts. Also, in the same subject, we generate the observed covariate (W) in J′ = 20 crypts at K′ = 40 positions. The positions for observing W are evenly spaced, and those for observing Y follow a uniform distribution, both in [0, 1].
The underlying covariate process is Xi(t) = 5 − 5 sin(3t · ri1) + ri2, with ri1 ~ unif[0.9, 1.1] and ri2 ~ N(0, 1). We generate the observed covariate Wi by model (2) with dij′ (·) ≡ dij′; dij′ and eij′k′ are independent with mean 0 and variance and , respectively. In this simulation, we let σd = 0.3 and σe = 0.7. We generate the observed response Y by a quadratic mixed-effects model with β0 = 1, β1 = −2, β2 = 1, and the covariance structure (4) with variance components σr = 1, σc = 1, and σε = 3. The data generation and the semiparametric estimation are repeated for 1000 times.
We estimate the above mixed-effects model with latent covariate by three methods: (1) GEE with the true covariate values (True); (2) the semiparametric method (Semip): though the optimal bandwidth for nonparametric smoothing can be obtained through the “leave-one-subject-out” cross validation, we present the estimates over three different bandwidths around the optimal to show the influence of bandwidth; and (3) the last observation carry-forward method (LOCF), which sets the covariate value at the position of response, say tijk, to be the observed covariate at a position immediately in front of the target position; that is with . Since NN estimates are similar to LOCF, only LOCF estimates are presented. Simulation results are in Table 2.
Table 2.
Simulation Results Based on 1000 Repetitions: Primary Mixed-Effects Quadratic Model with β0 = 1, β1 = −2, β2 = 1, and the Latent Covariate Process Xi(t) = 5 − 5 sin(3t · ri1) + ri2 with ri1 ~ unif[0.9, 1.1] and ri2 ~ N (0, 1); RB (%): Relative Bias, SD: Monte Carlo Standard Deviation, SE: Average of the Estimated Standard Error from the Asymptotic Distribution, CP: Monte Carlo Coverage Probability of the 95% Wald Confidence Interval
| Method | β0 | β1 | β2 | |
|---|---|---|---|---|
| True | RB | 1.024 | 0.056 | 0.020 |
| SD | 0.192 | 0.047 | 0.009 | |
| SE | 0.194 | 0.046 | 0.009 | |
| CP | 94.8 | 94.7 | 94.3 | |
| Semip | ||||
| h = 0.03 | RB | −0.521 | −2.421 | −1.406 |
| SD | 0.199 | 0.073 | 0.015 | |
| SE | 0.193 | 0.065 | 0.013 | |
| CP | 95.7 | 91.1 | 92.8 | |
| h = 0.06 | RB | 2.071 | −0.208 | −0.238 |
| SD | 0.199 | 0.073 | 0.015 | |
| SE | 0.195 | 0.065 | 0.013 | |
| CP | 95.5 | 94.4 | 95.1 | |
| h = 0.09 | RB | 5.217 | 1.920 | 0.483 |
| SD | 0.200 | 0.074 | 0.015 | |
| SE | 0.196 | 0.066 | 0.013 | |
| CP | 95.0 | 93.6 | 93.7 | |
| LOCF | RB | −20.948 | 60.280 | −37.085 |
| SD | 0.241 | 0.121 | 0.028 | |
In Table 2, we observe that the LOCF estimates are biased toward zero, due to the attenuation effect (Carroll, Ruppert, and Stefanski, 1995). For the semiparametric estimators, the performance depends on the bandwidth h. At h = 0.06, the semiparametric estimates show negligible biases, and the coverage probabilities of the 95% Wald confidence intervals are very close to the nominal. In addition, the estimated standard errors based on the asymptotic normality in Proposition 1 are close to the Monte Carlo standard deviations. Though the semiparametric estimates have larger variation compared to the estimates using the true covariate values, which is unattainable in practice, this extra variation originates from the nonparametric estimation of the latent covariate and would diminish as the within-subject sample size J′ increases. Overall, the simulation results indicate how close the proposed semiparametric estimators could approach the oracle property.
For the estimation of the variance components in primary model (1), by semiparametric approach in (A.3) to (A.5), the estimates at h = 0.06 are σ̃r = 0.999, σ̃c = 1.022, and σ̃ε = 3.003, with the corresponding Monte Carlo standard deviation being 0.064, 0.03, and 0.019, respectively.
3.2 Analysis of colon carcinogenesis data
Here, we summarize the analysis of the colon carcinogenesis data. The goal of the study was to investigate whether bcl-2 gene expression increases with the DNA adduct level, and whether the trend varies with diet. Recall that the response, bcl-2, and the covariate, the DNA adduct, were not observed from the same crypts within a rat. We assume the rat-level latent covariate process for the DNA adduct level as Xi(·).
Several features of this study should be noted. First, as discussed earlier, the relative crypt depth of a cell represents its physiologic function. If we divide the crypt into three sections—the bottom 1/3, the middle 1/3, and the top 1/3 section—these three sections roughly contain the stem cells, the proliferating cells and the differentiated cells, respectively. We accordingly carry out the analysis with respect to each section separately. Secondly, in the analysis of each section, we use the “centered” DNA adduct level around the section mean of each rat. The reason for centering is as follows: the rat to rat variation is fairly large in DNA adduct measurements, but the range of bcl-2 is about the same for almost all rats. If we perform regression analysis rat by rat, we can see that the regression pattern is roughly shared by the rats within the same treatment group. Through “centering” in the DNA adduct, we can easily summarize the within-rat bcl-2 versus DNA adduct relationship over all rats in the same treatment group. One the other hand, the regression that uses uncentered DNA adduct essentially models the trend between the rat-level averages of bcl-2 and the rat-level averages of the DNA adduct, which is not the interest of this study.
Analysis is performed in all the three sections of the crypt. The results in Hong et al. (2000) indicated that the top section is where the proportions of apoptosis differ between fish oil enhanced and corn oil enhanced diets in the later stages of carcinogenesis. Therefore, we only report the results for the top section here. The results for the other two sections are either non-significant or similar to the findings for the top section.
For the semiparametric method, bandwidth selection is carried out by the “leave-one-subject-out” cross validation (Rice and Silverman, 1991). This method has been successfully applied by Hu, Wang, and Carroll (2004) and several other authors in semiparametric modeling of correlated data. For this study, the selected bandwidth is about h = 0.05. The values of the generalized cross validation function change little over bandwidths around 0.05, and the semiparametric estimates are very close with bandwidths in that neighborhood.
We report the results from linear mixed-effects model. Though a quadratic mixed-effects regression was conducted, we discovered that for all but one treatment groups, the quadratic trend was insignificant (with p values > 0.2). More importantly, a linear model allows easier interpretation of the diet effect on the bcl-2 versus DNA adduct relationship. Therefore, we use the linear mixed-effects model as the primary model, and the properties in § 2.3 also apply here.
In Table 3, we report the estimated intercepts and slopes and their standard errors, as well as the p-values for the contrast between the two diets at each of the five time points. Both the standard errors and the p-values are obtained by the method of bootstrap, where the observed covariate is bootstrapped from model (2) with the underlying latent covariate as the rat-level smoothing estimate, and the response is bootstrapped from the primary model (1) with the semi-parametrically estimated parameters. For comparison, we report the estimated slopes from the LOCF method. We can see that these estimated slopes shrink towards zero, as is the case in the simulation study.
Table 3.
Estimates of the Linear Mixed-Effects Model of bcl-2 Versus DNA Adduct: SE Is the Standard Error and the p-Value Is for the Comparison Between the Two Diets at Each Time Point
| Time | Diet | Semip estimates |
Comparison p-values |
LOCF slope | ||
|---|---|---|---|---|---|---|
| intercept (SE) | slope (SE) | intercept | slope | |||
| 0 | fish | 33.54 (2.79) | 2.32 (0.53) | 0.38 | < 0.01 | 0.05 |
| corn | 37.08 (2.94) | 0.83 (0.43) | −0.08 | |||
| 3 | fish | 25.13 (2.79) | −0.79 (0.28) | 0.04 | < 0.01 | −0.09 |
| corn | 33.08 (2.80) | 0.35 (0.28) | 0.05 | |||
| 6 | fish | 25.57 (2.81) | 0.18 (0.25) | 0.43 | < 0.01 | 0.08 |
| corn | 28.51 (2.81) | 2.25 (0.37) | 0.12 | |||
| 9 | fish | 19.48 (2.88) | −1.28 (0.27) | 0.54 | < 0.01 | −0.02 |
| corn | 22.38 (2.89) | 0.92 (0.36) | −0.02 | |||
| 12 | fish | 24.99 (2.72) | −0.52 (0.30) | 0.72 | < 0.01 | 0.07 |
| corn | 26.42 (3.04) | 0.42 (0.27) | 0.09 | |||
From Table 3, we can see that, in the bcl-2 versus DNA adduct relationship, the fish oil fed rats have significantly smaller slopes than the corn oil fed rats during the initial stage of colon carcinogenesis except at time 0. More specifically, as cell DNA damage increases, the bcl-2 gene expression level always increases in the corn oil fed rats, while it either decreases as at time 3, 9, and 12 or remains relatively stable as at 6 in the fish oil fed rats.
As we know, the main function of bcl-2 is to suppress apoptosis activity. It can consequently prevent premature cell death when the DNA damage is within a normal range, but leads to less active self-termination of the cancer-prone cells where the DNA damage is genetically irreparable. Our findings in Table 3 suggest that during the first 12 hours post carcinogen exposure, the fish oil diet, compared with the corn oil diet, suppresses the increment in the gene expression level of bcl-2 when the DNA damage increases. Therefore, the fish oil supplemented diet is more advantageous in promoting apoptosis and potentially reducing the risk of colon cancer. On the other hand, since there is no cell DNA damage caused by the carcinogen at time 0, the positive slope of the fish oil diet at this time suggests that the fish oil diet is also good at preventing premature cell death in case of no abnormal cell damage.
The variance component estimates of the primary linear mixed-effects model are the rat-level variation σ̂r = 4.89, the crypt-level variation σ̂c;= 1.80, and the cell-level error σ̂ε = 12.48. It shows that the crypt-level variation is relatively small compared with the other two sources of variations.
Figure 1 illustrates the fitted lines from semiparametric regression at bandwidth h = 0.05 in the top section; the Y axis indicates the observed bcl-2 gene expression and the X axis indicates the centered nonparametrically estimated cell DNA adduct level. Because the data points within each time group are extremely dense, we present the data in the following way. That is, for each time group, we divide the range of the centered DNA adduct values into 25 segments of equal length. Then for the centered DNA adduct values within each segment, we produce the box plot of the corresponding bcl-2 observations.
Figure 1.
Fitted Regression Lines for bcl-2 vs. the Centered Smooth Estimates of DNA Adduct. Observations Are from the Top 1/3 Section at 0, 3, 6, 9, 12 Hours Post Carcinogen Exposure; Box Plots Are Produced for 25 Equal-Distance Regions for Corn Oil (Red) and Fish Oil (Blue) Observations, Respectively. The Plot in the Right-Lower Corner Provides Regression Slopes for All Groups; a Color of Dark Red (Blue) Indicates that the Slope Significantly Differs from 0 (at Significance Level α = 0.05).
3.3 A graphical assumption-checking tool
The key point of this work is that for a hierarchical mixed-effects model with subject-level process for latent covariate, a simple semiparametric estimation can replace the complicated EM computation. This is often the case when the number of total observations per subject is sufficiently large. Coupling the proposed semiparametric estimation with a graphical assumption-checking tool can be very useful in practice.
In Proposition 1, the asymptotics for the semiparametric estimators are built upon the condition that the crypt number for covariate observation within each rat is big, i.e., J′ → ∞. That is, we expect both the estimates and the corresponding variation to stabilize when J′ is sufficiently large. Next, we investigate how the crypt number J′ affects the semiparametric estimates, and roughly check whether the crypt number is large enough in the colon carcinogenesis example. We do this by the following bootstrap procedure, where the first two steps are to estimate the crypt effects and the random errors in model (2).
Within each subject, subtract the covariate values observed at t by its nonparametric estimate X̃i(t) and obtain the best linear unbiased predictor (BLUP) of the crypt-level random effects dij′(·). Here, we focus on the scenario that the crypt effect does not change with the cell position; that is, dij′(·) ≡ dij′. We then construct the kernel estimate of the crypt effect density fd.
Denote the corresponding residual process at crypt j′ by rij′(·).
Let J* denote the crypt size of consideration, which is the number of crypts within each rat for covariate observation. Sample independent crypt-level random effects , ℓ = 1, …, J*, from the estimated crypt effect density fd. This can be achieved by letting , where is sampled with replacement from the original set of BLUP’s of the crypt-level random effects; hd is the bandwidth; K(·) is the kernel density function in the estimation of fd, and is a randomly generated number from K(·).
-
Create bootstrapped surrogate covariates by letting , where is sampled with replacement from the original residual processes of {rij′(·)}.
Though, in step 1, we did not take the cell position into account in the estimation of the crypt effect, the effect of cell position is absorbed into the residuals and contributes to the generation of the bootstrapped W<b> at this step.
Create parametrically the corresponding bootstrapped responses from the primary model with the estimated parameter β̂ and X̃i(·).
Obtain semiparametric estimate β̂<b> from and { , ℓ = 1,…,J*}, i = 1, …, n.
The above procedure (step 3 to step 6) is repeated for b = 1, …, B.
Based on the semiparametric estimates in Table 3, we obtain a sequence of the bootstrap estimates {β̂<b>, b = 1, …, B} for each desired crypt size J*, with J* ranging from 3 to 50. We keep all other setups the same as the original data and let B = 1000. Figure 2 presents the bootstrap estimates for the intercepts and the slopes under both diets at 9 hours post carcinogen exposure. For all plots, the X-axis indicates the crypt sizes J*. In the top two panels, the Y-axis corresponds to the estimated intercepts, and in the bottom two are the estimated slopes. The fish oil diet plots are on the left, and the corn oil diet plots are on the right.
Figure 2.
Diagnostic Plots of the Semiparametrically Estimated Parameters Versus the Crypt Sizes at 9 Hours Post Carcinogen Exposure. Left Panel Plots: Fish Oil; Right Panel Plots: Corn Oil; Top Panel Plots: Intercepts; Bottom Panel Plots: Slopes. The Center, Top and Bottom Curves in Each Plot Correspond to the Bootstrap Averages and the 95% Confidence Limits, Respectively.
In Figure 2, it appears that the estimated fish oil diet slope is the most sensitive to the crypt size among the four estimates, whose bootstrap estimated standard errors decrease about 8.2% from J* = 5 to J* = 50. At crypt size J* = 24, which is the “typical” crypt number per rat for DNA adduct observation in this example, the corresponding estimates and standard errors are within 1.01% of those obtained at J* = 50, respectively, and are apparently close to the asymptotic limits. This implies that, in this example, the semiparametric estimates are close to the asymptotic results. While the bias and variation of the semiparametric estimates could be further reduced by having a larger number of crypts, the improvement would be very limited.
4. CONCLUDING REMARKS
We propose a semiparametric approach for estimating a hierarchical mixed-effects model with an unobservable latent functional covariate. Compare to a parametric approach, like the maximum likelihood, semiparametric approach is computationally easier to implement for applications with hierarchical structure; meanwhile, it avoids the problem of model misspecification for the latent covariate. When the “effective” sample size is large enough, the semiparametric estimator attains the oracle property, which is the best a maximum likelihood estimator can do.
In the colon carcinogenesis study, the interest is on the relationship between bcl-2 gene expression level and DNA adduct, while the latent covariate functional—the positional feature of DNA adduct—is not the interest. This research objective makes the semiparametric approach, parametric modeling for the primary relationship and nonparametric modeling for the positional feature, appropriate. It obviates model specification for the positional feature of DNA adduct, yet leads to consistent estimation of the primary model.
Due to the long-standing belief that within a rat, cells at the same position of different crypts share the same biological characteristics, we propose the estimation of the positional feature of DNA adduct at rat-level. This is done by nonparametric estimation of the positional feature over all the adduct observation crypts within each rat. The advantage of this nonparametric estimation scheme is two-fold. First, it makes possible to relate the response bcl-2 to the covariate DNA adduct, which were observed over two different groups of colon crypts in each rat. Second, the basic observation units—the cells were of fixed number within each crypt, while the subunits—the crypts were randomly selected from the rat and technically of infinite number. If we lump the crypts within each rat to estimate the rat-level DNA adduct positional feature, the total number of observation cells can be sufficiently large and the relative cell positions are densely distributed in [0, 1], which allows for a reasonable bandwidth selection in the nonparametric estimation. Based on this pooling-over-subunit nonparametric estimation scheme, the consistency of the estimators for both the primary parameters and the latent covariate functional requires only the number of the crypts to be sufficiently large. We also propose a bootstrap based diagnostic tool to check on the sufficiency of this “effective” sample size—the number of crypts within each rat.
Though the method developed here is motivated by the colon carcinogenesis study, it has potentially wider applications. In biological studies, it is common that two measurements of interest can not be taken from the same subunit of a subject due to the constructive nature of the experiment. It is even more common that true covariates are not directly observable but can only be postulated as coefficients or functions of another regression model (see Li, Zhang, and Davidian, 2004, for a parametric example). However, it is not always feasible to identify a parametric model for the covariate, therefore, the semiparametric approach is of more flexibility.
Acknowledgments
This research was supported by a grant, CA74552, from the US National Cancer Institute. It was also partially supported by the Texas A&M Foundation LINK program.
The authors thank Drs. Hong and Lupton for kindly providing the dataset.
APPENDIX
1 Lemma and definitions for Proposition 1
We assume a common marginal distribution of the cell positions for the covariate observation cells, i.e., density fT(t) for cell position t. Similar to Ti, denotes the vector of cell positions for covariate observation. Based on the mixed-effects nonparametric model (2), the local linear smoothing estimator Xi(·) at t has the following expression, with the subscript i suppressed.
Lemma 1
For X(·) ∈ C2(0, 1) and a kernel density function K satisfying ∫ sK(s)ds = 0, ∫ s2K(s)ds = 1,
| (A.1) |
where Kh(v) = h−1K(v/h) and h is the smoothing bandwidth. W2(t) = K′fT(t) and , where X(2)(·) and denote the second derivatives of X(·)and fT(·), respectively, and ηj′k′ = dj′ + ej′k′.
The above expression (A.1) can be derived similarly as done in Lin and Carroll (2000).
For the nonparametric estimate
, it is the realization of (A.1) at the response observation positions Ti. Thus the second term of (A.1) at t = Ti is the random error
i(Ti), and the third term is the bias h2/2ζ(Ti).
The covariance of is
In , the l-th diagonal entry corresponding to position t is , and the (l, m)-th off-diagonal entry corresponding to paired position (tl, tm) is ; In , the diagonal entry is and the off-diagonal entry is . In the following, denote as the vector of diagonal entries of and as the vector of diagonal entries of .
With the notations above, in Proposition 1, Ch = C1β, CJ′K′h = C2,1β, CJ′ = C2,2β, , where
for s = 1,2, , and diag(Xi) as the Ni × Ni diagonal matrix of Xi.
2 Proof of Proposition 1
Following (A.1) and suppressing subject index i,
Denote
as the nonparametric estimate of Xi and
as
(Ti). Since E(
) = 0 and
, each entry in
is Op{(J′K′h)−1/2}+ Op{(J′)−1/2}, and each entry in
is O{(J′K′h)−1} + O{(J′)−1}.
For the semiparametric estimator β̂* in (3), , where
Denote
then, is the matrix with entry , for r, s = 1, 2, 3.
| (A.2) |
in probability, as J′ → ∞, h → 0 and J′K′h → ∞. Consequently, A1 → V0 in probability.
Write
Note that A21 and A22 correspond to the mean and variance terms for the quadratic regression if X were observed. The bias of β̂*originates from A23 and A25, the extra variance is from A24 and A25 and their covariance with A22. Therefore, the bias terms are as following,
The additional variance is as following,
where,
with
Thus, cov(A25) = n−1{(J′K′h)−1 · C3,1 + (J′)−1 · C3,2}.
Since cov(A22, A24) = n−1h2 · C1 and cov(A24) = O(h4), Proposition 1 follows from the above derivations.
3 Proof of Proposition 2
Fuller and Battese (1973) gave the variance components estimators for nested design. The estimator of has the following expression, should the true co-variate is known:
| (A.3) |
| (A.4) |
| (A.5) |
Denote X as the design matrix with the true covariates. Hb, Ha1, and Ha2 are the hat matrices in the estimation of variance components. Here the notations are the same as in Fuller and Battese (1973).
The semiparametric variance component estimators σ̃r2, σ̃c2, and σ̃ε2 are of the same expression as in (A.5), (A.4), and (A.3), except that X is replaced by X̃, which is the design matrix of the nonparametrically estimated covariates. To study these semiparametric variance components estimators, we need only to focus on the effects from the non-parametric estimation of the covariates, which are contained in the following terms Qx=X̃TX̃ and Hx = X̃(X̃TX̃)−1X̃T, where Qx appears in hat matrices Hb, Ha1, Ha2; Hx appears in the estimated sum of squared errors τ̂Tτ̂, ûTû, and v̂Tv̂.
It is easy to see that n−1Qx → n−1XTX in probability as J′ → ∞, h → 0 and J′K′h → ∞. Similarly, n−1Hx→ n−1X(XTX)−1XT in probability. Thus, the semiparametric variance components estimators (σ̃r2, σ̃c2, σ̃ε2) converge to (σ̂r2, σ̂c2, σ̂ε2) in probability. Since (σ̂r2, σ̂c2, σ̂ε2) is unbiased estimator for ( ), the semiparametric variance components estimators are thus consistent.
Contributor Information
Zonghui Hu, Email: huzo@niaid.nih.gov, NIAID, National Institutes of Health, Bethesda, MD 20817, USA.
Naisyin Wang, Email: nwang@stat.tamu.edu, Department of Statistics, Texas A&M University, College Station, TX 77843, USA.
References
- Carroll RJ. Discussion of two important missing data issues. Statistical Sinica. 2004;14:627–629. [Google Scholar]
- Carroll RJ, Lin X. Nonparametric function estimation for clustered data when the predictor is measured without/with error. Journal of the American Statistical Association. 2000;95:520–535. [Google Scholar]
- Carroll RJ, Ruppert D, Stefanski LA. Measurement Error in Nonlinear Models. Chapman & Hall; London: 1995. [Google Scholar]
- Carroll RJ, Wand MP. Semiparametric estimation in logistic measurement error models. Journal of the Royal Statistical Society B. 1991;53:573–585. [Google Scholar]
- Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B. 1977;39:1–38. [Google Scholar]
- Fan J, Gijbels I. Local Polynomial Modeling and its Applications. Chapman & Hall; London: 1996. [Google Scholar]
- Fuller WA, Battese GE. Transformations of estimation of linear models with nested-error structure. Journal of the American Statistical Association. 1973;68:626–632. [Google Scholar]
- Heemels MT, Dhand R, Allen L. Apoptosis. Nature. 2000;407:769–769. [Google Scholar]
- Henderson CR. Estimation of variance and covariance components. Biometrics. 1953;9:226–252. [Google Scholar]
- Hong MY, Lupton JR, Morris JS, Wang N, Carroll RJ, Davidson LA, Elder RH, Chapkin RS. Dietary fish oil reduces 06-methylguanine DNA adduct levels in rat colon in part by increasing apoptosis during tumor initiation. Cancer Epidemiology Biomarkers & Prevention. 2000;9:819–826. [PubMed] [Google Scholar]
- Hu Z, Wang N, Carroll RJ. Profile-kernel versus backfitting in the partially linear models for longitudinal/clustered data. Biometrika. 2004;91:251–262. [Google Scholar]
- Huang X, Zhu Q. A pseudo-nearest-neighbor approach for missing data recovery on Gaussian random data sets. Pattern Recognition Letters. 2002;23:1613–1622. [Google Scholar]
- Li E, Zhang D, Davidian M. Conditional estimation for generalized linear models when covariates are subject-specific parameters in a mixed model for longitudinal measurements. Biometrics. 2004;60:1–7. doi: 10.1111/j.0006-341X.2004.00170.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin X, Carroll RJ. Nonparametric function estimation for clustered data when the predictor is measured without/with error. Journal of the American Statistical Association. 2000;95:520–534. [Google Scholar]
- Morris JS, Wang N, Lupton JR, Chapkin RS, Turner ND, Hong MY, Carroll RJ. Parametric and nonparametric methods for understanding the relationship between carcinogen-induced DNA adduct levels in distal and proximal regions of the colon. Journal of the American Statistical Association. 2001;96:816–827. doi: 10.1007/978-1-4419-9019-8_7. [DOI] [PubMed] [Google Scholar]
- Pepe MS, Fleming TR. A general nonparametric method for dealing with errors in missing or surrogate data. Journal of the American Statistical Association. 1991;86:108–113. [Google Scholar]
- Pielou EC. Segregation and symmetry in two species populations as studied by nearest neighbor methods. Journal of Ecology. 1961;49:255–269. [Google Scholar]
- Rice JA, Silverman BW. Estimating the mean and covariance structure nonparametrically when the data are curves. Journal of the Royal Statistical Society B. 1991;53:233–243. [Google Scholar]
- Tsiatis AA, Davidian M. A semiparametric estimator for the proportional hazards model with longitudinal covariates measured with error. Biometrika. 2001;88:447–458. doi: 10.1093/biostatistics/3.4.511. [DOI] [PubMed] [Google Scholar]
- Verbeke G, Lesaffre E. The effect of misspecifying the random effects distribution in linear mixed effects models for longitudinal data. Computational Statistics and Data Analysis. 1997;23:541–556. [Google Scholar]


