Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Jan 24.
Published in final edited form as: Environmetrics. 2018 Oct 3;29(8):e2536. doi: 10.1002/env.2536

Linear regression with left‐censored covariates and outcome using a pseudolikelihood approach

Michael P Jones 1
PMCID: PMC6344928  NIHMSID: NIHMS1001575  PMID: 30686916

Abstract

Environmental toxicology studies often involve sample values that fall below a laboratory procedure’s limit of quantification. Such left-censored data give rise to several problems for regression analyses. First, both covariates and outcome may be left censored. Second, the transformed toxicant levels may not be normal but mixtures of normals because of differences in personal characteristics, e.g. exposure history and demographic factors. Third, the outcome and covariates may be linear functions of left-censored variates, such as averages and differences. Fourth, some toxicant levels may be functions of other toxicant levels resulting in a recursive system. In this paper marginal and pseudo-likelihood based methods are proposed for estimation of the means and covariance matrix of variates found in these four settings. Next, linear regression methods are developed allowing outcomes and covariates to be linear combinations of left-censored measures. This is extended to a recursive system of modeling equations. Bootstrap standard errors and confidence intervals are used. Simulation studies demonstrate the proposed methods are accurate for a wide range of study designs and left-censoring probabilities. The proposed methods are illustrated through the analysis of an on-going community-based study of polychlorinated biphenyls, which motivated the proposed methodology.

Keywords: environmental exposure, limit of quantification, polychlorinated biphenyls, toxicology

1 |. INTRODUCTION

A common problem in environmental exposure assessment is the occurrence of contaminant levels that fall below a laboratory’s ability for accurate measurement. Values below this limit of quantification (LOQ) are known as left-censored observations. This problem occurs when a laboratory assay’s LOQ is high relative to typical field exposure, or when the sampling period is short (Hewett, 2006, p. 415). Thousands of chemicals are in use today, with more introduced every year, and laboratory assays of their levels are frequently not quantifiable at the lower levels. This is an important public health and policy issue since values below the LOQ may be toxic or bioaccumulate to toxic levels. Quantification limits are common with air, water, soil/sediment, food, blood, urine and tissue samples. A few examples include herbicides/pesticides/nitrates in the water supply (Holden et al., 1992; Jones et al. 2016); toxic metals (e.g. lead, chromium) in highway runoff (Shumway et al., 2002) and public water supply lines, as in Flint, MI; okadaic acid, arsenic and cadmium in food (EFSA, 2010; The Economist, 2017); endotoxin levels of environments with organic dust (Thorne et al., 2010); and polychlorinated biphenyls (PCBs) in air and blood samples (Korwel et al., 2005; Marek et al., 2013). The motivation for the proposed methods of this paper derives from an on-going community-based study assessing blood PCB levels and the need to easily handle left-censoring of the predictors as well as the outcome variable in linear regression models.

The distribution of environmental contaminants is typically right skewed and assumed to follow a lognormal distribution for which industrial hygienists and environmental toxicologists historically have estimated summary statistics for a single variate. Although maximum likelihood estimation (MLE) was introduced early by Hald (1949) to handle left censoring due to measurements below the LOQ, the methods were computationally laborious and required special tables. This led to single imputation of left-censored values by the LOQ, LOQ/2 (Nehls & Akland, 1973) and LOQ/2 (Hornung & Reed, 1990). Many researchers (e.g. Hewett & Ganser, 2007) have studied these imputation methods via simulation, showing that MLE is far more accurate. For regression problems historical development focused first on the problem in which only the outcome variate is censored. Maximum likelihood regression methods for censored outcomes with completely observed covariates have a long successful history in both the statistical (Kalbfleisch & Prentice, 2002) and econometric literature (Tobin, 1958), and yield consistent, asymptotically normal estimates. The problem of longitudinal or spatial outcomes subject to left censoring is more complex but has been tackled by full likelihood (Gadda et al., 2000), Monte-Carlo EM (Hughes, 1999) and Bayesian methods (Fridley & Dixon, 2006).

Recently, there has been increased interest in regression problems having usually one, sometimes more, censored covariates, but assuming a completely observed outcome. Several approaches have been proposed. A simple regression approach is the complete-case analysis that omits subjects with any censored variates. When only the covariates are subject to censoring, the complete-case analysis will produce unbiased parameter estimates so long as the regression error term remains uncorrelated with the observed covariates. The key concern is that efficiency decreases as the sample size decreases which can be substantial with heavy censoring. A second approach is single imputation, which includes the previously mentioned functions of LOQ and the expected value below the LOQ (Richardson & Ciampi, 2003). These tend to exhibit either bias in the parameter estimate, its standard error or both (Nie et al., 2010). A third and more sophisticated approach is multiple imputation. For the outcome Y modeled as a linear function of a right-censored Z and uncensored U, Atem et al. (2017) sampled from the predictive distribution of Z given the values of U and Y, substituting in parameter estimates derived from a complete-case regression of Y on (Z, U) and a Cox proportional hazards regression model fit to Z given U. Their approach is appropriate for right-censored Z, but less so for our left-censored setting, in part because this family does not contain the lognormal and, as they noted, due to their extrapolations beyond the range of uncensored values, especially problematic for heavily censored variates. The fourth approach is the EM algorithm. May et al. (2011) introduced approximate ML estimation based on a weighted Monte-Carlo EM algorithm in which the conditional Gaussian distribution of censored variate(s) given the values of uncensored values, including the outcome, were repeatedly sampled using the Gibbs sampler and the adaptive rejection metropolis algorithm. Unlike other authors, May et al. (2011) allowed more than one covariate to be censored.

On another note, there has also been recent interest in estimation of the covariance matrix for a large number of left-censored variates. In 2015, our paper (Jones et al.) and two others (Hoffman & Johnson, 2015; Pesonen et al., 2015) tackled this problem. Although the three methods differed somewhat in their approaches, each came to the conclusion that a pairwise likelihood approach to estimating the correlations was the key to finding a computationally efficient method for finding a consistent set of estimators. For the most part, all assumed multivariate normality of variates. However, this assumption is not realistic enough for many real world applications since the variates may only be normal within homogeneous strata of the population defined by both discrete and continuous variables (e.g. gender and age), thereby resulting in a complex mixture of normals.

An important but so far ignored area of research concerns environmental problems in which both the outcome variate and covariates are subject to left censoring, or more generally, are linear functions of censored variates, such as a difference or an average. This is the goal of this paper. In Section 2, we introduce methods for estimating the covariance matrix of left-censored variates that can be modeled in terms of a set of linear equations based on uncensored variates and normal error terms. This is then generalized to covariance matrices of linear combinations of left-censored variates. These estimators are used as building blocks in Section 3 to create multiple linear regression methods in which both the outcome and covariates can be linear functions of left-censored variates. This is further generalized to incorporate a recursive system of regression equations to accomodate censored covariates that are generated from other censored covariates. Section 4 studies in detail three designs one might find in environmental epidemiology, the last of which extends the framework of the recursive system of regressions. This section also reports on simulation studies of the performance of these methods. Section 5 provides an in-depth application of the proposed methods to the PCB study data that motivated our methodology. Section 6 concludes with some overall discussion.

2 |. ESTIMATION OF THE COVARIANCE MATRIX

Section 2.1 begins by postulating that the censored variates can be written as a set of linear functions of the uncensored covariates plus normal possibly correlated error terms. We propose an extension to the maximum pairwise pseudo-likelihood technique of Jones et al. (2015) for estimating the covariance matrix. Section 2.2 considers the special case of no uncensored covariates and provides a warning. Section 2.3 describes a straightforward approach to estimation of the covariance matrix for variates that are linear combinations of the censored and uncensored variates.

2.1 |. Estimation for uncensored and left-censored variates

Let Di°=(Ui1,,Uiq,Zi1°,,Zip°) be the i-th unit’s variates of which the Z°’s will be subject to left censoring, whereas the U’s are uncensored and so measured exactly. The special case q = 0 is allowed and will be discussed later. Let Li=(Li1,,Lip) be the laboratory lower limits of quantification for measuring the Zi° variates. (The lower limit may depend on the i-th unit’s sample volume or other characteristics and hence is subscripted by i.) There are three basic assumptions. (A1) The vector of variates are conditionally multivariate normal given the U variates. It will be convenient for later use to write this out explicitly as

Zi1°=ϕ10+Zip°=ϕp0+ϕ11Ui1+. . .+ϕp1Ui1+. . .+ϕ1qUiq+εi1=ϕpqUiq+εip=ϕ10+ϕ1Ui+εi1ϕp0+ϕpUi+εip (1)

where (εi1,,εip)~Np(0,Σε) for n independent units. Both p and q are fixed. As for regression models in general, one should avoid models with a large number of variates relative to the sample size. No distributional assumptions are made about the U variates. (A2) As usual in regression, we assume the U variates are uncorrelated with the ε error terms. (A3) Zij° is assumed conditionally independent of Lij given Ui. If necessary, we assume that the Z° variates have already been appropriately transformed so that the error structure is normal. For laboratory data this is often the log transform. Next, let Zij=max(Zij°,Lij)andηij=1[Zij°Lij] be the observed variates, so that Zij is uncensored if nij = 1, but is left censored if nij = 0. As an example using the PCB study, Zij is the log-transformed, possibly left-censored (below LOQ) measurement of PCB congener j in a blood sample from subject i.

The goal of this section is to estimate the marginal mean and marginal covariance matrix of D°=(U,Z°) from the observed data D={(Ui,Zi,ηi):i=1,,n} where Ui=(Ui1,,Uiq),Zi=(Zi1,,Zip) and ηi=(ηi1,,ηip). We propose a straight-forward two-stage approach using marginal likelihood and pairwise pseudo-likelihood, that is particularly appealing when p is reasonably large.

In Stage 1, the marginal means and variances of the Z°’s are estimated. We begin by considering the jth marginal model from (1), namely, Zij°=ϕj0+ϕjUi+εij(i=1,,n) with εij~N(0,σεj2) . The parameters (ϕj0,ϕj,σεj2) are estimated by maximizing the j-th likelihood given Ui

Lj(ϕj0,ϕj,σεj2)=i=1n{f(zijϕj0ϕjUi;0,σεj2)}ηij{F(zijϕj0ϕjUi;0,σεj2)}1ηij

where f and F are the normal density and distribution function; (0,σεj2) refer to their mean and variance. The resultant maximum likelihood estimators (ϕ^j0,ϕ^j,σ^εj2) are consistent. Maximization is repeated for each Zj separately (j=1,,p) and easily performed by existing software, e.g. the survreg function in R (R Development Core Team, 2017). The marginal mean EZj°=EU{E(Zj°|U)} can therefore be estimated by

E^Zj°=ϕ^j0+ϕ^j1U¯1++ϕ^jqU¯q(j=1,,p)

where U¯l=nliUil.. We shall refer to this estimator as the maximum marginal likelihood estimator (MMLE) since it estimates a marginal mean and uses no information from the other Z variates. The marginal variance is

Var(Zj°)=VarU{E(Zj°|U)}+EU{Var(Zj°|U)}=VarU{ϕjU}+Var(εj)=ϕjΣUUϕj+σεj2

which can be estimated by

Var^(Zj°)=ϕ^jΣ^UUϕ^j+σ^εj2, (2)

where Σ^UU is the usual sample covariance matrix of U. As typical in regression settings, marginal means and variances of the outcome depend on representative samples of subjects for covariate patterns, and one should be careful in extrapolating these results beyond the range of the data, such as age ranges not covered by the sample. Two generally useful facts are that the population linear regression coefficients and intercept can be written as ϕj=ΣUU1ΣUZj0andϕj0=EZj°ϕjEU, respectively. Hence, we can estimate the covariance Cov(U,Zj°) by

Σ^UZj°=Σ^UUϕ^j. (3)

In Stage 2, we propose a pairwise pseudo-likelihood approach for estimating the correlation and covariance matrices of Z°. From model (1) and the fact that any pair of error terms is bivariate normal, i.e. (εj,εm)~N2(0,0,σεj2,σεm2,ρεjm), it follows that

Cov(εj,εm)=Cov{Zj°EZj°ΣZj°UΣUU1(UEU),Zm°EZm°ΣZm°UΣUU1(UEU)}=Cov(Zj°,Zm°)ΣZj°UΣUU1ΣUZm°

and hence

Cov(Zj°,Zm°)=ΣZj°UΣUU1ΣUZm°+Cov(εj,εm). (4)

We only need to estimate Cov(εj,εm),jm, since the other covariance terms in (4) are estimated in Stage 1.

In Stage 2, the pairwise error correlations ρεjm are estimated by generalizing the pairwise pseudo-likelihood approach proposed by Jones et al. (2015) in order to handle the setting described by (1). For each pair (εj,εm), we form the pair-specific pseudo-likelihood

PPLjm(ρεjm)=Ljm(0,0,σ^εj2,σ^εm2,ρεjm) (5)

where Ljm is the left-censored bivariate normal likelihood and (σ^εj2,σ^εm2) are the MMLEs of the variances found in Stage 1. This likelihood is based solely on {(rij,ηij,rim,ηim):i=1,,n} where

rij=ZijE^Zij°Σ^Zj°UΣ^UU1(UijU¯j).

Thus, rij is the estimated (uncensored) residual ε^ij if ηij=1 and is left censored if ηij=0. The pairwise normal likelihood Ljm(0,0,σ^εj2,σ^εm2,ρεjm) for left-censored data is

i=1n{fjm(rij,rim)}ηijηim{(/a)Fjm(a,rim)|a=rij}ηij(1ηim)×{(/b)Fjm(rij,b)|b=rim}(1ηij)ηim{Fjm(rij,rim)}(1ηij)(1ηim),

where fjmandFjm are the bivariate normal density and distribution function with means (0, 0), variances evaluated at (σ^εj2,σ^εm2) and free correlation parameter ρεjm. Maximization of PPLjm(ρεjm) produces the maximum pairwise pseudo-likelihood estimator (MPPLE)ρ^εjm. This maximization is repeated for all p(p1)/2 pairings to compute the MPPLEP^ of the error correlation matrix P. Computing formulae for the likelihood components can be found in Jones et al. (2015), along with other details on maximization.

The MPPLE of the covariance matrix Σε of error terms in (1) is defined as Σ^ε=V^1/2P^v^1/2, where V^ is the Stage 1-derived diagonal matrix whose diagonal contains the MMLEs(σ^ε12,,σ^εp2). Since the MMLEs(σ^εj2,σ^εm2)aren-consistent estimators, the maximizer ρ^εjm of the pseudo-likelihood PPLjm(ρεjm) is also consistent by the pseudo-likelihood theory of Gong and Samaniego (1981). By the continuous mapping theorem the MPPLEΣ^ε is therefore consistent. The MPPLEΣ^Z°Z° of the covariance matrix of Z° uses consistent plug-in estimators for the population parameters in equation (4) calculated in Stage 1 and Stage 2 and hence is also consistent. At this point we have consistent estimators of μD°andΣD°.

2.2 |. Case of no uncensored covariates

In the case of no U covariates (q = 0), the assumption in (1) implies, for all (j, m) pairings,

(Z1j°,Z1m°),,(Znj°,Znm°)~iidN2(μj,μm,σεj2,σεm2,ρεjm) (6)

where μl=ϕl0 in (1). This is precisely the same setting explored by Jones et al. (2015). Stage 1, as described above, is used to estimate (μj,μm)and(σεj2,σεm2), the marginal mean and variance of (Zj°,Zm°), and Stage 2, as described above, is used to estimate ρεjm, the correlation between Zj°andZm°. The MMLEs MMLEsμ^j,μ^m,σ^εj2,σ^εj2and MPPLEρ^εjm are consistent under the assumption of pairwise bivariate normality (6). This assumption may be reasonable in many settings, but should be seriously scrutinized in each problem. As a cautionary tale, consider two types of workers at a plant, those involved in the manufacturing process with high exposure levels to a toxicant and other workers with far less exposure. Blood toxicant levels, i.e. (Z°|group=g), may follow a normal distribution for each group but with different means and variances. Ignoring the existence of the binary covariate and combining the two groups into one results in a mixture of normals rather than in a normal, violating the assumption in (6). The correct approach involves model (1) with U covariates describing the exposure patterns.

2.3 |. Estimation for linear functions of left-censored variates

Under the assumptions of multivariate normal errors and no unmeasured confounding in model (1), we have successfully estimated the marginal means μ^D° and covariance matrix Σ^D° of all variates as described in the previous section. We can achieve substantially greater generality by considering linear combinations of the uncensored and censored variates, for which we will estimate means as well as covariance and correlation matrices. This will extend the methodology beyond typical censored-data methods by allowing us, for example, to compute summary statistics for possibly weighted averages and differences of left-censored variates.

Again, D°=(U,Z0) consists of the q-vector U, not subject to left censoring, and the p-vector Z°, which is the uncensored version of Z. Let G be a (k+1)×(q+p) matrix of constants. The mean and covariance matrix of the linear combination GD° are estimated by

E^(GD°)=Gμ^D°,Cov^(GD°)=GΣ^D°G (7)

where μ^D°andΣ^D° are calculated in the two-stage process described earlier.

3 |. METHODS FOR LINEAR REGRESSION

Using the covariance matrix and mean vector from the previous section, Section 3.1 gives straightforward formulae for estimation of the regression coefficients and population multiple R2. The nonparametric bootstrap is proposed for standard errors. Section 3.2 covers the typical setting in which all left-censored variates are generated from the uncensored variates, as in (1), and then one of them is chosen to be the outcome to be regressed on the rest. The regression framework (1) for censored variates may not seem reasonable when some censored covariates are generated from other censored variates. Section 3.3 addresses this problem via a recursive system of regression models which are then shown to reduce to the original equation (1) framework, and hence allow estimation and hypothesis testing.

3.1 |. Multiple Linear Regression

Our regression setup will allow both the outcome and covariates to be linear combinations of uncensored and left-censored covariates as defined in Section 2.3. Specifically, our goal is to estimate the regression coefficients of the multiple linear regression model

Yi°=β0+β1Xi1°++βkXik°+ei (8)

where both Y°and(X1°,,Xk°) can be linear combinations of uncensored variates and variates that are subject to left censoring; e1,,en have mean zero and constant variance σe2, and are independent of each other and of the covariates. Let G be a (k+1)×(q+p) matrix of constants, where GY is the first row and GX is the remaining k rows of G. Construct G so that

Y°=GYD°,X°=GXD°.

By the previous section, the means and covariances of the Y°andX° can be estimated by

μ^Y°=GYμ^D°,Σ^Y°Y°=GYΣ^D°GY,μ^X°=GXμ^D°,Σ^X°X°=GXΣ^D°GX,Σ^X°Y°=GXΣ^D°GY.

Minimization of the mean squared error E(Y°β0βX°)2 for an arbitrary linear function leads to ΣX°X°β=ΣX°Y°andβ0=EY°βEX°. Our proposal is to use the substitution or plug-in principle, so that the MPPLEs of the regression coefficients are

β^=Σ^X°X°1Σ^X°Y°,β^0=μ^Y°β^μ^X°. (9)

The nonparametric bootstrap is proposed for standard errors and Wald-type confidence intervals. In addition, one can estimate the multiple correlation coefficient

R2=ΣX°Y°ΣX°X°1ΣX°Y°/ΣY°Y° (10)

by replacing its population components by their MPPLEs. Note that this R2 is a population quantity, and care should be given to interpreting R^2 as the percentage of the variability of a potentially censored Y explained by the regression function of the observed X’s, also subject to censoring, when using MPPLEs of the regression coefficients for a given data set.

3.2 |. Typical Linear Regression Example

A typical environmental health problem often involves uncensored variates (U1,...,Uq) and contaminants (Z1°,,Zp°) subject to left censoring for which the linear system of equations (1) holds along with the associated assumptions. The ε=(ε1,,εp) error terms in (1) are allowed to share unmeasured factors that induce correlations among the variates. In this section we shall assume that no directly generates any other (e.g. no metabolites) so that model (1) is directly applicable. We shall relax this assumption in the next section. Suppose that the scientific problem of interest is how one of the ‘s, say Y°=Zp°, is related to the others by a single regression model

Yi°=β0+β1Ui+β2Zi*°+ei (11)

where Zi*°=(Z1°,,Zp1°). In this section we explore the parametric connection between (1) and (11). Although somewhat terse, matrix algebra is useful for simplified expressions and derivations. Model (1) can be abbreviated to Zi°=ϕ0+ϕUi+εi, where ϕ0=(ϕ10,,ϕp0)andϕ=(ϕ1,,ϕp), the latter being a q × p matrix. The major blocks of the (q+p)×(q+p) covariance matrix ΣD°ofD°=(U,Z°) can easily be shown to be ΣUU,ΣUZ°=ΣUUϕandΣZ°Z°=ϕΣUUϕ+Σε. Denote the vector of predictors in (11) by X°=(U,Z*0′). Let G be the identity matrix of dimension q + p, GY is the last row of G, and GX is the first q + p − 1 rows of G, so that Y°=GYD°andX°=GXD°. Then ΣX°X°=GXΣD°GXandΣX°Y°=GXΣD°GY, so that β=ΣX°X°1ΣX°Y° and β0=EY°βEX°. Equivalently, one could write Cov(X°,e)=Cov(X°,Y°βX°)=ΣX°Y°ΣX°X°β=0. Hence, the β^ estimators are functions of ϕ^,Σ^UUandΣ^ε, i.e. estimators of the parameters underlying model system (1).

A different approach, that can offer valuable insight, is to assume the first p — 1 models of (1) for (Z1°,,Zp1°) plus model (11) for and then to derive how they determine the ϕ parameters for Zp°=Y°, i.e. the parameters of the last equation in (1). Since Zij°=ϕj0+ϕjUi+εij by model (1), substitution into (11) gives

Yi°=β0+β1Ui+j=1p1β2j(ϕj0+ϕjUi+εij)+ei=(β0+j=1p1β2jϕj0)+(β1+j=1p1β2jϕj)Ui+(j=1p1β2jεij+ei)=ϕY0+ϕYUi+eYi,

with the obvious equivalence of (ϕY0,ϕY) to the parameters in the last line of (1) and with eYi=εip. Furthermore, ei=εip1p1β2jεij. Letting c=(β21,,β2,p1,1),e~N(0,cΣεc′)

3.3 |. Recursive system of regression models

In some settings the model (1) assumption that the variates are generated solely from a set of uncensored U variates may not seem realistic. For example, in drawing a causal graph for a specific real-world problem, some variates with lower limits of quantification may be generated from other variates having a lower limit of quantification. As a simple example, consider the recursive system of models

Zi1°=γ10+γ1Ui+εi1* (12)
Zi2°=γ20+γ2Ui+ψ21Zi1°+εi2* (13)
Zi3°=γ30+γ3Ui+ψ31Zi1°+ψ32Zi2°+εi3* (14)

where (εi1*,εi2*,εi3*) is multivariate normal with mean 0 and covariance Σε*. As an example, the Zij°'s might be the true contaminant levels in years j = 1, 2, 3 of a study. We make the usual assumption of no unmeasured confounders, whereby in each model the U covariates and the ‘s, used as covariates, are uncorrelated with the regression error. This assumption is required for unbiased estimation of the regression parameters when all Z° covariates are measured exactly and is necessary here for the censored-data setting as well. Since cov(Z1°,ε2*)=γ1cov(U,ε2*)+cov(ε1*,ε2*), the requirement that ε2* be uncorrelated with and U in (13) means ε1*andε2* must be uncorrelated. Similarly, one can show that if ε3* is uncorrelated with (U,Z1°,Z2°) in model (14), then ε3* is uncorrelated with ε1* and ε2*. Hence, Σε* is a diagonal matrix, as generally assumed for recursive systems (Johnston, 1972) — see exception below.

The recursive system of equations described in (12)-(14) can be rewritten in the form of (1), that only involves U covariates, by making sequential substitutions of the Zij° outcome from one model into the Zij° covariates of succeeding models. This results in the following model (1) parameters: (ϕ10,ϕ1)=(γ10,γ1) for model (12), (ϕ20,ϕ2)=(γ20+ψ21γ10,γ2+ψ21γ1) for model (13), and (ϕ30,ϕ3)=(γ30+ψ31γ10+ψ32ϕ20,γ3+ψ31γ1+ψ32ϕ2) for model (14), with model (1) error terms becoming

(εi1εi2εi3)=(100ψ2110ψ31+ψ32ψ21ψ321)(εi1*εi2*εi3*). (15)

Writing (15) succinctly as ε=Bε*, the model (1) error terms are multivariate normal with mean 0 and covariance BΣε*B. And very importantly, ε is uncorrelated with U. The exception to Σε* being a diagonal matrix occurs as follows. If ψ21=0,thenεi2*=εi2 and Cov(εi1,εi2) need not be zero. Section 4.3 gives an example of a more general semi-recursive system in which (12) and (13) each contain multiple models rather than just a single model.

The important result here is that the recursive system of models (12)—(14) falls into the framework of model (1) for which the assumption of multivariate normal errors is satisfied. Hence, the means and covariance matrices of (U,Z°) can therefore be estimated by the two-stage process described in Section 2. Estimation of regression parameters is as follows. For model (13) let Y°=Zi2° represent the outcome and Xi°=(Ui,Zi1°) be the covariates with regression parameters β0=γ20andβ=(γ2,ψ21). For model (14) let Yi°=Zi3°andXi°=(Ui,Zi1°,Zi2°) with intercept β0=γ30 and regression coefficients β=(γ3,ψ31,ψ32) In either case let ΣX°X° be the covariance matrix of X° and ΣX°Y° be the vector of covariances between X° and . These quantities are estimable via (2) - (4) since the recursive system of equations falls into the framework of model (1). Hence, under the usual assumption of no unmeasured confounders, the consistent estimators are again given by β^=Σ^X°X°1Σ^X°Y°1,β^0=E^Y°β^E^X°. This estimation process is performed successively down the list of recursive equation models.

In some settings the data analyst may be provided a single regression model with multiple left-censored covariates. The obvious solution is to add model equations, one for each left-censored covariate. For example, if presented with just model (14), models (12) and (13) can be added, with or without ψ21 depending on subject matter knowledge. The tools of this section are then available.

4 |. SPECIFIC DESIGNS AND SIMULATION STUDIES

In this section we describe the details of the proposed methods for specific designs, and then report on simulation experiments, comparing the new estimators to both the true values and the estimators obtained from analysis of the uncensored version of the data. In Section 4.1 we examine first how well the MMLEs of means and variances and the MPPLEs of correlations from Section 2.1 work and then consider a standard regression setting which parallels Section 3.2. For comparison, we consider the LOQ/2 and LOQ/2 imputation methods, given the lack of other methods that can handle the general settings under study. As an aside we assume the data have been pre-transformed to normal, hence appropriate replacement of a left-censored value by, say, LOQ/2 is done on the lognormal scale. In Section 4.2 we consider a regression model in which the outcome is the difference in two left-censored variates and the covariates consist of uncensored and left-censored variates. Section 4.3 studies a model in which the outcome is the average of two left- censored variates and the covariates consist of uncensored variates and the average of three left-censored variates. Given the poor performance of the imputation methods in Section 4.1, they are not reported in the later sections.

4.1 |. Standard regression design

In the first simulation experiment suppose a study measures 7 variates on n = 100 subjects. There are 2 uncensored variates: U1~Bin(1,0.5) and U2 is the normal mixture U1+δ, where δ~N(0,4). There are also 5 left-censored variates (Z1°,,Z5°) that satisfy model equations (1) with

ϕ=(0.00.51.00.50.31.00.50.51.00.30.01.00.20.40.6)andΣε=(1.00.20.20.30.50.21.00.40.30.40.20.41.00.50.30.30.30.51.00.20.50.40.30.21.0).

Furthermore, the percentages of left-censoring are 20%, 30%, 40%, 50% and 35%. The first task is to examine how well the proposed MMLEs/MPPLEs estimate the true means, variances and correlations. For comparison we include estimators based on the uncensored version of the data and the imputed versions based on LOQ/2 and LOQ/2 approaches. Table 1 summarizes the results over 1000 simulated data sets. For the sake of space only the correlations between Z5° and the other variates are shown. In all cases the MM- LEs/MPPLEs closely match the true values and uncensored data estimates of the means, variances and correlations. The imputation methods overestimate the means and under-estimate the variances. Their estimates of the correlations are biased towards zero as one might expect in a measurement error case. The second task is to examine estimates of the regression coefficients. Here, Y°=Z5° is regressed on the other variates using the same data sets. The true values are found as described in Section 3.2. The second panel of Table 1 summarizes the averages and standard deviations of the regression coefficients, and their average standard errors. Fifty bootstrap resamples are used to estimate the standard errors and Wald-type confidence intervals for the proposed MPPLEs. The MPPLE estimates of the regression coefficients are excellent, especially when one considers the heavy censoring of the covariates and outcome. Estimation of the population R2 is also very good. The bootstrap standard errors tend to overestimate the simulation standard error, but the confidence coverages range from 95.8% to 97.5%. The imputation estimators are uniformly dismal with a substantial underestimate of R2.

Table 1.

Simulation results for standard regression design. Estimation of summary measures, regression coefficients and standard errors, R2. Percent left-censoring for (U, Z°) : 0, 0, 20, 30, 40, 50, 35; n = 100 over 1000 simulated data sets.

Means U1 U2 Z10 Z20 Z30 Z40 Z50
True 0.50 0.50 0.75 1.15 −1.25 −0.20 0.10
Uncensored 0.500 0.507 0.759 1.156 −1.258 −0.210 0.097
MPPLE 0.500 0.507 0.758 1.156 −1.258 −0.217 0.096
LOQ/2 0.500 0.507 0.882 1.389 −0.860 0.357 0.224
LOQ/2 0.500 0.507 0.952 1.493 −0.721 0.530 0.345
Variances
True 0.250 4.250 5.562 5.422 5.562 5.250 2.450
Uncensored 0.250 4.239 5.544 5.409 5.547 5.226 2.448
MPPLE 0.250 4.239 5.534 5.409 5.536 5.254 2.452
LOQ/2 0.250 4.239 4.549 3.859 3.266 2.517 1.812
LOQ/2 0.250 4.239 4.180 3.437 2.829 2.112 1.495
Correlations with Z50
True −0.064 −0.759 −0.535 −0.566 0.752 0.739
Uncensored −0.061 −0.758 −0.535 −0.565 0.751 0.737
MPPLE −0.061 −0.760 −0.539 −0.570 0.752 0.739
LOQ/2 −0.057 −0.717 −0.485 −0.495 0.710 0.681
LOQ/2 −0.056 −0.710 −0.471 −0.477 0.711 0.682
Coefficients β0 β1 β2 β3 β4 β5 β6 R2

 True 0.158 0.167 −1.267 0.442 0.281 0.144 −0.089 0.738
Uncensored Data
 Ave 0.156 0.168 −1.273 0.440 0.284 0.1406 −0.088 0.750
 Std Dev 0.146 0.180 0.198 0.086 0.091 0.100 0.102
 Ave(SE) 0.147 0.186 0.199 0.088 0.092 0.101 0.100
MPPLE
 Ave 0.152 0.169 −1.288 0.443 0.288 0.142 −0.098 0.760
 Std Dev 0.198 0.223 0.299 0.114 0.135 0.142 0.151
 Ave(BSE) 0.221 0.251 0.343 0.132 0.154 0.166 0.179
 CI Coverage 95.8 95.8 96.7 96.3 95.8 96.1 97.5
LOQ/2
 Ave 0.193 0.230 −0.744 0.300 0.150 0.204 −0.018 0.668
 StD Dev 0.177 0.172 0.157 0.089 0.095 0.116 0.128
 Ave(SE) 0.169 0.179 0.145 0.089 0.0924 0.109 0.113
LOQ/2
 Ave 0.255 0.215 −0.646 0.270 0.136 0.202 −0.011 0.662
 Std Dev 0.181 0.158 0.145 0.084 0.090 0.119 0.135
 Ave(SE) 0.169 0.164 0.132 0.085 0.089 0.107 0.114

4.2 |. Regression for a post-minus-pre design

In this problem we consider a difference between two variates subject to left censoring as the outcome variable which is regressed on uncensored and potentially censored variates. A typical application is the post-pre outcome regression model

Yi°=Zi2°Zi1°=β0+β1Xi1+β2Xi2+β3Xi3+ei (16)

where Xi1=Ui1,Xi2=Ui2,andXi3=Zi1°. Furthermore, we suppose that a causal diagram implies the following set of regression equations for the variates subject to left censoring

Zi1°=γ10+γ11Ui1+γ12Ui2+εi1* (17)
Zi2°=γ20+γ21Ui1+γ22Ui2+ψZi1°+εi2* (18)

where (εi1*,εi2*) are independent normals and uncorrelated with (Ui1,Ui2). As an example, Zi1°andZi2° might be the true blood levels of some toxicant in years 1 and 2, respectively, while Ui1andUi2 might be the subject’s gender and age. Substitution of (17) into (18) yields

Zi2°=(γ20+ψγ10)+(γ21+ψγ11)Ui1+(γ22+ψγ12)Ui2+(ψεi1*+εi2*).

We can now rewrite (17) and (18) into a form equivalent to (1)

Zi1°=ϕ10+ϕ11Ui1+ϕ12Ui2+εi1 (19)
Zi2°=ϕ20+ϕ21Ui1+ϕ22Ui2+εi2 (20)

where (ϕ10,ϕ11,ϕ12)=(γ10,γ11,γ12)and(ϕ20,ϕ21,ϕ22)=(γ20+ψγ10,γ21+ψγ11,γ22+ψγ12) and

(εi1εi2)=(10ψ1)(εi1*εi2*), (21)

which can be abbreviated as εi=Qεi*. To get unbiased estimators, we require εi1* to be uncorrelated with (Ui1,Ui2)andεi2* uncorrelated with (Ui1,Ui2,Zi1°,εi1*). Since

Cov(Z1°,ε2*)=Cov(γ11U1+γ12U2+ε1*,ε2*)=(γ11,γ12)Cov((U1,U2),ε2*)+Cov(ε1*,ε2*),Cov(Z1°,ε2*)=0 if and only if (U1,U2,ε1*) are uncorrelated with ε2*. Hence, by equation (21), ε~N2(0,QΣε*Q). Assumptions (A1)-(A3) are therefore satisfied, which implies that the means and covariance matrix ΣD0of(U1,U2,Z1°,Z2°) are estimable by Stages 1 and 2 given in Section 2.1. Next, define

G=(0011100001000010).

The covariance matrix of (Z2°Z1°,U1,U2,Z1°)isGΣ^D0G.. Estimators of the β’s in (16) follow from Section 3.1. One can also estimate the population R2 and the multiple partial correlation coefficients. The nonparametric bootstrap is used for estimation of standard errors and Wald-type confidence intervals.

Simulation Study and Results: We generated 1000 data sets, each of size n = 100, with U1 ~ Bin(1, 0.5), U2 ~ Unif(0,1) and (ε1*,ε2*)~N2(0,Σε*) where the coefficients in (17)–(18) and error covariance matrix are

(γ10γ11γ12γ20γ21γ22)=(1110.250.501.00)Σε*=(1000.7)

and where ψ=0.75. The percent left censoring for (Z1°,Z2°) is (22.0%, 53.6%). As seen in Table 2, the proposed MPPLEs of the β coefficients in regression model (16) have little bias, their bootstrap standard errors are very close to the sample standard deviations of the 1000 estimated β’s and the Wald-type bootstrap confidence coverages are close to the nominal 95% level. Moreover, the average of the 1000 estimates of the population multiple correlation (0.358) for the regression model is very close to the true value (0.323) and that of regression fits based on knowing the true values of the Z° variates (0.343).

Table 2.

Simulation results for Post-Pre design. Estimation of regression coefficients, bootstrap standard errors and 95% confidence interval coverages. n = 100. Percent left-censoring for Z1°,Z2° : 22.0%, 53.6%

True β Method Ave(β^) SD(β^) Ave(SE(β^)) CI Coverage

β0 = 0.25 Uncensored 0.259 0.203 0.207 94.9%
MPPLE 0.253 0.298 0.298 94.7%
β1 = −0.50 Uncensored −0.497 0.190 0.189 93.9%
MPPLE −0.500 0.237 0.233 94.8%
β2=−1.00 Uncensored −1.007 0.304 0.306 95.6%
MPPLE −1.019 0.401 0.386 93.9%
β3 = −0.25 Uncensored −0.253 0.084 0.085 94.8%
MPPLE −0.249 0.127 0.123 93.0%

Note: Multiple R2: 0.323 for true model, 0.343 for uncensored data analysis, 0.358 forMPPLE analysis. Results summarized over 1000 simulated data sets.

4.3 |. Regression design based on averages

In this example, suppose that in the previous year, 3 measurements of a toxicant are taken on unit i with true values (Zi1°,Zi2°,Zi3°), and in the current year, 2 more measurements are taken with true values (Zi4°,Zi5°). To make this setting more realistic, we include a covariate Ui1 that would affect last year’s values as well as this year’s Z° values. In addition, suppose that a random sample of the units received an intervention or remediation after last year’s measurements, i.e. Ui2 is 1 for the intervention and is zero otherwise. Suppose that the model of interest is

Yi°=β0+β1Ui1+β2Ui2+β3Xi°+ei (22)

where Yi°=(Zi4°+Zi5°)/2andXi°=(Zi1°+Zi2°+Zi3°)/3, the true average values of this year’s and last year’s toxicant levels, respectively. All Z° variates making up these averages are subject to left censoring.

To design a strategy for fitting model (22), suppose that a causal graph describing how each of last year’s variates were generated results in

Zi1°=ϕ10+ϕ11Ui1+εi1, (23)
Zi2°=ϕ20+ϕ21Ui1+εi2, (24)
Zi3°=ϕ30+ϕ31Ui1+εi3. (25)

Keeping in mind that model (22) is our goal, we will model this year’s Z° values conditional on last year’s values as

Zi4°=η40+η41Ui1+η42Ui2+η43(Zi1°+Zi2°+Zi3°3)+εi4*, (26)
Zi5°=η50+η51Ui1+η52Ui2+η53(Zi1°+Zi2°+Zi3°3)+εi5*, (27)

where (εi4*,εi5*) are bivariate normal. Note that U1 is still allowed to have an effect on this year’s measurements beyond what is contained in last year’s average X°. As generally required for validity of regression models, U1 is assumed independent of the error terms (ε1,ε2,ε3) in models (23)-(25) and (U1,U2,X°) are independent of (ε4*,ε5*) in models (26)-(27). Models (23)-(25) are allowed to contain the same unmeasured covariates, so that the errors (ε1,ε2,ε2) are allowed to be correlated. These unmeasured covariates are contained in (Z1°,Z2°,Z3°) so that with in models (26) and (27), we assume that (ε1,ε2,ε3) are uncorrelated with (ε4*,ε5*). This is the analogue of the zero correlation assumption made in recursive models (Section 3.3). Finally, (ε4*,ε5*) are allowed to be correlated. Note that (23)-(27) are an extension of the recursive system of equations of Section 3.3 by allowing a set of covariates to be added to the previous equation(s) rather than only adding one covariate at a time.

Before we can estimate the parameters of (22), we need to reformulate models (26) and (27) into model framework (1). The (Z1°,Z2°,Z3°) terms in models (26) and (27) are replaced by their respective models (23)-(25). For j = 4, 5 this results in

Zij°=(ηj0+ηj3(1/3)13ϕl0)+(ηj1+ηj3(1/3)13ϕl1)Ui1+ηj2Ui2+(ηj3(1/3)13εil+εij*)=ϕj0+ϕj1Ui1+ϕj2Ui2+εij (28)

where (ϕj0,ϕj1,ϕj2,εij) are implicitly defined in (28). Models (23)-(25) and (28) with j = 4, 5 have now been reconfigured in the form of (1). Next, note that the corresponding errors are

(εi1εi2εi3εi4εi5)=(100000100000100η43/3η43/3η43/310η53/3η53/3η53/301)(εi1εi2εi3εi4*εi5*):=Aε*,

so that the error vector of equation (1) is multivariate normal with mean 0 and covariance matrix AΣε*A. The error terms (ε1,ε2,ε3,ε4,ε5) can be correlated. The only stipulation is that (ε1,ε2,ε3) be independent of (ε4*,ε5*). This generalzes the diagonal error covariance matrix assumption made in recursive systems which add only one covariate at a time.

At this point models (23)-(25) plus (28) written out for j = 4 and j = 5 represent models for Z1°,,Z5° in the framework of (1), in which the error terms (ε1,,ε5) follow a multivariate normal. We are now justified in using Stage 1 and Stage 2 estimation methods to estimate the mean μ^D° and covariance matrix ΣD° of D°=(U1,U2,Z1°,,Z5°). Next, define the G matrix

G=(000001/21/210000000100000001/31/31/300),

noting that the first row of G(GY) is used to define the outcome Y°=GYD°, the average of Z4°andZ5°, and the last 3 rows of G(GX) to define the covariates (U1,U2,X°)=GXD°. Equations (9) and (10) give the estimated regression coefficients (β,β0) of (22) and the estimated populations R2, respectively. The nonparametric bootstrap can be used for standard errors and constructing confidence intervals.

Simulation Study and Results.

We generated 1000 data sets, each with n = 100, in which U1 ~ U(0,1) and U2 ~ Bin(1, 0.5). The true (Z1°,Z2°,Z3°) were generated from (23)-(25) in which their ϕ coefficients were all equal to 1, the error terms were trivariate normal with variances of 1 and covariances each equal to 0.3. In models (26)-(27), (ηj0,ηj1,ηj2,ηj3)=(0.2,0.5,1.0,0.7),forj=4,5, and hence so too were the β’s of the regression model (1). The error terms, (ε4,ε5), were generated from a bivariate normal with variances (0.7,0.7) and covariance 0.4. The lower limits of quantification were chosen so that each observed Z underwent 40% left censoring. The simulation results (Table 3) illustrate that the MPPLEs of the β’s in (22) are very close to the true values and to the analytic results for the uncensored version of the simulated data. The MPPLE standard errors were based on 50 nonparametric bootstrap resamples of a data set. As seen in Table 3 they are a close match to the empirical standard deviations taken over the 1000 simulated data sets; in addition, the confidence interval coverages are quite good. The population R2 is precisely estimated.

Table 3.

Simulation results for Design Based on Averages. Estimation for regression coefficients, bootstrap standard errors and 95% confidence interval coverages. n = 100. Percent left-censoring for Z1°,,Z5°: 40% each

True β Method Ave(β^) SD(β^) Ave(SE(β^)) CI Coverage

β0 = 0.2 Uncensored 0.200 0.194 0.196 95.7%
MPPLE 0.197 0.233 0.236 95.3%
β1 = −0.5 Uncensored −0.494 0.295 0.280 94.2%
MPPLE −0.492 0.330 0.316 93.4%
β2 = −1.0 Uncensored −0.996 0.152 0.149 94.3%
MPPLE −0.999 0.173 0.172 95.0%
β3 = 0.7 Uncensored 0.698 0.107 0.103 94.2%
MPPLE 0.699 0.131 0.129 95.0%

Note: Multiple R2: 0.499 for true model, 0.496 for uncensored data analysis, 0.499 for MPPLE analysis. Results summarized over 1000 simulated data sets.

5 |. REGRESSION ANALYSIS OF PCB DATA

Polychlorinated biphenyls (PCBs) are toxic man-made industrial chemicals that persist in the environment and bioaccumulate up the food chain. Due to their chemical stability, electrical insulating properties and low flammability, they were widely manufactured for use in hydraulic fluids, insulating fluids in transformers, fluorescent light ballasts, and plasticizers in paints, plastics and caulk. Although the intentional manufacture was banned in the United States and European Union, they are still released into the environment through disposal, waste burning and atmospheric recycling of PCB-containing products. They are even transported from industrial nations by air and ocean currents to highly sensitive Arctic ecosystems (Andersen et al., 2001). There is an extensive literature on serious health effects in humans and animals including carcinogenicity (Lauby-Secretan et al., 2013); reproductive abnormalities and fetal toxicity; adverse effects on the immune, nervous and endocrine systems (Brouwer, Reijnders and Koeman, 1989; Henry and DeVito, 2003 and references therein).

In the process of creating PCBs, chlorine atoms are substituted for hydrogen atoms on the biphenyl molecule, a bonded pair of benzene rings. The number and location of the chlorine atoms (up to a maximum of 10) result in 209 different PCB compounds, called congeners, and also determine the physical properties and toxicity. In general, the higher the number of chlorine atoms, the greater the bioaccumulation and the lower rate of metabolism. In the U.S., Monsanto manufactured complex mixtures of PCBs and marketed them under the name Aroclor followed by 4 digits. The last two digits are the percent chlorine by weight.

As part of an on-going U. S. National Institutes of Health-funded study, the AESOP study (Ampelman et al., 2015), blood samples are taken from subjects in East Chicago (urban site) and Columbus Junction, Iowa (rural site). Concentrations of congener-specific levels are determined by gas chromatography with tandem mass spectrometry. Of the 209 congeners, 159 chromatographic peaks can be measured: some are individual congeners, e.g PCB 118, and others co-eluting, e.g. PCBs 99 and 83. Of the 159 congeners measured, 10 have been chosen for this analytical example. All have at least 35% of their values uncensored. As PCBs are lipophilic, these measurements are standardized as the ratio of nanograms (ng) of congener per milliliter (ml) of blood lipids.

Two issues arise with the measurements. The first issue is that the amount of congener, the ratio numerator, may fall below the laboratory LOQ, whereas the amount of blood lipids, the ratio denominator, is measured exactly. The left-censoring point of the ratio differs therefore from one subject to another based on different lipid levels. The second issue is the normality assumption. The ratio data are highly skewed, and hence the log transform has been conventionally used in this field. For this analytical data set only one subject stands out in the normal probability plots as an outlier and has been removed. She has extremely high concentrations of most congeners relative to all subjects, even though her blood lipid level is typical.

Year 2 data from the AESOP study were previously analyzed by Jones, Perry & Thorne (2015). We shall use a subset of the year 2 data consisting solely of the 92 adult women on study. Table 4 provides the names of the 10 PCB congener groupings, referred to as peaks, and the percentage of samples that are above the LOQ (uncensored). Two uncensored variables will be incorporated into all regression models: site, to measure urban vs. rural difference, and age, since older subjects should have greater cumulative PCB exposure. Table 4 lists the MMLEs of the mean and standard deviation of each peak, found separately by use of regression model (1) with peak as the outcome, and site and age as covariates. There is considerable variability among congeners as to the amounts in blood samples.

Table 4.

Analysis of PCB congeners*. In each line the outcome variable P118 is regressed on 3 covariates: Site, Age and Peak.

Peak PCB Congener
Grouping
MMLE of
Mean (SD)
Percent
Uncensored
Linear Regression**
Slope (SEB) Wald
R2 Partial Cor
Y,Peak††

A P70+74+61+76 1.832 (0.675) 40.86 0.625 (0.145) 4.322 0.324 0.519
B P99+83 1.369 (0.613) 90.32 0.724 (0.098) 7.368 0.485 0.665
C P105 0.339 (0.830) 37.63 0.714 (0.099) 7.192 0.808 0.890
D P118 (Y) 1.808 (0.677) 83.87
E P138+129+163 2.150 (0.670) 95.70 0.568 (0.142) 4.006 0.317 0.511
F P146 0.166 (0.826) 75.27 0.321 (0.114) 2.811 0.191 0.353
G P153+168 2.275 (0.682) 98.92 0.506 (0.141) 3.600 0.267 0.454
H P156+157 0.486 (0.786) 58.06 0.532 (0.135) 3.928 0.337 0.531
I P187 0.790 (0.823) 84.95 0.238 (0.126) 1.890 0.137 0.256
J P203 −0.246 (0.956) 46.24 0.051 (0.102) 0.501 0.080 0.063
*

Log of congener values used to achieve normality in model (1).

For co-eluting congeners, the major congener by Aroclor production is listed first.

Minor congeners are superscripted.

**

Coefficient (SE), Wald Statistic shown for Peak but not Intercept, Site and Age.

††

Partial correlation between P118 (Y) and Peak are conditional on Site and Age.

To illustrate a left-censored outcome regressed on uncensored and left-censored covariates, PCB 118 (Peak D) is chosen as the outcome variable of interest. This particularly toxic dioxin-like congener displays clear carcinogenic activity in rats, specifically, liver cancers, as well as non-neoplastic lesions in the liver, lung, adrenal cortex, pancreas, thyroid, nose and kidney (National Toxicology Program Techinical Report, 2010). Table 4 lists the MPPL estimates of the regression coefficients, bootstrap standard errors and Wald ratios for 9 separate regressions predicting PCB 118 using site, age and a single peak as the 3 covariates. The coefficient estimates for site and age are not given. In addition, Table 4 lists the MPPLEs of the population R2 and partial correlation between PCB 118 and each peak given site and age. One of the most striking results is that Peak B has a slightly larger Wald statistic (7.368) than Peak C (7.192), but its partial correlation of 0.665 is much smaller than Peak C’s partial correlation of 0.890. This is also evident when comparing the estimates of the population R^2 values: 0.485 for Peak B and 0.808 for Peak C. This Wald ratio-partial correlation paradox is due in part to the fact that Peak B has 90.3% of its values uncensored as compared to just 37.6% uncensored values for Peak C. We return to this issue later.

As a further illustration of the proposed multiple linear regression technique using both uncensored and left-censored covariates, we consider multivariable variable selection next. There are several model building strategies, the most popular ones based on significance tests, information criteria and penalized likelihoods. Forward stepwise selection will be exhibited here, since the other methods would require new theoretical development, not done here. Step 1 in Table 5 lists the estimated coefficients, standard errors and Wald statistics for the base model containing site and age. Since Peak B has the largest Wald value (Table 4), it is the first variable added to the base model and is summarized in Step 2 of Table 5. In the next step Peak C has the largest Wald ratio of any peak, and the resulting model is summarized in Step 3. Note that the addition of Peak C has caused the Peak B Wald statistic to not only lose significance but to even change sign. Hence, in the backward glance of the forward stepwise procedure, Peak B is dropped resulting in the model given in Step 4 of Table 5. In the next step Peak I with a Wald statistic of 2.618 is added (Step 5 summary). Note that this addition has almost no effect on the coefficient or standard error of Peak C. The partial correlation between Peak D (PCB 118) and Peak I (PCB 187) given site, age and Peak C (PCB 105) is 0.401. Finally, no other covariate significantly improved on the last model; the highest positive Wald statistic is 0.850 for Peak F and the largest negative Wald statistic is −0.861 for Peak B, which had been dropped in Step 4. Had either of these variates been added, the corresponding estimated population R^2s would be 0.850 and 0.853, which are only slight increases over 0.839. Hence, the final model is given in the last column of Table 5. The corresponding 95% bootstrap confidence intervals are (−0.169, 0.212) for Site, (−0.016, 0.015) for Age, (0.508, 0.896) for P105, (0.043, 0.299) for P187 and (0.825, 2.058) for the intercept.

Table 5.

Forward stepwise regression for the outcome P118 with congeners* from Table 4 considered as potential covariates. Site and age are forced into each model.

Summary** of Forward Stepwise Regression Modeling Steps
Term (Peak) 1 2 3 4 5

Intercept 0.572 (0.534) 0.326 (0.358) 1.121 (0.423) 1.068 (0.347) 1.441 (0.314)
1.071 0.910 2.652 3.077 4.586
Site −0.019 (0.147) −0.095 (0.108) 0.041 (0.118) 0.030 (0.101) 0.022 (0.097)
−0.132 −0.883 0.345 0.300 0.222
Age 0.031 (0.012) 0.014 (0.008) 0.013 (0.008) 0.012 (0.008) −0.0005 (0.008)
2.507 1.800 1.582 1.513 −0.059
P99+83(B) 0.724 (0.098) −0.071 (0.211)
7.368 −0.338
P105(C) 0.755 (0.169) 0.714 (0.099) 0.702 (0.099)
4.471 7.192 7.084
P187(I) 0.171 (0.065)
2.618
MPPLE-R2 0.076 0.485 0.809 0.808 0.839
*

Log of congener values used to achieve normality in model (1).

**

Tabled values are regression coefficient estimate (standard error) and Wald statistic.

For co-eluting congeners, the major congener by Aroclor production is listed first.

Minor congeners are superscripted.

6 |. DISCUSSION

Left-censored linear regression is a very important area of research that is vital to toxicology and environmental epidemiology. In 2010 the journal Epidemiology devoted a special issue to left-censored data (Schisterman and Little, 2010), with a call for further development. Early developmental work focused on the setting in which only the outcome variate is subject to left censoring. For general settings it was common in environmental and toxicological research to replace left-censored covariates and outcome by the LOQ/2 or LOQ/2, then to log transform the data and use common regression tools, like least squares. This practice is still in common use today. More recent proposals have been to use multiple imputation and the EM algorithm, oftentimes developed for the simpler case of a single left-censored covariate.

The goal of this paper is to address the far more general framework in which both the outcome and multiple covariates are subject to left censoring or are linear combinations of left-censored variates. The proposed approach involves a pseudo-likelihood framework that uses plug-in estimators of lower level parameters in order to estimate higher level parameters. One would expect this approach to lose some efficiency relative to classical maximum likelihood but to have some practical advantages too. Simultaneous maximiza- ton of a joint likelihood in one stage is not as practical due to the many different patterns of left-censoring (maximum of 2p) and the complicated integrations. This issue with the MLE is described in greater detail in Jones et al. (2015, pp. 87–8). That paper also showed that joint maximization of the means, variances and correlation for the case of p = 2 is hampered by suboptimal convergence rates, and may have no real improvement in bias or mean squared error. The MPPLE approach will be pursued in subsequent papers for other settings, e.g. the analysis of multivariate data including repeated measures, profiles, and principal components. As far as I am aware, all methods proposed in this paper are new for the field of left-censored data analysis. Since the focus in this paper is on measurement involving a lower LOQ, the methods are developed for left-censored data. However, the likelihoods Lj in Stage 1and Ljm in Stage 2 are easily modified to handle right-censored or interval-censored data which are then maximized. Usable software is under development and will become available. The framework and proposed methods are briefly summarized next.

After transformation, usually logarithmic, measures of environmental contaminants are very often normally distributed for a homogeneous population given a common exposure. However, this may not be true for non-homogeneous populations or those with different exposure histories. For this reason, toxicants, like PCBs which persist and bioaccumulate so that blood levels may depend on personal characteristics, e.g. age and residence, exhibit a distribution that is a mixture of normals. Modeling framework (1) addresses this issue, by allowing one to adjust for uncensored covariates. A marginal and pairwise pseudo-likelihood approach is proposed to estimate the means and covariance matrix of all uncensored and left-censored covariates in this framework. This estimation is extended to linear combinations, such as differences and averages of left-censored variates. Methods for linear regression involving an outcome and covariates, which can all be linear combinations of left-censored and uncensored variates, are proposed and are shown to perform quite well via simulation studies. This work is further extended into recursive and semi-recursive systems of equations that extend the conceptual framework of (1). The development of these methods for three common research designs is given and are studied via simulation studies. The results are quite good. The proposed linear regression methodology is illustrated on data from an on-going PCB study which motivated the proposed methods.

ACKNOWLEDGEMENTS

I would like to thank Dr. Peter Thorne for helpful discussions on polychlorinated biphenyls. This work was funded in part by 2 NIH grants: the Iowa Superfund Research Program on Semi-Volatile PCBs (NIH P42 ES013661) and the Environmental Health Sciences Research Center, University of Iowa (NIH P30 ES005605). Finally, I would like to thank the Associate Editor and the Reviewer for their insightful comments.

REFERENCES

  1. Ampelman MD, Martinez A, DeWall J, Rawn DFK, Hornbuckle KC & Thorne PS (2015). Inhalation and dietary exposure to PCBs in urban and rural cohorts via congener-specific measurements. Environmental Science & Technology, 49, 1156–1164. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Andersen M, Lie E, Derocher AE, Belikov SE, Bernhoft A, Boltunov AN, Garner GW, Skaare JU & Wiig O (2001) Geographic variation of PCB congeners in polar bears (Ursus maritimus) from Svalbard east to the Chukchi Sea. Polar Biology, 24, 231–238. [Google Scholar]
  3. Atem FD, Sampene E & Greene TJ (2017). Improved conditional imputation for linear regression with a randomly censored predictor. Statistical Methods in Medical Research, Epub 2017 Jan 1. 10.1177/0962280217727033 [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Brouwer A, Reijnders PJH & Koeman JH (1989). Polychlorinated biphenyl (PCB) contamin- ated fish induces vitamin A and thyroid hormone deficiency in the common seal. Aquatic Toxicology, 15, 99–106. [Google Scholar]
  5. EFSA. (2010). Management of left-censored data in dietary exposure assessment of chemical substances. EFSA Journal, 8,1557. Ninety-six page scientific report of the European Food Safety Authority. [Google Scholar]
  6. Frame GM (2001). The current state-of-the-art of comphrehensive, quantitative, congener specific PCB analysis, and what we know about the distributions of individual congeners in commercial Aroclor mixtures In PCBs: Recent Advances in Environmental Toxicology and Health Effects, edited by Robertson LW & Hansen LG, The University Press of Kentucky. [Google Scholar]
  7. Fridley BL & Dixon P (2007). Data augmentation for a Bayesian spatial model involving censored observations. Environmetrics, 18, 107–123. [Google Scholar]
  8. Gadda HJ, Thiébaut R, Chêne G & Commenges D (2000). Analysis of left-censored longitudinal data with application to viral load in HIV infection. Biostatistics, 1, 355–368. [DOI] [PubMed] [Google Scholar]
  9. Gong G & Samaniego FJ (1981). Pseudo maximum likelihood estimation: theory and applications. The Annals of Statistics, 9, 861–869. [Google Scholar]
  10. Hald A (1949). Maximum likelihood estimation of the parameters of a normal distribution which is truncated at a known point. Skandinavisk Aktuarietidskrift, 32, 119–134. [Google Scholar]
  11. Henry TR & DeVito MJ (2003). Non-Dioxin-like PCBs: Effects and consideration in ecologic risk assessment. United States Environmental Protection Agency Report. http://www.epa.gov/oswer/riskassessment/pdf/1340-erasc-003.pdf
  12. Hewett, P. (2006). Analysis of censored data. Appendix VIII of A Strategy for Accessing and Managing Occupational Exposures, (eds.) Bullock, W.H. & Ignacio, J.S. (third edition).
  13. Hewett P & Ganser GH (2007). A comparison of several methods for analyzing censored data. The Annals of Occupational Hygiene, 51, 611–632. [DOI] [PubMed] [Google Scholar]
  14. Hoffman HJ & Johnson RE (2015). Pseudo-likelihood estimation of multivariate normal parameters in the presence of left-censored data. Journal of Agricultural, Biological and Environmental Statistics, 20, 156–171. [Google Scholar]
  15. Holden LR, Graham JA, Whitmore RW, Alexander WJ, Pratt RW, Liddle SK & Piper LL (1992). Results of the National Alachlor Well Water Survey. Environmetnal Science & Technology, 26, 935–943. [Google Scholar]
  16. Hornung RW & Reed LD (1990). Estimation of average concentration in the presence of nondetectable values. Applied Occupational and Environmental Hygiene, 5, 46–51. [Google Scholar]
  17. Hughes JP (1999). Mixed Effects Models with Censored Data with Application to HIV RNA Levels. Biometrics, 55, 625–629. [DOI] [PubMed] [Google Scholar]
  18. Johnston J (1972). Econometric Methods. New York: McGraw-Hill. [Google Scholar]
  19. Jones MP, Perry SS & Thorne PS (2015). Maximum Pairwise Pseudo-Likelihood Estimation of the Covariance Matrix from Left-Censored Data. Journal of Agricultural, Biological and Environmental Statistics, 20, 83–99. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Jones RR, Weyer PJ, DellaValle CT, Inoue-Choi M, Anderson KE, Cantor KP, Krasner S, Robien K, Freeman LEB, Silverman DT & Ward MH (2016). Nitrate from drinking water and diet and bladder cancer among postmenopausal women in Iowa. Environmental Health Perspectives, 124, 1751–1758. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Kalbfleisch J & Prentice R (2002). The Statisticial Analysis of Failure Time Data. New York: Wiley. [Google Scholar]
  22. Korwel IK, Hornbuckle KC, Peck A Ludewig G Robertson LW, Sulkowski WW, Espandiari P, Gairola CG & Lehmler HJ (2005). Congener-specific tissue distribution of Aroclor 1254 and a highly chlorinated environmental PCB mixture in rats. Environmental Science & Technology, 39, 3513–3520. [DOI] [PubMed] [Google Scholar]
  23. Lauby-Secretan B, Loomis D, Grosse Y, El Ghissassi F, Bouvard V, Benbrahim- Tallaa L, Guha N, Baan R, Mattock H & Straif K (2013). Carcinogenicity of polychlorinated biphenyls and polybrominated biphenyls. The Lancet Oncology, 14, 287–288. [DOI] [PubMed] [Google Scholar]
  24. Marek RF, Thorne PS, Wang K, DeWall J & Hornbuckle KC (2013). PCBs and OH-PCBs in serum from children and mothers in urban and rural U.S. communities. Environmental Science & Technology 47, 3353–3361. Erratum in: Environmental Science & Technology, 47, 9555–9556. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. May RC, Ibrahim JG & Chu H (2010). Maximum likeihood estimation in generalized linear models with multiple covariates subject to detection limits. Statistics in Medicine, 30, 2551–2561. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. National Toxicology Program. (2010). Toxicology and carcinogenesis studies of 2,3’,4,- 4’,5penta-chlorobihpenyl (PCB 118) (CAS No. 31508–00-6) in female Harlan Sprague- Dawley rats (gavage studies). National Toxicology Program Technical Report Series, November; 559, 1–174. (ntp.niehs.nih.gov/ntp/htdocs/lt_rpts/tr559.pdf) [PubMed] [Google Scholar]
  27. Nehls GJ & Akland GG (1973). Procedures for handling aerometric data. Journal of the Air Pollution Control Association, 23, 180–184. [Google Scholar]
  28. Nie L, Chu H, Liu C, Cole SR, Vexler A & Schisterman EF (2010). Linear regression with an independent variable subject to a detection limit. Epidemiology, 21:July Supplement, S17–S24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Pesonen M, Pesonen J & Nevalainen J (2015). Covariance matrix estimation for left- censored data. Computational Statistics & Data Analysis, 92, 13–25. [Google Scholar]
  30. R Development Core Team. (2017). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria: http://www.rproject.org/. [Google Scholar]
  31. Richardson DB & Ciampi A (2003). Effects of exposure measurement error when an exposure variable is constrained by a lower limit. American Journal of Epidemiology, 157, 355–363. [DOI] [PubMed] [Google Scholar]
  32. Schisterman EF & Little RJ (2010). Opening the black box of biomarker measurement error. Epidemiology, 21(July Supplement), S1–S3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Shumway RH, Azari RS & Kayhanian M (2002). Statistical approaches to estimating mean water quality concentrations with detection limits. Environmental Science & Technology, 36, 3345–3353. [DOI] [PubMed] [Google Scholar]
  34. The Economist.(2017). Pollution in China: The bad earth. June 10–16, 2017, 24–26. [Google Scholar]
  35. Thorne PS, Perry SS, Saito R, O’Shaughnessy PT, Mehaffy J, Metwali N, Keefe T, Donham KJ & Reynolds SJ (2010). Evaluation of the Limulus amebocyte lysate and recombinant factor C assays for assessment of airborne endotoxin. Applied and Environmental Microbiology, 76, 4988–4995. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Tobin J (1958). Estimation of relationships for limited dependent variables. Econometrica, 26, 24–36. [Google Scholar]

RESOURCES