Skip to main content
PLOS One logoLink to PLOS One
. 2022 Jan 21;17(1):e0262148. doi: 10.1371/journal.pone.0262148

A parametric framework for multidimensional linear measurement error regression

Stanley Luck 1,*
Editor: Sriparna Saha2
PMCID: PMC8782541  PMID: 35061791

Abstract

The ordinary linear regression method is limited to bivariate data because it is based on the Cartesian representation y = f(x). Using the chain rule, we transform the method to the parametric representation (x(t), y(t)) and obtain a linear regression framework in which the weighted average is used as a parameter for a multivariate linear relation for a set of linearly related variable vectors (LRVVs). We confirm the proposed approach by a Monte Carlo simulation, where the minimum coefficient of variation for error (CVE) provides the optimal weights when forming a weighted average of LRVVs. Then, we describe a parametric linear regression (PLR) algorithm in which the Moore-Penrose pseudoinverse is used to estimate measurement error regression (MER) parameters individually for the given variable vectors. We demonstrate that MER parameters from the PLR and nonlinear ODRPACK methods are quite similar for a wide range of reliability ratios, but ODRPACK is formulated only for bivariate data. We identify scale invariant quantities for the PLR and weighted orthogonal regression (WOR) methods and their correspondences with the partitioned residual effects between the variable vectors. Thus, the specification of an error model for the data is essential for MER and we discuss the use of Monte Carlo methods for estimating the distributions and confidence intervals for MER slope and correlation coefficient. We distinguish between elementary covariance for the y = f(x) representation and covariance vector for the (x(t), y(t)) representation. We also discuss the multivariate generalization of the Pearson correlation as the contraction between Cartesian polyad alignment tensors for the LRVVs and weighted average. Finally, we demonstrate the use of multidimensional PLR in estimating the MER parameters for replicate RNA-Seq data and quadratic regression for estimating the parameters of the conical dispersion of read count data about the MER line.

1 Introduction

In this work, we consider the problem of fitting a multidimensional line for data that are subject to stochastic error. The motivation for this work comes from a collaborative R & D effort involving the application of the genome-wide association (GWAS) [1] and eQTL methods [2] to identify beneficial agronomic variation in maize. This led to our applied algebraic investigation of the merits of various effect size measures and their associated statistical methodologies as described in our two recent publications in this journal. In the first, we discussed the importance of factoring a 2 × 2 contingency table for obtaining marginal scale invariant effect size measures for proportional variation [3]. We also used projective geometric concepts for discussing connections between the phi coefficient, odds ratio, relative risk and proportion with respect to the representation of effect sizes. In the second, we discussed the formulation of a nonparametric measure for nonoverlap proportion and the importance of specifying a complete set of parameters for point-biserial variation in the formulation of an effect size measure [4]. We also discussed the importance of providing an error model for data and the use of Monte Carlo (MC) methods for estimating distributions and confidence intervals for effect size, as required for best practices [5]. However, in RNA-Seq studies, error estimation is complicated by the low number of replications usually employed. Our RNA-Seq experimental design is consistent with the recommended minimum of three replicates [6], but this is less than the minimum of 13 replicates required for reliably estimating variance [7]. Furthermore, RNA-Seq read counts can range over more than three orders of magnitude with heteroscedastic error. Thus, weighted least squares (WLS) methods [8, 9] are needed for properly partitioning residual effects in data analyses. The importance of accounting for measurement errors in omics studies has been discussed in several publications [7, 1013]. These problems provide the motivation for our investigation of linear measurement error regression (MER) methods for multivariate data. The concepts discussed in this paper significantly extend our previous gene expression data analysis work that is briefly described in [14].

We use the term ‘measurement error regression’ in accordance with the recommendation in [15] instead of the other commonly used terms, such as ‘error-in-variables’ or ‘major axis regression’ [16]. There are many studies on the MER problem [1719], but [9, 2022] were particularly useful in this work. There are two forms of WLS optimization in measurement error regression for (x, y) data. The elementary form involves the application of the normal equations [9] algorithm to obtain WLS-based linear regression (WR) estimates for the necessary parameters. However, WR is subject to the limitation where a design matrix is constructed from error free independent data, and the residual effects are assigned exclusively to the dependent data. Thus, WR requires the specification of ‘dependent’ and ‘independent’ quantities. Then, when both x and y are subject to error the WR estimate is biased, and there is extensive literature describing various approaches for solving this problem [1820]. The second form of WLS involves ‘weighted orthogonal regression’ (WOR) or ‘orthogonal distance regression’ where the weighted sum of squares of the combined residual effects for x and y are minimized [23]. The ‘dependent’ versus ‘independent’ distinction for y and x does not hold in WOR. Then, WOR can be regarded as a generalization because under the condition where only the y data are subject to error, the WOR and WR estimates are equivalent. However, there is no analytical solution for WOR [23], and the implementation requires a numerical iterative search method. In this work, we use the implementation of WOR in open source statistics software [24, 25] based on the nonlinear ODRPACK algorithm [26]. However, ODRPACK is limited to the analysis of bivariate data because it is based on the Cartesian representation y = f(x). Our objective is to describe a multivariate generalization of the WR algorithm that provides unbiased estimates for MER parameters for a set of linearly related variable vectors (LRVVs). The main novel contributions of this work are as follows: 1) We associate a set of LRVVs with a convex set of weighted averages and use Monte Carlo simulation to demonstrate that the minimum coefficient of variation for error (CVE) provides the optimal weights for forming the weighted average (Eqs 13, 16 and 17). 2) We formulate a parametric linear regression (PLR) algorithm where the weighted average of the LRVVs serves as the independent parameter, and the MER parameters are estimated using the Moore-Penrose pseudoinverse (Eqs 22 and 23; Proposition 1). 3) Convex mappings from the data points in the MER graph to the predicted values along the MER line are provided (Eqs 6 and 7). 4) We propose that in the parametric representation for the linear relation between variable vectors, the corresponding covariance is a parametric vector quantity (Eq 25; Proposition 2). 5) We identify scale invariant quantities for the partitioned residual effects between the LRVVs for both PLR (Eq 27) and WOR (Eq 37). 6) We propose a multivariate generalization of the Pearson correlation coefficient based on the m-fold contraction of Cartesian polyad alignment tensors for the LRVVs and weighted average (Eqs 32, 33 and 35). 7) A quadratic regression method is used to estimate the conical dispersion parameters for determining the read counting error in replicated RNA-Seq data (Eqs 40 and 43).

2 Methods

Section 2.1 introduces our notations. In sections 2.2–2.4, we examine the WLS methods for partitioning residual effects in linear regression for (x, y) data. In section 2.5, we discuss the minimum CVE criterion for obtaining the optimal weighted average of a set of LRVVs. In section 2.6, we describe the PLR framework where parametric equations are used to obtain a novel multidimensional generalization of the normal WLS algorithm for linear regression. In section 3.2, we describe the application of the PLR algorithm for the analysis of the conical dispersion of replicated RNA-Seq data about the MER line.

It has already been established that the specification of an error model E for the given data is essential in MER analysis. An error model serves as a realistic assessment of the performance of a data acquisition system, and error parameters are estimated from replicated measurements [27]. The parameters are usually summarized as an ‘expected variance’ for error with respect to each observation. When referring to experimental error, we use the term ‘expected variance’ because error parameters are empirical. We provide a detailed discussion of the role of E in partitioning residual effects for the WLS optimization of MER. Furthermore, proper statistical practice dictates that effect size measures such as the regression slope, the correlation coefficient and Cohen’s d must be qualified by a distribution [5, 28], and an assessment of substantive significance [29] must account for a distributed system response. Consequently, we also discuss the use of Monte Carlo methods and E for estimating the distributions and confidence intervals of MER parameters.

2.1 Notation

Our notation and terminology are partly taken from [30], but comprehensive discussions of vector spaces and linear algebra for applied statistics are found in many textbooks, such as [8, 9, 31]. Scalars are denoted by italics y, and vectors are written in bold lowercase letters y ≡ (yi) ≡ (y1, y2, y3, …, ydim(y)). Matrices are written in bold uppercase letters Y ≡ [yij], and the transpose is denoted as YT. Then, n joint observations for m experimental quantities produce a dataset {yijyijR1,1in,1jm}, which corresponds to the Cartesian product of the data vectors Y=(yj), yjRn, where dim(Y)=n×m. We use a convenient terminology for the subsets of Y, where each axis of a regression graph is assigned to a variable vector yj, the linear combinations ∑j kj yj are elements of the variable space, and each point in the graph corresponds to an observation vector y(i) [30, 32]. In this work, we describe a novel parametric linear regression algorithm for estimating MER parameters for a set Y of LRVVs. We also use the shortened term ‘variable vectors’ synonymously when referring to LRVVs, except in a few clearly identified cases. The inner product with the one-vector 1 = (1, 1, 1, …) produces the mean value y¯=1·y/dim(y). The subtraction of the mean vector produces a centered vector yc=y-y¯1. The hat symbol ‘^’ denotes a unit-length vector, i.e., y^=1. Then, the Pearson correlation coefficient r(x, y) is defined as the cosine of the angle between unit-length, centered vectors [30, 32]

r=x^c·y^c. (1)

The variance for yj is denoted as Var(yj). The expected variance for normally distributed errors eyij is denoted as syij2, and the vector syj2 denotes the expected variance of the error in yj. For bivariate data (yj, yk), the covariance is denoted as Cov(yj, yk); the notation for a variance-covariance matrix is not needed in this work. In our Cov(yj, yk) notation, yj and yk are independent and dependent quantities, respectively, and the roles are reversed for Cov(yk, yj). The connection between correlation and covariance as measures of linear dependence in data is discussed in [21, 32]. A general discussion of convex sets, linear fractional transformations and perspective functions is found in [33]. The set containing all convex combinations of a set of variable vectors Y is denoted as conv(Y), where τconv(Y)τ=jwjyj, for ∑j wj = 1 and 0 ≤ wjj. A perspective function has the form P(y, t) = y/t. In recent publications, we provide an elementary discussion about projective geometry for fractional variation [3], and identify the homogeneous coordinates of the Pearson correlation coefficient as points on the line passing through the point (r,1-r2)S+1 [4]. Equivalent homogeneous coordinates are indicated by a two-sided arrow, i.e., yxy = x t.

2.2 The WLS linear regression algorithm

Consider a set of n linearly related observations {(xi, yi)|1 ≤ in}, which are represented as the variable vectors {(x,y)|x,yRn}. The observations are subject to error, and the differences between the true (x*, y*) and observed values are denoted by (ex, ey) = (xx*, yy*). The fact that (x*, y*) are unknown implies that (ex, ey) are also indeterminate. The specification of an error model for the data E(x,y) is essential for estimating the distributions and corresponding confidence intervals of the statistical parameters, including the slope; see section 2.6. The default approach is to reduce E(x,y) to a stochastic form or some approximation thereof, where the errors (ex, ey) are independent and associated with normal distributions with zero means (0, 0) and expected variances (sx2,sy2), where {Cov(exi,exk)=0ik}, {Cov(eyi,eyk)=0ik} and {Cov(exi,eyk)=0i,k}. In the standard MER framework [20, 21], the linear relation between x and y is expressed using Cartesian equations:

y=αxy1+βxyx*+ey, (2)
=αxy1+βxyx+(ey-βxyex), (3)

where {αxy,βxyR1} are the model parameters. We use the subscript ‘xy’ to indicate the correspondence with the derivative βxy=dyi*/dxi*, because it will be necessary to distinguish between different parameterizations in Eq 21. In linear regression analysis, the objective is to obtain sample estimates (axy, bxy) for (αxy, βxy). After introducing the design matrix X = [1, x] and the diagonal variance matrix Vy=diag(sy2), the WLS linear regression estimate is obtained by solving the normal equations to obtain [9]

uxy,WR=(XTVy-1X)-1XTVy-1y, (4)

where uxy,WR = (axy,WR, bxy,WR). Vy is required for the WLS scaling of both X and y, and (XT X)−1 XT is the Moore-Penrose pseudoinverse of X; in practice, the pseudoinverse is estimated by singular value decomposition. The ordinary linear regression (OLR) estimate is obtained when VyI, where I is the identity matrix. However, the bxy,WR estimate is biased because of the correlation of x with ex in Eq 3 [20]. The reliability ratio [17] for x is defined as

κx=1-Var(ex)Var(x),

where Var(x) = Var(x*) + Var(ex), and Var(ex)=1·sx2/dim(sx2). As illustrated in Fig 1A and 1B, bxy,WR is attenuated as κx decreases from 1. This bias in bxy,WR arises from the algebraic requirement that X must serve as an error-free quantity in the normal equations, and the residuals are assigned exclusively to y. The solution uxy,WR holds under the condition where Var(ex) is small enough such that xx* with κx ≈ 1. A key objective of this work is to describe a novel parametric generalization in which the WR algorithm itself serves as the basis for measurement error regression for a set of LRVVs (section 2.6). We also compare the performance of PLR with that of existing MER methods and provide realistic MER examples using publicly available data. In particular, we discuss the WLS orthogonal regression and attenuation-corrected slope estimation methods.

Fig 1. Comparison of various measurement error regression methods.

Fig 1

A, B, C: Monte Carlo simulations of MER for 25 evenly-spaced values yi*=xi*. The statistical parameters are estimated from 5000 Monte Carlo datasets (x, y) = (x* + ex, y* + ey), with normally distributed errors (ex, ey). The parametric linear regression and orthogonal distance regression methods both produce consistent estimates for the slope (bMER) for a wide range of reliability ratios κx. The weighted average (bxy,W) of the attenuation-corrected direct and inverse WR estimates also produces comparable results. In contrast, the standard (bWR) estimate is attenuated. A: Homoscedastic error: (sy* = 0.3, sx* = α), and α = {0.01, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35}. B: Heteroscedastic error: (sy*=0.01+0.3y*,sx*=0.01+αx*), and α = {0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6}. C: Using the results from (B), the Pearson correlation coefficient rxy is attenuated compared to the parametric correlation coefficient rPLR=rτwxrτwy for the weighted average τw = w x + (1 − w)y. D, E, F: MER results for the data (x, y) from Table 11.3.1 in [21]. However, error estimates were not provided for the data. For demonstration purposes, we used a rudimentary error model: sx2=0.04+0.005x2, sy2=0.1+0.005y2, κx = 0.93, κy = 0.91, r = 0.49, and rPLR = 0.74. D: Each MER line is labeled with its slope. The ordinary least squares (bOLR) and orthogonal regression (bOR) estimates are consistent with [21]. The WLS orthogonal regression (bORw) estimate is also shown. The predicted values for (xi, yi) are plotted for PLR and WOR. E: The bxy,PLR estimate is obtained from separate regressions bτx,WR and bτy,WR. The inset shows the distribution of bxy,PLR estimated from 100 histograms with 5000 MC samples per histogram (see Fig 4); the error bars indicate the degree of convergence for the MC simulation. F: The weighted average prediction residuals for bxy = 1.38 and the WOR residuals are quite similar. The PLR residuals from the τ WR results in (E) are comparable but do not account for heteroscedastic error. A-E: The error bars correspond to ±2σ.

2.3 WLS orthogonal regression

The WOR algorithm minimizes the weighted sum of squares for the combined residual error for (exi2,eyi2):

mina,bi(wyi(a+bxi*-yi)2+wxiexi2), (5)

with (wxi,wyi)(1/sxi2,1/syi2) [26] and does not require the identification of dependent and independent variables. However, this minimization problem does not possess an analytical solution [23], and a numerical method is needed for the implementation of WOR. For our demonstration, the python scipy.odr package [25] provides a convenient WOR implementation based on the ODRPACK optimization library [26]. We are interested in WOR for a linear function y = a + bx, but ODRPACK is more general and uses the Levenberg-Marquardt nonlinear least squares algorithm to support parameter estimation for a user-specified bivariate function y = f(x). As demonstrated in Fig 1, the ODRPACK algorithm performs well for homoscedastic (Fig 1A) and heteroscedastic (Fig 1B) error models over a wide range of reliability ratios. When either κx or κy approaches 1, the WOR estimate is equivalent to the WR estimate. Thus, the WOR algorithm is regarded as a generalization of WR. Furthermore, the ODRPACK algorithm provides predicted and residual values for (xi, yi), as demonstrated in Fig 1F. For comparison, we describe a novel algebraic method to estimate predicted and residual values in MER. First, we associate the observed data (xi, yi) with a convex mapping to an interval on an MER line with (intercept, slope)≡(a, b). The interval is bounded by the following two points:

vxi=(xi,bxi+a)

and

vyi=(yi-ab,yi),

which correspond to the special cases sxi2=0 and syi2=0, respectively. Then, the weighted average

vi,w=wvxi+(1-w)vyi (6)

is contained in conv(vxi,vyi) for 0 ≤ w ≤ 1. Using the signal-to-noise ratios (x¯/sxi2,y¯/syi2) (see Eq 17), the optimal weight is obtained:

w=(1+y¯x¯sxi2syi2)-1, (7)

which produces the weighted average prediction (WP) value vi,WP. In Fig 1F, we demonstrate that the ODRPACK and WP estimates for residuals are quite similar. The generalization of Eqs 6 and 7 for MER in Rm is straightforward.

However, the WR and WOR algorithms are both subject to the algebraic limitation that the Cartesian representation y = f(x) only allows for the representation of lines (or curves) in two dimensions. Alternatively, the specification of a straight line in Rm requires the use of the parametric representation

u=α+βt, (8)

with u,α,βRm and tR1 [34]. Consequently, our objective is to describe a parametric framework where the WR algorithm serves as the basis for measurement error regression for LRVVs.

2.4 Covariance in ordinary linear regression

The OLR algorithm corresponds to the special case where Vy is replaced with the identity matrix I in Eq 4. Then, the expression for the slope takes a simple form and provides the intuition for our novel parametric algorithm in section 2.6. From Eq 3, the sample covariance [22] is expressed as

Cov(x,y)=bxyVar(x)+Cov(x,ey-bxyex),=bxy(Var(x)-Var(ex)),

where Cov(ex, ey) = 0, and Cov(x, ex) = Var(ex). Then, we obtain the direct estimator for the slope

bxy,κx=Cov(x,y)Var(x)-Var(ex),=bxy,OLRκx, (9)

where

bxy,OLRCov(x,y)Var(x) (10)

and bxy,κxbxy,OLR as κx → 1. Even though bxy,κx is referred to as the ‘regression coefficient corrected for attenuation’ [17], we note that bxy,κx is also biased. Reversing the roles of x and y yields the inverse estimator for βxy:

byx,κy-1=Var(y)-Var(ey)Cov(y,x),=κybyx,OLR, (11)

where κy = 1 − Var(ey)/Var(y). The relations in Eqs 9 and 11 indicate that the OLR estimates serve as bounding values because for any least squares estimate bMER, it can be shown [20] that

|bxy,OLR||bMER||byx,OLR-1|. (12)

In section 2.6, we provide a statistical explanation (see Eq 26) and a graphical illustration (Fig 2) of the bounded range of |bMER|. To compensate for the bias in bxy,κx and byx,κy-1, various ad-hoc combinations have been proposed, including the arithmetic mean [16, 35]:

bxy,A=(bxy,κx+byx,κy-1)/2.

Fig 2. Bounded ranges of the regression slopes for the errors in both x and y.

Fig 2

Monte Carlo simulations of parametric linear regression are conducted for 25 evenly spaced values yi*=xi*, and the error model is syi=0.2yi*, sxi=αxi* for α = 0.01, 0.2, 0.4. The estimates for the slope bPLR (A) and correlation rPLR = rτx rτy (B) are averages over 5000 datasets (x, y) = (x* + ex, y* + ey) with normally distributed errors (ex, ey) and a weighted average τ = wx x + (1 − wx)y. The optimal (bPLR, rPLR) estimates corresponding to the minimum CVE in τ are also plotted (red). The 2σ error bars are estimates for the positive and negative deviations from the median value.

Unfortunately, the limiting values of bxy,A are not consistent with the OLR estimate for either κx or κy = 1. Instead, we introduce the weighted average

bxy,W=wxbxy,WRκx+wyκybyx,WR,

where the weights are determined from signal to noise ratios (wx,wy)(x¯/Var(ex),y¯/Var(ey)) (see Eq 17). In Fig 1A and 1B, we observe that the bxy,W estimates compare well with the expected slope but with much larger confidence intervals compared to those of WOR.

2.5 Noise reduction for a weighted average of linearly related variable vectors

Consider the weighted average of the LRVVs {(x,y)x,yRn}:

τ=wx+(1-w)y, (13)

with 0 ≤ w ≤ 1. Then, τconv(x, y) and corresponds to points on the line segment connecting x and y. The average of x and y is associated with the partial cancellation of error and an improved signal-to-noise ratio. Let Var(ex) and Var(ey) denote the expected variances of the errors in x and y, respectively. Then, the propagation of independent error for the weighted sum of two variables [34] gives

Var(eτ)=w2Var(ex)+(1-w)2Var(ey). (14)

By introducing the mean value τ¯=wx¯+(1-w)y¯ and the change in coordinates ω = w/(1 − w), the coefficient of variation for error in τ is

Var(eτ)τ¯=Var(ex)Var(ey)ω2+1x¯y¯ω+1Var(ey)y¯. (15)

Then, setting the derivative d(Var(eτ)/τ¯)/dω=0 yields

ω=x¯y¯Var(ey)Var(ex),

and we obtain the criterion

w=(1+y¯x¯Var(ex)Var(ey))-1, (16)

for the minimum CVE. In Fig 3, we use Monte Carlo simulations to demonstrate that the minimum Var(eτ)/τ¯ is consistent with Eq 16. By partitioning the weighted average τ = ∑j wj yj into a sequence of pairwise averages for the data (yj), we obtain the general min(CVE) expression

wj=y¯jVar(eyj)(jy¯jVar(eyj))-1. (17)

Fig 3. Noise reduction for a weighted average of variable vectors.

Fig 3

Monte Carlo simulations of weighted averaging are conducted for the data (x, y) consisting of 25 evenly spaced values for y* = 2x*. The statistical parameters are estimated from 5000 datasets (x, y) = (x* + ex, y* + ey) with normally distributed errors (ex, ey), and the error model is: sy* = 0.3y*, sx* = α x* for α = 0.01, 0.3, 0.5. The minimum coefficient of variation for error Var(eτ)/τ¯ for the weighted average τ = wx x + (1 − wx)y is consistent with Eq 17, as shown by the optimal wmin curve. Deming regression is associated with the weights (wx,wy)(1/Var(ex),1/Var(ey)) and the corresponding CVE curve (grey) intersects the wmin curve.

Thus, the noise reduction for the average is optimal when the y¯j/Var(eyj) ratios are balanced for the LRVVs. To the best of our knowledge, this expression for optimizing the CVE of a weighted average has not been previously reported. We note that the y¯j/Var(eyj) correspond to the signal-to-noise ratios y¯j/Var(eyj) [36]. However, for data that are not linearly related, averaging results in the uncontrolled mixing of noise and signal and a loss of information. This implies that the treatment of measurement error in more general problems, including multiple and polynomial regression, requires nonlinear optimization methods. That explains why this work is limited to the treatment of MER for LRVVs. In the next section, we discuss the use of τ as a parameter for measurement error regression.

2.6 The chain rule in linear regression

By applying the chain rule [34] to Eq 2, we obtain the factorization

βxy=dyi*dxi*, (18)
=dyi*dτi*dτi*dxi*, (19)
(dyi*dτi*,dxi*dτi*), (20)
βxy=(βτy,βτx). (21)

The fractional form of Eq 19 is associated with scale invariance and corresponds to the homogeneous coordinates βxyt, with tR1. The mapping βxyβxy is equivalent to invoking the parametric representation (x(t), y(t)) for the linear relation between the variables (Eq 8). Therefore, the chain rule permits a generalization in which the slope in linear regression is parameterized by τ*, where βxy corresponds to the point (βxy, 1) on the projective line P1 and a perspective function of the slope vector βxy [4, 33]. Consequently, Eq 21 serves as the basis for our generalization of the linear regression algorithm, where the weighted average τ serves to parameterize the relation between variable vectors. Analogously to Eq 9, the PLR slope vector is obtained as

bxy,PLR1κτ(bτy,WR,bτx,WR), (22)

where the components of bxy,PLR are estimated individually using the WR algorithm (Eq 4) with τ in the design matrix. Thus, x and y both serve as dependent variables. Without loss of generality, the factor κτ-1 can be ignored because of the homogeneous coordinates equivalence. Therefore, we can regard τ as corresponding to a set of fixed values. Then, the parametric linear regression estimator for βxy is

bxy,PLR=bτy,WRbτx,WR. (23)

In the special case where wx = 1, τ = x and bxy,PLRbxy,WR, x serves as both a dependent variable and an independent variable. Conversely, bxy,PLRbyx,WR-1 for wy = 1, and y serves as an independent variable. Therefore, the fact that there is a bounded range of values for the MER slope (Eq 12) comes from the dependence of bxy,PLR on τ, which ranges over conv(x, y). The monotonic decrease of bxy,PLR with wx is shown in Fig 2A, with the lower and upper bounds given by (bxy,WR, 1) and (1, byx,WR), respectively; this is consistent with the OLR constraints in Eq 12. As we discuss in section 3.2, the extension of the PLR algorithm for estimating the MER parameters for a set of LRVVs Y=(yj) is straightforward. Therefore, we make the parametric linear regression proposition.

Proposition 1. For a set of linearly related variable vectors Y=(yj), the weighted average τ = ∑j wjyj corresponding to the minimum coefficient of variation for error (Eq 17) provides the optimal measurement error regression estimate for the slope vector bY,PLR=(bτyj,WR).

In Fig 1A and 1B, we show that the performance of the bxy,PLR estimate is comparable to that of the WOR algorithm for a wide range of reliability ratios. The PLR algorithm provides predicted values for y(i) (Fig 1E), but predicted values can also be estimated using the WP method (Eq 7). In the special case where the LRVVs are not subject to error and the points fall exactly on a straight line, we obtain κτ=1τconv(Y), then any τ can serve as the parametric variable vector in PLR.

Now, we consider the pedagogically important special case where Vx and Vy are both proportional to I. We obtain

bxy,PLRbxy,POLR=1Var(τ)(Cov(τ,y),Cov(τ,x)), (24)
Cov(y,xτ)(Cov(τ,y),Cov(τ,x)), (25)
(bxy,POLR,1), (26)

where bxy,POLR = Cov(τ, y)/Cov(τ, x) is a perspective function of bxy,POLR. In Eq 25, we introduce the parametric covariance vector notation Cov(y, xτ) to indicate the homogeneous coordinates equivalence with bxy,POLR. Thus, we note that covariance serves as a measure of linear dependence between variable vectors [21, 32] and is therefore subject to the chain rule. Consequently, we make the parametric covariance proposition.

Proposition 2 In the analysis of linear dependence for a set of LRVVs Y, the chain rule implies that the covariance is associated with a conditional vector form Cov(Yτ)=(Cov(τ,yj)), which is parameterized by the weighted average τconv(Y).

Therefore, covariance is a vector quantity in the parametric representation for the linear relation between variable vectors. Then, there is a range for Cov(y, xτ) in correspondence with bxy,POLR and the bounded range for bMER in Eq 12. For (wx, wy) given by Eq 17 and τ = wx x + wx y, we obtain the expression

Cov(τ,y)wyVar(y)=wxCov(x,y)wyVar(y)+1, (27)

observe that the right side is invariant to scaling the data: (x, y) → (kxx, kyy), and note that there is a similar factorization for Cov(τ, x). Conversely, we conclude that the scaling invariance property for the quantities

{wxCov(x,y)wyVar(y),wyCov(y,x)wxVar(x)},

implies that wx/wyky/kx, which is a necessary condition but not sufficient for specifying the optimal weights for τ; in section 3.1, we discuss Deming weighting which also satisfies the scaling invariance property. Then, we obtain the homogeneous coordinates equivalence

(Cov(τ,y),Cov(τ,x))(wyVar(y)wxVar(x)Cov(τ,y)wyVar(y),Cov(τ,x)wxVar(x)),

and observe the corresponding scaling relation

wyVar(y)wxVar(x)kykxwyVar(y)wxVar(x).

Then, bxy,POLR varies proportionally with the scaling (kxx, kyy); the bxy,POLR estimate is consistent with respect to the scaling of the data. These covariance factorizations provide another demonstration of the importance of WLS optimization for partitioning residual effects in measurement error regression. The generalization of these expressions for a (yj) dataset is straightforward. However, we note that Cov(τ, yj) assumes that Vyj=I and does not account for the heteroscedastic error in yj. Thus, in practice, the PLR estimate (Eq 22) is preferred. Finally, we note that the two special cases

bxy,POLR=(bxy,OLR,1), (28)
(Cov(x,y),Var(x)), (29)

for τ = x and

bxy,POLR=(1,byx,OLR), (30)
(Var(y),Cov(y,x)), (31)

for τ = y, correspond to the bounding values in Eq 12. Therefore, Cov(x, y) and Cov(y, x) are numerically equal but serve as components for different parametric covariance vectors, and correspond to different estimates for MER slope, bxy,OLR and byx,OLR, respectively.

Now, we consider the multivariate generalization of the Pearson correlation coefficient r (Eq 1). Using the well-known correspondence between covariance and Pearson correlation [32],

rτyj=Cov(τ,yj)Var(τ)Var(yj)=τ^c·y^j,c,

we transform Eq 25 and define a parametric correlation vector

r(x,yτ)=(rτx,rτy). (32)

Then, we observe that the Pearson correlation coefficient rτ(x, y) can be written in Cartesian tensor form as the double contraction of dyads

rτ=τ^cτ^c:x^cy^c, (33)
=rτxrτy. (34)

The dyad x^cy^c is a bilinear measure of the joint variation in x and y. In the special cases where τ^c=x^c and y^c, rτ reduces to the standard Pearson correlation coefficient r (Eq 1). Geometrically, rτ serves as a measure of the degree to which xc and yc are mutually aligned in the τc direction. Then, the generalization of Eq 33 as the m-fold contraction of polyads for the variable vectors (y^j,c)Rn×m serves as the basis for our definition of the parametric correlation coefficient:

rPLR=j=1mrτyj, (35)

with m ≥ 2. The τ^c-polyads τ^cτ^c,τ^cτ^cτ^c, are referred to as 2nd, 3rd, … order Pearson correlation tensors, respectively. Therefore, we associate a set of m LRVVs with a set of two-way, three-way, …, m-way weighted averages, PLRs, covariance vectors and PLR correlation coefficients. However, a detailed discussion of Cartesian tensors [37, 38] is beyond the scope of this paper. In Fig 2B, the concave down variation of rPLR = rτx rτy results from the product of the monotonically increasing rτx function and decreasing rτy function, with the endpoints equal to the Pearson correlation coefficient r. Thus, rPLR varies with the expected error (sx2,sy2), but Pearson’s r serves as the lower bound. Then, r is attenuated compared to rPLR when the data (x, y) are subject to measurement errors [13], as shown in Fig 1C. When the data (x, y) are highly correlated and Pearson’s r approaches 1, the rPLR curve becomes a horizontal line at rPLR = 1 (or nearly so). We conclude that the parametric covariance (Eq 25), correlation (Eq 32) and PLR slope (Proposition 1) correspond to different perspective functions and coordinate systems for representing the linear relation between variable vectors, i.e., points dispersed about the MER line. Then, a condition for the interpretation of covariance and correlation is that the deviation of the points from the MER line must be consistent with the expected error (sy(i)2). In section 3.1, we discuss the ambiguity that arises when the MER line is perturbed by outliers.

Analytical methods for the propagation of errors and the estimation of distributions for ratios [39, 40], proportions [41, 42], and correlation coefficients [43] are complicated by fractional transformations, bounded ranges, and discreteness. The propagation of errors for convoluted quantities such as bxy,PLR and rPLR is prohibitively complicated. Alternatively, Monte Carlo methods are practical approaches for estimating the distributions and confidence intervals of fractional statistical parameters, and they allow for the detailed simulation of stochastic effects in the data acquisition process as described previously [4]. The error bars shown in various figures in this paper are estimated using MC methods. Fig 4 shows an example of a two-dimensional MC histogram for the joint (rPLR, bxy,PLR) distribution. We note that the distributions for MER parameters are often skewed, so confidence intervals are estimated separately for + /− deviations from the median.

Fig 4. Distribution of the statistical parameters in parametric linear regression.

Fig 4

A two-dimensional histogram (B) is calculated from 500000 Monte Carlo datasets for the data in Fig 1D. The distributions of rPLR (A) and bPLR (C) are averages obtained from 100 MC histograms, with 5000 MC datasets (x, y) per histogram. The 2σ error bars are estimates of the positive and negative deviations from the median and indicate the degree of convergence of the MC estimate. The (rPLR, bPLR) = (0.74, 1.44) estimates are indicated in red.

3 Data analysis and results

Using publicly available data, we provide practical examples that illustrate the ambiguity that arises when the variable vectors are not linearly related and the application of multidimensional parametric linear regression for analyzing the dispersion of RNA-Seq read count data.

3.1 Example 1: Measurement error regression with outliers

The data for this example come from Table 11.3.1 in [21] and originate from consulting work (R. Berger, personal communication). However, error estimates are not available for these data. For our demonstration, we provided a rudimentary error model with (sx2,sy2)=(0.04+0.005x2,0.1+0.005y2), as shown by the 2σ xy-error bars in Fig 1D. The ordinary linear regression estimate for the slope is attenuated as expected (Eq 9). However, the orthogonal regression (OR) estimate reported in [21] is also biased because of the equal weight assumption for the residual effects (ex, ey). Using the defined notations for sums of squares,

Sxx=i(xi-x¯i)2,Syy=i(yi-y¯i)2,Sxy=i(xi-x¯i)(yi-y¯i),

and making the substitution (x, y) ↦ (wxx, wyy), we obtain a WLS OR estimate:

bxy,ORw=wxwy(-B+B2+1), (36)

where the wx/wy factor is needed for the transformation back to the original coordinates for (x, y), and the factor for the weighted residuals in (x, y) is

B=12Sxy(wxwySxx-wywxSyy). (37)

The weights are determined from the signal to noise ratios (wx,wy)(x¯/Var(ex),y¯/Var(ey)) with the constraints wx + wy = 1 and wx, wy ≥ 0 (Eq 17). Then, we note that B is invariant to the scaling of the data, (x, y)→(kx x, ky y). Therefore, the bxy,ORw estimate is consistent with respect to the scaling of the data. The bxy,OR estimate is recovered when wx = wy [20], and because of this constraint, bxy,OR is not a consistent estimate since it does not vary proportionally with (kxx, kyy) scaling. There is better agreement between bxy,WOR and bxy,ORw (Fig 1D), but the latter does not account for the heteroscedastic errors (sx2,sy2). The bxy,PLR estimate is obtained from the separate regressions (bτx,WR, bτy,WR), as shown in Fig 1E. The differences between the bxy,WOR and bxy,PLR estimates are small compared to the experimental uncertainty. However, we observe that there are many positive and negative outliers in the data (Fig 1D), and only a subset of the points are close to any one of the MER lines. This implies that the data (x, y) are not linearly related, and the weighted average τconv(x, y) is then subject to confounding effects. Then, WLS optimization produces misleading results, and the MER line does not possess a functional or operational interpretation. Criteria can be applied to select an LRVV subset for measurement error regression, but this works best when the outlier subset is small. Otherwise, linear regression is not recommended for the analysis of data that are not linearly related. In [4], we discuss the use of regression tree association graphs in the analysis of weakly correlated data. Finally, we note that the Deming orthogonal regression estimate [22] is obtained by substituting the alternative weighting (wx,wy)(1/Var(ex),1/Var(ey)) into Eq 36. Our simulations confirm that Deming weighting has the scale invariance property but it is not optimal for MER (Fig 3), and the corresponding parameter estimates are biased for both PLR and ORw (not shown).

3.2 Example 2: Conical dispersion for replicated RNA-Seq data

In this section, our objective is to demonstrate that multidimensional linear MER has practical applications. We expect that there are many applications for PLR in the ‘big data’ world but the application for RNA-Seq data analysis is convenient because of our previous collaborations in gene expression research. In particular, we discuss the use of the PLR method in the analysis of read counting errors in a replicated RNA-Seq dataset. The estimation of the experimental errors and confidence intervals for effect size measures are important problems in RNA-Seq analysis because experiments are usually performed with very few replicates [6]. The replicated data correspond to the Cartesian product of LRVVs Y=(yj) with yjRn, dim(Y)=n×m; yij is the read count for the ith RNA tag of the yj replicate, n is the number of RNA tags, m is the number of replicates, and mn because RNA-Seq assays are highly multiplexed and n is large. The observation vectors for the RNA tags are denoted g(i). Our error analysis algorithm is iterative and requires an initial guess for the expected error in the data (syj2)init; the read counting errors are assumed to be independent or approximately so: {Cov(eyij,eykl)=0ikjl}. In our implementation, the initial error estimate is of the form syj2=q0+q2yj2. Let min(g(i)) correspond to a set of RNA tags with the lowest read counts, excluding those with g¯(i)=0; i.e., min(g(i)) is a set of points close to the origin of the regression graph. Then, q0 is initially estimated from the variance for min(g(i)); typically 1 ≤ q0 ≤ 50. An initial value for q2 is obtained by visual inspection of the data (Fig 5); typically 0.002 ≤ q2 ≤ 0.05 (Fig 6 legend). Then, the optimal τconv(Y) for the minimum CVE is estimated, and the PLR method provides parameter estimates for the MER line:

=amin+bPLRt, (38)

with amin=min(g(i))¯, tR1, {amin,bPLRRm}. Then, the data are transformed so that passes through the origin with a unit slope (Fig 5B & 5D); i.e., bPLR = 1, and amin ≊ 0. This adjustment for is important because afterwards, the g(i) are conically dispersed about the 1 line. Then, the g(i) are partitioned into orthogonal components for deviation δ(i) and mean g¯(i)=g(i)·1/dim(g(i)):

g(i)=δ(i)+g¯(i)1. (39)

Fig 5. Conical quadratic error parameters for replicated RNA-Seq data.

Fig 5

Replicate Arabidopsis thaliana RNA-Seq data from [46] for the (μg, 4 reps) sample in (A, B) and (1g, 3 reps) sample in (C,D) are scaled so that the parametric linear regression line corresponds to the unit slope line (bPLR=1) and passes through the origin. Then, the conical dispersion of points about the 1 line in the MER graph serves as a measure of the replication error, as shown in the y1 projections (B,D). An iterative WLS regression of the mean squared deviation versus the mean value (g¯(i)) produces parameter estimates for the quadratic error model (Eq 41), as shown in (A, C). The corresponding 2σ threshold curves are shown in (B, D). The linear component of the error (q0+q1g¯(i)) is shown as ‘convergent’ dashed curves. The larger overdispersion q2 in (C, D) compared to that in (A, B) indicates that there is a significant variation in data quality between the two experiments. The histogram (D inset) shows that the read counts range over four decades.

Fig 6. Overdispersion of the residual effects in replicated RNA-Seq data.

Fig 6

Empirical survival functions for the root mean square scaled deviation (δ´rms) from the conical dispersion analysis (see Fig 5) of the RNA-Seq data are plotted for six A. thaliana samples under fractional gravity [46]. The χ32 and χ42 curves from the Monte Carlo simulation for the standard normal distribution are also shown. The slower decay in the observed δ´rms indicates that RNA-Seq data are subject to heterogeneous dispersive effects. The inset shows the overdispersion for a 1000 point subset around the median g¯(i) for the μg sample. Thus, overdispersive effects are observed across the entire range of read counts. In the legend, each sample is labeled with its q2 coefficient for the quadratic error model. The variation in q2 indicates that there are significant data quality variations between samples.

Statistically, only the magnitude of δ(i) is important because the orientation is random. Next, we introduce the mean squared deviation (MSD) δ(i)2¯=δ(i)·δ(i)/dim(δ(i)), and we obtain the two parameter representation for the dispersion (g¯(i),δ(i)2¯), as shown in Fig 5A and 5C. We adopt a heuristic quadratic model for the MSD [14, 44]

δ(i)2¯=q0+q1g¯(i)+q2g¯(i)2+e(i). (40)

Then, a WLS quadratic regression of the MSD produces sample estimates for the qi (Fig 5A and 5C), which then serve as the error model for the input data:

syij2=q0+q1yij+q2yij2, (41)

as shown by the 2σ threshold curves in Fig 5B and 5D. The q0 and q1 terms correspond to the background noise and Poisson-like components of the dispersion, respectively [45]. The overdispersive component q2 suggests that there are uncontrolled experimental effects that cannot be averaged in a random manner. We note that for a well-designed data acquisition process, the q2 component should be minimized. The initial guess for (syj2)init is updated, and the algorithm iterates several times to reach convergence for both and {qi}.

The data for this demonstration come from a study of the response of Arabidopsis thaliana to fractional gravity [47] and can be retrieved from the publicly accessible NASA GeneLab Omics database [46]. The dataset contains read count data for six samples with varying gravities (g) {μ, 0.09, 0.18, 0.36, 0.57, 1}. For each sample, the replicates (yj) with m = 3∨4, correspond to a set of LRVVs with approximately identical noise properties: {syj2syk2jk}, with the possible exception of a small subset of irreproducible outliers which must be flagged. The read counts are distributed over more than three decades (Fig 5D inset), and the point estimates are skewed due to the heavy tail effects at large values [6]. Consequently, an upper limit for the mean value (g¯(i)2000) of the observation vectors is imposed, and 4% of the data are removed from the MER analysis; however, all of the data are included in the graphs (Figs 5 and 6). In Fig 5, we observe that the dispersion varies between the μg (A, B) and 1g (C, D) samples. As discussed above, the data are transformed such that the PLR slope is 1 and the curvature is negligible near the origin in log-log plots (B, D). We note that log-log scales are convenient for visualizing dispersion, but the data are not log-transformed in actual calculations; a log or ratio transformation results in nonlinear confounding effects for weighted averages. For duplicate read counts g(i) = (x, y), we define the scaled difference

d=x-ysx2+sy2, (42)

with the expected variance for the error in xy given by the sum of the individual variances: sx-y2=sx2+sy2 [21]. Then, we apply the condition sy2=q0+q1y+q2y2 to obtain the quadratic equation

(q2-1d2)y2+(q1+2xd2)y+q0+sx2-x2d2=0. (43)

The |d| = 2 threshold curves in Fig 5B and 5D are calculated from the roots of this expression. The q2 component dominates at high read counts, as indicated by the convergence of the dashed curves for the linear components (q0, q1). Then, q2 serves as an indicator of the data quality and reproducibility, and we observe significant variations between the samples (Fig 6 legend). We define the root mean square (rms) scaled deviation as

δ´rms,(i)=δ(i)2¯sg¯(i)2,

where sg¯(i)2=q0+q1g¯(i)+q2g¯(i)2. A comparison of the empirical survival distributions for δ´rms,(i) shows that there are heavy tails at large deviations for the six samples (Fig 6). We use Monte Carlo methods to simulate the read count data for normal distributions N(g¯(i),sg¯(i),dim(g(i))) and confirm that the simulated distributions for δ´rms,(i) are identical to chi-squared distributions (Fig 6). The much slower decay in the experimental δ´rms,(i) is an indication of uncontrolled processes in the RNA-Seq assay that result in overdispersion in the g(i). The inset graph in Fig 6 shows the overdispersion for a small {g(i)} subset around the median g¯(i) for the μg sample. Thus, the overdispersive effects in the δ´rms,(i) are observed for intervals over the entire range of read counts. Statistical methods that assume normal distributions and do not account for this overdispersion will overestimate the effect size. In biological studies, the variable vectors for RNA-Seq are not linearly related. Then, our parametric LRVV framework serves an elementary role in searching for multiway dependency in the variation of gene expression. The detailed discussion of weighted least squares optimization problems, including normalization and the estimation of effect size, in gene expression data analysis is a topic for a separate paper.

4 Software

Numerical computations were performed using the python (v3.8.4) language with the NumPy (v1.19.0) [48] and SciPy (v1.5.0) [49] packages. Figures were prepared using the Matplotlib (v3.4.1) package [50]. The manuscript was prepared using the MiKTeX (v2.9) implementation of TeX/LaTeX [51].

5 Discussion

In this work, we discuss the fact that statistical measures of linear dependence, including covariance, correlation coefficient and regression slope, are subject to the chain rule. We develop a novel multidimensional linear regression algorithm where the relation between variable vectors is parameterized by a weighted average, and the weights are determined from an error model E for the input data. Then, the implementation of PLR involves the use of weighted least squares normal equations to estimate the parameters of the best fit line for a set of LRVVs. Standard statistical concepts, including covariance and the Moore-Penrose pseudoinverse, provide the necessary intuition for the formulation of our PLR method. This contrasts with the ODRPACK algorithm which requires iterative numeric optimization because the WOR least-squares equation cannot be solved analytically. The PLR and ODRPACK methods yield very similar results but the latter is formulated only for bivariate data. We find that in the parametric representation of a linear relation, covariance and correlation are associated with vector quantities. We also identify scale invariant quantities for partitioned residual effects between the LRVVs for both the PLR and weighted orthogonal regression methods. Therefore, when E is undefined, MER is not a well-posed problem [52]; i.e., there is not a unique solution. This supports previous suggestions that the specification of E is a requirement for solving the MER problem [27]. E is also essential for determining the distributions and corresponding confidence intervals of PLR parameter estimates and effect size. Constructing a realistic error model for a data acquisition process can be daunting, especially in omics studies, because it is necessary to account for all relevant sources of experimental error and system variability. However, an approximate trial and error approach can be informative, and Monte Carlo methods are convenient because they allow for the detailed simulation of stochastic effects in the data acquisition process. Finally, the averaging of variable vectors that are not linearly related results in the loss of information because of the destructive interference associated with the mixing of signal and noise. Consequently, accounting for measurement error in polynomial and multiple regression requires nonlinear optimization methods, which is a topic for a separate paper. We conclude that the parametric representation for the relation between variable vectors provides a more general framework for linear regression compared to the standard Cartesian representation.

Acknowledgments

I thank many former colleagues in the Genetic Discovery group at DuPont for many helpful discussions about genome-wide association and eQTL methods. I also especially thank my biologist collaborator and spouse, Ada Ching.

Data Availability

The RNA-Seq GLDS-251 dataset is available from the GeneLab Omics database at https://genelab-data.ndc.nasa.gov/genelab/accession/GLDS-251/. The textbook, "Statistical Inference" by G. Casella and R. L. Berger, also serves as a data source.

Funding Statement

The author, Stanley Luck, is a member of Vector Analytics LLC, which is a science consulting company. The funder provided support in the form of salaries for authors [SL], but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the ‘author contributions’ section.

References

  • 1. Beló A, Zheng P, Luck S, Shen B, Meyer DJ, Li B, et al. Whole genome scan detects an allelic variant of fad2 associated with increased oleic acid levels in maize. Molecular Genetics and Genomics. 2008;279(1):1–10. doi: 10.1007/s00438-007-0289-y [DOI] [PubMed] [Google Scholar]
  • 2. Holloway B, Luck S, Beatty M, Rafalski JA, Li B. Genome-wide expression quantitative trait loci (eQTL) analysis in maize. BMC genomics. 2011;12(1):336. doi: 10.1186/1471-2164-12-336 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Luck S. Factoring a 2 x 2 contingency table. PLOS ONE. 2019;14(10):e0224460. doi: 10.1371/journal.pone.0224460 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Luck S. Nonoverlap proportion and the representation of point-biserial variation. PLOS ONE. 2020;15(12):e0244517. doi: 10.1371/journal.pone.0244517 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Nakagawa S, Cuthill IC. Effect size, confidence interval and statistical significance: a practical guide for biologists. Biological Reviews. 2007;82(4):591–605. doi: 10.1111/j.1469-185X.2007.00027.x [DOI] [PubMed] [Google Scholar]
  • 6. Conesa A, Madrigal P, Tarazona S, Gomez-Cabrero D, Cervera A, McPherson A, et al. A survey of best practices for RNA-seq data analysis. Genome Biology. 2016;17(1):1–19. doi: 10.1186/s13059-016-1047-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Moseley HNB. Error analysis and propagation in metabolomics data analysis. Computational and Structural Biotechnology Journal. 2013;4(5):e201301006. doi: 10.5936/csbj.201301006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Draper NR, Smith H. Applied Regression Analysis. 3rd ed. Wiley Series in Probability and Statistics. Wiley; 1998. [Google Scholar]
  • 9. Press WH, Teukolsky SA, Vetterling WT, Flannery BP, Cambridge University Press. Numerical Recipes: The Art of Scientific Computing. 3rd ed. New York: Cambridge University Press; 2007. [Google Scholar]
  • 10. Carroll RJ, Wang Y. Nonparametric variance estimation in the analysis of microarray data: a measurement error approach. Biometrika. 2008;95(2):437–449. doi: 10.1093/biomet/asn017 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Wang B, Zhang SG, Wang XF, Tan M, Xi Y. Testing for Differentially-Expressed MicroRNAs with Errors-in-Variables Nonparametric Regression. PLoS ONE. 2012;7(5):e37537. doi: 10.1371/journal.pone.0037537 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Liao J, Li X, Wong TY, Wang JJ, Khor CC, Tai ES, et al. Impact of Measurement Error on Testing Genetic Association with Quantitative Traits. PLoS ONE. 2014;9(1):e87044. doi: 10.1371/journal.pone.0087044 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Saccenti E, Hendriks MHWB, Smilde AK. Corruption of the Pearson correlation coefficient by measurement error and its estimation, bias, and correction under different error models. Scientific Reports. 2020;10(1):438. doi: 10.1038/s41598-019-57247-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Luck SD. Normalization and error estimation for biomedical expression patterns. In: Bittner ML, Chen Y, Dorsel AN, Dougherty ER, editors. Proceedings of SPIE—The International Society for Optical Engineering. vol. 4266; 2001. p. 153–157. Available from: http://proceedings.spiedigitallibrary.org/proceeding.aspx?articleid=901072.
  • 15. Stefanski LA. Measurement Error Models. Journal of the American Statistical Association. 2000;95(452):1353. doi: 10.1080/01621459.2000.10474347 [DOI] [Google Scholar]
  • 16. Cantrell CA. Technical Note: Review of methods for linear least-squares fitting of data and application to atmospheric chemistry problems. Atmospheric Chemistry and Physics. 2008;8(17):5477–5487. doi: 10.5194/acp-8-5477-2008 [DOI] [Google Scholar]
  • 17. Fuller WA. Measurement Error Models. Fuller WA, editor. Wiley Series in Probability and Statistics. Hoboken, NJ, USA: John Wiley & Sons, Inc.; 2006. [Google Scholar]
  • 18. Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. Measurement Error in Nonlinear Models: A Modern Perspective. 2nd ed. CRC Press; 2006. [Google Scholar]
  • 19. Buonaccorsi JP. Measurement error: models, methods, and applications. Boca Raton: Chapman and Hall/CRC; 2010. [Google Scholar]
  • 20. Cheng CL, Ness JWV. Statistical Regression with Measurement Error. New York: Wiley; 1999. [Google Scholar]
  • 21. Casella G, Berger R. Statistical Inference. 2nd ed. Pacific Grove, CA: Duxbury; 2002. [Google Scholar]
  • 22. Gillard J, Iles T. Methods of Fitting Straight Lines where Both Variables are Subject to Measurement Error. Current Clinical Pharmacology. 2009;4(3):164–171. [DOI] [PubMed] [Google Scholar]
  • 23. Markovsky I, Van Huffel S. Overview of total least-squares methods. Signal Processing. 2007;87(10):2283–2302. doi: 10.1016/j.sigpro.2007.04.004 [DOI] [Google Scholar]
  • 24.Spiess AN. Orthogonal Nonlinear Least-Squares Regression; 2015. Available from: https://cran.r-project.org/package=onls.
  • 25.Boggs PT, Byrd RH, Rogers JE, Schnabel RB. Orthogonal Distance Regression. In: SciPy v.1.5.2 Reference Guide. SciPy; 2020. Available from: https://docs.scipy.org/doc/scipy/reference/odr.html.
  • 26.Boggs PT, Byrd RH, Rogers JE, Schnabel RB. User’s Reference Guide for ODRPACK Version 2.01 Software for Weighted Orthogonal Distance Regression. June; 1992.
  • 27. Cheng CL, Riu J. On Estimating Linear Relationships When Both Variables Are Subject to Heteroscedastic Measurement Errors. Technometrics. 2006;48(4):511–519. doi: 10.1198/004017006000000237 [DOI] [Google Scholar]
  • 28. Cumming G. Understanding The New Statistics. New York, NY: Routledge; 2012. [Google Scholar]
  • 29. Kelley K, Preacher KJ. On effect size. Psychological Methods. 2012;17(2):137–152. doi: 10.1037/a0028086 [DOI] [PubMed] [Google Scholar]
  • 30. Puntanen S, Styan GPH, Isotalo J. Matrix Tricks for Linear Statistical Models. Berlin, Heidelberg: Springer Berlin Heidelberg; 2011. [Google Scholar]
  • 31. Noble B, Daniel JW. Applied Linear Algebra. 2nd ed. Englewood Cliffs: Prentice-Hall; 1977. [Google Scholar]
  • 32. Rodgers JL, Nicewander WA. Thirteen Ways to Look at the Correlation Coefficient. The American Statistician. 1988;42(1):59–66. doi: 10.2307/2685263 [DOI] [Google Scholar]
  • 33. Boyd SP, Vandenberghe L. Convex optimization. New York, NY: Cambridge University Press; 2004. [Google Scholar]
  • 34. Boas ML. Mathematical methods in the physical sciences. 2nd ed. New York, NY: John Wiley & Sons, Inc.; 1983. [Google Scholar]
  • 35. Schneeweiss H, Shalabh H. On the estimation of the linear relation when the error variances are known. Computational Statistics & Data Analysis. 2007;52(2):1143–1148. doi: 10.1016/j.csda.2007.06.018 [DOI] [Google Scholar]
  • 36. Voigtman E. Comparison of Signal-to-Noise Ratios. Analytical Chemistry. 1997;69(2):226–234. doi: 10.1021/ac960675d [DOI] [Google Scholar]
  • 37. Drew TB. Handbook of Vector and Polyadic Analysis. New York, NY: Reinhold; 1961. [Google Scholar]
  • 38. Lebedev LP, Cloud MJ, Eremeyev VA. Tensor Analysis with Applications in Mechanics. WORLD SCIENTIFIC; 2010. [Google Scholar]
  • 39. Marsaglia G. Ratios of Normal Variables. Journal of Statistical Software. 2006;16(4):1–10. doi: 10.18637/jss.v016.i04 [DOI] [Google Scholar]
  • 40. von Luxburg U, Franz VH. A Geometric Approach to Confidence Sets for Ratios: Fieller’s Theorem, Generalizations, and Bootstrap. Statistica Sinica. 2009;19:1095–1117. [Google Scholar]
  • 41. Newcombe RG. Interval estimation for the difference between independent proportions: comparison of eleven methods. Statistics in Medicine. 1998;17(8):873–890. doi: [DOI] [PubMed] [Google Scholar]
  • 42. Agresti A. Dealing with discreteness: making ‘exact’ confidence intervals for proportions, differences of proportions, and odds ratios more exact. Statistical Methods in Medical Research. 2003;12(1):3–21. doi: 10.1191/0962280203sm311ra [DOI] [PubMed] [Google Scholar]
  • 43. Bishara AJ, Hittner JB. Reducing Bias and Error in the Correlation Coefficient Due to Nonnormality. Educational and Psychological Measurement. 2015;75(5):785–804. doi: 10.1177/0013164414557639 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Weng L, Dai H, Zhan Y, He Y, Stepaniants SB, Bassett DE. Rosetta error model for gene expression analysis. Bioinformatics. 2006;22(9):1111–1121. doi: 10.1093/bioinformatics/btl045 [DOI] [PubMed] [Google Scholar]
  • 45. Alkemade CTJ, Snelleman W, Boutilier GD, Pollard BD, Winefordner JD, Chester TL, et al. A review and tutorial discussion of noise and signal-to-noise ratios in analytical spectrometry—I. Fundamental principles of signal-to-noise ratios. Spectrochimica Acta Part B: Atomic Spectroscopy. 1978;33(8):383–399. doi: 10.1016/0584-8547(78)80049-4 [DOI] [Google Scholar]
  • 46.Herranz R, Vandenbrink JP, Medina F, Kiss JZ. GLDS-251: RNAseq analysis of the response of Arabidopsis thaliana to fractional gravity under blue-light stimulation during spaceflight; 2019. Available from: https://genelab-data.ndc.nasa.gov/genelab/accession/GLDS-251/. [DOI] [PMC free article] [PubMed]
  • 47. Herranz R, Vandenbrink JP, Villacampa A, Manzano A, Poehlman WL, Feltus FA, et al. RNAseq Analysis of the Response of Arabidopsis thaliana to Fractional Gravity Under Blue-Light Stimulation During Spaceflight. Frontiers in Plant Science. 2019;10(November):1–11. doi: 10.3389/fpls.2019.01529 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, et al. Array programming with NumPy. Nature. 2020;585(7825):357–362. doi: 10.1038/s41586-020-2649-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature Methods. 2020;17(3):261–272. doi: 10.1038/s41592-019-0686-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Hunter JD. Matplotlib: A 2D Graphics Environment. Computing in Science & Engineering. 2007;9(3):90–95. doi: 10.1109/MCSE.2007.55 [DOI] [Google Scholar]
  • 51.Schenk C. MiKTeX; 2021. Available from: http://miktex.org/.
  • 52. Logan JD. Applied Mathematics. 2nd ed. New York, NY: John Wiley & Sons, Inc.; 1997. [Google Scholar]

Decision Letter 0

Sriparna Saha

22 Sep 2021

PONE-D-21-22997The chain rule, measurement error regression and RNA-Seq analysisPLOS ONE

Dear Dr. Luck,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please incorporate the suggestions/comments given by both the reviewers and revise the paper accordingly.

Please submit your revised manuscript by Nov 06 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Sriparna Saha, PhD

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at 

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Thank you for providing the following Funding Statement:  

The author, Stanley Luck, is a member of Vector Analytics LLC, which is a science consulting company. Vector Analytics LLC did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of the author are articulated in the `author contributions' section. The author received no specific funding for this work. 

We note that one or more of the authors is affiliated with the funding organization, indicating the funder may have had some role in the design, data collection, analysis or preparation of your manuscript for publication; in other words, the funder played an indirect role through the participation of the co-authors. 

If the funding organization did not play a role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript and only provided financial support in the form of authors' salaries and/or research materials, please review your statements relating to the author contributions, and ensure you have specifically and accurately indicated the role(s) that these authors had in your study in the Author Contributions section of the online submission form. Please make any necessary amendments directly within this section of the online submission form.  Please also update your Funding Statement to include the following statement: “The funder provided support in the form of salaries for authors [insert relevant initials], but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the ‘author contributions’ section.” 

If the funding organization did have an additional role, please state and explain that role within your Funding Statement. 

"Please also provide an updated Competing Interests Statement declaring this commercial affiliation along with any other relevant declarations relating to employment, consultancy, patents, products in development, or marketed products, etc.  

Within your Competing Interests Statement, please confirm that this commercial affiliation does not alter your adherence to all PLOS ONE policies on sharing data and materials by including the following statement: "This does not alter our adherence to  PLOS ONE policies on sharing data and materials.” (as detailed online in our guide for authors http://journals.plos.org/plosone/s/competing-interests) . If this adherence statement is not accurate and  there are restrictions on sharing of data and/or materials, please state these. Please note that we cannot proceed with consideration of your article until this information has been declared.

3. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information. 

Additional Editor Comments:

As commented by both the reviewers, the paper should be revised based on the suggestions provided.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: No

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: No

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: In this paper, the author proposed “The chain rule, measurement error regression and RNA-Seq

analysis”.

The strengths of the paper are that it is well structured, the description of the related work is well done and that results are extensively compared to results of the similar research.

1) Draw a graphical abstract of this work.

2) Justify the novelty of the proposed approach?

3) Proofread the article once again?

4) The experimental validation of the proposed approach is quite confusing justify it.

5) Give some experimental comparison of the proposed approach with existing approach?

Reviewer #2: The manuscript reading looks more like a technical report (heavy on the methodological description). To attract strong interest, it is highly recommended to revise the manuscript to focus on the major objects and findings in a concise manner. I have also some comments which are given below.

1. Title: Consider modifying it to be more specific. The wording of this current title ("The chain rule, measurement error regression and RNA-Seq analysis") is somewhat imprecise.

2. In page no 3 and 4 author mentioned seven novel contributions of his work but they are not clearly established in the whole paper.

3. Recommendation to revise the “data analysis and results” part in details.

4. The discussion part (page no. 23 and 24) does not clearly state the point of view of this paper. The revision is needed for this part.

5. I recommend that the author should revise the Fig. 1, Fig. 5 and Fig. 6 to illustrate the workflow.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

Decision Letter 1

Sriparna Saha

19 Dec 2021

A parametric framework for multidimensional linear measurement error regression

PONE-D-21-22997R1

Dear Dr. Luck,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Sriparna Saha, PhD

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: No

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: All my comments are addressed by the authors clearly and explanations given by authors is impressive.

Reviewer #2: Author tried to make the changes according to the instructions. It looks better than previous though the introduction portion is not up-to the mark.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Acceptance letter

Sriparna Saha

22 Dec 2021

PONE-D-21-22997R1

A parametric framework for multidimensional linear measurement error regression

Dear Dr. Luck:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Sriparna Saha

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Attachment

    Submitted filename: ResponseToReviewers.pdf

    Data Availability Statement

    The RNA-Seq GLDS-251 dataset is available from the GeneLab Omics database at https://genelab-data.ndc.nasa.gov/genelab/accession/GLDS-251/. The textbook, "Statistical Inference" by G. Casella and R. L. Berger, also serves as a data source.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES