Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Aug 8.
Published in final edited form as: IEEE/ACM Trans Comput Biol Bioinform. 2018 Jan 8;16(2):442–454. doi: 10.1109/TCBB.2018.2790918

A Unified Model for Joint Normalization and Differential Gene Expression Detection in RNA-Seq data

Kefei Liu 1, Jieping Ye 2, Yang Yang 3, Li Shen 4, Hui Jiang 5
PMCID: PMC6686202  NIHMSID: NIHMS1043996  PMID: 29993952

Abstract

The RNA-sequencing (RNA-seq) is becoming increasingly popular for quantifying gene expression levels. Since the RNA-seq measurements are relative in nature, between-sample normalization of counts is an essential step in differential expression (DE) analysis. The normalization of existing DE detection algorithms is ad hoc and performed once for all prior to DE detection, which may be suboptimal since ideally normalization should be based on non-DE genes only and thus coupled with DE detection. We propose a unified statistical model for joint normalization and DE detection of log-transformed RNA-seq data. Sample-specific normalization factors are modeled as unknown parameters in the gene-wise linear models and jointly estimated with the regression coefficients. By imposing sparsity-inducing L1 penalty (or mixed L1/L2 penalty for multiple treatment conditions) on the regression coefficients, we formulate the problem as a penalized least-squares regression problem and apply the augmented lagrangian method to solve it. Simulation studies show that the proposed model and algorithms perform better than or comparably to existing methods in terms of detection power and false-positive rate. The performance gain increases with increasingly larger sample size or higher signal to noise ratio, and is more significant when a large proportion of genes are differentially expressed in an asymmetric manner.

Index Terms—: RNA-Seq, differential expression analysis, normalization, linear regression, L1-Norm regularization, augmented Lagrangian method

1. INTRODUCTION

Ultra high-throughput sequencing of transcriptomes (RNA-seq) is a widely used method for quantifying gene expression levels due to its low cost, high accuracy and wide dynamic range for detection [1]. As of today, modern ultra high-throughput sequencing platforms can generate hundreds of millions of sequencing reads from each biological sample in a single day. RNA-seq also facilitates the detection of novel transcripts [2] and the quantification of transcripts on isoform level [3], [4]. For these reasons, RNA-seq has become the method of choice for assaying transcriptomes [5].

One major limitation of RNA-seq is that it only provides relative measurements of transcript abundances due to difference in library size (i.e., sequencing depth) between samples [6]. Normalization of RNA-seq read counts is required in gene differential expression analysis to correct for such variation between samples. A popular form of between-sample normalization is achieved by scaling raw read counts in each sample by a sample-specific factor related to library size [6], [7]. This include CPM/RPM (counts/reads per million) [8], quantile normalization [9],

[10], upper-quartile normalization [11], trimmed mean of M values [8] and DESeq normalization [12]. Also, commonly used gene expression measures, e.g., TPM (transcript per million) [13], and RPKM/FPKM (reads/fragments per kilo-base of exon per million mapped reads) [1], [2], also correct for difference in gene length within a sample [14] (the so-called within-sample normalization). In particular, the CPM/RPM (counts/reads per million) [8], TPM (transcript per million) [13], and RPKM/FPKM (reads/fragments per kilobase of exon per million mapped reads) [1], [2] for the i-th gene from the j-th sample are respectively defined as

cpmij=106cijNjfpkmij=109cijliNjtpmij=106cij/liicij/li (1)

where cij is the observed read count for gene i from the j-th sample, Nj= ∑icij is the sequencing depth in the j-th sample, and i be the length of gene i. In this work we focus on between-sample normalization.

In traditional count-based RNA-seq analysis methods, the read counts for each gene are assumed to follow a Poisson [15] or negative binomial (NB) distribution. One issue with the count-based RNA-seq analysis methods is that their procedures are complicated and contain many ad hoc heuristics. Moreover, the Poisson or NB distributions of counts are mathematically less tractable than the normal distribution [16], [17]. This makes count-based methods difficult to generalize to new data. Moreover, commonly used statistical methods for microarray data analysis, e.g., quality weighting of RNA samples, addition of random noise to generate technical replicates, and gene set test [16] have been designed for normally distributed data and it is unclear whether we can adapt them to count data. Also the presence of outliers is an issue that is not well addressed (addressed in a very ad hoc manner) by existing methods. To handle that, the authors of [16] take the logarithm of the raw count of reads and apply normal distribution-based statistical methods to analyze them. Note that by logarithmic transformation, the dynamic range of the RNA-seq counts is compressed such that the outlier counts are largely transformed into “normal” data. As a result, sophisticated way to detect and discard outliers [18], [19], [20] is not required.

In this paper, like in [16], [17] we work with logtransformed gene expression values and propose a unified statistical model for differential gene expression. Different from [16], [17], we model sample-specific scaling factors for between-sample normalization as unknown parameters and incorporate them into the gene-wise linear models. By imposing the sparsity-inducing penalty (L1 penalty for single treatment factor and mixed L1/L2 penalty for multiple treatment factors) on the regression coefficients and carefully choosing the tuning parameter, the model is able to achieve joint accurate detection of DE genes and between-sample normalization. To fit the model, we first eliminate sample-specific parameters using optimization argumentation to formulate the problem as a penalized linear regression problem, and then solve it with the alternating direction method of multipliers algorithm (ADMM), which is known for its fast convergence to modest accuracy [21]. Regarding the choice of tuning parameter, we theoretically derive the smallest tuning parameter αmax that leads to all-zero solution, and thereby find a proper tuning parameter within [0, αmax].

Note that our work is preceded by [22] which address the differential expression problem in a similar way. The difference is that the model of [22] considers only categorical or qualitative predictor/explanatory variables (treatment conditions). For example, label “0” is assigned to samples from the control group and label “1” to samples from the treatment group. While in our model, the predictor/explanatory variables can take arbitrary numeric values, and is thus a generalization of [22] from discrete to continuous predictorvariable model case. Note that the algorithm in [22] does not apply to the current numeric variable model at hand, because (i) applicability: it requires that multiple samples are present in each group but in the continuous-predictor model the concept of “group” no longer exists, or more precisely, each group is formed by only one sample; (ii) algorithmic complexity: it requires an p-dimensional exhaustive search, where p is the number of treatment conditions. When p > 1 (see Section 4), the algorithm is computationally very expensive.

The remainder of the paper is organized as follows. In Section 2, we formulate the problem in the context of a single treatment factor. In Section 3, we formulate the problem as a penalized simple regression problem and derive efficient ADMM algorithm to solve it, together with the estimation of noise variance and tuning parameter. In Section 4, we extend the simple regression model to multiple linear regression model. Comparison with existing methods is presented in Section 5, followed by discussions in Section 6.

2. DATA MODEL AND PROBLEM FORMULATION

Throughout the paper, the subscript and superscript are used to index the vectors for rows and columns of a matrix, respectively. For example, the i-th row and j-th column vector of a matrix A is denoted as ai and aj, respectively. Note that this does not conform to conventional notations where the subscript is used to index the columns of a matrix and the superscript is to index the rows.

2.1. Data model

Suppose there are a total of m genes measured in n samples. Let yij, i = 1, 2, …, m, j=1, 2, …, n, be the log-transformed gene expression measurements (a small positive number is usually added before taking logarithm) for the i-th gene from the j-th sample. The following statistical model is assumed

yij=βi0+βixj+dj+εij, (2)

where βi0 is the y-intercept for gene i, xj, j = 1, 2, …, n, is the predictor variable that represents the treatment condition (e.g., drug dosage) for sample j, βi is the slope or regression coefficient representing log-fold-change of expression levels of gene i with unit change of xj, dj is the scaling factor (e.g., log (sequencing depth) or log (library size)) for sample j for between-sample normalization [6], and εij~N(0,σi2) models the measurement noise. We assume that the error terms εij are uncorrelated with the predictor variable and uncorrelated with each other (across both gene i and sample j).

In (2), we consider a single treatment condition. Extension to models with multiple treatment conditions will be discussed in Section 4.

Our main interest is to detect differentially expressed (DE) genes, i.e., whether βi is equal to zero. If βi 0 gene i is differentially expressed across the n samples; otherwise it is not.

Remark 2.1. Since βi0 and dj in (2) respectively model gene-specific factor (e.g., gene length) and sample-specific factor, model (2) is able to work with any log-transformed gene expression measures in the form of

yij=logcijliqj, (3)

where cij is the raw counts, i is the length of gene i and qj is the normalization factor of the j-th sample, since i and qj can be absorbed into βi0 and dj, respectively. Note that gene expression measures of form cij/(ℓi qj) include the raw counts (with i= qj =1), measures based on between sample normalization only(i=1) [6], and FPKM and TPM which are shown in (1) and involve both between and within-sample normalization.

2.2. Penalized likelihood

The likelihood function based on the measured data is given by

L(β0,β,d;y)=i=1mj=1n12πσi2exp{(yijβi0βixjdj)22σi2}, (4)

where

β=(β1β2βm).

Assume that {σi2}i=1m are known, maximization of (4) is equivalent to minimizing the negative log-likelihood:

l(β0,β,d)=i=1mj=1n12σi2(yijβi0βixjdj)2, (5)

where we have the irrelevant constant.

In practice, we solve for {σi2}i=1m using an ad hoc approach, which will be described in Section 3.4.

We introduce a L1 penalty on the βi′s,

p(β)=αβ1:=αi=1m|βi|. (6)

It is well known that L1 the penalty favors sparse solutions (forces some coefficients to be exactly zero) [23]. This is reasonable since in practice many genes are not differentially expressed.

The objective function to be minimized is

f(β0,β,d)=i=1mj=1n12σi2(yijβi0xjβidj)2+αi=1m|βi|. (7)

3. ALGORITHM DEVELOPMENT

3.1. Formulation of (7) as Penalized Simple Linear Regression Model

It can be proved that the optimization problem in (7) is jointly convex in (β0, β, d). Therefore, the minimizer of (7) is the stationary point.

The derivative of f (β0, β, d with respect to dj, j=1, 2, …, n, is

fdj=i=1m1σi2(yijβi0xjβidj). (8)

Setting to zero gives

dj=1i=1m1σi2i=1m1σi2(yijβi0xjβi). (9)

Model (2) is non-identifiable because we can simply add any constant to all the dj′s, and subtract the same constant from all the βi0′s, while having the same fit. To resolve this issue, we fix d1 = 0. Therefore

dj=djd1=(y¯j(w)y¯1(w))(xjx1)β¯(w), (10)

where

y¯j(w):=1i=1m1σi2i=1m1σi2yij, for j=1,2,,n, (11)
β¯(w):=1i=1m1σi2i=1m1σi2βi. (12)

Here, the superscript (w) indicates that the mean is a weighted mean instead of an unweighted one.

On the other hand, from

fβi0=1σi2j=1n(yijβi0xjβidj)=0, (13)

we have

βi0=1nj=1n(yijxjβidj)=y¯i.x¯βi1nj=1ndj, (14)

where

y¯i.:=1nj=1nyij, for i=1,2,,m, (15)
x¯:=1nj=1nxj. (16)

From (10) we have

1nj=1ndj=(y¯(w)y¯1(w))(x¯x1)β¯(w), (17)

where

y¯(w):=1i=1m1σi2i=1m(1σi21nj=1nyij). (18)

Substituting (17) into (14) yields

βi0=y¯i.+y¯1(w)y¯(w)+(x¯x1)β¯(w)x¯βi. (19)

Without loss of generality, we make the following two assumptions:

Assumption 3.1.

j=1nxj=nx¯=0,j=1nxj2=1. (20)

These assumptions are reasonable since in the model (2) the center and scaling factor of xj’s can be absorbed into βi0 and βi, respectively.

Then (19) simplifies to

βi0=y¯i.+y¯1(w)y¯(w)x1β¯(w). (21)

The sum of (10) and (21) yields

βi0+dj=y¯i.+y¯j(w)y¯(w)xjβ¯(w). (22)

Substituting (22) into (7), the latter simplifies to

f(β)=i=1m12σi2j=1n(y˜ijxjβi+xjβ¯(w))2+αi=1m|βi|, (23)

where

y˜ij:=yijy¯i.y¯j(w)+y¯(w). (24)

It can be shown by straightforward calculation that {y˜ij}

i=1m1σi2y˜ij=0, (25)
j=1ny˜ij=0. (26)

3.2. Model Fitting by ADMM

We propose to use the alternating direction method of multipliers (ADMM) [21] to solve (23). Although ADMM can be very slow to converge to high accuracy, it is often the case that ADMM converges to modest accuracy very fast (within a few tens of iterations) [21].

To apply the ADMM, the problem (23) is reformulated as

f(β)=i=1m12σi2j=1n(y˜ijxjβi+xjδ0)2+αi=1m|βi|, (27a)

subject to

1i=1m1σi2i=1m1σi2βi=δ0. (27b)

The augmented Lagrangian of (27) is (28) at the bottom of the page.

Step 1: Update βi, i =1, 2, …, m:

The derivative of (28) with respect to βi is

Lρβi=1σi2j=1nxj(y˜ijxjβi+xjδ0)+α|βi|+1l=1m1σl21σi2λ+ρ1l=1m1σl21σi2(1l=1m1σl2l=1m1σl2βlδ0), (29)

where |βi| is the subgradient of|βi| with respect to βi and is defined as

|βi|={1,βi>01,βi<0[1,1],βi=0

Setting (29) equal to zero gives (30) at the bottom of the page, where T is the soft-thresholding operator:

Tσi2α[x]:=sign(x)(|x|σi2α)+={xσi2α,x>σi2αx+σi2α,x<σi2α0,σi2αxσi2α

Step 2: Update δ0:

The derivative of (28) with respect to δ0 is

Lρδ0=i=1m1σi2j=1nxj(y˜ijxjβi+xjδ0)λ+ρ(δ01i=1m1σi2i=1m1σi2βi). (31)

Setting (31) equal to zero gives

δ0=1i=1m1σi2+ρ(λi=1m1σi2j=1nxjy˜ij)+1i=1m1σi2i=1m1σi2βi=1i=1m1σi2+ρλ+1i=1m1σi2i=1m1σi2βi, (32)

where the second equality is due to (32) where the second equality is due to (25). Step 3: Update λ:

λ new =λ old +ρ(1i=1m1σi2i=1m1σi2βiδ0) (33)

The model fitting algorithm is described in Algorithm 1.

3.3. Estimation of Tuning Parameter α

Eq. (23) can be expressed in matrix form as

f(β)=12Σ1/2( Y˜MβxT)F2+αβ1, (34)

where

Σ=diag{σ}, (35)

with

σ=(1/σ121/σ221/σm2)T, (36)

and

M=(100010000001)1i=1m1σi2(1/σ121/σ221/σm21/σ121/σ221/σm21/σ121/σ221/σm2)=Im1i=1m1σi21mσT. (37)

After expansion, (34) becomes

f(β)=12Σ1/2 Y˜F2βTMTΣ Y˜x+12βTMTΣMβ+αβ1, (38)

where we exploit assumption xTx = 1.

Since 12βTMTΣMβ0 with equality occurring at β = 0, it is shown that β^=0 is the minimizer of f(β)when

αMTΣ Y˜x:=max1im|miTΣY˜x|, (39)

where mi denotes the i-th column of M in (37).

Lρ(β,δ0,λ)=i=1m12σi2j=1n(y˜ijxjβi+xjδ0)2+αi=1m|βi|+λ(1i=1m1σi2i=1m1σi2βiδ0)+ρ2(1i=1m1σi2i=1m1σi2βiδ0)2 (28)
βi=σi2(l=1mσl2)2σi2(l=1mσl2)2+ρTσi2α[(j=1nxjy˜ij+δ0)ρl=1m1σl2(1l=1m1σl2li1σl2βlδ0+λρ)] (30)

3.

Note that

MTΣY˜=(Im1i=1m1σi2σ1mT)ΣY˜=ΣY˜, (40)

where the last equality holds because 1mTΣY˜=0 due to (25)

Substituting (40) into (39) yields

αmax=ΣY˜x=maxi|1σi2xTy˜i|. (41)

Our strategy is to first sort |1σ12xTy˜1|,|1σ22xTy˜2|,,|1σN2xTy˜m| in ascending order, from least to greatest, and then set α as the P-th percentile (0 < P < 100) of the m ordered value. We set P=5 in Section 5.

3.4. Maximum likelihood estimation of {σi2}i=1m

To solve for {σi2}i=1m, consider the negative log-likelihood function in (4) with {σi2}i=1m being unknown parameters as well:

l(β0,β,d,{σi2}i=1m)=i=1m[n2log(2πσi2)+12σi2j=1n(yijβi0xjβidj)2 (42)

Taking the partial derivatives of ℓ(.) with respect to dj and βi0 and setting the results to zero, we arrive at (10) and (21) respectively. The sum of (10) and (21) gives (22).

Taking the partial derivative of ℓ(.) with respect to βi and setting the result to zero, we have

βi=j=1nxjyijj=1nxj(βi0+dj). (43)

Substituting (22) into (43) yields

βi=j=1nxjyij1i=1m1σi2i=1m1σi2j=1nxjyij+β¯(w), (44)

where β¯(w) is defined in (12).

Taking the partial derivative of σi2 and setting the result to zero gives

σi2=1nj=1n(yijβi0xjβidj)2. (45)

Substituting (22) into (45) yields

σi2=1nj=1n(yijy¯i.y¯j(w)+y¯(w)xjβi+xjβ¯(w))2, (46)

where y¯i, y¯j(w) and y¯(w) are defined in (15), (11) and (18), respectively.

Given initial estimates for β¯(w) and {σi2}i=1m we can alternate equations (44), (46) and (12) iteratively to gradually refine the estimates for βi and σi2, as shown in Algorithm 2.

3.

To obtain a robust estimate for σi2, we further take the weighted average of σ^i2 and the estimated mean variance across all the genes. That is

σ^i2=(1w)σ^i2+wσ^2¯ (47)

where

σ^2¯=1mi=1mσ^i2, (48)

and the weight w is calculated using the following formula which is derived based on an empirical Bayes approach [24]

w=2(m1)n+1(1m+(σ^2¯)2i=1m(σ^i2σ^2¯)2). (49)

This kind of variance estimation approach is widely used in differential gene expression analysis with small sample sizes [25], [26]. The estimated variances σ^i2 can then used in Algorithm 1 to solve for {βi}i=1mRemark 3.1. In the special case of σ12=σ22==σm2=σ2, it no longer requires to estimate σ2 since the unknown σ2 in (7) can be absorbed into the tuning parameter α.

4. EXTENSION TO MULTIPLE LINEAR REGRESSION MODEL AND ALGORITHM DEVELOPMENT

In the multiple linear regression model, each response or outcome is modeled by p > 1 predictors:

yij=βi0+βiTxj+dj+εij, (50)

where

βi=(βi1βi2βip)p×1 (51)

is a vector of regression coefficients representing log-fold-change of expression levels of gene i between treatment conditions, and

xj=(xj1xj2xjp)p×1 (52)

is a vector of predictors representing the treatment conditions (drug dosage, blood pressure, age, BMI, etc.) for sample j, and βi0, dj and εij~N(0,σi2) are the y-intercept, scaling factor for sample j and measurement noise, respectively. We assume that the error terms εij are uncorrelated with all the predictor variables and uncorrelated with each other.

The likelihood function based on the observed data is given by

L(β0,{βi}i=1m,d;Y)=i=1mj=1n12πσi2exp{(yijβi0βiTxjdj)22σi2}. (53)

Assume that {σi2}i=1m are known, maximization of (53) leads to minimizing the negative log-likelihood:

l(β0,{βi}i=1m,d)=i=1mj=1n12σi2(yijβi0βiTxjdj)2 (54)

The objective function to be minimized is

f(β0,{βi},d)=i=1mj=1n12σi2(yijβi0xjTβidj)2+i=1mp(βi). (55)

Below we introduce two types of penalty function p(βi).

  1. Type I penalty:

p(βi)=α|βip|. (56)

Gene i is differentially expressed if βip ≠ 0 and not otherwise. This penalty is for the applications where one covariate is of main interest (e.g., treatment) while we want to adjust for all possible effects of other confounding covariates (e.g., age, gender, etc).

  1. Type II penalty:

p(βi)=αβi. (57)

Gene i is differentially expressed if βi≠ 0 and not otherwise. This penalty is for the applications where all covariates are of interest and we want to identify the genes for which at least one covariate has an effect.

It can be proved that the optimization problem (55) with penalty (56) or (57) is jointly convex in (β0, {βi} ,d).

Assume that

j=1nxj=0, (58)

and set d1=0. using similar argumentation as in Section 3.1 to eliminate β0 and d, we simplify (55) to

f({βi})=i=1m12σi2j=1n(y˜ijxjTβi+xjTβ¯(w))2+i=1mp(βi), (59)

where y˜ij is the same as that in (24), and

β¯(w):=1i=1m1σi2i=1m1σi2βi. (60)

4.1. Regression with type I penalty: Model fitting by ADMM

To apply the ADMM, we reformulate the Type I penalized regression problem as

f({βi},δ0)=i=1m12σi2j=1n(y˜ijxjTβi+xjTδ0)2+αi=1m|βip|, (61a)

subject to

1i=1m1σi2i=1m1σi2βi=δ0. (61b)

The augmented Lagrangian of (61) is (62) at the bottom of the page.

Step 1: Update βi, i 1, 2, …, m:

Taking the partial derivative of (62) with respect to βi and setting the result to zero gives

[1σi2XTX+ρσi41(l=1m1σl2)2Ip]βi+α(00|βip|)=vi, (63)

where

X=(x1Tx2TxnT)=(x11x12x1px21x22x2pxn1xn2xnp)n×p, (64)

∂|βip| is the subgradient of |βip| with respect to βip,a nd

vi=1σi2(j=1nxjy˜ij+XTXδ0)ρσi21l=1m1σl2(1l=1m1σl2li1σl2βlδ0+λρ). (65)

Given matrix partition in the following form:

1σi2XTX+ρσi41(l=1m1σl2)2Ip=Q=(Q11qqTqpp),βi=(βiβip),vi=(vivip),

where Q11 is the submatrix‚ of Q „with last‚ row and last column deleted, from (63) we have

Q11βi+qβip=vi (66)
qTβi+qppβip+α|βip|=vip. (67)

From (66) it follows

βi=Q111(viqβip). (68)

Substituting (68) into (67) yields

βip=1qppqTQ111qTα[vipqTQ111vi]. (69)

Step 2: Update δ0:

Taking the derivative of (62) with respect to δ0 and setting the result to zero gives

δ0=(i=1m1σi2XTX+ρIp)1λ+1i=1m1σi2i=1m1σi2βi, (70)

where we have exploited (25).

Step 3: Update λ:

λ new =λ old +ρ(1i=1m1σi2i=1m1σi2βiδ0). (71)

The model fitting algorithm is described in Algorithm 3.

4.2. Regression with type II penalty: Model fitting by ADMM

The Type II penalized regression problem is reformulated as

f({βi},δ0)=i=1m12σi2j=1n(y˜ijxjTβi+xjTδ0)2+αi=1mβi, (72a)

subject to

1i=1m1σi2i=1m1σi2βi=δ0. (72b)

The augmented Lagrangian of (72) is (73) at the bottom of the page.

Step 1: Update βi, i 1, 2, …, m:

The relevant terms to compute the derivatives of (73) with respect to βi is (74) at the bottom of the page, where c is an irrelevant constant which does not depend on βi, and vi is defined in (65).

Lρ({βi},δ0,λ)=i=1m12σi2j=1n(y˜ijxjTβi+xjTδ0)2+αi=1m|βip|+λT(1i=1m1σi2i=1m1σi2βiδ0)+ρ21i=1m1σi2i=1m1σi2βiδ02. (62)

It can be shown that when ‖υi‖ ≤ α then βi = 0; otherwise denote the eigendecomposition of XTX as XTX=UDUT, we have that minimization of (74) is equivalent to

minβi12Ziβibi2+αβi, (75a)

where

Zi=[1σi2D+ρσi41(l=1m1σl2)2Ip]1/2UT, (75b)
bi=[1σi2D+ρσi41(l=1m1σl2)2Ip]1/2UTvi. (75c)

As in [27], we use a coordinate descent procedure to optimize (75). For each s, given the estimate of {β^il}ls can be estimated by solving

minβis12zsβisri(s)2+αβis2+lsβ^il2, (76)

where

ri(s)=bilszlβ^il. (77)

We solve (76) via a one-dimensional search. Note that the solution to (76) falls between 0 and βilo=zsTri(s)/zs2, the ordinary least-squares estimate. We can use the optimize function in the R package, or fminbnd function in MAT-LAB, which performs one-dimensional search based on golden section search and successive parabolic interpolation.

After updating {βi}i=1m, the updates of δ0 and λ out to be the same as that in Section 4.1. The model fitting algorithm is described in Algorithm 4.

4.3. Estimation of Tuning Parameter α

Eq. (59) can be expressed in matrix form as

f(B)=12Σ1/2(Y˜MBXT)F2+p(B), (78)

where M and X are respectively defined in (37) and (64), and

B=(β1Tβ2TβmT)=(β11β12β1pβ21β22β2pβm1βm2βmp)m×p, (79)

and p(B)is the penalty function.

The derivative of f(B)with respect to B is

fB=MTΣMBXTXMTΣY˜X+p(B)B. (80)

4.3.1. Type I Penalty

When p (B) p(B)=αi=1m|βip|, its derivative with respect to B is

p(B)B=α(00|β1p|00|β2p|00|βmp|)=(0m×(p1)αβp1βp). (81)

Denote

X=[x1xp1xp]=[X1xp],B=[β1βp1βp]=[B1βp].

Setting (80) equal to zero gives

MTΣM(B1X1T+βpxpT)X1=MTΣY˜X1 (82)
MTΣM(B1X1T+βpxpT)xp+αβp1βp=MTΣY˜xp. (83)

Since MTΣM is rank deficient2, the solution to (82) is not unique. We apply the pseudoinverse of MTΣM to obtain the minimum-norm solution to (82):

B1=(MTΣM)(MTΣY˜MTΣMβpxpT)X1(X1TX1)1. (84)

Substituting (84) into (83) yields

MTΣMβpxpT[InX1(X1TX1)1X1T]xp+αβp1βp=MTΣY˜[InX1(X1TX1)1X1T]xp. (85)

2. Simple analysis shows that the rank of MTΣM is m-1.

Lρ({βi},δ0,λ)=i=1m12σi2j=1n(y˜ijxjTβi+xjTδ0)2+αi=1mβi+λT(1i=1m1σi2i=1m1σi2βiδ0)+ρ21i=1m1σi2i=1m1σi2βiδ02 (73)
Li({βi},δ0,λ)=12σi2j=1n(y˜ijxjTβi+xjTδ0)2+αβi+λT1l=1m1σl21σi2βi+ρ21l=1m1σl21σi2βi+1l=1m1σl2li1σl2βlδ02=12βiT(1σi2XTX+ρσi41(l=1m1σl2)2Ip)βiβiTvi+αβi+c (74)

Note that to arrive at (85), we have exploited the fact that (MTΣM)(MTΣM)MTΣ=MTΣ which is due to that MTΣM=MTΣ according to the the definition of M in (37) and definition of the pseudoinverse of a matrix.

Since the coefficient matrix of βp, i.e., MTΣM(xpT[InX1(X1TX1)1X1T]xp) is positive semidefinite, (85) implies that when

αMTΣY˜[InX1(X1TX1)1X1T]xp=ΣY˜[InX1(X1TX1)1X1T]xp=maxi|1σi2y˜iT[InX1(X1TX1)1X1T]xp|, (86)

where the next to last equality is due to (40) we obtain zero solution.

4.3.2. Type II Penalty

The derivative of p(B)=αi=1mβi with respect to B is

p(B)B=α(β1β1Tβ2β2TβmβmT), (87)

when βiβi=βiβi if βi0 and βiβi1 otherwise[27],[28]

Setting (80) equal to zero yields

XTXBTMTΣmiXTY˜TΣmi+αβiβi=0p×1, (88)

for i=1, 2, …, m, where mi is the i-th column of M in (37) The minimizer to f (B) is a zero matrix when

αmaxiXTY˜TΣmi. (89)

Note that

Y˜TΣmi= Y˜TΣ(ei1l=1m1σl21σi21m)= Y˜TΣei=1σi2y˜i, (90)

where the next to last equality is due to (25) Substituting (90) into (89) yields

αmax=maxi1σi2XTy˜i. (91)

4.4. Maximum likelihood estimation of {σi2}i=1m

To solve for {σi2}i=1m consider the negative log-likelihood function with {σi2}i=1m being unknown parameters as well:

l(β0,{βi}i=1m,d,{σi2}i=1m)=i=1m[n2log(2πσi2)+12σi2j=1n(yijβi0xjTβidj)2]. (92)

Taking the partial derivatives of (.)with respect to dj and βi0 and setting the result to zero, we arrive at

dj=djd1=(y¯j(w)y¯.1(w))(xjx1)Tβ¯(w), (93)
βi0=y¯i.1nj=1nxjTβi1nj=1ndj=y¯i.+y¯1(w)y¯(w)x1Tβ¯(w), (94)

where to derive the second equality we have exploited assumption (58).

The sum of (93) and (94) gives

βi0+dj=y¯i.+y¯j(w)y¯(w)xjTβ¯(w). (95)

Taking the partial derivative of (.) with respect to βi and setting the result to zero, we have

βi=j=1nxjyijj=1nxj(βi0+dj). (96)

Substituting (95) into (96) yields

βi=(XTX)1[j=1nxjyij1i=1m1σi2i=1m1σi2j=1nxjyij]+β¯(w), (97)

where β(w) is defined in (60).

Taking the partial derivative of σi2 and setting the result to zero gives

σi2=1nj=1n(yijβi0xjTβidj)2. (98)

Substituting (95) into (98) yields

σi2=1nj=1n(yijy¯i.y¯j(w)+y¯(w)xjTβi+xjTβ¯(w))2, (99)

where y¯i, y¯j(w) and y¯(w) are defined in (15), (11) and (18), respectively.

Given initial estimates for β¯(w) and {σi2}i=1m estimates for βi and σi2 can then be iteratively updated using equations (97), (99), and (60) and until convergence.

After estimating σi2s s, they can then be shrinked (squeezed) toward the common noise variance to obtain robust estimates for σi2, as done in Section 3.4.

Given initial estimates for β¯(w) and {σi2}i=1m, estimates for βi and σi2 can then be iteratively updated using equations (97), (99), (60) until convergence, as shown in Algorithm 5.

5. EXPERIMENTS

We evaluate the performance of the proposed algorithm (referred to as ELMSeq, short for extended linear model for RNA-seq data analysis). To save space, we only verify the proposed algorithm the for simple regression model (2). We use the 5th percentile to set the tuning parameter α (see Section 3.3).

We compare our method with the state-of-the-art methods for detecting differential gene expression from RNA-seq data: edgeR-robust [20], [29], DESeq2 [18], and limma-voom [16], [17].

5.1. Simulations on Synthetic Data

We simulate RNA-seq data with a total of m 1000 genes and n = 15 samples. The data generation is described in Table 1.

Table 1:

Synthetic data generation process and parameters

i ~ 2unif(5,10) gene length of gene i
βi0~N(0,1) other log scaling factors of gene i
βi = 0 log-fold change for non-DE genes
βi~N(2,1) log-fold change for up-regulated DE genes
βi~N(2,1) log-fold change for down-regulated DE genes
xj~N(0,1) condition data of sample j
Nj ~ unif(2, 3) × 106 library size of sample j
dj~N(0,1) other log scaling factors of sample j
μij=Njlii=1mlieβi0+βixj+dj expected RNA-seq read counts of gene i from sample j
cij=elogN(μij,0.1) read counts
yij = log cij log-transformed gene expression

We first examine whether the proposed algorithm can accurately estimate the log-fold changes (or slopes) βi′s. For ease of illustration, we set the true slopes for DE ones as βi= ±2 instead of βi~N(±2,1) We start with 300 DE and 700 non-DE genes. Among DE genes 50% are up-regulated while the remaining 50% are down-regulated. The fitted {βi}i=1m using ELMSeq are plotted in Figure 1(a). We see that the estimated slopes are centered around the true ones: the estimated βi′s of the DE genes are centered around ±2, while those of the non-DE genes are close to zero. In Figure 1(b) and Figure 1(c), we increase the percent of up-regulated DE genes to 70% and 90%, respectively. Our method still accurately retrieves all non-zero βi′s while shrinking all other βi′s to zero.

Figure 1:

Figure 1:

Estimated βi in the simple linear regression model from simulated RNASeq data, where the number of genes is m=1000 and number of samples is n= 15. The number of DE genes varies from 300 to 700, and the percentage of up-regulated DE genes varies from 50% to 90%. Along the horizontal axis, from left to right: up-regulated genes (βi= 2), down-regulated genes (βi= −2) and non-DE genes (βi=0).

In Figure 1(d–f), we increase the number of DE genes to 500, among which 50%, 70% or 90% are up-regulated while others are down-regulated. Our method still achieves accurate estimates. In Figure 1(g–h), we further increase the number of DE genes to 700 among which 50% or 70% are up-regulated, for which our method still achieves accurate estimates when. Only when we simulate with 700 DE genes among which 90% are up-regulated, our method fails to distinguish between DE and non-DE genes since the estimated regression coefficients of the latter are not zero either [Figure 1(i)]. A theoretical explanation of Figure 1(i) has been provided in the supplementary material.

Using a different gene expression measure such as CPM, RPKM or TPM values computed with formulas in (1) yields essentially the same result.

Using the algorithm in Algorithm (1), we estimate the regression coefficient β^i for each gene i. We decide there is a linear relationship between the predictor variable xj and the expression data yij if β^i0 The larger |β^i| is, the stronger the relationship. We then sort the genes in descending order of their |β^i| vary the threshold to construct the receiver operating characteristic (ROC) curve and to calculate the area under the ROC curve (AUC).

The AUCs for DE gene detection using all four methods are summarized in Table 2. We see that the ELMSeq performs better than or comparbly to other three methods, regardless of how many genes are differentially expressed and whether they are expressed in a symmetric manner or not. In challenging cases where a large proportion of genes are differentially expressed in an asymmetric manner (e.g., 50% DE genes among which 90% are up-regulated or 70% DE genes among which 70% are up-regulated), the performance gain of the ELMSeq over completing methods is more significant.

Table 2:

AUC comparison of edgeR-robust, DESeq2, limma voom and ELMSeq in log-normally distributed data. Number of samples: n= 15, log-fold change for DE genes: βi~N(±2,1), and noise level: σi= 0.1. The table shows the percent of DE genes (DE %), percent of up-regulated genes among the DE genes (Up %), as well as the mean AUCs for all four methods measured using 10 simulated replicates. The standard errors of the mean AUCs are given in parentheses.

DE (%) Up (%) edgeR DESeq2 voom ELMSeq
10 50 0.9903
(0.0016)
0.6068
(0.0807)
0.991
(0.0018)
0.9914
(0.0017)
10 70 0.9935
(0.0021)
0.4527
(0.0638)
0.9941
(0.0021)
0.9943
(0.0021)
10 90 0.9869
(0.0028)
0.6878
(0.0637)
0.9875
(0.0024)
0.9897
(0.0022)
30 50 0.9898
(0.001)
0.5508
(0.0883)
0.99
(0.001)
0.99
(0.001)
30 70 0.9891
(0.0014)
0.7946
(0.064)
0.9897
(0.0014)
0.991
(0.0011)
30 90 0.9788
(0.0023)
0.6114
(0.0805)
0.9796
(0.0022)
0.9795
(0.0014)
50 50 0.9917
(8e-04)
0.429
(0.0797)
0.9916
(8e-04)
0.9917
(8e-04)
50 70 0.9748
(0.0026)
0.4923
(0.081)
0.9754
(0.0026)
0.9826
(0.0015)
50 90 0.8717
(0.0133)
0.4697
(0.0667)
0.8801
(0.0119)
0.9662
(0.002)
70 50 0.9907
(9e-04)
0.5572
(0.1027)
0.9915
(8e-04)
0.9923
(7e-04)
70 70 0.8564
(0.018)
0.5307
(0.0588)
0.8696
(0.0148)
0.9591
(0.0034)
70 90 0.3375
(0.0108)
0.4808
(0.0192)
0.3204
(0.0154)
0.4718
(0.0124)

In Table 3, we decrease the log-fold change of the DE genes as βi~N(±0.2,0.1) while keeping all other data generation parameters (including the noise level) the same as those in Table 2. We see that all methods suffer a degradation in AUC performance; but again, the ELMSeq consistently perform better than or comparably to all other methods.

Table 3:

AUC comparison of edgeR-robust, DESeq2, limma voom and ELMSeq in log-normally distributed data. The data generation parameters are the same as those in Table 2 except that the log-fold changes for DE genes decrease to: βi~N(±0.2,0.1).

DE (%) Up (%) edge DESeq2 voom ELMSeq
10 50 0.8055
(0.0089)
0.5241
(0.0142)
0.8224
(0.0095)
0.8232
(0.0095)
10 70 0.8086
(0.009)
0.4846
(0.0126)
0.8212
(0.0095)
0.8234
(0.0101)
10 90 0.7867
(0.0084)
0.5078
(0.0084)
0.7955
(0.0104)
0.8024
(0.0106)
30 50 0.8087
(0.005)
0.497
(0.0119)
0.8158
(0.0054)
0.8157
(0.0054)
30 70 0.7848
(0.0052)
0.5471
(0.0211)
0.7949
(0.0052)
0.8013
(0.0055)
30 90 0.7398
(0.0059)
0.5329
(0.0181)
0.7505
(0.0059)
0.773
(0.0054)
50 50 0.8143
(0.0061)
0.4931
(0.0137)
0.8265
(0.0049)
0.8268
(0.0051)
50 70 0.7611
(0.0054)
0.5061
(0.0155)
0.7704
(0.0054)
0.7752
(0.0056)
50 90 0.6451
(0.006)
0.5017
(0.0102)
0.6503
(0.0059)
0.6793
(0.0025)
70 50 0.8149
(0.0022)
0.5231
(0.0273)
0.8261
(0.003)
0.8267
(0.0028)
70 70 0.7271
(0.0074)
0.5093
(0.01)
0.7354
(0.0086)
0.7388
(0.0083)
70 90 0.5449
(0.0066)
0.5158
(0.0089)
0.5505
(0.0081)
0.5434
(0.0069)

Note that when more samples are available, the performance gain of the ELMSeq over completing methods becomes even more significant. The results for various sample sizes n=5, 8, 25, 50, 100 are provided in the supplementary materials (Tables S1S5 for genes with high expression profiles βi~N(±2,1) and Tables S6S10 for genes with low expression profiles βi~N(±0.2,0.1)).

We also performed simulations with the multiple linear regression model in Section 4, and the preliminary results are similar to that obtained for the simple regression model. Note that unlike the simple regression model and type I penalized multiple linear regression model, the type II penalized multiple linear regression model does not allow to define up-and down-regulated genes as multiple regression coefficients are tested simultaneously.

5.2. An application to a real RNA-Seq dataset

We further evaluate our algorithm on a prostate adenocarcinoma (PRAD) RNA-Sequencing dataset published as part of The Cancer Genome Atlas (TCGA) project [30]. The RNA-Seq datasets of 20531 genes from 187 samples were downloaded from the TCGA data portal (https://tcga-data.nci.nih.gov/tcga). We desire to identify genes that are associated with pre-operative prostate-specific antigen (PSA), an important risk factor for prostate cancer. The gene expression data were preprocessed by the TCGA consortium. Tissue samples from 333 PRAD patients were sequenced using the Il-lumina sequencing instruments. The raw sequencing reads were processed and analyzed using the SeqWare Pipeline 0.7.0 and MapspliceRSEM workflow 0.7 developed by the University of North Carolina, and then aligned to the human reference genome using MapSplice[31]. The gene expression distributions of all samples are normalized to have the same 75th percentile expression values (1,000).

Using the algorithm in Algorithm 1, we obtain the estimated between-sample normalization factors d^js and regression coefficient β^i for each gene i. We then substitute d^js into model (2), and for each gene i compute the p-value by testing the null hypothesis that the slope of the regression line is equal to zero, i.e., βi = 0. We determine a gene is differentially expressed if the p value associated with its linear regression model is less than 0.05/m. Here the threshold 0.05/m is determined using the Bonferroni correction to adjust for multiple significant tests and to achieve a desired family-wise error rate of 0.05. The relations between the sets of differentially expressed genes selected by edgeR, DESeq2, limma-voom and ELMSeq are depicted in Fig. 2.

Figure 2:

Figure 2:

Venn diagram showing the relation between the set of differentially expressed genes detected by edgeR, DESeq2, limma-voom and ELMSeq.

Nine genes are uniquely detected by ELMSeq: RIC3, ALDH1A2, BCL11A, CDH3, DIRAS3, EPHA5, CEACAM1, PRSS16, and AJAP1. For most of these genes, evidence has also been reported in the literature on their association with prostate cancer. For example, the genes ALDH1a2 [32] and CEACAM1 [33] are reported to be tumor suppressors in prostate cancer: underexpression of these genes promote prostate cancer cell proliferation.

Twelve genes are detected by all four methods: KANK4, RHOU, TPT1, SH2D3A, EEF1A1P9, ZCWPW1, ZNF454, RACGAP1, PTPLA, POC1A, AURKA and TIMM17A. The common genes detected by three methods are: six genes CDK1, FAM111B, MLF1IP, PRC1, DTL, RAD54B by edgeR, DESeq2, and limma-voom, three genes SH3RF2, ATCAY and PCP4 by edgeR, DESeq2 and ELMSeq, three genes FERMT1, FOXA3 and LRAT by edgeR, limma-voom and ELMSeq, and one gene IPO9 by DESeq2, limma-voom and ELMSeq. For most of these genes, evidence has also been reported in the literature on their association with prostate cancer. For example, the silencing of gene RHOU decreases the invasion, proliferation and motility of prostate cancer cells[34].

6. DISCUSSION

A unified statistical model is proposed for joint between sample normalization and DE detection of RNA-seq data. The sample-specific normalization factors are modeled as unknown parameters and jointly estimated together with DE detection. As a result, the model is robust against normalization errors and is independent of the units (i.e., counts, CPM/RPM, RPKM/FPKM or TPM) in which gene expression levels are summarized.

For the model with a single treatment condition, we introduce the L1 penalty to the linear regression model. The L1 penalty favors sparse solutions (forces some coefficients to be exactly zero). This is desirable since many genes are not differentially expressed. From a Bayesian point of view, the lasso penalty corresponds to a Laplace (double exponential centred at zero) prior over the regression coefficients. By contrast, existing methods do not exploit the sparsity information. We also extend the simple linear regression model to multiple linear regression model to accommodate multiple treatment conditions. Two types of penalty functions are introduced. In the first one only one covariate is of interest while all other covariates are treated as confounding factors. We are interested in testing whether that specific covariate is associated with differential expression. In the second case all covariates are of interest (there are no confounding covariates) and we are interested in testing whether any covariate affects the differential expression of a gene.

Simulation studies show that the proposed methods always perform better than or comparably to existing methods in terms of AUC. The performance gain increases with a larger sample size or higher signal to noise ratio, and is more significant when a large proportion of genes are differentially expressed in an asymmetric manner.

The R codes of the algorithms described in the paper are available for download at http://www-personal.umich. edu/~jianghui/lr-ADMM/.

Supplementary Material

Supplemental materials

ACKNOWLEDGMENTS

The authors would like to thank the anonymous reviewers for their valuable comments and suggestions to improve the quality of the paper.

Biographies

graphic file with name nihms-1043996-b0003.gif

Kefei Liu received B.Sc. in mathematics and Ph.D. in Electronic Engineering from Wuhan University in 2006 and City University of Hong Kong in 2013, respectively. He is currently a Postdoctoral Research Fellow at the Center for Computational Biology and Bioinformatics, Indiana University School of Medicine. Before joining IU, he worked as a Postdoctoral Research Associate at The Biodesign Institute of Arizona State University and the Department of Computational Medicine and Bioinformatics of University of Michigan. His current research interests include machine learning, optimization, tensor decompositions and their applications in biomedical data analysis.

graphic file with name nihms-1043996-b0004.gif

Jieping Ye received the Ph.D. degree in computer science from the University of Minnesota, Twin Cities, MN, USA, in 2005.

He is an Associate Professor of Department of Computational Medicine and Bioinformatics and Department of Electrical Engineering and Computer Science at the University of Michigan, Ann Arbor, MI, USA. His research interests include machine learning, data mining, and biomedical informatics. Dr. Ye has served as Senior Program Committee/Area Chair/Program Committee Vice Chair of many conferences including NIPS, ICML, KDD, IJCAI, ICDM, SDM, ACML, and PAKDD. He serves as a PC Co-Chair of SDM 2015. He serves as an Associate Editor for IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING and IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, and serves as an Action Editor for Data Mining and Knowledge Discovery. He won the NSF CAREER Award in 2010. His papers have been selected for the outstanding student paper at ICML in 2004, the KDD best research paper honorable mention in 2010, the KDD best research paper nomination in 2011 and 2012, the SDM best research paper runner up in 2013, the KDD best research paper runner up in 2013, and the KDD best student paper award in 2014.

graphic file with name nihms-1043996-b0005.gif

Yang Yang is a Ph.D. candidate at Beihang University. Now he is a visiting student under the supervision of distinguished professor Philip S. Yu at the University of Illinois at Chicago. He got his bachelor’s and master’s degree from Xidian University. His research interests are social network analysis, machine learning, and complex networks.

graphic file with name nihms-1043996-b0006.gif

Li Shen holds a B.S. degree from Xi’an Jiao Tong University, an M.S. degree from Shanghai Jiao Tong University, and a Ph.D. degree from Dartmouth College, all in Computer Science. He is an Associate Professor of Radiology and Imaging Sciences at Indiana University School of Medicine. His research interests include medical image computing, bioinformatics, data mining, network science, systems biology, brain imaging genomics, and brain connectomics.

graphic file with name nihms-1043996-b0007.gif

Hui Jiang is an Assistant Professor in the Department of Biostatistics at University of Michigan. He received his Ph.D. in Computational and Mathematical Engineering from Stanford University in 2009. He received his B.S. and M.S. in Computer Science from Peking University. Before joining the University of Michigan in 2011, he was a postdoctoral scholar in the Department of Statistics and Genome Technology Center at Stanford University. He is interested in developing statistical and computational methods for the analysis of large-scale biological data generated using modern high-throughput technologies.

Contributor Information

Kefei Liu, Department of Radiology and Imaging Sciences, Indiana University School of Medicine, Indianapolis, IN 46202..

Jieping Ye, Department of Computational Medicine and Bioin-formatics, University of Michigan, MI 48109..

Yang Yang, School of Computer Science and Engineering, Beihang University, Beijing 100191, China..

Li Shen, Department of Radiology and Imaging Sciences, Indiana University School of Medicine, Indianapolis, IN 46202..

Hui Jiang, Department of Biostatistics, University of Michigan, MI 48109..

REFERENCES

  • [1].Mortazavi A, Williams BA, McCue K, Schaeffer L, and Wold B, “Mapping and quantifying mammalian transcriptomes by RNA-Seq.” Nat Methods, vol. 5, no. 7, pp. 621–628, July 2008. [DOI] [PubMed] [Google Scholar]
  • [2].Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, and Pachter L, “Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation,” Nat Biotechnol, vol. 28, no. 5, pp. 511–515, May 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Jiang H and Wong WH, “Statistical inferences for isoform expression in RNA-Seq,” Bioinformatics, vol. 25, no. 8, pp. 1026–1032, April 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Salzman J, Jiang H, and Wong WH, “Statistical modeling of RNA-Seq data,” Statistical Science, vol. 26, no. 1, pp. 62–83, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [5].Wang Z, Gerstein M, and Snyder M, “RNA-Seq: a revolutionary tool for transcriptomics.” Nat Rev Genet, vol. 10, no. 1, pp. 57–63, January 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Dillies M-A, Rau A, Aubert J, Hennequet-Antier C, Jean mougin M, Servant N, Keime C, Marot G, Castel D, Estelle J,Guernec G, Jagla B, Jouneau L, Laloe D, Le Gall C, Schaeffer B,ffer L, Le Crom S, Guedj M, Jaffrezic F, and F. S. C., “A comprehensive evaluation of normalization methods for illumina high-throughput RNA sequencing data analysis,” Brief Bioinform, vol. 14, no. 6, pp. 671–683, November 2013. [DOI] [PubMed] [Google Scholar]
  • [7].Rapaport F, Khanin R, Liang Y, Pirun M, Krek A, Zumbo P, Mason CE, Socci ND, and Betel D, “Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data,” Genome Biology, vol. 14, no. 9, p. R95, September 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Robinson MD and Oshlack A, “A scaling normalization method for differential expression analysis of RNA-seq data,” Genome Biol, vol. 11, no. 3, p. R25, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Bolstad BM, Irizarry RA, Astrand M, and Speed TP, “A comparison of normalization methods for high density oligonucleotide array data based on variance and bias.” Bioinformatics, vol. 19, no. 2, pp. 185–193, January 2003. [DOI] [PubMed] [Google Scholar]
  • [10].Smyth GK, “Limma: linear models for microarray data,” in Bioinformatics and computational biology solutions using R and Bioconductor. Springer, 2005, pp. 397–420. [Google Scholar]
  • [11].Bullard JH, Purdom E, Hansen KD, and Dudoit S, “Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments.” BMC Bioinformatics, vol. 11, p. 94, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Anders S and Huber W, “Differential expression analysis for sequence count data,” Genome Biol, vol. 11, no. 10, p. R106, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Li B, Ruotti V, Stewart RM, Thomson JA, and Dewey CN, “RNA-Seq gene expression estimation with read mapping uncertainty.” Bioinformatics, vol. 26, no. 4, pp. 493–500, February 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Oshlack A, Wakefield MJ et al. , “Transcript length bias in RNA seq data confounds systems biology,” Biol Direct, vol. 4, no. 1, p. 14, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [15].Marioni JC, Mason CE, Mane SM, Stephens M, and Gilad Y, “RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays,” Genome research, vol. 18, no. 9, pp. 1509–1517, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [16].Law CW, Chen Y, Shi W, and Smyth GK, “Voom: precision weights unlock linear model analysis tools for RNA-seq read counts,” Genome Biol, vol. 15, no. 2, p. R29, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [17].Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, and Smyth GK, “limma powers differential expression analyses for RNA-sequencing and microarray studies,” Nucleic Acids Research, vol. 43, no. 7, p. e47, January 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Love MI, Huber W, and Anders S, “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2,” Genome Biology, vol. 15, no. 12, p. 550, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [19].Jiang H and Salzman J, “A penalized likelihood approach for robust estimation of isoform expression,” Statistics and Its Interface, vol. 8, no. 4, pp. 437–445, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [20].Zhou X, Lindsay H, and Robinson MD, “Robustly detecting differential expression in RNA sequencing data using observation weights,” Nucleic acids research, vol. 42, no. 11, pp. e91–e91, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [21].Boyd S, Parikh N, Chu E, Peleato B, and Eckstein J, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends® in Machine Learning, vol. 3, no. 1, pp. 1–122, 2011. [Google Scholar]
  • [22].Jiang H and Zhan T, “Unit-free and robust detection of differential expression from RNA-Seq data,” Statistics in Biosciences, vol. 9, no. 1, pp. 178–199, 2017. [Google Scholar]
  • [23].Tibshirani R, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 267–288, 1996. [Google Scholar]
  • [24].Ji H and Wong WH, “TileMap: create chromosomal map of tiling array hybridizations,” Bioinformatics, vol. 21, no. 18, pp. 3629–3636, 2005. [DOI] [PubMed] [Google Scholar]
  • [25].Ji H and Liu XS, “Analyzing ‘omics data using hierarchical models,” Nature biotechnology, vol. 28, no. 4, pp. 337–340, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [26].Smyth G, “Statistical applications in genetics and molecular biology,” Linear models and empirical Bayes methods for assessing differential expression in microarray experiments, 2004. [DOI] [PubMed] [Google Scholar]
  • [27].Friedman J, Hastie T, and Tibshirani R, “Regularization paths for generalized linear models via coordinate descent,” Journal of statistical software, vol. 33, no. 1, p. 1, 2010. [PMC free article] [PubMed] [Google Scholar]
  • [28].Yuan M and Lin Y, “Model selection and estimation in regression with grouped variables,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 68, no. 1, pp. 49–67, 2006. [Google Scholar]
  • [29].Robinson MD, McCarthy DJ, and Smyth GK, “edgeR: a bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics, vol. 26, pp. 139–140, January 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [30].Network CGAR, “The molecular taxonomy of primary prostate cancer,” Cell, vol. 163, pp. 1011–1025, November 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [31].Wang K, Singh D, Zeng Z, Coleman SJ, Huang Y, Savich GL, He X, Mieczkowski P, Grimm SA, Perou CM, MacLeod JN, Chiang DY, Prins JF, and Liu J, “Mapsplice: accurate mapping of RNA-seq reads for splice junction discovery,” Nucleic acids research, vol. 38, p. e178, October 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [32].Kim H, Lapointe J, Kaygusuz G, Ong DE, Li C, van de Rijn M,Brooks JD, and Pollack JR, “The retinoic acid synthesis gene ALDH1a2 is a candidate tumor suppressor in prostate cancer,” Cancer research, vol. 65, no. 18, pp. 8118–8124, 2005. [DOI] [PubMed] [Google Scholar]
  • [33].Busch C, Hanssen TA, Wagener C, and Öbrink B, “Down regulation of CEACAM1 in human prostate cancer: correlation with loss of cell polarity, increased proliferation rate, and gleason grade 3 to 4 transition,” Human pathology, vol. 33, no. 3, pp. 290–298, 2002. [DOI] [PubMed] [Google Scholar]
  • [34].Alinezhad S, Väänäanen R-M, Mattsson J, Li Y, Tallgrén T, Ochoa NT, Bjartell A, Åkerfelt M, Taimen P, Boström PJ¨ et al. , “Validation of novel biomarkers for prostate cancer progression by the combination of bioinformatics, clinical and functional studies,” PloS one, vol. 11, no. 5, p. e0155901, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental materials

RESOURCES