Modeling Skewness in Human Transcriptomes

Joaquim Casellas; Luis Varona

doi:10.1371/journal.pone.0038919

. 2012 Jun 11;7(6):e38919. doi: 10.1371/journal.pone.0038919

Modeling Skewness in Human Transcriptomes

Joaquim Casellas ^1,^*, Luis Varona ²

Editor: Timothy Ravasi³

PMCID: PMC3372486 PMID: 22701729

Abstract

Gene expression data are influenced by multiple biological and technological factors leading to a wide range of dispersion scenarios, although skewed patterns are not commonly addressed in microarray analyses. In this study, the distribution pattern of several human transcriptomes has been studied on free-access microarray gene expression data. Our results showed that, even in previously normalized gene expression data, probe and differential expression within probe effects suffer from substantial departures from the commonly assumed symmetric Gaussian distribution. We developed a flexible mixed model for non-competitive microarray data analysis that accounted for asymmetric and heavy-tailed (Student’s t distribution) dispersion processes. Random effects for gene expression data were modeled under asymmetric Student’s t distributions where the asymmetry parameter (λ) took values from perfect symmetry (λ = 0) to right- (λ>0) or left-side (λ>0) over-expression patterns. This approach was applied to four free-access human data sets and revealed clearly better model performance when comparing with standard approaches accounting for traditional symmetric Gaussian distribution patterns. Our analyses on human gene expression data revealed a substantial degree of right-hand asymmetry for probe effects, whereas differential gene expression addressed both symmetric and left-hand asymmetric patterns. Although these results cannot be extrapolated to all microarray experiments, they highlighted the incidence of skew dispersion patterns in human transcriptome; moreover, we provided a new analytical approach to appropriately address this biological phenomenon. The source code of the program accommodating these analytical developments and additional information about practical aspects on running the program are freely available by request to the corresponding author of this article.

Introduction

Mixed models have been advocated in gene expression analyses due to their superiority in partitioning sources of variation and their flexibility for accommodating various experimental designs [1]; furthermore, they can be used for joint analysis of all loci [2], appropriately accounting for variability both across and within microarray probes [3]. Microarray data sets are characterized by high dimensionality in the sense of a small number of replicates (i.e. microarray slides) and a large number of probes per replicate. Mixed models account for these peculiarities of the microarray gene expression data, the sources of variation being preferentially treated as random effects [4] to appropriately address large numbers of levels with scarce amounts of information per level. A typical assumption for the distribution of random effects in mixed model analyses is the Gaussian density function [4], which is systematically applied in standard gene expression analyses [3], [5]. Although this parametric assumption could be viewed as a reasonable compromise between mathematical convenience and biological plausibility, its suitability in gene-expression analyses has been questioned in recent studies [6]–[12].

Taking the probe-specific differential expression effect between two treatments, the Gaussian distribution forces a symmetrical pattern between the two treatments, whereas a wide range of skewed distributions and treatment-related over-expressions may seem more reasonable. Moreover, the Gaussian assumption suffers substantial misadjusts in the presence of outliers [13], which are common in microarray data [14]. Given the inconsistencies of the Gaussian distribution for random effects in the gene expression data, recent researches have proposed parametric alternatives for modeling gene expression data, assuming heavy-tailed processes like Cauchy [8] and Student’s t distributions [7] or asymmetric distributions like Pareto [15], Gamma [6] and skew Laplace [16], [17]. Although these studies have reported substantial improvements in terms of model fit to experimental data, none of them allowed joint, flexible modeling of gene expression data under variable incidence of outliers or asymmetry, or the incidence of both positive (right-hand tail over-expressed) and negative (left-hand tail over-expressed) asymmetric patterns.

The Student’s t distribution has been proposed as a useful assumption for attenuating the impact of outliers in mixed models [7] and, furthermore, asymmetry can be easily accommodated in the Student’s t density [18]. Within this context, the aim of this research was to check for asymmetric and heavy-tailed patterns in random effects of gene expression data, developing a new analytical approach to appropriately accommodate both sources of departure from the standard symmetric Gaussian assumption.

Results

Model Comparison

Four independent microarray gene expression data sets from human tissues (Table 1) were analyzed under three different hierarchical mixed linear models. These analyses accounted for the systematic effect of each microarray slide and two random sources of variation, probe and treatment within probe (with two levels in each data set). Models differed in the a priori distributions of these random effects, they being multivariate normal densities (Model SG) [5], symmetric Student’s t densities (Model ST), or asymmetric Student’s t densities (Model AS) following Sahu et al. [18]. The deviance information criterion (DIC) [19] assessed model performance under these three different prior distributions for random effects, revealing a huge penalization for model SG in all cases (Table 2). Note that models with a smaller DIC were favored as this indicated a better fit and lower degree of model complexity [19]. In all four comparisons, DIC sequentially and drastically reduced with models ST and AT (Table 2), where Sahu et al. [18] asymmetric Student’s t priors for probe and differential expression within probe effects were clearly preferred (Table 2). Note that differences larger that 3 to 5 DIC units are assumed as statistically relevant [19] and Model AT showed the lowest DIC with 4,678 (dataset 1) to 60,335 (dataset 3) less DIC units than Model ST. These large DIC departures ruled out any possible controversy concerning the most preferable model. It is important to note that the number of differentially expressed genes reduced with Model AT, also suggesting a more conservative behavior for this parameterization (Table 2).

Table 1. Summary of the free-access data sets analyzed.

	Platform^(a)	Tissue	Groups of comparison (number ofsamples per group)	Reference	GEO^(b)
Dataset 1	Affymetrix GeneChip Human FullLength Array HuGeneFL	Mononuclear cell layer	Non-pulmonary arterial hypertension (6)vs. pulmonary arterial hypertension (14)	Bull et al. [26]	GSE703
Dataset 2	Affymetrix GeneChip Human FullLength Array HuGeneFL	Bronchoalveolar lavage cells	Non-smoker (5) vs. smoker (5)	Heguy et al. [27]	GSE3212
Dataset 3	Affymetrix GeneChip Human GenomeU133 Plus 2.0 Array	Spermatozoa	Normal (12) vs. teratozoospermicindividuals (8)	Platts et al. [28]	GSE6969
Dataset 4	Illumina humanRef-8 v2.0 expressionbeadchip	Carotid endarterectomy samples	Carotid artery stenosis treated withmycophenolate (9) vs. placebo (11)	Unpublished	GSE13922

Open in a new tab

^(a)

The approximate number of interrogated transcripts were 5,000, 47,000 and 16,000 for Affymetrix GeneChip Human Full Length Array HuGeneFL (Affymetrix, Inc., Santa Clara, CA), Affymetrix GeneChip Human Genome U133 Plus 2.0 Array (Affymetrix, Inc., Santa Clara, CA) and Illumina human Ref-8 v2.0 expression beadchip (Illumina, Inc., San Diego, CA), respectively.

^(b)

Gene Expression Omnibus accession number (http://www.ncbi.nlm.nih.gov/geo/).

Table 2. Model comparison and characterization of the dispersion patter of probe and differential expression within probe under Model AT.

	Dataset 1	Dataset 2	Dataset 3	Dataset 4
DIC^(a) (and number of probes with significant differential expression^(b))
Model SG^(c)	284,161 (189)	231,581 (5)	3,692,344 (702)	1,756,122 (12)
Model ST^(d)	247,509 (31)	224,741 (1)	3,614,823 (692)	1,734,053 (4)
Model AT^(e)	242,831 (2)	188,835 (0)	3,554,488 (639)	1,724,667 (2)
Parameters^(f) under Model AT. Mode (and highest posterior density region at 95%)
v_p	8.95 (4.21 to 26.61)	5.62 (4.16 to 11.05)	6.77 (4.15 to 16.0)	8.87 (4.40 to 30.09)
λ_p	0.38 (0.04 to 0.66)	0.13 (0.01 to 0.32)	1.84 (1.61 to 1.93)	2.03 (1.98 to 2.09)
v_d	7.36 (4.18 to 18.05)	6.90 (4.15 to 19.76)	5.99 (4.38 to 11.51)	8.48 (4.66 to 23.90)
λ_d	0.01 (−0.04 to 0.06)	−0.00 (−0.04 to 0.04)	−1.88 (−1.96 to −1.81)	−0.00 (−0.01 to 0.01)

Open in a new tab

^(a)

Deviance information criterion.

^(b)

Differentially expressed genes after Bonferroni [29]-like correction (α = 0.05). The adjusted significance threshold for posterior probabilities was calculated as α/π, were π was the number of probes included in each analysis.

Random effects g and d(g) were assumed as symmetric Gaussian^(c), symmetric Student’s t ^(d) or asymmetric Student’s t ^(e) distributed following Sahu et al. [18].

^(f)

Degrees of freedom (v) and asymmetry parameter ( Inline graphic ) for probe (p) and differential expression within probe (d) effects.

Estimates for Asymmetry and non-Gaussian Patterns

Under Model AT, the non-Gaussian distributions of probe and differential expression within probe effects were characterized in terms of heavy tails (Student’s t) and asymmetric dispersion patterns by means of v (degrees of freedom of the Student’s t distribution) and Inline graphic (asymmetry parameter). Probe effects revealed both heavy tails and positive asymmetry with a substantial over-expression of the right tail of the distribution. The modal estimates of the degrees of freedom fluctuated between 5.62 (data set 2) and 8.95 (data set 1), with the highest posterior density region at 95% roughly ranged between 4 and 30. The right-hand asymmetry was clearly demonstrated in all datasets with positive modal estimates of Inline graphic , their HPD95 excluding the null or negative values (Figure 1a). The differential expression within probe effect showed a similar pattern with small v, although significant asymmetry was only revealed in data set 3 ( = −1.88; HPD95: −1.96 to −1.81; Figure 1b).

Distribution of mean estimates for probe (a) and differential expression within-probe (b) effects under Model SG (grey line) and Model AT (black line) for data set 3.

Discussion

The asymmetric and non-Gaussian distribution of the human transcriptome has been revealed in four independent human data sets from different microarray platforms and technologies. Although our results cannot be completely extrapolated to all microarray data, they show that deviations from the standard Gaussian prior for random effects should be accurately considered in current gene expression studies. Normalization of gene expression data has been a topic of main interest during the last decade [20], [21], but our results suggested that non-Gaussian patterns must be considered as an inherent property of gene expression data, and this phenomenon should be appropriately accounted for in analytical models in order to avoid biases on final estimates (Table 2). Note that the Student’s t density converges to a Gaussian density when v tends to infinity, although both densities are assumed roughly similar for v values larger than 30 [13]. In our case, small modal estimates (<10) were obtained for the degrees of freedom of the Student’s t distribution, suggesting a relevant departure from the standard Gaussian distribution as corroborated by the DIC statistic. Our small values for v reported a substantial incidence of outlier gene expressions as was previously sugested by Gottardo et al. [7] and Khondoker et al. [8] in alternative microarray data sets. Moreover, Model AT was preferred, highlighting the usefulness of the hierarchical mixed model with asymmetric Student’s t prior distributions for random sources of variation.

All data sets agreed with right-hand over-distributed probe effects, whereas left-hand over-expression was revealed for differential expression within probe estimates in data set 3 (Figure 1b). This right-hand asymmetry in human transcriptome must be linked to the fact that lowly expressed probes are roughly grouped in the left tail of the scanning spectrum due to technological limitations of the microarray technique, whereas a substantial incidence of high or extremely-high gene expression intensities can be anticipated [22]. Note that this phenomenon is not commonly accounted for in gene expression analyses worldwide, whereas the mixed model parameterization developed in this manuscript provides a highly flexible statistical tool accounting for the non-Gaussian properties of human (and even non-human) transcriptome. As shown in Figure 1a, departures from the standard Model SG do not reduce to the symmetry pattern only, but also rely on the average mathematical expectation for probe effects. Sahu’s et al. [18] method can accommodate distributions with non-zero modal estimates (Figure 1a). Focusing on data set 3, the modal estimate for differential expression effects was placed around 2 (Model AT) and linked to larger estimates for the array effect when comparing with Model SG. It implied a moderate relocation of systematic and random sources of variation for gene expression data in the output of the mixed model.

Results from differential expression within probe effects highlighted the remarkable flexibility of Sahu’s et al. [18] method to accommodate any kind of asymmetry pattern. This peculiarity is of special relevance for differential gene expression given that a wide range of asymmetric patterns could be find in gene expression studies. Although the standard symmetric Gaussian distributions may be valid sometimes, a wide range of left- and right-tail over-expressions could be addressed with Model AT. Indeed, data set 3 (all datasets) showed that asymmetric (heavy-tailed) patterns are not unusual and they must be considered in gene expression analyses. The relevance of a proper modeling of random effects is clearly highlighted in Figure 1b where the symmetric Gaussian prior distribution for the differential expression produces a bimodal artifact in the posterior distribution of the estimates, clearly differing from the expected drawn under the a priori assumption.

Note that all model comparisons were made on the basis of the DIC statistic [19], a widely used statistical criterion to assess model complexity and fit. Indeed, DIC measures posterior predictive error by penalizing the fit of the model (i.e., deviance) by its complexity, determined by the effective number of parameters as defined by Spiegelhalter et al. [19]. Within this context, model AT must be clearly viewed as the most parsimonious and reliable parameterization, at least among the alternatives we are considering in this study. DIC evidenced that the incidence of asymmetry and heavy-tailed patterns in human gene expression data must be out of any doubt and, as consequence, model AT characterized a quasi-optimum approach to analyze this kind of microarray data. Nevertheless, DIC does not provide specific information about testing properties of current models when evaluating differentially expressed genes, although better model fit must be linked to better testing properties. Model AT reported the smallest number of differentially expressed probes in all data sets (Table 2) and all those probes were previously identified as differentially expressed by models SG and ST. The same patterns were obtained in preliminary analyses of simulated gene expression data (results not shown) and suggested that the better fit and more restrictive testing behavior of model AT could be linked to false positives under models SG and ST.

In conclusion, the incidence of asymmetric random effects has been highlighted in non-competitive gene expression data from human tissues; the new model proposed below provides a better adjustment of gene expression data and even a more conservative testing pattern has been suggested. Although this manuscript has focused on non-competitive hybridization microarrays, models can be easily adapted to two channel microarrays following Purdom and Holmes [16].

Materials and Methods

Mixed Model for Non-competitive Microarray Data

We assume as a starting point non-competitive hybridization microarray data from n unrelated individuals appropriately grouped in two different treatments (e.g. normal versus tumor cells) and m probes. These data (y) can be analyzed by the mixed model:

where X, Z ₁, and Z ₂ are incidence matrices for array (a), probe (p) and differential expression (between treatments) within probe ( Inline graphic ) effects, and e is the vector of residuals. Following a standard Bayesian development, the joint posterior distribution of all parameters in the model conditional to the data is proportional to the Bayesian likelihood,

graphic file with name pone.0038919.e007.jpg

multiplied by the a priori distribution of each parameter in the model. Note that this equation describes a heteroskedastic normal density [5] with gene-specific residual variances and with null residual covariance between genes (R). A priori distributions for p and Inline graphic could be described as

and

they being independent Gaussian densities with a mean of zero and variances equal to Inline graphic and , respectively (Model SG). Note that i was the number of elements in p and j was the number of elements in d(p). Nevertheless, robustness must be gained under a skew-Student’s t prior. This prior can be parameterized as a skewed-normal density,

graphic file with name pone.0038919.e013.jpg

multiplied by the conditional distribution of the mixing parameter (s_k), this being a Gamma prior,

graphic file with name pone.0038919.e014.jpg

Note that Inline graphic was the scale parameter, were the degrees of freedom, and was the asymmetry parameter modelled following Sahu et al. [18] (Model AT). Moreover, φ and denoted the density function and cumulative distribution function of a standard normal distribution with kernel as defined between parentheses, respectively, and Inline graphic was the standard gamma function with argument as defined within parentheses. Note that describes perfect symmetry, whereas right (or left) tail proportionally increases for positive (or negative) values of . An uniform prior distribution was defined for , as previously suggested by Varona et al. [23]. Symmetric Student’s t priors (Model ST) can be easily defined if Inline graphic is appropriately fixed to 0. A priori distributions for degrees of freedom were defined as exponential [24] and flat priors were assumed for the remaining parameters. Note that the Student’s t density converges to the Gaussian one when degrees of freedom tend to infinity, whereas few degrees of freedom account for heavy-tailed densities [13]. All the unknown factors in the model can be easily sampled from their joint posterior distribution by Markov chain Monte Carlo methods [25].

Example with Free-access Human Gene Expression Data

To illustrate the asymmetric pattern of the human transcriptome, we applied the models to four free-access human microarray datasets (http://www.ncbi.nlm.nih.gov/geo/; accession numbers GSE703, GSE3212, GSE6969 and GSE 13922). Note that all data sets are MIAME compliant and they were previously deposited in the Gene Expression Omnibus database (http://www.ncbi.nlm.nih.gov/geo/). These datasets were representative of two different trademarks and hybridization technologies, evaluated in diverse human tissues (see Table 1). All of them focused on the comparison between two groups, non-pulmonary arterial hypertension versus pulmonary arterial hypertension (Dataset 1; [26]), non-smoker versus smoker (Dataset 2; [27]), normal versus teratozoospermic individuals (Dataset 3; [28]) and carotid artery stenosis treated with mycophenolate versus placebo (Dataset 4; unpublished). A base 2 logarithm was applied to normalize gene-expression scores.

Note that the four human data sets were selected at random to evaluate the three mixed model parameterizations on different human tissues and microarray platforms. Of course, both tissue and data quality could have some impact on the distribution pattern, although this escaped from the objectives of this research. Different preprocessing approaches would have different impacts on further analyses of gene expression data and even skewed or heavy-tailed patterns could be partially addressed by preliminary data editing methodologies such as normalization of background correction. Nevertheless, we focused on the development, implementation and evaluation of a reliable parameterization to account for non-Gaussian patterns in gene expression data, assuming that all preliminary data editing processes where properly satisfied.

For each dataset, the three different models were analyzed (models SG, ST and AT). Each model was solved through Bayesian inference with a single Monte Carlo Markov chain of 500,000 elements after discarding the first 50,000 as burn-in. Models were compared with the DIC [19].

Footnotes

Competing Interests: The authors have declared that no competing interests exist.

Funding: The research contract of JC was partially funded by Spain’s Ministerio de Ciencia e Innovación (reference, RYC-2009-04049). The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. No additional external funding was received for this study.

References

1.Cui X, Churchill GA. Statistical tests for differential expression in cDNA microarray experiments. Genome Biol. 2003;4:210. doi: 10.1186/gb-2003-4-4-210. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Hoeschele I, Li H. A note on joint versus gene-specific mixed model analysis of microarray gene expression data. Biostatistics. 2005;6:186. doi: 10.1093/biostatistics/kxi001. [DOI] [PubMed] [Google Scholar]
3.Wolfinger RD, Gibson G, Wolfinger ED, Bennett L, Hamadeh H. Assessing gene significance from cDNA microarray expression data via mixed models. J Comput Biol. 2001;8:637. doi: 10.1089/106652701753307520. [DOI] [PubMed] [Google Scholar]
4.Searle SR. Matrix Algebra Useful for Statistics. John Wiley & Sons, New York, NY. 1982.
5.Casellas J, Ibáñez-Escriche N, Martínez-Giner M, Varona L. GEAMM v1.4.: a versatile program for mixed model analysis of gene expression data. Anim Genet. 2008;39:90. doi: 10.1111/j.1365-2052.2007.01670.x. [DOI] [PubMed] [Google Scholar]
6.Kendziorski CM, Newton MA, Lan H, Gould MN. On parametric empirical Bayes methods for comparing multiple groups using replicate gene expression profiles. Stat Med. 2003;22:3914. doi: 10.1002/sim.1548. [DOI] [PubMed] [Google Scholar]
7.Gottardo R, Raftery AE, Yeung KY, Bumgarner RE. Bayesian robust inference for differential gene expression in microarrays with multiple samples. Biometrics. 2006;62:18. doi: 10.1111/j.1541-0420.2005.00397.x. [DOI] [PubMed] [Google Scholar]
8.Khondoker MR, Glasbey CA, Worton BJ. Statistical estimation of gene expression using multiple laser scans of microarrays. Bioinformatics. 2006;22:219. doi: 10.1093/bioinformatics/bti790. [DOI] [PubMed] [Google Scholar]
9.Angelini C, Cutillo L, De Canditiis D, Mutarelli M, Pensky M. BATS: a Bayesian user-friendly software for analyzing time series microarray experiments. BMC Bioinformatics. 2008;9:415. doi: 10.1186/1471-2105-9-415. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Hardin J, Wilson J. A note on oligonucleotide expression values not being normally distributed. Biostatistics. 2009;10:450. doi: 10.1093/biostatistics/kxp003. [DOI] [PubMed] [Google Scholar]
11.Salas-Gonzalez D, Kuruoglu EE, Ruiz DP. A heavy-tailed empirical Bayes method for replicated microarray ata. Comput Stat Data Anal. 2009;53:1546. [Google Scholar]
12.Posekany A, Felsenstein K, Sykacek P. Biological assessment of robust noise models in microarray data analysis. Bioinformatics. 2011;27:814. doi: 10.1093/bioinformatics/btr018. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Lange KL, Little RJA, Taylor JMG. Robust statistical modelling using the t distribution. J Am Stat Assoc. 1989;84:896. [Google Scholar]
14.Model F, König T, Piepenbrock C, Adorján P. Statistical process control for large scale microarray experiments. Bioinformatics. 2002;18:S162. doi: 10.1093/bioinformatics/18.suppl_1.s155. [DOI] [PubMed] [Google Scholar]
15.Kuznetsov VA, Knott GD, Bonner RF. General statistics of stochastic process of gene expression in eukaryotic cells. Genetics. 2002;161:1332. doi: 10.1093/genetics/161.3.1321. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Purdom E, Holmes SP. Error distribution for gene expression data. Stat Appl Genet Mol Biol. 2005;4:16. doi: 10.2202/1544-6115.1070. [DOI] [PubMed] [Google Scholar]
17.Bhowmick D, Davison AC, Goldstein DR, Ruffieux Y. A Laplace mixture model for identification of differential expressions in microarray experiments. Biostatistics. 2006;7:641. doi: 10.1093/biostatistics/kxj032. [DOI] [PubMed] [Google Scholar]
18.Sahu SK, Dey DK, Branco MD. A new class of multivariate skew distributions with applications to Bayesian regression models. Can J Stat. 2003;31:150. [Google Scholar]
19.Spiegelhalter DJ, Best NG, Carlin BP, var der Linde A. Bayesian measures of model complexity and fit. J Royal Statist Soc B. 2002;64:639. [Google Scholar]
20.Beyene J, Hu P, Parkhomenko E, Tritchler D. Impact of normalization and filtering on linkage analysis of gene expression data. BMC Proc. 2007;1:S150. doi: 10.1186/1753-6561-1-s1-s150. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Smyth GK, Speed TP. Normalization of cDNA microarray data. Methods. 2003;31:273. doi: 10.1016/s1046-2023(03)00155-5. [DOI] [PubMed] [Google Scholar]
22.Chen D-T, Lin S-H, Soong S-J. Gene selection for oligonucleotide array: an approach using PM probe level data. Bioinformatics. 2004;20:862. doi: 10.1093/bioinformatics/btg493. [DOI] [PubMed] [Google Scholar]
23.Varona L, Ibáñez-Escriche N, Quintanilla R, Noguera JL, Casellas J. Bayesian analysis of quantitative traits using skewed distributions. Genet Res. 2008;90:190. doi: 10.1017/S0016672308009233. [DOI] [PubMed] [Google Scholar]
24.Strandén I, Gianola D. Mixed effects linear models with t-distributions for quantitative genetic analysis: a Bayesian approach. Genet Sel Evol. 1999;31:42. [Google Scholar]
25.Gilks WR, Richardson S, Spiegelhalter DJ. Markov chain Monte Carlo in practice. Chapman & Hall, New York, NY. 1996.
26.Bull TM, Coldren CD, Moore M, Sotto-Santiago SM, Pham DV. Gene microarray analysis of peripheral blood cells in pulmonary arterial hypertension. Am J Resp Crit Care Med. 2004;170:919. doi: 10.1164/rccm.200312-1686OC. [DOI] [PubMed] [Google Scholar]
27.Heguy A, O’Connor TP, Luettich K, Worgall S, Cieciuch A. Gene expression profiling of human alveolar macrophages of phenotypically normal smokers and nonsomkers reveals a previously unrecognized subset of genes modulated by cigarette smoking. J Mol Med. 2006;84:328. doi: 10.1007/s00109-005-0008-2. [DOI] [PubMed] [Google Scholar]
28.Platts AE, Dix DJ, Chemes HE, Thompson KE, Goodrich R. Success and failure in human spermatogenesis as revealed by teratozoospermic RNAs. Hum Mol Genet. 2007;16:773. doi: 10.1093/hmg/ddm012. [DOI] [PubMed] [Google Scholar]
29.Bonferroni CE. Elementi di Statistica Generale. Libreria Seber, Florence, Italy. 1930.

[pone.0038919-Cui1] 1.Cui X, Churchill GA. Statistical tests for differential expression in cDNA microarray experiments. Genome Biol. 2003;4:210. doi: 10.1186/gb-2003-4-4-210. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0038919-Hoeschele1] 2.Hoeschele I, Li H. A note on joint versus gene-specific mixed model analysis of microarray gene expression data. Biostatistics. 2005;6:186. doi: 10.1093/biostatistics/kxi001. [DOI] [PubMed] [Google Scholar]

[pone.0038919-Wolfinger1] 3.Wolfinger RD, Gibson G, Wolfinger ED, Bennett L, Hamadeh H. Assessing gene significance from cDNA microarray expression data via mixed models. J Comput Biol. 2001;8:637. doi: 10.1089/106652701753307520. [DOI] [PubMed] [Google Scholar]

[pone.0038919-Searle1] 4.Searle SR. Matrix Algebra Useful for Statistics. John Wiley & Sons, New York, NY. 1982.

[pone.0038919-Casellas1] 5.Casellas J, Ibáñez-Escriche N, Martínez-Giner M, Varona L. GEAMM v1.4.: a versatile program for mixed model analysis of gene expression data. Anim Genet. 2008;39:90. doi: 10.1111/j.1365-2052.2007.01670.x. [DOI] [PubMed] [Google Scholar]

[pone.0038919-Kendziorski1] 6.Kendziorski CM, Newton MA, Lan H, Gould MN. On parametric empirical Bayes methods for comparing multiple groups using replicate gene expression profiles. Stat Med. 2003;22:3914. doi: 10.1002/sim.1548. [DOI] [PubMed] [Google Scholar]

[pone.0038919-Gottardo1] 7.Gottardo R, Raftery AE, Yeung KY, Bumgarner RE. Bayesian robust inference for differential gene expression in microarrays with multiple samples. Biometrics. 2006;62:18. doi: 10.1111/j.1541-0420.2005.00397.x. [DOI] [PubMed] [Google Scholar]

[pone.0038919-Khondoker1] 8.Khondoker MR, Glasbey CA, Worton BJ. Statistical estimation of gene expression using multiple laser scans of microarrays. Bioinformatics. 2006;22:219. doi: 10.1093/bioinformatics/bti790. [DOI] [PubMed] [Google Scholar]

[pone.0038919-Angelini1] 9.Angelini C, Cutillo L, De Canditiis D, Mutarelli M, Pensky M. BATS: a Bayesian user-friendly software for analyzing time series microarray experiments. BMC Bioinformatics. 2008;9:415. doi: 10.1186/1471-2105-9-415. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0038919-Hardin1] 10.Hardin J, Wilson J. A note on oligonucleotide expression values not being normally distributed. Biostatistics. 2009;10:450. doi: 10.1093/biostatistics/kxp003. [DOI] [PubMed] [Google Scholar]

[pone.0038919-SalasGonzalez1] 11.Salas-Gonzalez D, Kuruoglu EE, Ruiz DP. A heavy-tailed empirical Bayes method for replicated microarray ata. Comput Stat Data Anal. 2009;53:1546. [Google Scholar]

[pone.0038919-Posekany1] 12.Posekany A, Felsenstein K, Sykacek P. Biological assessment of robust noise models in microarray data analysis. Bioinformatics. 2011;27:814. doi: 10.1093/bioinformatics/btr018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0038919-Lange1] 13.Lange KL, Little RJA, Taylor JMG. Robust statistical modelling using the t distribution. J Am Stat Assoc. 1989;84:896. [Google Scholar]

[pone.0038919-Model1] 14.Model F, König T, Piepenbrock C, Adorján P. Statistical process control for large scale microarray experiments. Bioinformatics. 2002;18:S162. doi: 10.1093/bioinformatics/18.suppl_1.s155. [DOI] [PubMed] [Google Scholar]

[pone.0038919-Kuznetsov1] 15.Kuznetsov VA, Knott GD, Bonner RF. General statistics of stochastic process of gene expression in eukaryotic cells. Genetics. 2002;161:1332. doi: 10.1093/genetics/161.3.1321. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0038919-Purdom1] 16.Purdom E, Holmes SP. Error distribution for gene expression data. Stat Appl Genet Mol Biol. 2005;4:16. doi: 10.2202/1544-6115.1070. [DOI] [PubMed] [Google Scholar]

[pone.0038919-Bhowmick1] 17.Bhowmick D, Davison AC, Goldstein DR, Ruffieux Y. A Laplace mixture model for identification of differential expressions in microarray experiments. Biostatistics. 2006;7:641. doi: 10.1093/biostatistics/kxj032. [DOI] [PubMed] [Google Scholar]

[pone.0038919-Sahu1] 18.Sahu SK, Dey DK, Branco MD. A new class of multivariate skew distributions with applications to Bayesian regression models. Can J Stat. 2003;31:150. [Google Scholar]

[pone.0038919-Spiegelhalter1] 19.Spiegelhalter DJ, Best NG, Carlin BP, var der Linde A. Bayesian measures of model complexity and fit. J Royal Statist Soc B. 2002;64:639. [Google Scholar]

[pone.0038919-Beyene1] 20.Beyene J, Hu P, Parkhomenko E, Tritchler D. Impact of normalization and filtering on linkage analysis of gene expression data. BMC Proc. 2007;1:S150. doi: 10.1186/1753-6561-1-s1-s150. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0038919-Smyth1] 21.Smyth GK, Speed TP. Normalization of cDNA microarray data. Methods. 2003;31:273. doi: 10.1016/s1046-2023(03)00155-5. [DOI] [PubMed] [Google Scholar]

[pone.0038919-Chen1] 22.Chen D-T, Lin S-H, Soong S-J. Gene selection for oligonucleotide array: an approach using PM probe level data. Bioinformatics. 2004;20:862. doi: 10.1093/bioinformatics/btg493. [DOI] [PubMed] [Google Scholar]

[pone.0038919-Varona1] 23.Varona L, Ibáñez-Escriche N, Quintanilla R, Noguera JL, Casellas J. Bayesian analysis of quantitative traits using skewed distributions. Genet Res. 2008;90:190. doi: 10.1017/S0016672308009233. [DOI] [PubMed] [Google Scholar]

[pone.0038919-Strandn1] 24.Strandén I, Gianola D. Mixed effects linear models with t-distributions for quantitative genetic analysis: a Bayesian approach. Genet Sel Evol. 1999;31:42. [Google Scholar]

[pone.0038919-Gilks1] 25.Gilks WR, Richardson S, Spiegelhalter DJ. Markov chain Monte Carlo in practice. Chapman & Hall, New York, NY. 1996.

[pone.0038919-Bull1] 26.Bull TM, Coldren CD, Moore M, Sotto-Santiago SM, Pham DV. Gene microarray analysis of peripheral blood cells in pulmonary arterial hypertension. Am J Resp Crit Care Med. 2004;170:919. doi: 10.1164/rccm.200312-1686OC. [DOI] [PubMed] [Google Scholar]

[pone.0038919-Heguy1] 27.Heguy A, O’Connor TP, Luettich K, Worgall S, Cieciuch A. Gene expression profiling of human alveolar macrophages of phenotypically normal smokers and nonsomkers reveals a previously unrecognized subset of genes modulated by cigarette smoking. J Mol Med. 2006;84:328. doi: 10.1007/s00109-005-0008-2. [DOI] [PubMed] [Google Scholar]

[pone.0038919-Platts1] 28.Platts AE, Dix DJ, Chemes HE, Thompson KE, Goodrich R. Success and failure in human spermatogenesis as revealed by teratozoospermic RNAs. Hum Mol Genet. 2007;16:773. doi: 10.1093/hmg/ddm012. [DOI] [PubMed] [Google Scholar]

[pone.0038919-Bonferroni1] 29.Bonferroni CE. Elementi di Statistica Generale. Libreria Seber, Florence, Italy. 1930.

PERMALINK

Modeling Skewness in Human Transcriptomes

Joaquim Casellas

Luis Varona

Roles

Abstract

Introduction

Results

Model Comparison

Table 1. Summary of the free-access data sets analyzed.

Table 2. Model comparison and characterization of the dispersion patter of probe and differential expression within probe under Model AT.

Estimates for Asymmetry and non-Gaussian Patterns

Figure 1.

Discussion

Materials and Methods

Mixed Model for Non-competitive Microarray Data

Example with Free-access Human Gene Expression Data

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Modeling Skewness in Human Transcriptomes

Joaquim Casellas

Luis Varona

Roles

Abstract

Introduction

Results

Model Comparison

Table 1. Summary of the free-access data sets analyzed.

Table 2. Model comparison and characterization of the dispersion patter of probe and differential expression within probe under Model AT.

Estimates for Asymmetry and non-Gaussian Patterns

Figure 1.

Discussion

Materials and Methods

Mixed Model for Non-competitive Microarray Data

Example with Free-access Human Gene Expression Data

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases