Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2020 Jan 6;36(8):2486–2491. doi: 10.1093/bioinformatics/btz974

MatchMixeR: a cross-platform normalization method for gene expression data integration

Serin Zhang 1, Jiang Shao 2, Disa Yu 3, Xing Qiu 4,, Jinfeng Zhang 5,
Editor: Alfonso Valencia
PMCID: PMC7868049  PMID: 31904810

Abstract

Motivation

Combining gene expression (GE) profiles generated from different platforms enables previously infeasible studies due to sample size limitations. Several cross-platform normalization methods have been developed to remove the systematic differences between platforms, but they may also remove meaningful biological differences among datasets. In this work, we propose a novel approach that removes the platform, not the biological differences. Dubbed as ‘MatchMixeR’, we model platform differences by a linear mixed effects regression (LMER) model, and estimate them from matched GE profiles of the same cell line or tissue measured on different platforms. The resulting model can then be used to remove platform differences in other datasets. By using LMER, we achieve better bias-variance trade-off in parameter estimation. We also design a computationally efficient algorithm based on the moment method, which is ideal for ultra-high-dimensional LMER analysis.

Results

Compared with several prominent competing methods, MatchMixeR achieved the highest after-normalization concordance. Subsequent differential expression analyses based on datasets integrated from different platforms showed that using MatchMixeR achieved the best trade-off between true and false discoveries, and this advantage is more apparent in datasets with limited samples or unbalanced group proportions.

Availability and implementation

Our method is implemented in a R-package, ‘MatchMixeR’, freely available at: https://github.com/dy16b/Cross-Platform-Normalization.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Large volumes of gene expression (GE) data are being generated in biological research every day. The availability of such data provides an unprecedented opportunity to conduct integrative analyses to address important biomedical questions. Over the years, different companies have produced various high-throughput experimental platforms for measuring expression levels of genes. Currently, at the gene expression omnibus (GEO) database (Edgar et al., 2002) there are 15 major GE measuring platforms, each with more than 10 000 samples, together with hundreds more minor platforms with smaller number of samples. GE data produced by different platforms cannot be directly combined because of the platform differences. Figure 1 compares six scatter plots for six matched samples’ GE profiles quantified by three different platform pairs. If there were no systematic platform differences, the scatter plots would cluster around the diagonal line y=x since one point in the figure represents expression levels of the same gene from the same sample, only measured by two different platforms. Clearly, there are marked patterns for each three platform pairs in Figure 1(a–c). Specifically, scatter plots of the same cell lines [a and b for breast cancer (BRCA) cell line MCF7 and d and e for central nerve system (CNS) cancer cell line SF_268] are apparently different while the top and bottom plot pairs (a–d, b–e and c–f) share the same patterns even though the two samples contain different biological information. These empirical evidences suggest that not only platform differences exist, but they may be the main effect in cross-platform integrative analyses.

Fig. 1.

Fig. 1.

Scatterplots of GE values for the same cell line and tissue sample generated by different platforms before normalization for three pairs of platforms: (a) BRCA cell line MCF7 by GPL96 and GPL570; (b) BRCA cell line MCF7 by GPL4133 and GPL570; (c) BRCA tissue from the same patient by Agilent G4502A and GPL11154; (d) CNS cancer cell line SF_268 by GPL96 and GPL570; (e) CNS cancer cell line SF_268 by GPL4133 and GPL570; and (f) BRCA tissue from another same patient by Agilent G4502A and GPL11154. Each point represents the mRNA expression value of a gene. The X axis is GE measured by one platform and the Y axis is GE for the same cell line or tissue sample measured by the other platform

In general, cross-study differences (also known as batch effects) between two datasets generated by two different platforms from two different studies can be decomposed into three components as follows:

Cross-study differences=Sample differences+Lab differences+Platform differences. (1)

Lab differences come from the different conditions and lab technicians in doing the same experiment. Platform differences are produced because the biological samples are measured through various instruments, which are based on different techniques. Lab differences and platform differences are unwanted variations that have detrimental effect to integrative analyses. Sample differences come from the differences in the sample collection (or study design) between two different studies. They may be minor if the samples are collected from very similar populations; or extremely large if the samples are collected from two entirely dissimilar populations. These sample differences, if relevant to the study, should be kept in the cross-platform normalization (XPN) process. For example, one study may have more tumor samples and the other study may have more normal samples. A new project comparing tumor and normal samples by combining the two earlier studies should not remove the biological differences between the two datasets caused by dissimilar proportions of tumor patients. In an extreme case, if a GE dataset contains only tumor samples measured by platform A and another dataset contains only normal samples generated through platform B, removing the sample differences between the two datasets in the combining process will make the subsequent analyses meaningless.

Most lab differences may be removed by using standard normalization methods (Irizarry et al., 2003; Leek et al., 2010; McCall et al., 2010; Zahurak et al., 2007), assuming neither global gene amplification or repression nor a large proportion of differentially expressed genes (DEGs) exists (Loven et al., 2012; Qiu et al., 2013). It can also be minimized by having researchers follow standard experimental protocols (Bergamaschi et al., 2006). It is difficult to eliminate platform differences since data produced by different platforms may not satisfy the assumptions for these standard normalization methods such as GEs having similar medians (Parrish and Spencer, 2004) and/or similar shape of the distributions (Bolstad et al., 2003).

To remove the platform differences, several batch-effect removal (BER) methods were developed in the past with various levels of success (Rudy and Valafar, 2011). Specifically, distance weighted discrimination (DWD) uses support vector machines to find the optimal hyperplane that separates two datasets by maximizing the sum of the inverse distances (Benito et al., 2004). XPN is based on a block-linear model by first clustering similar genes and samples into blocks by k-means clustering, and then fitting a combination of additive and multiplicative weights within each block (Shabalin et al., 2008). The output of the algorithm is the average of the normalized values obtained over the repeated runs. The ComBat (Johnson et al., 2007; Walker et al., 2008), based on the empirical Bayes method, is a shrinkage version of the location scale based adjustment model. It estimates the location scale model parameters by pooling information across genes such that the parameter estimates are closer to the grand mean of the estimates across the genes. A review paper (Rudy and Valafar, 2011) compared nine XPN methods (Benito et al., 2004; Jiang et al., 2004; Shabalin et al., 2008; Walker et al., 2008; Warnat et al., 2005; Xia et al., 2009) and showed that the commonly used quantile normalization method performed rather poorly, and DWD, XPN and ComBat generally performed better than the other methods.

Existing BER methods remove all three types of differences in Equation (1), and consequently they induce bias when group distributions are dissimilar in the two datasets to be combined. Because it is very challenging to distinguish lab differences and sample differences, a better strategy would be to remove as much lab differences as possible using a commonly used normalization method on individual platforms, and then use a XPN method to take out only platform differences across the datasets. The platform differences can be considered as constant between the two given platforms, which can be estimated using appropriate data for such purposes. The estimated differences can then be used in transformations that convert data from one platform to another.

Arguably, a matched sample design under which GE levels are measured by different platforms for the same biological samples is the best for estimating the platform differences. By design, there are no sample differences among the matched datasets. If the experiments are done under standard protocols, then lab differences are minimized as well. Therefore, cross-study differences between matched samples processed by different platforms come mainly from the platform differences. We identified matched sample datasets from the public domain, the CellMiner database and The Cancer Genome Atlas (TCGA). We developed a fast linear mixed effects regression (FLMER) algorithm to fit gene-specific models using selected matched sample data (called the benchmark training data). These models were then used to construct transformations to homogenize the platform differences of other data that needed to be combined (called the research data). Figure 2 illustrates the main concept of our method.

Fig. 2.

Fig. 2.

A visual explanation of our new method, MM. The goal is to infer systematic platform differences between platforms X and Y. First, GE data of the same samples measured by two different platforms, called matched sample data (benchmark training data), were collected. These data were then used to estimate the systematic platform differences between platforms X and Y using our FLMER algorithm. The inferred transformations will then be used to transform research data in one platform to another platform before being combined

We compared MatchMixeR (MM) with DWD, XPM and ComBat using both simulated and real datasets. Overall, MM achieved the highest after-normalization concordance and showed remarkable advantages compared to other methods on differential expression (DE) analysis. Our approach not only increased the reproducibility of integrative analyses using multiple datasets from different platforms, it also allowed datasets with very small sample sizes to be readily combined based on pre-trained transformations.

2 Methods and data

2.1 Methods

Throughout the study, we assumed that we had a benchmark training dataset with GEs obtained from matched samples measured on two platforms: A (denoted by xij, i=1, 2,p represents genes and j=1, 2,n represents samples) and B (denoted by yij). We also assumed that a researcher would like to conduct an integrative analysis on two sets of GE data measured on platforms A (denoted by xijnew,A) and B (denoted by yijnew,B). Our goal was to estimate per-gene transformations between platforms A and B from the matched samples, and then apply them to xijnew,A to obtain x^ijnew,B, the predicted GEs of the first set of data on platform B. Since the platform differences were removed, x^ijnew,B can then be combined with yijnew,B for downstream analyses.

A simple version of our method was based on the following ordinary least squares model (OLS) [the individualized OLS model, see Equation (22) in Supplementary Material for more technical details].

yij=β0i+xijβ1i+ϵij, ϵijN0,σϵ2.

Once the estimates β^i(β^0i,β^1i)' were obtained, we defined the new data as x^ijnew,B=β^0i+xijnew,Aβ^1i.

By design, platform B was considered as the target platform so yijnew,B was not changed. We recommend assigning a platform with larger sample size and/or better signal-to-noise ratio as platform B, to reduce manipulations of the data.

The OLS model was simple and computationally efficient, and it generated reasonably good results throughout our analyses. However, it had too many degrees of freedom (DF = 2p+1) such that the resulting estimates had high variability when the training data were small. It is well known that using a penalty-based shrinkage method, such as LASSO (Tibshirani, 1996) and ridge regression (Hoerl and Kennard, 1970), can reduce the variability of a regression model. However, these approaches shrink β^i toward zero, thus inducing detrimental biases. Another drawback of the penalized regressions is that the penalty parameters need to be determined by time-consuming cross-validation or GCV principle, which makes them less suitable for high-throughput data analysis.

As an alternative, certain linear mixed effects regression (LMER) models are also known to have smaller variability than their corresponding OLS counterparts, and they shrink the estimates toward the fixed effects instead of zero (Maldonado, 2009) therefore are much less biased. Another attractive property of the LMER approach is that the magnitude of shrinkage, which is the analog of the penalty parameter in LASSO/ridge regression, is controlled by the ratio of the variance of i.i.d. noise and random effects [Supplementary Equation (6)], which can be estimated from the data without time-consuming cross-validation. Specifically, we proposed the following LMER model [please see Model (20) in the Supplementary Material for more details]:

yij=β0+γ0i+xijβ1+γ1i+ϵij.

Here, β0, β1 are fixed effects shared by all genes and γ0i, γ1i are random intercepts/slopes pertaining to individual genes. We called the above model the MM model because it was based on matched samples and mixed effects regression.

Shrinking gene-specific platform effects toward the fixed effects makes much more sense than shrinking them to zero by penalized regression methods. However, currently available LMER fitting algorithms, such as the lme4 package in R, are based on computationally intensive EM algorithms and are prone to numerical instability and occasional convergence issues. We developed a novel moment-based method for the MM model. It did not depend on the iterative procedure and was more than 130 times faster than lme4. Therefore, we named it the FLMER method. In a simulation study, we found that with a fraction of computational cost, FLMER was able to produce slightly more accurate estimates than the lme4-based approach; and both were better than the OLS method, especially when the training samples were small. Technical details of the FLMER method and the comparisons among FLMER, lme4 and OLS are documented in Supplementary Sections 1.4 and 1.5. In the remaining sections, we focus on comparing the MM model with three major competing methods: XPN, DWD and ComBat, which were directly applied to the data to be combined.

2.2 Data

In the real data study, three matched GE datasets were used. The first one, from CellMiner (Shankavaram et al., 2009), contained 58 NCI60 cell lines whose GE were measured by two microarray platforms, the Human Genome U133A array and the Human Genome U133 Plus 2.0 Array. Both microarrays were manufactured by Affymetrix (Dalma‐Weiszhausz et al., 2006) and their platform accession codes in GEO (Edgar et al., 2002) were GPL96 and GPL570, respectively. The second dataset, also from CellMiner, contained 59 NCI60 cell lines, whose GE were measured by two different microarray platforms made by different manufacturers, GPL570 and the Human Genome Whole Microarray made by Agilent (GPL4133) (CrysAlis, 2011; Tomczak et al., 2015). The last dataset, from TCGA Data Portal (Tomczak et al., 2015), contained 583 matched patients’ mRNA expression data measured by microarray platform (Agilent G4502A) and Illumina HiSeq 2000 (GPL11154), which is a next generation sequencing (NGS) platform. All the data downloaded from the public databases were properly pre-processed, including within-platform normalization and log2 transformation.

3 Results

3.1 Simulation

We simulated three matched sample datasets for evaluating our method and comparing it with existing methods. All three datasets contained GE levels with 10 000 genes (rows) and the same number of matched samples from both platforms A and B. The numbers of samples in each platform in the three datasets were 100, 30 and 5, respectively. We used the first dataset with 100 samples as benchmark training data and the other two datasets as a large and small testing data (i.e. the research data). Technical details about these simulated datasets can be found in Section 1.5 of the Supplementary Material.

We fit FLMER model and obtained transformations with the training data and applied it to the two testing data. The data from platform A were transformed and combined with the data from platform B. Note that our method is one of only a few that uses the first matched dataset as a training set to obtain the transformations, allowing the data values of platform B remained unchanged, while the majority of other methods directly normalize the testing dataset and transform the data values from both platforms A and B.

3.1.1. After-normalization concordance

As an initial assessment, we first compared the cross-platform concordances after XPN by looking at the column-wise R2 between the transformed data of platforms A and B.

Figure 3 showed the boxplots of R2 after normalization by MM, XPN, DWD and ComBat. All four methods performed very well in terms of the R2 values, indicating that applying an appropriate XPN enhances the concordance of data generated on different platforms. To take a closer look, we recorded the root mean square distance (RMSD) in Table 1. Here, the RMSD represented the gap between expression values in platform A and the corresponding values in platform B. Of note, the RMSD was the same as the root mean RSS for our method but not the others, because they adjusted the data values from both platforms A and B. Like RSS, a smaller RMSD indicated higher performance because the data values from both platforms were the mRNA expressions of the same genes of the same samples. We saw in Table 1 that all methods significantly reduced RMSD, but MM produced the smallest values for both small (n = 5) and large (n = 30) testing data after normalization. We also compared the computational times of all methods for small testing dataset, and found that MM (0.895 s) is about 3–200 times faster than other methods (DWD: 5.307 s; XPN: 173.671 s and ComBat: 3.195 s).

Fig. 3.

Fig. 3.

Boxplots of column-wise R2 for each method: (a) for the test set with five samples, and (b) for the test set with 30 samples

Table 1.

RMSD between two expression values in platform A and platform B after normalization

n Before MM DWD XPN ComBat
30 2.518 0.001 0.051 0.012 0.024
5 2.518 0.001 0.046 0.025 0.026

3.1.2. DE analysis

Next, we conducted a DE analysis using simulated data to show how well these methods could selectively remove platform differences while keeping biological signals. We first generated four simulated datasets, each with 10 000 genes (rows), of which 1000 were DEGs. All datasets had 60 samples, and the compositions of these 60 samples were different for each dataset. The first dataset contained 15 platform A samples and 15 platform B samples in both Group 1 and Group 2. In a real case, Group 1 and Group 2 may represent the tumor group and normal group, respectively. This was the most balanced case and the most ideal situation for the competing methods. Datasets 2, 3 and 4 were unbalanced with dataset 4 being the most unbalanced in which all samples in

Group 1 were from platform A and all samples in Group 2 were from platform B. For two group comparison, we employed rowttests function in ‘genefilter’ R-package and used Holm–Bonfferoni method for the P-value adjustment. The threshold for selecting DEGs was padj<0.05.

We used F1-score (a.k.a. F-score or F-measure), defined as F1=2precisionrecallprecision+recall, to evaluate different methods. As the harmonic average of precision and recall, F1-score is an overall measure of a model’s accuracy. As shown in Table 2, all methods performed well for the balanced case (dataset 1). However, MM performed significantly better than other methods in unbalanced cases. When it comes to the most unbalanced situation (dataset 4), only our method worked while all other methods failed completely. A similar set of simulated datasets with smaller samples showed similar results (see Supplementary Table S6).

Table 2.

The mean of 30 repetitions of true and false positives and F1 scores of the downstream differential GE analyses for the simulated data

MM DWD XPN ComBat
  • (1)

  • 15:15

  • 15:15

TP 958.9 959.3 954.9 959.9
FP 0.03 0.10 0.03 0.10
F1 0.9790 0.9792 0.9769 0.9795
  • (2)

  • 15:15

  • 30:0

TP 960.5 893.5 796.9 878.3
FP 0.10 0.00 0.00 0.00
F1 0.9798 0.9438 0.8869 0.9352
  • (3)

  • 5:25

  • 25:5

TP 952.8 815.1 807.6 822.4
FP 0.00 0.00 0.00 0.00
F1 0.9758 0.8981 0.8935 0.9025
  • (4)

  • 30:0

  • 0:30

TP 962.8 0 0 0
FP 0.13 0 0 0
F1 0.9810 NaN NaN NaN

Note: Numbers shown in the first column (A:B and C:D) are sample sizes in two groups and two platforms. A is the sample size in Group 1 on platform A; B is Group 1 on platform B; C is Group 2 on platform A; D is Group 2 on platform B. For example, in the second case, the numbers are shown 15:15 and 30:0; 15 samples are from platform A and 15 samples are from platform B in Group 1; all 30 samples are from platform A in Group 2. Boldface entries are the best F1 score among all methods.

3.2 Real data analysis

3.2.1. After-normalization concordance

In real data analyses, we split the three real matched samples data acquired from Cell Miner and TCGA into training and testing datasets. The training datasets were used to get the transformations defined by the FLMER model to combine the testing datasets. Other methods were directly applied to the testing datasets. For the microarray data from CellMiner, 40 random cell lines were selected as training datasets and 10 cell lines were randomly chosen from the remaining data to be the testing datasets. For the TCGA data, 450 matched patient samples were used as the training dataset and 50 samples from the remaining matched samples were used as the testing dataset. We repeated the experiments 30 times and recorded the average performance metrics. Like in the simulation studies, we compared the mean column-wise R2 values. The columns represented samples, cell lines or patients while the rows represented genes or probes in GE data. Again, a higher R2 value indicated a higher cross-platform concordance and better performance.

Figure 4 showed that our method had high cross-platform concordance on all three datasets. All methods performed very well on the platform pair of GPL96 and GPL570, because the GPL570 is an upgraded version of GPL 96, the two microarrays are technically quite alike hence the GE data quantified by the two platforms have close statistical properties. Here, XPN and ComBat had slightly higher values of R2 than our method but the gaps were not significant. On the other hand, our method performed slightly better for the two microarray platforms with larger differences (Affymetrix:Agilent), and it performed much better than the other methods for the array and NGS platform pair that had the largest differences. DWD had the worst performance on all three platform pairs, and its disadvantage was much clear when the platforms were more different. Overall, in the comparison of after-normalization concordance, MM performed the best among the four XPN methods.

Fig. 4.

Fig. 4.

Box plots of mean column-wise R2 using real GE data. (a) For GPL96/GPL570 platform pair, (b) for GPL570/GPL4133 and (c) for Agilent G4502a/GPL11154

3.2.2. DE analysis

We also conducted DE analysis on the TCGA BRCA data. Here, we used the entire TCGA BRCA data while only 583 matched samples were used in the previous section. Specifically, the 583 matched samples with 524 tumor samples and 59 normal samples were used as training set to get the transformation functions for our MM method.

As for the research (testing) data, we used six different datasets, which were not matched samples and the composition of tumor and normal samples in each platform for each testing dataset vary (see Supplementary Table S7). Since the real DEGs of TCGA data were unknown, we obtained the list of DEGs using DESeq2 (Love et al., 2014) with the entire TCGA BRCA RNA-Seq (NGS) data containing 1212 samples in which there were 1020 tumor samples and 112 normal samples measured by GPL11154. From the list, we selected 9555 DEGs whose adjusted P-values were <0.001. This set of DEGs is considered as the gold standard.

Among the six scenarios (Table 3), dataset 1 represented the most balanced case with a 1:1 ratio between array samples and RNA sequencing samples in both the normal and tumor groups. Dataset 6 represented the most unbalanced case consisting of only array data for normal and only sequencing data for tumor samples.

Table 3.

Average true and false positives and the F1 scores from 30 repetitions of the downstream DE analyses for the TCGA data

MM DWD XPN ComBat NGS Fisher Array
  • (1)

  • 10:10

  • 10:10

TP 6115.6 6163.8 5971.8 6168.0 5609.4 5328.6 4913.2
FP 1353.8 1465.0 1335.8 1471.6 1805.0 1542.2 1275.5
F1 0.7184 0.7174 0.7083 0.7174 0.6844 0.6488 0.6241
  • (2)

  • 15:5

  • 15:5

TP 6007.8 6098.2 6125.6 6100.6 5744.4 4211.8 4147.1
FP 1495.0 1680.8 1655.8 1688.0 1918.2 1077.2 1304.2
F1 0.7044 0.7036 0.7066 0.7034 0.6673 0.5674 0.5527
  • (3)

  • 15:5

  • 5:15

TP 5564.8 4600.0 4268.2 4753.2 4967.8 4021.2 3872.6
FP 1391.6 816.4 704.2 873.4 1501.6 1098.1 1018.8
F1 0.6740 0.6145 0.5876 0.6262 0.6200 0.5481 0.5361
  • (4)

  • 25:0

  • 5:10

TP 5372.0 1582.4 1691.0 2185.8 4005.6 4005.6 N/A
FP 1661.6 243.8 252.4 361.6 1074.5 1074.5 N/A
F1 0.6477 0.2780 0.2941 0.3612 0.5474 0.5474 N/A
  • (5)

  • 20:5

  • 0:15

TP 5628.6 1675.6 227.6 1064.4 N/A 3642.6 3642.6
FP 1394.6 269.4 38.8 188.0 N/A 1001.7 1001.7
F1 0.6790 0.2914 0.0562 0.1969 N/A 0.5025 0.5025
  • (6)

  • 20:0

  • 0:20

TP 5404.2 0 0 0 N/A N/A N/A
FP 1465.0 0 0 0 N/A N/A N/A
F1 0.6581 N/A N/A N/A N/A N/A N/A

Note: Boldface entries are the best F1 score among all methods.

In this analysis, we also included the results of the Fisher’s P-value combination method and DE analyses based on a single platform. For Fisher’s method, we first performed the DE analyses separately on microarray data and sequencing data, and then calculated the combined P-values through Fisher’s method. For datasets 4 and 5, Fisher’s method had the same performance as either microarray or NGS, because DE analysis could only be performed for one platform. For dataset 6, DE analysis could not be performed on any individual GE profile because there were only one patient group (tumor or normal) samples for each platform. For all six datasets, we repeated the experiment 30 times and recorded the average true and false positives and the F1 scores in Table 3.

According to Table 3, among the XPN methods, MM had the most robust and stable results throughout all six cases. Similar to the simulation studies, DWD, XPN and ComBat method had relatively good performance in the balanced case. However, their performances were poor for unbalanced data. This tendency was visualized by their F1 score lines across all the examples in Figure 5. Their starting points were close to our method but rapidly decreased from case (3) and became not useful at all for case (6), the most unbalanced case. However, our method produced stable results even in the worst unbalanced scenario. Since many real data are somewhat unbalanced, we believe that being robust to unbalancedness is an important advantage of our method.

Fig. 5.

Fig. 5.

F1 scores of XPN methods and the Fisher method

4 Discussion

The main purpose of XPN is to remove systematic differences across platforms while maintaining the relevant biological differences. Existing BER methods cannot effectively distinguish between lab differences and sample differences. Consequently, they perform poorly when combining datasets with different distributions of biological samples, which can be quite common in integrative studies using public datasets.

In this study, we developed MM using a matched sample design to estimate GE platform differences, which can then be used to convert data from one platform to another. MM offers several unique advantages when compared to the existing methods.

First, when platform differences for two platforms are inferred, they can be used to merge multiple datasets from two platforms or even more platforms. To combine multiple datasets from two platforms, we can first learn transformations for the two platforms. We can then keep datasets in one platform unchanged as the reference platform and transform all the datasets in the other platform to that of the reference platform. We can also combine datasets from more than two platforms. For example, using GPL570 as the reference platform, we can learn the transformation for GPL96-GPL570 and GPL4133-GPL570. We can then combine multiple datasets generated from GPL96, GPL4133 and GPL570 by keeping datasets generated by GPL570 unchanged and transforming all the datasets in GPL96 and GPL4133 to that of GPL570. Secondly, for any two platforms, we can learn two transformations by making one of them the reference platform. When combining datasets, we keep the dataset with more samples unchanged, and only apply the transformations to the datasets with fewer samples to minimize the undesirable bias caused by data manipulations. Thirdly, since we learned the transformation using benchmark data, MM can be used to combine datasets with a very small number of samples. Fourthly, by using the platform with the most abundant number of samples as the reference platform, we can transform datasets from all the other platforms to that reference platform in public databases, such as GEO. Finally, the MM model was gene-specific, we can quantify the confidence intervals of the transformed expressions for individual genes, and use them as weights in downstream DE analyses. It is also possible for us to design such a dichotomous DE analysis strategy: (i) for genes with relatively tight confidence intervals of XPN, we will conduct one DE analysis based on the combined expression profiles; and (ii) for the remaining genes, we perform separate DE analysis on individual platforms and then use Fisher’s method to combine the resulting P-values. All these advantages above can be difficult or impossible to achieve using existing BER methods.

In this study, our goal is to demonstrate the effectiveness of MM in removing only platform differences between datasets generated by different platforms. We downloaded pre-normalized data from CellMiner and TCGA. However, this pre-normalization may affect the transformations that we learn from the two platforms. To learn the transformations that can be used in the future as standards by other researchers, we need to standardize the pre-normalization step. One good option is to perform frozen robust multiarray analysis (McCall et al., 2010) first for all of the samples used to build the models. When we combine the future datasets, we will use frozen robust multiarray analysis again to normalize these datasets.

Finally, from Table 3, we can see that all XPN methods performed better than the Fisher’s method, which in turn was better than single platform analyses. This indicated that: (i) an integrated DE analysis, regardless of the method, was always better than analyses based on only one platform; (ii) if possible, always consider combining GE profiles before performing DE analysis instead of combining P-values after individual analyses.

5 Conclusion

In this study, we designed a method, called MM, for combining multiple GE datasets. MM was designed to remove platform differences while keeping the biological differences. Using both simulated and real data, we showed that MM performed better than the competing methods in terms of both after-normalization concordance and downstream DE analysis. We believe that MM will have wide applications in integrative studies of GE data. Our method is implemented as an R-package ‘MatchMixer’, which is freely available at: https://github.com/dy16b/Cross-Platform-Normalization.

Funding

This work was supported by the National Institute of General Medical Sciences of the National Institute of Health [R01GM126558 to J.Z.].

Conflict of Interest: none declared.

Supplementary Material

btz974_Supplementary_Data

Contributor Information

Serin Zhang, Department of Statistics, Florida State University, Tallahassee, FL 32306, USA.

Jiang Shao, Gilead Sciences Inc., Foster City, CA 94404, USA.

Disa Yu, Department of Statistics, Florida State University, Tallahassee, FL 32306, USA.

Xing Qiu, Department of Biostatistics and Computational Biology, University of Rochester, Rochester, NY 14624, USA.

Jinfeng Zhang, Department of Statistics, Florida State University, Tallahassee, FL 32306, USA.

References

  1. Benito M.  et al. (2004) Adjustment of systematic microarray data biases. Bioinformatics, 20, 105–114. [DOI] [PubMed] [Google Scholar]
  2. Bergamaschi A.  et al. (2006) Distinct patterns of DNA copy number alteration are associated with different clinicopathological features and gene-expression subtypes of breast cancer. Genes Chromosomes Cancer, 45, 1033–1040. [DOI] [PubMed] [Google Scholar]
  3. Bolstad B.M.  et al. (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19, 185–193. [DOI] [PubMed] [Google Scholar]
  4. CrysAlis P. (2011) Agilent Technologies, Yarnton, Oxfordshire, England. [Google Scholar]
  5. Dalma‐Weiszhausz D.D.  et al. (2006) The affymetrix GeneChip® platform: an overview. Methods Enzymol., 410, 3–28. [DOI] [PubMed] [Google Scholar]
  6. Edgar R.  et al. (2002) Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res., 30, 207–210. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Hoerl A.E., Kennard R.W. (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics, 12, 55–67. [Google Scholar]
  8. Irizarry R.A.  et al. (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, 4, 249–264. [DOI] [PubMed] [Google Scholar]
  9. Jiang H.  et al. (2004) Joint analysis of two microarray gene-expression data sets to select lung adenocarcinoma marker genes. BMC Bioinformatics, 5, 81. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Johnson W.E.  et al. (2007) Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics, 8, 118–127. [DOI] [PubMed] [Google Scholar]
  11. Leek J.T.  et al. (2010) Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet., 11, 733–739. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Love M.I.  et al. (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol., 15, 550. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Loven J.  et al. (2012) Revisiting global gene expression analysis. Cell, 151, 476–482. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Maldonado Y.M. (2009) Mixed Models, Posterior Means and Penalized Least-Squares. IMS Lecture Notes-Monograph Series, Institute of Mathematical Statistics, Vol. 57, pp. 216–236.
  15. McCall M.N.  et al. (2010) Frozen robust multiarray analysis (fRMA). Biostatistics, 11, 242–253. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Parrish R.S., Spencer H.J. 3rd. (2004) Effect of normalization on significance testing for oligonucleotide microarrays. J. Biopharm. Stat., 14, 575–589. [DOI] [PubMed] [Google Scholar]
  17. Qiu X.  et al. (2013) The impact of quantile and rank normalization procedures on the testing power of gene differential expression analysis. BMC Bioinformatics, 14, 124. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Rudy J., Valafar F. (2011) Empirical comparison of cross-platform normalization methods for gene expression data. BMC Bioinformatics, 12, 467. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Shabalin A.A.  et al. (2008) Merging two gene-expression studies via cross-platform normalization. Bioinformatics, 24, 1154–1160. [DOI] [PubMed] [Google Scholar]
  20. Shankavaram U.T.  et al. (2009) CellMiner: a relational database and query tool for the NCI-60 cancer cell lines. BMC Genomics, 10, 277. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Tibshirani R. (1996) Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B Methodol., 58, 267–288. [Google Scholar]
  22. Tomczak K.  et al. (2015) The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Contemp. Oncol., 19, A68. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Walker W.L.  et al. (2008) Empirical Bayes accomodation of batch-effects in microarray data using identical replicate reference samples: application to RNA expression profiling of blood from Duchenne muscular dystrophy patients. BMC Genomics, 9, 494. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Warnat P.  et al. (2005) Cross-platform analysis of cancer microarray data improves gene expression based classification of phenotypes. BMC Bioinformatics, 6, 265. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Xia X.Q.  et al. (2009) WebArrayDB: cross-platform microarray data analysis and public data repository. Bioinformatics, 25, 2425–2429. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Zahurak M.  et al. (2007) Pre-processing Agilent microarray data. BMC Bioinformatics, 8, 142. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btz974_Supplementary_Data

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES