Abstract
Motivation
Scaling by sequencing depth is usually the first step of analysis of bulk or single-cell RNA-seq data, but estimating sequencing depth accurately can be difficult, especially for single-cell data, risking the validity of downstream analysis. It is thus of interest to eliminate the use of sequencing depth and analyze the original count data directly.
Results
We call an analysis method ‘scale-invariant’ (SI) if it gives the same result under different estimates of sequencing depth and hence can use the original count data without scaling. For the problem of classifying samples into pre-specified classes, such as normal versus cancerous, we develop a deep-neural-network based SI classifier named scale-invariant deep neural-network classifier (SINC). On nine bulk and single-cell datasets, the classification accuracy of SINC is better than or competitive to the best of eight other classifiers. SINC is easier to use and more reliable on data where proper sequencing depth is hard to determine.
Availability and implementation
This source code of SINC is available at https://www.nd.edu/∼jli9/SINC.zip.
Supplementary information
Supplementary data are available at Bioinformatics online.
1 Introduction
In RNA-seq experiments, different total numbers of reads are generated for different samples, and eliminating this difference by ‘normalization’ is often the first step of analyzing RNA-seq data. Such normalization is typically done by first estimating a size factor called ‘sequencing depth’, which is roughly proportional to the total number of reads of the sample, and then dividing individual counts by the corresponding sequencing depth. Correctly normalizing the data is the first step of successful analysis and unbiased interpretation of RNA-seq data.
However, the estimation of the sequencing depth is far from trivial. The naïve method, called total counts (TC), estimates the sequencing depth by the total number of read counts in the sample. It can be highly biased under the presence of differentially expressed genes (Dillies et al., 2013). Explicitly or implicitly relying on the assumption that most genes are not differentially expressed, other methods such as upper quantile (UQ) (Bullard et al., 2010), TMM (Robinson and Oshlack, 2010) and DESeq (Anders and Huber, 2010), are proposed, which have shown superior performance comparing to TC in tasks like differential expression analysis of bulk RNA-seq data (for a review of normalization methods for bulk RNA-seq data; see Dillies et al., 2013). In single-cell RNA-seq (scRNA-seq) data, because of the widespread dropouts and a much higher noise-to-signal level, methods such as UQ, TMM and DESeq often encounter difficulties (Vallejos et al., 2017), and methods such as scran (Lun et al., 2016) were proposed to handle unique characteristics of single-cell data (Bacher et al., 2017; Ding et al., 2015; Katayama et al., 2013).
However, there are two problems in the current normalization-before-analysis strategy of RNA-seq data analysis. First, while different methods may give distinct estimates of sequencing depth, which is especially the case when they are applied to bulk data of highly heterogeneous samples such as cancerous samples or when applied to single-cell data, there is currently no systematic guideline on how to choose the best estimation method for data in hand (Vallejos et al., 2017). Second, the assumption that most genes are not differentially expressed across samples, on which most methods for sequencing depth estimation rely, may not hold on real data: cancerous samples can be so different from normal samples that a majority of genes are differentially expressed, and cells of different phenotypes can also have quite different expression in most genes (Tosh and Slack, 2002).
A solution to these problems is to develop methods that give the same result under different estimates of sequencing depth. We call such methods ‘scale-invariant’ or SI for short, since sequencing depth is served as a scaling parameter during normalization. To make the concept of SI precise, we let be the number of reads mapped to gene i in sample j, and let be the data matrix. Suppose are the estimated sequencing depths, then the normalized data matrix is , where . SI means that a method gives the same result using data matrix with any value of the sequencing depth vector .
We can go one step further. Since results of SI methods will be the same for any value of the sequencing depth vector , we can simply assign , in which case and . This means that SI methods can directly take the original count data as input. There is no need to get an accurate estimate of sequencing depth, as the normalization step is completely eliminated.
Such SI methods do not exist for all tasks of RNA-seq data analysis. For example, differential analysis of individual genes is sensitive to scaling by definition. In this paper, we propose a SI method for another important analysis of RNA-seq data: classification of samples. That is, using gene expression levels to classify samples into pre-defined classes, such as normal versus cancerous.
SI methods have not been specifically developed for the classification based on RNA-seq data. Most generic methods for classification are not SI. For example, methods such as nearest neighbor, linear discriminant analysis and support vector machines (SVMs) are based on the distance between sample points, from samples to class centroids, or from samples to a hyperplane, and thus their results are sensitive to the scaling of samples. As another example, tree-based methods, such as decision trees and random forests (RFs), are composed of splits, each of which compares the value of a single feature to a threshold, and thus are also sensitive to the scaling of features. Methods developed specifically for RNA-seq data, Poisson linear discriminant analysis (PLDA) (Witten et al., 2011) and negative binomial linear discriminant analysis (NBLDA) (Dong et al., 2016), assume Poisson or negative binomial distributions for the counts, and they are not SI, either.
However, RNA-seq data indeed contains scale-irrelevant information about the class labels of the samples. For example, in a two-class problem, it is possible that gene i has higher expression than gene j in one class of samples and lower expression than gene j in the other class of samples. Based on whether gene i has higher or lower expression than gene j in a new sample, one can predict which class the new sample comes from. Such ‘relative abundance’ between a pair of genes will not change under any scaling of data, and thus the classifier will be SI. Such an idea of using a pair of genes was first proposed by Geman et al. (2004) for the classification problem on microarray data. For a two-class classification problem, among all possible gene pairs (i, j), they search for the pair that maximizes , where and are the observed proportion of samples that gene i has higher expression than gene j in Classes 1 and 2, respectively. This pair of gene is called the top-scoring pair (TSP), which is also the name of their method.
Interestingly, while TSP is SI, it was developed for microarray data, which does not have ‘sequencing depth’ and scaling is not a required pre-processing step for the analysis. Indeed, the starting points of TSP are higher classification accuracy and easier interpretation, which have been demonstrated in real datasets (Geman et al., 2004). Later, TSP was generalized to k-TSP (Tan et al., 2005), which includes the top k pairs of genes (k > 1 and typically ). In k-TSP, each top gene pair gives a prediction and the final predicted label is given by a majority vote. Other variations further extended the idea of TSP, e.g. by comparing the relative abundance of three or more genes simultaneously (Lin et al., 2009; Magis and Price, 2012; Wang et al., 2013; Yang and Naiman, 2014). The success of TSP and its variations has been shown in numerous real applications.
In this paper, we propose the first SI method specifically designed for the classification of bulk and single-cell RNA-seq data. The focus of our method is on high classification accuracy, as TSP and its variations have almost unsurpassable ability in easy interpretation. Our method is based on a deep neural-network (DNN), which is known for its superior ability in generating complicated features that are efficient for classification. We use data augmentation to achieve scale-invariance automatically. Furthermore, we also use data augmentation to incorporate technical variance of RNA-seq data, which robustifies the classifier and alleviates overfitting.
In fact, TSP and its variations, originally designed for microarray data, can be directly applied to RNA-seq data. We will report their performance on three bulk RNA-seq datasets and six single-cell RNA-seq datasets, and show that our method gives much higher classification accuracy than TSP and k-TSP on all the nine datasets. On each of the nine datasets, our method also gives higher or similar accuracy compared with the best performer among six other methods that do not have the SI property.
2 Materials and methods
We propose an SI classifier named scale-invariant deep neural-network classifier (SINC) under the framework of DNN. SINC uses data augmentation to achieve scale-invariance. Data augmentation is a technique that artificially creates new training data from existing training data. It is most commonly used by DNN in computer vision. Suppose one wants to build a classifier on a training dataset to classify the only object in a photo into a car or a plane. Since whether the object is a car or a plane does not depend on the location, angle or size of the object, one can translate, rotate and scale each image to create many more images, while keep the class label unchanged. These artificially created images are called augmented images. Then the DNN is trained by both the original and the augmented images. With such data augmentation, the DNN can not only be trained more accurately because of the enlarged training sample size, but also equip with rotation-invariance, scaling-invariance and translation-invariance, i.e. its prediction will not change when the image is rotated, scaled or translated. This is because a high accuracy in both the original images and the augmented images indicates classifying the augmented images to the same class as the corresponding original image. Thus, as the training proceeds, DNN gains these desired invariances automatically. These invariances are often highly appreciated properties in image classification. In this paper, we augment RNA-seq data by adding random scaling parameters so that the obtained classifier gains scale-invariance.
More than that, we also use data augmentation to take care of technical noises in RNA-seq data. This is done by adding random noises that mimic the technical noises in RNA-seq data so that the predicted class labels are ‘invariant under technical noises’, or in another word, more robust to technical noises.
Here we give details about the data augmentation we use. As in the introduction, we use for the number of reads mapped to gene i in sample j, and for an estimated sequencing depth of experiment j, and . Then the sequencing-depth-scaled data are . Typically for classification people use data on the log scale, which has more stable variance. The log-transformed original count is , and the log-transformed scaled count is . We see that the effect of scaling is introducing an additive offset to the log-transformed data. In practice, often a small number such as 1.0 is added to and before taking log to avoid the trouble of . This will make hold only approximately, although the difference is typically ignorable unless is close to 0. Let and be the matrix of counts and log-transformed counts, respectively. We generate the augmented data matrix on the log scale by
| (1) |
Here is a length-p column vector, is a length-n column vector consisting of randomly generated scaling factors (sequencing depths) and is a p × n matrix consisting of randomly generated technical noises. The elements in are generated from i.i.d. , where u is a constant. We used for all computations in this paper. Under this choice, the range of data is roughly doubled. The elements in are independently sampled from , where is the estimated technical variance for gene i. The details of how we calculate are given in Supplementary Material. At each epoch of the training of DNN, we use Equation (1) to randomly generate a matrix and supply both the n samples in and the n samples in to the DNN. We find that typically the network gains the scale-invariance property in less than 100 epochs (Supplementary Fig. S1).
3 Data
We applied SINC and several other classifiers on nine real RNA-seq datasets, including three bulk datasets and six single-cell datasets. An overview of these datasets is given in Table 1. The three bulk RNA-seq datasets were from TCGA (The Cancer Genome Atlas), each consisting of samples from a type of cancer: lung adenocarcinoma (LUAD), pheochromocytoma and paraganglioma (PCPG) and renal cell cancer (RCC). For LUAD and PCPG data, the clinical subtypes were used as class labels. The RCC data have three subtypes of samples: 606 kidney renal papillary cell carcinoma, 323 kidney renal clear cell carcinoma and 91 kidney chromophobe. These three subtypes were used as class labels. These RNA-seq data, as well as the associated clinical information, were downloaded using an R package called ‘TCGA2STAT’ (Wan et al., 2016). The data imported by TCGA2STAT were the latest version of all version-stamped standardized datasets hosted and maintained by the Broad GDAC Firehose.
Table 1.
Overview of datasets
| Dataset | Bulk/sc | No. of samples | No. of classes (no. of samples in each class) | No. of genes | Data source/ accession no. | References |
|---|---|---|---|---|---|---|
| LUAD | Bulk | 230 | 3 (78, 63, 89) | 10 925 | TCGA | – |
| PCPG | Bulk | 173 | 4 (22, 68, 61, 22) | 10 652 | TCGA | Fishbein et al. (2017) |
| RCC | Bulk | 1020 | 3 (606, 323, 91) | 10 781 | TCGA | – |
| Buettner | sc | 288 | 3 (96, 96, 96) | 13 561 | E-MTAB-2805 | Buettner et al. (2015) |
| Li | sc | 432 | 2 (272, 160) | 6726 | GSE81861 | Li et al. (2017) |
| Tang | sc | 246 | 6 (43, 38, 31, 47, 48, 39) | 1901 | GSE100911 | Tang et al. (2017) |
| Baron1 | sc | 1895 | 9 (110, 51, 236, 872, 214, 120, 130, 70, 92) | 3070 | GSE84133 | Baron et al. (2016) |
| Baron2 | sc | 1223 | 5 (284, 495, 101, 280, 63) | 3658 | GSE84133 | Baron et al. (2016) |
| Chen | sc | 9774 | 12 (1148, 413, 818, 379, 151, 167, 724, 3541, 1741, 51, 32, 609) | 1665 | GSE87544 | Chen et al. (2017) |
We also collected six scRNA-seq datasets from different publications, and we here use the last name of the first author to denote each dataset. Buettner et al. (2015) studied the differentiation of naïve T cells into T helper 2 cells, and we used the stage information of cells during differentiation as their class labels. Li et al. (2017) analyzed the transcriptional heterogeneity in colorectal tumors, and we used the count data from 272 tumor epithelial cells and 160 normal epithelial cells for classification analysis. Tang et al. (2017) studied cell heterogeneity within the zebrafish kidney marrow, and we used the six fluorescent transgenic lines as the class labels. We created two datasets, Baron1 and Baron2, from Baron et al. (2016), which revealed the pancreas population structure with the transcriptomes of over 12 000 pancreatic cells from four human donors and two mice. Baron1 contains all major cell groups from the first human donor, excluding those with less than 20 cells. Baron2 contains the five largest cell groups from the second human donor. Chen et al. (2017) sequenced more than 14 000 single cells from adult mouse hypothalamus tissues and identified 11 non-neuronal and 34 neuronal cell types. We excluded 18 GABA cell types and 15 glutamatergic neuron cell types, which have very small numbers of samples, and used the remaining 12 cell types as 12 classes.
For both bulk and single cell RNA-seq data, we filtered out genes that have zero counts in more than 80% of samples. For bulk RNA-seq data, we further filtered out genes with mean expression (on the log scale) less than five. The numbers of remaining genes are shown in Table 1.
4 Results
To measure the accuracy of classifiers on each dataset, we use misclassification rate on 5-fold cross-validation (CV). In CV, each dataset is randomly divided into five equal folds, and each fold contains 20% of all samples. In each of five iterations, one fold is used as test data and the other four folds are used as training data. After the five iterations, all samples have been used as test data exactly once. And the proportion of them that have been misclassified as test data is the ‘5-fold CV misclassification rate’, or the ‘CV error’ for short.
4.1 The structure of DNN
For all the real datasets, we used neural networks with the same structure that consists of two hidden layers as well as an input layer and an output layer. The first hidden layer has 256 nodes, and the second hidden layer has 128 nodes. All layers are fully connected, and the ReLU activation function is used. A dropout rate 0.5 is used for each hidden layer. The loss function used for classification is cross entropy. An Adam optimizer (Kingma and Ba, 2015) with learning rate 0.02 is used for training the model. For each dataset, the model is trained for 200 epochs with batch size 24. The model is implemented on TensorFlow 1.8 (Abadi et al., 2015).
Note that for all datasets we used the same simple network structure (and it is not particularly ‘deep’, although the techniques we used, such as ReLU activation function, dropout and data augmentation, are often used for DNNs. See ‘The neural network that SINC uses: deep or shallow?’ in Supplementary Material for detailed discussions.) and the same set of hyperparameters for SINC. We studied the influence of the structure and a few hyperparameters (results given in Supplementary Material), but we did not do a comprehensive search for the best combination on each individual dataset. Although such a search may further improve the performance of SINC, we will show that SINC still gives impressive performance even under the current sub-optimal setting.
To reduce the dimensionality of the input layer, we did a simple pre-selection of genes. In each iteration of the CV, we conducted an F-test on each gene to test whether different classes have different mean expression levels, using only the training data. We only kept the top 1500 genes with the most significant P-values. Note that in each iteration the test samples have been excluded from the selection of genes to avoid possible underestimation of the CV error.
4.2 Eight other classifiers
We compared SINC with eight other classifiers, including four generic classifiers, two classifiers that were specifically designed for RNA-seq data, and two SI classifiers that were developed for microarray data.
The four generic classifiers are K nearest neighbor (KNN) (Altman, 1992), SVM (Cortes and Vapnik, 1995), classification and regression trees (CART) (Breiman, 2017) and RF (Ho, 1995). All of them require sequencing-depth normalized data. For bulk RNA-seq data, we tried three different methods for the estimation of sequencing depths: TC, UQ (Bullard et al., 2010) and DESeq (Anders and Huber, 2010). For scRNA-seq data, we tried two methods: TC and scran (Lun et al., 2016). After the estimation of sequencing depths, we scaled the data, took a transformation and then supplied it to the classifiers.
The two classifiers designed for RNA-seq data are PLDA (Witten et al., 2011) and NBLDA (Dong et al., 2016). These methods require estimated sequencing depths and the original count data, and they take care of normalization internally. We tried the same set of methods for estimating sequencing depths as for the generic classifiers. Besides, in PLDA, as suggested, a power transformation was applied to attenuate the overdispersion.
The mechanisms of these six classifiers determine that none of them should be SI, and this will be validated in our data. All these six classifiers have tuning parameters, such as the number of neighbors for KNN, the cost parameter for SVM, the complexity parameter for CART, the number of trees for RF, and the shrinkage parameter for PLDA and NBLDA. We tried multiple values for each tuning parameter and chose the one that gave the smallest CV error.
The two SI classifiers developed for microarray data we used for comparison are TSP and k-TSP, which have R packages publicly available. The optimal k value for k-TSP is selected as the one that gives the smallest CV error. TSP and k-TSP were originally developed for two-class classification problems. To generalize their applications to multiclass classification problems, the authors of k-TSP (Tan et al., 2005) suggested using one of three generic strategies: one-versus-others (1-versus-r), one-versus-one (1-versus-1) or hierarchical classification (HC). See Bishop (2006) or Hastie et al. (2009) for an introduction of the three strategies. To make our comparison comprehensive, we implemented all the three strategies for both TSP and k-TSP.
4.3 Measurements of performance
To measure the accuracy of classifiers, we use the CV error as introduced at the beginning of Section 4. We also call it the ‘CV error on the original data’ to differentiate it with another CV error introduced in the next paragraph.
To measure whether a classifier is SI, we use the ‘CV error on the augmented data’. At each iteration of CV, the test samples are augmented M times, and the new test samples include both the original and augmented test samples. The misclassification rate on all these test samples in all iterations is defined as the ‘CV error on the augmented data’. M = 20 was used for this paper, and the augmentation was done using Equation (1) without the technical noise term . Non-SI classifiers are sensitive to the scaling of the data and thus expected to perform much worse on the augmented test data than on the original test data. SI classifiers, on the other hand, should have similar misclassification rates on both the augmented test data and the original test data. Thus, the increase of the CV error on the augmented data compared with the CV error on the original data serves as a measure of SI.
We also define a more direct measure of SI called ‘concordance rate’ based on the augmented test samples in CV. Suppose the M augmented samples from an original test sample i have predicted labels , and the majority vote of these M labels is . The concordance rate of sample i is defined as the proportion of equal to , and the concordance rate (on a dataset) is the average concordance rate of all samples. The concordance rate is a positive number no greater than 1; the higher it is, the more insensitive to scaling the classifier is. A value close to 1 means SI.
4.4 Comparison with non-SI classifiers
Figure 1 shows the CV errors on the original data and on the augmented data for bulk and single-cell datasets. On the original data (shown as solid lines), there is no single non-SI classifier that outperforms all other non-SI classifiers on all datasets. For example, SVM is among the best for LUNG, RCC, Buettner, Li and Chen data, but its performance is much inferior to PLDA on PCPG, Baron1 and Baron2 data, and much inferior to RF on Tang data. While none of the six non-SI methods show uniformly best or close-to-the-best results on all datasets, SINC gives the lowest CV error on six out of the nine datasets. On the other three datasets, its CV error is very close to the best: 0.0694 versus 0.0556 (from SVM with TC normalization) in Buettner data, 0.0201 versus 0.0195 (from PLDA with TC normalization) on Baron1, and 0.0139 versus 0.0123 (from NBLDA with scran normalization) on Baron2. Averaged over all the nine datasets (and averaged over different normalization methods for the six non-SI classifiers), the CV error of SINC is 68, 32, 62, 52, 45 and 59% lower than KNN, SVM, CART, RF, PLDA and NBLDA, respectively.
Fig. 1.

CV errors of SINC and six other classifiers on three bulk RNA-seq datasets (top) and six single-cell datasets (middle and bottom). While SINC does not require normalization, different normalization methods (results shown in different colors) have been used for the other six classifiers. The CV errors of the six classifiers are different on the original data (shown in solid lines) and the augmented data (shown in dash lines). SINC has virtually the same CV error on both the original test data and the augmented test data, and this error is shown as a black horizontal line in each plot. (Color version of this figure is available at Bioinformatics online.)
On the augmented data, their difference is even larger. While the CV error of SINC show no noticeable change, the CV error of other classifiers inflates dramatically in most cases. On average (over all datasets and normalization methods), the CV error increases 79, 451, 102, 203, 214 and 125% for KNN, SVM, CART, RF, PLDA and NBLDA, respectively. (Note the CV error can be over 0.50 as all datasets except Li contain more than two classes.) We see that the error inflation is especially high for SVM, the overall best performer among the six non-SI methods on the original data. On the augmented data, the average CV error of SINC is 81–88% lower than SVM and the other five non-SI methods.
Figure 2 shows the concordance rates for all methods on the nine datasets. The concordance rates of SINC are near perfection (>0.99 on every dataset), indicating that it indeed achieves scale-invariance. On the other hand, the average (over all datasets and all normalization methods) concordance rate of the other six methods are between 0.75 and 0.81, meaning that they are generally not SI. Notably, though, is that the concordance rates of both PLDA and NBLDA are higher than 0.99 on RCC dataset when TC normalization is used. This means that they also achieve scale-invariance, albeit only on this single dataset. This is quite interesting. We have found a mathematical explanation on why and when this can happen, and the detailed derivations are given in Supplementary Material (Claim 1 and its proof).
Fig. 2.

Concordance rates of SINC and six other classifiers on three bulk RNA-seq datasets (top) and six single-cell datasets (middle and bottom). While SINC does not require normalization, different normalization methods (results shown in different colors) have been used for the other six classifiers. The concordance rate of SINC, which is very close to 1.0, is shown as a black horizontal line in each plot. (Color version of this figure is available at Bioinformatics online.)
4.5 Comparison with TSP and k-TSP
Since TSP and k-TSP are also SI, their concordance rates are naturally 1. Thus, we only compare their CV errors (on the original dataset) with our method, which are shown in Figure 3. It is clear that our method consistently outperforms TSP and k-TSP on the nine real datasets, no matter what multiclass classification strategy is used for TSP and k-TSP. Averaged over datasets and strategies for multiclass classification, the CV error is 77 and 63% lower than TSP and k-TSP, respectively.
Fig. 3.

CV errors of SINC, TSP and k-TSP on three bulk RNA-seq datasets (top) and six single-cell datasets (middle and bottom). These three methods do not require normalization. Different strategies for combining results from two-class predictions to get multiclass predictions are tried for TSP and k-TSP, and they are shown as different colors. Li data only have two classes of samples, and thus there is only one line in the sub-figure. SINC handles multiclass naturally, and its CV error is shown as a black horizontal line in each plot. (Color version of this figure is available at Bioinformatics online.)
On the other hand, if we compare with the results in Figure 1, we find that although TSP and k-TSP have inferior performance to our method, k-TSP still outperforms classifiers like KNN and CART in many datasets, despite that it uses a small number of genes and a simple classification rule, indicating the strong power of using relative expression levels between genes instead of individual gene expression levels.
5 Discussion
We have presented SINC, an SI method for the classification of samples based on the DNN framework, and compared its performance with existing classifiers on real bulk and single-cell RNA-seq datasets. SINC not only achieves scale-invariance but also outperforms existing methods in regard to the classification accuracy.
Specially, SINC achieves much higher accuracy than TSP and k-TSP, two SI classifiers original developed for microarray data. This can be due to three reasons. First, while SINC uses a large number of genes for classification (1500 for all data in this paper), TSP and k-TSP only use up to a few gene pairs. TSP only uses a single pair, while k-TSP typically limits the number of pairs to ten. For many datasets, such a small number of genes simply do not contain enough information for highly accurate classification. Second, our higher accuracy may also come from the power of DNN in describing sophisticated relationship between genes, while the simple classification rules that TSP and k-TSP use may be not capable enough. Third, while SINC models the technical noises and then utilizes them to robustify the classifier, TSP and k-TSP do not take technical noises into account. The technical noises in RNA-seq data are often substantial, and they follow characteristic patterns that were revealed by previous fundamental work in the field. SINC takes advantages of these important results.
However, SINC loses an important advantage of TSP and k-TSP: the interpretability. DNN captures sophisticated relationships involving many genes, but these relationships are almost impossible to interpret.
The scale-invariance of SINC is obtained through training an ordinary full-connected neural network on the augmented data. An alternative strategy that does not rely on data augmentation is training a network with specific structure/constraints that guarantees its scale-invariance. This is an interesting strategy worth trying. We started with the simplest network structure: a network without any hidden layer. Let the weight that connects input node i and output node k be wik, and . We have shown (mathematical derivations given in Supplementary Material as Claim 2 and its proof) that a sufficient condition to make the network SI is that is the same for all k’s. This constraint is actually equivalent to the sum-to-zero constraint in penalized linear regression proposed by Shen et al. (2017). Unfortunately, we failed to find easy-to-describe constraints that guarantee scale-invariance for neural networks with one or more hidden layers. Given the complexity and non-linearity of neural networks with hidden layers, such constraints may not have a concise mathematical form and are thus difficult to be imposed via programing. Our strategy of using data augmentation on an ordinary network without explicit constraints is more feasible. The theoretical constraints are expected to be satisfied automatically as the training on the augmented data goes, and we have shown this is indeed the case for neural networks without a hidden layer (see Supplementary Material for details).
SINC is flexible in the form of RNA-seq data it takes as input. We have shown that on original count data without any normalization, SINC is able to achieve high classification accuracy and scale-invariance. In practice, many datasets are available as TPM, FPKM or RPKM, which take the gene length in consideration. SINC is able to directly take as input data with these measurements, just like the original count data. No additional pre-processing steps are required. Furthermore, on these measurements, SINC is able to achieve comparable classification accuracy to the count data, and it also has the SI property. See Supplementary Material for a theoretical justification and results on real datasets.
SINC, a classifier based on both DNN and data augmentation, achieves scale-invariance and has higher classification accuracy than other methods that we consider. The scale-invariance is, of course, attributed to the data augmentation. To study whether the improvement of the classification accuracy is due to the superior classification power of DNN or the data augmentation or both, we compared the performance of SINC with the corresponding DNN, the DNN with the same structure and trained with the same set of hyperparameters, without data augmentation. Their CV errors are given in Supplementary Figure S4. While the differences in CV errors of the two methods are small on most datasets, SINC achieves significantly lower CV errors on three datasets: LUAD (0.10 versus 0.14, a 25% decrease), PCPG (0.058 versus 0.064, a 9% decrease) and Buettner (0.069 versus 0.076, a 9% decrease). Note that all these three datasets have relatively small sample sizes (<300). This means that while the use of DNN is the main reason why SINC outperforms other classification methods, data augmentation further elevates the classification accuracy for data with smaller sample sizes.
We also studied the effect of the sample size on the performance of SINC by downsampling real datasets and then applying SINC. As expected, the CV error of SINC increases as the sample size decreases, although there does not seem to be a common minimum sample size for decent accuracy on different datasets (see Supplementary Material for details).
All our comparisons so far are based on the CV error, which is the de facto standard for comparing different methods on single datasets. We also conducted a case study in which two (separate) datasets were used, one for training and one for testing. To obtain such a pair of datasets, we revisited Baron1 and Baron2, which are scRNA-seq datasets with cells from two different human donors but have overlapped cell types. The nine cell types in Baron1 data include all the five cell types in Baron2 data, as well as four other cell types. We removed these four cell types and called the remaining data Baron1c. By doing that, Baron1c and Baron2 have the same five cell types. Then we used Baron1c for training and Baron2 for testing. Supplementary Figure S5 shows the test errors (test misclassification rates) and concordance rates of SINC and the six non-SI classifiers. The test error for SINC (0.0124) shows no inflation compared with its CV error (0.0139) on Baron2 data, which is not the case for most other classifiers. More importantly, on the test data, SINC still achieves scale-invariance while other methods do not, and SINC still has the highest classification accuracy.
Supplementary Material
Acknowledgements
The results shown here are in part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga. The authors thank Dr Siyuan Zhang from Department of Biological Sciences for helpful discussions.
Funding
This work was supported by the National Institutes of Health [R03CA212964 to J.L.].
Conflict of Interest: none declared.
References
- Abadi M. et al. (2015) TensorFlow: Large-scale Machine Learning on Heterogeneous Systems Software available from tensorflow.org.
- Altman N.S. (1992) An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat., 46, 175–185. [Google Scholar]
- Anders S., Huber W. (2010) Differential expression analysis for sequence count data. Genome Biol., 11, R106.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bacher R. et al. (2017) SCnorm: robust normalization of single-cell RNA-seq data. Nat. Methods, 14, 584.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baron M. et al. (2016) A single-cell transcriptomic map of the human and mouse pancreas reveals inter-and intra-cell population structure. Cell Syst., 3, 346–360. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bishop C.M. (2006) Pattern Recognition and Machine Learning. Springer, New York, NY. [Google Scholar]
- Breiman L. (2017) Classification and Regression Trees. Routledge, New York, NY. [Google Scholar]
- Buettner F. et al. (2015) Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nat. Biotechnol., 33, 155.. [DOI] [PubMed] [Google Scholar]
- Bullard J.H. et al. (2010) Evaluation of statistical methods for normalization and differential expression in mRNA-seq experiments. BMC Bioinformatics, 11, 94.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen R. et al. (2017) Single-cell RNA-seq reveals hypothalamic cell diversity. Cell Rep., 18, 3227–3241. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cortes C., Vapnik V. (1995) Support-vector networks. Mach. Learn., 20, 273–297. [Google Scholar]
- Dillies M.-A. et al. (2013) A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief. Bioinformatics, 14, 671–683. [DOI] [PubMed] [Google Scholar]
- Ding B. et al. (2015) Normalization and noise reduction for single cell RNA-seq experiments. Bioinformatics, 31, 2225–2227. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dong K. et al. (2016) NBLDA: negative binomial linear discriminant analysis for RNA-seq data. BMC Bioinformatics, 17, 369.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fishbein L. et al. (2017) Comprehensive molecular characterization of pheochromocytoma and paraganglioma. Cancer Cell, 31, 181–193. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Geman D. et al. (2004) Classifying gene expression profiles from pairwise mRNA comparisons. Stat. Appl. Genet. Mol. Biol., 3, 1–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hastie T. et al. (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer, New York, NY.
- Ho T.K. (1995) Random decision forests. In: Proceedings of 3rd International Conference on Document Analysis and Recognition, Montreal, Quebec, Canada. Vol. 1, IEEE, pp. 278–282. [Google Scholar]
- Katayama S. et al. (2013) SAMstrt: statistical test for differential expression in single-cell transcriptome with spike-in normalization. Bioinformatics, 29, 2943–2945. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kingma D.P., Ba J. (2015) Adam: a method for stochastic optimization. In: Proceedings of the 3rd International Conference for Learning Representations (ICLR'15). San Diego.
- Li H. et al. (2017) Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors. Nat. Genet., 49, 708.. [DOI] [PubMed] [Google Scholar]
- Lin X. et al. (2009) The ordering of expression among a few genes can provide simple cancer biomarkers and signal brca1 mutations. BMC Bioinformatics, 10, 256.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lun A.T. et al. (2016) Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol., 17, 75.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Magis A.T., Price N.D. (2012) The top-scoring ‘N’ algorithm: a generalized relative expression classification method from small numbers of biomolecules. BMC Bioinformatics, 13, 227.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Robinson M.D., Oshlack A. (2010) A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol., 11, R25.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shen R. et al. (2017) Identification of gene pairs through penalized regression subject to constraints. BMC Bioinformatics, 18, 466.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tan A.C. et al. (2005) Simple decision rules for classifying human cancers from gene expression profiles. Bioinformatics, 21, 3896–3904. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tang Q. et al. (2017) Dissecting hematopoietic and renal cell heterogeneity in adult zebrafish at single-cell resolution using RNA sequencing. J. Exp. Med., 214, 2875–2887. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tosh D., Slack J.M. (2002) How cells change their phenotype. Nat. Rev. Mol. Cell Biol., 3, 187.. [DOI] [PubMed] [Google Scholar]
- Vallejos C.A. et al. (2017) Normalizing single-cell RNA sequencing data: challenges and opportunities. Nat. Methods, 14, 565.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wan Y.-W. et al. (2016) TCGA2STAT: simple TCGA data access for integrated statistical analysis in R. Bioinformatics, 32, 952–954. [DOI] [PubMed] [Google Scholar]
- Wang H. et al. (2013) TSG: a new algorithm for binary and multi-class cancer classification and informative genes selection. BMC Med. Genomics, 6, S3.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Witten D.M. et al. (2011) Classification and clustering of sequencing data using a Poisson model. Ann. Appl. Stat., 5, 2493–2518. [Google Scholar]
- Yang S., Naiman D.Q. (2014) Multiclass cancer classification based on gene expression comparison. Stat. Appl. Genet. Mol. Biol., 13, 477–496. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
