Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2022 Dec 12;119(51):e2206938119. doi: 10.1073/pnas.2206938119

Correlated gene modules uncovered by high-precision single-cell transcriptomics

Alec R Chapman a,1,2, David F Lee a,2,3, Wenting Cai a,4, Wenping Ma b,c,d, Xiang Li b,c,d,5, Wenjie Sun b,c, Xiaoliang Sunney Xie b,c,6
PMCID: PMC9907105  PMID: 36508663

Significance

In a human cell, transcription of a particular gene invariably fluctuates with time, and the fluctuations of several transcripts can be correlated because they are regulated by the same transcription factor. By developing a single-cell transcriptome method with high detectability (MALBAC-DT) to measure pair-wise correlation among mRNA abundance under steady-state conditions, we discovered correlated gene modules (CGMs), a group of genes whose expression are synchronized in order for them to work together to carry out certain biological functions, such as protein synthesis or cholesterol synthesis. CGMs provide information regarding genome’s biological functions through protein-to-protein interactions.

Keywords: scRNA-seq, correlated gene modules, single cell, transcriptomics

Abstract

Correlations in gene expression are used to infer functional and regulatory relationships between genes. However, correlations are often calculated across different cell types or perturbations, causing genes with unrelated functions to be correlated. Here, we demonstrate that correlated modules can be better captured by measuring correlations of steady-state gene expression fluctuations in single cells. We report a high-precision single-cell RNA-seq method called MALBAC-DT to measure the correlation between any pair of genes in a homogenous cell population. Using this method, we were able to identify numerous cell-type specific and functionally enriched correlated gene modules. We confirmed through knockdown that a module enriched for p53 signaling predicted p53 regulatory targets more accurately than a consensus of ChIP-seq studies and that steady-state correlations were predictive of transcriptome-wide response patterns to perturbations. This approach provides a powerful way to advance our functional understanding of the genome.


Single-cell RNA-seq (scRNA-seq) has greatly expanded our knowledge of gene expression (112). However, while scRNA-seq has been particularly effective at distinguishing cell types (5, 1317), functional connection of various genes is still poorly understood. Such an understanding is critical to a wide range of biological problems, including unraveling the gene networks controlling cellular differentiation and identifying drug targets. One powerful approach is to infer gene relationships from scRNA-seq data through “guilt-by-association”, in which changes in expression are used to cluster genes into putatively coregulated modules. Although widely used, the downside to existing studies is that change in expression is calculated across different cell types, perturbation conditions, or developmental stages (1821), where multiple unrelated pathways can be confounded. This hinders the assignment of the large number of differentially expressed genes to specific functional pathways.

We hypothesize that regulatory gene modules could be better extracted if genes were clustered by pair-wise correlations of steady-state fluctuations rather than differential expression. If two genes A and B (or C and D) are regulated by the same transcription factor TF1 (or TF2), while each gene undergoes stochastic fluctuations, A and B (C and D) exhibit a positive correlation, A and C (or B and D) exhibit no correlation. (Fig. 1A). Other regulatory factors subject to stochastic fluctuations include epigenetic regulators, spatial position of genes within the nucleus, or regulators of transcript degradation such as miRNAs, and information on such regulatory features could potentially be inferred from covariance analysis as well. These steady-state correlations can be measured in a phenotypically homogenous population of single cells (Materials and Methods).

Fig. 1.

Fig. 1.

(A) A schematic model depicting how gene regulatory interactions can result in correlations in a steady-state population. Genes A and B are coregulated by TF1, while genes C and D are coregulated by TF2. Stochastic fluctuations in a transcription factor will result in fluctuations in its target genes, causing them to be correlated. Independently regulated genes, on the other hand, will exhibit no correlation. (B) Measurement error associated with estimating correlation from a limited number of cells. Plotted are the distributions of correlations that would be measured for a pair of uncorrelated genes if a given number of cells were sampled. (C) Impact of detection efficiency on correlation measurements. For a pair of genes with a given true correlation, the correlation that would be measured by sampling an unlimited number of cells is plotted as a function of the detection efficiency for individual transcripts. (D) MALBAC-DT protocol and experimental workflow. A homogenous cell population is trypsinized and sorted into individual wells of 96-well plates by flow cytometry. Reverse transcription is carried out using a poly-T primer containing a cell-specific barcode and unique molecular identifier (UMI). First-strand cDNA is amplified by random primers using MALBAC thermocycling to ensure linear amplification followed by additional cycles of exponential amplification by PCR. After amplification, samples are pooled together for library preparation and sequencing.

However, reliably measurements of steady-state pair-wise genome-wide covariance matrix requires a method with high transcript detectability and counting accuracy. Here, we present Multiple Annealing and Looping Based Amplification Cycles for Digital Transcriptomics (MALBAC-DT), which enables us to accurately capture correlated gene modules (CGMs) covered by the stochastic fluctuations of gene expression and facilitates our understanding of gene functions.

Results

Development of MALBAC-DT.

We show by theoretical modeling and simulations that these fluctuations can produce measurable cross-correlation coefficient on the order of ~0.50, even with conservative assumptions about the mechanisms underlying transcription and degradation (SI Appendix, Fig. S1). However, measuring such correlations requires a method with high detection sensitivity and accuracy (Fig. 1C). That is, there must be a high probability of detecting a transcript if only a single copy is present, and a precise digital count is needed if a transcript is present at many copies. Furthermore, a large number of cells must be sampled in order to signal average measured correlations (Fig. 1B).

To meet these unique technical demands, we designed a single-cell mRNA amplification method —MALBAC-DT (Fig. 1D), which improves on several aspects of our prior work on amplifying DNA and RNA from single cells (4, 22). Because amplification by MALBAC-DT does not rely on template switching, we had greater flexibility to choose reverse transcriptases and optimize reaction conditions to maximize first-strand cDNA production. Furthermore, with MALBAC-DT, it is possible to successfully amplify transcripts that are only partially reverse transcribed, either due to their length or secondary structure. First-strand cDNA is then amplified linearly by directly annealing MALBAC random primers along the cDNA strand, which minimizes the number of enzymatic steps for second-strand synthesis, followed by exponential amplification by PCR. To improve accuracy, we developed a unique molecular identifier (UMI) design that can reliably correct for UMI artifacts that occur during amplification and sequencing, in which all UMIs were separated by a minimum Hamming distance of three (Materials and Methods). Finally, to improve throughput and reproducibility, an automated workflow for MALBAC-DT was developed using the Biomek FXP Workstation (Materials and Methods).

With MALBAC-DT, we detected ~20% more genes from single-cell amounts of RNA compared with Smart-seq2 (SI Appendix, Table S1). Moreover, we obtained a lower percentage of reads corresponding to ERCC spike-ins, suggesting lower bias toward shorter or less complex transcripts (SI Appendix, Table S1). The specific UMI design allowed us to correct for transcript miscounting. Clustering analysis found groups of highly similar UMI sequences that likely are sequencing or amplification artifacts (Materials and Methods). Without the minimum Hamming distance of our UMI design, these artifacts would otherwise have been indistinguishable from legitimate UMI sequences.

CGMs Are Observed in a Uniform Cell Population.

To demonstrate that MALBAC-DT can provide unique insights into gene function, we amplified and sequenced 768 cells from the human osteosarcoma U2OS cell line, of which 738 cells passed quality filters. As expected for a homogenous cell culture, clustering of cells by t-stochastic neighbor embedding (tSNE) (23, 24) based on gene expression corrected for cell-cycle effects did not reveal distinct subpopulations of cells (Fig. 2A). However, clustering genes by tSNE did reveal distinct sets of genes that displayed similar expression patterns across cells (Fig. 2B), as well as hierarchical clustering (Fig. 2C).

Fig. 2.

Fig. 2.

(A) Clustering of 738 U2OS cells by tSNE. Cells were obtained from a homogenous culture and cDNA was amplified by MALBAC-DT. Consistent with a homogenous culture, no subclusters of cells are apparent. (B) Clustering of genes by tSNE. Genes were clustered based on their expression levels across the 738 U2OS cells. Many clusters are present, representing sets of genes that have similar expression patterns. (C) Hierarchical clustering of gene expression data from 738 U2OS cells. Several sets of genes with similar expression patterns are observed, although fewer than are observable by tSNE. (D) Correlation matrix for ~11,000 genes that were detected in at least 10% of the 738 U2OS cells. Genes are ordered by hierarchical clustering to reveal numerous modules of highly correlated and/or anticorrelated genes. Inset provides an enlarged view to highlight the detail present in many of these clusters. The genes in many of these Correlated Transcriptional Modules (CGMs) are enriched for particular biological function or contain binding sites for specific transcription factors. (E) A CGM related to protein synthesis, with genes responsible for specific processes indicated by arrows. (F and G) CGMs related to sterol synthesis (F) and extracellular matrix remodeling (G). Genes with known functional roles in these processes are labeled in red. (H) Genes with high correlations are more likely to be identified as related in protein–protein interaction databases. Gene pairs are binned based on their correlation coefficient. For each bin, the fraction of pairs identified by StringDb is plotted.

To further investigate these clusters of genes, we computed the correlation coefficient for each pair of genes across all cells. Upon hierarchical clustering of this correlation matrix (Fig. 2D), we identified 148 CGMs of 10 to 200 genes that are highly correlated with each other. Many of these modules consist of genes pertaining to a specific biological function, including general housekeeping functions—such as cell cycle control and protein synthesis (Fig. 2E) and cholesterol synthesis (Fig. 2F)—as well as cell-type specific functions such as bone growth and extracellular matrix remodeling (Fig. 2G). For example, the protein synthesis module includes genes responsible for synthesizing tRNAs and amino acids, as well as the machinery required to initiate transcription and translation (Fig. 2E and SI Appendix, Table S3 module 45). The same is true for cholesterol synthesis (Fig. 2F and SI Appendix, Table S3 module 27).

We observed slightly more positively correlated genes than negatively correlated pairs of genes (SI Appendix, Fig. S2). If all genes are independent, we expect equally positively and negatively correlated measurement noise. We believe that positive pair-wise correlations in our data arise primarily from genes that share one or more common regulatory transcription factors. While genes within a CGM have positive pair-wise correlations, and correlations between different CGMs are often negative because their phase differences. Although negative correlation between certain pairs of genes is mechanistically possible, the overall slightly more positive than negative pair-wise correlation indicates that coregulation by a transcription factor for a group of genes is a wide spread phenomenon. Such modules were also largely identified using the Weighted Gene Correlation Network Analysis (WGCNA) algorithm (25, 26) (SI Appendix, Fig. S3, Materials and Methods). A full list of CGMs and their associated functional enrichments is provided in SI Appendix, Table S3.

Although gene-to-gene correlation measurements have been widely used, it has been done almost exclusively by means of the perturbative approach (18, 2729), i.e., evaluation of the correlations after introducing a new experimental condition. This perturbative approach is bound to affect a large number of genes in the cell, usually resulting in large groups of correlated genes most of which may be functionally unrelated. Using our method to evaluate correlations at steady-state fluctuations in single cells revealed CGMs with a smaller number of intrinsically correlated genes, thus making it a more focused approach.

While analysis and normalization of cell cycle dependencies is important for cell typing and differential expression analyses (30), the CGMs we observed are largely unaffected when expression levels are adjusted to control for cell cycle, except those directly related to cell cycle activity (SI Appendix, Fig. S4). Although genes representing different phases of the cell cycle become uncorrelated after normalization, we observe that genes within cell cycle-related CGMs remain correlated, as would be expected due to correlations in the stochastic fluctuations that are not removed by normalization.

Those highly correlated genes we observed were more likely to be identified as being related in databases of protein–protein interactions (PPI) (Fig. 2H), indicating that correlations observed at steady state can imply functional relationships between genes, raising the possibility of identifying mechanistically related sets of genes from such measurements.

CGM Detection Requires Both High-Precision and High-Throughput Method.

Detecting CGMs relies on precise measurements of correlations between genes. This requires large numbers of cells, accurate quantification of transcripts, and high detectability. Low cell numbers add sampling noise to the correlation coefficients (Fig. 1B), while low detection sensitivity attenuates the correlations (Fig. 1C). Simulations demonstrate that neither high cell numbers nor high detection sensitivity alone is sufficient (SI Appendix, Figs. S5 and S6), and that the CGMs detected were not the result of spurious correlations due to limited sample size (SI Appendix, Fig. S7).

Previously published data generated by 10× Genomics (7), which has high throughput but low sensitivity, is unable to reveal CGMs (SI Appendix, Fig. S8A), further confirming that large cell numbers cannot compensate for low sensitivity. A dataset more recently made available by 10× Genomics is able to detect 40 modules compared with 178 CGMs identified by MALBAC-DT using the same cell line. Of these 40 modules, 53% (21/40) were found to be significantly correlated by MALBAC-DT (q < 0.05, Materials and Methods), while of the 178 CGMs identified by MALBAC-DT, only 9.6% (17/178) were able to be detected as significantly correlated in the 10× Genomics data. Recently, we generated data of both MALBAC-DT and 10× Genomics (v3 chemistry) from the same single-cell cloning of K562 cell line. After filtering out the low-quality cells from 10× data (nearly half of the cells), even with similar detectability, sequencing depth and cell numbers (SI Appendix, Fig. S8D), correlated genes identified by MALBAC-DT are more enriched in the protein-protein interaction database than 10× (SI Appendix, Fig. S8F).

CGMs Reveal Regulatory Relationships.

To validate that transcripts within an identified module are coregulated, we further investigated a module comprising targets of the key tumor suppressor protein p53 (Fig. 3A). This CGM contains 197 genes, most of which have been identified by ChIP-seq studies to contain p53 binding sites. Moreover, all p53 targets identified by a previous ChIP-chip study (31) of this cell line are contained in this module.

Fig. 3.

Fig. 3.

(A) CGM related to p53 activity. Genes with significant literature support (32) for being targets of p53 are indicated by red arrows, while genes with limited literature support are indicated by black arrows. (B) Distribution of p53 transcript levels in control and shRNA knockdown cells. (C) Mean expression levels of transcripts in p53 knockdown cells vs. control cells. Red points indicate genes, which are significantly differentially expressed. (D) Correlation among genes in a homogenous cell population is predictive of their response to perturbation. Genes differentially expressed in response to p53 knockdown are categorized as strongly anticorrelated (less than −0.2), moderately anticorrelated (between −0.2 and −0.1), weakly anticorrelated (between −0.1 and 0), weakly correlated (between 0 and 0.1), moderately correlated (between 0.1 and 0.2), and strongly correlated (greater than 0.2). For each category, the fraction of gene pairs which are concordantly regulated (both up-regulated or both down-regulated) or discordantly regulated (one up-regulated while the other down-regulated) are shown. (E) Hierarchical clustering of genes differentially expressed upon p53 knockdown. Genes cluster into ~10 CGMs. Genes within a CGM have the same directional response to p53 knockdown, consistent with their regulation as a functional unit. Direct p53 targets, indicated by red and black arrows as in (A), are predominantly found in a single CGM, as are the genes originally identified as a CGM related to p53 function. CGMs thus distinguish direct p53 targets from downstream pathways.

We performed an shRNA knockdown of p53 followed by MALBAC-DT and examined its effect on the expression of transcripts within the p53 module. Mean transcript levels in the p53 module decreased 15-fold in the cells with p53 knockdown compared with the control cells (Fig. 3B). Strikingly, correlations in our steady-state dataset were able to predict the genes that would be perturbed by p53 knockdown even more accurately than ChIP-seq studies. Whereas 49% of genes in the p53 module were down-regulated upon p53 knockdown, this was only the case for 33% of genes from a consensus of ChIP-seq studies (32), and 9% of genes identified by a ChIP-chip study of the same cell line (31).

In addition to the downregulation of genes in the p53 module, 1337 genes were significantly up- or down-regulated throughout the whole transcriptome upon p53 perturbation (Fig. 3C). Gene correlations in our original dataset proved to be predictive of perturbation response: strongly correlated genes tended to be both up-regulated or down-regulated, whereas the opposite was true for strongly anticorrelated genes (Fig. 3D). Furthermore, these differentially expressed genes clustered into ~10 CGMs, several of which are associated with distinct pathways (Fig. 3E). These associations can potentially be used to infer the function of previously uncharacterized differentially expressed genes.

CGMs Reveal Regulatory Targets.

Because gene correlations reflect regulatory relationships, they have the potential to identify genes that act in the same regulatory pathway. Indeed, several genes which previously had not been identified as p53 targets exhibited a strong correlation with p53 expression and were perturbed by p53 knockdown. For example, the deubiquitinase JOSD1 was found to be strongly anticorrelated with p53 activity. Although JOSD1 had not been previously associated with p53, other deubiquitinases are known to modulate p53 activity either directly or through Mdm2. Moreover, a structurally related protein ATXN3 was recently shown to stabilize p53 via deubiquitination (33). We therefore hypothesized that JOSD1 might play a role in the p53 pathway. Indeed, JOSD1 was up-regulated upon p53 inactivation, consistent with their anticorrelation in the unperturbed system, indicating a possible negative feedback loop in which p53 inhibits JOSD1 transcription, while the JOSD1 protein stabilizes p53.

Understanding Differential Expression between Cell Types.

We hypothesized that CGMs could provide insight into differential gene expression across cell types and the biological pathways that are differentially activated. To this end, we amplified and sequenced 748 single cells from the human embryonic kidney HEK293T cell line. As expected for dramatically different cell types, several thousand genes were found to be differentially expressed between the HEK293T and U2OS cell lines (Fig. 4 A and B). While differential expression failed to provide meaningful classification of these genes, CGMs provided a natural way to understand the differences between these two cell types. Of the 148 CGMs identified in U2OS, 22 were also observed in HEK293T cells, representing housekeeping machinery shared across vastly different cell types. For many CGMs, the genes are up- or down-regulated as a group (Fig. 4 C and D). In some of these cases, the CGM is absent in one cell type (Fig. 4C), indicating that the module is switched off. In some other cases, the CGM is present in both cell types, indicating that the genes remain coregulated but at different expression levels (Fig. 4D).

Fig. 4.

Fig. 4.

Shared and cell-type-specific CGMs between the U2OS and HEK293T cell lines. (A) Mean expression levels of genes in HEK293T vs U2OS. (B) Hierarchical clustering of expression separates HEK293T and U2OS cells, and reveals ~6,000 up- and down-regulated genes, but does not provide a clear assignment of genes into concise modules. (C–F) CGMs organize genes into biologically relevant pathways in a manner that is distinct from and not apparent by differential clustering. Top row heat maps represent the correlation between pairs of genes. Correlations in U2OS are show above the diagonal and correspond to subsets of the data presented in Fig. 2D. Correlations for the same sets of genes in HEK293T cells are shown below the diagonal. Bottom row heat maps show the corresponding expression patterns of these genes across cells. In some cases (C) differentially expressed genes correspond to differentially correlated modules, and CGMs can further separate differentially expressed genes into distinct functional units. In other cases (D), a set of differentially expressed genes can be resolved into a common CGM between multiple cell types. Moreover, differential correlation between cell types can occur without differential expression (E), possibly indicating multiple modes of regulation for the component genes. Finally, correlations can be consistently observed across cell types and organized into distinct CGMs in the absence of differential expression (F).

CGMs can also identify changes in regulatory architecture across cell types even in the absence of differential expression. Although HEK293T is known to lack p53 activity (34, 35), target genes of p53 are not down-regulated in HEK293T compared with U2OS (Fig. 4E), suggesting their potential roles in other pathways. Despite similar expression levels of the component genes, the p53 CGM is absent in HEK293T.

Discussion

While single-cell studies have typically investigated differential expression between cell types, such studies have several limitations. Between two cell types or conditions, multiple pathways will be simultaneously perturbed, making it difficult to separate the large number of differentially expressed genes into their respective pathways. Meanwhile, many pathways will not be affected, making it challenging and costly to find the right perturbation and timing to isolate a pathway of interest. Finally, perturbing a pathway will affect top-level regulators and distant downstream genes alike, making it difficult to unravel the regulatory hierarchy.

Our results demonstrate that stochastic fluctuations in gene expression contain a wealth of information about the structure of gene regulatory networks, which can be investigated using a single-cell amplification method that is sufficiently sensitive and accurate. By sequencing a uniform population of cells, we identified a number of CGMs corresponding to a variety of biological functions, including metabolic processes such as protein and cholesterol synthesis, tissue-specific regulatory pathways, and p53 regulation. Notably, all of these CGMs were identified a priori from a single-cell line, with no need to conduct targeted perturbations or choose specific cell types to compare. Although we rely experimentally on MALBAC-DT, it is possible that the method of using stochastic fluctuations to identify CGMs could be applied more generally in the future with improved sensitivity of other single-cell amplification methods.

CGMs indicate an alternative method for assigning previously uncharacterized genes to functional pathways. Compared with differential expression studies, CGMs provide a more focused approach by narrowing down the large number of differentially expressed genes into smaller sized functional modules. In this way, CGM analysis can be used to identify direct and indirect regulatory targets of a specific regulator even more accurately than ChIP-seq studies, as evidenced by the p53 example where targets of p53 were identified. The degree of gene correlation or anticorrelation was predictive of response patterns caused by perturbation, further emphasizing that these correlations imply coregulatory relationships.

Finally, we have shown that CGMs identified within uniform cell populations provide a framework for understanding the molecular pathways that are affected by perturbations and distinguish different cell types. Upon p53 knockdown, the CGMs identified in our original population served to group the large number of differentially expressed genes into direct regulatory targets and specific downstream processes. Similarly, when comparing U2OS and HEK293T cell lines, we were able to separate the thousands of differentially expressed genes into cell type specific CGMs and identified different mechanisms of regulation of p53 targets between the two cell types despite the lack of consistent differential expression of these genes.

Although the analyses we presented on cell lines using CGMs in this paper are only proof-of-principle, our results shed light on critical biological processes relevant to many cell types. We expect the analysis of CGMs in a diverse set of cell types and tissues using methods with high sensitivity will provide critical insights not obtainable using differential expression analyses alone.

Materials and Methods

Cell Culture and Handling.

U2OS and HEK293T cell lines were obtained from ATCC and cultured at 37 °C in RPMI-1640 medium with 10% fetal bovine serum and 1% penicillin–streptomycin. To form single-cell suspensions for flow sorting, culture medium was removed, and cultures were rinsed with Dulbecco’s phosphate-buffered saline (D-PBS) and incubated with 1mL 0.25% trypsin for 5 min. Detached cells in D-PBS were pelleted by centrifugation at 300 g for 5 min and resuspended in D-PBS. Single-cell suspensions were then kept on ice until flow sorting.

shRNA knockdowns were performed by incubating ~30% confluent U2OS cells with 1 ug of either 11,653 C3 plasmid for TP53 knockdown cells or with TransIT-LT1 plasmid for control cells. Cells were incubated for 48 h followed by flow sorting to isolate single cells in lysis buffer.

K562 cells were used for both MALBAC-DT and 10× genomic sequencing library preparation. K562 single cells were seeded into a 96-well plate containing an RPMI-1640 medium with 10% Fetal Bovine Serum and 1% Penicillin–Streptomycin by mouth pipetting and cultured at 37 °C. The culture medium was changed every 3 d for about 3 wk. Cells were checked under an inverted microscope and the colony was transferred to a T25 cell culture flask for expansion with the same RPMI medium. Cells were cultured in 5% CO2 at 37 °C and the culture medium was changed approximately every 3 d until the cell density reached about 0.6 × 106 cells/mL. Cells were split into two flasks and cultured for another 2 or 3 d before MALBAC-DT cell sorting or 10× Genomics cell preparation. On the day of cell sorting or cell preparation, K562 cell cultures were centrifuged at 300 g to collect cells, washed twice with 1-mL 0.04% BSA-PBS and suspended in 1 mL BSA-PBS. Then, suspended cells were filtered by a 70-μm cell strainer and stained with Hoechst 33432 for 10 min before cell sorting.

MALBAC-DT Protocol.

Sequencing libraries for U2OS and HEK293T cell lines were prepared by the original MALBAC-DT protocol. Cells were flow-sorted into 3 μL lysis buffer consisting of 1 μL H2O, 0.6 μL 5× SSIV buffer, 0.15 μL 10% ICA-630, 0.8 μL 5M betaine, and 0.05 μL SUPERase in, 0.2 μL 50 μM RT-An primer and 0.2 μL 10 mM dNTP mix. Plates were stored at −80 °C until ready for amplification. While pipetting, plates were kept on ice and vortexed and briefly centrifuged after all pipetting steps.

To perform reverse transcription, plates were incubated at 72 °C for 3 min, then 1 μL RT mix was added consisting of 0.264 μL H2O, 0.16 μL 5× SSIV buffer, 0.2 μL 100 mM DTT, 0.152 μL SUPERase in, 0.024 μL 1 M MgSO4, and 0.2 μL SuperScript IV. Plates were incubated at 55 °C for 10 min.

Next, excess reverse transcription primers were degraded by exonuclease digestion. 1 μL exonuclease mix was added consisting of 0.1 μL Exo I buffer, 0.1 μL H2O, 0.6 μL Exo I, and 0.2 μL 50 μM RT-Bn primer. Plates were incubated at 37 °C for 30 min and then at 80 °C for 20 min.

Amplification was performed by adding 24 μL amplification mix consisting of 18.64 μL H2O, 3 μL ThermoPol buffer, 0.4 μL 10 mM dNTP mix, 0.16 μL 100 mM MgSO4, 0.4 μL 50 μM GAT-7N, 0.4 μL 50 μM GAT-COM, and 1 μL Deep Vent (exo-). The following thermocycle program was run:

Step Temperature Time
1 95 5:00
2 4 0:50
3 10 0:50
4 20 0:50
5 30 0:50
6 40 0:45
7 50 0:45
8 65 4:00
9 95 0:20
10 58 0:20
11 Goto 2 10×
12 95 1:00
13 95 0:20
14 58 0:30
15 72 3:00
16 Goto 13 17×
17 72 5:00
18 4 hold

Finally, amplification was completed by adding 0.4 μL 50 μM Tru2-Gn-RT primers and running an additional five cycles of PCR steps 12 to 15. Amplified plates were stored at −20 °C until library preparation.

In original version of the protocol, a total of 8 RT3-An primers were used per 96-well plate, with one primer corresponding to one row of the plate. During the final amplification step, 12 Tru2-Gn-RT primers were used, with one primer corresponding to one column of the plate. In the automated version of the protocol, 96 RT3-An primers were used, with a distinct primer corresponding to each well. In the final step, a single Tru2-Gn-RT primer was then used. While the former method requires lower upfront costs to synthesize primers, the latter method simplifies preparing plates at larger scales and avoid cross-contamination of samples during amplification.

To prepare libraries for sequencing, 1 μL from each of the wells was combined and purified using 0.8× Ampure beads. The Nextera library preparation kit is used to add Illumina adapters by tagmentation. During subsequent PCR steps, Ix-Tru2 primers are substituted for Nextera S7XX primers in order to select the 3′ ends of transcripts containing cell barcodes and UMIs.

Automated MALBAC-DT Protocol.

Library for K562 cell line was prepared by an automated MALBAC-DT protocol. The details for the automated MALBAC-DT protocol have been previously described (33). In brief, cells were sorted into a 96-well plate containing 2 μL lysis buffer and incubated at 72 °C for 3 min for cell lysis and primer annealing at 4 °C. Reverse transcription was performed by incubating the plates at 55 °C for 10 min after the addition of a 2 μL RT mix. Excess RT primers were digested by adding 2 μL Exo mix and incubated at 37 °C for 30 min and 80 °C for 20 min. For cDNA amplification, 24 μL PCR master mix was added and amplified for 15 cycles. Subsequent to cDNA amplification, 2 μL 10 μM Tru2-G-RT primer was added to each well and amplified for another five cycles. Sequencing libraries were prepared by the Vazyme TruePrep DNA Library Prep Kit and sequenced on NovaSeq (2 × 150 bp) with a custom sequencing primer for Read 2.

10× Genomics.

10× sequencing library of K562 cells were prepared using Chromium Single Cell 3′ Reagent Kits v3 (10× Genomics) according to the manufacturer’s instructions. Briefly, K562 cells were centrifuged, washed and filtered as described under “Cell culture and handling”. Then after cell counting, about 3,200 cells were loaded to the chip with an estimated recovery of 2,000 cells. After reverse transcription, single-cell cDNA was purified and amplified for 12 cycles as recommended in the instructions. Then, the amplified cDNA was proceeded to sequencing library preparation by fragmentation, end repair, A-tailing, adaptor ligation, and PCR amplification (11 cycles). The library was sequenced on the Illumina HiSeq X Ten system with 2 × 150-bp paired-end sequencing.

Sequence Processing.

Separate fastq files were generated for each cell based on the outer and inner barcode sequences. Reads that did not exactly match the barcode of any cells were discarded. Barcodes, adapter sequences, and UMIs were stripped from the reads which were then aligned to the human GRCh38.p7 reference using STAR v.2.5.2. For each gene, a list of UMIs was obtained for all reads mapped to that gene, excluding regions masked by RepeatMasker. To remove extraneous UMIs resulting from amplification or sequencing errors, UMIs for a particular gene were represented as nodes in a graph, with connections between UMIs differing at no more than 7 bases. Connected components were identified, and the consensus sequence within each component was determined. Consensus sequences matching the (HBDV)5 RT-An pattern and differing from the (VDBH)5 RT-Bn pattern at least three bases were retained. To avoid potential cross-talk between wells, UMIs observed for a same gene in multiple cells are discarded.

Quality Control of UMI Counts.

After obtaining UMI counts for all genes and cells, cells for which more than 1% of transcripts are from ERCC spike-ins or contain fewer than 1,000 total transcripts are discarded, as are genes observed in fewer than 10% of cells. Counts are normalized relative to the total number of transcripts in each cell prior to computing the correlation matrix. For both MALBAC-DT and 10× libraries from K562 cell lines, UMI counts are processed using Seurat 3.9.9.9024 (34). In addition to the above filtering criteria, cells with over 20% mitochondrial gene are discarded.

The number of genes detected per cell in the 10× library we generated has a bimodal distribution (nFeature_RNA in SI Appendix, Fig. S8B). Nearly half of the cells have less than 2,500 detected genes. To filter out low-quality cells, the threshold of > 5,000 genes per cell is applied to the 10× UMI counts matrix, resulting in 917 high-quality cells (SI Appendix, Fig. S8C). To conduct a fair comparison between MALBAC-DT and 10× libraries, the MALBAC-DT library was downsampled with similar mean reads, median genes and median UMI per cell to the 10× library. 930 cells were randomly chosen from the downsampled library (SI Appendix, Fig. S8C).

CGM Identification.

Hierarchical clustering was performed using the SciPy function “scipy.cluster.hierarchy.linkage” using method “average” and with a distance metric of 1 − abs (ρij), where ρij is the Pearson correlation between genes i and j. CGMs were called using the R package: Dynamic Tree Cut (35) with the options of “minClusterSize = 10, method = hybrid, deepsplit = 0”.

To test the robustness of this clustering algorithm, we randomized the UMI counts across all cells for each particular gene and recomputed the correlations between gene pairs, resulting in a distribution of correlation coefficients that would be expected due to limited sample size alone. On top of this background of uncorrelated genes, we set a group of genes to have a stronger correlation and examined whether these genes could then be identified by clustering. We found that for the magnitudes of correlations typically observed in our data, groups of correlated genes could be reliably recovered (SI Appendix, Fig. S9).

GO and PPI Enrichment Analysis.

For CGMs identified from MALBAC-DT and 10× libraries of K562 cell lines, gene set enrichment analysis is performed using the R package “enrichR” (36, 37) with p-value threshold of 1e-5 to associate gene sets in modules with “KEGG_2019_Human”, “GO_Biological_Process_2018”, “GO_Cellular_Component_2018”, “GO_Molecular_Function_2018”, and “Reactome_2016” databases. Same settings were used when comparing the CGM calling method in this paper with the WGCNA method.

PPI enrichment analysis was performed using STRING database (36) with the options of “version = 11, species = 9,606, score_threshold = 700”.

Pseudotime Inference and Cell-Cycle Correction.

Pseudotime was inferred for each cell by assuming that the expression of cell cycle genes followed a sinusoidal function along the time trajectory. The actual expression of each cell cycle gene was further modeled as follows, a normal distribution centered around the level predicted by sinusoidal function, with variance aggregated from both stochastic expression variance and technical noise.

yg,cNμg,cvg2+vtech2,
μg,c=AmpcostcTpeak,g+1+AmpShiftg.

yg,c: actual expression of gene g for cell c.

μg,c: expected expression of g for c from sinusoidal function.

vg2: gene specific variance from stochastic expression for g.

vtech2: common technical noise.

Ampg,AmpShiftg: amplitude of the sinusoidal function for g.

Tpeak,g: The peak time of g, in the time scale of percentage into the cell-cycle. Retrieved from Cyclebase.org (38).

tc: The pseudotime of cell c.

The transcriptome was fitted against the described model, with a pseudotime optimized for each cell to maximize the overall likelihood estimation. The Maximum Likelihood Estimation process was done using “PyTorch”.

In order to correct the covariance matrix for cell cycle effects, cells were then ordered by the assigned pseudotime, and the expression of each gene was corrected by subtracting the mean of the surrounding rolling window. In total, approximately 350 genes were adjusted for cell cycle effects (SI Appendix, Fig. S10).

Correlations between Coregulated Genes.

We consider the case of two genes that are regulated by the same transcription factor (SI Appendix, Fig. S1A) with dynamics described by:

[A]t=vA[TF1]-dA[A],
[B]t=vA[TF1]-dA[B],
[TF1]=f(t).

For the sake of simplicity, we have taken the transcription rate u and degradation rate v to be the same for transcripts A and B. Transcription factor TF1, which regulates both A and B, is allowed to fluctuate in time arbitrarily.

The lifetimes of mRNAs are often short relative to those of proteins. In this case, as TF1 fluctuates, transcripts A and B rapidly adjust and fluctuate independently from one another around the steady-state concentration [A]ss=[B]ss=[TF1]va/da , and these fluctuations will follow a Poisson distribution.

Under these assumptions, the covariance between [A] and [B] is:

δaδb=tf1δaδb|tf1ptf1,=tf1δa|tf1δb|tf1ptf1,=tf1va2da2δtf12ptf1,=va2da2σtf12.

Following a similar procedure to obtain δa2 and δa2, we obtain the correlation coefficient

ρAB=δaδbδa2δb2=μAcvTF121+μAcvTF12.

where μA is the mean of A, and cvTF1 is the coefficient of variation of TF1.

Analysis of Datasets from the 10× Genomics Website.

Datasets were downloaded from the website of 10× Genomics https://www.10xgenomics.com/. The datasets reported by Zheng et al. (7) consisting of ~2,800 HETK293T cells and the newer dataset consisting of a mixture of ~10,000 HEK293T and mouse NIH3T3 cells were used. For the latter dataset, only HEK293T cells were used for further analysis. BAM files were downloaded and filtered according to the same mapping criteria used for MALBAC-DT datasets. To determine whether a CGM detected with one method was significantly correlated in the other, each of the genes within a given module was randomly substituted with a gene from the top 50 genes with nearest mean expression, and the average of the absolute value of the correlations in this randomized module was calculated. This randomization was repeated 10,000 times and a Bonferoni-corrected P-value was obtained by comparing the average correlation in the true module to the distribution of average correlations in the randomized modules.

Comparison of Modules Detected by Hierarchical Clustering and WGCNA.

Hierarchical clustering of the correlation matrix is performed using the “hclust” function of R software, with the “average” method and a distance metric of 1-abs(ρ_ij), where ρ_ij is the correlation between genes i and j. Subclusters are obtained by cutting the dendrogram using the “cutree” function with parameter h = 0.9. Subclusters containing more than 10 genes are identified as CGMs. We also compare our module detection method with WGCNA, a widely used gene coexpression analysis package. Modules are identified by the “blockwiseModules” function, with “power = 1, TOMType = unsigned, minModuleSize = 10”, and the other default parameters. For the U2OS cell line, only 19 modules are identified by WGCNA, and nearly 9,000 genes do not form any module. We use the Dice coefficient as a measure of similarity between modules detected by both methods and found that CGMs and WGCNA modules are highly similar.

Supplementary Material

Appendix 01 (PDF)

Acknowledgments

We thank L. Tan for helpful discussions of analysis methods and S. Mulepati for assistance with knockdown experiments. Portions of this paper has appeared as part of Dr. Lee’s doctoral dissertation. This work was first supported by an NIH Director’s Pioneer Award (DP1 CA186693), followed by a generous gift grant from Xianhong Wu to Harvard University, and continued at Beijing Advanced Innovation Center for Genomics at Peking University with a grant from the Beijing Municipal Commission of Science and Technology (Z201100005320015).

Author contributions

A.R.C. and X.S.X. designed research; A.R.C., D.F.L., and W.M. performed research; A.R.C., W.C., X.L., and W.S. analyzed data; and A.R.C. and X.S.X. wrote the paper with the help of other authors.

Competing interest

The authors declare a competing interest. The authors have patent filings to disclose, A.R.C., D.F.L., and X.S.X. are inventors on the patent PCT/US18/34689 filed by President and Fellows of Harvard College.

Footnotes

Reviewers: S.R.Q., Stanford University; and P.A.S., Columbia University Irving Medical Center.

Data, Materials, and Software Availability

All data are available on NCBI SRA (accession no. PRJNA837885). Software and code are available upon request.

Supporting Information

References

  • 1.Phillips J., Eberwine J. H., Antisense RNA amplification: A linear amplification method for analyzing the mrna population from single living cells. Methods 10, 283–288 (1996). [DOI] [PubMed] [Google Scholar]
  • 2.Tang F., et al. , mRNA-Seq whole-transcriptome analysis of a single cell. Nat. Methods 6, 377–382 (2009). [DOI] [PubMed] [Google Scholar]
  • 3.Picelli S., et al. , Smart-seq2 for sensitive full-length transcriptome profiling in single cells. Nat. Methods 10, 1096–1098 (2013). [DOI] [PubMed] [Google Scholar]
  • 4.Chapman A. R., et al. , Single cell transcriptome amplification with MALBAC. PLoS One 10, e0120889 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Klein A. M., et al. , Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161, 1187–1201 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Macosko E. Z., et al. , Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Hashimshony T., et al. , CEL-Seq2: Sensitive highly-multiplexed single-cell RNA-Seq. Genome. Biol. 17, 77 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Zheng G. X. Y., et al. , Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Lovatt D., et al. , Transcriptome in vivo analysis (TIVA) of spatially defined single cells in live tissue. Nat. Methods 11, 190–196 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Islam S., et al. , Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq. Genome. Res. 21, 1160–1167 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Islam S., et al. , Highly multiplexed and strand-specific single-cell RNA 5′ end sequencing. Nat. Protoc. 7, 813 (2012). [DOI] [PubMed] [Google Scholar]
  • 12.Islam S., et al. , Quantitative single-cell RNA-seq with unique molecular identifiers. Nat. Methods 11, 163–166 (2014). [DOI] [PubMed] [Google Scholar]
  • 13.Hagemann-Jensen M., et al. , Single-cell RNA counting at allele and isoform resolution using Smart-seq3. Nat Biotechnol 38, 708–714 (2020). [DOI] [PubMed] [Google Scholar]
  • 14.Shalek A. K., et al. , Single-cell transcriptomics reveals bimodality in expression and splicing in immune cells. Nature 498, 236 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Grün D., et al. , Single-cell messenger RNA sequencing reveals rare intestinal cell types. Nature 525, 251 (2015). [DOI] [PubMed] [Google Scholar]
  • 16.Zeisel A., et al. , Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138 (2015). [DOI] [PubMed] [Google Scholar]
  • 17.Usoskin D., et al. , Unbiased classification of sensory neuron types by large-scale single-cell RNA sequencing. Nat. Neurosci. 18, 145 (2014). [DOI] [PubMed] [Google Scholar]
  • 18.Dixit A., et al. , Perturb-seq: Dissecting molecular circuits with scalable single-cell RNA profiling of pooled genetic screens. Cell 167, 1853–1866.e17 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Kotliar D., et al. , Identifying gene expression programs of cell-type identity and cellular activity with single-cell RNA-Seq. Elife 8, e43803 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Sagar, Grun D., Deciphering cell fate decision by integrated single-cell sequencing analysis. Annu. Rev. Biomed. Data Sci. 3, 1–22 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Wagner A., Regev A., Yosef N., Revealing the vectors of cellular identity with single-cell genomics. Nat. Biotechnol. 34, 1145–1160 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Zong C., Lu S., Chapman A. R., Xie X. S., Genome-wide detection of single-nucleotide and copy-number variations of a single human cell. Science 338, 1622 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Maaten L. V. D., Hinton G., Visualizing high-dimensional data using t-sne. J. Mach. Learn. Res. 9, 26 (2008). [Google Scholar]
  • 24.Maaten L. V. D., Accelerating t-SNE using tree-based algorithms. J. Mach. Learn. Res. 15, 3221–3245 (2014). [Google Scholar]
  • 25.Langfelder P., Horvath S., WGCNA: An R package for weighted correlation network analysis. BMC Bioinformatics 9, 559 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Zhang B., Horvath S., A general framework for weighted gene co-expression network analysis. Stat. Appl. Genet. Mol. Biol. 4, 17 (2005). [DOI] [PubMed] [Google Scholar]
  • 27.Adamson B., et al. , A multiplexed single-cell CRISPR screening platform enables systematic dissection of the unfolded protein response. Cell 167, 1867–1882.e1821 (2016), 10.1016/j.cell.2016.11.048. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Dixit A., et al. , Perturb-seq: Dissecting molecular circuits with scalable single-cell RNA profiling of pooled genetic screens. Cell 167, 1853–1866.e17 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Jaitin D. A., et al. , Dissecting immune circuits by linking CRISPR-pooled screens with single-cell RNA-seq. Cell 167, 1883–1896.e15 (2016). [DOI] [PubMed] [Google Scholar]
  • 30.Datlinger P., et al. , Pooled CRISPR screening with single-cell transcriptome readout. Nat. Methods 14, 297–301 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Scialdone A., et al. , Computational assignment of cell-cycle stage from single-cell transcriptome data. Methods 85, 54–61 (2015). [DOI] [PubMed] [Google Scholar]
  • 32.Smeenk L., et al. , Characterization of genome-wide p53-binding sites upon stress response. Nucleic Acids Res. 36, 3639–3654 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Fischer M., Grossmann P., Padi M., DeCaprio J. A., Integration of TP53, DREAM, MMB-FOXM1 and RB-E2F target gene analyses identifies cell cycle gene regulatory networks. Nucleic Acids Res. 44, 6070–6086 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Gao R., et al. , Inactivation of PNKP by mutant ATXN3 triggers apoptosis by activating the DNA damage-response pathway in SCA3. Plos Genet. 11, e1004834 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Ahuja D., Sáenz-Robles M. T., Pipas J. M., SV40 large T antigen targets multiple cellular pathways to elicit cellular transformation. Oncogene 24, 7729 (2005). [DOI] [PubMed] [Google Scholar]
  • 36.Steegenga W. T., Shvarts A., Riteco N., Bos J. L., Jochemsen A. G., Distinct regulation of p53 and p73 activity by adenovirus E1A, E1B, and E4orf6 proteins. Mol. Cell. Biol. 19, 3885–3894 (1999). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Szklarczyk D., et al. , STRING v11: Protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res 47, D607–D613 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Santos A., Jensen L. J., Wernersson R., Cyclebase 3.0: A multi-organism database on cell-cycle regulation and phenotypes. Nucleic Acids Res. 43, D1140–D1144 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix 01 (PDF)

Data Availability Statement

All data are available on NCBI SRA (accession no. PRJNA837885). Software and code are available upon request.


Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES