Abstract
Although an established model organism, Tetrahymena thermophila remains comparatively inaccessible to high throughput screens, and alternative bioinformatic approaches still rely on unconnected datasets and outdated algorithms. Here, we report a new approach to consolidating RNA-seq and microarray data based on a systematic exploration of parameters and computational controls, enabling us to infer functional gene associations from their co-expression patterns. To illustrate the power of this approach, we took advantage of new data regarding a previously studied pathway, the biogenesis of a secretory organelle called the mucocyst. Our untargeted clustering approach recovered over 80% of the genes that were previously verified to play a role in mucocyst biogenesis. Furthermore, we tested four new genes that we predicted to be mucocyst-associated based on their co-expression and found that knocking out each of them results in mucocyst secretion defects. We also found that our approach succeeds in clustering genes associated with several other cellular pathways that we evaluated based on prior literature. We present the Tetrahymena Gene Network Explorer (TGNE) as an interactive tool for genetic hypothesis generation and functional annotation in this organism and as a framework for building similar tools for other systems.
Graphical Abstract
Graphical Abstract.
Introduction
Gene co-expression, particularly in response to an experimental perturbation, has long been used as evidence for the functional association of genes that are otherwise uncharacterized [1]. The transcriptome is an intermediary between genotype and phenotype, and transcriptomics is often cheaper, faster, and higher throughput than using biochemistry or genetic engineering to functionally associate a given gene with a biological pathway or process [2]. The number of transcriptomic datasets has grown dramatically over the past two decades, raising deep questions such as: how well do co-expression patterns translate from one set of experimental conditions to another? How many cellular processes are driven by genetic co-expression, and does this change under different growth or environmental conditions? These questions point to the importance and challenge of using the wealth of publicly available data to pursue new hypotheses, rather than treating whole-transcriptome experiments as either purely descriptive or one-and-done assays to study a single organism- or condition-specific problem. Answering these questions requires appropriate model systems and principled approaches.
The ciliate Tetrahymena thermophila is a unicellular eukaryote that has featured in groundbreaking discoveries regarding programmed genome rearrangements, telomeres/telomerase, and cytoskeletal motor proteins [3]. However, some features of T. thermophila present challenges to its use in uncovering new biology broadly. Ciliates are over a billion years diverged from better studied organisms such as fungi and animals [4], an evolutionary distance that frequently creates obstacles to identifying gene orthologs and thereby inferring conserved functions. Moreover, interesting novel mechanisms may have arisen over that large evolutionary distance, such as the recently discovered unique secretory apparatus that ciliates share only with the related apicomplexans and dinoflagellates [5]. One way to address these issues would be a forward genetic approach, using random mutagenesis to identify phenotypes of interest and then associate them with causative mutations [6–8]. However, ciliate nuclear organization makes it challenging to undertake high throughput forward genetic approaches in these organisms [9]. Due to all these factors, high throughput bioinformatic studies offer a potential breakthrough for interrogating both the evolutionarily conserved and novel biology in T. thermophila. Previous research has indicated that protein expression in T. thermophila tends to be regulated on the level of transcription as opposed to transcript degradation or translation rates, which is in line with observations in yeast [10, 11]. Thus, gene co-expression studies promise to be informative for the analysis of gene functions.
T. thermophila has distinct vegetative and sexual life stages. Consequently, many genes are tuned for differential expression during stages of vegetative growth/mitosis or conjugation/meiosis, previously explored in microarray-based experiments and co-expression analyses [12–15]. Strikingly, we found that many characterized genes involved in the biosynthesis and secretion of a particular secretory organelle, called the mucocyst, are co-expressed across growth, starvation, and conjugation [16–18]. This allowed us to subsequently identify multiple other co-expressed genes. A large subset of these were then verified as involved in mucocyst biogenesis or secretion [16–18]. This success led us to develop a tool we called the Co-regulation Data Harvester (CDH), which scraped the available co-expression data for T. thermophila and performed reciprocal-best-BLAST queries to indicate potential functional annotations based on orthologous genes in other organisms [19]. This tool also allowed us to identify candidates for genes involved in the secretion of homologous organelles in the apicomplexan Toxoplasma gondii, which were then experimentally verified [20].
However, the CDH became obsolete as new algorithms for identifying gene co-expression, as well as new databases for orthology searches, became available after our publication [21–24]. Additionally, extensive new revisions of the T. thermophila genome model were published, as well as an RNA-seq dataset from cell cycle-synchronized cultures [25–27]. These developments prompted us to align the original microarray data with the newest genome model, while also bridging insights from the different transcriptomic datasets, to develop a stable, accessible tool for the research community. Here, we report the Tetrahymena Gene Network Explorer (TGNE), an interactive tool for revealing co-expression patterns, by taking advantage of the gene annotations and expression data that have only recently become available [21, 22, 28]. One important aspect of the expanded datasets is that the microarray and RNA-seq expression profiles are independent from each other, the former covering bulk growth, starvation, and sexual conjugation, and the latter covering a synchronized mitotic cell cycle. Using the TGNE, we found that co-expression of many mucocyst-related genes is a feature of both datasets. We further found that these genes are also upregulated during experimentally induced mucocyst biosynthesis, implying that their co-expression in “untargeted” experiments reflects functional association. This demonstrates that the TGNE can be used to generate experimentally verifiable hypotheses and provides a direct insight into the dynamics of functionally associated genes in T. thermophila. A similar pattern emerges from TGNE analysis with regard to other cellular processes in T. thermophila, such as regulation of histone, ribosomal, and proteasomal subunits.
Beyond drawing insights specific to T. thermophila biology, our approach to developing the TGNE provides a framework for revitalizing microarray data and integrating it with RNA-seq results. We leverage computational negative controls to support our choice of (dis)similarity metric for co-expression profiles and our optimization of parameters for partitioning the profiles into clusters. We also compare different normalization strategies to show the degree to which gene expression pattern shape and magnitude differently affect the clustering results. Approaches to computational negative controls, distance metrics, normalization, and clustering algorithms have all been detailed in prior work [1, 24, 29–32]. However, to our knowledge, these strategies have not been previously brought together to unite bioinformatic insights from different experiments. Our results indicate that there are more testable hypotheses to be found and more insights to glean from the troves of publicly available bioinformatic data, even for evolutionary distant and experimentally challenging organisms.
Materials and methods
Microarray data preprocessing
The microarray data analyzed in this study were sourced from the NCBI Gene Expression Omnibus (GEO; https://www.ncbi.nlm.nih.gov/geo/) under accession numbers GSE11300, GSE26650, GSE26384, and GSE26385. Experimental probes were aligned to the June 28, 2024 release of the Tetrahymena thermophila genome model CDS using HISAT2 with default parameters [27, 33]. In the NimbleGen Design File (NDF), the SEQ_ID of each singly aligned probe was replaced with the TTHERM_ID of the corresponding sequence. Probes that did not align to a single gene coding sequence were discarded with the exception of RANDOM probes. All raw microarray data files were converted to XYS format. All XYS files were compiled to create an expression set and RMA normalized with oligo and pdInfoBuilder in R [34, 35].
Microarray chip quality control
Chips were removed if they met the following three criteria: (i) had a 25th percentile NUSE (normalized unscaled standard error) > 1, (ii) had a relatively large NUSE interquartile range, and (iii) had expression intensity autocorrelation on reconstructed pseudo-images of the original chips [36]. After this quality control, if there remained only one replicate for a given time point, it was also excluded from the analysis. The microarray chips were hierarchically clustered with hclust in R to observe any clustering biases. All microarray chips from GEO Accession GSE26385 were removed, as they clustered away from other replicates for their respective conditions and were collected by a specific individual, which is indicative of a batch effect [13].
Microarray gene filtering
Genes were filtered to remove ones that had too low expression or variance to be informative in the analysis. All genes were subjected to two filters: one based on the distribution of their respective expression statistics and one based on likelihood of differential expression. The first filter required that:
The gene’s geometric mean expression was greater than or equal to the 25th percentile of the geometric means of expression for all genes.
The gene’s geometric coefficient of variation of expression was greater than or equal to the median geometric coefficient of variation for all genes.
The gene’s maximum fold-change of expression was greater than or equal to the median maximum fold-change of expression for all genes.
The gene’s ratio of its median absolute deviation to its median expression was greater than or equal to the median ratio for all genes.
To identify genes that have robust differential expression patterns that may have been lost to the above filter, we used MaSigPro parametrized to a false discovery rate, q = 0.001 by the Benjamini–Hochberg correction for multiple hypothesis testing [37]. The genes identified by MaSigPro were added to the ones that passed the first filter, and this total gene set was used for subsequent analysis.
RNA-seq analysis
The sequencing data used in this study was downloaded from the Sequence Read Archive (SRA; https://www.ncbi.nlm.nih.gov/sra/) under the BioProject PRJNA861835. Adapter sequences and low-quality reads were removed with Trimmomatic [38] with default parameters. The quality of the reads in each sample before and after trimming was accessed with FastQC [39] and MultiQC [40]. The transcript abundance for each gene was computed with Kallisto [41] using the trimmed reads and the T. thermophila genome model CDS. Transcripts per million (TPM) and counts per million (CPM) were computed from transcript abundance and the effective length of each transcript.
RNA-seq gene filtering
Jaccard filtering was applied to the RNA-seq gene expression data to remove genes with noisy and unreplicated expression patterns [42]. We determined the maximum Jaccard similarity index between replicate gene measurements to be 0.9422, which corresponded to a CPM of 0.0802. Only genes with a maximum CPM measurement >0.0802 were kept for the subsequent analyses. After filtering, the TPM values for expression were used to compute co-expression clustering.
Orthology-based annotation
eggNOG-mapper v2.1.12 was applied to the T. thermophila genome model protein sequences to mine annotations of orthologs using the following parameters: the HMMER database with 2759 as the taxID, tax scope constrained to eukaryota, 2759 selected for the target taxa, report orthologs enabled, nonelectronic GO terms only, the HMM database, and no PFAM realignment [21, 22, 43]. The exact command was: -m hmmer -d 2759 –no_annot –tax_scope Eukaryota –target_taxa 2759 –report_orthologs –report_no_hits –go_evidence non-electronic –pfam_realign none –dbtype hmmdb. Interproscan 5.68–100.0 was applied to the T. thermophila genome model protein sequences to mine annotations of orthologs using the default parameters [28, 44].
Enrichment analysis
The modified two-tailed Fisher’s exact test with a Bonferroni correction for multiple hypothesis testing, as implemented in the DAVID database, was used to determine if any GO, COG, EC, KEGG_ko, PFAM, and/or InterPro annotation terms were enriched in each cluster relative to the genome background [45, 46].
Clustering
The raw microarray and RNA-seq expression datasets were preprocessed and clustered using the same approach. Two different preprocessing pipelines were applied to each dataset: (i) each gene expression profile was log-transformed elementwise and subsequently z-score normalized; (ii) each gene expression profile was min–max normalized. All four preprocessed datasets were subject to the same clustering pipeline. The arithmetic mean of the normalized expression values was computed for each gene across replicates at each time point. A high-dimensional Manhattan distance matrix was precomputed with scikit-learn [47]. The nearest neighbors for each gene expression profile in the high-dimensional space were computed using a modified scikit-learn function. By default, scikit-learn does not include a point as its own nearest neighbor. The scikit-learn function was used to compute the eleven nearest neighbors, and the point itself was added manually to the set of nearest neighbors and distances to complete the set of 12. A graph of the high-dimensional space was built with umap-learn [23] with a varying number of nearest neighbors. Genes were clustered via community detection of networks with leidenalg [24] using the Constant Potts Model [48] quality function with a varying linear resolution parameter.
Parameter optimization and partition quality validation
Partitions were computed with a varying number of nearest neighbors for UMAP graph generation (range=[2, 12], step = 1) and Leiden linear resolution parameters (range=[0, 1], step = 0.005). The modularity of each partition was computed using the graph and the clusters generated by the partition with networkx [49]. The number of enriched clusters was counted, and the fraction of enriched clusters was computed as the number of enriched clusters divided by the total number of clusters in the partition. The interquartile range among all of the cluster sizes was computed for each partition. Pareto-efficient partitions were computed in a three-dimensional space defined by the fraction of enriched clusters, modularity, and the interquartile range of cluster sizes. For a partition to be considered optimal, we required a modularity >0.7 and an interquartile range of cluster sizes >10. Computational negative control partitions were used to assess the statistical significance of each of the four optimal partitions. Scrambled negative control partitions were generated by randomly swapping raw expression values within each gene’s expression profile before preprocessing and clustering. Simulated negative control partitions were generated by creating a uniformly distributed Latin hypercube with SciPy [50] with the same dimensionality as the dataset. The hypercube values were then scaled to match the range of values within the dataset and used as raw input for preprocessing and clustering. One thousand computational negative control partitions of each type were computed with the optimal parameterizations for each of the microarray and RNA-seq datasets. A two-tailed one-sample t-test was used to assess the difference between the modularity of each optimal partition and each of the corresponding negative control modularity distributions.
Differential expression analysis of induced mucocyst replacement dataset
To distinguish between genes upregulated due to demands of mucocyst synthesis versus genes upregulated in response to the exocytic stimulus per se, we exploited a mutant cell line, MN173, that does not secrete its mucocysts upon stimulation [51]. Using biological triplicates, total RNA was isolated from wild-type cells (strain CU428) prior to stimulation of exocytosis, and then 60-min post stimulation. In parallel, cells from the MN173 mutant line were treated and processed equivalently. Cells were grown in 1% proteose peptone, 0.2% dextrose, 0.1% yeast extract, and 0.003% ferric EDTA. Cells were grown to 150,000–200,000 cells/ml and pelleted in 50 ml conical tubes for 45 s at 800 × g. They were washed once and suspended in Dryl’s medium with added magnesium and calcium (DMC) (0.1 mM Na2HPO4, 0.1 mM NaH2PO4, 0.65 mM CaCl2, 0.1 mM MgCl2, pH 7.1) for 16 h at room temperature with shaking. Fifty milliliters of aliquots were stimulated by pelleting as above, and resuspension in 5 ml. 2% Alcian Blue was added to 0.05% and the tube contents were mixed by inversion, and then diluted immediately by addition of 45 ml of 0.25% proteose peptone, 0.5 mM CaCl2. Cells were then washed once in DMC and resuspended for recovery in DMC at room temperature with shaking. Pelleted cells were lysed with 5 M guanidinium thiocyanate, 10 mM EDTA, 50 mM Tris–HCl pH 7.5, and 8 mM 2-mercaptoethanol. RNA was precipitated with seven volumes cold 4 M LiCl, and the pellet was washed once with 3 M LiCl and suspended in 0.5% N-lauroyl-sarkosine, 1 mM EDTA, 10 mM Tris pH 7.5. RNA was then phenol/chloroform extracted and ethanol precipitated.
The complementary DNA (cDNA) synthesis and Cy3 labeling were done by Roche NimbleGen Systems, as described in [52]. Hybridization and staining of arrays were carried out by Roche NimbleGen Systems as described in [53]. Arrays were scanned by Roche NimbleGen using a GenePix 4000B (Molecular Devices, Sunnyvale, CA), and the data were extracted using NimbleScan software. Array normalization was performed using the quantile normalization method [54]. Normalized expression values for the individual probes were used to obtain the expression values for a given open reading frame by using the multiarray average (RMA) procedure [55]. Data were analyzed based on the RMA-processed expression values.
The microarray data from the mucocyst replacement experiments were processed the same way as the untargeted co-expression microarray dataset up to and including the RMA normalization step. The RMA normalized expression set was analyzed with limma [56] to determine the differential expression data for each gene over 1 h in the MN173 mutant relative to the wild type T. thermophila. Genes which had a fold-change of >1.5×, a Benjamini and Hochberg’s adjusted false discovery rate (q-value) <0.01, and a B-statistic >1 (i.e. a Bayesian posterior probability >73.1% chance of being differentially expressed) were classified as differentially expressed genes.
Cross-validation of untargeted co-expression analyses against targeted mucocyst replacement experiment
We identified co-expression clusters that were statistically significantly enriched for the 33 genes that are experimentally validated to be involved in mucocyst biogenesis (Supplementary File S1) in both the microarray and RNA-seq datasets. This was done by a Fisher’s exact test relative to the background genome as in [45]. For each respective normalization framework, these sets of genes and the set of genes that are upregulated during mucocyst replacement were then compared to determine their mutual agreement. A Venn diagram was generated to display the intersections of the three sets. To assess whether the intersections between these sets are more likely to include experimentally validated genes than the genes excluded from the intersections, we again employed a Fisher’s exact test, this time looking at the background of the union of the three sets, rather than the entire genome.
Macronuclear knockouts of candidate genes
Genes of interest (TTHERM_00141040, TTHERM_00193465, TTHERM_01213910, TTHERM_00047330, TTHERM_00317390, TTHERM_00283800, TTHERM_00241790, TTHERM_01332070, TTHERM_00059370, and TTHERM_00227750) were knocked out via a standard biolistic bombardment, homologous recombination, and selection protocol [57]. In brief, Polymerase Chain Reaction (PCR) was used to amplify the 5′ and 3′ flanking regions (1.5–2 kb each) for each gene. These amplicons were subsequently subcloned into the SacI and XhoI sites that flank a neomycin resistance cassette (Neo4), thus granting the cassette homology arms to replace the endogenous gene. These vectors were linearized by KpnI and SapI and transformed into CU428 cells by biolistic transformation. Biolistic transformations were as described previously [58], with the following modifications: gold particles were prepared as recommended with 15 μg of total linearized plasmid DNA. To select positive transformants, paromomycin was added 4 h after bombardment to cultures that had been shaking at 30°C. Transformants were selected in 120 ug/ml paromomycin, and CdCl2 was added at 1 μg/ml to induce Neo4 expression. Putative transformants were identified after 3 days of selection. These were then serially transferred daily in increasing amounts of paromomycin for at least 4 weeks before further testing.
Dibucaine mucocyst secretion assay to experimentally validate new mucocyst gene knockouts
T. thermophila cells (wild-type or knockout) were grown to a density between 4 × 105 and 6 × 105 cells/ml and washed once with 10 mM Na-HEPES (pH 7.2) after being pelleted for 30 s at 400 × g in a clinical centrifuge. Loose cell pellets (concentrated ∼10-fold relative to the initial culture) were stimulated for 30 s by addition of 2.5 mM dibucaine (final concentration). The cells were then diluted at least five-fold with 10 mM Na-HEPES (pH 7.2) and centrifuged at 1200 × g for 1 min in 15 ml conical tubes. After the centrifugation, we imaged the two-layer pellet, with cells overlaid by flocculent extruded mucocyst contents, to determine the strains’ relative capability to secrete mucocysts.
Results
Uniting insights from microarray and RNA-seq datasets
In order to draw comparisons between gene expression patterns in the disparate microarray and RNA-seq datasets, we processed them into a lingua franca of normalized gene expression. The microarray expression dataset covered bulk growth, starvation, and conjugation conditions [12, 13, 15], and the RNA-seq expression dataset covered 1.5 cell cycle–synchronized mitotic cycles [26]. We filtered both datasets, removing batch effects, noisy expression, and unreplicated samples (Supplementary Figs S1 and S2). After the quality control steps, 20,428 genes remained in the microarray dataset and 23,113 genes remained in the RNA-seq dataset (the total gene number in T. thermophila is 27,494) [27] (Table 1). After normalization, we scanned over five different distance metrics [Euclidean, Manhattan, context likelihood of relatedness (CLR), angular, and linear correlation [13, 32, 59]], nearest neighbors ranging from 2 to 12, and scanning the Leiden clustering resolution parameter between zero and one (Supplementary Figs S3–S7) [24]. Using Pareto optimization [30], we settled on Manhattan distance, three nearest neighbors, and resolution parameter (r = 0.005) as the most effective (Supplementary Fig. S3). The Pareto optimization checked for modularity [60], fraction of clusters with significantly enriched functional terms, and cluster size interquartile range (Supplementary Figs S3–S7). To determine the functional term enrichment, we first used eggNOG and InterProScan to annotate all the genes in our dataset based on orthologous groups and protein domains [21, 22, 28] (Table 1). In each cluster, the enrichment of each functional term was calculated against its background abundance in the genome using a modified Fisher’s exact test and Bonferroni correction against multiple hypothesis testing [45, 46] (Table 2).
Table 1.
Annotation statistics of the microarray and RNA-seq datasets
| Microarray | RNA-seq | |
|---|---|---|
| Total number of genes | 20,428 | 23,113 |
| Fraction genes with COG category terms | 0.55 | 0.52 |
| Fraction genes with GO terms | 0.08 | 0.07 |
| Fraction genes with KEGG KO terms | 0.32 | 0.29 |
| Fraction genes with EC terms | 0.14 | 0.13 |
| Fraction genes with PFAM terms | 0.49 | 0.45 |
| Fraction genes with InterPro terms | 0.65 | 0.63 |
Table 2.
Normalization specific cluster and enrichment statistics of the optimal microarray and RNA-seq partitions
| Microarray full | RNA-seq | |||
|---|---|---|---|---|
| Normalization | min-max | z-score | min-max | z-score |
| Modularity | 0.77 | 0.76 | 0.79 | 0.79 |
| Number of clusters | 636 | 636 | 731 | 740 |
| Mean cluster size | 32.12 | 32.12 | 31.62 | 31.23 |
| Median cluster size | 30.0 | 30.0 | 31.0 | 29.0 |
| Standard deviation of cluster size | 12.65 | 13.58 | 11.45 | 11.23 |
| Minimum cluster size | 3 | 3 | 3 | 3 |
| Maximum cluster size | 82 | 98 | 76 | 80 |
| Number of enriched clusters | 250 | 255 | 199 | 219 |
| Mean enriched cluster size | 34.57 | 35.10 | 35.00 | 34.25 |
| Median enriched cluster size | 32.0 | 32.0 | 34.0 | 33.0 |
| Standard deviation of enriched cluster size | 14.37 | 15.09 | 11.55 | 11.07 |
| Maximum enriched cluster size | 82 | 98 | 76 | 66 |
| Minimum enriched cluster size | 3 | 3 | 9 | 14 |
| Number of genes in enriched clusters | 8643 | 8950 | 6965 | 7501 |
After all these steps, we simulated the null hypothesis that each gene expression profile was completely unrelated to other gene expression profiles with two methods: (i) expression value scrambling within each gene, or (ii) generating simulated expression profiles each gene that supported an evenly distributed hypercube of values (Fig. 1). For each method, we ran 1000 independent simulations and found that the normally distributed modularity values corresponding to the null hypothesis were statistically significantly lower than the partition modularity for the chosen optimal partitions (P < 0.005 by two-tailed t-test), which indicated that our parametrization identified significant gene co-expression modules (Fig. 1A–D). For both the microarray and RNA-seq datasets, and for both normalization strategies, we hierarchically clustered the co-expression modules around their centroids, allowing us to plot gene co-expression modules by their relative similarity (Fig. 1E–H). These heatmaps reveal a whole-transcriptome view of gene expression across all the assayed conditions, and each condition has a corresponding co-expression module that reaches either its minimum or maximum at that point.
Figure 1.
Optimal parameterization significance testing for each dataset/normalization scheme and illustrations of optimal experimental partitions. Histograms illustrating modularity distributions for computational negative control (NC) partitions compared to the experimental partition created from the optimal parameterization. The computational negative controls based on scrambled data are in black, the computational negative controls based on a simulated hypercube with uniform data distribution are in purple, and the modularity value for the optimized partitions are indicated by the dashed green line. (A) The computational negative control comparison for the min-max normalized microarray dataset. (B) The computational negative control comparison for the min–max normalized RNA-seq dataset. (C) The computational negative control comparison for the z-score normalized microarray dataset. (D) The computational negative control comparison for the z-score normalized RNA-seq dataset. In each case, the modularity for the optimized clustering of the real data was statistically significantly greater than in either negative control (P < 0.005). Heatmaps illustrating the optimal partitions generated from (E) the min–max normalized microarray dataset, (F) the min–max normalized RNA-seq, (G), the z-score normalized microarray, and (H) the z-score normalized RNA-seq datasets. Modules of gene expression profiles are ordered by hierarchical clustering of the module centroids using average linkage. Each row of a given heat map corresponds to one gene’s expression. In panels (E) and (G), the x-axis denotes the different phases of the T. thermophila life cycle: low density logarithmic growth (Ll), medium density logarithmic growth (Lm), high density logarithmic growth (Lh), 0–24 h of starvation (S0–S24), and 0–18 h of conjugation (C0–C18) [12]. In panels (F) and (H), the x-axis denotes the stages of the mitotic cell cycle and corresponding timepoints for sampling.
Mucocyst biogenesis genes are recovered in untargeted bioinformatic analysis
With the gene co-expression being normalized and computed in the same way for the two datasets, we were able to test the degree of their agreement. We focused on mucocyst biogenesis, a cellular process that we have previously studied, including with the use of inferences from gene co-expression [16, 18, 20]. For our current analysis, we determined which clusters in both the microarray and RNA-seq datasets were enriched for 33 genes that have been previously experimentally verified to be involved in mucocyst biogenesis or secretion (Supplementary File S1). Using the min–max normalized data, we found clusters that are enriched for these experimentally verified genes: six in the microarray co-expression dataset (m002, m003, m004, m005, m006, and m378, totaling 182 genes, Fig. 2A) and four in the RNA-seq co-expression dataset (m040, m194, m219, and m294, totaling 104 genes, Fig. 2B). Using the z-score normalized data, we found four such clusters in the microarray co-expression dataset (m169, m171, m172, and m424, totaling 172 genes, Supplementary Fig. S9A) and four in the RNA-seq co-expression dataset (m632, m634, m636, and m679, totaling 144 genes, Supplementary Fig. S9B).
Figure 2.
Enrichment, differential expression, and overlap of experimentally validated mucocyst-associated and differentially expressed, upregulated genes. Min–max normalized expression profiles for genes in (A) the six microarray and (B) the four RNA-seq clusters significantly enriched for experimentally validated mucocyst-associated genes as well as the 33 genes overlapping between the upregulated, enriched microarray clusters, and enriched RNA-seq clusters in (C) the microarray and (D) the RNA-seq datasets. (E) Volcano plot illustrating differential expression of each gene represented in the microarray dataset over one hour in the MN173 mutant relative to the wild type T. thermophila. Thresholds are represented by blue dashed lines (q < 0.01 and fold-change > 1.5). All genes that passed the thresholds have a Bayesian posterior probability of differential expression >80%. (F) Venn diagram describing the overlapping genes in the enriched microarray clusters, enriched RNA-seq clusters, and the set of upregulated genes with min–max normalization. Min-max normalized expression profiles for genes that are co-expressed in the microarray and RNA-seq datasets, but not detected in the upregulated dataset: (G) gene expression in the microarray profiles and (H) gene expression in the RNA-seq profiles.
To determine whether this agreement between the two datasets has biological significance, we experimentally assessed which genes are upregulated after we stimulated massive mucocyst secretion, when the cells are induced to synthesize a large cohort of these organelles. For this analysis, we compared a wild-type strain (CU428) to a mutant (MN173) that produces mucocysts but is incapable of releasing them, and which therefore does not induce new mucocyst synthesis upon stimulation [61]. This was a microarray assay that we processed in the same way as described above (Supplementary Fig. S8). Focusing on min–max normalized data, of the 3220 genes that were differentially upregulated in this experiment (Fig. 2E), 112 are shared with the co-expressed clusters in the microarray dataset alone and five are shared with the ones in the RNA-seq dataset alone (Fig. 2F–H). Thirty-three genes are shared across the differential upregulation and the two co-expression datasets (Fig. 2F).
In Fig. 2F, the 33 gene intersection of the three datasets includes 13 experimentally verified genes (Supplementary File S1, “min–max triple agreement” tab), which is a statistically significant enrichment of experimentally verified genes relative to the background of all genes in Fig. 2F (p < 1 × 10−6 by a two-tailed Fisher’s exact test). The five gene intersection of the RNA-seq co-expression and differential upregulation alone contains no experimentally verified genes (Supplementary File S1, “min–max upreg & RNA-seq” tab; p = 1). The 112 gene intersection of the microarray co-expression and differential upregulation alone contains eighty experimentally verified genes (Supplementary File S1, “mix–max upreg & microarray” tab; p < 1 × 10−6). The eight gene intersection of the microarray and RNA-seq co-expression alone contains seven experimentally verified genes (Supplementary File S1, “min–max microarray & RNA-seq” tab; p < 1 × 10−6). Thus, 28 of the 33 genes (84%) that were experimentally verified to be involved in mucocyst biogenesis prior to this analysis were recovered, all of which are found in the microarray co-expression dataset and shared agreed upon by at least one of the two other datasets. We obtained analogous results when starting with z-score normalized data (Supplementary Fig. S9 and Supplementary File S1, “z-score” tabs).
Figure 2G and H shows the expression profiles of the eight genes at the intersection of the microarray and RNA-seq co-expression datasets. This list includes GRL1, 3, 4, 5, 7, and 8 and GRT1, which are all known to be mucocyst cargo proteins [58, 62–64]. The other gene is TTHERM_00537380, which is unnamed, unannotated, and lacks any orthologs or protein domains that were identified by eggNOG or InterProScan (Supplementary File S1).
For 10 of the genes that are co-expressed with the 33 previously verified genes, we performed new knockout experiments confirming their role in mucocyst biogenesis (Fig. 3). Six of these genes have been previously implicated to be part of the mucocyst docking and discharge complex by co-immunoprecipitation but have not been genetically assayed (Fig. 3A) [5]. The remaining four had not been previously studied, but we selected them based on their co-expression clusters and for their putative annotations as proton-pumping pyrophosphotases, which have been implicated in mucocyst and trichocyst biogenesis (Fig. 3B) [65–67]. In our knockout experiments, the loss of each of these ten genes resulted in a mucocyst secretion defect, as evidenced by the loss of the mucosal layer over the cell pellets after dibucaine treatment.
Figure 3.
Experimental validation of ten genes that are suggested to be mucocyst-associated by our co-expression analysis. (A) Genes that co-immunoprecipitated as members of the mucocyst docking and discharge protein complex (TTHERM_00141040, TTHERM_00193465, TTHERM_01213910, TTHERM_00047330, TTHERM_00317390, and TTHERM_00227750) [5]. (B) Four genes that were knocked out solely on the basis of our co-expression inference (TTHERM_00283800, TTHERM_00241790, TTHERM_01332070, and TTHERM_00059370). For each gene, the left tube shows the wild-type response to dibucaine as evidenced by a flocculent layer of mucus overlying the cell pellet after centrifugation. The boundary of the cell pellet is denoted by the solid line, and the boundary of the mucus layer is denoted by the dotted line. The right tube in each panel displays the phenotype of strains with the respective genes genetically knocked out. Each has a defect in mucocyst release in response to the dibucaine treatment.
The topology of the T. thermophila gene network reveals other functionally enriched modules
The previous two studies of the T. thermophila gene co-expression landscape identified relatively large modules with significantly enriched functional terms: the RNA-seq study identified 3032 genes as cell cycle-regulated that were divided into 10 clusters, only four of which had significantly enriched functional terms [26]. The microarray study reported 55 co-expression clusters for the full genome, but did not report a statistical analysis of functional enrichment [13]. However, both studies found functionally associated genes clustering together––most prominently histone-, proteasome-, and ribosome-associated genes. We set out to assess whether our new analysis reproduces or expands on these findings (Fig. 4).
Figure 4.
Min–max normalized expression profiles of clusters significantly enriched for (A and B) histone, (C and D) ribosome, and (E and F) proteasome functional annotation terms in the microarray (left column) and RNA-seq (right column) analyses. In each case, the same number of clusters come up in the two datasets: one for histone-associated profiles, two for ribosome-associated profiles, and three for proteasome-associated profiles. The histone-associated profiles are characterized by low expression during starvation and high expression during growth and conjugation (A) and high expression during the S-phase of the cell cycle (B). Ribosome-associated profiles are characterized by high expression during growth or starvation and low expression during conjugation (C). In the RNA-seq expression dataset, the ribosome-associated profiles appear to be at a minimum during the first G1 phase and at a maximum at the second G1 phase, indicating that in this experiment they are not following the cyclicity of the mitotic cell cycle (D). The main characteristics of the proteasome-associated co-expression pattern are a sharp loss of expression at the beginning of conjugation (E) and a peak of expression during mitotic division (F).
In the case of histone-associated genes, the min–max microarray co-expression analysis identified one cluster of 18 genes (module m179), and the min–max RNA-seq co-expression analysis identified one cluster of 15 genes (module m721) (Fig. 4A and B). The intersection of these two gene sets comprises six genes, five of which were previously annotated as histone components: TTHERM_00790792 (HTA1), TTHERM_00633360 (HTB1), TTHERM_00498190 (HHF1), TTHERM_00316500 (HTA2), TTHERM_00283180 (HTB2), and TTHERM_00143660 (HTA3) (Supplementary File S2, “AB overlap”). Additionally, the microarray co-expression cluster identifies five members of the MCM helicase and a chromatin-associated protein: TTHERM_00554270 (MCM2), TTHERM_00092850 (MCM3), TTHERM_00277550 (MCM4), and TTHERM_00448570 (MCM6), TTHERM_00011750 (putative MCM7), and TTHERM_00729230 (IBD1) (Supplementary File S2 and “Fig. 4A”). The RNA-seq co-expression cluster also includes more histone- and chromatin-associated genes, such as: TTHERM_00823720 (HHO1), TTHERM_00660180 (HMG1), TTHERM_00257230 (HMGB2), TTHERM_00189170 (HHF2), and TTHERM_00571055 (HHT1) (Supplementary File S2 and “Fig. 4B”).
For the ribosome-associated genes, the microarray co-expression analysis identified two clusters with functional enrichment, modules m601 and m602, comprising 105 genes (Fig. 4C). The RNA-seq analysis also identified two clusters, modules m458 and m460, comprising 82 genes (Fig. 4D). The overlap between the two consists of 49 genes, each one of which is annotated as a ribosomal gene (Supplementary File S2, “CD overlap”). Similarly, the proteasome-associated genes separate into three co-expression profiles in both the microarray analysis (modules m374, m375, and m453; 102 genes total) and the RNA-seq analysis (m467, m470, and m473; 87 genes total) (Fig. 4E and F). The intersection between these gene sets contains 22 genes, 21 of which are annotated as proteasomal components (Supplementary File S2, “EF overlap”). The exception is TTHERM_00600110 (TTN1), which is a nuclease [68]. Curiously, the ribosomal co-expression profiles do not follow the periodicity of the mitotic cell cycle (Fig. 4D), unlike the histone (Fig. 4B) and proteasome (Fig. 4F) co-expression profiles.
The Interactive Tetrahymena Gene Network Explorer (TGNE)
Given that our analysis appears to be broadly informative for T. thermophila cell biology, we developed an interactive tool for reproducing our investigations for any gene or pathway of interest (Fig. 5). The TGNE is a standalone HTML file that contains all the data, making it portable and requiring no maintenance. The microarray version is 292.1 MB, and the RNA-seq version is 106.2 MB. The TGNE works in any web browser that supports webGL 2.0 (e.g. Chrome 56+, Firefox 51+, Safari 15+, and Opera 43+), and every plot in the tool is interactive. The annotation table with the selected genes, as well as the functional term enrichment data for the corresponding modules, can be downloaded using the buttons at the upper right corner (Fig. 5H). The functional term enrichment data for each module in each variant of the TGNE is available in Supplementary File S3. The interactive dashboards for the microarray and RNA-seq variants of the TGNE are available as Supplementary File S4 (microarray version) and Supplementary File S5 (RNA-seq version).
Figure 5.
A labeled diagram of the TGNE dashboard showing the z-score normalized data for the gene module enriched for histone methyltransferase genes. (A) The “Conditions Selection Tabs” are exclusive to the microarray dashboard and allow the user to specify which life cycle phases are included within the input data to the clustering pipeline: the entire profile, just the vegetative conditions, or just the conjugative conditions. The “Normalization Selection Tabs” allow the user to select which normalization technique should be used on the input data: z-score or min–max. (B) The search bars are text fields that can be used to select genes based on their modules, annotations, or functional term codes. The left search bar allows searches for specific modules; the middle search bar is for TTHERM_ID, common names, descriptions, and module number; and the right search bar allows searches for functional annotation terms or codes, specifically: PFAM names or GO/KEGG/InterPro/E.C. alphanumeric codes. Here, “m270” was used as the search term to select the entire module that is enriched for histone-methyltransferases. The Boolean logic operators between the search bars allow for more complex searches, e.g. genes in module X AND genes with annotation Y. The Boolean logic between the middle and righthand search bars is always performed first, as indicated by the parentheses. (C) The download buttons. The annotation table, functional enrichment information, and normalized gene expression values for the selected genes/modules can be downloaded as tab-separated files using these three buttons. (D) The heatmap representation of the normalized expression of all genes across all conditions, as in Fig. 2E. The selected module is highlighted, and the unselected genes are grayed out. (E) This plot shows all modules with significantly enriched functional terms, which are the same terms as those that can be searched using the right-hand search bar. As with the heatmap, when a certain module is selected, the others are grayed out. Here, the y-axis is zoomed in to focus on modules 270 and 271. Moving the cursor over any of the circles in the plot displays the enriched term, its fold-change relative to the genome background, and the Bonferroni-corrected P-value. Here, the indicated circle represents the PFAM term “DNA_Pol_E_B,” which corresponds to “DNA polymerase alpha/epsilon subunit B.” This term is 306.7 times over-represented in this cluster relative to the genome background, with a Bonferroni-corrected P-value of ∼1.4 × 10−3. (F) An interactive UMAP representation of the gene expression with one tab showing the UMAP embedding of each cluster and the other tab showing the UMAP embedding of each gene. Selected genes and modules are highlighted, while unselected ones are grayed out. Clicking on any circle or selecting them with one of the tools to the right of the plot selects those module(s) or gene(s) for display. (G) The graph for displaying the expression profiles of the selected genes. This is an equivalent representation of the data in the heatmap. (H) The annotation table. When genes are selected, their annotation information based on the published T. thermophila genome, eggNOG, and InterProScan is populated into this table. Columns after the eggNOG preferred names are not displayed in this figure. When rows of the annotation table are clicked, the corresponding genes’ expression patterns are highlighted in panel (G). Here, we selected the histone methyltransferase genes to be highlighted. Each graphical panel of this figure is presented in more detail in Supplementary Figs S11 and S12.
Discussion
Our results show that, at least in the four cases we explored, there is significant agreement between co-expression patterns in two disparate experiments: one measuring gene expression using microarrays across growth, starvation, and conjugation and one measuring gene expression using RNA-seq in cell cycle-synchronized cells, which were generated fifteen years apart in different laboratories [12, 26]. The fact that we were able to bring the data into a shared framework that bridges these experimental and temporal differences is an indication that old data do not need to lie fallow. Crucially, our approach relies on careful quality filtering, normalization, systematic parameter optimization, and computational negative controls. The basis for our reanalysis was to align the probes/reads against the newest model of the genome and then normalize the expression data such that the resulting co-expression clusters would be translatable between datasets. Each dataset was filtered to remove batch effects, unreplicated gene expression, and noise (Supplementary Figs S1 and S2). We normalized each dataset in two ways to satisfy two different perspectives on the data. Min–max normalization linearly scales the data between zero and one, emphasizing only the shape of the given expression profile. In contrast, z-score normalization incorporates both the shape and the magnitude of expression within each profile. The min-max normalization thus primarily serves to provide a coarser-grained separation between co-expression clusters, while the z-score normalization distinguishes between genes that may have highly correlated expression patterns but distinctly different variance in expression levels across conditions. For example, this is evidenced in a comparison between Fig. 4H and Supplementary Fig. S9H, where in the former (the min–max normalization) the genes are grouped into only one cluster and in the latter (the z-score normalization) the genes are grouped into two. Consequently, we recommend users of the TGNE to start by interrogating their pathways of interest using min–max normalization and then further explore the subtleties of gene expression using the z-score normalization. For example, the consideration of variance in gene expression in the z-score normalization schema may separate transcription factors from their downstream regulated genes. We hope that our work can be a roadmap for using co-expression data to identify functionally associated genes in other systems.
Our comparison of metrics for difference (or similarity) between gene expression profiles showed that the Manhattan distance (also termed the L1 Norm) performs as well or better than the other metrics, which is in line with prior literature on overcoming the “curse of dimensionality” (Supplementary Figs S3–S7) [69, 70]. For all distance metrics except CLR, three nearest neighbors was the best parametrization for Leiden clustering (Supplementary Figs S3–S7) [24]. Notably, the CLR distance metric worked significantly differently for the microarray and RNA-seq expression datasets, unlike any of the other metrics (Supplementary Fig. S5A and B), which indicates that it would be inappropriate for bridging the two datasets. The previous Tetrahymena gene network landscape employed CLR to detect co-expression clusters, so our analysis is fundamentally distinct [13]. The other four distance metrics appear to be largely equivalent in terms of the resulting modularity, fraction of clusters with enriched functional terms, and interquartile range for cluster size (Supplementary Figs S3, S4, S6, and S7). We chose the Manhattan distance for our subsequent analyses because it gave the most similar clustering statistics between the microarray and RNA-seq datasets, but it is possible that the other distance metrics could reveal subtle differences in the detected co-expression patterns.
The Leiden algorithm generates flat clusters, as opposed to hierarchical ones. Consequently, we do not report the relative pairwise relatedness of genes, unlike the previous Tetrahymena gene network analysis [13]. This approach allowed us to calculate statistically significantly enriched terms for ∼40% of clusters (Table 2). However, the interactive TGNE dashboard can be used to glean inter-cluster relatedness based on the heatmap, which is hierarchically sorted based on each cluster's centroid in geometrical space, as well as the UMAP embeddings and shared annotation enrichment terms (Fig. 5C–E).
We specifically chose to incorporate modularity and the fraction of enriched clusters in the three-dimensional Pareto optimization to optimize for both mathematical and biological significance, respectively. A higher relative modularity indicates that there are more intra-cluster paths (i.e. more relatedness between gene expression profiles in the same cluster) and fewer inter-cluster paths (i.e. less relatedness between genes in different clusters) [60]. The fraction of enriched clusters corresponds to the proportion of clusters that have statistically significantly enriched functional annotations relative to the distribution of annotation terms in the genome. Modularity and functional enrichment fraction are not sufficient to avoid partitions where there are either several large clusters containing all genes or where the majority of clusters are tiny, neither of which would be biologically informative. Optimizing the interquartile range for cluster size allowed us to avoid parameter settings that resulted in either of these situations. From the Pareto optimal set of partitions, the first optimal partition with an interquartile range of cluster sizes greater than ten was chosen, as this was the point where cluster sizes were still small enough to be manually verified.
Our approach enabled us to perform computational negative controls to assess whether our clustering performs better than the null hypothesis––that there is no co-expression network in the data (Fig. 1). This methodology allowed us to analyze all the datasets in the same way and ensure the validity of our chosen partitions. Interestingly, the two variants of the negative control distributions for each dataset were never completely superimposed. The min–max normalized datasets illustrated scrambled negative control distributions which had a higher average modularity than the simulated distributions. The opposite was true for the z-score normalized datasets. While the use of computational negative controls in co-expression studies has previously been employed qualitatively or propounded on theoretical grounds [1, 29–31], our treatment of these T. thermophila datasets is the most systematic that we are aware of. Furthermore, the consistent modularity scores of our Pareto-optimized partitions and the performance of the negative controls gave us confidence to draw comparisons between the co-expression patterns in the microarray and RNA-seq datasets.
Our primary goal for the TGNE was to develop a tool for generating testable hypotheses about T. thermophila cell biology. To evaluate its effectiveness, we used it to revisit the biogenesis of an organelle in Tetrahymena that we have previously studied: the mucocyst (Fig. 2). Using the TGNE, we found clusters which were enriched for the 33 genes that are experimentally known to be associated with mucocysts (i.e. genes that either are required for mucocyst biogenesis/secretion, localize to mucocysts, or both) in both the RNA-seq and microarray datasets (Fig. 2A and B). We compared these genes against our differential upregulation experiment, which assessed gene expression in wild-type or secretion-null mutant cultures after stimulating mucocyst release. The degree of agreement between co-expression and differential upregulation patterns, as displayed in the intersections of the Venn diagrams in Fig. 2F and Supplementary Fig. S9F, gave us confidence that our co-expression clusters contain novel genes that are important for mucocyst biogenesis. We generated new knockouts for 10 genes (Fig. 3): six were previously co-immunoprecipitated with the Mucocyst Docking and Discharge complex (Fig. 3A) [5]; four were unstudied but had putative annotations as proton-pumping pyrophosphotases, which have been implicated in ciliate membrane trafficking (Fig. 3B) [65–67]. Each knockout had a mucocyst secretion defect, and after this initial confirmation, these genes will be the subject of detailed future studies. The agreement between co-immunoprecipitation and co-expression data is not limited to mucocyst biogenesis. We found that the histone methyltransferase complex components that co-immunoprecipitate with AMT1 (AMT6, AMT7, AMTP1, and AMTP2) also cluster together in the z-score normalization of the microarray data (Fig. 5) [71].
Importantly, each intersection in the Venn diagram in Fig. 2F and Supplementary Fig. S9F indicates other clear candidates for genes involved in mucocyst biogenesis that have not been previously studied (Supplementary File S1). These intersections include four orthologs to the Paramecium tetraurelia trichocyst cargo proteins (TTHERM_00321725, TTHERM_00773710, TTHERM_00697290, and TTHERM_00773700) and eight members of an expanded gene family that shares a beta/gamma crystallin domain with known mucocyst cargo genes (TTHERM_00585170, TTHERM_00471040, TTHERM_00038880, TTHERM_00558350, TTHERM_00570550, TTHERM_01002860, TTHERM_01002870, and TTHERM_00989430) [72–74]. There are also two genes that are known to be essential for trichocyst secretion: ND6 (TTHERM_00410160) and ND9 (TTHERM_00938850) [75, 76].
Furthermore, eight genes which were co-expressed in both the microarray and RNA-seq datasets but were not induced upon regranulation (Fig. 2F) include GRL1 (TTHERM_00527180), GRL3 (TTHERM_00624730), GRL4 (TTHERM_00624720), GRL5 (TTHERM_00378890), GRL7 (TTHERM_00522600), GRL8 (TTHERM_01055600), GRT1 (TTHERM_00221120), and TTHERM_00537380 (which has no annotation or clear orthologs outside the ciliates available) (Supplementary File S1). Of these, the GRLs and GRT1 are known mucocyst cargo genes [51, 58, 62, 63, 72, 73]. Every gene in this set has an average log2 intensity >15.5 on the microarray, apart from the unnamed gene, which is >14.9. Given the 16-bit detection camera of the system that was used for the microarray-based experiments, there was not enough dynamic range to detect further upregulation of these genes (Supplementary File S6). However, we have previously detected upregulation of GRL1, GRL3, and GRL4 during regranulation by qPCR and northern blot in the same experimental framework [77]. Thus, we expect that if the differential upregulation experiment were performed using RNA-seq instead of microarrays, the majority of these genes would also be in the triple-intersection of the Venn diagram. Of note, the completely unannotated TTHERM_00537380 was over-represented in the constitutive secretome of the T. thermophila SB281 mutant, which is also the case for mucocyst cargo proteins (Supplementary File S1) [51, 78, 79]. The fact that it shares a strong co-expression profile with verified cargo genes and is in this secretome makes TTHERM_00537380 a prime candidate for future study.
A consequence of our clustering approach is that it highlights statistically significant, but qualitatively subtle, differences in expression patterns among well-characterized complexes or functionally related genes. One intriguing example of this phenomenon is found within the nine-member gene family called GRL, for Granule Lattice. All GRL products are structurally related secretory proteins that are co-packaged within mucocysts, and early analyses of transcriptomic data revealed that the GRL genes are highly co-regulated. However, while most GRL proteins are likely to be required to form the physical core of the mucocyst, GRL6 appears to play a distinct regulatory role, as yet poorly understood [62]. Remarkably, in our current analysis GRL6 partitions into a different cluster from the other GRL genes, potentially reflecting this functional divergence (Supplementary Fig. S10). We posit that, even in the absence of any other data, the separate clustering of closely related genes may provide hints of functional diversification. More broadly, in cases where specialized cell biological structures or pathways in T. thermophila were co-inherited by other ciliates or the relatively closely related dinoflagellates and apicomplexans, the transcriptional clusters detected in the TGNE may help to uncover novel features within this deep lineage.
The efficacy of bioinformatic approaches like ours is necessarily limited by the input datasets. Even though we were able to “modernize” the microarray data by aligning it to the newest genome model and applying more quality control, microarray datasets are inherently limited by both maximum and minimum signal intensities. This results in a plateau effect in gene expression profiles in which high expression levels reach a signal ceiling and low expression levels fall below detection thresholds. These limitations reduce the resolution and detail in affected expression profiles and, consequently, restrict the amount of variance available for clustering genes algorithmically. RNA-seq datasets do not suffer from the plateau effect, but the RNA-seq dataset analyzed in this study was limited in that it was only performed in duplicate and only during the growth phase of the Tetrahymena life cycle. Additionally, despite the samples being synchronized, many gene expression observations were not repeated in the duplicated G1 and S phases, such as in Fig. 4D. This was likely due to cell cycle synchronization diminishing over time. Overall, the tool would benefit from a new RNA-seq dataset that covers both the growth and conjugation phases with highly replicated samples collected at smaller time intervals. This would enhance both the precision and accuracy of the expression profiles and clustering algorithm’s ability to accurately partition genes. Naturally, more replicates in all conditions would further help to reduce the noise and improve clustering.
We highlighted only four biological functions in this report (mucocyst biogenesis and histone, ribosome, or proteasome processes), each with up to six associated co-expression clusters. However, our analysis produced hundreds of co-expression clusters that are enriched for biological functions (Table 2). These include metabolism, membrane trafficking, cytoskeletal organization, DNA replication, and many others (Supplementary File S3). A detailed analysis of all these gene modules is outside the scope of the present work, but it suggests the opportunity to target the study of many genes of interest in Tetrahymena thermophila. However, it is important to note that new transcriptomic datasets may uncover co-expression patterns that might be absent from our present analysis. Every new set of experiments that studies T. thermophila cell biology from a categorically new perspective (e.g. temperature stress, antibiotic treatment, co-culture with predators or prey, etc.) may reveal co-expression clusters that are impossible to identify using the currently available data. Such datasets may be well suited for use in undergraduate laboratory courses: by searching for genes that are co-expressed with known transcription factors or other regulatory elements, a group of students can arrive at lists of candidate genes to analyze (e.g. knock out or tag) in a timeframe that is appropriate for an academic term.
One exciting extension of approaches like the TGNE will lie in their ability to elucidate cell biological pathways that evolved in specific lineages. As an example, we have previously found that co-expression analysis in T. thermophila could be leveraged to identify genes involved in a secretory protein complex that appears unique to the Alveolata lineage, which includes Tetrahymena and the apicomplexan Toxoplasma gondii [20]. This indicates that signatures of functional association, as evidenced by co-expression patterns, persist and can therefore be informative through evolutionary time and speciation. In future work, tools like the TGNE for organisms that are chosen for their phylogenetic diversity, rather than for their experimental accessibility, could provide opportunities for translating experimental results between evolutionarily distant model systems, as well as for identifying lineage-specific cellular innovations.
Supplementary Material
Acknowledgements
We thank Daniela Sparvoli and Liam Elliot for their feedback on our manuscript and the TGNE. The University of Chicago Research Computing Center provided high performance computing resources and technical support. We are grateful to Benilton Carvalho (creator of the R oligo package) for technical support about RMA normalization in oligo. Wei Miao and Geoffrey Kapler released their Tetrahymena gene expression datasets publicly, and we appreciate the opportunity to analyze them. We thank Wei Miao for sharing negative control probe data for their original microarray study. Stijn van Dongen provided helpful discussions about co-expression clustering analysis.
Author contributions: Michael A. Bertagna (Conceptualization [supporting], Data curation [supporting], Formal Analysis [equal], Investigation [supporting], Methodology [equal], Software [equal], Validation [supporting], Visualization [supporting], Writing—original draft [supporting], Writing—review & editing [supporting]), Lydia J. Bright (Investigation [supporting]), Fei Ye (Data curation [supporting], Investigation [supporting], Writing—review & editing [supporting]), Yu-Yang Jiang (Investigation [supporting]), Ajay Pradhan (Investigation [supporting], Writing—review & editing [supporting]), Debolina Sarkar (Investigation [supporting], Writing—review & editing [supporting]), Santosh Kumar (Funding acquisition [supporting], Investigation [supporting], Writing—review & editing [supporting]), Shan Gao (Funding acquisition [supporting], Resources [supporting], Writing—review & editing [supporting]), Aaron P. Turkewitz (Conceptualization [supporting], Funding acquisition [lead], Project administration [supporting], Resources [lead], Supervision [equal], Validation [supporting], Writing—original draft [supporting], Writing—review & editing [supporting]), Lev M.Z. Tsypin (Conceptualization [lead], Data curation [lead], Formal Analysis [equal], Funding acquisition [supporting], Investigation [lead], Methodology [equal], Project administration [lead], Software [equal], Supervision [equal], Validation [lead], Visualization [lead], Writing—original draft [lead], Writing—review & editing [lead]).
Contributor Information
Michael A Bertagna, Department of Molecular Genetics and Cell Biology, University of Chicago, Chicago, IL, 60637, United States.
Lydia J Bright, Department of Biology, State University of New York at New Paltz, New Paltz, NY, 12561, United States.
Fei Ye, MOE Key Laboratory of Evolution & Marine Biodiversity and Institute of Evolution & Marine Biodiversity, Ocean University of China, Qingdao 266003, China; Laboratory for Marine Biology and Biotechnology, Qingdao Marine Science and Technology Center, Qingdao 266237, China.
Yu-Yang Jiang, Department of Molecular Genetics and Cell Biology, University of Chicago, Chicago, IL, 60637, United States.
Debolina Sarkar, National Centre for Cell Science, NCCS Complex, Savitribai Phule Pune University Campus, Pune, Maharashtra State, 411007, India.
Ajay Pradhan, National Centre for Cell Science, NCCS Complex, Savitribai Phule Pune University Campus, Pune, Maharashtra State, 411007, India.
Santosh Kumar, National Centre for Cell Science, NCCS Complex, Savitribai Phule Pune University Campus, Pune, Maharashtra State, 411007, India.
Shan Gao, MOE Key Laboratory of Evolution & Marine Biodiversity and Institute of Evolution & Marine Biodiversity, Ocean University of China, Qingdao 266003, China; Laboratory for Marine Biology and Biotechnology, Qingdao Marine Science and Technology Center, Qingdao 266237, China.
Aaron P Turkewitz, Department of Molecular Genetics and Cell Biology, University of Chicago, Chicago, IL, 60637, United States.
Lev M Z Tsypin, Department of Pathology, Stanford University School of Medicine, Palo Alto, CA, 94305, United States.
Supplementary data
Supplementary data is available at NAR Genomics & Bioinformatics online.
Conflict of interest
None declared.
Funding
L.M.Z.T. thanks the Stanford Energy Postdoctoral Fellowship, facilitated by the Precourt Institute for Energy at Stanford University, for supporting travel and publication costs for this work. A.P.T.’s laboratory was supported by NIH GM077607 and NSF MCB 1937326. L.J.B.’s work was supported byNIH Genetics and Regulation TrainingGrant T32 GM007197. S.K.’s laboratory is funded by the Department of Biotechnology, India (BT/PR38584/MED/122/247/202), Department of Science and Technology (DST), India (CRG/2021/000732), and DBT/Wellcome Trust India Alliance (IA/I/22/2/506480). S.G.’s laboratory is supported by the National Natural Science Foundation of China (32125006 and 32070437).
Data availability
All code and data necessary to reproduce our analysis is available through Zenodo (https://doi.org/10.5281/zenodo.14353373) and figshare (https://doi.org/10.6084/m9.figshare.28022501). Our new microarray dataset for mucocyst regeneration after secretion is available on the NCBI Gene Expression Omnibus: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE276404. A flow diagram and all the pseudocode describing our data processing is available in Supplementary File S7.
References
- 1. Eisen MB, Spellman PT, Brown PO et al. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA. 1998; 95:14863–8. 10.1073/pnas.95.25.14863. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Lowe R, Shirley N, Bleackley M et al. Transcriptomics technologies. PLoS Comput Biol. 2017; 13:e1005457. 10.1371/journal.pcbi.1005457. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Ruehle MD, Orias E, Pearson CG Tetrahymena as a unicellular model eukaryote: genetic and genomic tools. Genetics. 2016; 203:649–65. 10.1534/genetics.114.169748. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Jiang C, Gu S, Pan T et al. Dynamics and timing of diversification events of ciliated eukaryotes from a large phylogenomic perspective. Mol Phylogenet Evol. 2024; 197:108110. 10.1016/j.ympev.2024.108110. [DOI] [PubMed] [Google Scholar]
- 5. Kuppannan A, Jiang Y-Y, Maier W et al. A novel membrane complex is required for docking and regulated exocytosis of lysosome-related organelles in Tetrahymena thermophila. PLoS Genet. 2022; 18:e1010194. 10.1371/journal.pgen.1010194. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Orias E, Flacks M, Satir BH Isolation and ultrastructural characterization of secretory mutants of Tetrahymena thermophila. J Cell Sci. 1983; 64:49–67. 10.1242/jcs.64.1.49. [DOI] [PubMed] [Google Scholar]
- 7. Gutierrez JC, Orias E Genetic characterization of Tetrahymena thermophila mutants unable to secrete capsules. Dev Genet. 1992; 13:160–6. 10.1002/dvg.1020130210. [DOI] [PubMed] [Google Scholar]
- 8. Kontur C, Kumar S, Lan X et al. Whole genome sequencing identifies a novel factor required for secretory granule maturation in Tetrahymena thermophila. G3. 2016; 6:2505–16. 10.1534/g3.116.028878. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Orias E, Flacks M Macronuclear genetics of Tetrahymena. I. Random distribution of macronuclear genecopies in T. pyriformis, syngen 1. Genetics. 1975; 79:187–206. 10.1093/genetics/79.2.187. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Stargell LA, Karrer KM, Gorovsky MA Transcriptional regulation of gene expression in Tetrahymena thermophila. Nucl Acids Res. 1990; 18:6637–9. 10.1093/nar/18.22.6637. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Csárdi G, Franks A, Choi DS et al. Accounting for experimental noise reveals that mRNA levels, amplified by post-transcriptional processes, largely determine steady-state protein levels in yeast. PLoS Genet. 2015; 11:e1005206. 10.1371/journal.pgen.1005206. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Miao W, Xiong J, Bowen J et al. Microarray analyses of gene expression during the Tetrahymena thermophila life cycle. PLoS One. 2009; 4:e4429. 10.1371/journal.pone.0004429. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Xiong J, Yuan D, Fillingham JS et al. Gene network landscape of the ciliate Tetrahymena thermophila. PLoS One. 2011; 6:e20124. 10.1371/journal.pone.0020124. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Xiong J, Lu Y, Feng J et al. Tetrahymena functional genomics database (TetraFGD): an integrated resource for Tetrahymena functional genomics. Database (Oxford). 2013; 2013:bat008. 10.1093/database/bat008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Xiong J, Lu X, Lu Y et al. Tetrahymena Gene Expression Database (TGED): a resource of microarray data and co-expression analyses for Tetrahymena. Sci China Life Sci. 2011; 54:65–7. 10.1007/s11427-010-4114-1. [DOI] [PubMed] [Google Scholar]
- 16. Briguglio JS, Kumar S, Turkewitz AP Lysosomal sorting receptors are essential for secretory granule biogenesis in Tetrahymena. J Cell Biol. 2013; 203:537–50. 10.1083/jcb.201305086. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Kumar S, Briguglio JS, Turkewitz AP Secretion of polypeptide crystals from Tetrahymena thermophila secretory organelles (mucocysts) depends on processing by a cysteine cathepsin, CTH4p. Euk Cell. 2015; 14:817–33. 10.1128/EC.00058-15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Kumar S, Briguglio JS, Turkewitz AP An aspartyl cathepsin, CTH3, is essential for proprotein processing during secretory granule maturation in Tetrahymena thermophila. MBoC. 2014; 25:2444–60. 10.1091/mbc.e14-03-0833. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Tsypin LM, Turkewitz AP The Co-regulation Data Harvester: automating gene annotation starting from a transcriptome database. SoftwareX. 2017; 6:165–71. 10.1016/j.softx.2017.06.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Sparvoli D, Delabre J, Penarete-Vargas DM et al. An apical membrane complex for triggering rhoptry exocytosis and invasion in Toxoplasma. EMBO J. 2022; 41:e111158. 10.15252/embj.2022111158. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Cantalapiedra CP, Hernández-Plaza A, Letunic I et al. eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Mol Biol Evol. 2021; 38:5825–9. 10.1093/molbev/msab293. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Huerta-Cepas J, Szklarczyk D, Heller D et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 2019; 47:D309–14. 10.1093/nar/gky1085. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. McInnes L, Healy J, Melville J UMAP: uniform manifold approximation and projection for dimension reduction. arXiv18 September 2020, preprint: not peer reviewed 10.48550/arXiv.1802.03426. [DOI]
- 24. Traag VA, Waltman L, van Eck NJ From Louvain to Leiden: guaranteeing well-connected communities. Sci Rep. 2019; 9:5233. 10.1038/s41598-019-41695-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Sheng Y, Duan L, Cheng T et al. The completed macronuclear genome of a model ciliate Tetrahymena thermophila and its application in genome scrambling and copy number analyses. Sci China Life Sci. 2020; 63:1534–42. 10.1007/s11427-020-1689-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Zhang L, Cervantes MD, Pan S et al. Transcriptome analysis of the binucleate ciliate Tetrahymena thermophila with asynchronous nuclear cell cycles. MBoC. 2023; 34:rs1. 10.1091/mbc.E22-08-0326. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Ye F, Chen X, Li Y et al. Comprehensive genome annotation of the model ciliate Tetrahymena thermophila by in-depth epigenetic and transcriptomic profiling. Nucleic Acids Res. 2025; 23:gkae1177. 10.1093/nar/gkae1177. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Jones P, Binns D, Chang H-Y et al. InterProScan 5: genome-scale protein function classification. Bioinformatics. 2014; 30:1236–40. 10.1093/bioinformatics/btu031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Pearson RK, Zylkin T, Schwaber JS et al. Quantitative evaluation of clustering results using computational negative controls. Proceedings of the 2004 SIAM International Conference on Data Mining (SDM). 2004; University City, Philadelphia, Pennsylvania, USA: Society for Industrial and Applied Mathematics; 188–99. [Google Scholar]
- 30. Martínez-Peñaloza M-G, Mezura-Montes E, Cruz-Ramírez N et al. Improved multi-objective clustering with automatic determination of the number of clusters. Neural Comput Applic. 2017; 28:2255–75. [Google Scholar]
- 31. Mao L, Van Hemert JL, Dash S et al. Arabidopsis gene co-expression network and its functional modules. BMC Bioinf. 2009; 10:346. 10.1186/1471-2105-10-346. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Langfelder P, Horvath S WGCNA: an R package for weighted correlation network analysis. BMC Bioinf. 2008; 9:559. 10.1186/1471-2105-9-559. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Kim D, Paggi JM, Park C et al. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol. 2019; 37:907–15. 10.1038/s41587-019-0201-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Carvalho BS, Irizarry RA A framework for oligonucleotide microarray preprocessing. Bioinformatics. 2010; 26:2363–7. 10.1093/bioinformatics/btq431. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Falcon S, Carvey V, Settles M et al. pdInfoBuilder: Platform Design Information Package Builder. 2017; R Bioconductor Package Manager; 10.18129/B9.BIOC.PDINFOBUILDER. [DOI] [Google Scholar]
- 36. Brettschneider J, Collin F, Bolstad BM et al. Quality assessment for short oligonucleotide microarray data. Technometrics. 2008; 50:241–64. 10.1198/004017008000000334. [DOI] [Google Scholar]
- 37. Nueda MJ, Tarazona S, Conesa A Next maSigPro: updating maSigPro bioconductor package for RNA-seq time series. Bioinformatics. 2014; 30:2598–602. 10.1093/bioinformatics/btu333. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Bolger AM, Lohse M, Usadel B Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics. 2014; 30:2114–20. 10.1093/bioinformatics/btu170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Andrews S FastQC: A quality control tool for high throughput sequence data. 2022; https://www.bioinformatics.babraham.ac.uk/projects/fastqc/.
- 40. Ewels P, Magnusson M, Lundin S et al. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016; 32:3047–8. 10.1093/bioinformatics/btw354. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Bray NL, Pimentel H, Melsted P et al. Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol. 2016; 34:525–7. 10.1038/nbt.3519. [DOI] [PubMed] [Google Scholar]
- 42. Rau A, Gallopin M, Celeux G et al. Data-based filtering for replicated high-throughput transcriptome sequencing experiments. Bioinformatics. 2013; 29:2146–52. 10.1093/bioinformatics/btt350. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Eddy SR Accelerated profile HMM searches. PLoS Comput Biol. 2011; 7:e1002195. 10.1371/journal.pcbi.1002195. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Blum M, Chang H-Y, Chuguransky S et al. The InterPro protein families and domains database: 20 years on. Nucleic Acids Res. 2021; 49:D344–54. 10.1093/nar/gkaa977. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Huang DW, Sherman BT, Lempicki RA Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009; 4:44–57. 10.1038/nprot.2008.211. [DOI] [PubMed] [Google Scholar]
- 46. Huang DW, Sherman BT, Lempicki RA Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009; 37:1–13. 10.1093/nar/gkn923. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Pedregosa F, Varoquaux G, Gramfort A et al. Scikit-learn: machine learning in Python. JMLR. 2011; 12: 10.48550/arXiv.1201.0490. [DOI] [Google Scholar]
- 48. Traag VA, Van Dooren P, Nesterov Y Narrow scope for resolution-limit-free community detection. Phys Rev E. 2011; 84:016114. 10.1103/PhysRevE.84.016114. [DOI] [PubMed] [Google Scholar]
- 49. Hagberg A, Swart PJ, Schult DA Exploring network structure, dynamics, and function using NetworkX, Los Alamos, NM (United States). 2008; Los Alamos, New Mexico, USA: Los Alamos National Laboratory (LANL) https://www.osti.gov/biblio/960616. [Google Scholar]
- 50. Virtanen P, Gommers R, Oliphant TE et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods. 2020; 17:261–72. 10.1038/s41592-019-0686-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. Bowman GR, Turkewitz AP Analysis of a mutant exhibiting conditional sorting to dense core secretory granules in Tetrahymena thermophila. Genetics. 2001; 159:1605–16. 10.1093/genetics/159.4.1605. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52. Nuwaysir EF, Huang W, Albert TJ et al. Gene expression analysis using oligonucleotide arrays produced by maskless photolithography. Genome Res. 2002; 12:1749–55. 10.1101/gr.362402. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53. Ulijasz AT, Andes DR, Glasner JD et al. Regulation of iron transport in Streptococcus pneumoniae by RitR, an orphan response regulator. J Bacteriol. 2004; 186:8123–36. 10.1128/JB.186.23.8123-8136.2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54. Bolstad BM, Irizarry RA, Åstrand M et al. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003; 19:185–93. 10.1093/bioinformatics/19.2.185. [DOI] [PubMed] [Google Scholar]
- 55. Irizarry R, Hobbs B, Collin F et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003; 4:249–64. 10.1093/biostatistics/4.2.249. [DOI] [PubMed] [Google Scholar]
- 56. Ritchie ME, Phipson B, Wu D et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015; 43:e47–. 10.1093/nar/gkv007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57. Chalker DL Collins K Transformation and Strain Engineering of Tetrahymena. Methods in Cell Biology, Tetrahymena Thermophila. 2012; 109:Cambridge, Massachusetts, USA: Academic Press; 327–45. 10.1016/B978-0-12-385967-9.00011-6. [DOI] [PubMed] [Google Scholar]
- 58. Chilcoat ND, Melia SM, Haddad A et al. Granule lattice protein 1 (Grl1p), an acidic, calcium-binding protein in Tetrahymena thermophila dense-core secretory granules, influences granule size, shape, content organization, and release but not protein sorting or condensation. J Cell Biol. 1996; 135:1775–87. 10.1083/jcb.135.6.1775. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59. Faith JJ, Hayete B, Thaden JT et al. Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol. 2007; 5:e8. 10.1371/journal.pbio.0050008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60. Clauset A, Newman MEJ, Moore C Finding community structure in very large networks. Phys Rev E. 2004; 70:066111. 10.1103/PhysRevE.70.066111. [DOI] [PubMed] [Google Scholar]
- 61. Melia SM, Cole ES, Turkewitz AP Mutational analysis of regulated exocytosis in Tetrahymena. J Cell Sci. 1998; 111:131–40. 10.1242/jcs.111.1.131. [DOI] [PubMed] [Google Scholar]
- 62. Cowan AT, Bowman GR, Edwards KF et al. Genetic, genomic, and functional analysis of the granule lattice proteins in Tetrahymena secretory granules. MBoC. 2005; 16:4046–60. 10.1091/mbc.e05-01-0028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63. Chilcoat ND, Elde NC, Turkewitz AP An antisense approach to phenotype-based gene cloning in Tetrahymena. Proc Natl Acad Sci USA. 2001; 98:8709–13. 10.1073/pnas.151243498. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64. Rahaman A, Miao W, Turkewitz AP Independent transport and sorting of functionally distinct protein families in Tetrahymena thermophila dense core secretory granules. Euk Cell. 2009; 8:1575–83. 10.1128/EC.00151-09. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65. Plattner H Chapter 3 - membrane trafficking in protozoa: SNARE proteins, H+-ATPase, actin, and other key players in ciliates. International Review of Cell and Molecular Biology. 2010; 280:Cambridge,Massachusetts, USA: Academic Press; 79–184. [DOI] [PubMed] [Google Scholar]
- 66. Cheng C-Y, Romero DP, Zoltner M et al. Structure and dynamics of the contractile vacuole complex in Tetrahymena thermophila. J Cell Sci. 2023; 136:jcs261511. 10.1242/jcs.261511. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67. Kaur H, Sparvoli D, Osakada H et al. An endosomal syntaxin and the AP-3 complex are required for formation and maturation of candidate lysosome-related secretory organelles (mucocysts) in Tetrahymena thermophila. MBoC. 2017; 28:1551–64. 10.1091/mbc.e17-01-0018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68. Howard-Till RA, Yao M-C Tudor nuclease genes and programmed DNA rearrangements in Tetrahymena thermophila. Euk Cell. 2007; 6:1795–804. 10.1128/EC.00192-07. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69. Altman N, Krzywinski M The curse(s) of dimensionality. Nat Methods. 2018; 15:399–400. 10.1038/s41592-018-0019-x. [DOI] [PubMed] [Google Scholar]
- 70. Aggarwal CC, Hinneburg A, Keim DA. Van Den Bussche J., Vianu V On the surprising behavior of distance metrics in high dimensional space. Database Theory — ICDT 2001, Lecture Notes in Computer Science. 2001; 1973:Berlin, Heidelberg, Germany: Springer; 420–34. 10.1007/3-540-44503-X_27. [DOI] [Google Scholar]
- 71. Wang Y, Nan B, Ye F et al. Dual modes of DNA N6-methyladenine maintenance by distinct methyltransferase complexes. Proc Natl Acad Sci USA. 2025; 122:e2413037121. 10.1073/pnas.2413037121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72. Bowman GR, Smith DGS, Michael Siu KW et al. Genomic and proteomic evidence for a second family of dense core granule cargo proteins in Tetrahymena thermophila. J Eukaryotic Microbiol. 2005; 52:291–7. 10.1111/j.1550-7408.2005.00045.x. [DOI] [PubMed] [Google Scholar]
- 73. Bowman GR, Elde NC, Morgan G et al. Core formation and the acquisition of fusion competence are linked during secretory granule maturation in Tetrahymena. Traffic. 2005; 6:303–23. 10.1111/j.1600-0854.2005.00273.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74. Gautier M-C, Sperling L, Madeddu L Cloning and sequence analysis of genes coding for Paramecium secretory granule (trichocyst) proteins: A unique protein fold for a family of polypeptides with different primary structures. J Biol Chem. 1996; 271:10247–55. 10.1074/jbc.271.17.10247. [DOI] [PubMed] [Google Scholar]
- 75. Froissard M, Keller AM, Cohen J ND9P, a novel protein with armadillo-like repeats involved in exocytosis: physiological studies using allelic mutants in paramecium. Genetics. 2001; 157:611–20. 10.1093/genetics/157.2.611. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76. Gogendeau D, Keller A-M, Yanagi A et al. Nd6p, a novel protein with RCC1-like domains involved in exocytosis in Paramecium tetraurelia. Euk Cell. 2005; 4:2129–39. 10.1128/EC.4.12.2129-2139.2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77. Haddad A, Bowman GR, Turkewitz AP New class of cargo protein in Tetrahymena thermophila dense core secretory granules. Euk Cell. 2002; 1:583–93. 10.1128/EC.1.4.583-593.2002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78. Maihle NJ, Satir BH Protein secretion in Tetrahymena thermophila: characterization of the secretory mutant strain SB281. J Cell Sci. 1985; 78:49–52. 10.1242/jcs.78.1.49. [DOI] [PubMed] [Google Scholar]
- 79. Madinger CL, Collins K, Fields LG et al. Constitutive secretion in Tetrahymena thermophila. Euk Cell. 2010; 9:674–81. 10.1128/EC.00024-10. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All code and data necessary to reproduce our analysis is available through Zenodo (https://doi.org/10.5281/zenodo.14353373) and figshare (https://doi.org/10.6084/m9.figshare.28022501). Our new microarray dataset for mucocyst regeneration after secretion is available on the NCBI Gene Expression Omnibus: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE276404. A flow diagram and all the pseudocode describing our data processing is available in Supplementary File S7.






